Timeseries chart: data as rows instead of columns - apache-spark

I want to use a bar chart as a visualisation, to depict the number of times that one of several applications was run on each date (for simplicity sake, think of applications as excel, word, powerpoint etc. For example, on 15-March, if a user opens Excel once, and Word twice, it will show a height of two for Word, stacked with a height of one for excel etc).
The visualisation shows one year of stacked data (365 bars), where each stacked bar uses a different color for each program. There are 20 programs. The dataseries are stacked - for example Word (if it is depicted) is always at the bottom, followed by Excel, etc.
The traditional way to do this is using an outerjoin that creates 365 rows, and 20 columns (Word, Excel etc) with an integer value for each column for each day. This technique works, but is very compute intensive on databases (not to mention dense SQL) and I wondered if anyone knows of a timeseries charting tool that handles this problem.
This is a data pivoting issue: I want to extract the data in rows, and let the charting product assemble the bars. It needs to read the Data Series name from the data, and create a new Data Series whenever it encounters a new value. I actually have thousands of possible Programs, but each chart is unlikely to need to depict any more than 20.
The charting product should work like this:
minDate=20150101, maxDate=20151231, interval=Day
and then the data is
20150315 "Word" 1
20150315 "Word" 1
20150315 "Excel" 1
20150316 "Powerpoint" 1
20150316 "Word" 1
Alternatively, each row could be a total per day per program:
20150315 "Word" 2
20150315 "Excel" 1
20150316 "Powerpoint" 1
20150316 "Word" 1
I am using C3, with spark and scala. I can use any javascript library.

What you are asking is how do you reshape the data from wide to long format. In Spark, you use DataFrame.explode. Assuming you had something that looked like this:
+--------+----+-----+
| date|word|excel|
+--------+----+-----+
|20150101| 3| 4|
|20150102| 1| 2|
+--------+----+-----+
You could reshape it with this:
df.explode($"word", $"excel"){case row =>
Seq(("Word", row.getInt(0)), ("Excel", row.getInt(1)))
}.select($"date", $"_1" as "app", $"_2" as "count")
Into this:
+--------+-----+-----+
| date| app|count|
+--------+-----+-----+
|20150101| word| 3|
|20150101|excel| 4|
|20150102| word| 1|
|20150102|excel| 2|
+--------+-----+-----+

Related

How to label scatterplot points by name?

I am trying to figure out how to get labels to show on either Google sheets, Excel, or Numbers.
I have information that looks like this
name|x_val|y_val
----------------
a | 1| 1
b | 2| 4
c | 1| 2
Then I would want my final graph to look like this.
4| .(c)
3|
2| .(b)
1| .(a)
|__ __ __ __
0 1 2 3 4
Like why can't I label each of these points with its name? I can only seem to label the value, e.g, (c) would show 4
Is the only solution D3?
Well I did not think this was possible until I went and checked. In some previous version of Excel I could not do this. I am currently using Excel 2013.
This is what you want to do in a scatter plot:
right click on your data point
select "Format Data Labels" (note you may have to add data labels
first)
put a check mark in "Values from Cells"
click on "select range" and select your range of labels you want on the points
UPDATE: Colouring Individual Labels
In order to colour the labels individually use the following steps:
select a label. When you first select, all labels for the series should get a box around them like the graph above.
Select the individual label you are interested in editing. Only the label you have selected should have a box around it like the graph below.
On the right hand side, as shown below, Select "TEXT OPTIONS".
Expand the "TEXT FILL" category if required.
Second from the bottom of the category list is "COLOR", select the colour you want from the pallet.
If you have the entire series selected instead of the individual label, text formatting changes should apply to all labels instead of just one.
None of these worked for me. I'm on a mac using Microsoft 360. I found this which DID work:
This workaround is for Excel 2010 and 2007, it is best for a small number of chart data points.
Click twice on a label to select it.
Click in formula bar.
Type =
Use your mouse to click on a cell that contains the value you want to use.
The formula bar changes to perhaps =Sheet1!$D$3
Repeat step 1 to 5 with remaining data labels.
Simple
For all those who don't have the option in Excel (like me), there is a macro which works and is explained here: https://www.get-digital-help.com/2015/08/03/custom-data-labels-in-x-y-scatter-chart/ Very useful
Another convoluted answer which should technically work and is ok for a small number of data points is to plot all your data points as 1 series in order to get your connecting line. Then plot each point as its own series. Then format data labels to display series name for each of the individual data points.
In short it works ok for a small data set or just key points from a data set.
If using VBA is an option and assuming that you have a table named 'Table1' of the form:
Label|x_val|y_val
----------------
a | 1| 1
b | 2| 4
c | 1| 2
this routine should work:
Sub labelDatapoints()
Dim r As Integer
With ActiveSheet.ChartObjects(1).Chart 'The scatter plot
.SeriesCollection(1).ApplyDataLabels
For r = 1 To Range("Table1[Label]").Rows.Count 'iterate through all data points
.SeriesCollection(1).Points(r).DataLabel.Text = Range("Table1[Label]").Cells(r).Value 'add the custom label to the current datapoint
Next r
End With
End Sub
Modified from https://www.get-digital-help.com/dynamic-data-labels-in-a-chart/

Table Format Report in Acumatica

I want to design a report in table format. I am using Panel and Line Control for now, but it's time consuming to adjust. Is there a better way to do this?
Example:
| column1 | column2 | column3 |
| aaaaa1 | sdf | sdfsdf |
| aaaaa2 | sdf | sdfsdf |
| aaaaa3 | sdf | sdfsdf |
The row background color should be alternating (gray & white).
Your question is a bit ambiguous I am going to assume you mean using the Report designer you would like to generate a report that looks like the image below.
That being said you have two different items you have to accomplish the table layout and the alternating Row color.
For the Table Layout
Unless I am missing something this is the easy part you only need to use the field controls and border settings on the controls to generate the grids. I have added a screen cap of the standard layout for the report above in the designer
For the Alternating Row Color
This is the challenge because without a little but of work it cant be done. What you have to do is using the visible expression create two sets of duplicate controls one placed directly on top of each other. I used the row number to build my expression that looked like
=IIf( $RowNumber % 2 = '0', True, False )
=IIf( $RowNumber % 2 = '0', False, True )
That divides each row number by two and if there is a remainder (Bottom equation) shows it and vise versa for the other set of controls.

Summing up a related table's values in PowerPivot/DAX

Say I have two tables. attrsTable:
file | attribute | value
------------------------
A | xdim | 5
A | ydim | 6
B | xdim | 7
B | ydim | 3
B | zdim | 2
C | xdim | 1
C | ydim | 7
sizeTable:
file | size
-----------
A | 17
B | 23
C | 34
I have these tables related via the 'file' field. I want a PowerPivot measure within attrsTable, whose calculation uses size. For example, let's say I want xdim+ydim/size for each of A, B, C. The calculations would be:
A: (5+6)/17
B: (7+3)/23
C: (1+7)/34
I want the measure to be generic enough so I can use slicers later on to slice by file or attribute. How do I accomplish this?
I tried:
dimPerSize := CALCULATE([value]/SUM(sizeTable[size])) # Calculates 0
dimPerSize := CALCULATE([value]/SUM(RELATED(sizeTable[size]))) # Produces an error
Any idea what I'm doing wrong? I'm probably missing some fundamental concepts here of how to use DAX with relationships.
Hi Redstreet,
taking a step back from your solution and the one proposed by Jacob, I think it might be useful to create another table that would aggregate all the calculations (especially given you probably have more than 2 tables with file-specific attributes).
So I have created one more table that contains (only) unique file names, and thus the relationships could be visualized this way:
It's much simpler to add necessary measures (no need for calculated columns). I have actually tested 2 scenarios:
1) create simple SUM measures for both Attribute Value and File Size. Then divide those two measures and job done :-).
2) use SUMX functions to have a bit more universal solution. Then the final formula for DimPerSize calculation could look like this:
=DIVIDE(
SUMX(DISTINCT(fileTable[file]),[Sum of AttrValue]),
SUMX(DISTINCT(fileTable[file]),[Sum of FileSize]),
BLANK()
)
With [Sum of AttrValue] being:
=SUM(attrsTable[value])
And Sum of FileSize being:
=SUM(sizeTable[size])
This worked perfectly fine, even though SUMX in both cases goes over all instances of given file name. So for file B it also calculates with zdim (if there is a need to filter this out, then use simple calculate / filter combination). In case of file size, I am using SUMX as well, even though it's not really needed since the table contains only 1 record for each file name. If there would be 2 instances, then use SUMX or AVERAGEX depending on the desired outcome.
This is the link to my source file in Excel (2010).
Hope this helps.
You look to have the concept of relationships OK but you aren't on the right track in terms of CALCULATE() either in terms of the structure or the fact that you can't simply use 'naked' numerical columns, they need to be packaged in some way.
Your desired approach is correct in that once you get a simple version of the thing running, you will be able to slice and dice it over any of your related dimensions.
Best practice is probably to build this up using several measures:
[xdim] = CALCULATE(SUM('attrstable'[value]), 'attrstable'[attribute] = "xdim")
[ydim] = CALCULATE(SUM('attrstable'[value]), 'attrstable'[attribute] = "ydim")
[dimPerSize] = ([xdim] + [ydim]) / VALUES('sizeTable'[size])
But depending on exactly how your pivot is set up, this is likely to also throw an error because it will try and use the whole 'size' column in your totals. There are two main strategies for dealing with this:
Use an 'iterative' formula such as SUX() or AVERAGEX() to iterate individually over the 'file' field and then adds up or averages for the total e.g.
[ItdimPerSize] = AVERAGEX(VALUES('sizeTable'[file]), [dimPerSize])
Depending on the maths you want to use, you might find that produce a useful average that you need to use SUMX but devide by the number of cases i.e. COUNTROWS('sizeTable'[file]).
You might decide that the totals are irrelevant and simply introduce an error handling element that will make them blank e.g.
[NtdimPerSize] = IF(HASONEVALUE('sizeTable'[file]),[dimPerSize],BLANK())
NB, all of this assumes that when you are creating your pivot that you are 'dragging in' the file field from the 'sizetable'.

Graph three columns in Excel as one data point

I'm having trouble graphing some data in Excel properly. I have three columns, which are time, name, and value. If I'm just graphing one name, i.e. "speed", my file looks like this
time | data | value
1 | speed | 1
2 | speed | 2
3 | speed | 3
When I try to graph this, it thinks speed is three different things, and graphs as speed 1, speed 2 and speed 3. I instead want a straight line with a slope of 1, called speed, so it recognizes that they're all the same data. Ideally, I could add in extra data names such as acceleration, and it would treat all of those the same too.
In Excel 2007: Go to insert --> Scatterplot and create the chart type "Scatter with smooth lines".
Give a mouse right click on the chart and pick "Select Data" option.
Click on Add series button and place "Series X values" as the range of values in column Time and fill in "Series Y values" with the range data on column Value.
If you have additional rows with data meaning something else than speed, like acceleration you can add a second series to make the second line in the graph.
Example:

Excel String - search and ordering with VBA

looking for a starting point for my task.
Situation:
I have three Excel worksheets that have the same structure and contain the following columns amongst others:
|ID|SubID|SubName|Description|Link1|Link2|
|1|1|SubName 1|Desc SubName1|P00001|P00002|
|1|2|SubName 2|Desc SubName2|P00002|P00003|
|2|1|SubName 1|Desc SubName1|P00001|P00002|
here comes a line that contains more than one P-Number inside a cell:
|3|1|SubName 1|Desc SubName1|P00001,P00002,P00003|P00001,P00002|
I now would like to look through the Link1 column and place those values with the first three columns in a sorted order to a "Link1" worksheet like so:
|P-Number|ID|SubID|SubName|
|P00001|1|1|SubName 1|
|P00001|2|1|SubName 1|
|P00001|3|1|SubName 1|
|P00002|1|2|SubName 2|
|P00002|3|1|SubName 1|
|P00003|3|1|SubName 1|
And I would like to do the same with the Link2 column also placing the values to a seperate worksheet.
The next step would be to concatenate the P-Number of the new sheet with a string that works like a link. That's what I have working so far.
Another idea would be to make the SubName clickable and have it jump to the according name in one of the three worksheets.
I'm not sure if a macro is the right choice or maybe a Pivot table could also do the trick.
Any ideas are appreciated.
Thanks.
UPDATE:
I tried to incorporate this concept to my main macro where I also process the columns mentioned above. However since I have several loops running through this part the output line will be over-written by the last P-number of the column. I use this Sub:
Sub PrintArray(Data As Variant, Cl As Range)
Cl.Resize(UBound(Data, 1), UBound(Data, 2)) = Data
End Sub
And this call for testing:
PrintArray NewArray, ActiveWorkbook.Worksheets("Link 1").Range("A2")
How can I "save" each P-number to the new sheet without having to use a loop inside my code?
Pivot tables can't split 1 cell into multiple rows, so the starting point of a VBA solution would be:
load input range into an array (my_array = Range(...).Value2)
for each row, split Link1 (link_array = Split(my_array(current_row, 5), ","))
for each link, append the link and other values into a results array
place the results onto a worksheet
sort the resulting range

Resources