I just start learning and using Spark, and currently facing a problem. Any suggestion or hint will be greatly appreciated.
Basically I have a dataset that contain all kind of event of different user, like AppLaunch, GameStart, GameEnd, etc. and I want to create a summary of each user's action of each time he/she start the app.
For example: I have the following dataset:
UserId | Event Type | Time | GameType | Event Id|
11111 | AppLauch | 11:01:53| null | 101 |
11111 | GameStart | 11:01:59| Puzzle | 102 |
11111 | GameEnd | 11:05:31| Puzzle | 103 |
11111 | GameStart | 11:05:58| Word | 104 |
11111 | GameEnd | 11:09:13| Word | 105 |
11111 | AppEnd | 11:09:24| null | 106 |
11111 | AppLauch | 12:03:43| null | 107 |
22222 | AppLauch | 12:03:52| null | 108 |
22222 | GameStart | 12:03:59| Puzzle | 109 |
11111 | GameStart | 12:04:01| Puzzle | 110 |
22222 | GameEnd | 12:06:11| Puzzle | 111 |
11111 | GameEnd | 12:06:13| Puzzle | 112 |
11111 | AppEnd | 12:06:23| null | 113 |
22222 | AppEnd | 12:06:33| null | 114 |
And what I want is a dataset similar to this:
EventId | USerId| Event Type | Time | FirstGamePlayed| LastGamePlayed|
101 |11111 | AppLauch | 11:01:53| Puzzle | Word |
107 |11111 | AppLauch | 12:03:43| Puzzle | Puzzle |
108 |22222 | AppLauch | 12:03:52| Puzzle | Puzzle |
Only need to know the first game played and the last game played, even if there are more than 3 games played in one app-launch.
My initial idea is group them by the user Id and window of time frame (AppLaunch to AppEnd), and then find a way to scan through the dataset, if there is an gameStart event and it fell into the any window, it will be the FirstGamePlayed, the last GameStart event before the time of AppEnd will be the LastGamePlayed. but I didn't find a way to achieve this.
Any hint/suggestion will be nice.
Thanks
I think this can be solved using window function followed by a aggregation like this:
df
// enumerate AppLaunches
.withColumn("AppLauchNr", sum(when($"EventType" === "AppLauch", 1)).over(Window.partitionBy($"UserId").orderBy($"Time".asc)))
// get first last game per AppLaunch
.withColumn("firstGamePlayed", first($"GameType", true).over(Window.partitionBy($"UserId", $"AppLauchNr").orderBy($"Time".asc)))
.withColumn("lastGamePlayed", first($"GameType", true).over(Window.partitionBy($"UserId", $"AppLauchNr").orderBy($"Time".desc)))
// now aggregate
.groupBy($"AppLauchNr")
.agg(
first($"UserId").as("UserId"),
min($"EventId").as("EventId"),
lit("AppLauch").as("EventType"), // this is always AppLauch
min($"Time").as("Time"),
first($"firstGamePlayed", true).as("firstGamePlayed"),
first($"lastGamePlayed", true).as("lastGamePlayed")
)
.drop($"AppLauchNr")
First and Last game played can also be determined using orderBy().groupBy() instead of window functions, but I'm still not sure about spark preserves the ordering during aggregation (this is not mentioned in the docs, see e.g. Spark DataFrame: does groupBy after orderBy maintain that order? and discussion in https://issues.apache.org/jira/browse/SPARK-16207)
df
.withColumn("AppLauchNr", sum(when($"EventType" === "AppLauch", 1)).over(Window.partitionBy($"UserId").orderBy($"Time".asc)))
.orderBy($"UserId",$"AppLauchNr",$"Time")
.groupBy($"UserId",$"AppLauchNr")
.agg(
first($"EventId").as("EventId"),
first($"EventType").as("EventType"),
first($"Time").as("Time"),
first($"GameType", true).as("firstGamePlayed"),
last($"GameType", true).as("lastGamePlayed")
)
.drop($"AppLauchNr")
Related
wmi_cpu_core_frequency_mhz is a gauge that returns some unique values {a,b,c...} for label=core with values(p,q,r,s)
I want to get break down of count of each gauge value a,b,c for each label p,q,r,s
Something looking like this
| core | count(a) | count(b) | count(c) |...
+------+----------+----------+----------+
| p | 10 | 35 | 5 |...
+------+----------+----------+----------+
| q | 15 | 15 | 20 |...
+------+----------+----------+----------+
| r | 2 | 13 | 35 |...
+------+----------+----------+----------+
| s | 10 | 10 | 30 |...
+------+----------+----------+----------+
Any idea how to tackle this, or where should I start from.
You want to use count_values here, so your query would be something like count_values (p,q,r,s) ("core", wmi_cpu_core_frequency_mhz).
You won't get a 2d table with this, but you should get the data you're after, at least.
I have 3 Excel-files (automated exports) that contain the following information:
1. The total list of shelves in one particular store:
| Shelf_code |
|------------|
| AB01 |
| AA02 |
2. The total list of all shelves linked to each article
| SKU_code | Shelf_code |
|----------|------------|
| 111 | AA01 |
| 111 | AB01 |
| 111 | AC01 |
| 112 | AA01 |
3. The list of all available SKUs
| SKU_code | Other stuff |
|----------|-------------|
| 111 | ... |
| 112 | ... |
| 113 | ... |
| 114 | ... |
And what I want to do is to link the Shelf_codes from that specific store to the total available SKU-list, so it will look like this:
| SKU_code | Other stuff | Shelf_code_store1 |
|----------|-------------|-------------------|
| 111 | ... | AB01 |
| 112 | ... | |
| 113 | ... | |
| 114 | ... | AB01 |
I have tried to embed the MATCH formula within another INDEX/MATCH formula (see code below) which was partially successful since this will only work if the shelf_code in file 2 happens to be the first one to match the SKU_code.
Since this will be mostly not the case, it will return a #N/A error
MATCH(
INDEX({file2_shelfcode},MATCH(file3_skucode,{file2_skucode},0)),
{file1_shelfcode}
)
Does anyone has a solution for this?
Since these files contain over 1000 articles, 200 shelves, 6 stores, and will be frequently updated I don't think using a Pivottable on file 2 will fit my needs.
I have a data of automotive spare parts with their multiple store locations in a warehouse.
all I want to do is get the locations in front of the part number, so that it is easy to know all the locations of a specific part number.
The current pivot data looks like this
I've manually transposed a few rows in the below image, but the data contains around 70K rows, Hence I'm looking for a better solution
Kindly refer to the below table
+--------------+-----+-------+-------------+
| Item name | Qty | UoM | Stock |
+--------------+-----+-------+-------------+
| '0450000115 | 324 | piece | G12B04 |
| '0450000A61 | 312 | piece | G12B05 |
| '0450000115 | 336 | piece | G12B06 |
| '0450000A61 | 228 | piece | G12B07 |
| '0450000115 | 336 | piece | G12B08 |
| '0450000115 | 192 | piece | G12B09 |
| '087902E200A | 470 | piece | G12B10 |
| '087902E200A | 760 | piece | G12B13 |
| '087902E200A | 759 | piece | G12B14 |
| '0450000115 | 336 | piece | G12B15 |
| '087902E200A | 400 | piece | G12B16 |
| '087902E200A | 10 | piece | G3B32 |
| '084B410426 | 100 | piece | G3B32 |
| '087902E200A | 300 | piece | G4B08 |
| '0450000A61 | 2 | piece | GDB01 |
| '084B410426 | 60 | piece | GR.04.C.04. |
| '087902E200A | 327 | piece | HD.03.K.05. |
+--------------+-----+-------+-------------+
You need to create a measure, using the CONCATENATEX function. For this you need to add your data to the datamodel. You can do this by checking the box add this data to the datamodel on the bottom of the create pivottable dialogbox.
Rightclick the table on the Pivottable Fields Pane and select add measure. Then create the following measure: = CONCATENATEX('table','table'[Stock],", ")
Now put [Item name] on Rows and the measure [StockText] on Values. This should be the result:
+-------------+----------+----------+--------+------------------+
| customer_id | date | time | answer | missed_call_type |
+-------------+----------+----------+--------+------------------+
| 101 | 2018/8/3 | 12:13:00 | no | employee |
| 102 | 2018/8/3 | 12:15:00 | no | customer |
| 103 | 2018/8/3 | 12:20:00 | no | employee |
| 102 | 2018/8/3 | 15:15:00 | no | customer |
| 101 | 2018/8/3 | 18:15:00 | no | employee |
| 105 | 2018/8/3 | 18:18:00 | no | customer |
| 102 | 2018/8/3 | 19:18:00 | no | employee |
+-------------+----------+----------+--------+------------------+
I got a table whick looks like this and wanted to calculate average interval time for those who did not answer the phone.
for this example,the average interval time is:
{(18:15:00-12:13:00)+[(19:18:00-15:15:00)+(15:15:00-12:15:00)]/2}/2
the problem is I could only manipulate it in Excel...someone knows Excel please help!or any suggestion is fine,I am familiar with SQL.
I've entered your data as shown below, and added a time_to_next_call-column which calculates the interval until the next call to the same customer_id using the formula: =INDEX(C3:C$8,MATCH(A2,A3:A$8,0))-C2 for the first cell in the data set.
Then, list your customer IDs in a column, and use and AVERAGEIF-function to calcualte the average time_to_next_call for that customer_ID, as shown below:
As I've placed the customer_id 101 in cell A12, my function for calculating the average time_to_next_call is: =AVERAGEIFS($F$2:$F$8,$A$2:$A$8,A13,$F$2:$F$8,">0")
Adjust your ranges as appropriate to fit your workbook. Hope this helps
Background
I'm creating a grade book in Excel for my wife. I have sheets for the overall grade, classwork, exams, and participation.
The three sections of work (classwork, exams, and participation) each have a variable number of items, and each item has a different number of points possible. Each section has a weight in the overall grade.
I have this up and running with a fixed number of items per section, but I'd like to create a template that can be updated from class to class and year to year.
Here's the problem:
On the classwork sheet, I'd like to be able to enter new assignments and their point value and have that automatically update the master grade sheet on my first sheet tab. Is there any way to add columns in a section of one worksheet (the master grade sheet) when new rows are added to another worksheet (the list of assignments)?
It is possible to achieve this without using VBA. The reason you will have difficulty acheiving this, however, is that you've violated normal form in the table you've already built. It appears the pertinent data you're looking for is each student's score on each assignment. If this if correct, the level of granularity you will want is on the Assignment, not on the Student.
There are some fairly quick ways to modify your existing work to account for this. I've written out some sample data below. Take a look and see if it helps.
Sample Original Table
+---------+------+------------+------------+
| Student | Quiz | Thumbnails | Watercolor |
+---------+------+------------+------------+
| Paul | 3 | 10 | 90 |
| Frank | 4 | 10 | 95 |
| Mary | 5 | 10 | 70 |
| Ellen | | 10 | 85 |
| Sue | 6 | 10 | 92 |
| Anton | 5 | 10 | 87 |
+---------+------+------------+------------+
Image of the data is below ( note I have highlighted the blank value ).
Sample Normal Table
+---------+-------------+-----------+-------+
| Student | Assignment | New_Score | Score |
+---------+-------------+-----------+-------+
| Paul | Quiz | | 3 |
| Frank | Quiz | | 4 |
| Mary | Quiz | | 5 |
| Ellen | Quiz | | 0 |
| Sue | Quiz | | 6 |
| Anton | Quiz | | 5 |
| Paul | Thumbnails | | 10 |
| Frank | Thumbnails | | 10 |
| Mary | Thumbnails | | 10 |
| Ellen | Thumbnails | | 10 |
| Sue | Thumbnails | | 10 |
| Anton | Thumbnails | | 10 |
| Paul | Watercolor | | 90 |
| Frank | Watercolor | | 95 |
| Mary | Watercolor | | 70 |
| Ellen | Watercolor | | 85 |
| Sue | Watercolor | | 92 |
| Anton | Watercolor | | 87 |
| Mary | ExtraCredit | 10 | 10 |
| Ellen | ExtraCredit | 8 | 8 |
| Sue | ExtraCredit | 9 | 9 |
| Anton | ExtraCredit | 10 | 10 |
+---------+-------------+-----------+-------+
Image of the data is below. The score column reaches back to your old table and grabs the score you've already entered for the students, so you won't have to do this all manually. The formula for this is =INDEX(non_normal,MATCH([#Student],non_normal[Student],0),MATCH([#Assignment],non_normal[#Headers],0)).
This assumes you've formatted the old data into an Excel DataTable ( ctrl+t ) and named it non_normal ( alt+j+t+i ). Note the unsubmitted assignment for Ellen comes through with a score of zero using this method. I've added a column named New_Score so that you are able to add new student-assignment submission combinations to the table without having to modify your old non_normal table ( which was the trouble in the OP ). With this column added, the formula in the Score column can be changed to =IF(NOT(ISBLANK([#[New_Score]])),[#[New_Score]],INDEX(non_normal,MATCH([#Student],non_normal[Student],0),MATCH([#Assignment],non_normal[#Headers],0))) which will take the New_Score value if available and the original score if not.
The orange cells are new student-assignment submission combinations. Note you do not need to add a row for every student, just add a row whenever a student submits an assignment.
Sample Assignments Table
+-------------+-----------------+
| Assignment | Points_Possible |
+-------------+-----------------+
| Quiz | 6 |
| Thumbnails | 10 |
| Wartercolor | 100 |
| ExtraCredit | |
+-------------+-----------------+
I've added the ExtraCredit assignment with a possible max score of zero/blank ( since not completing extra credit shouldn't count against a student )
Payoff - Back to the Original Table
+--------------+---------------+------------+------------+-------------+-------------+--------+
| Sum of Score | Column Labels | | | | | |
+--------------+---------------+------------+------------+-------------+-------------+--------+
| Row Labels | Quiz | Thumbnails | Watercolor | ExtraCredit | Grand Total | |
+--------------+---------------+------------+------------+-------------+-------------+--------+
| Anton | 5 | 10 | 87 | 10 | 112 | 96.6% |
| Ellen | 0 | 10 | 85 | 8 | 103 | 88.8% |
| Frank | 4 | 10 | 95 | | 109 | 94.0% |
| Mary | 5 | 10 | 70 | 10 | 95 | 81.9% |
| Paul | 3 | 10 | 90 | | 103 | 88.8% |
| Sue | 6 | 10 | 92 | 9 | 117 | 100.9% |
+--------------+---------------+------------+------------+-------------+-------------+--------+
Using the image below, you pivot your newly normalized data into a Pivot Table. ( alt+n+v ). Now, simply adding a new assignment to the normal_assignment DataTable will cause that assignment to appear in a new column when you refresh the Pivot Table ( alt+a+r+a ).
The % score on the right of the Pivot Table is calculated using the following formula ( with the sample Pivot Table starting in cell $M$2 ): =GETPIVOTDATA("Score",$M$2,"Student",M4)/SUM(assignment[Points_Possible])
I've uploaded the raw sample file for this to my public repo if you'd like to pull it and take a peek at the source. Credit to sensefulsolutions for text-to-table conversion.
Hope this is what you need!