How to simultaneously group/apply two Spark DataFrames? - apache-spark

/* My question is language-agnostic I think, but I'm using PySpark if it matters. */
Situation
I currently have two Spark DataFrames:
One with per-minute data (1440 rows per person and day) of a person's heart rate per minute:
| Person | date | time | heartrate |
|--------+------------+-------+-----------|
| 1 | 2018-01-01 | 00:00 | 70 |
| 1 | 2018-01-01 | 00:01 | 72 |
| ... | ... | ... | ... |
| 4 | 2018-10-03 | 11:32 | 123 |
| ... | ... | ... | ... |
And another DataFrame with daily data (1 row per person and day), of daily metadata, including the results of a clustering of days, i.e. which cluster day X of person Y fell into:
| Person | date | cluster | max_heartrate |
|--------+------------+---------+----------------|
| 1 | 2018-01-01 | 1 | 180 |
| 1 | 2018-01-02 | 4 | 166 |
| ... | ... | ... | ... |
| 4 | 2018-10-03 | 1 | 147 |
| ... | ... | ... | ... |
(Note that clustering is done separately per person, so cluster 1 for person 1 has nothing to do with person 2's cluster 1.)
Goal
I now want to compute, say, the mean heart rate per cluster and per person, that is, each person gets different means. If I have three clusters, I am looking for this DF:
| Person | cluster | mean_heartrate |
|--------+---------+----------------|
| 1 | 1 | 123 |
| 1 | 2 | 89 |
| 1 | 3 | 81 |
| 2 | 1 | 80 |
| ... | ... | ... |
How do I best do this? Conceptually, I want to group these two DataFrames per person and send two DF chunks into an apply function. In there (i.e. per person), I'd group and aggregate the daily DF per day, then join the daily DF's cluster IDs, then compute the per-cluster mean values.
But grouping/applying multiple DFs doesn't work, right?
Ideas
I have two ideas and am not sure which, if any, make sense:
Join the daily DF to the per-minute DF before grouping, which would result in highly redundant data (i.e. the cluster ID replicated for each minute). In my "real" application, I will probably have per-person data too (e.g. height/weight), which would be a completely constant column then, i.e. even more memory wasted. Maybe that's the only/best/accepted way to do it?
Before applying, transform the DF into a DF that can hold complex structures, e.g. like
.
| Person | dataframe | key | column | value |
|--------+------------+------------------+-----------+-------|
| 1 | heartrates | 2018-01-01 00:00 | heartrate | 70 |
| 1 | heartrates | 2018-01-01 00:01 | heartrate | 72 |
| ... | ... | ... | ... | ... |
| 1 | clusters | 2018-01-01 | cluster | 1 |
| ... | ... | ... | ... | ... |
or maybe even
| Person | JSON |
|--------+--------|
| 1 | { ...} |
| 2 | { ...} |
| ... | ... |
What's the best practice here?

But grouping/applying multiple DFs doesn't work, right?
No, AFAIK this does not work not in pyspark nor pandas.
Join the daily DF to the per-minute DF before grouping...
This is the way to go in my opinion. You don't need to merge all redundant columns but only those requrired for your groupby-operation. There is no way to avoid redundancy for your groupby-columns as they will be needed for the groupby-operation.
In pandas, it is possible to provide an extra groupby-column as a pandas Series specifically but it requires to have the exact same shape as the to be grouped dataframe. However, in order create the groupby-column, you will need a merge anyway.
Before applying, transform the DF into a DF that can hold complex structures
Performance and memory wise, I would not go with this solution unless you have multiple required groupby operations which will benefit from more complex data structures. In fact, you will need to put in some effort to actually create the data structure in the first place.

Related

Multiple metrics over large dataset in spark

I have a big dataset groupped by certain field that I need to run descriptive statistics on each field.
Let's say dataset is 200m+ records and there's about 15 stat functions that I need to run - sum/avg/min/max/stddev etc. Problem that it's very hard to scale that task since there's not clear way to partition dataset.
Example dataset:
+------------+----------+-------+-----------+------------+
| Department | PartName | Price | UnitsSold | PartNumber |
+------------+----------+-------+-----------+------------+
| Texas | Gadget1 | 5 | 100 | 5943 |
| Florida | Gadget3 | 484 | 2400 | 4233 |
| Alaska | Gadget34 | 44 | 200 | 4235 |
+------------+----------+-------+-----------+------------+
Right now I am doing it this way (example):
columns_to_profile = ['Price', 'UnitSold', 'PartNumber']
functions = [
Function(F.mean, 'mean'),
Function(F.min, 'min_value'),
Function(F.max, 'max_value'),
Function(F.variance, 'variance'),
Function(F.kurtosis, 'kurtosis'),
Function(F.stddev, 'std'),
Function(F.skewness, 'skewness'),
Function(count_zeros, 'n_zeros'),
Function(F.sum, 'sum'),
Function(num_hist, "hist_data"),
]
functions_to_apply = [f.function(c).alias(f'{c}${f.alias}')
for c in columns_to_profile for f in get_functions(column_types, c)]
df.groupby('Department').agg(*functions_to_apply).toPandas()
Problem here is that list of functions is bigger than this (there's about 16-20) which applies to each column but cluster spend most of the time in shuffling and CPU load is about 5-10%.
How should I partition this data or maybe my approach is incorrect?
If departments are skewed (i.e. Texas have 90% of volume) what should be my approach?
this is my spark dag for this job:

Spark replicating rows with values of a column from different dataset

I am trying to replicate rows inside a dataset multiple times with different values for a column in Apache Spark. Lets say I have a dataset as follows
Dataset A
| num | group |
| 1 | 2 |
| 3 | 5 |
Another dataset have different columns
Dataset B
| id |
| 1 |
| 4 |
I would like to replicate the rows from Dataset A with column values of Dataset B. You can say a join without any conditional criteria that needs to be done. So resulting dataset should look like.
| id | num | group |
| 1 | 1 | 2 |
| 1 | 3 | 5 |
| 4 | 1 | 2 |
| 4 | 3 | 5 |
Can anyone suggest how the above can be achieved? As per my understanding, join requires a condition and columns to be matched between 2 datasets.
What you want to do is called CartesianProduct and df1.crossJoin(df2) will achieve it. But be careful with it because it is a very heavy operation.

How to add a new column with some constant value while appending two datasets using Groovy?

I have multiple monthly datasets with 50 variables each. I need to append these datasets to create one single dataset. However, I also want to add the month's name to the corresponding records while appending such that I can see a new column in the final dataset which can be used to identify records belonging to a month.
Example:
Data 1: Monthly_file_201807
ID | customerCategory | Amount |
1 | home | 654.00 |
2 | corporate | 9684.65 |
Data 2: Monthly_file_201808
ID | customerCategory | Amount |
84 | SME | 985.29 |
25 | Govt | 844.88 |
On Appending, I want something like this:
ID | customerCategory | Amount | Month |
1 | home | 654.00 | 201807 |
2 | corporate | 9684.65 | 201807 |
84 | SME | 985.29 | 201808 |
25 | Govt | 844.88 | 201808 |
currently, I'm appending using following code:
List dsList = [
Data1Path,
Data2Path
].collect() {app.data.open(source:it)}
//concatenate all records into a single larger dataset
Dataset ds=app.data.create()
dsList.each(){
ds.prepareToAdd(it)
ds.addAll(it)
}
ds.save()
app.data.copy(in: ds, out: FinalAppendedDataPath)
I have used the standard append code, but unable to add that additional column with a fixed value of month in there. I don't want to loop through the data to create an additional column of "month", as my data is very large and I have multiple files.

Node.js : Check entries regularly based on their timestamps

I have a Node.js backend and MSSQL as database. I do some data processing and store logs in a database table which shows which entity is a what stage in the progress, e.g. (there are three stages each entity has to pass)
+----+-----------+-------+------------------------+
| id | entity_id | stage | timestamp |
+----+-----------+-------+------------------------+
| 1 | 1 | 1 | 2019-01-01 12:12:01 |
| 2 | 1 | 2 | 2019-01-01 12:12:10 |
| 3 | 1 | 3 | 2019-01-01 12:12:15 |
| 4 | 2 | 1 | 2019-01-01 12:14:01 |
| 5 | 2 | 2 | 2019-01-01 12:14:10 <--|
| 6 | 3 | 1 | 2019-01-01 12:24:01 |
+----+-----------+-------+------------------------+
As you can see in line with the arrow, the entity no. 2 did not go to stage 3. After a certain amount of time (maybe 120 seconds), these entity should be considered as faulty and this will be reported somehow.
What would be a good approach in Node.js to check the table for those outtimed entities? Some kind of cron job which checks all lines every x seconds? That sounds rather clumsy to me.
I am looking forward to our ideas!

PySpark getting distinct values over a wide range of columns

I have data with a large number of custom columns, the content of which I poorly understand. The columns are named evar1 to evar250. What I'd like to get is a single table with all distinct values, and a count how often these occur and the name of the column.
------------------------------------------------
| columnname | value | count |
|------------|-----------------------|---------|
| evar1 | en-GB | 7654321 |
| evar1 | en-US | 1234567 |
| evar2 | www.myclient.com | 123 |
| evar2 | app.myclient.com | 456 |
| ...
The best way I can think of doing this feels terrible, as I believe I have to read this data once per column (there are actually about 400 such columns.
i = 1
df_evars = None
while i <= 30:
colname = "evar" + str(i)
df_temp = df.groupBy(colname).agg(fn.count("*").alias("rows"))\
.withColumn("colName", fn.lit(colname))
if df_evars:
df_evars = df_evars.union(df_temp)
else:
df_evars = df_temp
display(df_evars)
Am I missing a better solution?
Update
This has been marked as a duplicate but the two responses IMO only solve part of my question.
I am looking at potentially very wide tables with potentially a large number of values. I need a simple way (ie. 3 columns that show the source column, the value and the count of the value in the source column.
The first of the responses only gives me an approximation of the number of distinct values. Which is pretty useless to me.
The second response seems less relevant than the first. To clarify, source data like this:
-----------------------
| evar1 | evar2 | ... |
|---------------|-----|
| A | A | ... |
| B | A | ... |
| B | B | ... |
| B | B | ... |
| ...
Should result in the output
--------------------------------
| columnname | value | count |
|------------|-------|---------|
| evar1 | A | 1 |
| evar1 | B | 3 |
| evar2 | A | 2 |
| evar2 | B | 2 |
| ...
Using melt borrowed from here:
from pyspark.sql.functions import col
melt(
df.select([col(c).cast("string") for c in df.columns]),
id_vars=[], value_vars=df.columns
).groupBy("variable", "value").count()
Adapted from the answer by user6910411.

Resources