Multiple metrics over large dataset in spark - apache-spark

I have a big dataset groupped by certain field that I need to run descriptive statistics on each field.
Let's say dataset is 200m+ records and there's about 15 stat functions that I need to run - sum/avg/min/max/stddev etc. Problem that it's very hard to scale that task since there's not clear way to partition dataset.
Example dataset:
+------------+----------+-------+-----------+------------+
| Department | PartName | Price | UnitsSold | PartNumber |
+------------+----------+-------+-----------+------------+
| Texas | Gadget1 | 5 | 100 | 5943 |
| Florida | Gadget3 | 484 | 2400 | 4233 |
| Alaska | Gadget34 | 44 | 200 | 4235 |
+------------+----------+-------+-----------+------------+
Right now I am doing it this way (example):
columns_to_profile = ['Price', 'UnitSold', 'PartNumber']
functions = [
Function(F.mean, 'mean'),
Function(F.min, 'min_value'),
Function(F.max, 'max_value'),
Function(F.variance, 'variance'),
Function(F.kurtosis, 'kurtosis'),
Function(F.stddev, 'std'),
Function(F.skewness, 'skewness'),
Function(count_zeros, 'n_zeros'),
Function(F.sum, 'sum'),
Function(num_hist, "hist_data"),
]
functions_to_apply = [f.function(c).alias(f'{c}${f.alias}')
for c in columns_to_profile for f in get_functions(column_types, c)]
df.groupby('Department').agg(*functions_to_apply).toPandas()
Problem here is that list of functions is bigger than this (there's about 16-20) which applies to each column but cluster spend most of the time in shuffling and CPU load is about 5-10%.
How should I partition this data or maybe my approach is incorrect?
If departments are skewed (i.e. Texas have 90% of volume) what should be my approach?
this is my spark dag for this job:

Related

Spark replicating rows with values of a column from different dataset

I am trying to replicate rows inside a dataset multiple times with different values for a column in Apache Spark. Lets say I have a dataset as follows
Dataset A
| num | group |
| 1 | 2 |
| 3 | 5 |
Another dataset have different columns
Dataset B
| id |
| 1 |
| 4 |
I would like to replicate the rows from Dataset A with column values of Dataset B. You can say a join without any conditional criteria that needs to be done. So resulting dataset should look like.
| id | num | group |
| 1 | 1 | 2 |
| 1 | 3 | 5 |
| 4 | 1 | 2 |
| 4 | 3 | 5 |
Can anyone suggest how the above can be achieved? As per my understanding, join requires a condition and columns to be matched between 2 datasets.
What you want to do is called CartesianProduct and df1.crossJoin(df2) will achieve it. But be careful with it because it is a very heavy operation.

How to add a new column with some constant value while appending two datasets using Groovy?

I have multiple monthly datasets with 50 variables each. I need to append these datasets to create one single dataset. However, I also want to add the month's name to the corresponding records while appending such that I can see a new column in the final dataset which can be used to identify records belonging to a month.
Example:
Data 1: Monthly_file_201807
ID | customerCategory | Amount |
1 | home | 654.00 |
2 | corporate | 9684.65 |
Data 2: Monthly_file_201808
ID | customerCategory | Amount |
84 | SME | 985.29 |
25 | Govt | 844.88 |
On Appending, I want something like this:
ID | customerCategory | Amount | Month |
1 | home | 654.00 | 201807 |
2 | corporate | 9684.65 | 201807 |
84 | SME | 985.29 | 201808 |
25 | Govt | 844.88 | 201808 |
currently, I'm appending using following code:
List dsList = [
Data1Path,
Data2Path
].collect() {app.data.open(source:it)}
//concatenate all records into a single larger dataset
Dataset ds=app.data.create()
dsList.each(){
ds.prepareToAdd(it)
ds.addAll(it)
}
ds.save()
app.data.copy(in: ds, out: FinalAppendedDataPath)
I have used the standard append code, but unable to add that additional column with a fixed value of month in there. I don't want to loop through the data to create an additional column of "month", as my data is very large and I have multiple files.

How to simultaneously group/apply two Spark DataFrames?

/* My question is language-agnostic I think, but I'm using PySpark if it matters. */
Situation
I currently have two Spark DataFrames:
One with per-minute data (1440 rows per person and day) of a person's heart rate per minute:
| Person | date | time | heartrate |
|--------+------------+-------+-----------|
| 1 | 2018-01-01 | 00:00 | 70 |
| 1 | 2018-01-01 | 00:01 | 72 |
| ... | ... | ... | ... |
| 4 | 2018-10-03 | 11:32 | 123 |
| ... | ... | ... | ... |
And another DataFrame with daily data (1 row per person and day), of daily metadata, including the results of a clustering of days, i.e. which cluster day X of person Y fell into:
| Person | date | cluster | max_heartrate |
|--------+------------+---------+----------------|
| 1 | 2018-01-01 | 1 | 180 |
| 1 | 2018-01-02 | 4 | 166 |
| ... | ... | ... | ... |
| 4 | 2018-10-03 | 1 | 147 |
| ... | ... | ... | ... |
(Note that clustering is done separately per person, so cluster 1 for person 1 has nothing to do with person 2's cluster 1.)
Goal
I now want to compute, say, the mean heart rate per cluster and per person, that is, each person gets different means. If I have three clusters, I am looking for this DF:
| Person | cluster | mean_heartrate |
|--------+---------+----------------|
| 1 | 1 | 123 |
| 1 | 2 | 89 |
| 1 | 3 | 81 |
| 2 | 1 | 80 |
| ... | ... | ... |
How do I best do this? Conceptually, I want to group these two DataFrames per person and send two DF chunks into an apply function. In there (i.e. per person), I'd group and aggregate the daily DF per day, then join the daily DF's cluster IDs, then compute the per-cluster mean values.
But grouping/applying multiple DFs doesn't work, right?
Ideas
I have two ideas and am not sure which, if any, make sense:
Join the daily DF to the per-minute DF before grouping, which would result in highly redundant data (i.e. the cluster ID replicated for each minute). In my "real" application, I will probably have per-person data too (e.g. height/weight), which would be a completely constant column then, i.e. even more memory wasted. Maybe that's the only/best/accepted way to do it?
Before applying, transform the DF into a DF that can hold complex structures, e.g. like
.
| Person | dataframe | key | column | value |
|--------+------------+------------------+-----------+-------|
| 1 | heartrates | 2018-01-01 00:00 | heartrate | 70 |
| 1 | heartrates | 2018-01-01 00:01 | heartrate | 72 |
| ... | ... | ... | ... | ... |
| 1 | clusters | 2018-01-01 | cluster | 1 |
| ... | ... | ... | ... | ... |
or maybe even
| Person | JSON |
|--------+--------|
| 1 | { ...} |
| 2 | { ...} |
| ... | ... |
What's the best practice here?
But grouping/applying multiple DFs doesn't work, right?
No, AFAIK this does not work not in pyspark nor pandas.
Join the daily DF to the per-minute DF before grouping...
This is the way to go in my opinion. You don't need to merge all redundant columns but only those requrired for your groupby-operation. There is no way to avoid redundancy for your groupby-columns as they will be needed for the groupby-operation.
In pandas, it is possible to provide an extra groupby-column as a pandas Series specifically but it requires to have the exact same shape as the to be grouped dataframe. However, in order create the groupby-column, you will need a merge anyway.
Before applying, transform the DF into a DF that can hold complex structures
Performance and memory wise, I would not go with this solution unless you have multiple required groupby operations which will benefit from more complex data structures. In fact, you will need to put in some effort to actually create the data structure in the first place.

Spark create multiple Data frames from one Data frame

I am using Spark 2.1 with Cassandra (3.9) as data source. C* has a big table with 50 columns, which is not a good data model for my use case. so I created split tables for each of those sensors along with partition key and clustering key cols.
All sensor table
-----------------------------------------------------
| Device | Time | Sensor1 | Sensor2 | Sensor3 |
| dev1 | 1507436000 | 50.3 | 1 | 1 |
| dev2 | 1507436100 | 90.2 | 0 | 1 |
| dev1 | 1507436100 | 28.1 | 1 | 1 |
-----------------------------------------------------
Sensor1 table
-------------------------------
| Device | Time | value |
| dev1 | 1507436000 | 50.3 |
| dev2 | 1507436100 | 90.2 |
| dev1 | 1507436100 | 28.1 |
-------------------------------
Now I am using spark to copy data from old table to new ones.
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="allsensortables", keyspace="dataks")\
.load().cache()
df.createOrReplaceTempView("data")
query = ('''select device,time,sensor1 as value from data ''' )
vgDF = spark.sql(query)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table="sensor1", keyspace="dataks")\
.save()
copying data one by one is taking a lot of time (2.1) hours for a single table. is there any way i can select * and create multiple df for each sensors and save at once ? (or even sequentially).
One issue in the code is the cache
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table="allsensortables", keyspace="dataks")\
.load().cache()
Here I don't see how df is used multiple times apart from save. SO here cache is counter productive. You are reading the data, filter it and saving it to a separate cassandra table. Now the only action happening on the dataframe is the save and nothing else.
So there is no benefit from caching the data here. Removing the cache will give you some speed up.
To create multiple tables sequentially. I would suggest to use partitionBy and write the data first to HDFS as partitioned data w.r.t sensor and then write it back to cassandra.

Performance: Group by a subset of previous grouping columns

I have a DataFrame with two categorical columns, similar to the following example:
+----+-------+-------+
| ID | Cat A | Cat B |
+----+-------+-------+
| 1 | A | B |
| 2 | B | C |
| 5 | A | B |
| 7 | B | C |
| 8 | A | C |
+----+-------+-------+
I have some processing to do that needs two steps: The first one needs the data to be grouped by both categorical columns. In the example, it would generate the following DataFrame:
+-------+-------+-----+
| Cat A | Cat B | Cnt |
+-------+-------+-----+
| A | B | 2 |
| B | C | 2 |
| A | C | 1 |
+-------+-------+-----+
Then, the next step consists on grouping only by CatA, to calculate a new aggregation, for example:
+-----+-----+
| Cat | Cnt |
+-----+-----+
| A | 3 |
| B | 2 |
+-----+-----+
Now come the questions:
In my solution, I create the intermediate dataframe by doing
val df2 = df.groupBy("catA", "catB").agg(...)
and then I aggregate this df2 to get the last one:
val df3 = df2.groupBy("catA").agg(...)
I assume it is more efficient than aggregating the first DF again. Is it a good assumption? Or it makes no difference?
Are there any suggestions of a more efficient way to achieve the same results?
Generally speaking it looks like a good approach and should be more efficient than aggregating data twice. Since shuffle files are implicitly cached at least part of the work should be performed only once. So when you call an action on df2 and subsequently on df3 you should see that stages corresponding to df2 have been skipped. Also partial structure enforced by the first shuffle may reduce memory requirements for the aggregation buffer during the second agg.
Unfortunately DataFrame aggregations, unlike RDD aggregations, cannot use custom partitioner. It means that you cannot compute both data frames using a single shuffle based on a value of catA. It means that second aggregation will require separate exchange hash partitioning. I doubt it justifies switching to RDDs.

Resources