Given a dataset where each line represents an event for a given machine {"machineId":"123","timestamp":12222345}...
compute the average amount of time between two events per each machine
using RDD spark
Related
I have following pipeline in HDFS which I am processing in spark
input table : batch, team, user, metric1, metric2
This table can has user level metrics in hourly batches. In same hour a user can have multiple entries.
level 1 aggregation : this aggregation to get latest entry per user per batch
agg(metric1) as user_metric1, agg(metric2) as user_metric2 (group by batch, team, user)
level 2 aggregation : get team level metrics
agg(user_metric1) as team_metric1, agg(user_metric2) as team_metric2 (group by batch, team)
Input table is 8gb (snappy parquet format) in size in HDFS. My spark job is showing shuffle write to 40gb and at least 1 gb per executor shuffle spill.
In order to minimize this, if I repartition input table on user level before performaing aggregation,
df = df.repartition('user')
would it improve performance? How should I approach this problem if I want to reduce shuffle?
While running with following resources
spark.executor.cores=6
spark.cores.max=48
spark.sql.shuffle.partitions=200
Spark shuffles data from a node to another one because the resources is distributed (input data...) over the cluster, this can make the calculation slow and can present a heavy network traffic over the cluster, for your case the number of shuffles is due to the group by , if you make a repartition based on the three columns of the goup by it will reduce the number of shuffles, for the spark configuration the default spark.sql.shuffle.partitions is 200, let's say that we will let spark configuration as it is, the repartition will take some time and once finished the calculation will be faster:
new_df = df.repartition("batch","team", "user")
For one of my data validation script , I am doing df.describe() on the input dataframe for preparing data profile (where df is my input dataframe). For a particular dataframe of 3000 columns (but only 16 records) , this particular part is taking on an average 3 mins. When I have checked the spark history , this part has executed in two stages with each of having 1 task. On Event timeline, 99.99% of the time is as "executor computing time". Shuffle write and shuffle read are only in kb and not even taking a millisecond to complete. While checking the DAG execution flow for this stage , it gave me a flow as follows:
stage 1: (~1.4 min)
scan csv --> project --> sortAggregate --> exchange(shufflle write)
stage 2(~1.6 min):
Exchange(shufflle read)--> sortAggregate
So from this I can understand that sortAggregate on 3000 columns is causing the delay. What should I do to bring down this execution time.
I am redesigning a real-time prediction pipeline over streaming IoT sensor data. The pipeline is ingesting sensor data samples, structured as (sensor_id, timestamp, sample_index, value) as they are created in the source system, saves them locally and runs pyspark batch jobs for training algorithms and making predictions.
Currently, sensor data is saved to local files on disk with a single file per sensor and to HDFS for spark streaming. The streaming job picks up each microbatch, calculates how many samples arrived for each sensor and decides which sensors accumulated enough new data to make a new prediction. It then maps each sensor row in the RDD to a method that opens the data file using python open method, scans to the last processed sample, picks up the data from that sample onwards plus some history data required for the prediction, and runs the prediction job on the spark cluster. In addition, every fixed number of samples each algorithm requires a refit, which queries a long history from the same data store and runs on the spark cluster.
Finally, the RDD that is processed by the prediction job looks like this:
|-----------------------------|
| sensor_id | sensor_data |
|-----------------------------|
| SENSOR_0 | [13,52,43,54,5] |
| SENSOR_1 | [22,42,23,3,35] |
| SENSOR_2 | [43,2,53,64,42] |
|-----------------------------|
We are now encountering a problem of scale when monitoring a few hundred thousand sensors. It seems that the most costly operation during the process is reading data from files - a few dozen millisecond latency in reading each file accumulates to unmanageable latency for the entire prediction job. Further, storing the data as flat files on disk is not scalable at all.
We are looking into changing storage method in order to up performance and offer scalability. Using time series databases (we tried timescaledb & influxdb) poses the problem of querying the data for all sensors in one query, when each sensor needs to be queried from a different point in time, and then grouping the separate samples into the sensor_data column as seen above, which is very costly, causes lots of shuffles and even underperforms the flat files solution. We are also trying parquet files, but their single write behavior makes it difficult to plan a data structure that will perform well in this case.
tl;dr -
I am looking for a performant architecture for the following scenario:
streaming sensor data is ingested in real time
when a sensor accumulates enough samples, current + historic data is queried and sent to prediction job
each prediction job handles all sensors that reached threshold in the last microbatch
RDD contains rows of sensor ID and an ordered array of all queried samples
I have a daily level transaction dataset for three months going upto around 17 gb combined. Now I have a server with 16 cores and 64gb RAM with 1 tb of hardisk space. I have the transaction data broken into 90 files each having the same format and a set of queries which is to be run in this entire dataset and the query for each daily level data is the same for all 90 files. The result after the query is run is appended and then we get the resultant summary back. Now before I start on my endevour I was wondering if Apache Spark with pyspark can be used to solve this. I tried R but it was very slow and ultimately I got memory outage issue
So my question has two parts
How should I create my RDD? Should I pass my entire dataset as an RDD or is there any way I can tell spark to work in Parallel in these 90 datsets
2.Can I expect a significant speed improvement if I am not working with Hadoop
Is there a way to define batch definition in spark streaming such that, each RDD represent a record rather than data of a time interval.