Spark Streaming UI - Input Rows vs Input Rate - apache-spark

My Spark application (structured streaming) displays a number of Input Rows which is much higher than the number of records I'm sending to the application (in my case the input rows in the UI is always 21 times the number of actual records).
I can't find a clear explanation of what exactly the "Input Rows" means. I read somewhere that this was related to the number of actions performed on the dataset but the math doesn't add up.
Any help is appreciated.

Number of input rows are nothing but the total number of rows in a batch. For example if next batch trigger for every 20 seconds and input rate is 10, then your input rows will be 200.
https://spark.apache.org/docs/latest/web-ui.html#structured-streaming-tab

Related

Merging the 30-second sampling rate epochs into 30 minutes sampling epochs in the Excel File

I am a PhD student in Sport Science. In my excel file, there were five columns (Date, Time, ExactTime (computed by me using excel function), Activity Level) as shown in the attached photo. As you can see, in the "ExactTime" column, each row indicates the activity level at the 30-second intervals. However, my PhD supervisor would like to have a excel file containing the average activity level at each 30-minute interval, rather than the default 30-second interval. For instance, the first row becomes 09-08-2021 12:12:00 (in the first column) and the average activity level from 09-08-2021 11:42:00 to 12:12:00. Grateful if I could have some step-by-step guidelines on how to do it! Many thanks! The link to my data file is attached.(https://drive.google.com/drive/folders/1roIDdcxGwsq9l630YR0gapQ_yM9hvf0g?usp=sharing)
enter image description here
To have a excel file containing the average activity level at each 30-minute interval, rather than the default 30-second interval.

Pyspark: Doing a count on sample of dataframe instead whole dataframe

I currently have some code that computes the overall time taken to run the count operation on a dataframe. I have another implementation which measures the time taken to run count on a sampled version of this dataframe.
sampled_df = df.sample(withReplacement=False, fraction=0.1)
sampled_df.count()
I then extrapolate the overall count from the sampled count. But I do not see an overall decrease in the time taken for calculating this sampled count when compared to doing a count on the whole dataset. Both seem to take around 40 seconds. Is there a reason this happens? Also, is there an improvement in terms of memory when using a sampled count over count on whole dataframe?
You can use countApprox. This lets you choose how long your willing to wait for an approximate count/confidence interval.
Sample still needs to access all partitions to create a sample that is uniform. You aren't really saving anytime using a sample.

how to decide number of executors for 1 billion rows in spark

We have a table which has one billion three hundred and fifty-five million rows.
The table has 20 columns.
We want to join this table with another table which has more of less same number of rows.
How to decide number of spark.conf.set("spark.sql.shuffle.partitions",?)
How to decide number of executors and its resource allocation details?
How to find the amount of storage those one billion three hundred and fifty-five million rows will take in memory?
Like #samkart says, you have to experiment to figure out the best parameters since it depends on the size and nature of your data. The spark tuning guide would be helpful.
Here are some things that you may want to tweak:
spark.executor.cores is 1 by default but you should look to increase this to improve parallelism. A rule of thumb is to set this to 5.
spark.files.maxPartitionBytes determines the amount of data per partition while reading, and hence determines the initial number of partitions. You could tweak this depending on the data size. Default is 128 MB blocks in HDFS.
spark.sql.shuffle.partitions is 200 by default but tweak it depending on the data size and number of cores. This blog would be helpful.

Filtering for rows within every 30 seconds interval

I have a large data file from a test where I send a voltage that is increment by 1mv every 30s from 0-5V) to test the accuracy of my system. The computer outputs a file that has over 70000 rows of data but all I am really concerned with is data that occurs every 30s. Is there a way to filter for only the data that aligns with the 30s interval and ideally having around 5000 rows of data?
I am stuck and I really don't want to manually sort through 70000 lines of data, any help is greatly appreciated..
So you want to filter and only see the rows that occur every 30 seconds? You can add a calculated column in Excel to extract the seconds and filter by that column:
=RIGHT(TEXT(A1, "hh:mm:ss"),2)
This will extract the seconds from a time, and then you can filter where the seconds are 30. Replace A1 with your correct column.

Pyspark job being stuck at the final task

The flow of my program is something like this:
1. Read 4 billion rows (~700GB) of data from a parquet file into a data frame. Partition size used is 2296
2. Clean it and filter out 2.5 billion rows
3. Transform the remaining 1.5 billion rows using a pipeline model and then a trained model. The model is trained using a logistic regression model where it predicts 0 or 1 and 30% of the data is filtered out of the transformed data frame.
4. The above data frame is Left outer joined with another dataset of ~1 TB (also read from a parquet file.) Partition size is 4000
5. Join it with another dataset of around 100 MB like
joined_data = data1.join(broadcast(small_dataset_100MB), data1.field == small_dataset_100MB.field, "left_outer")
6. The above dataframe is then exploded to the factor of ~2000 exploded_data = joined_data.withColumn('field', explode('field_list'))
7. An aggregate is performed aggregate = exploded_data.groupBy(*cols_to_select)\
.agg(F.countDistinct(exploded_data.field1).alias('distincts'), F.count("*").alias('count_all')) There are a total of 10 columns in the cols_to_select list.
8. And finally an action, aggregate.count() is performed.
The problem is, the third last count stage (200 tasks) gets stuck at task 199 forever. In spite of allocating 4 cores and 56 executors, the count uses only one core and one executor to run the job. I tried breaking down the size from 4 billion rows to 700 million rows which is 1/6th part, it took four hours. I would really appreciate some help in how to speed this process up Thanks
The operation was being stuck at the final task because of the skewed data being joined to a huge dataset. The key that was joining the two dataframes was heavily skewed. The problem was solved for now by removing the skewed data from the dataframe. If you must include the skewed data, you can use iterative broadcast joins (https://github.com/godatadriven/iterative-broadcast-join). Look into this informative video for more details https://www.youtube.com/watch?v=6zg7NTw-kTQ

Resources