The flow of my program is something like this:
1. Read 4 billion rows (~700GB) of data from a parquet file into a data frame. Partition size used is 2296
2. Clean it and filter out 2.5 billion rows
3. Transform the remaining 1.5 billion rows using a pipeline model and then a trained model. The model is trained using a logistic regression model where it predicts 0 or 1 and 30% of the data is filtered out of the transformed data frame.
4. The above data frame is Left outer joined with another dataset of ~1 TB (also read from a parquet file.) Partition size is 4000
5. Join it with another dataset of around 100 MB like
joined_data = data1.join(broadcast(small_dataset_100MB), data1.field == small_dataset_100MB.field, "left_outer")
6. The above dataframe is then exploded to the factor of ~2000 exploded_data = joined_data.withColumn('field', explode('field_list'))
7. An aggregate is performed aggregate = exploded_data.groupBy(*cols_to_select)\
.agg(F.countDistinct(exploded_data.field1).alias('distincts'), F.count("*").alias('count_all')) There are a total of 10 columns in the cols_to_select list.
8. And finally an action, aggregate.count() is performed.
The problem is, the third last count stage (200 tasks) gets stuck at task 199 forever. In spite of allocating 4 cores and 56 executors, the count uses only one core and one executor to run the job. I tried breaking down the size from 4 billion rows to 700 million rows which is 1/6th part, it took four hours. I would really appreciate some help in how to speed this process up Thanks
The operation was being stuck at the final task because of the skewed data being joined to a huge dataset. The key that was joining the two dataframes was heavily skewed. The problem was solved for now by removing the skewed data from the dataframe. If you must include the skewed data, you can use iterative broadcast joins (https://github.com/godatadriven/iterative-broadcast-join). Look into this informative video for more details https://www.youtube.com/watch?v=6zg7NTw-kTQ
Related
I have a problem that I can't understand. I have 3 node (RF:3) in my cluster and my nodes hardware is pretty good. Now there are 60 - 70 million rows and 3000 columns data in my cluster so i want to query specific data approximately 265000 rows and 4 columns, i use default fetch size, I can get 5000 lines of data per second up to 55000 lines of data after that my data retrieval speed drops.
I think this situation will be solved from the cassandra.yaml file, do you have any idea what I can check?
I currently have some code that computes the overall time taken to run the count operation on a dataframe. I have another implementation which measures the time taken to run count on a sampled version of this dataframe.
sampled_df = df.sample(withReplacement=False, fraction=0.1)
sampled_df.count()
I then extrapolate the overall count from the sampled count. But I do not see an overall decrease in the time taken for calculating this sampled count when compared to doing a count on the whole dataset. Both seem to take around 40 seconds. Is there a reason this happens? Also, is there an improvement in terms of memory when using a sampled count over count on whole dataframe?
You can use countApprox. This lets you choose how long your willing to wait for an approximate count/confidence interval.
Sample still needs to access all partitions to create a sample that is uniform. You aren't really saving anytime using a sample.
We have a table which has one billion three hundred and fifty-five million rows.
The table has 20 columns.
We want to join this table with another table which has more of less same number of rows.
How to decide number of spark.conf.set("spark.sql.shuffle.partitions",?)
How to decide number of executors and its resource allocation details?
How to find the amount of storage those one billion three hundred and fifty-five million rows will take in memory?
Like #samkart says, you have to experiment to figure out the best parameters since it depends on the size and nature of your data. The spark tuning guide would be helpful.
Here are some things that you may want to tweak:
spark.executor.cores is 1 by default but you should look to increase this to improve parallelism. A rule of thumb is to set this to 5.
spark.files.maxPartitionBytes determines the amount of data per partition while reading, and hence determines the initial number of partitions. You could tweak this depending on the data size. Default is 128 MB blocks in HDFS.
spark.sql.shuffle.partitions is 200 by default but tweak it depending on the data size and number of cores. This blog would be helpful.
I have a list of more than 25 million records(1D Array). I want to normalise the values between 0 to 5.
I'm using scikit-learn's MinMaxScaler for this. This thing is working fine to records till 20M but as the size increasing it is taking huge of time.
Any suggestions how to do this in optimised way.
I need to select n rows from very large data set which has millions of rows. Let's say 4 million rows out of 15 million. Currently, I'm adding row_number to records within each partition and selecting the required percentage of records from each partition. For instance, 4 million is 26.66 % of 15 million. But when I'm trying to choose 26 % from each partition, the total number is going down because of the missing 0.6 %. As shown below, rows are selected when the row_number is less than percentage. Is there a better way to do this ?
dataframe sample function can be used. Solution available in below link
How to select an exact number of random rows from DataFrame