Speed difference between spark.read.parquet and spark.read.format.load - apache-spark

I'm trying to understand what is causing the huge difference in reading speed. I have a dataframe with 30 million rows and 38 columns.
final_df=spark.read.parquet("/dbfs/FileStore/path/to/file.parquet")
This takes 14 minutes to read the file
While
final_df = spark.read.format("parquet").load("/dbfs/FileStore/path/to/file.parquet")
Takes only 2 seconds to read the file.

Related

How to extract many csv files from a very large csv file using sliding window?

I need to create a script that accepts as input: a very large csv file, a starting time, a window size, a shift time, and a delta time.
Dask Index Structure:
npartitions=49
2022-06-30 19:43:30 datetime64[ns]
2022-07-01 01:46:43 ...
...
2022-07-13 04:17:22 ...
2022-07-13 10:50:46 ...
Name: Timestamp, dtype: datetime64[ns]
Dask Name: sort_index, 196 tasks
The code should create csv files that spans window size amount of time (for example 20 minutes) that starts at starting time and ends at (starting time + window size) time (for example starting at midnight, 0:00 to 0:20 for a 20 minute window time).
The file should have entries every delta time. For example, if delta time = 10 seconds, then there should be 10-second increments in the csv file.
Once a csv file spanning the window size time is created, a new csv file should be created that starts at starting time + shift time and ends windows size time later. For example, the first csv file may cover minutes 0-20 and the second csv file covers minutes 2-22.
csv files should continue until one of them reaches ending time.
Note: There is a unique measurement at every single second of the input time series.

Cassandra Data Read Speed Slows Down

I have a problem that I can't understand. I have 3 node (RF:3) in my cluster and my nodes hardware is pretty good. Now there are 60 - 70 million rows and 3000 columns data in my cluster so i want to query specific data approximately 265000 rows and 4 columns, i use default fetch size, I can get 5000 lines of data per second up to 55000 lines of data after that my data retrieval speed drops.
I think this situation will be solved from the cassandra.yaml file, do you have any idea what I can check?

Minus queries between HDFS and CASSANDRA having 70 million records is taking around 40 mins

My HDFS parquet file and Cassandra table is having 70 million rows and 16 columns and 14 columns are in Json with length more than 2000.
I am doing source minus target and target minus source. Then calculating count of each data frame of HDFS and Cassandra. All this taking 40 minutes for me.
Running on yarn with 6 TB space having 20 data nodes and 1640 cores.
Even if I am changing the num of executors to 100 and num of cores to 4 the performance is not improved. Please let me know if this is the maximum efficiency we can achieve.

Excel duration graphing in minutes and seconds only

I am trying to graph time duration data in Excel using only minutes and seconds but some of my data is over 60 minutes, i.e. 71 minutes and 32 seconds, but Excel formats this data point to 1 hour 11 minutes and 32 seconds. I want to keep it in the format of 71 minutes and 32 seconds. Does anyone know how to do this?
Try setting your Number Formatting to [mm]:ss;#? (Or to [m]" minutes and "s" seconds";# if you want the full text blurb)

Pyspark job being stuck at the final task

The flow of my program is something like this:
1. Read 4 billion rows (~700GB) of data from a parquet file into a data frame. Partition size used is 2296
2. Clean it and filter out 2.5 billion rows
3. Transform the remaining 1.5 billion rows using a pipeline model and then a trained model. The model is trained using a logistic regression model where it predicts 0 or 1 and 30% of the data is filtered out of the transformed data frame.
4. The above data frame is Left outer joined with another dataset of ~1 TB (also read from a parquet file.) Partition size is 4000
5. Join it with another dataset of around 100 MB like
joined_data = data1.join(broadcast(small_dataset_100MB), data1.field == small_dataset_100MB.field, "left_outer")
6. The above dataframe is then exploded to the factor of ~2000 exploded_data = joined_data.withColumn('field', explode('field_list'))
7. An aggregate is performed aggregate = exploded_data.groupBy(*cols_to_select)\
.agg(F.countDistinct(exploded_data.field1).alias('distincts'), F.count("*").alias('count_all')) There are a total of 10 columns in the cols_to_select list.
8. And finally an action, aggregate.count() is performed.
The problem is, the third last count stage (200 tasks) gets stuck at task 199 forever. In spite of allocating 4 cores and 56 executors, the count uses only one core and one executor to run the job. I tried breaking down the size from 4 billion rows to 700 million rows which is 1/6th part, it took four hours. I would really appreciate some help in how to speed this process up Thanks
The operation was being stuck at the final task because of the skewed data being joined to a huge dataset. The key that was joining the two dataframes was heavily skewed. The problem was solved for now by removing the skewed data from the dataframe. If you must include the skewed data, you can use iterative broadcast joins (https://github.com/godatadriven/iterative-broadcast-join). Look into this informative video for more details https://www.youtube.com/watch?v=6zg7NTw-kTQ

Resources