Spark process file in chunks - apache-spark

I would like to process chunks of data (from a csv file) and then do some analysis within each partition/chunk.
How do I do this and then process these multiple chunks in parallel fashion? I'd like to run map and reduce on each chunk

I don't think you can read only part of a file. Also I'm not quite sure if I understand your intent correctly or if you understood the concept of Spark correctly.
If you read a file and apply map function on the Dataset/RDD, Spark will automatically process the function in parallel on your data.
That is, each worker in your cluster will be assigned to a partition of your data, i.e. will process "n%" of the data. Which data items will be in the same partition is decided by the partitioner. By default, Spark uses a Hash Partitioner.
(Alternatively to map, you can apply mapParititions)
Here are some thoughts that came to my mind:
partition your data using the partitionBy method and create your own partitioner. This partitioner can for example put the first n rows into partition 1, the next n rows into partition 2, etc.
If your data is small enough to fit on the driver, you can read the whole file, collect it into an array, and skip the desired number of rows (in the first run, no row is skipped), take the next n rows, and then create an RDD again of these rows.
You can preprocess the data, create the partitons somehow, i.e. containing the n% and then store it again. This will create different files on your disk/HDFS: part-00000, part-00001, etc. Then in your actual program you can read just the desired part file, one after the other...

Related

Overused the capacity memory when trying to process the CSV file when using Pyspark and Python

I dont know which part of the code I should share since what I do is basically as below(I will share a simple code algorithm instead for reference):
Task: I need to search for file A and then match the values in file A with column values in File B(It has more than 100 csv files, with each contained more than 1millions rows in CSV), then after matched, combined the results into a single CSV.
Extract column values for File A and then make it into list of values.
Load File B in pyspark and then use .isin to match with File A list of values.
Concatenate the results into single csv file.
"""
first = pd.read_excel("fileA.xlsx")
list_values = first[first["columnA"].apply(isinstance,args=(int,))]["columnA"].values.tolist()
combine = []
for file in glob.glob("directory/"): #here will loop at least 100 times.
second = spark.read.csv("fileB")
second = second["columnB"].isin(list_values) # More than hundreds thousands rows will be expected to match.
combine.append(second)
total = pd.concat(combine)
Error after 30hours of running time:
UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
Is there a way to better perform such task? currently, to complete the process it takes more than 30hours to just run the code but it ended with failure with above error. Something like parallel programming or which I could speed up the process or to clear the above error? ?
Also, when I test it with running only 2 CSV files, it took less than a minute to complete but when I try to loop the whole folder with 100 files, it takes more than 30hours.
There are several things that I think you can try to optimize given that your configuration and resource unchanged:
Repartition when you read your CSV. Didn't study the source code on how spark read the csv, but based on my experience / case in SO, when you use spark to read the csv, all the data will be in single partition, which might cause you the Java OOM error and also it's not fully utilize your resource. Try to check the partitioning of the data and make sure that there is no data skewness before you do any transformation and action.
Rethink on how to do the filtering based on another dataframe column value. From your code, your current approach is to use a python list to collect and store the reference, and then use .isin() to search if the main dataframe column contain value which is in this reference list. If the length of your reference list is very large, the searching operation of EACH ROW to go through the whole reference list is definitely a high cost. Instead, you can try to use the leftsemi .join() operation to achieve the same goal. Even if the dataset is small and you want to prevent the data shuffling, you can use the broadcast to copy your reference dataframe to every single node.
If you can achieve in Spark SQL, don't do it by pandas. In your last step, you're trying to concat all the data after the filtering. In fact, you can achieve the same goal with .unionAll() or .unionByName(). Even you do the pd.concat() in the spark session, all the pandas operation will be done in the driver node but not distributed. Therefore, it might cause Java OOM error and degrade the performance too.

Should we always use rdd.count() instead of rdd.collect().size

rdd.collect().size will first move all data to driver, if the dataset is large, it could resutl in OutOfMemoryError.
So, should we always use rdd.count() instead?
Or in other words, in what situation, people would prefer rdd.collect().size?
collect causes data to be processed and then fetched to the driver node.
For count you don't need:
Full processing - some columns may not be required to be fetched or calculated e.g. not included in any filter. You don't need to load, process or transfer the columns that don't effect the count.
Fetch to driver node - each worker node can count it's rows and the counts can be summed up.
I see no reason for calling collect().size.
Just for general knowledge, there is another way to get around #2, however, for this case it is redundant and won't prevent #1: rdd.mapPartitions(p => p.size).agg(r => r.sum())
Assuming you're using the Scala size function on the array returned by rdd.collect() I don't see any advantage of collecting the whole RDD just to get its number of rows.
This is the point of RDDs, to work on chunks of data in parallel to make transformations manageable. Usually the result is smaller than the original dataset because the given data is somehow transformed/filtered/synthesized.
collect usually comes at the end of data processing and if you run an action you might also want to save the data since might require some expensive computations and the collected data is presumably interesting/valuable.

Get PySpark to output one file per column value (repartition / partitionBy not working)

I've seen many answers and blob posts suggesting that:
df.repartition('category').write().partitionBy('category')
Will output one file per category, but this doesn't appear to be true if the number of unique 'category' values in df is less than the number of default partitions (usually 200).
When I use the above code on a file with 100 categories, I end up with 100 folders each containing between 1 and 3 "part" files, rather than having all rows with a given "category" value in the same "part". The answer at https://stackoverflow.com/a/42780452/529618 seems to explain this.
What is the fastest way get exactly one file per partition value?
Things I've tried
I've seen many claims that
df.repartition(1, 'category').write().partitionBy('category')
df.repartition(2, 'category').write().partitionBy('category')
Will create "exactly one file per category" and "exactly two files per category" respectively, but this doesn't appear to be how this parameter works. The documentation makes it clear that the numPartitions argument is the total number of partitions to create, not the number of partitions per column value. Based on that documentation, specifying this argument as 1 should (accidentally) output a single file per partition when the file is written, but presumably only because it removes all parallelism and forces your entire RDD to be shuffled / recalculated on a single node.
required_partitions = df.select('category').distinct().count()
df.repartition(required_partitions, 'category').write().partitionBy('category')
The above seems like a workaround based on the documented behaviour, but one that would be costly for several reasons. For one, a separate count if df is expensive and not cached (and/or so big that it would be wasteful to cache just for this purpose), and also any repartitioning of a dataframe can cause unnecessary shuffling in a multi-stage workflow that has various dataframe outputss along the way.
The "fastest" way probably depends on the actual hardware set-up and actual data (in case it is skewed). To my knowledge, I also agree that df.repartition('category').write().partitionBy('category') will not help solving your problem.
We faced a similar problem in our application but instead of doing first a count and then the repartition, we separated the writing of the data and the requirement to have only a single file per partition into two different Spark jobs. The first job is optimized to write the data. The second job just iterates over the partitioned folder structure and simply reads the data per folder/partition, coalesces its data to one partition and overwrites them back. Again, I can not tell if that is the fastest way also to your environment, but for us it did the trick.
Having done some research on this topic lead to the Auto Optimize Writes feature on Databricks for writing to a Delta Table. Here, they use a similar approach: First writing the data and then running a separate OPTIMIZE job to aggregate the files into a single file. In the mentioned link you will find this explanation:
"After an individual write, Azure Databricks checks if files can further be compacted, and runs an OPTIMIZE job [...] to further compact files for partitions that have the most number of small files."
As a side note: Make sure to keep the configuration spark.sql.files.maxRecordsPerFile to 0 (default value) or to a negative number. Otherwise, this configuration alone could lead to multiple files for data with the same value in the column "category".
You can try coalesce(n); coalesce is used to decrease the number of partitions, which is an optimized version of repartition.
n = The number of partitions you want to be output.

Does Spark guarantee consistency when reading data from S3?

I have a Spark Job that reads data from S3. I apply some transformations and write 2 datasets back to S3. Each write action is treated as a separate job.
Question: Does Spark guarantees that I read the data each time in the same order? For example, if I apply the function:
.withColumn('id', f.monotonically_increasing_id())
Will the id column have the same values for the same records each time?
You state very little, but the following is easily testable and should serve as a guideline:
If you re-read the same files again with same content you will get the same blocks / partitions again and the same id using f.monotonically_increasing_id().
If the total number of rows differs on the successive read(s) with different partitioning applied before this function, then typically you will get different id's.
If you have more data second time round and apply coalesce(1) then the prior entries will have same id still, newer rows will have other ids. A less than realistic scenario of course.
Blocks for files at rest remain static (in general) on HDFS. So partition 0..N will be the same upon reading from rest. Otherwise zipWithIndex would not be usable either.
I would never rely on the same data being in same place when read twice unless there were no updates (you could cache as well).

how saveToCassandra() work?

i want to know when i use rdd.saveToCassandra() if this function save all elements of current rdd into table cassandra a single time or save element by element similar than map function which process element by element of each rdd and return new parsed element?
Thanks
Neither first option nor second one. It writes data after grouping it in batches of configured size (by default 1024 bytes per batch and 1000 batches per Spark task). If you interested in details - it's open-sourced, so check RDDFunctions and TableWriter for start.
Updated as a response to comments. You may split your RDD in multiple RDDs and save each using saveToCassandra. RDD splitting is not standard feature of Spark as for now, so you need a 3rd-party library like Silex. Check documentation for flatMuxPartitions here

Resources