Spark - Concatenate pairs of files - apache-spark

I wonder if Spark is suitable for below use case:
I have millions of csv files as pairs. I would like to concatenate each pair of files and output another file (each pair has the same number of rows and different columns, so i would like to join basically but the actual operation is not important for this question). So:
a1.csv
a2.csv
b1.csv
b2.csv
...
...
becomes:
a-concat.csv
b-concat.csv
....
I can easily do that in a normal python script with Pandas for example, but it will take very long time. Instead I would like to distribute this with Spark. Normally Spark collects files, create huge dataframes and operate on them which is not the case for this specific problem. Any suggestions?

Related

If Spark reads data from multiple files via glob, then does some mapping, then does take(5), would it read only the first file?

I have multiple large files, and I use a glob that matches them all to read them into a single dataframe. Then I do so some mapping, i.e. processing rows independently from each other. For development purposes, I don't want to process the whole data, so I'm thinking of doing a df.take(5). Will Spark be smart enough to realize that it only needs to read the first five rows of the first file? Thanks!
I'm hoping it will only read the first five records, but I don't know if it does.

Overused the capacity memory when trying to process the CSV file when using Pyspark and Python

I dont know which part of the code I should share since what I do is basically as below(I will share a simple code algorithm instead for reference):
Task: I need to search for file A and then match the values in file A with column values in File B(It has more than 100 csv files, with each contained more than 1millions rows in CSV), then after matched, combined the results into a single CSV.
Extract column values for File A and then make it into list of values.
Load File B in pyspark and then use .isin to match with File A list of values.
Concatenate the results into single csv file.
"""
first = pd.read_excel("fileA.xlsx")
list_values = first[first["columnA"].apply(isinstance,args=(int,))]["columnA"].values.tolist()
combine = []
for file in glob.glob("directory/"): #here will loop at least 100 times.
second = spark.read.csv("fileB")
second = second["columnB"].isin(list_values) # More than hundreds thousands rows will be expected to match.
combine.append(second)
total = pd.concat(combine)
Error after 30hours of running time:
UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
Is there a way to better perform such task? currently, to complete the process it takes more than 30hours to just run the code but it ended with failure with above error. Something like parallel programming or which I could speed up the process or to clear the above error? ?
Also, when I test it with running only 2 CSV files, it took less than a minute to complete but when I try to loop the whole folder with 100 files, it takes more than 30hours.
There are several things that I think you can try to optimize given that your configuration and resource unchanged:
Repartition when you read your CSV. Didn't study the source code on how spark read the csv, but based on my experience / case in SO, when you use spark to read the csv, all the data will be in single partition, which might cause you the Java OOM error and also it's not fully utilize your resource. Try to check the partitioning of the data and make sure that there is no data skewness before you do any transformation and action.
Rethink on how to do the filtering based on another dataframe column value. From your code, your current approach is to use a python list to collect and store the reference, and then use .isin() to search if the main dataframe column contain value which is in this reference list. If the length of your reference list is very large, the searching operation of EACH ROW to go through the whole reference list is definitely a high cost. Instead, you can try to use the leftsemi .join() operation to achieve the same goal. Even if the dataset is small and you want to prevent the data shuffling, you can use the broadcast to copy your reference dataframe to every single node.
If you can achieve in Spark SQL, don't do it by pandas. In your last step, you're trying to concat all the data after the filtering. In fact, you can achieve the same goal with .unionAll() or .unionByName(). Even you do the pd.concat() in the spark session, all the pandas operation will be done in the driver node but not distributed. Therefore, it might cause Java OOM error and degrade the performance too.

Create a single CSV per partition with Spark

I have a ~10GB dataframe that should be written as a bunch of CSV files, one per partition.
The CSVs should be partitioned by 3 fields: "system", "date_month" and "customer".
Inside each folder exactly one CSV file should be written, and the data inside the CSV file should be ordered by two other fields: "date_day" and "date_hour".
The filesystem (an S3 bucket) should look like this:
/system=foo/date_month=2022-04/customer=CU000001/part-00000-x.c000.csv
/system=foo/date_month=2022-04/customer=CU000002/part-00000-x.c000.csv
/system=foo/date_month=2022-04/customer=CU000003/part-00000-x.c000.csv
/system=foo/date_month=2022-04/customer=CU000004/part-00000-x.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00000-x.c000.csv
/system=foo/date_month=2022-05/customer=CU000002/part-00000-x.c000.csv
/system=foo/date_month=2022-05/customer=CU000003/part-00000-x.c000.csv
/system=foo/date_month=2022-05/customer=CU000004/part-00000-x.c000.csv
I know I can easily achieve that using coalesce(1) but that will only use one worker and I'd like to avoid that.
I've tried this strategy
mydataframe.
repartition($"system", $"date_month", $"customer").
sort("date_day", "date_hour").
write.
partitionBy("system", "date_month", "customer").
option("header", "false").
option("sep", "\t").
format("csv").
save(s"s3://bucket/spool/")
my idea was that each worker would have gotten a different partition so it would have easily sorted the data and written a single file in the partition path. After running the code I've noticed I have many CSV for each partition, something like this:
/system=foo/date_month=2022-05/customer=CU000001/part-00000-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00001-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00002-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00003-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00004-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00005-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00006-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
/system=foo/date_month=2022-05/customer=CU000001/part-00007-df027d9e-3d57-492b-b97a-daa5e80fdc93.c000.csv
[...]
the data in each file is ordered as expected and the concatenation of all the files would create the correct file, but that takes too much time and I'd prefer to rely on Spark.
Is there a way to create a single ordered CSV file per partition, without moving all the data to a single worker with coalesce(1)?
I'm using scala, if that matters.
sort() (and also orderBy()) triggers a shuffle because it sorts the whole dataframe, to sort within the partition you should use the aptly named sortWithinPartitions.
mydataframe.
repartition($"system", $"date_month", $"customer").
sortWithinPartitions("date_day", "date_hour").
write.
partitionBy("system", "date_month", "customer").
option("header", "false").
option("sep", "\t").
format("csv").
save(s"s3://bucket/spool/")

Spark data manipulation with wholeTextFiles

I have 20k compressed files of ~2MB to manipulate in spark. My initial idea was to use wholeTextFiles() so that I get filename - > content tuples. This is useful because I need to maintain this kind of pairing (because the processing is done on a per file basis, with each file representing a minute of gathered data). However, whenever I need to map/filter/etc the data and to maintain this filename - > association, the code gets ugly (and perhaps not efficient?) i.e.
Data.map(lambda (x,y) : (x, y.changeSomehow))
The data itself, so the content of each file, would be nice to read as a separate RDD because it contains 10k's of lines of data; however, one cannot have an rdd of rdds (as far as i know).
Is there any way to ease the process? Any workaround that would basically allow me to use the content of each file as an rdd, hence allowing me to do rdd.map(lambda x: change(x)) without the ugly keeping track of filename (and usage of list comprehensions instead of transformations) ?
The goal of course is to also maintain the distributed approach and to not inhibit it in any way.
The last step of the processing will be to gather together everything through a reduce.
More background: trying to identify (near) ship collisions on a per minute basis, then plot their path
If you have normal map functions (o1->o2), you can use mapValues function. You've got also flatMap (o1 -> Collection()) function: flatMapValues.
It will keep Key (in your case - file name) and change only values.
For example:
rdd = sc.wholeTextFiles (...)
# RDD of i.e. one pair, /test/file.txt -> Apache Spark
rddMapped = rdd.mapValues (lambda x: veryImportantDataOf(x))
# result: one pair: /test/file.txt -> Spark
Using reduceByKey you can reduce results

Splitting a huge dataframe into smaller dataframes and writing to files using SPARK(python)

I am loading a (5gb compressed file) into memory (aws), creating a dataframe(in spark) and trying to split it into smaller dataframes based on 2 column values. Eventually i want to write all these sub-sets into their respective files.
I just started experimenting in spark and just getting used to the data structures. The approach I was trying to follow was something like this.
read the file
sort it by the 2 columns (still not familiar with repartitioning and do not know if it will help)
identify unique list of all values of those 2 columns
iterate through this list
-- create smaller dataframes by filtering using the values in list
-- writing to files
df.sort("DEVICE_TYPE", "PARTNER_POS")
df.registerTempTable("temp")
grp_col = sqlContext.sql("SELECT DEVICE_TYPE, PARTNER_POS FROM temp GROUP BY DEVICE_TYPE, PARTNER_POS")
print(grp_col)
I do not believe this are cleaner and more efficient ways of doing this. I need to write this to files as there are etls which get kicked off in parallel based on the output. Any recommendations?
If it's okay that the subsets are nested in a directory hierarchy, then you should consider using spark's builtin partitioning:
df.write.partitionBy("device_type","partner_pos")
.json("/path/to/root/output/dir")

Resources