Spark transformations and ordering - apache-spark

I am working on parsing different types of files (text,xml,csv etc.) into a specific text file format using spark java API. This output file maintains the order of file header, start tag, data header, data and end tag. All of these element are extracted from input file at some point.
I tried to achieve this in below 2 ways:
Read file to RDD using sparks textFile and perform parsing by using map or mapPartions which returns new RDD.
Read file using sparks textFile , reduce to 1 partition using coalesce and perform parsing by using mapPartions which returns new RDD.
While I am not concerned about sequencing of actual data, with first approach I am not able to keep the required order of File Header, Start Tag, Data Header and End Tag.
The latter works for me, but I know it is not efficient way and may cause problem in case of BIG files.
Is there any efficient way to achieve this?

You are correct in you assumptions. The second choice simply cancels the distributional aspect of your application, so it's not scalable. For the order issue, as the concept is asynchronous, we cannot keep track of order when the data reside in different nodes. What you could do is some preprocessing that would cancel the need for order. Meaning, merge lines up to the point where the line order does not matter and only then distribute your file. Unless you can make assumptions about the file structure, such as number of lines that belong together, I would go with the above.

Related

Loop through RDD elements, read its content for further processing

I have a folder that contains n number of files.
I am creating an RDD that contains all the filenames of above folder with the code below:
fnameRDD = spark.read.text(filepath).select(input_file_name()).distinct().rdd)
I want to iterate through these RDD elements and process following steps:
Read content of each element (each element is a filepath, so need to read content throgh SparkContext)
Above content should be another RDD which I want to pass as an argument to a Function
Perform certain steps on the RDD passed as argument inside called function
I already have a Function written which has steps that I've tested for Single file and it works fine
But I've tried various things syntactically to do first 2 steps, but I am just getting invalid syntax every time.
I know I am not supposed to use map() since I want to read a file in each iteration which will require sc, but map will be executed inside worker node where sc can't be referenced.
Also, I know I can use wholeTextFiles() as an alternative, but that means I'll be having text of all the files in memory throughout the process, which doesn't seems efficient to me.
I am open to suggestions for different approaches as well.
There are possibly other, more efficient ways to do it but assuming you already have a function SomeFunction(df: DataFrame[value: string]), the easiest would be to use toLocalIterator() on your fnameRDD to process one file at a time. For example:
for x in fnameRDD.toLocalIterator():
fileContent = spark.read.text(x[0])
# fileContent.show(truncate=False)
SomeFunction(fileContent)
A couple of thoughts regarding efficiency:
Unlike .collect(), .toLocalIterator() brings data to driver memory one partition at a time. But in your case, after you call .distinct(), all the data will reside in a single partition, and so will be moved to the driver all at once. Hence, you may want to add .repartition(N) after .distinct(), to break that single partition into N smaller ones, and avoid the need to have large heap on the driver. (Of course, this is only relevant if your list of input files is REALLY long.)
The method to list file names itself seems to be less than efficient. Perhaps you'd want to consider something more direct, using FileSystem API for example like in this article.
I believe you're looking for recursive file lookup,
spark.read.option("recursiveFileLookup", "true").text(filepathroot)
if you point this to the root directory of your files, spark will traverse the directory and pick up all the files that sit under the root and child folders, this will read the file into a single dataframe

If Spark reads data from multiple files via glob, then does some mapping, then does take(5), would it read only the first file?

I have multiple large files, and I use a glob that matches them all to read them into a single dataframe. Then I do so some mapping, i.e. processing rows independently from each other. For development purposes, I don't want to process the whole data, so I'm thinking of doing a df.take(5). Will Spark be smart enough to realize that it only needs to read the first five rows of the first file? Thanks!
I'm hoping it will only read the first five records, but I don't know if it does.

Overused the capacity memory when trying to process the CSV file when using Pyspark and Python

I dont know which part of the code I should share since what I do is basically as below(I will share a simple code algorithm instead for reference):
Task: I need to search for file A and then match the values in file A with column values in File B(It has more than 100 csv files, with each contained more than 1millions rows in CSV), then after matched, combined the results into a single CSV.
Extract column values for File A and then make it into list of values.
Load File B in pyspark and then use .isin to match with File A list of values.
Concatenate the results into single csv file.
"""
first = pd.read_excel("fileA.xlsx")
list_values = first[first["columnA"].apply(isinstance,args=(int,))]["columnA"].values.tolist()
combine = []
for file in glob.glob("directory/"): #here will loop at least 100 times.
second = spark.read.csv("fileB")
second = second["columnB"].isin(list_values) # More than hundreds thousands rows will be expected to match.
combine.append(second)
total = pd.concat(combine)
Error after 30hours of running time:
UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
Is there a way to better perform such task? currently, to complete the process it takes more than 30hours to just run the code but it ended with failure with above error. Something like parallel programming or which I could speed up the process or to clear the above error? ?
Also, when I test it with running only 2 CSV files, it took less than a minute to complete but when I try to loop the whole folder with 100 files, it takes more than 30hours.
There are several things that I think you can try to optimize given that your configuration and resource unchanged:
Repartition when you read your CSV. Didn't study the source code on how spark read the csv, but based on my experience / case in SO, when you use spark to read the csv, all the data will be in single partition, which might cause you the Java OOM error and also it's not fully utilize your resource. Try to check the partitioning of the data and make sure that there is no data skewness before you do any transformation and action.
Rethink on how to do the filtering based on another dataframe column value. From your code, your current approach is to use a python list to collect and store the reference, and then use .isin() to search if the main dataframe column contain value which is in this reference list. If the length of your reference list is very large, the searching operation of EACH ROW to go through the whole reference list is definitely a high cost. Instead, you can try to use the leftsemi .join() operation to achieve the same goal. Even if the dataset is small and you want to prevent the data shuffling, you can use the broadcast to copy your reference dataframe to every single node.
If you can achieve in Spark SQL, don't do it by pandas. In your last step, you're trying to concat all the data after the filtering. In fact, you can achieve the same goal with .unionAll() or .unionByName(). Even you do the pd.concat() in the spark session, all the pandas operation will be done in the driver node but not distributed. Therefore, it might cause Java OOM error and degrade the performance too.

Spark tagging file names for purpose of possible later deletion/rollback?

I am using Spark 2.4 in AWS EMR.
I am using Pyspark and SparkSQL for my ELT/ETL and using DataFrames with Parquet input and output on AWS S3.
As of Spark 2.4, as far as I know, there is no way to tag or to customize the file name of output files (parquet). Please correct me?
When I store parquet output files on S3 I end up with file names which look like this:
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
The middle part of the file name looks like it has embedded GUID/UUID :
part-43130-4fb6c57e-d43b-42bd-afe5-3970b3ae941c.c000.snappy.parquet
I would like to know if I can obtain this GUID/UUID value from the PySpark or SparkSQL function at run-time, to log/save/display this value in a text file?
I need to log this GUID/UUID value because I may need to later remove the files with this value as part of their names, for a manual rollback purposes (for example, I may discover a day or a week later that this data is somehow corrupt and needs to be deleted, so all files tagged with GUID/UUID can be identified and removed).
I know that I can partition the table manually on a GUID column but then I end up with too many partitions, so it hurts performance. What I need is to somehow tag the files, for each data load job, so I can identify and delete them easily from S3, hence GUID/UUID value seems like one possible solution.
Open for any other suggestions.
Thank you
Is this with the new "s3a specific committer"? If so, it means that they're using netflix's code/trick of using a GUID on each file written so as to avoid eventual consistency problems. That doesn't help much though.
consider offering a patch to Spark which lets you add a specific prefix to a file name.
Or for Apache Hadoop & Spark (i.e. not EMR), an option for the S3A committers to put that prefix in when they generate temporary filenames.
Short term: well, you can always list the before-and-after state of the directory tree (tip: use FileSystem.listFiles(path, recursive) for speed), and either remember the new files, or rename them (which will be slow: Remembering the new filenames is better)
Spark already writes files with UUID in names. Instead of creating too many partitions you can setup customer file naming (e.g. add some id). May be this is solution for you - https://stackoverflow.com/a/43377574/1251549
Not tried yet (but planning) - https://github.com/awslabs/amazon-s3-tagging-spark-util
In theory, you can tag with jobid (or whatever) and then run something
Both solutions lead to perform multiple s3 list objects API request check tags/filename and delete file one by one.

Efficiently write one output file when partitioning by column

I have a large data set, df, made up of events. I want to write it out, partitioned by year/month/dat/hour, and have each resulting partition contain only file.
Here's a code snippet:
df.partitionBy("event_year", "event_month", "event_day", "event_hour").
mode(SaveMode.Overwrite).
parquet(s"${output_data_root}/tmp/")
What's unclear is what to do with df prior to this operation to get one file out, as it's unclear how partition(COL) and coalesce interact. IE, what happens when I do:
df.repartition(col("year"), col("month"), col("day"), col("event_hour")).coalesce(1)
(or vice versa)
It wouldn't work to just coalesce(1) (the data set is far too large), but from what I can tell, repartition(COL) will not necessarily result in one partition per column set.
It's still unclear to me exactly what is going on under the hood, but it turns out you just:
df.repartition(1, col("year"), col("month"), col("day"), col("event_hour"))
Anecdotally, this is WAY faster than repartition(...).coalesce, and particularly when using S3, it definitely is important to keep your file count minimal.

Resources