How to move 1000 files to RDD's? - apache-spark

I am new in Apache Spark and need help.
I have python script with reading 6 tdms files (TDMS() function) and building graph for numerical data of each of them (GRAPH() function). I do it with for loop.
But I want to load 1000 such files and to run this script in parallels for each one. So I want to create RDD's with this files and apply my function to each file?
How can I do it?
Can I define number of nodes in Spark?

Have you tried making Python list that includes all the files you need to read, and run those in for loop to read data from file, create RDD, run graph function, and I guess save it?
Or make the file list as RDD, and run those with map, lambda(for graph), for each.
If you are only care about parallel run, you can just keep loading data and make one big RDD, and call sc.parallelize. You can either decide Spark to do it, or you can even specify the number you want to use by calling sc.parallelize(data, ).

Related

Loop through RDD elements, read its content for further processing

I have a folder that contains n number of files.
I am creating an RDD that contains all the filenames of above folder with the code below:
fnameRDD = spark.read.text(filepath).select(input_file_name()).distinct().rdd)
I want to iterate through these RDD elements and process following steps:
Read content of each element (each element is a filepath, so need to read content throgh SparkContext)
Above content should be another RDD which I want to pass as an argument to a Function
Perform certain steps on the RDD passed as argument inside called function
I already have a Function written which has steps that I've tested for Single file and it works fine
But I've tried various things syntactically to do first 2 steps, but I am just getting invalid syntax every time.
I know I am not supposed to use map() since I want to read a file in each iteration which will require sc, but map will be executed inside worker node where sc can't be referenced.
Also, I know I can use wholeTextFiles() as an alternative, but that means I'll be having text of all the files in memory throughout the process, which doesn't seems efficient to me.
I am open to suggestions for different approaches as well.
There are possibly other, more efficient ways to do it but assuming you already have a function SomeFunction(df: DataFrame[value: string]), the easiest would be to use toLocalIterator() on your fnameRDD to process one file at a time. For example:
for x in fnameRDD.toLocalIterator():
fileContent = spark.read.text(x[0])
# fileContent.show(truncate=False)
SomeFunction(fileContent)
A couple of thoughts regarding efficiency:
Unlike .collect(), .toLocalIterator() brings data to driver memory one partition at a time. But in your case, after you call .distinct(), all the data will reside in a single partition, and so will be moved to the driver all at once. Hence, you may want to add .repartition(N) after .distinct(), to break that single partition into N smaller ones, and avoid the need to have large heap on the driver. (Of course, this is only relevant if your list of input files is REALLY long.)
The method to list file names itself seems to be less than efficient. Perhaps you'd want to consider something more direct, using FileSystem API for example like in this article.
I believe you're looking for recursive file lookup,
spark.read.option("recursiveFileLookup", "true").text(filepathroot)
if you point this to the root directory of your files, spark will traverse the directory and pick up all the files that sit under the root and child folders, this will read the file into a single dataframe

Overused the capacity memory when trying to process the CSV file when using Pyspark and Python

I dont know which part of the code I should share since what I do is basically as below(I will share a simple code algorithm instead for reference):
Task: I need to search for file A and then match the values in file A with column values in File B(It has more than 100 csv files, with each contained more than 1millions rows in CSV), then after matched, combined the results into a single CSV.
Extract column values for File A and then make it into list of values.
Load File B in pyspark and then use .isin to match with File A list of values.
Concatenate the results into single csv file.
"""
first = pd.read_excel("fileA.xlsx")
list_values = first[first["columnA"].apply(isinstance,args=(int,))]["columnA"].values.tolist()
combine = []
for file in glob.glob("directory/"): #here will loop at least 100 times.
second = spark.read.csv("fileB")
second = second["columnB"].isin(list_values) # More than hundreds thousands rows will be expected to match.
combine.append(second)
total = pd.concat(combine)
Error after 30hours of running time:
UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
Is there a way to better perform such task? currently, to complete the process it takes more than 30hours to just run the code but it ended with failure with above error. Something like parallel programming or which I could speed up the process or to clear the above error? ?
Also, when I test it with running only 2 CSV files, it took less than a minute to complete but when I try to loop the whole folder with 100 files, it takes more than 30hours.
There are several things that I think you can try to optimize given that your configuration and resource unchanged:
Repartition when you read your CSV. Didn't study the source code on how spark read the csv, but based on my experience / case in SO, when you use spark to read the csv, all the data will be in single partition, which might cause you the Java OOM error and also it's not fully utilize your resource. Try to check the partitioning of the data and make sure that there is no data skewness before you do any transformation and action.
Rethink on how to do the filtering based on another dataframe column value. From your code, your current approach is to use a python list to collect and store the reference, and then use .isin() to search if the main dataframe column contain value which is in this reference list. If the length of your reference list is very large, the searching operation of EACH ROW to go through the whole reference list is definitely a high cost. Instead, you can try to use the leftsemi .join() operation to achieve the same goal. Even if the dataset is small and you want to prevent the data shuffling, you can use the broadcast to copy your reference dataframe to every single node.
If you can achieve in Spark SQL, don't do it by pandas. In your last step, you're trying to concat all the data after the filtering. In fact, you can achieve the same goal with .unionAll() or .unionByName(). Even you do the pd.concat() in the spark session, all the pandas operation will be done in the driver node but not distributed. Therefore, it might cause Java OOM error and degrade the performance too.

How can I append to same file in HDFS(spark 2.11)

I am trying to store Stream Data into HDFS using SparkStreaming,but it Keep creating in new file insted of appending into one single file or few multiple files
If it keep creating n numbers of files,i feel it won't be much efficient
HDFS FILE SYSYTEM
Code
lines.foreachRDD(f => {
if (!f.isEmpty()) {
val df = f.toDF().coalesce(1)
df.write.mode(SaveMode.Append).json("hdfs://localhost:9000/MT9")
}
})
In my pom I am using respective dependencies:
spark-core_2.11
spark-sql_2.11
spark-streaming_2.11
spark-streaming-kafka-0-10_2.11
As you already realized Append in Spark means write-to-existing-directory not append-to-file.
This is intentional and desired behavior (think what would happen if process failed in the middle of "appending" even if format and file system allow that).
Operations like merging files should be applied by a separate process, if necessary at all, which ensures correctness and fault tolerance. Unfortunately this requires a full copy which, for obvious reasons is not desired on batch-to-batch basis.
It’s creating file for each rdd as every time you are reinitialising the DataFrame variable. I would suggest have a DataFrame variable and assign as null outside of loop and inside each rdd union with the local DataFrame. After the loop write using the outer DataFrame.

Spark transformations and ordering

I am working on parsing different types of files (text,xml,csv etc.) into a specific text file format using spark java API. This output file maintains the order of file header, start tag, data header, data and end tag. All of these element are extracted from input file at some point.
I tried to achieve this in below 2 ways:
Read file to RDD using sparks textFile and perform parsing by using map or mapPartions which returns new RDD.
Read file using sparks textFile , reduce to 1 partition using coalesce and perform parsing by using mapPartions which returns new RDD.
While I am not concerned about sequencing of actual data, with first approach I am not able to keep the required order of File Header, Start Tag, Data Header and End Tag.
The latter works for me, but I know it is not efficient way and may cause problem in case of BIG files.
Is there any efficient way to achieve this?
You are correct in you assumptions. The second choice simply cancels the distributional aspect of your application, so it's not scalable. For the order issue, as the concept is asynchronous, we cannot keep track of order when the data reside in different nodes. What you could do is some preprocessing that would cancel the need for order. Meaning, merge lines up to the point where the line order does not matter and only then distribute your file. Unless you can make assumptions about the file structure, such as number of lines that belong together, I would go with the above.

How to properly load millions of files into an RDD

I have a very large set of json files (>1 million files) that I would like to work on with Spark.
But, I've never tried loading this much data into an RDD before, so I actually don't know if it can be done, or rather if it even should be done.
What is the correct pattern for dealing with this amount of data within RDD(s) in Spark?
Easiest way would be to create directory, copy all the files to the directory and pass directory as path while reading the data.
If you try to use patterns in the directory path, Spark might run into out of memory issues.

Resources