I need to count term frequency of each word per document so i want to implement map reduce functions per text file.how to implement map() and reduce() per text file?
And another problem in Map-Reduce is
Map-Reduce writes output from reduce to single file /user/output/part-0000 and project need to write each file processed output in different text files, how to do it?
Follow the steps mentioned below:
In job file compute number of input files
Set numreducers equal to number of input files
Assign numbers 0 to n-1 to files and pass this information to distributed cache
Get file name in setup() method of mapper and retrieve assigned number for that file and assign it to some static variable
From Partitioner return this static variable
Reducer will emit 'n' files.
Related
I have a folder that contains n number of files.
I am creating an RDD that contains all the filenames of above folder with the code below:
fnameRDD = spark.read.text(filepath).select(input_file_name()).distinct().rdd)
I want to iterate through these RDD elements and process following steps:
Read content of each element (each element is a filepath, so need to read content throgh SparkContext)
Above content should be another RDD which I want to pass as an argument to a Function
Perform certain steps on the RDD passed as argument inside called function
I already have a Function written which has steps that I've tested for Single file and it works fine
But I've tried various things syntactically to do first 2 steps, but I am just getting invalid syntax every time.
I know I am not supposed to use map() since I want to read a file in each iteration which will require sc, but map will be executed inside worker node where sc can't be referenced.
Also, I know I can use wholeTextFiles() as an alternative, but that means I'll be having text of all the files in memory throughout the process, which doesn't seems efficient to me.
I am open to suggestions for different approaches as well.
There are possibly other, more efficient ways to do it but assuming you already have a function SomeFunction(df: DataFrame[value: string]), the easiest would be to use toLocalIterator() on your fnameRDD to process one file at a time. For example:
for x in fnameRDD.toLocalIterator():
fileContent = spark.read.text(x[0])
# fileContent.show(truncate=False)
SomeFunction(fileContent)
A couple of thoughts regarding efficiency:
Unlike .collect(), .toLocalIterator() brings data to driver memory one partition at a time. But in your case, after you call .distinct(), all the data will reside in a single partition, and so will be moved to the driver all at once. Hence, you may want to add .repartition(N) after .distinct(), to break that single partition into N smaller ones, and avoid the need to have large heap on the driver. (Of course, this is only relevant if your list of input files is REALLY long.)
The method to list file names itself seems to be less than efficient. Perhaps you'd want to consider something more direct, using FileSystem API for example like in this article.
I believe you're looking for recursive file lookup,
spark.read.option("recursiveFileLookup", "true").text(filepathroot)
if you point this to the root directory of your files, spark will traverse the directory and pick up all the files that sit under the root and child folders, this will read the file into a single dataframe
I dont know which part of the code I should share since what I do is basically as below(I will share a simple code algorithm instead for reference):
Task: I need to search for file A and then match the values in file A with column values in File B(It has more than 100 csv files, with each contained more than 1millions rows in CSV), then after matched, combined the results into a single CSV.
Extract column values for File A and then make it into list of values.
Load File B in pyspark and then use .isin to match with File A list of values.
Concatenate the results into single csv file.
"""
first = pd.read_excel("fileA.xlsx")
list_values = first[first["columnA"].apply(isinstance,args=(int,))]["columnA"].values.tolist()
combine = []
for file in glob.glob("directory/"): #here will loop at least 100 times.
second = spark.read.csv("fileB")
second = second["columnB"].isin(list_values) # More than hundreds thousands rows will be expected to match.
combine.append(second)
total = pd.concat(combine)
Error after 30hours of running time:
UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
Is there a way to better perform such task? currently, to complete the process it takes more than 30hours to just run the code but it ended with failure with above error. Something like parallel programming or which I could speed up the process or to clear the above error? ?
Also, when I test it with running only 2 CSV files, it took less than a minute to complete but when I try to loop the whole folder with 100 files, it takes more than 30hours.
There are several things that I think you can try to optimize given that your configuration and resource unchanged:
Repartition when you read your CSV. Didn't study the source code on how spark read the csv, but based on my experience / case in SO, when you use spark to read the csv, all the data will be in single partition, which might cause you the Java OOM error and also it's not fully utilize your resource. Try to check the partitioning of the data and make sure that there is no data skewness before you do any transformation and action.
Rethink on how to do the filtering based on another dataframe column value. From your code, your current approach is to use a python list to collect and store the reference, and then use .isin() to search if the main dataframe column contain value which is in this reference list. If the length of your reference list is very large, the searching operation of EACH ROW to go through the whole reference list is definitely a high cost. Instead, you can try to use the leftsemi .join() operation to achieve the same goal. Even if the dataset is small and you want to prevent the data shuffling, you can use the broadcast to copy your reference dataframe to every single node.
If you can achieve in Spark SQL, don't do it by pandas. In your last step, you're trying to concat all the data after the filtering. In fact, you can achieve the same goal with .unionAll() or .unionByName(). Even you do the pd.concat() in the spark session, all the pandas operation will be done in the driver node but not distributed. Therefore, it might cause Java OOM error and degrade the performance too.
I am using Flatbuffers as a way to store data and the meta-tags with the data. I am using Python and in order to simulate dictionaries, I have two table structures: One for dictionary entries and one to hold a vector of entries. Here is an example of the schema:
// Define dictionary structure
table tokenEntry{
key:string;
value:int;
}
table TokenDict{
Entries:[tokenEntry];
}
root_type TokenDict;
I wish to write two dictionaries to a single file using Flatbuffers. I want to also read the dictionaries one at a time from the file, and not load both into memory at the same time. I am able to write both to file, one at a time. However, when I read from the file, I get both of the structures at once. The buffer holds all the data from the file. This is not what I want, because later I will have a much larger amount of data in the files. Is there a way to read in just one at a time?
As an example, if I were to use pickle structures, I can write multiple pickles to a file and read them back one at a time. i wish to do the same with Flatbuffers.
Thank you.
Best to write a file as a sequence of individual FlatBuffers, each prefixed with a size. You can do that in Python using FinishSizePrefixed (see builder.py).
I have a problem where I am trying to split a file along n character length records for a distributed system. I have the functionality for breaking up the record and map it to the proper names on a record level but need to go from the file to being on the system to breaking up the file and passing it out to the nodes in n length sized pieces to be split and processed.
I have looked into the specs for the SparkContext object and there is a method to pull in a file from the Hadoop environment and load it as a byte array data frame. The function is byteRecords.
I am trying to load seg-Y type files into spark, and transfer them into rdd for mapreduce operation.
But I failed to transfer them into rdd. Does anyone who can offer help?
You could use binaryRecords() pySpark call to convert binary file's content into an RDD
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords
binaryRecords(path, recordLength)
Load data from a
flat binary file, assuming each record is a set of numbers with the
specified numerical format (see ByteBuffer), and the number of bytes
per record is constant.
Parameters: path – Directory to the input data files recordLength –
The length at which to split the records
Then you could map() that RDD into a structure by using, for example, struct.unpack()
https://docs.python.org/2/library/struct.html
We use this approach to ingest propitiatory fixed-width records binary files. There is a bit of Python code that generates Format string (1st argument to struct.unpack), but if your files layout is static, it's fairly simple to do manually one time.
Similarly is possible to do using pure Scala:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext#binaryRecords(path:String,recordLength:Int,conf:org.apache.hadoop.conf.Configuration):org.apache.spark.rdd.RDD[Array[Byte]]
You've not really given much detail, but you can start with using the SparkContextbinaryFiles() API
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext