I have a problem where I am trying to split a file along n character length records for a distributed system. I have the functionality for breaking up the record and map it to the proper names on a record level but need to go from the file to being on the system to breaking up the file and passing it out to the nodes in n length sized pieces to be split and processed.
I have looked into the specs for the SparkContext object and there is a method to pull in a file from the Hadoop environment and load it as a byte array data frame. The function is byteRecords.
Related
I have a folder that contains n number of files.
I am creating an RDD that contains all the filenames of above folder with the code below:
fnameRDD = spark.read.text(filepath).select(input_file_name()).distinct().rdd)
I want to iterate through these RDD elements and process following steps:
Read content of each element (each element is a filepath, so need to read content throgh SparkContext)
Above content should be another RDD which I want to pass as an argument to a Function
Perform certain steps on the RDD passed as argument inside called function
I already have a Function written which has steps that I've tested for Single file and it works fine
But I've tried various things syntactically to do first 2 steps, but I am just getting invalid syntax every time.
I know I am not supposed to use map() since I want to read a file in each iteration which will require sc, but map will be executed inside worker node where sc can't be referenced.
Also, I know I can use wholeTextFiles() as an alternative, but that means I'll be having text of all the files in memory throughout the process, which doesn't seems efficient to me.
I am open to suggestions for different approaches as well.
There are possibly other, more efficient ways to do it but assuming you already have a function SomeFunction(df: DataFrame[value: string]), the easiest would be to use toLocalIterator() on your fnameRDD to process one file at a time. For example:
for x in fnameRDD.toLocalIterator():
fileContent = spark.read.text(x[0])
# fileContent.show(truncate=False)
SomeFunction(fileContent)
A couple of thoughts regarding efficiency:
Unlike .collect(), .toLocalIterator() brings data to driver memory one partition at a time. But in your case, after you call .distinct(), all the data will reside in a single partition, and so will be moved to the driver all at once. Hence, you may want to add .repartition(N) after .distinct(), to break that single partition into N smaller ones, and avoid the need to have large heap on the driver. (Of course, this is only relevant if your list of input files is REALLY long.)
The method to list file names itself seems to be less than efficient. Perhaps you'd want to consider something more direct, using FileSystem API for example like in this article.
I believe you're looking for recursive file lookup,
spark.read.option("recursiveFileLookup", "true").text(filepathroot)
if you point this to the root directory of your files, spark will traverse the directory and pick up all the files that sit under the root and child folders, this will read the file into a single dataframe
This may be a silly question, but I'm not able to understand how the files are split across partitions.
My requirement is to read 10000 Binary files(Bloom filter persisted file) from Hdfs location and process the Binary files separately by converting the data to ByteArrayInputStream . The point to be noted is these files are Bloom filter persisted files and should be read sequentially from the start of the file till to the end and should be converted to Byte Array, thus this Byte array will be used to restructure the Bloomfilter object.
JavaPairRDD<String, PortableDataStream> rdd = sparkContext.binaryFiles(commaSeparatedfilePaths);
rdd.map(new Function<Tuple2<String, PortableDataStream>, BloomCheckResponse>()
Here in the code, I get v1._1 as Filepath and v1._2 the PortableDataStream which will be converted to ByteArrayInputStream.
Each binary file is of 34 MB.
Now the question is will there come a situation where part of the file will be in one partition and the other part in a different one? Or all the time I process, will I get all the content of file mapped to its file in single partition and its not split across?
Executor memory = 4GB and the cores = 2 and the executors are 180.
Basically the expectation is that the file should be read the way it is from start to end without split.
Each (file, stream) is guaranteed to provide full content of the file in the stream. There is no case where data will be divided between multiple pairs, not to mention multiple partitions.
You're safe to use it for your intended scenario.
I am working on parsing different types of files (text,xml,csv etc.) into a specific text file format using spark java API. This output file maintains the order of file header, start tag, data header, data and end tag. All of these element are extracted from input file at some point.
I tried to achieve this in below 2 ways:
Read file to RDD using sparks textFile and perform parsing by using map or mapPartions which returns new RDD.
Read file using sparks textFile , reduce to 1 partition using coalesce and perform parsing by using mapPartions which returns new RDD.
While I am not concerned about sequencing of actual data, with first approach I am not able to keep the required order of File Header, Start Tag, Data Header and End Tag.
The latter works for me, but I know it is not efficient way and may cause problem in case of BIG files.
Is there any efficient way to achieve this?
You are correct in you assumptions. The second choice simply cancels the distributional aspect of your application, so it's not scalable. For the order issue, as the concept is asynchronous, we cannot keep track of order when the data reside in different nodes. What you could do is some preprocessing that would cancel the need for order. Meaning, merge lines up to the point where the line order does not matter and only then distribute your file. Unless you can make assumptions about the file structure, such as number of lines that belong together, I would go with the above.
I am trying to load seg-Y type files into spark, and transfer them into rdd for mapreduce operation.
But I failed to transfer them into rdd. Does anyone who can offer help?
You could use binaryRecords() pySpark call to convert binary file's content into an RDD
http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.SparkContext.binaryRecords
binaryRecords(path, recordLength)
Load data from a
flat binary file, assuming each record is a set of numbers with the
specified numerical format (see ByteBuffer), and the number of bytes
per record is constant.
Parameters: path – Directory to the input data files recordLength –
The length at which to split the records
Then you could map() that RDD into a structure by using, for example, struct.unpack()
https://docs.python.org/2/library/struct.html
We use this approach to ingest propitiatory fixed-width records binary files. There is a bit of Python code that generates Format string (1st argument to struct.unpack), but if your files layout is static, it's fairly simple to do manually one time.
Similarly is possible to do using pure Scala:
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext#binaryRecords(path:String,recordLength:Int,conf:org.apache.hadoop.conf.Configuration):org.apache.spark.rdd.RDD[Array[Byte]]
You've not really given much detail, but you can start with using the SparkContextbinaryFiles() API
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
I need to count term frequency of each word per document so i want to implement map reduce functions per text file.how to implement map() and reduce() per text file?
And another problem in Map-Reduce is
Map-Reduce writes output from reduce to single file /user/output/part-0000 and project need to write each file processed output in different text files, how to do it?
Follow the steps mentioned below:
In job file compute number of input files
Set numreducers equal to number of input files
Assign numbers 0 to n-1 to files and pass this information to distributed cache
Get file name in setup() method of mapper and retrieve assigned number for that file and assign it to some static variable
From Partitioner return this static variable
Reducer will emit 'n' files.