I need to call an external process from my EMR Spark job. I see that rdd.pipe would allow me to pipe an RDD to a process. (As an aside, is that one process per RDD, or one per element?).
However, my external process requires a filename as input and generates a file as output.
How can I invoke this external process and subsequently load the output file as an RDD?
is that one process per RDD, or one per element?
Neither. It is a process per partition.
process requires a filename as input and generates a file as output. How can
The simplest solution is to write a simple wrapper which writes to randomly generated path, invokes your program, reads the file and writes to stdout and this is pretty much all what pipe is about. Unless you write to distributed file system you wouldn't be able to retriever the output otherwise.
Related
In my scenario I have CSV files continuously uploaded to HDFS.
As soon as a new file gets uploaded I'd like to process the new file with Spark SQL (e.g., compute the maximum of a field in the file, transform the file into parquet). i.e. I have a one-to-one mapping between each input file and a transformed/processed output file.
I was evaluating Spark Streaming to listen to the HDFS directory, then to process the "streamed file" with Spark.
However, in order to process the whole file I would need to know when the "file stream" completes. I'd like to apply the transformation to the whole file in order to preserve the end-to-end one-to-one mapping between files.
How can I transform the whole file and not its micro-batches?
As far as I know, Spark Streaming can only apply transformation to batches (DStreams mapped to RDDs) and not to the whole file at once (when its finite stream has completed).
Is that correct? If so, what alternative should I consider for my scenario?
I may have misunderstood your question the first try...
As far as I know, Spark Streaming can only apply transformation to batches (DStreams mapped to RDDs) and not to the whole file at once (when its finite stream has completed).
Is that correct?
No. That's not correct.
Spark Streaming will apply transformation to the whole file at once as was written to HDFS at the time Spark Streaming's batch interval elapsed.
Spark Streaming will take the current content of a file and start processing it.
As soon as a new file gets uploaded I need to process the new file with Spark/SparkSQL
Almost impossible with Spark due to its architecture which takes some time from the moment "gets uploaded" and Spark processes it.
You should consider using a brand new and shiny Structured Streaming or (soon obsolete) Spark Streaming.
Both solutions support watching a directory for new files and trigger Spark job once a new file gets uploaded (which is exactly your use case).
Quoting Structured Streaming's Input Sources:
In Spark 2.0, there are a few built-in sources.
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
See also Spark Streaming's Basic Sources:
Besides sockets, the StreamingContext API provides methods for creating DStreams from files as input sources.
File Streams: For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported).
One caveat though given your requirement:
I would need to know when the "file stream" completes.
Don't do this with Spark.
Quoting Spark Streaming's Basic Sources again:
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
Wrapping up...you should only move the files to the directory that Spark watches when the files are complete and ready for processing using Spark. This is outside the scope of Spark.
You can use DFSInotifyEventInputStream to watch Hadoop dir and then execute Spark job programmatically when file is created.
See this post:
HDFS file watcher
I have a dataframe and a i am going to write it an a .csv file in S3
i use the following code:
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly",mode='overwrite',header=True)
it puts a .csv file in product_profit_weekly folder , at the moment .csv file has a weired name in S3 , is it possible to choose a file name when i am going to write it?
All spark dataframe writers (df.write.___) don't write to a single file, but write one chunk per partition. I imagine what you get is a directory called
df.coalesce(1).write.csv("dbfs:/mnt/mount1/2016//product_profit_weekly
and one file inside called
part-00000
In this case, you are doing something that could be quite inefficient and not very "sparky" -- you are coalescing all dataframe partitions to one, meaning that your task isn't actually executed in parallel!
Here's a different model. To take advantage of all spark parallelization, which means DON'T coalesce, and write in parallel to some directory.
If you have 100 partitions, you will get:
part-00000
part-00001
...
part-00099
If you need everything in one flat file, write a little function to merge it after the fact. You could either do this in scala, or in bash with:
cat ${dir}.part-* > $flatFilePath
I am quite new with pyspark. In my application with pyspark, I want to achieve following things:
Create a RDD using python list and partition it into some partitions.
Now use rdd.mapPartitions(func)
Here, the function "func" performs an iterative operation which, reads content of saved file into a local variable (for e.g. numpy array), performs some updates using the rdd partion data and again saves the content of variable to some common file system.
I am not able to figure out how to read and write a variable inside a worker process which is accessible to all processes??
I am writing Spark application (Single client) and dealing with lots of small files upon whom I want to run an algorithm. The same algorithm for each one of them. But the files cannot be loaded into the same RDD for the algorithm to work, Because it should sort data within one file boundary.
Today I work on a file at a time, As a result I have poor resource utilization (Small amount of data each action, lots of overhead)
Is there any way to perform the same action/transformation on multiple RDD's simultaneously (And only using one driver program)? Or should I look for another platform? Because such mode of operation isn't classic for Spark.
If you use SparkContext.wholeTextFiles, then you could read the files into one RDD and each partition of the RDD would have the content of a single file. Then, you could work on each partition separately using SparkContext.mapPartitions(sort_file), where sort_file is the sorting function that you want to apply on each file. This would use concurrency better than your current solution, as long as your files are small enough that they can be processed in a single partition.
When I run a Spark job and save the output as a text file using method "saveAsTextFile" as specified at https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.rdd.RDD :
here are the files that are created :
Is the .crc file a Cyclic Redundancy Check file ? and so is used to check that the content of each generated file IS correct ?
The _SUCCESS file is always empty, what does this signify ?
The files that do not have an extension in above screenshot contain the actual data from the RDD but why are many files generated instead of just one ?
Those are files generated by the underlying Hadoop API that Spark calls when you invoke saveAsTextFile().
part- files: These are your output data files.
You will have one part- file per partition in the RDD you called saveAsTextFile() on. Each of these files will be written out in parallel, up to a certain limit (typically, the number of cores on the workers in your cluster). This means you will write your output much faster that it would be written out if it were all put in a single file, assuming your storage layer can handle the bandwidth.
You can check the number of partitions in your RDD, which should tell you how many part- files to expect, as follows:
# PySpark
# Get the number of partitions of my_rdd.
my_rdd._jrdd.splits().size()
_SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally.
.crc files: I have not seen the .crc files before, but yes, presumably they are checks on the part- files.