Read file while it is being written by spark structured streaming - apache-spark

I am using spark structured streaming for my application. I have use case where i need to read file while it is being written.
I tried with spark structured streaming as below:
sch=StructType([StructField("ID",IntegerType(),True),StructField("COUNTRY",StringType(),True)])
df_str = spark.readStream.format("csv").schema(sch). option("header",True).option("delimiter",','). load("<Load Path>")
query = df_str.writeStream.format("parquet").outputMode("append").trigger(processingTime='10 seconds').option("path","<HDFS location>").option("checkpointLocation","<chckpoint_loc>").start()
But it is reading only file initially, after that file is not getting read incrementally. i am thinking workaround to write file in temp directory and create new file after some time and copy to directory from spark structured streaming job is reading but this is causing latency.
Is there any other way to handle this(I can not use kafka)?
Sorry if this question is not for Stackoverflow because i did not find any other place to ask this question.

Unfortunately Spark doesn't support it. The unit of file stream source is "file". Spark assumes that the files it reads are "immutable", which means the files shouldn't be changed once they're placed in source path. This makes offset management pretty much simpler (doesn't need to track file offsets), where the number of files in source path would keep increasing. Reasonable limitation, but still a limitation.

Related

How to implement exactly-once processing when reading reading from directory using Spark Structured Streaming?

I'd like to use the concept of stream processing to read files from a local directory and then publish to Apache Kafka. I thought about using Spark Structured Streaming.
How is the checkpointing implemented when the streaming fails after reading 50 lines of a file. Will it start from 51st line of the file when started next time or will it again read from the start of the file?
Also, will we have any issues if we use checkpointing in structured streaming when there is any upgrade or any change in the code.
when the streaming fails after reading 50 lines of a file. Will it start from 51st line of the file when started next time Or will it again read from the start of the file.
Either the entire file is fully processed or none at all. That's how FileFormat works in general in Spark SQL and has little to do with Spark Structured Streaming in particular (since they share the underlying execution infrastructure).
In short, the engine "will it again read from the start of the file."
That also to say that there is no concept of a single line while processing files in Spark Structured Streaming. You process a streaming DataFrame that is an entire file (or even a couple of files) all at once and whether you want to process the dataset line by line or in its entirety is up to you, a Spark developer.
Also, will we have any issues if we use checkpointing in structured streaming when there is any upgrade or any change in the code.
In theory, you should not. The purpose of the new checkpointing mechanism in Spark Structured Streaming (as compared to the legacy Spark Streaming's) was to allow for restarts and upgrades in a more comfortable way. The checkpointing uses just a little information (usually stored in JSON files) to restart processing from the point of the last successful checkpoint.

Importing a large text file into Spark

I have a pipe delimited text file that is 360GB, compressed (gzip). The file is in an S3 bucket.
This is my first time using Spark. I understand that you can partition a file in order to allow multiple worker nodes to operate on the data which results in huge performance gains. However, I'm trying to find an efficient way to turn my one 360GB file into a partitioned file. Is there a way to use multiple spark worker nodes to work on my one, compressed file in order to partition it? Unfortunately, I have no control over the fact that I'm just getting one huge file. I could uncompress the file myself and break it into many files (say 360 1GB files), but I'll just be using one machine to do that and it will be pretty slow. I need to run some expensive transformations on the data using Spark so I think partitioning the file is necessary. I'm using Spark inside of Amazon Glue so I know that it can scale to a large number of machines. Also, I'm using python (pyspark).
Thanks.
If i'm not mistaken, Spark uses Hadoop's TextInputFormat if you read a file using SparkContext.textFile. If a compression codec is set, the TextInputFormat determines if the file is splittable by checking if the code is an instance of SplittableCompressionCodec.
I believe GZIP is not splittable, Spark can only generate one partition to read the entire file.
What you could do is:
1. Add a repartition after SparkContext.textFile so you at least have more than one of your transformations process parts of the data.
2. Ask for multiple files instead of just a single GZIP file
3. Write an application that decompresses and splits the files into multiple output files before running your Spark application on it.
4. Write your own compression codec for GZIP (this is a little more complex).
Have a look at these links:
TextInputFormat
source code for TextInputFormat
GzipCodec
source code for GZIPCodec
These are in java, but i'm sure there are equivalent Python/Scala versions of them.
First I suggest you have to used ORC format with zlib compression so you get almost 70% compression and as per my research ORC is the most suitable file format for fastest data processing. So you have to load your file and simply write it into orc format with repartition.
df.repartition(500).write.option("compression","zlib").mode("overwrite").save("testoutput.parquet")
One potential solution could be to use Amazon's S3DistCp as a step on your EMR cluster to copy the 360GB file in the HDFS file system available on the cluster (this requires Hadoop to be deployed on the EMR).
A nice thing about S3DistCp is that you can change the codec of the output file, and transform the original gzip file into a format which will allow Spark to spawn multiple partitions for its RDD.
However I am not sure about how long it will take for S3DistCp to perform the operation (which is an Hadoop Map/Reduce over S3. It benefits from optimised S3 libraries when run from an EMR, but I am concerned that Hadoop will face the same limitations as Spark when generating the Map tasks).

Unable to read the streaming data from the single file in Spark streaming

I am trying to read the streaming data from the text file which gets appended continuously using Spark streaming API "textFileStream". But unable to read the continuous data with Spark streaming. How to achieve it in Spark?
This an expected behavior. For file based sources (like fileStream):
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
If you want to read continuously appended you'll have to create your own source, or use separate process, which will monitor changes, and push records to for example Kafka (though it is rare to combine Spark with file systems that support appending).

How to process new files in HDFS directory once their writing has eventually finished?

In my scenario I have CSV files continuously uploaded to HDFS.
As soon as a new file gets uploaded I'd like to process the new file with Spark SQL (e.g., compute the maximum of a field in the file, transform the file into parquet). i.e. I have a one-to-one mapping between each input file and a transformed/processed output file.
I was evaluating Spark Streaming to listen to the HDFS directory, then to process the "streamed file" with Spark.
However, in order to process the whole file I would need to know when the "file stream" completes. I'd like to apply the transformation to the whole file in order to preserve the end-to-end one-to-one mapping between files.
How can I transform the whole file and not its micro-batches?
As far as I know, Spark Streaming can only apply transformation to batches (DStreams mapped to RDDs) and not to the whole file at once (when its finite stream has completed).
Is that correct? If so, what alternative should I consider for my scenario?
I may have misunderstood your question the first try...
As far as I know, Spark Streaming can only apply transformation to batches (DStreams mapped to RDDs) and not to the whole file at once (when its finite stream has completed).
Is that correct?
No. That's not correct.
Spark Streaming will apply transformation to the whole file at once as was written to HDFS at the time Spark Streaming's batch interval elapsed.
Spark Streaming will take the current content of a file and start processing it.
As soon as a new file gets uploaded I need to process the new file with Spark/SparkSQL
Almost impossible with Spark due to its architecture which takes some time from the moment "gets uploaded" and Spark processes it.
You should consider using a brand new and shiny Structured Streaming or (soon obsolete) Spark Streaming.
Both solutions support watching a directory for new files and trigger Spark job once a new file gets uploaded (which is exactly your use case).
Quoting Structured Streaming's Input Sources:
In Spark 2.0, there are a few built-in sources.
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
See also Spark Streaming's Basic Sources:
Besides sockets, the StreamingContext API provides methods for creating DStreams from files as input sources.
File Streams: For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported).
One caveat though given your requirement:
I would need to know when the "file stream" completes.
Don't do this with Spark.
Quoting Spark Streaming's Basic Sources again:
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
Wrapping up...you should only move the files to the directory that Spark watches when the files are complete and ready for processing using Spark. This is outside the scope of Spark.
You can use DFSInotifyEventInputStream to watch Hadoop dir and then execute Spark job programmatically when file is created.
See this post:
HDFS file watcher

How to use spark streaming from a file system input

I want to use spark streaming and give the input from file system (say HDFS ).How will I do it
For instance, when using JavaStreamingContext there are appropriate methods methods, e.g. textFileStream() to read any text files, fileStream() to read files from Hadoop-compatible filesystem. The directory you pass as a parameter of the API will be monitored for changes. In case you are going to move any file there it will be picked up by Streaming application depending on batch interval.
Please have a look at my simple samples at github to read data either from Twitter or from file system.
Hope that will help.

Resources