How to add new files to spark structured streaming dataframe

How to add new files to spark structured streaming dataframe - apache-spark

I am getting daily files in a folder in linux server, How should I add these to my spark structured streaming dataframe? (Delta Update)

Have you read the document?
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources

Related

How to append data in existing AVRO file using Python

I have a dataframe with similar schema, I need to append the data into the AVRO file. I don't like to add the avro file into folder as a part. For your information, my AVRO file is not into the folder as a part. Can you please help me to solve the task.

You can write the data by using mode overwrite while writing the dataframe.
But the part file is created as spark is distributed processing and each executor spits out a files based on the amount of data

Read compressed JSON in Spark

I have data stored in S3 as utf-8 encoded json files, and compressed using either snappy/lz4.
I'd like to use Spark to read/process this data, but Spark seems to require the filename suffix (.lz4, .snappy) to understand the compression scheme.
The issue is that I have no control over how the files are named - they will not be written with this suffix. It is also too expensive to rename all such files to include such as suffix.
Is there any way for spark to read these JSON files properly?
For parquet encoded files there is the 'parquet.compression' = 'snappy' in Hive Metastore, which seems to solve this problem for parquet files. Is there something similar for text files?

hadoop: In which format data is stored in HDFS

I am loading data into HDFS using spark. How is the data stored in HDFS? Is it encrypt mode? Is it possible to crack the HDFS data? how about Security for existing data?
I want to know the details how the system behaves.

HDFS is a distributed file system which supports various formats like plain text format csv, tsv files. Other formats like parquet, orc, Json etc..
While saving the data in HDFS in spark you need to specify the format.
You can’t read parquet files without any parquet tools but spark can read it.
The security of HDFS is governed by Kerberos authentication. You need to set up the authentication explicitly.
But the default format of spark to read and write data is - parquet

HDFS can store data in many formats and Spark has the ability to read it (csv, json, parquet etc). While writing back specify the format that you wish to save the file in.
reading some stuff on the below commands will help you this thing:
hadoop fs -ls /user/hive/warehouse
hadoop fs -get (this till get the files from hdfs to your local file system)
hadoop fs -put (this will put the files from your local file system to hdfs)

How to process new files in HDFS directory once their writing has eventually finished?

In my scenario I have CSV files continuously uploaded to HDFS.
As soon as a new file gets uploaded I'd like to process the new file with Spark SQL (e.g., compute the maximum of a field in the file, transform the file into parquet). i.e. I have a one-to-one mapping between each input file and a transformed/processed output file.
I was evaluating Spark Streaming to listen to the HDFS directory, then to process the "streamed file" with Spark.
However, in order to process the whole file I would need to know when the "file stream" completes. I'd like to apply the transformation to the whole file in order to preserve the end-to-end one-to-one mapping between files.
How can I transform the whole file and not its micro-batches?
As far as I know, Spark Streaming can only apply transformation to batches (DStreams mapped to RDDs) and not to the whole file at once (when its finite stream has completed).
Is that correct? If so, what alternative should I consider for my scenario?

I may have misunderstood your question the first try...
As far as I know, Spark Streaming can only apply transformation to batches (DStreams mapped to RDDs) and not to the whole file at once (when its finite stream has completed).
Is that correct?
No. That's not correct.
Spark Streaming will apply transformation to the whole file at once as was written to HDFS at the time Spark Streaming's batch interval elapsed.
Spark Streaming will take the current content of a file and start processing it.
As soon as a new file gets uploaded I need to process the new file with Spark/SparkSQL
Almost impossible with Spark due to its architecture which takes some time from the moment "gets uploaded" and Spark processes it.
You should consider using a brand new and shiny Structured Streaming or (soon obsolete) Spark Streaming.
Both solutions support watching a directory for new files and trigger Spark job once a new file gets uploaded (which is exactly your use case).
Quoting Structured Streaming's Input Sources:
In Spark 2.0, there are a few built-in sources.
File source - Reads files written in a directory as a stream of data. Supported file formats are text, csv, json, parquet. See the docs of the DataStreamReader interface for a more up-to-date list, and supported options for each file format. Note that the files must be atomically placed in the given directory, which in most file systems, can be achieved by file move operations.
See also Spark Streaming's Basic Sources:
Besides sockets, the StreamingContext API provides methods for creating DStreams from files as input sources.
File Streams: For reading data from files on any file system compatible with the HDFS API (that is, HDFS, S3, NFS, etc.), a DStream can be created as:
streamingContext.fileStream[KeyClass, ValueClass, InputFormatClass](dataDirectory)
Spark Streaming will monitor the directory dataDirectory and process any files created in that directory (files written in nested directories not supported).
One caveat though given your requirement:
I would need to know when the "file stream" completes.
Don't do this with Spark.
Quoting Spark Streaming's Basic Sources again:
The files must be created in the dataDirectory by atomically moving or renaming them into the data directory.
Once moved, the files must not be changed. So if the files are being continuously appended, the new data will not be read.
Wrapping up...you should only move the files to the directory that Spark watches when the files are complete and ready for processing using Spark. This is outside the scope of Spark.

You can use DFSInotifyEventInputStream to watch Hadoop dir and then execute Spark job programmatically when file is created.
See this post:
HDFS file watcher

Do Parquet Metadata Files Need to be Rolled-back?

When a Parquet file data is written with partitioning on its date column we get a directory structure like:
/data
_common_metadata
_metadata
_SUCCESS
/date=1
part-r-xxx.gzip
part-r-xxx.gzip
/date=2
part-r-xxx.gzip
part-r-xxx.gzip
If the partition date=2 is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition date=1?
Or is it ok to delete partitions at will and rewrite them (or not) later?

If you're using DataFrame there is no need to roll back the metadata files.
For example:
You can write your DataFrame to S3
df.write.partitionBy("date").parquet("s3n://bucket/folderPath")
Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry)
Now you can
Load your data and see that the data is still valid except the data you had in partition date=1 sqlContext.read.parquet("s3n://bucket/folderPath").count
Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode
df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath")
You can also take a look at this question from databricks forum.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string