Write parquet files to object store using spark-kubernetes - apache-spark

Working on a project to write parquet files to a object store from k8 spark operator using multiple partition keys. The output location of the spark jobs for the parquet files is the same and the data is written to the subfolders based on the partition keys.
What are the options to write the parquet files without HDFS?
The default S3A (algo1 and algo2) committers will not work since the temp files in the output location gets deleted after each successful write as data is continuously written to the output location as small batches.
is there any way use the directory committer without HDFS?
NFS does not seem to work.

Related

Renaming Exported files from Spark Job

We are currently using Spark Job on Databricks which do processing on our data lake which in S3.
Once the processing is done we export our result to S3 bucket using normal
df.write()
The issue is when we write dataframe to S3 the name of file is controlled by Spark, but as per our agreement we need to rename this files to a meaningful name.
Since S3 doesn't have renaming feature we are right now using boto3 to copy and paste file with expected name.
This process is very complex and not scalable with more client getting onboard.
Do we have any better solution to rename exported files from spark to S3 ?
It's not possible to do it directly in Spark's save
Spark uses Hadoop File Format, which requires data to be partitioned - that's why you have part- files. If the file is small enough to fit into memory, one work around is to convert to a pandas dataframe and save as csv from there.
df_pd = df.toPandas()
df_pd.to_csv("path")

why does _spark_metadata has all parquet partitioned files inside 0 but cluster having 2 workers?

I have a small spark cluster with one master and two workers. I have a Kafka streaming app which streams data from Kafka and writes to a directory in parquet format and in append mode.
So far I am able to read from Kafka stream and write it to a parquet file using the following key line.
val streamingQuery = mydf.writeStream.format("parquet").option("path", "/root/Desktop/sampleDir/myParquet").outputMode(OutputMode.Append).option("checkpointLocation", "/root/Desktop/sampleDir/myCheckPoint").start()
I have checked in both of the workers. There are 3-4 snappy parquet files got created with file names having prefix as part-00006-XXX.snappy.parquet.
But when I try to read this parquet file using following command:
val dfP = sqlContext.read.parquet("/root/Desktop/sampleDir/myParquet")
it is showing file not found exceptions for some of the parquet split files. Strange thing is that, those files are already present in the one of the worker nodes.
When further checked in the logs, it is obeserved that spark is trying to get all the parquet files from only ONE worker nodes, and since not all parquet files are present in one worker, it is hitting with the exception that those files were not found in the mentioned path to parquet.
Am I missing some critical step in the streaming query or while reading data?
NOTE: I don't have a HADOOP infrastructure. I want to use filesystem only.
You need a shared file system.
Spark assumes the same file system is visible from all nodes (driver and workers).
If you are using the basic file system then each node sees their own file system which is different than the file system of other nodes.
HDFS is one way of getting a common, shared file system, another would be to use a common NFS mount (i.e. mount the same remote file system from all nodes to the same path). Other shared file systems also exist.

Merge multiple parquet files into single file on S3

I don't want to partition or repartition the spark dataframe and write multiple part files as it gives the best performance. Is there any way I can merge the files after it has been written to S3.
I have used parquet-tools and it does the merge to local files. I want to do this on S3.

Does Spark lock the File while writing to HDFS or S3

I have an S3 location with the below directory structure with a Hive table created on top of it:
s3://<Mybucket>/<Table Name>/<day Partition>
Let's say I have a Spark program which writes data into above table location spanning multiple partitions using the below line of code:
Df.write.partitionBy("orderdate").parquet("s3://<Mybucket>/<Table Name>/")
If another program such as "Hive SQL query" or "AWS Athena Query" started reading data from the table at the same time:
Do they consider temporary files being written?
Does spark lock the data file while writing into S3 location?
How can we handle such concurrency situations using Spark as an ETL tool?
No locks. Not implemented in S3 or HDFS.
The process of committing work in HDFS is not atomic in HDFS; there's some renaming going on in job commit which is fast but not instantaneous
With S3 things are pathologically slow with the classic output committers, which assume rename is atomic and fast.
The Apache S3A committers avoid the renames and only make the output visible in job commit, which is fast but not atomic
Amazon EMR now has their own S3 committer, but it makes files visible when each task commits, so exposes readers to incomplete output for longer
Spark writes the output in a two-step process. First, it writes the data to _temporary directory and then once the write operation is complete and successful, it moves the file to the output directory.
Do they consider temporary files being written?
As the files starting with _ are hidden files, you can not read them from Hive or AWS Athena.
Does spark lock the data file while writing into S3 location?
Locking or any concurrency mechanism is not required because of the simple two-step write process of spark.
How can we handle such concurrency situations using Spark as an ETL tool?
Again using the simple writing to temporary location mechanism.
One more thing to note here is, in your example above after writing output to the output directory you need to add the partition to hive external table using Alter table <tbl_name> add partition (...) command or msck repair table tbl_name command else data won't be available in hive.

Do Parquet Metadata Files Need to be Rolled-back?

When a Parquet file data is written with partitioning on its date column we get a directory structure like:
/data
_common_metadata
_metadata
_SUCCESS
/date=1
part-r-xxx.gzip
part-r-xxx.gzip
/date=2
part-r-xxx.gzip
part-r-xxx.gzip
If the partition date=2 is deleted without the involvement of Parquet utilities (via the shell or file browser, etc) do any of the metadata files need to be rolled back to when there was only the partition date=1?
Or is it ok to delete partitions at will and rewrite them (or not) later?
If you're using DataFrame there is no need to roll back the metadata files.
For example:
You can write your DataFrame to S3
df.write.partitionBy("date").parquet("s3n://bucket/folderPath")
Then, manually delete one of your partitions (date=1 folder in S3) using S3 browser (e.g. CloudBerry)
Now you can
Load your data and see that the data is still valid except the data you had in partition date=1 sqlContext.read.parquet("s3n://bucket/folderPath").count
Or rewrite your DataFrame (or any other DataFrame with the same schema) using append mode
df2.write.mode("append").partitionBy("date").parquet("s3n://bucket/folderPath")
You can also take a look at this question from databricks forum.

Resources