Filenames written to HDFS by spark dataframe - apache-spark

The question is regarding spark 1.6
When a dataframe is written to HDFS in SaveMode.APPEND mode, I want to know which files were created new.
A way to do this is to keep track of files in HDFS before and after job, is there a better way?
Also Map-Reduce prints job statistics at the end, do we have something similar for every spark action.

Related

How to get spark streaming to continue where spark batch left off

I have monthly directories of parquet files (~10TB each directory). Files are being atomically written to this directory every minute or so. When we get to a new month, a new directory is created and data is written there. Once data is written, it cannot be moved.
I easily run batch queries on this data using spark (batch mode). I can also easily run spark streaming queries.
I am wondering how I can reconcile the two modes: batch and stream.
For example: Lets say I run a batch query on the data. I get the results of the query and do something with them. I can then checkpoint this dataframe. Now let's say I want to start a streaming job to only process new files relative to what was processed in the batch job, ie. only files not processed in the batch job should now be processed.
Is this possible with spark streaming? If start a spark streaming job and use the same checkpoint that the batch job used, will it proceed as I want it to?
Or, with the batch job, do I need to keep track of what files were processed and then somehow pass this to spark streaming so it can know to not process these.
This seems like a pretty common problem, so I am asking here to see what some other big data software developers have done.
I apologize for not having any code to post in this question, but I hope that my explanation is all it takes for someone to see a potential solution. If needed, I can come up with some snippets

Can output files be moved while doing spark streaming, without crashing the spark job?

I have a Structured Streaming Spark Job running with Kafka as source, outputting orc files in append mode. While the job is running, I'm moving the files (want to) to an hdfs location every certain time. By moving the files, will the spark job ever crash or produce bad output as a result? Once spark writes the file, will it ever look at the file again for any reason? I want to perform files move but I don't want to disrupt spark in any way.
As you are appending the data moving the files won't affect your structured streaming job as long as _spark_metadata directory which gets generated in your output folder and the checkpoint directory remains in sync.

spark sql data bigger than node memory when coalesce(1)

I'm working on spark 1.6.1
I have a dataframe that is distributed and is for sure bigger than any nodes i have in my cluster.
What will happen if i bring all in a node ?
df.coalesce(1)
Will the job fail ?
Thanks
It will fail for sure as data will not fit in memory.
If you want to return single file as a output, you can merge HDFS files later using HDFS getMerge.
You can use utility to merge multiple files into one file from below mentioned git project
https://github.com/gopal-tiwari/hdfs-file-merge

Spark streaming + processing files in HDFS stage directory

I am having a daemon process which dumps data as files in HDFS. I need to create a RDD over the new files, de-duplicate them and store them back on HDFS. The file names should be maintained while dumping back on HDFS.
Any pointers to achieve this?
I am open to achieve it with or without spark streaming.
I tried creating a spark streaming process which processes data directly ( using java code on worker nodes) and pushes it into HDFS without creating a RDD.
But, this approach fails for larger files (greater than 15GB).
I am looking into JavaSparkContext.fileStreaming now.
Any pointers would be a great help.
Thanks and Regards,
Abhay Dandekar

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

Resources