I am using Spark to read the text files from a folder and load them to hive.
The interval for the spark streaming is 1 min. The source folder may have 1000 files of bigger size in rare cases.
How do i control spark streaming to limit the number of files the program reads? Currently my program is reading all files generated in last 1 min. But i want to control the number of files it's reading.
I am using textFileStream API.
JavaDStream<String> lines = jssc.textFileStream("C:/Users/abcd/files/");
Is there any way to control the file streaming rate?

I am afraid not.
Spark steaming is based on Time driven.
You can use Flink which provide Data driven

You could use "spark.streaming.backpressure.enabled" and "spark.streaming.backpressure.initialRate" for controlling the rate at which data is received!!!

If your files are CSV files, you can use structured streaming to read the files into a streaming DataFrame with maxFilesPerTrigger like this:
import org.apache.spark.sql.types._
val streamDf = spark.readStream.option("maxFilesPerTrigger", "10").schema(StructType(Seq(StructField("some_field", StringType)))).csv("/directory/of/files")


How can I achieve streaming data aggregation per batch using Spark Structured Streaming?

I am using Spark Structured Streaming to read from a bunch of files coming into my system to a specific folder.
I want to run a streaming aggregation query on the data and write the result to Parquet files every batch, using Append Mode. This way, Spark Structured Streaming performs a partial aggregation intra-batch that is written to disk and we read from the output Parquet files using a Impala table that points to the output directory.
So I need to have something like this:
batch aggregated_value
batch-1 10
batch-2 8
batch-3 17
batch-4 13
I actually don't need the batch column but it helps to clarify what I am trying to do.
Does Structured Streaming offer a way to achieve this?

Spark 2.x - gzip vs snappy compression for parquet files

I am (for the first time) trying to repartition the data my team is working with to enhance our querying performance. Our data is currently stored in partitioned .parquet files compressed with gzip. I have been reading that using snappy instead would significantly increase throughput (we query this data daily for our analysis). I still wanted to benchmark the two codecs to see the perfomance gap with with my own eyes. I wrote a simple (Py)Spark 2.1.1 application to carry out some tests. I persisted 50 millions records in memory (deserialized) in a single partition, wrote them into a single parquet file (to HDFS) using the different codecs and then imported the files again to assess the difference. My problem is that I can't see any significant difference for both read and write.
Here is how I wrote my records to HDFS (same thing for the gzip file, just replace 'snappy' with 'gzip') :
.option('compression', 'snappy')\
And here is how I read my single .parquet file (same thing for the gzip file, just replace 'snappy' with 'gzip') :
df_read_snappy =\
.option('basePath', 'path_to_dir/test_file_snappy')\
.option('compression', 'snappy')\
I looked at the durations in the Spark UI. For information, the persisted (deserialized) 50 millions rows amount 317.4M. Once written into a single parquet file, the file weights 60.5M and 105.1M using gzip and snappy respectively (this is expected as gzip is supposed to have a better compression ratio). Spark spends 1.7min (gzip) et 1.5min (snappy) to write the file (single partition so a single core has to carry out all the work). Reading times amount to 2.7min (gzip) et 2.9min (snappy) on a single core (since we have a single file / HDFS block). This what I do not understand : where is snappy's higher performance ?
Have I done something wrong ? Is my "benchmarking protocol" flawed ? Is the performance gain here but I am not looking at the right metrics ?
I must add that I am using Spark default conf. I did not change anything aside from specifying the number of executors, etc.
Many thanks for your help!
Notice: Spark parquet jar version is 1.8.1

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

Spark Streaming : source HBase

Is it possible to have a spark-streaming job setup to keep track of an HBase table and read new/updated rows every batch? The blog here says that HDFS files come under supported sources. But they seem to be using the following static API :
I can't find any documentation around this. Is it possible to stream from hbase using spark streaming context? Any help is appreciated.
The link provided does the following
Read the streaming data - convert it into HBase put and then add to HBase table. Until this, its streaming. Which means your ingestion process is streaming.
The stats calculation part, I think is batch - this uses newAPIHadoopRDD. This method will treat the data reading part as files. In this case, the files are from Hbase - thats the reason for the following input formats
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
If you want to read the updates in a HBase as streaming, then you should have a handle of WAL(write ahead logs) of HBase at the back end, and then perform your operations. HBase-indexer is a good place to start to read any updates in HBase.
I have used hbase-indexer to read hbase updates at the back end and direct them to solr as they arrive. Hope this helps.

How to write Spark streaming calculated results to HDFS?

I am writing Spark streaming job and my batch window is 1 min. At regular intervals of 30 mins i want to write something to HDFS.
Can i do that in Spark streaming ?
If yes , How ?
I dont want to write in each Spark streaming batch as it will be too many files on HDFS.
I am getting input stream , I am adding only records which I have not seen earlier to RDD (or Dataframe) and then in the end after 30 mins interval i want to write that to HDFS.
The current solution in my mind is
Use updateStateByKey
Use Checkpoint with huge interval
Just wondering what the standard pattern is in such use cases.
