Structured streaming performance and purging the parquet files - apache-spark

I am using Spark structured streaming to get streaming data from Kafka. I need to aggregate various metrics (Say 6 metrics) and write as parquet files. I do see that there is a huge delay between metric 1 and metric 2. For example, if metric 1 is updated recently, metric 2 is one hour old data. How do I improve this performance to work in parallel?
Also, I write Parquet files which should be read by another application. How do I purge old parquet information constantly? Should I have a different application for it?
Dataset<String> lines_topic = spark.readStream().format("kafka").option("kafka.bootstrap.servers", bootstrapServers)
Dataset<Row> data= lines_topic.select(functions.from_json(lines_topic.col("value"), schema).alias(topics)); data.withWatermark(---).groupBy(----).count(); query = data.writeStream().format("parquet").option("path",---).option("truncate", "false").outputMode("append").option("checkpointLocation", checkpointFile).start();

Since each query is running independently from the others you need to ensure you're giving each query enough resources to execute. What could be happening is if you're using the default FIFO scheduler then all triggers are running sequentially vs in parallel.
Just as described here you should set a FAIR scheduler on your SparkContext and then define new pools for each query.
// Run streaming query1 in scheduler pool1
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool1")
df.writeStream.queryName("query1").format("parquet").start(path1)
// Run streaming query2 in scheduler pool2
spark.sparkContext.setLocalProperty("spark.scheduler.pool", "pool2")
df.writeStream.queryName("query2").format("orc").start(path2)
Also, in terms of purging old parquet files you may want to partition the data and then periodically delete old partitions as needed. Otherwise you can't just delete rows if all the data is being written to the same output path.

Related

How do spark streaming on hive tables using sql in NON-real time?

We have some data (millions) in hive tables which comes everyday. Next day, once the over-night ingestion is complete different applications query us for data (using sql)
We take this sql and make a call on spark
spark.sqlContext.sql(statement) // hive-metastore integration is enabled
This is causing too much memory usage on spark driver, can we use spark streaming (or structured streaming), to stream the results in a piped fashion rather than collecting everything on driver and then sending to clients ?
We don't want to send out the data as soon it comes ( in typical streaming apps), but want to send a streaming data to clients when they ask (PULL) for data.
IIUC..
Spark Streaming is mainly designed to process streaming data by converting into batches of Milliseconds to Seconds.
You can look over streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) provides you a very good functionality for Spark to write
Streaming processed output Sink in micro-batch manner.
Nevertheless Spark structured streaming don't have a standard JDBC source defined to read from.
Work out for an option to directly store Hive underlying files in compressed and structured manner, transfer them directly rather than selecting through spark.sql if every client needs same/similar data or partition them based on where condition of spark.sql query and transfer needed files further.
Source:
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.
ForeachBatch:
foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.

How to ingest different spark dataframes in a single spark job

I want to write a ETL pipeline in spark handling different input sources but using as few computing resources as possible and have problem using 'traditional' spark ETL approach.
I have a number of streaming datasources which need to be persisted into DeltaLake tables. Each datasource is simply a folder in s3 with avro files. Each datasource has different schema. Each datasource should be persisted into it's own DeltaLake table. Little conversion other than avro -> delta is needed, only enrichment with some additional fields derived from filename.
New files are added at a moderate rate, from once a min to once a day, depending on the datasource. I have a kafka notification when new data lands, describing what kind of data and s3 file path.
Assume there are two datasources - A and B. A is s3://bucket/A/* files, B - s3://bucket/B/*. Whenever new files is added I have a kafka message with payload {'datasource': 'A', filename: 's3://bucket/A/file1', ... other fields}. A files should go to delta table s3://delta/A/, B - s3://delta/B/
How can I ingest them all in a single spark application with minimal latency?
As need data is constantly coming, sound like streaming. But in spark streaming one needs to define stream schema upfront, and I have different sources with different schema not known upfront.
Spinning up a dedicated spark application per datasource is not an option - there are 100+ datasources with very small files arriving. Having 100+ spark applications is a waste of money. All should be ingested using single cluster of moderate size.
The only idea I have now: in a driver process run a normal kafka consumer, for each record read a dataframe, enrich with additional fields and persist to it's delta table. More more parallelism - consume multiple messages and run them in futures, so multiple jobs run concurrently.
Some pseudo-code, in a driver process:
val consumer = KafkaConsumer(...)
consumer.foreach{record =>
val ds = record.datasource
val file = record.filename
val df = spark.read.format(avro).load(file)
.withColumn('id', record.id)
val dest = s"s3://delta/${record.datasourceName}"
df.write.format('delta').save(dest)
consumer.commit(offset from record)
}
Sounds good (and PoC shows it works), but I wonder if there are other options? Any other ideas are appreciated.
Spark runs in a DataBricks platform.
Spark does not constraint you to have a spark application per datasource ingestion, you can group datasources into a couple of spark app or you could go with one spark application for all the datasources, which is a feasible approach if the spark app have enough resources to ingest and process all the datasource.
You can do something like:
object StreamingJobs extends SparkApp {
// consume from Kafka Topic 1
StreamProcess_1.runStream(spark)
// consume from Kafka Topic 2
StreamProcess_2.runStream(spark)
// consume from Kafka Topic n
StreamProcess_N.runStream(spark)
// wait until termination
spark.streams.awaitAnyTermination()
}
and maybe another spark jobs for batch processing
object BatchedJobs extends SparkApp {
// consume from data source 1
BatchedProcess_1.run(spark)
// consume from data source 2
BatchedProcess_2.run(spark)
// consume from data source n
BatchedProcess_N.run(spark)
}

How Spark SQL reads Parquet partitioned files

I have a parquet file of around 1 GB. Each data record is a reading from an IOT device which captures the energy consumed by the device in the last one minute.
Schema: houseId, deviceId, energy
The parquet file is partitioned on houseId and deviceId. A file contains the data for the last 24 hours only.
I want to execute some queries on the data residing in this parquet file using Spark SQL An example query finds out the average energy consumed per device for a given house in the last 24 hours.
Dataset<Row> df4 = ss.read().parquet("/readings.parquet");
df4.as(encoder).registerTempTable("deviceReadings");
ss.sql("Select avg(energy) from deviceReadings where houseId=3123).show();
The above code works well. I want to understand that how spark executes this query.
Does Spark read the whole Parquet file in memory from HDFS without looking at the query? (I don't believe this to be the case)
Does Spark load only the required partitions from HDFS as per the query?
What if there are multiple queries which need to be executed? Will Spark look at multiple queries while preparing an execution plan? One query may be working with just one partition whereas the second query may need all the partitions, so a consolidated plan shall load the whole file from disk in memory (if memory limits allow this).
Will it make a difference in execution time if I cache df4 dataframe above?
Does Spark read the whole Parquet file in memory from HDFS without looking at the query?
It shouldn't scan all data files, but it might in general, access metadata of all files.
Does Spark load only the required partitions from HDFS as per the query?
Yes, it does.
Does Spark load only the required partitions from HDFS as per the query?
It does not. Each query has its own execution plan.
Will it make a difference in execution time if I cache df4 dataframe above?
Yes, at least for now, it will make a difference - Caching dataframes while keeping partitions

Spark Streaming to Hive, too many small files per partition

I have a spark streaming job with a batch interval of 2 mins(configurable).
This job reads from a Kafka topic and creates a Dataset and applies a schema on top of it and inserts these records into the Hive table.
The Spark Job creates one file per batch interval in the Hive partition like below:
dataset.coalesce(1).write().mode(SaveMode.Append).insertInto(targetEntityName);
Now the data that comes in is not that big, and if I increase the batch duration to maybe 10mins or so, then even I might end up getting only 2-3mb of data, which is way less than the block size.
This is the expected behaviour in Spark Streaming.
I am looking for efficient ways to do a post processing to merge all these small files and create one big file.
If anyone's done it before, please share your ideas.
I would encourage you to not use Spark to stream data from Kafka to HDFS.
Kafka Connect HDFS Plugin by Confluent (or Apache Gobblin by LinkedIn) exist for this very purpose. Both offer Hive integration.
Find my comments about compaction of small files in this Github issue
If you need to write Spark code to process Kafka data into a schema, then you can still do that, and write into another topic in (preferably) Avro format, which Hive can easily read without a predefined table schema
I personally have written a "compaction" process that actually grabs a bunch of hourly Avro data partitions from a Hive table, then converts into daily Parquet partitioned table for analytics. It's been working great so far.
If you want to batch the records before they land on HDFS, that's where Kafka Connect or Apache Nifi (mentioned in the link) can help, given that you have enough memory to store records before they are flushed to HDFS
I have exactly the same situation as you. I solved it by:
Lets assume that your new coming data are stored in a dataset: dataset1
1- Partition the table with a good partition key, in my case I have found that I can partition using a combination of keys to have around 100MB per partition.
2- Save using spark core not using spark sql:
a- load the whole partition in you memory (inside a dataset: dataset2) when you want to save
b- Then apply dataset union function: dataset3 = dataset1.union(dataset2)
c- make sure that the resulted dataset is partitioned as you wish e.g: dataset3.repartition(1)
d - save the resulting dataset in "OverWrite" mode to replace the existing file
If you need more details about any step please reach out.

Alternate to recursively Running Spark-submit jobs

Below is the scenario I would need suggestions on,
Scenario:
Data ingestion is done through Nifi into Hive tables.
Spark program would have to perform ETL operations and complex joins on the data in Hive.
Since the data ingested from Nifi is continuous streaming, I would like the Spark jobs to run every 1 or 2 mins on the ingested data.
Which is the best option to use?
Trigger spark-submit jobs every 1 min using a scheduler?
How do we reduce the over head and time lag in submitting the job recursively to the spark cluster? Is there a better way to run a single program recursively?
Run a spark streaming job?
Can spark-streaming job get triggered automatically every 1 min and process the data from hive? [Can Spark-Streaming be triggered only time based?]
Is there any other efficient mechanism to handle such scenario?
Thanks in Advance
If you need something that runs every minute you better use spark-streaming and not batch.
You may want to get the data directly from kafka and not from hive table, since it is faster.
As for your questions what is better batch / stream. You can think of spark streaming as micro batch process that runs every "batch interval".
Read this : https://spark.apache.org/docs/latest/streaming-programming-guide.html

Resources