I am able to develop a pipeline which reads from kafka does some transformations and write the output to kafka sink as well as parque sink. I would like adding effective logging to log the intermediate results of the transformation like in a regular streaming application.
One option I see is to log the queryExecutionstreams via
df.queryExecution.analyzed.numberedTreeString
or
logger.info("Query progress"+ query.lastProgress)
logger.info("Query status"+ query.status)
But this doesn't seem to have a way to see the business specific messages on which the stream is running on.
Is there a way how I can add more logging info like the data which it's processing?
i found some options to track the same .Basically we can name our streaming Queries using df.writeStream.format("parquet")
.queryName("table1")
The query name table1 will be printed in the Spark Jobs Tab against the Completed Jobs list in the Spark UI from which you can track the status for each of the streaming queries
2) Use the ProgressReporter API in structured streaming to collect more stats
Related
We have some data (millions) in hive tables which comes everyday. Next day, once the over-night ingestion is complete different applications query us for data (using sql)
We take this sql and make a call on spark
spark.sqlContext.sql(statement) // hive-metastore integration is enabled
This is causing too much memory usage on spark driver, can we use spark streaming (or structured streaming), to stream the results in a piped fashion rather than collecting everything on driver and then sending to clients ?
We don't want to send out the data as soon it comes ( in typical streaming apps), but want to send a streaming data to clients when they ask (PULL) for data.
IIUC..
Spark Streaming is mainly designed to process streaming data by converting into batches of Milliseconds to Seconds.
You can look over streamingDF.writeStream.foreachBatch { (batchDF: DataFrame, batchId: Long) provides you a very good functionality for Spark to write
Streaming processed output Sink in micro-batch manner.
Nevertheless Spark structured streaming don't have a standard JDBC source defined to read from.
Work out for an option to directly store Hive underlying files in compressed and structured manner, transfer them directly rather than selecting through spark.sql if every client needs same/similar data or partition them based on where condition of spark.sql query and transfer needed files further.
Source:
Structured Streaming queries are processed using a micro-batch processing engine, which processes data streams as a series of small batch jobs thereby achieving end-to-end latencies as low as 100 milliseconds and exactly-once fault-tolerance guarantees.
ForeachBatch:
foreachBatch(...) allows you to specify a function that is executed on the output data of every micro-batch of a streaming query. Since Spark 2.4, this is supported in Scala, Java and Python. It takes two parameters: a DataFrame or Dataset that has the output data of a micro-batch and the unique ID of the micro-batch.
We have a Spark streaming use case where we need to compute some metrics from ingested events (in Kafka), but the computations require additional metadata which are not present in the events.
The obvious design pattern I can think of is to make point queries to the metadata tables (on the master DB) from spark executor tasks and use that metadata info during the processing of each event.
Another idea would be to "enrich" the ingested events in a separate pipeline as a preprocessor step before sending them to Kafka. This could be done, say by another service or task.
The second scenario is more useful in cases when the domain/environment where Spark/hadoop runs is isolated from the domain of the master DB where all metadata is stored.
Is there a general consensus on how this type of event "enrichment" should be done? What other considerations am I missing here?
Typically the first approach that you thought about is correct and meets your requirements.
There is know that within Apache Spark you can join data-in-motion with data-at-rest.
In other words you have your streaming context that continuously stream data from Kafka.
val dfStream = spark.read.kafka(...)
At the same time you can connect to the metastore DB (e.g spark.read.jdbc)
val dfMetaDb = spark.read.jdbc(...)
You can join them together
dsStream.join(dfMetaDB)
and continue the process from this point on.
The benefits is that you don't touch other components and rely only on Spark processing capabilities.
I need to export data from Hive to Kafka topics based on some events in another Kafka topic. I know I can read data from hive in Spark job using HQL and write it to Kafka from the Spark, but is there a better way?
This can be achieved using unstructured streaming. The steps mentioned below :
Create a Spark Streaming Job which connects to the required topic and fetched the required data export information.
From stream , do a collect and get your data export requirement in Driver variables.
Create a data frame using the specified condition
Write the data frame into the required topic using kafkaUtils.
Provide a polling interval based on your data volume and kafka write throughputs.
Typically, you do this the other way around (Kafka to HDFS/Hive).
But you are welcome to try using the Kafka Connect JDBC plugin to read from a Hive table on a scheduled basis, which converts the rows into structured key-value Kafka messages.
Otherwise, I would re-evaulate other tools because Hive is slow. Couchbase or Cassandra offer much better CDC features for ingestion into Kafka. Or re-write the upstream applications that inserted into Hive to begin with, rather to write immediately into Kafka, from which you can join with other topics, for example.
I would like to use external metrics system to monitor stream progress in spark. For this I should send notifications with metrics as soon as possible (number of read, transformed and written records)
StreamExecution uses ProgressReporter to send QueryProgressEvents with statistics (numInputRows, processedRowsPerSecond etc) to StreamingQueryListener. The problem is it happens when all data in batch are processed. However I would like to get a notification with the number of input rows as soon as they read from source (before transformation and write happens) and then number written records when data sent to a sink.
Is there a way to get such kind of metrics per batch in structured streaming in real time?
Metrics for structured streaming are not currently implemented out of the box anywhere besides the databricks platform. The only way to get them via open source spark is to extend the streaming query listener class and write your own.
Please forgive if this question doesn't make sense, as I am just starting out with Spark and trying to understand it.
From what I've read, Spark is a good use case for doing real time analytics on streaming data, which can then be pushed to a downstream sink such as hdfs/hive/hbase etc.
I have 2 questions about that. I am not clear if there is only 1 spark streaming job running or multiple at any given time. Say I have different analytics I need to perform for each topic from Kafka or each source that is streaming into Kafka, and then push the results of those downstream.
Does Spark allow you to run multiple streaming jobs in parallel so you can keep aggregate analytics separate for each stream, or in this case each Kafka topic. If so, how is that done, any documentation you could point me to ?
Just to be clear, my use case is to stream from different sources, and each source could have potentially different analytics I need to perform as well as different data structure. I want to be able to have multiple Kafka topics and partitions. I understand each Kafka partition maps to a Spark partition, and it can be parallelized.
I am not sure how you run multiple Spark streaming jobs in parallel though, to be able to read from multiple Kafka topics, and tabulate separate analytics on those topics/streams.
If not Spark is this something thats possible to do in Flink ?
Second, how does one get started with Spark, it seems there is a company and or distro to choose for each component, Confluent-Kafka, Databricks-Spark, Hadoop-HW/CDH/MAPR. Does one really need all of these, or what is the minimal and easiest way to get going with a big data pipleine while limiting the number of vendors ? It seems like such a huge task to even start on a POC.
You have asked multiple questions so I'll address each one separately.
Does Spark allow you to run multiple streaming jobs in parallel?
Yes
Is there any documentation on Spark Streaming with Kafka?
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
How does one get started?
a. Book: https://www.amazon.com/Learning-Spark-Lightning-Fast-Data-Analysis/dp/1449358624/
b. Easy way to run/learn Spark: https://community.cloud.databricks.com
I agree with Akbar and John that we can run multiple streams reading from different sources in parallel.
I like add that if you want to share data between streams, you can use Spark SQL API. So you can register your RDD as a SQL table and access the same table in all the streams. This is possible since all the streams share the same SparkContext