I'm building spark streaming from Apache Kafka to our columnar DataBase.
To ensure fault tolerance I'm using HDFS checkpoint and write ahead log.
Apache Kafka topic -> spark streaming -> HDFS checkpoint-> spark SQl ( for messages manipulation)-> spark jdbc for our Db.
When I'm using spark jobs for one topic and table everything is working file.
I'm trying to stream in one spark job multiple Kafka topics and to write for multiple tables in here started the problem with the checkpoint ( which is per one topic table )
The problem is with checkpoints :(
1) If I will use "KafkaUtils.createDirectStream" with a list of topics and "groupBy" topic name but the checkpoint folder is one and if, for example, I will need to increase resources during the ongoing streaming (change amount of cors due to Kafka Lag) this will be impossible because today it's possible only if I delete the checkpoint folder and restart the spark job.
2) Use multiple spark streamingContext this I will try today and see if it works.
3) Multiple sparkStreaming with high-level consumers ( offset saved in kafka10...)
Any other ideas/solutions that I'm missing
Does structure streams with multiple Kafka topics and checkpoints behave differently?
Thx
Related
I have a requirement to implement the solution for below usecase.
Currently Applications are storing data into Postgres database but Postgres database is facing storage issue. So the plan is to move the data from postgres to Hadoop with near realtime data in hadoop. So we thought of below solution .
Write Kafka producer application to listen to postgres tables and capture changing data and write to Kafka Topic .
Write a Kafka sink application to read from kafka topic and write to hive tables(parquet -- external tables -- partitioned and non partitioned) . So for non partitioned tables if we want to apply updates/deletes then we need to touch the whole table in spark code right? which will lead to performance degrade for every record getting from kafka topic . We have already developed sqoop incremental job which runs for every 5 minutes to do the same. But client needs real time data in hadoop so kafka+spark processing came into discussion .
Could you provide pro's and con's for step2 comparing to sqoop incremental.
please share code snippets/links if any which helps my thought process.
Getting data into Kafka is easy - use Debezium.
For getting it out...
I wouldn't use Hive at all for this. Real time data (depending on on the volume of the data, obviously) results in tiny files in HDFS. Subsequently, Hive queries become slower and slower over time.
Hive is not a replacement for Postgres. In fact, the Hive metastore requires a relational database still, such as Postgres.
I also wouldn't use Spark. You have to write code when ingesting Kafka topics into queryable formats is already a solved problem with other tools.
Popular options include Apache Pinot, Druid, or Apache Iceberg storage with Presto (some of which may overlap with HDFS storage, but will be much, much faster than Hive to query). Only the third option requires writing Kafka consumer code; the other two have native Kafka ingestion.
And even still, if you're stuck with HDFS, Kafka Connect framework comes with Kafka. There's an HDFS Sink plugin, written by Confluent, which supports Hive integration.
I am trying to calculate Kafka lag on my spark structured streaming application.
I can get the current processed offset from my Kafka metadata which comes along with actual data.
Is there a way through which we can get the latest offsets of all partitions in a Kafka topic programmatically from spark interface ?
Can I use Apache Kafka admin classes or Kafka interfaces to get the latest offset information for each batch in my spark app ?
I have a long running spark structured streaming job which is ingesting kafka data. I have one concern as below. If the job is failed due to some reason and restart later, how to ensure kafka data will be ingested from the breaking point instead of always ingesting current and later data when the job is restarting. Do I need to specifiy explicitly something like consumer group and auto.offet.reset, etc? Are they supported in spark kafka ingestion? Thanks!
According to the Spark Structured Integration Guide, Spark itself is keeping track of the offsets and there are no offsets committed back to Kafka. That means if your Spark Streaming job fails and you restart it all necessary information on the offsets is stored in Spark's checkpointing files. That way your application will know where it left off and continue to process the remaining data.
I have written more details about setting group.id and Spark's checkpointing of offsets in another post
Here are the most important Kafka specific configurations for your Spark Structured Streaming jobs:
group.id: Kafka source will create a unique group id for each query automatically. According to the code the group.id will automatically be set to
val uniqueGroupId = s"spark-kafka-source-${UUID.randomUUID}-${metadataPath.hashCode}
auto.offset.reset: Set the source option startingOffsets to specify where to start instead. Structured Streaming manages which offsets are consumed internally, rather than rely on the kafka Consumer to do it
enable.auto.commit: Kafka source doesn’t commit any offset.
Therefore, in Structured Streaming it is currently not possible to define your custom group.id for Kafka Consumer and Structured Streaming is managing the offsets internally and not committing back to Kafka (also not automatically).
I'm working on an application that would connect to a Kafka source and on the same source, I would want to create multiple streaming queries with different filter conditions. Each of the queries would be processing a business logic and writing it to HBase.
I'm trying to solve some race conditions in the business logic, and want to understand how the internals of Spark Strucutured Streaming work while reading from Kafka.
1) How many Kafka consumers would be created across the application? Would it be related to the number of partitions on the topic or to the number of executors running the application?
2) Would each of the streaming query write to the same unbounded table with 1 Kafka consumer per query?
In our project we are considering using Kakfa with spark streaming, for PoC I am using spark 2.4.1 version Kafka and Java8.
I have some questions:
How to handle missing data into Kafka topics ingestion?
How to maintain the auditing for the same? What is the big data industry practice in this?
What should be the recovery mechanism to be followed? Any links or videos for the same?
How to handle missing data into Kafka topics ingestion?
I don't understand this. Does it mean missing data in Kafka topic or missing out data from Kafka topic to Spark streaming?
The first one can't be handled unless you're producer of the data and you can change according to the reason. The second one is possible if the data is still available in Kafka topic managed by retention period on Kafka cluster.
How to maintain the auditing for the same?
There are couple of things you could do. You can ask Kafka to manage those offsets by committing those offsets. Or you could write the offsets to any other location like HBase and from there you can get the message offsets upto which you've successfully processed. With latest Structured Streaming, you do not need to manage such low level details, Spark will manage in the checkpoint directory.
What should be the recovery mechanism to be followed?
It depends on which choice you're using. If you've the offset numbers in HBase, you can read from HBase and use KafkaUtils class to get messages from given offsets number using:
KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Assign[String, String](fromOffsets.keys.toList, kafkaParams, fromOffsets)
)
More details on
https://spark.apache.org/docs/2.2.0/streaming-kafka-0-10-integration.html