How to stream 100GB of data in Kafka topic? - apache-spark

So, in one of our kafka topic, there's close to 100 GB of data.
We are running spark-structured streaming to get the data in S3
When the data is upto 10GB, streaming runs fine and we are able to get the data in S3.
But with 100GB, it is taking forever to stream the data in kafka.
Question: How does spark-streaming reads data from Kafka?
Does it take the entire data from current offset?
Or does it take in batch of some size?

Spark will work off consumer groups, just as any other Kafka consumer, but in batches. Therefore it takes as much data as possible (based on various Kafka consumer settings) from the last consumed offsets. In theory, if you have the same number of partitions, with the same commit interval as 10 GB, it should only take 10x longer to do 100 GB. You've not stated how long that currently takes, but to some people 1 minute vs 10 minutes might seem like "forever", sure.
I would recommend you plot the consumer lag over time using the kafka-consumer-groups command line tool combined with something like Burrow or Remora... If you notice an upward trend in the lag, then Spark is not consuming records fast enough.
To overcome this, the first option would be to ensure that the number of Spark executors is evenly consuming all Kafka partitions.
You'll also want to be making sure you're not doing major data transforms other than simple filters and maps between consuming and writing the records, as this also introduces lag.
For non-Spark approaches, I would like to point out that the Confluent S3 connector is also batch-y in that it'll only periodically flush to S3, but the consumption itself is still closer to real-time than Spark. I can verify that it's able to write very large S3 files (several GB in size), though, if the heap is large enough and the flush configurations are set to large values.
Secor by Pinterest is another option that requires no manual coding

Related

PySpark Structured Streaming with Kafka - Scaling Consumers for multiple topics with different loads

We subscribed to 7 topics with spark.readStream in 1 single running spark app.
After transforming the event payloads, we save them with spark.writeStream to our database.
For one of the topics, the data is inserted only batch-wise (once a day) with a very high load. This delays our reading from all other topics, too. For example (grafana), the delay between a produced and consumed record over all topics stays below 1m the whole day. When the bulk-topic receives its events, our delay increases up to 2 hours on all (!) topics.
How can we solve this? we already tried 2 successive readStreams (the bulk-topic separately), but it didn't help.
Further info: We use 6 executors, 2 executor-cores. The topics have a different number of partitions (3 to 30). Structured Streaming Kafka Integration v0.10.0.
General question: How can we scale the consumers in spark structured streaming? Is 1 readStream equal to 1 consumer? or 1 executor? or what else?
Partitions are main source of parallelism in Kafka so I suggest you increase number of partitions (at least for topic which has performance issues). Also you may tweak some of consumer caching options mentioned in doc. Try to keep number of partitions 2^n. At the end you may increase size of driver machine if possible.
I'm not completely sure, but I think Spark will try to keep same number of consumer as number of partitions per topic. Also I think that actually stream is fetched from Spark driver always (not from workers).
We found a solution for our problem:
Our grafana after the change shows, that the batch-data topic still peaks but without blocking the consumption on other topics.
What we did:
We still have 1 spark app. We used 2 successive spark.readStreams but also added a sink for each.
In code:
priority_topic_stream = spark.readStream.format('kafka')
.options(..).option('subscribe', ','.join([T1, T2, T3])).load()
bulk_topic_stream = spark.readStream.format('kafka')
.options(..).option('subscribe', BULK_TOPIC).load()
priority_topic_stream.writeStream.foreachBatch(..).trigger(..).start()
bulk_topic_stream.writeStream.foreachBatch(..).trigger(..).start()
spark.streams.awaitAnyTermination()
To minimize the peak on the bulk-stream we will try out increasing its partitions like adviced from #partlov. But that would have only speeded up the consumption on the bulk-stream but not resolved the issue from blocking our reads from the priority-topics.

Increase the output size of Spark Structured Streaming job

Context : I have a Spark Structured Streaming job with Kafka as source and S3 as sink. The outputs in S3 are again picked up as input in other MapReduce jobs.
I, therefore, want to increase the output size of the files on S3 so that the MapReduce job works efficiently.
Currently, because of small input size, the MapReduce jobs are taking way too long to complete.
Is there a way to configure the streaming job to wait for at least 'X' number of records to process?
Probably you want to wait micro batch trigger till sufficient data are available at source . You can use minOffsetsPerTrigger option to wait for sufficient data available in kafka .
Make sure to set sufficient maxTriggerDelay time as per your application need .
No there is not in reality.
No for Spark prior to 3.x.
Yes and No for Spark 3.x which equates to No effectively.
minOffsetsPerTrigger was introduced but has a catch as per below. That means the overall answer still remains No.
From the manuals:
Minimum number of offsets to be processed per trigger interval. The
specified total number of offsets will be proportionally split across
topicPartitions of different volume. Note, if the maxTriggerDelay is
exceeded, a trigger will be fired even if the number of available
offsets doesn't reach minOffsetsPerTrigger.

Can Spark/EMR read data from s3 multi-threaded

Due to some unfortunate sequences of events, we've ended up with a very fragmented dataset stored on s3. The table metadata is stored on Glue, and data is written with "bucketBy", and stored in parquet format. Thus discovery of the files is not an issue, and the number of spark partitions is equal to the number of buckets, which provides a good level of parallelism.
When we load this dataset on Spark/EMR we end up having each spark partition loading around ~8k files from s3.
As we've stored the data in a columnar format; per our use-case where we need a couple of fields, we don't really read all the data but a very small portion of what is stored.
Based on CPU utilization on the worker nodes, I can see that each task (running per partition) is utilizing almost around 20% of their CPUs, which I suspect is due to a single thread per task reading files from s3 sequentially, so lots of IOwait...
Is there a way to encourage spark tasks on EMR to read data from s3 multi-threaded, so that we can read multiple files at the same time from s3 within a task? This way, we can utilize the 80% idle CPU to make things a bit faster?
There are two parts to reading S3 data with Spark dataframes:
Discovery (listing the objects on S3)
Reading the S3 objects, including decompressing, etc.
Discovery typically happens on the driver. Some managed Spark environments have optimizations that use cluster resources for faster discovery. This is not typically a problem unless you get beyond 100K objects. Discovery is slower if you have .option("mergeSchema", true) as each file will have to touched to discover its schema.
Reading S3 files is part of executing an action. The parallelism of reading is min(number of partitions, number of available cores). More partitions + more available cores means faster I/O... in theory. In practice, S3 can be quite slow if you haven't accesses these files regularly for S3 to scale their availability up. Therefore, in practice, additional Spark parallelism has diminishing returns. Watch the total network RW bandwidth per active core and tune your execution for the highest value.
You can discover the number of partitions with df.rdd.partitions.length.
There are additional things you can do if the S3 I/O throughput is low:
Make sure the data on S3 is dispersed when it comes to its prefix (see https://docs.aws.amazon.com/AmazonS3/latest/dev/optimizing-performance.html).
Open an AWS support request and ask the prefixes with your data to be scaled up.
Experiment with different node types. We have found storage-optimized nodes to have better effective I/O.
Hope this helps.

Storing data from sensors into hdfs

I am working on a project that involves using HDFS for storage and Spark for computation.
I need to store data from sensors into HDFS in real time.
For example I have a weather station where the sensor generates data(temperature pression) each 5 seconds. I would like to know how to store these data in hdfs in real time
Writing a lot of small files directly to HDFS may have some undesirable effects, as it affects master node memory usage and may lead to lower processing speed in comparison with batch processing.
Any of your sensor will produce 500k files monthly, so, unless you have very limited number of sensors, I would suggest you to take a look at message brokers. Apache Kafka (https://kafka.apache.org/) is well known one and already bundled in some Hadoop distros. You can use it to "stage" you data and process it in (mini-)batches, for example.
Finally, if you need to process incoming data in real-time maner (CEP and so on), i would recommend to pay attention to Spark Streaming (https://spark.apache.org/streaming/) technology.

Increase number of partitions in Dstream to be greater then Kafka partitions in Direct approach

Their are 32 Kafka partitions and 32 consumers as per Direct approach.
But the data processing for 32 consumers is slow then Kafka rate(1.5x), which creates a backlog of data in Kafka.
I Want to increase the number of partitions for Dstream received by each consumer.
I will like solution to be something around to increase partitions on consumers rather then increasing partitions in Kafka.
In the direct stream approach, at max you can have #consumers = #partitions. Kafka does not allow more than one consumer per partition per group.id. BTW you are asking more partition per consumer? it will not help since your consumers are already running at full capacity and still are insufficient.
Few technical changes you can try to reduce the data backlog on kafka:
Increase number of partitions - although you do not want to do this, still this is the easiest approach. Sometimes platform just needs more hardware.
Optimize processing at consumer side - check possibility of record de-duplication before processing, reduce disk I/O, loop unrolling techniques etc to reduce time taken by consumers.
(higher difficulty) Controlled data distribution - Often it is found that some partitions are able to process better than others. It may be worth looking if this is the case in your platform. Kafka's data distribution policy has some preferences (as well as message-key) which often cause uneven load inside cluster: https://www.cloudera.com/documentation/kafka/latest/topics/kafka_performance.html
Assuming you have enough hardware resources allocated to consumer, you can check below parameter
spark.streaming.kafka.maxRatePerPartition
You can set number of records you consume from single kafka partition per second.

Resources