Increasing network load in HDFS traffic with stream jobs and Kafka - apache-spark

We experience unexplained behaviour with our new EMR setup that includes:
EMR 5.16 (3 nodes - c4.8xlarge and 1 master - c4.8xlarge)
Kafka Cluster based on ECS
We running simple stream job that reads from a Kafka topic, makes some logic and writeStream back to Kafka topic (using checkpointLocation as HDFS path)
The "problem" is that in Ganglia I can see increasing network traffic that came out from the driver (that runs on one of the slaves) to the Master server.
I can see from a simple pcap file that's the traffic belongs to 50010 (Hadoop Data Transfer) and here I'm in a dead end.
Some help needed, thanks!

After some investigation and view the payload of the traffic, it was the logs that sent to the Master! It was delivered to Spark history server and located in HDFS..
I just needed to add this config to my spark-submit --conf spark.eventLog.enabled=false

Related

How do I setup Spark application to pull from single Kafka topic on multiple Spark nodes?

My application has a Kafka input stream for a single topic, it does some filtering and aggregating of the data, and then writes to Elasticsearch. What I'm seeing is that while the application is distributed to all of the spark nodes and processing the data properly, only one node is pulling data, and the rest are idle.
Also, I am using an R53 hostname for the Kafka nodes. Should I use a comma-separated list of the Kafka nodes instead?
The topic has 20 partitions. I am running Spark 3.2.1 using only Spark Streaming (no DFS).
The topic has 20 partitions
Then up to 20 executors should be able to consume in parallel.
using an R53 hostname for the Kafka nodes
Any Kafka client, including Spark, will need to communicate with the brokers individually. This means you'll need to expose each broker's advertised.listeners setting such that Spark can communicate with each broker directly, and not via a single DNS name / load balancer address. If only one is resolvable, then you'll only be able to consume (or produce) to just that one.
Should I use a comma-separated list of the Kafka nodes instead
It's recommended, but not necessary. For example, what if the broker at the one address provided is not responding? The bootstrap protocol will return all advertised.listener addresses back to the client based on its associated listeners protocol.

Spark Streaming job doesn't fail when the connection to Kafka cannot be established

I'm using Spark Streaming on AWS EMR to connect to a Kafka cluster on AWS MSK. I'm using spark-sql-kafka-0-10 with Spark 2.4.3.
If the security groups are not correctly configured, the Spark Streaming jobs get stuck for hours with the following warning:
20/06/29 14:10:42 WARN NetworkClient: [Consumer clientId=consumer-1, groupId=spark-kafka-source...] Connection to node -1 could not be established. Broker may not be available.
I would expect the job to fail if the connection cannot be established.
Is there a way I can make the job fail? All the timeout values are set to the default values.
This warning message occurs because you don't have connectivity to one or more than one of the brokers of your Kafka cluster. (might also happen in case when new brokers are added to the existing cluster and you are not aware of it)
Before setting up a job, I would recommend checking for connectivity between the producer's server and all the kafka brokers using telnet

Spark streamming task shutdown gracefully when kafka client send message asynchronously

i am building a spark streamming application, read input message from kafka topic, transformation message and output the result message into another kafka topic. Now i am confused how to prevent data loss when application restart, including kafka read and output. Setting the spark configuration "spark.streaming.stopGracefullyOnShutdow" true can help?
You can configure Spark to do checkpoint to HDFS and store the Kafka offsets in Zookeeper (or Hbase, or configure elsewhere for fast, fault tolerant lookups)
Though, if you process some records and write the results before you're able to commit offsets, then you'll end up reprocessing those records on restart. It's claimed that Spark can do exactly once with Kafka, but that is a only with proper offset management, as far as I know, for example, Set enable.auto.commit to false in the Kafka priorities, then only commit after the you've processed and written the data to its destination
If you're just moving data between Kafka topics, Kafka Streams is the included Kafka library to do that, which doesn't require YARN or a cluster scheduler

App server Log process

I have a requirement from my client to process the application(Tomcat) server log file for a back end REST Based App server which is deployed on a cluster. Clint wants to generate "access" and "frequency" report from those data with different parameter.
My initial plan is that get those data from App server log --> push to Spark Streaming using kafka and process the data --> store those data to HIVE --> use zeppelin to get back those processed and centralized log data and generate reports as per client requirement.
But as per my knowledge Kafka does not any feature which can read data from log file and post them in Kafka broker by its own , in that case we have write a scheduler job process which will read the log time to time and send them in Kafka broker , which I do not prefer to do, as in that case it will not be a real time and there can be synchronization issue which we have to bother about as we have 4 instances of application server.
Another option, I think we have in this case is Apache Flume.
Can any one suggest me which one would be better approach in this case or if in Kafka, we have any process to read data from log file by its own and what are the advantage or disadvantages we can have in feature in both the cases?
I guess another option is Flume + kakfa together , but I can not speculate much what will happen as I have almost no knowledge about flume.
Any help will be highly appreciated...... :-)
Thanks a lot ....
You can use Kafka Connect (file source connector) to read/consume Tomcat logs files & push them to Kafka. Spark Streaming can then consume from Kafka topics and churn the data
tomcat -> logs ---> kafka connect -> kafka -> spark -> Hive

How spark streaming data are stored

In spark streaming, stream data will be received by receivers which run on workers. The data will be pushed into a data block periodically and receiver will send the receivedBlockInfo to the driver. I want to know that will spark streaming distribute the block to the cluster?(In other words, will it use a distributing storage strategy). If it does not distribute the data across the cluster, how will the workload balance be guaranteed?(Image we have a cluster of 10s nodes but there are only a few receivers)
As far as I know data are received by the worker node where the receiver is running. They are not distributed across other nodes.
If you need the input stream to be repartitioned (balanced across cluster) before further processing, you can use
inputStream.repartition(<number of partitions>)
You can read more about level of parallelism in Spark documentation
https://spark.apache.org/docs/latest/streaming-programming-guide.html#performance-tuning

Resources