I am working on spark streaming with kinesis on EMR. For roughly 25 days my streaming was fine running for every minute and precessing around 1000 records.
I have written job in scala
Today when i looked at the master ui i saw spark was not processing any recorda and had queued 40 jobs with zero records. It was also stuck in reading records from kinesis.
Has anyone experienced this issue earlier. I am using spark 1.6
Related
I have a Spark job (batch) with a checkpoint that it takes over 3h to finish, and appears the checkpoint over 30 times in the SparkUI:
I tried to delete the checkpoint from the code, and similar thing happens, there is a 3h GAP between the job before and the next job.
Data is not too big, and the job just read from 6 tables with no more than 3GB of data, and this job is running in a Cloudera Platform (YARN).
I have already tried using more shuffle partitions and parallelism and also using less, but it doesn't work. I also tried with the number of executors, but nothing changed...
What do you think is happening?
I finally could solve it.
The problem was that the input hive table had just 5 partitions (5 parquet files), so the job was working all the time with just 5 partitions.
.repartition(100) after reading solved the problem and speed up the process from 5h to 40 min.
We wrote a Spark Streaming application, that receives Kafka messages (backpressure enabled and spark.streaming.kafka.maxRatePerPartition set), maps the DStream into a Dataset and writes this datasets to Parquet files (inside DStream.foreachRDD) at the end of every batch.
At the beginning, everything seems fine, Spark Streaming processing time is around 10 seconds for a 30 second batch interval. The amount of produced Kafka messages is a bit less then the amount of messages we consume in our Spark application, so there's no backpressure needed (in the beginning). The Spark job creates many Parquet files inside our Spark Warehouse HDFS directory (x Partitions => x Parquet Files per Batch), as expected.
Everything runs just fine for hours, but after around 12-14 hours, our processing time increases rapidly, e.g. it jumped from the normal 10 seconds processing time to >1 minute from one batch to the next one. This of course leads to a huge batch queue after a short time.
We saw similar results for 5 minute batches (processing time is around 1.5 minutes here and suddenly increases to >10 minute per batch after period of time).
Similar results happened also when we wrote ORC instead of Parquet files.
Since the batches can run independently, we do not use the checkpointing feature of Spark Streaming.
We're using the Hortonworks Data Platform 3.1.4 with Spark 2.3.2 and Kafka 2.0.0.
Is this a known problem in Spark Streaming? Are there any dependencies on "old" batches for Parquet /ORC tables? Or is this a general file-based or Hadoop-based problem? Thanks for your help.
Can anyone please explain the internal working of spark when reading data from one table and writing it to another in cassandra.
Here is my use case:
I am ingesting data coming in from an IOT platform into cassandra through a kafka topic. I have a small python script that parses each message from kafka to get the tablename it belongs to, prepares a query and writes it to cassandra using datastax's cassandra-driver for python. With that script I am able to ingest around 300000 records per min into cassandra. However my incoming data rate is 510000 records per minute so kafka consumer lag keeps on increasing.
Python script is already making concurrent calls to cassandra. If I increase the number of python executors, cassandra-driver starts failing because cassandra nodes become unavailable to it. I am assumin there is a limit of cassandra calls per sec that I am hitting there. Here is the error message that I get:
ERROR Operation failed: ('Unable to complete the operation against any hosts', {<Host: 10.128.1.3 datacenter1>: ConnectionException('Pool is shutdown',), <Host: 10.128.1.1 datacenter1>: ConnectionException('Pool is shutdown',)})"
Recently, I ran a pyspark job to copy data from a couple of columns in one table to another. The table had around 168 million records in it. Pyspark job completed in around 5 hours. So it processed over 550000 records per min.
Here is the pyspark code I am using:
df = spark.read\
.format("org.apache.spark.sql.cassandra")\
.options(table=sourcetable, keyspace=sourcekeyspace)\
.load().cache()
df.createOrReplaceTempView("data")
query = ("select dev_id,datetime,DATE_FORMAT(datetime,'yyyy-MM-dd') as day, " + field + " as value from data " )
vgDF = spark.sql(query)
vgDF.show(50)
vgDF.write\
.format("org.apache.spark.sql.cassandra")\
.mode('append')\
.options(table=newtable, keyspace=newkeyspace)\
.save()
Versions:
Cassandra 3.9.
Spark 2.1.0.
Datastax's spark-cassandra-connector 2.0.1
Scala version 2.11
Cluster:
Spark setup with 3 workers and 1 master node.
3 worker nodes also have a cassandra cluster installed. (each cassandra node with one spark worker node)
Each worker was allowed 10 GB ram and 3 cores.
So I am wondering:
Does spark read all the data from cassandra first and then writes it to the new table or is there some kind of optimization in spark cassandra connector that allows it to move the data around cassandra tables without reading all the records?
If I replace my python script with a spark streaming job in which I parse the packet to get the table name for cassandra, will that help me ingest data more quickly into cassandra?
Spark connector is optimized because it parallelize processing and reading/inserting data into nodes that are owns the data. You may get better throughput by using Cassandra Spark Connector, but this will require more resources.
Talking about your task - 300000 inserts/minute is 5000/second, and this is not very big number frankly speaking - you can increase throughput by putting different optimizations:
Using asynchronous calls to submit requests. You only need to make sure that you submit more requests that could be handled by one connection (but you can also increase this number - I'm not sure how to do it in Python, but please check Java driver doc to get an idea).
use correct consistency level (LOCAL_ONE should give you very good performance)
use correct load balancing policy
you can run several copies of your script in parallel, making sure that they are all in the same Kafka consumer group.
I'm using Kafka 0.9 and Spark 1.6. Spark Streaming application streams messages from Kafka through direct stream API (Version 2.10-1.6.0).
I have 3 workers with 8 GB memory each. For every minute I get 4000 messages to Kafka and in spark each worker is streaming 600 messages. I always see a lag on the Kafka offset to Spark offset.
I have 5 Kafka partitions.
Is there a way to make Spark stream more messages for each pull from Kafka?
My streaming frequency is 2 seconds
spark configurations in the app
"maxCoresForJob": 3,
"durationInMilis": 2000,
"auto.offset.reset": "largest",
"autocommit.enable": "true",
Would you please explain more? did you check which piece of code taking longer to execute? From cloudera manager-> Yarn--> Application -> selection your application --> Application master --> Streaming, then select one batch and click. Try to find out what task is taking longer time to execute. How many executors are you using? for 5 partitions, it is better to have 5 executors.
You can post your transformation logic, there could be some way to tune.
Thanks
I am running spark streaming and it is consuming message from kafka.I have also defined checkpoint directory in my spark code.
We did a bulk message upload in kafka yesterday. When I check the offset status in kafka using -
bin/kafka-run-class.sh kafka.tools.ConsumerOffsetChecker --group xxx- \
streaming-consumer-group --zookeeper xxx.xxx.xxx.xxx:2181
It shows there is no message lag. However, my spark job is still running for last 10 hrs.
My understanding is spark-streaming code should read the messages sequentially and it should update offset in kafka accordingly.
I am not able to figure out why spark is still running even if there is no message lag in kafka. Can someone explain?