Kinesis + Spark Streaming giving empty records - apache-spark

I am trying to read kinesis data stream with spark streaming. I am not getting any records in the output. My code is not giving any error, just that it does not print anything on the console even after feeding data to kinesis. I have also tried playing around with trim horizon and latest but
still no luck. Please find the code below:
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
import time
from pyspark import StorageLevel
appName="cdc_application"
sc = SparkContext(appName=appName)
ssc = StreamingContext(sc, 1)
streamName = 'cdc-stream'
endpointUrl = 'https://kinesis.ap-south-1.amazonaws.com'
regionName = 'ap-south-1'
checkpointInterval = 5
kinesisstream = KinesisUtils.createStream(ssc, appName,
streamName, endpointUrl, regionName,
InitialPositionInStream.LATEST,
checkpointInterval,StorageLevel.MEMORY_AND_DISK_2
)
kinesisstream.pprint()
ssc.start()
time.sleep(10) # Run stream for 10 minutes just in case no detection of producer
ssc.stop(stopSparkContext=True,stopGraceFully=True)
Version - Spark 2.4.8 on EMR
Package used - org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.8
More details: My Kinesis stream is the target of AWS DMS which has been connected to the postgreSQL database. I am getting the change records(CDC) in Kinesis stream which i am trying to process through spark on EMR.
My output for this looks like below:
Time: 2022-05-05 05:19:33
Time: 2022-05-05 05:19:34
Time: 2022-05-05 05:19:35
Time: 2022-05-05 05:19:36
Time: 2022-05-05 05:19:37
Time: 2022-05-05 05:19:38
In console, it gives following output, but does not show anything:
Time: 2022-05-05 07:02:26
22/05/05 07:02:26 INFO JobScheduler: Finished job streaming job 1651734146000 ms.0 from job set of time 1651734146000 ms
22/05/05 07:02:26 INFO JobScheduler: Total delay: 0.014 s for time 1651734146000 ms (execution: 0.003 s)
22/05/05 07:02:26 INFO KinesisBackedBlockRDD: Removing RDD 844 from persistence list
22/05/05 07:02:26 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[844] at createStream at NativeMethodAccessorImpl.java:0 of time 1651734146000 ms
22/05/05 07:02:26 INFO ReceivedBlockTracker: Deleting batches: 1651734144000 ms
22/05/05 07:02:26 INFO InputInfoTracker: remove old batch metadata: 1651734144000 ms
22/05/05 07:02:27 INFO JobScheduler: Added jobs for time 1651734147000 ms
22/05/05 07:02:27 INFO JobScheduler: Starting job streaming job 1651734147000 ms.0 from job set of time 1651734147000 ms
Time: 2022-05-05 07:02:27

Related

Spark Dataframe writing to google pubsub

I am trying to write the Parquet files to Pubsub through Spark on a Dataproc cluster.
I have used below pseudo code
dataFrame
.as[MyCaseClass]
.foreachPartition(partition => {
val topicName = "projects/myproject/topics/mytopic"
val publisher = Publisher.newBuilder(topicName).build()
partition.foreach(users => {
try {
val jsonUser = users.asJson.noSpaces //using circe scala lib
val data = ByteString.copyFromUtf8(jsonUser)
val pubsubMessage = PubsubMessage.newBuilder().setData(data).build()
val message = publisher.publish(pubsubMessage)
}
catch {
case e: Exception => System.out.println("Exception in processing the event " + e.printStackTrace())
}
})
publisher.shutdown()
}
catch {
case e: Exception => System.out.println("Exception in processing the partition = " + e.printStackTrace())
}
}
)
Whenever I am submitting this on the cluster I am getting the spark prelaunch errors with exit code 134.
I have shaded the guava and protobuf in my pom. If I run this example through a local test case, it works but if submitted on dataproc I get the errors.
I did not find any relative information about writing the data frame to pub-sub.
Any pointers?
Update:
System Details: Single Node Cluster with N1-Standard-32 (32 Cores,120GB Memory)
Executor Cores: Dynamic enabled
Attaching the stack trace:
20/12/22 17:51:43 WARN org.apache.spark.scheduler.cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 1 for reason Container from a bad node: container_1608332157194_0026_01_000002 on host: dataproc-cluster.internal. Exit status: 134. Diagnostics: [2020-12-22 17:51:43.556]Exception from container-launch.
Container id: container_1608332157194_0026_01_000002
Exit code: 134
[2020-12-22 17:51:43.557]Container exited with a non-zero exit code 134. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
/bin/bash: line 1: 19017 Aborted /usr/lib/jvm/adoptopenjdk-8-hotspot-amd64/bin/java -server -Xmx5586m -Djava.io.tmpdir=/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1608332157194_0026/container_1608332157194_0026_01_000002/tmp '-Dspark.driver.port=43691' '-Dspark.rpc.message.maxSize=512' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/userlogs/application_1608332157194_0026/container_1608332157194_0026_01_000002 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler#dataproc-cluster.internal:43691 --executor-id 1 --hostname dataproc-cluster.internal --cores 2 --app-id application_1608332157194_0026 --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1608332157194_0026/container_1608332157194_0026_01_000002/__app__.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1608332157194_0026/container_1608332157194_0026_01_000002/mySparkJar-1.0.0-0-SNAPSHOT.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1608332157194_0026/container_1608332157194_0026_01_000002/org.apache.spark_spark-avro_2.11-2.4.2.jar --user-class-path file:/hadoop/yarn/nm-local-dir/usercache/root/appcache/application_1608332157194_0026/container_1608332157194_0026_01_000002/org.spark-project.spark_unused-1.0.0.jar > /var/log/hadoop-yarn/userlogs/application_1608332157194_0026/container_1608332157194_0026_01_000002/stdout 2> /var/log/hadoop-yarn/userlogs/application_1608332157194_0026/container_1608332157194_0026_01_000002/stderr
Last 4096 bytes of stderr :
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/lib/hadoop/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
20/12/22 17:51:36 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 11320100 records.
20/12/22 17:51:36 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
20/12/22 17:51:38 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor [.snappy]
20/12/22 17:51:38 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 2301 ms. row count = 11320100
20/12/22 17:51:39 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader initialized will read a total of 11320100 records.
20/12/22 17:51:39 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading next block
20/12/22 17:51:40 INFO org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory in 1411 ms. row count = 11320100
If job failed early it could be the case that there not enough memory for Spark Driver to start: https://discuss.xgboost.ai/t/container-exited-with-a-non-zero-exit-code-134/133
To solve this issue you need to provision Dataproc cluster with master node that has more RAM or allocate more memory/heap for Spark driver and/or Spark executors.

Spark: executor heartbeat timed out

I am working in a databricks cluster that have 240GB of memory and 64 cores. This the settings I defined.
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as fs
from pyspark.sql import SQLContext
from pyspark import SparkContext
from pyspark.sql.functions import count
from pyspark.sql.functions import col, countDistinct
from pyspark import SparkContext
from geospark.utils import GeoSparkKryoRegistrator, KryoSerializer
from geospark.register import upload_jars
from geospark.register import GeoSparkRegistrator
spark.conf.set("spark.sql.shuffle.partitions", 1000)
#Recommended settings for using GeoSpark
spark.conf.set("spark.driver.memory", "20g")
spark.conf.set("spark.network.timeout", "1000s")
spark.conf.set("spark.driver.maxResultSize", "10g")
spark.conf.set("spark.serializer", KryoSerializer.getName)
spark.conf.set("spark.kryo.registrator", GeoSparkKryoRegistrator.getName)
upload_jars()
SparkContext.setSystemProperty("geospark.global.charset","utf8")
spark.conf.set
I am working with large datasets and this is the error I get after hours of running.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 10.0 failed 4 times, most recent failure: Lost task 3.3 in stage 10.0 (TID 6054, 10.17.21.12, executor 7):
ExecutorLostFailure (executor 7 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 170684 ms
Let the heartbeat Interval be default(10s) and increase the network time out interval(default -120 s) to 300s (300000ms) and see. Use set and get .
spark.conf.set("spark.sql.<name-of-property>", <value>)
spark.conf.set("spark.network.timeout", 300000 )
or run this script in the notebook .
%scala
dbutils.fs.put("dbfs:/databricks/init/set_spark_params.sh","""
|#!/bin/bash
|
|cat << 'EOF' > /databricks/driver/conf/00-custom-spark-driver-defaults.conf
|[driver] {
| "spark.network.timeout" = "300000"
|}
|EOF
""".stripMargin, true)
The error tells you that the worker has timed out because it took too long.
There is probably some bottleneck happening in the background. Check the spark UI for executor 7, task 3 and stage 10. You also want to check the query that you have been running.
You also want to check these setting for better configuration:
spark.conf.set("spark.databricks.io.cache.enabled", True) # delta caching
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", True) # adaptive query execution for skewed data
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) # setting treshhold on broadcasting
spark.conf.set("spark.databricks.optimizer.rangeJoin.binSize", 20) #range optimizer
Feel free to give us more info on the Spark UI, we can better help you find the problem that way. Also, what kind of query were you doing?
Can you please try the following options ,
repartition the dataframe you work in more numbers for example df.repartition(1000)
--conf spark.network.timeout 10000000
--conf spark.executor.heartbeatInterval=10000000

Spark Structured Streaming with Kafka doesn't honor startingOffset="earliest"

I've set up Spark Structured Streaming (Spark 2.3.2) to read from Kafka (2.0.0). I'm unable to consume from the beginning of the topic if messages entered the topic before Spark streaming job is started. Is this expected behavior of Spark streaming where it ignores Kafka messages produced prior to initial run of Spark Stream job (even with .option("stratingOffsets","earliest"))?
Steps to reproduce
Before starting streaming job, create test topic (single broker, single partition) and produce messages to the topic (3 messages in my example).
Start spark-shell with the following command: spark-shell --packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.2.3.1.0.0-78 --repositories http://repo.hortonworks.com/content/repositories/releases/
Execute the spark scala code below.
// Local
val df = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9097")
.option("failOnDataLoss","false")
.option("stratingOffsets","earliest")
.option("subscribe", "test")
.load()
// Sink Console
val ds = df.writeStream.format("console").queryName("Write to console")
.trigger(org.apache.spark.sql.streaming.Trigger.ProcessingTime("10 second"))
.start()
Expected vs actual output
I expect the stream to start from offset=1. However, it starts reading from offset=3. You can see that kafka client is actually resetting the starting offset: 2019-06-18 21:22:57 INFO Fetcher:583 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Resetting offset for partition test-0 to offset 3.
I can see that the spark stream processes messages that I produce after starting the streaming job.
Is this expected behavior of Spark streaming where it ignores Kafka messages produced prior to initial run of Spark Stream job (even with .option("stratingOffsets","earliest"))?
2019-06-18 21:22:57 INFO AppInfoParser:109 - Kafka version : 2.0.0.3.1.0.0-78
2019-06-18 21:22:57 INFO AppInfoParser:110 - Kafka commitId : 0f47b27cde30d177
2019-06-18 21:22:57 INFO MicroBatchExecution:54 - Starting new streaming query.
2019-06-18 21:22:57 INFO Metadata:273 - Cluster ID: LqofSZfjTu29BhZm6hsgsg
2019-06-18 21:22:57 INFO AbstractCoordinator:677 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Discovered group coordinator localhost:9097 (id: 2147483647 rack: null)
2019-06-18 21:22:57 INFO ConsumerCoordinator:462 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Revoking previously assigned partitions []
2019-06-18 21:22:57 INFO AbstractCoordinator:509 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] (Re-)joining group
2019-06-18 21:22:57 INFO AbstractCoordinator:473 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Successfully joined group with generation 1
2019-06-18 21:22:57 INFO ConsumerCoordinator:280 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Setting newly assigned partitions [test-0]
2019-06-18 21:22:57 INFO Fetcher:583 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Resetting offset for partition test-0 to offset 3.
2019-06-18 21:22:58 INFO KafkaSource:54 - Initial offsets: {"test":{"0":3}}
2019-06-18 21:22:58 INFO Fetcher:583 - [Consumer clientId=consumer-2, groupId=spark-kafka-source-e948eee9-3024-4f14-bcb8-75b80d43cbb1--181544888-driver-0] Resetting offset for partition test-0 to offset 3.
2019-06-18 21:22:58 INFO MicroBatchExecution:54 - Committed offsets for batch 0. Metadata OffsetSeqMetadata(0,1560910978083,Map(spark.sql.shuffle.partitions -> 200, spark.sql.streaming.stateStore.providerClass -> org.apache.spark.sql.execution.streaming.state.HDFSBackedStateStoreProvider))
2019-06-18 21:22:58 INFO KafkaSource:54 - GetBatch called with start = None, end = {"test":{"0":3}}
Spark Batch mode
I was able to confirm that batch mode reads from the beginning - so no issue with Kafka retention configuration
val df = spark
.read
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9097")
.option("subscribe", "test")
.load()
df.count // Long = 3
Haha it was a simple typo: "stratingOffsets" should be "startingOffsets"
You can do this in two ways. load data from kafka to streaming dataframe or load data from kafka to static dataframe(for testing).
I think you are not seeing the data because of group-id. kafka will commit the consumer group and offsets in to an internal topic. make sure the group name is unique for each read.
Here are the two options.
Option 1: read data from kafka to streaming dataframe
// spark streaming with kafka
import org.apache.spark.sql.streaming.ProcessingTime
val ds1 = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers","app01.app.test.net:9097,app02.app.test.net:9097")
.option("subscribe", "kafka-testing-topic")
.option("kafka.security.protocol", "SASL_PLAINTEXT")
.option("startingOffsets","earliest")
.option("maxOffsetsPerTrigger","6000")
.load()
val ds2 = ds1.select(from_json($"value".cast(StringType), dataSchema).as("data")).select("data.*")
val ds3 = ds2.groupBy("TABLE_NAME").count()
ds3.writeStream
.trigger(ProcessingTime("10 seconds"))
.queryName("query1").format("console")
.outputMode("complete")
.start()
.awaitTermination()
Option 2: read data from kafka to static dataframe (for testing, it will load from beginning)
// Subscribe to 1 topic defaults to the earliest and latest offsets
val ds1 = spark.read.format("kafka")
.option("kafka.bootstrap.servers","app01.app.test.net:9097,app02.app.test.net:9097")
.option("subscribe", "kafka-testing-topic")
.option("kafka.security.protocol", "SASL_PLAINTEXT")
.option("spark.streaming.kafka.consumer.cache.enabled","false")
.load()
val ds2 = ds1.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)","topic","partition","offset","timestamp")
val ds3 = ds2.select("value").rdd.map(x => x.toString)
ds3.count()

Apache Spark Kinesis Integration: connected, but no records received

tldr; Can't use Kinesis Spark Streaming integration, because it receives no data.
Testing stream is set up, nodejs app sends 1 simple record per second.
Standard Spark 1.5.2 cluster is set up with master and worker nodes (4 cores) with docker-compose, AWS credentials in environment
spark-streaming-kinesis-asl-assembly_2.10-1.5.2.jar is downloaded and added to classpath
job.py or job.jar (just reads and prints) submitted.
Everything seems to be okay, but no records what-so-ever are received.
From time to time the KCL Worker thread says "Sleeping ..." - it might be broken silently (I checked all the stderr I could find, but no hints). Maybe swallowed OutOfMemoryError... but I doubt that, because of the amount of 1 record per second.
-------------------------------------------
Time: 1448645109000 ms
-------------------------------------------
15/11/27 17:25:09 INFO JobScheduler: Finished job streaming job 1448645109000 ms.0 from job set of time 1448645109000 ms
15/11/27 17:25:09 INFO KinesisBackedBlockRDD: Removing RDD 102 from persistence list
15/11/27 17:25:09 INFO JobScheduler: Total delay: 0.002 s for time 1448645109000 ms (execution: 0.001 s)
15/11/27 17:25:09 INFO BlockManager: Removing RDD 102
15/11/27 17:25:09 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[102] at createStream at NewClass.java:25 of time 1448645109000 ms
15/11/27 17:25:09 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer(1448645107000 ms)
15/11/27 17:25:09 INFO InputInfoTracker: remove old batch metadata: 1448645107000 ms
15/11/27 17:25:10 INFO JobScheduler: Added jobs for time 1448645110000 ms
15/11/27 17:25:10 INFO JobScheduler: Starting job streaming job 1448645110000 ms.0 from job set of time 1448645110000 ms
-------------------------------------------
Time: 1448645110000 ms
-------------------------------------------
<----- Some data expected to show up here!
15/11/27 17:25:10 INFO JobScheduler: Finished job streaming job 1448645110000 ms.0 from job set of time 1448645110000 ms
15/11/27 17:25:10 INFO JobScheduler: Total delay: 0.003 s for time 1448645110000 ms (execution: 0.001 s)
15/11/27 17:25:10 INFO KinesisBackedBlockRDD: Removing RDD 103 from persistence list
15/11/27 17:25:10 INFO KinesisInputDStream: Removing blocks of RDD KinesisBackedBlockRDD[103] at createStream at NewClass.java:25 of time 1448645110000 ms
15/11/27 17:25:10 INFO BlockManager: Removing RDD 103
15/11/27 17:25:10 INFO ReceivedBlockTracker: Deleting batches ArrayBuffer(1448645108000 ms)
15/11/27 17:25:10 INFO InputInfoTracker: remove old batch metadata: 1448645108000 ms
15/11/27 17:25:11 INFO JobScheduler: Added jobs for time 1448645111000 ms
15/11/27 17:25:11 INFO JobScheduler: Starting job streaming job 1448645111000 ms.0 from job set of time 1448645111000 ms
Please let me know any hints, I'd really like to use Spark for real time analytics... everything but this small detail of not receiving data :) seems to be ok.
PS: I find strange that somehow Spark ignores my settings of Storage level (mem and disk 2) and Checkpoint interval (20,000 ms)
15/11/27 17:23:26 INFO KinesisInputDStream: metadataCleanupDelay = -1
15/11/27 17:23:26 INFO KinesisInputDStream: Slide time = 1000 ms
15/11/27 17:23:26 INFO KinesisInputDStream: Storage level = StorageLevel(false, false, false, false, 1)
15/11/27 17:23:26 INFO KinesisInputDStream: Checkpoint interval = null
15/11/27 17:23:26 INFO KinesisInputDStream: Remember duration = 1000 ms
15/11/27 17:23:26 INFO KinesisInputDStream: Initialized and validated org.apache.spark.streaming.kinesis.KinesisInputDStream#74b21a6
Source code (java):
public class NewClass {
public static void main(String[] args) {
SparkConf conf = new SparkConf().setAppName("appname").setMaster("local[3]");
JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1000));
JavaReceiverInputDStream kinesisStream = KinesisUtils.createStream(
ssc, "webassist-test", "test", "https://kinesis.us-west-1.amazonaws.com", "us-west-1",
InitialPositionInStream.LATEST,
new Duration(20000),
StorageLevel.MEMORY_AND_DISK_2()
);
kinesisStream.print();
ssc.start();
ssc.awaitTermination();
}
}
Python code (tried both pprinting before and sending to MongoDB):
from pyspark.streaming.kinesis import KinesisUtils, InitialPositionInStream
from pyspark import SparkContext, StorageLevel
from pyspark.streaming import StreamingContext
from sys import argv
sc = SparkContext(appName="webassist-test")
ssc = StreamingContext(sc, 5)
stream = KinesisUtils.createStream(ssc,
"appname",
"test",
"https://kinesis.us-west-1.amazonaws.com",
"us-west-1",
InitialPositionInStream.LATEST,
5,
StorageLevel.MEMORY_AND_DISK_2)
stream.pprint()
ssc.start()
ssc.awaitTermination()
Note: I also tried sending data to MongoDB with stream.foreachRDD(lambda rdd: rdd.foreachPartition(send_partition)) but not pasting it here, since you'd need a MongoDB instance and it's not related to the problem - no records come in on the input already.
One more thing - the KCL never commits. The corresponding DynamoDB looks like this:
leaseKey checkpoint leaseCounter leaseOwner ownerSwitchesSinceCheckpoint
shardId-000000000000 LATEST 614 localhost:d92516... 8
The command used for submitting:
spark-submit --executor-memory 1024m --master spark://IpAddress:7077 /path/test.py
In the MasterUI I can see:
Input Rate
Receivers: 1 / 1 active
Avg: 0.00 events/sec
KinesisReceiver-0
Avg: 0.00 events/sec
...
Completed Batches (last 76 out of 76)
Thanks for any help!
I've had issues with no record activity being shown in Spark Streaming in the past when connecting with Kinesis.
I'd try these things to get more feedback/a different behaviour from Spark:
Make sure that you force the evaluation of your DStream transformation operations with output operations like foreachRDD, print, saveas...
Create a new KCL Application in DynamoDB using a new name for the "Kinesis app name" parameter when creating the stream or purge the existing one.
Switch between TRIM_HORIZON and LATEST for initial position when creating the stream.
Restart the context when you try these changes.
EDIT after code was added:
Perhaps I'm missing something obvious, but I cannot spot anything wrong with your source code. Do you have n+1 cpus running this application (n is the number of Kinesis shards)?
If you run a KCL application (Java/Python/...) reading from the shards in your docker instance, does it work? Perhaps there's something wrong with your network configuration, but I'd expect some error messages pointing it out.
If this is important enough / you have a bit of time, you can quickly implement kcl reader in your docker instance and will allow you to compare with your Spark Application. Some urls:
Python
Java
Python example
Another option is to run your Spark Streaming application in a different cluster and to compare.
P.S.: I'm currently using Spark Streaming 1.5.2 with Kinesis in different clusters and it processes records / shows activity as expected.
I was facing this issue when I used the suggested documentation and examples for the same, the following scala code works fine for me(you can always use java instead)--
val conf = ConfigFactory.load
val config = new SparkConf().setAppName(conf.getString("app.name"))
val ssc = new StreamingContext(config, Seconds(conf.getInt("app.aws.batchDuration")))
val stream = if (conf.hasPath("app.aws.key") && conf.hasPath("app.aws.secret")){
logger.info("Specifying AWS account using credentials.")
KinesisUtils.createStream(
ssc,
conf.getString("app.name"),
conf.getString("app.aws.stream"),
conf.getString("app.aws.endpoint"),
conf.getString("app.aws.region"),
InitialPositionInStream.LATEST,
Seconds(conf.getInt("app.aws.batchDuration")),
StorageLevel.MEMORY_AND_DISK_2,
conf.getString("app.aws.key"),
conf.getString("app.aws.secret")
)
} else {
logger.info("Specifying AWS account using EC2 profile.")
KinesisUtils.createStream(
ssc,
conf.getString("app.name"),
conf.getString("app.aws.stream"),
conf.getString("app.aws.endpoint"),
conf.getString("app.aws.region"),
InitialPositionInStream.LATEST,
Seconds(conf.getInt("app.aws.batchDuration")),
StorageLevel.MEMORY_AND_DISK_2
)
}
stream.foreachRDD((rdd: RDD[Array[Byte]], time) => {
val rddstr: RDD[String] = rdd
.map(arrByte => new String(arrByte))
rddstr.foreach(x => println(x))
}

Spark streaming not working

I have a rudimentary spark streaming word count and its just not working.
import sys
from pyspark import SparkConf, SparkContext
from pyspark.streaming import StreamingContext
sc = SparkContext(appName='streaming', master="local[*]")
scc = StreamingContext(sc, batchDuration=5)
lines = scc.socketTextStream("localhost", 9998)
words = lines.flatMap(lambda line: line.split())
counts = words.map(lambda word: (word, 1)).reduceByKey(lambda x, y: x + y)
counts.pprint()
print 'Listening'
scc.start()
scc.awaitTermination()
I have on a another terminal running nc -lk 9998 and I pasted some text. It prints out the typical logs (no exceptions) but it ends up queuing the job for some weird time (45 yrs) and it keeps on printing this...
15/06/19 18:53:30 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:874
15/06/19 18:53:30 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (PythonRDD[7] at RDD at PythonRDD.scala:43)
15/06/19 18:53:30 INFO TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
15/06/19 18:53:35 INFO JobScheduler: Added jobs for time 1434754415000 ms
15/06/19 18:53:40 INFO JobScheduler: Added jobs for time 1434754420000 ms
15/06/19 18:53:45 INFO JobScheduler: Added jobs for time 1434754425000 ms
...
...
What am I doing wrong?
Spark Streaming requires multiple executors to work. Try using local[4] for the master.

Resources