Exception while integrating spark with Kafka - apache-spark

Doing Spark-kafka streaming on word-count. Built a jar using sbt.
When I do spark-submit the following exception is throwing.
Exception in thread "streaming-start" java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.isDirectory()Z
at org.apache.spark.streaming.util.FileBasedWriteAheadLog.initializeOrRecover(FileBasedWriteAheadLog.scala:245)
at org.apache.spark.streaming.util.FileBasedWriteAheadLog.<init>(FileBasedWriteAheadLog.scala:80)
at org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:142)
at org.apache.spark.streaming.util.WriteAheadLogUtils$$anonfun$2.apply(WriteAheadLogUtils.scala:142)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.streaming.util.WriteAheadLogUtils$.createLog(WriteAheadLogUtils.scala:141)
at org.apache.spark.streaming.util.WriteAheadLogUtils$.createLogForDriver(WriteAheadLogUtils.scala:99)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:256)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker$$anonfun$createWriteAheadLog$1.apply(ReceivedBlockTracker.scala:254)
at scala.Option.map(Option.scala:146)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.createWriteAheadLog(ReceivedBlockTracker.scala:254)
at org.apache.spark.streaming.scheduler.ReceivedBlockTracker.<init>(ReceivedBlockTracker.scala:77)
at org.apache.spark.streaming.scheduler.ReceiverTracker.<init>(ReceiverTracker.scala:109)
at org.apache.spark.streaming.scheduler.JobScheduler.start(JobScheduler.scala:87)
at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply$mcV$sp(StreamingContext.scala:583)
at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
at org.apache.spark.streaming.StreamingContext$$anonfun$liftedTree1$1$1.apply(StreamingContext.scala:578)
at org.apache.spark.util.ThreadUtils$$anon$2.run(ThreadUtils.scala:126)
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#12010fd1{/streaming,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#552ed807{/streaming/json,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#7318daf8{/streaming/batch,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#3f1ddac2{/streaming/batch/json,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler#37864b77{/static/streaming,null,AVAILABLE,#Spark}
18/03/27 12:43:55 INFO streaming.StreamingContext: StreamingContext started
my spark submit:
spark-submit --packages org.apache.spark:spark-streaming-kafka-0-8_2.11:2.2.0 --class "KafkaWordCount" --master local[4] scala_project_2.11-1.0.jar localhost:2181 test-consumer-group word-count 1
scala_version: 2.11.8
spark_version: 2.2.0
sbt_version: 1.0.3
object KafkaWordCount {
def main(args: Array[String]) {
val (zkQuorum, group, topics, numThreads) = ("localhost:2181", "test-consumer-group", "word-count", 1)
val sparkConf = new SparkConf()
.setMaster("local[*]")
.setAppName("KafkaWordCount")
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val topicMap = topics.split(",").map((_, numThreads)).toMap
val lines = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap).map(_._2)
val words = lines.flatMap(_.split(" "))
words.foreachRDD(rdd => println("#####################rdd###################### " + rdd.first))
val wordCounts = words.map(x => (x, 1L))
.reduceByKeyAndWindow(_ + _, _ - _, Minutes(10), Seconds(2), 2)
wordCounts.print()
ssc.start()
ssc.awaitTermination()
}
}

Related

Stream data from one kafka topic to another using pyspark

I have been trying to stream some sample data using pyspark from one kafka topic to another (I want to apply some transformations, but, could not get the basic data movement to work). Below is my spark code.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.types import StructType, StringType, IntegerType
from pyspark.sql.functions import from_json, col
import time
confluentApiKey = 'someapikeyvalue'
confluentSecret = 'someapikey'
spark = SparkSession.builder\
.appName("repartition-job") \
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\
.getOrCreate()
df = spark\
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-cloud:9092") \
.option("subscribe", "test1") \
.option("topic", "test1") \
.option("sasl.mechanisms", "PLAIN")\
.option("security.protocol", "SASL_SSL")\
.option("sasl.username", confluentApiKey)\
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))\
.option("kafka.ssl.endpoint.identification.algorithm", "https")\
.option("sasl.password", confluentSecret)\
.option("startingOffsets", "earliest")\
.option("basic.auth.credentials.source", "USER_INFO")\
.option("failOnDataLoss", "true").load()
df.printSchema()
query = df \
.selectExpr("CAST(key AS STRING) AS key", "to_json(struct(*)) AS value") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-cloud:9092") \
.option("topic", "test2") \
.option("sasl.mechanisms", "PLAIN")\
.option("security.protocol", "SASL_SSL")\
.option("sasl.username", confluentApiKey)\
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))\
.option("kafka.ssl.endpoint.identification.algorithm", "https")\
.option("sasl.password", confluentSecret)\
.option("startingOffsets", "latest")\
.option("basic.auth.credentials.source", "USER_INFO")\
.option("checkpointLocation", "/tmp/checkpoint").start()
I have been able to get the schema printed well.
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
And when attempting to write to another Kafka topic using writeStream, I see the below logs and dont see the data being written and spark shuts down.
22/02/04 18:29:26 INFO CheckpointFileManager: Writing atomically to file:/tmp/checkpoint/metadata using temp file file:/tmp/checkpoint/.metadata.e6c58f93-5c1c-4f26-97cf-a8d3ed389a57.tmp
22/02/04 18:29:26 INFO CheckpointFileManager: Renamed temp file file:/tmp/checkpoint/.metadata.e6c58f93-5c1c-4f26-97cf-a8d3ed389a57.tmp to file:/tmp/checkpoint/metadata
22/02/04 18:29:27 INFO MicroBatchExecution: Starting [id = 71f0aeb8-46fc-49b5-8baf-3b83cb4df71f, runId = 7a274767-8830-448c-b9bc-d03217cd4465]. Use file:/tmp/checkpoint to store the query checkpoint.
22/02/04 18:29:27 INFO MicroBatchExecution: Reading table [org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable#3be72f6d] from DataSourceV2 named 'kafka' [org.apache.spark.sql.kafka010.KafkaSourceProvider#276c9fdc]
22/02/04 18:29:27 INFO SparkUI: Stopped Spark web UI at http://spark-sample-9d328d7ec5fee0bc-driver-svc.default.svc:4045
22/02/04 18:29:27 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
22/02/04 18:29:27 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
22/02/04 18:29:27 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
22/02/04 18:29:27 INFO MicroBatchExecution: Starting new streaming query.
22/02/04 18:29:27 INFO MicroBatchExecution: Stream started from {}
22/02/04 18:29:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/02/04 18:29:27 INFO MemoryStore: MemoryStore cleared
22/02/04 18:29:27 INFO BlockManager: BlockManager stopped
22/02/04 18:29:27 INFO BlockManagerMaster: BlockManagerMaster stopped
22/02/04 18:29:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/02/04 18:29:28 INFO SparkContext: Successfully stopped SparkContext
22/02/04 18:29:28 INFO ConsumerConfig: ConsumerConfig values:
....
....
....
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
22/02/04 18:29:28 INFO ShutdownHookManager: Shutdown hook called
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-7acecef5-0f0b-4b9a-af81-c8aa12f7fcad
22/02/04 18:29:28 INFO AppInfoParser: Kafka version: 2.4.1
22/02/04 18:29:28 INFO AppInfoParser: Kafka commitId: c57222ae8cd7866b
22/02/04 18:29:28 INFO AppInfoParser: Kafka startTimeMs: 1643999368467
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /var/data/spark-641f2e65-8f10-46b9-9821-d3b1f3536c0e/spark-1a103622-3329-4444-8e69-40f5a341c372/pyspark-59e822b4-a4a4-403b-9937-170d99c67584
22/02/04 18:29:28 INFO KafkaConsumer: [Consumer clientId=consumer-spark-kafka-source-75381ad2-1ce9-4e2b-a0b7-18d6ecb5ea8b--2090736517-driver-0-1, groupId=spark-kafka-source-75381ad2-1ce9-4e2b-a0b7-18d6ecb5ea8b--2090736517-driver-0] Subscribed to topic(s): test1
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /var/data/spark-641f2e65-8f10-46b9-9821-d3b1f3536c0e/spark-1a103622-3329-4444-8e69-40f5a341c372
22/02/04 18:29:28 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
22/02/04 18:29:28 INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
22/02/04 18:29:28 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
Also, sometimes, I do see the below logs where the kafka connection fails to establish.
22/02/06 04:50:55 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:56 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:57 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:58 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
What am I doing wrong?

Though I have setMaster as local, my spark application gives error

I have the following application (I am starting and stopping spark) in Windows. I use Scala-IDE(Eclipse). I get "A master URL must be set in your configuration" error even though I have set it here. I use spark-2.4.4 version.
Can someone please help me to fix this issue.
import org.apache.spark._;
import org.apache.spark.sql._;
object SampleApp {
def main(args: Array[String]) {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("Simple Application")
val sc = new SparkContext(conf)
sc.stop()
}
}
The error is:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
19/10/28 22:58:56 INFO SparkContext: Running Spark version 2.4.4
19/10/28 22:58:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/10/28 22:58:56 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at com.spark.renga.SampleApp$.main(SampleApp.scala:8)
at com.spark.renga.SampleApp.main(SampleApp.scala)
19/10/28 22:58:56 ERROR Utils: Uncaught exception in thread main
java.lang.NullPointerException
at org.apache.spark.SparkContext.postApplicationEnd(SparkContext.scala:2416)
at org.apache.spark.SparkContext.$anonfun$stop$2(SparkContext.scala:1931)
at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340)
at org.apache.spark.SparkContext.stop(SparkContext.scala:1931)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:585)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at com.spark.renga.SampleApp$.main(SampleApp.scala:8)
at com.spark.renga.SampleApp.main(SampleApp.scala)
19/10/28 22:58:56 INFO SparkContext: Successfully stopped SparkContext
Exception in thread "main" org.apache.spark.SparkException: A master URL must be set in your configuration
at org.apache.spark.SparkContext.<init>(SparkContext.scala:368)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at com.spark.renga.SampleApp$.main(SampleApp.scala:8)
at com.spark.renga.SampleApp.main(SampleApp.scala)
if you are using version 2.4.4 try this:
import org.apache.spark.sql.SparkSession
object SampleApp {
def main(args: Array[String]) {
val spark = SparkSession
.builder
.master("local[*]")
.appName("test")
.getOrCreate()
println(spark.sparkContext.version)
spark.stop()
}
}

Spark Streaming works in Local mode but "stages fail" with "could not initialize class" in Client/Cluster mode

I have a Spark + Kafka streaming app that runs fine in Local mode, however when I try to launch it in yarn + local/cluster mode I get several errors like below
The first error I always see is
WARN TaskSetManager: Lost task 1.1 in stage 3.0 (TID 9, ip-xxx-24-129-36.ec2.internal, executor 2): java.lang.NoClassDefFoundError: Could not initialize class TestStreaming$
at TestStreaming$$anonfun$main$1$$anonfun$apply$1.apply(TestStreaming.scala:60)
at TestStreaming$$anonfun$main$1$$anonfun$apply$1.apply(TestStreaming.scala:59)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:917)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1944)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Next error I get is
ERROR JobScheduler: Error running job streaming job 1541786030000 ms.0
followed by
java.lang.NoClassDefFoundError: Could not initialize class
Spark version 2.1.0
Scala 2.11
Kafka version 10
Part of my code when I launch it loads the config in main. I pass this config file at runtime with -conf AFTER the jar (see below). I'm not quite sure but must I pass this config to the executors as well?
I launch my streaming app with the command below. One shows Local mode, the other shows client mode.
runJar = myProgram.jar
loggerPath=/path/to/log4j.properties
mainClass=TestStreaming
logger=-DPHDTKafkaConsumer.app.log4j=$loggerPath
confFile=application.conf
-----------Local Mode----------
SPARK_KAFKA_VERSION=0.10 nohup spark2-submit --driver-java-options
"$logger" --conf "spark.executor.extraJavaOptions=$logger" --class
$mainClass --master local[4] $runJar -conf $confFile &
-----------Client Mode----------
SPARK_KAFKA_VERSION=0.10 nohup spark2-submit --master yarn --conf >"spark.executor.extraJavaOptions=$logger" --conf >"spark.driver.extraJavaOptions=$logger" --class $mainClass $runJar -conf >$confFile &
Here is my code below. Been battling this for over a week now.
import Util.UtilFunctions
import UtilFunctions.config
import org.apache.spark.sql.SparkSession
import org.apache.spark.SparkConf
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.log4j.Logger
object TestStreaming extends Serializable {
#transient lazy val logger: Logger = Logger.getLogger(getClass.getName)
def main(args: Array[String]) {
logger.info("Starting app")
UtilFunctions.loadConfig(args)
UtilFunctions.loadLogger()
val props: Map[String, String] = setKafkaProperties()
val topic = Set(config.getString("config.TOPIC_NAME"))
val conf = new SparkConf()
.setAppName(config.getString("config.SPARK_APP_NAME"))
.set("spark.streaming.backpressure.enabled", "true")
val spark = SparkSession.builder()
.config(conf)
.getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(10))
ssc.sparkContext.setLogLevel("INFO")
ssc.checkpoint(config.getString("config.SPARK_CHECKPOINT_NAME"))
val kafkaStream = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topic, props))
val distRecordsStream = kafkaStream.map(record => (record.key(), record.value()))
distRecordsStream.window(Seconds(10), Seconds(10))
distRecordsStream.foreachRDD(rdd => {
if(!rdd.isEmpty()) {
rdd.foreach(record => {
println(record._2) //value from kafka
})
}
})
ssc.start()
ssc.awaitTermination()
ssc.stop()
}
def setKafkaProperties(): Map[String, String] = {
val deserializer = "org.apache.kafka.common.serialization.StringDeserializer"
val zookeeper = config.getString("config.ZOOKEEPER")
val offsetReset = config.getString("config.OFFSET_RESET")
val brokers = config.getString("config.BROKERS")
val groupID = config.getString("config.GROUP_ID")
val autoCommit = config.getString("config.AUTO_COMMIT")
val maxPollRecords = config.getString("config.MAX_POLL_RECORDS")
val maxPollIntervalms = config.getString("config.MAX_POLL_INTERVAL_MS")
val props = Map(
"bootstrap.servers" -> brokers,
"zookeeper.connect" -> zookeeper,
"group.id" -> groupID,
"key.deserializer" -> deserializer,
"value.deserializer" -> deserializer,
"enable.auto.commit" -> autoCommit,
"auto.offset.reset" -> offsetReset,
"max.poll.records" -> maxPollRecords,
"max.poll.interval.ms" -> maxPollIntervalms)
props
}
}

Spark fails to load data frame from elasticsearch due to field format exception

My dataframe fails due to NumberFormatException on one of the nested JSON fields when reading from Elasticsearch .
I am not providing any schema as it should be inferred automatically from Elastic.
package org.arc
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
import scala.io.Source
import java.nio.charset.CodingErrorAction
import scala.io.Codec
import org.apache.spark.storage.StorageLevel
import org.apache.spark._
import org.apache.spark.sql.SparkSession
import org.apache.spark.util.Utils
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.expressions
import org.apache.spark.sql.functions.{ concat, lit }
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.{ StructType, StructField, StringType };
import org.apache.spark.serializer.KryoSerializer
object SparkOnES {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("SparkESTest")
.config("spark.master", "local[*]")
.config("spark.sql.warehouse.dir", "C://SparkScala//SparkLocal//spark-warehouse")
.enableHiveSupport()
.getOrCreate()
//1.Read Sample JSON
import spark.implicits._
//val myjson = spark.read.json("C:\\Users\\jasjyotsinghj599\\Desktop\\SampleTest.json")
// myjson.show(false)
//2.Read Data from ES
val esdf = spark.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", "XXXXXX")
.option("es.port", "80")
.option("es.query", "?q=*")
.option("es.nodes.wan.only", "true")
.option("pushdown", "true")
.option("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.load("batch_index/ticket")
esdf.createOrReplaceTempView("esdf")
spark.sql("Select * from esdf limit 1").show(false)
val esdf_fltr_lt = esdf.take(1)
}
}
The ErrorStack says that it cannot parse the input field.Looking at the exception, this issue seems to have caused due to mismatch in the type of data expected ( int, float, double ) and received ( string ) :
Caused by: java.lang.NumberFormatException: For input string: "161.60"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277)
at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
at org.elasticsearch.spark.serialization.ScalaValueReader.parseLong(ScalaValueReader.scala:142)
at org.elasticsearch.spark.serialization.ScalaValueReader$$anonfun$longValue$1.apply(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader$$anonfun$longValue$1.apply(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader.checkNull(ScalaValueReader.scala:120)
at org.elasticsearch.spark.serialization.ScalaValueReader.longValue(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader.readValue(ScalaValueReader.scala:89)
at org.elasticsearch.spark.sql.ScalaRowValueReader.readValue(ScalaEsRowValueReader.scala:46)
at org.elasticsearch.hadoop.serialization.ScrollReader.parseValue(ScrollReader.java:770)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:720)
... 25 more
18/04/25 23:33:53 WARN TaskSetManager: Lost task 3.0 in stage 1.0 (TID 4, localhost): org.elasticsearch.hadoop.rest.EsHadoopParsingException: Cannot parse value [161.60] for field [tvl_tkt_tot_chrg_amt]
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:723)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:867)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:710)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:476)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:401)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:296)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:269)
at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:393)
at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NumberFormatException: For input string: "161.60"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277)
at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
at org.elasticsearch.spark.serialization.ScalaValueReader.parseLong(ScalaValueReader.scala:142)
at org.elasticsearch.spark.serialization.ScalaValueReader$$anonfun$longValue$1.apply(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader$$anonfun$longValue$1.apply(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader.checkNull(ScalaValueReader.scala:120)
at org.elasticsearch.spark.serialization.ScalaValueReader.longValue(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader.readValue(ScalaValueReader.scala:89)
at org.elasticsearch.spark.sql.ScalaRowValueReader.readValue(ScalaEsRowValueReader.scala:46)
at org.elasticsearch.hadoop.serialization.ScrollReader.parseValue(ScrollReader.java:770)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:720)
... 25 more
18/04/25 23:33:53 INFO SparkContext: Invoking stop() from shutdown hook
18/04/25 23:33:53 INFO SparkUI: Stopped Spark web UI at http://10.1.2.244:4040
18/04/25 23:33:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/04/25 23:33:53 INFO MemoryStore: MemoryStore cleared
18/04/25 23:33:53 INFO BlockManager: BlockManager stopped
18/04/25 23:33:53 INFO BlockManagerMaster: BlockManagerMaster stopped
18/04/25 23:33:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/04/25 23:33:53 INFO SparkContext: Successfully stopped SparkContext
18/04/25 23:33:53 INFO ShutdownHookManager: Shutdown hook called
18/04/25 23:33:53 INFO ShutdownHookManager: Deleting directory

Spark Streaming Kafka stream

I'm having some issues while trying to read from kafka with spark streaming.
My code is:
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("KafkaIngestor")
val ssc = new StreamingContext(sparkConf, Seconds(2))
val kafkaParams = Map[String, String](
"zookeeper.connect" -> "localhost:2181",
"group.id" -> "consumergroup",
"metadata.broker.list" -> "localhost:9092",
"zookeeper.connection.timeout.ms" -> "10000"
//"kafka.auto.offset.reset" -> "smallest"
)
val topics = Set("test")
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, kafkaParams, topics)
I previously started zookeeper at port 2181 and Kafka server 0.9.0.0 at port 9092.
But I get the following error in the Spark driver:
Exception in thread "main" java.lang.ClassCastException: kafka.cluster.BrokerEndPoint cannot be cast to kafka.cluster.Broker
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6$$anonfun$apply$7.apply(KafkaCluster.scala:90)
at scala.Option.map(Option.scala:145)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:90)
at org.apache.spark.streaming.kafka.KafkaCluster$$anonfun$2$$anonfun$3$$anonfun$apply$6.apply(KafkaCluster.scala:87)
Zookeeper log:
[2015-12-08 00:32:08,226] INFO Got user-level KeeperException when processing sessionid:0x1517ec89dfd0000 type:create cxid:0x34 zxid:0x1d3 txntype:-1 reqpath:n/a Error Path:/brokers/ids Error:KeeperErrorCode = NodeExists for /brokers/ids (org.apache.zookeeper.server.PrepRequestProcessor)
Any hint?
Thank you very much
The problem was related the wrong spark-streaming-kafka version.
As described in the documentation
Kafka: Spark Streaming 1.5.2 is compatible with Kafka 0.8.2.1
So, including
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka_2.10</artifactId>
<version>0.8.2.2</version>
</dependency>
in my pom.xml (instead of version 0.9.0.0) solved the issue.
Hope this helps
Kafka10 streaming / Spark 2.1.0 / DCOS / Mesosphere
Ugg I spent all day on this and must have read this post a dozen times. I tried spark 2.0.0, 2.0.1, Kafka 8, Kafka 10. Stay away from Kafka 8 and spark 2.0.x, and dependencies are everything. Start with below. It works.
SBT:
"org.apache.hadoop" % "hadoop-aws" % "2.7.3" excludeAll ExclusionRule(organization = "org.apache.hadoop", name = "hadoop-common"),
"org.apache.spark" %% "spark-core" % "2.1.0",
"org.apache.spark" %% "spark-sql" % "2.1.0" ,
"org.apache.spark" % "spark-streaming-kafka-0-10_2.11" % "2.1.0",
"org.apache.spark" % "spark-streaming_2.11" % "2.1.0"
Working Kafka/Spark Streaming code:
val spark = SparkSession
.builder()
.appName("ingest")
.master("local[4]")
.getOrCreate()
import spark.implicits._
val ssc = new StreamingContext(spark.sparkContext, Seconds(2))
val topics = Set("water2").toSet
val kafkaParams = Map[String, String](
"metadata.broker.list" -> "broker:port,broker:port",
"bootstrap.servers" -> "broker:port,broker:port",
"group.id" -> "somegroup",
"auto.commit.interval.ms" -> "1000",
"key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
"auto.offset.reset" -> "earliest",
"enable.auto.commit" -> "true"
)
val messages = KafkaUtils.createDirectStream[String, String](ssc, PreferConsistent, Subscribe[String, String](topics, kafkaParams))
messages.foreachRDD(rdd => {
if (rdd.count() >= 1) {
rdd.map(record => (record.key, record.value))
.toDS()
.withColumnRenamed("_2", "value")
.drop("_1")
.show(5, false)
println(rdd.getClass)
}
})
ssc.start()
ssc.awaitTermination()
Please like if you see this so I can get some reputation points. :)

Resources