I am trying to get a very simple kafka + sparkstreaming integration.
On the kafka side, I cloned this repository (https://github.com/confluentinc/cp-docker-images) and did a docker-compose up to get an instance of zookeeper and kafka running. I created a topic called "foo" and added messages. In this case, kafka is running on port 29092.
On the spark side, my build.sbt file looks like this:
name := "KafkaSpark"
version := "0.1"
scalaVersion := "2.11.12"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-streaming-kafka-0-10" % sparkVersion
)
I was able to get the following code snippet running from consuming data from the terminal:
import org.apache.spark._
import org.apache.spark.streaming._
object SparkTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(3))
val lines = ssc.socketTextStream("localhost", 9999)
val words = lines.flatMap(_.split(" "))
val pairs = words.map(word => (word, 1))
val wordCounts = pairs.reduceByKey(_ + _)
// Print the first ten elements of each RDD generated in this DStream to the console
wordCounts.print()
ssc.start() // Start the computation
ssc.awaitTermination() // Wait for the computation to terminate
}
}
So the sparkstreaming is working.
Now, I created the following to consume from kafka:
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.count
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.sql.types.{StringType, StructType, TimestampType}
object KafkaTest {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder
.master("local")
.appName("Spark Word Count")
.getOrCreate()
val ssc = new StreamingContext(spark.sparkContext, Seconds(3))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "localhost:29092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"group.id" -> "stream_group_id",
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean)
)
val topics = Array("foo")
val stream = KafkaUtils.createDirectStream[String, String](
ssc,
PreferConsistent,
Subscribe[String, String](topics, kafkaParams)
)
stream.foreachRDD { (rdd, time) =>
val data = rdd.map(record => record.value)
data.foreach(println)
println(time)
}
ssc.start() // Start the computation
ssc.awaitTermination()
}
}
When it runs, I get the following in the console (I'm running this in intellij). The process just hangs at the last line after "subscribing" to the topic. I've tried creating a topic that does not exist and I get the same result, i.e. it doesn't seem to throw an error despite the lack of a topic existing. If I create a non-existing broker, I do get an error (Exception in thread "main" org.apache.kafka.common.KafkaException: Failed to construct kafka consumer) so it must be finding the broker when I do use the proper port.
Any suggestions on how to correct this issue?
Here's the log file:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
17/11/23 05:29:42 INFO SparkContext: Running Spark version 2.2.0
17/11/23 05:29:42 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
17/11/23 05:29:48 INFO SparkContext: Submitted application: Spark Word Count
17/11/23 05:29:48 INFO SecurityManager: Changing view acls to: jonathandick
17/11/23 05:29:48 INFO SecurityManager: Changing modify acls to: jonathandick
17/11/23 05:29:48 INFO SecurityManager: Changing view acls groups to:
17/11/23 05:29:48 INFO SecurityManager: Changing modify acls groups to:
17/11/23 05:29:48 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(jonathandick); groups with view permissions: Set(); users with modify permissions: Set(jonathandick); groups with modify permissions: Set()
17/11/23 05:29:48 INFO Utils: Successfully started service 'sparkDriver' on port 59606.
17/11/23 05:29:48 DEBUG SparkEnv: Using serializer: class org.apache.spark.serializer.JavaSerializer
17/11/23 05:29:48 INFO SparkEnv: Registering MapOutputTracker
17/11/23 05:29:48 INFO SparkEnv: Registering BlockManagerMaster
17/11/23 05:29:48 INFO BlockManagerMasterEndpoint: Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/11/23 05:29:48 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/11/23 05:29:48 INFO DiskBlockManager: Created local directory at /private/var/folders/w2/njgz3jnd097cdybxcvp9c2hw0000gn/T/blockmgr-3a3feb00-0fdb-4bc5-867d-808ac65d7c8f
17/11/23 05:29:48 INFO MemoryStore: MemoryStore started with capacity 2004.6 MB
17/11/23 05:29:48 INFO SparkEnv: Registering OutputCommitCoordinator
17/11/23 05:29:49 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
17/11/23 05:29:49 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.
17/11/23 05:29:49 WARN Utils: Service 'SparkUI' could not bind on port 4042. Attempting port 4043.
17/11/23 05:29:49 INFO Utils: Successfully started service 'SparkUI' on port 4043.
17/11/23 05:29:49 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.1.67:4043
17/11/23 05:29:49 INFO Executor: Starting executor ID driver on host localhost
17/11/23 05:29:49 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 59613.
17/11/23 05:29:49 INFO NettyBlockTransferService: Server created on 192.168.1.67:59613
17/11/23 05:29:49 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
17/11/23 05:29:49 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(driver, 192.168.1.67, 59613, None)
17/11/23 05:29:49 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.1.67:59613 with 2004.6 MB RAM, BlockManagerId(driver, 192.168.1.67, 59613, None)
17/11/23 05:29:49 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(driver, 192.168.1.67, 59613, None)
17/11/23 05:29:49 INFO BlockManager: Initialized BlockManager: BlockManagerId(driver, 192.168.1.67, 59613, None)
17/11/23 05:29:49 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:/Users/jonathandick/IdeaProjects/KafkaSpark/spark-warehouse/').
17/11/23 05:29:49 INFO SharedState: Warehouse path is 'file:/Users/jonathandick/IdeaProjects/KafkaSpark/spark-warehouse/'.
17/11/23 05:29:50 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
17/11/23 05:29:50 WARN StreamingContext: spark.master should be set as local[n], n > 1 in local mode if you have receivers to get data, otherwise Spark jobs will not get resources to process the received data.
17/11/23 05:29:50 WARN KafkaUtils: overriding enable.auto.commit to false for executor
17/11/23 05:29:50 WARN KafkaUtils: overriding auto.offset.reset to none for executor
17/11/23 05:29:50 WARN KafkaUtils: overriding executor group.id to spark-executor-stream_group_id
17/11/23 05:29:50 WARN KafkaUtils: overriding receive.buffer.bytes to 65536 see KAFKA-3135
17/11/23 05:29:50 INFO DirectKafkaInputDStream: Slide time = 3000 ms
17/11/23 05:29:50 INFO DirectKafkaInputDStream: Storage level = Serialized 1x Replicated
17/11/23 05:29:50 INFO DirectKafkaInputDStream: Checkpoint interval = null
17/11/23 05:29:50 INFO DirectKafkaInputDStream: Remember interval = 3000 ms
17/11/23 05:29:50 INFO DirectKafkaInputDStream: Initialized and validated org.apache.spark.streaming.kafka010.DirectKafkaInputDStream#1a38eb73
17/11/23 05:29:50 INFO ForEachDStream: Slide time = 3000 ms
17/11/23 05:29:50 INFO ForEachDStream: Storage level = Serialized 1x Replicated
17/11/23 05:29:50 INFO ForEachDStream: Checkpoint interval = null
17/11/23 05:29:50 INFO ForEachDStream: Remember interval = 3000 ms
17/11/23 05:29:50 INFO ForEachDStream: Initialized and validated org.apache.spark.streaming.dstream.ForEachDStream#1e801ce2
17/11/23 05:29:50 INFO ConsumerConfig: ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 300000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 1048576
bootstrap.servers = [localhost:29092]
ssl.keystore.type = JKS
enable.auto.commit = false
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id =
ssl.endpoint.identification.algorithm = null
max.poll.records = 2147483647
check.crcs = true
request.timeout.ms = 40000
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
group.id = stream_group_id
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
session.timeout.ms = 30000
metrics.num.samples = 2
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
ssl.protocol = TLS
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.keystore.location = null
ssl.cipher.suites = null
security.protocol = PLAINTEXT
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
auto.offset.reset = latest
17/11/23 05:29:50 DEBUG KafkaConsumer: Starting the Kafka consumer
17/11/23 05:29:50 INFO ConsumerConfig: ConsumerConfig values:
metric.reporters = []
metadata.max.age.ms = 300000
partition.assignment.strategy = [org.apache.kafka.clients.consumer.RangeAssignor]
reconnect.backoff.ms = 50
sasl.kerberos.ticket.renew.window.factor = 0.8
max.partition.fetch.bytes = 1048576
bootstrap.servers = [localhost:29092]
ssl.keystore.type = JKS
enable.auto.commit = false
sasl.mechanism = GSSAPI
interceptor.classes = null
exclude.internal.topics = true
ssl.truststore.password = null
client.id = consumer-1
ssl.endpoint.identification.algorithm = null
max.poll.records = 2147483647
check.crcs = true
request.timeout.ms = 40000
heartbeat.interval.ms = 3000
auto.commit.interval.ms = 5000
receive.buffer.bytes = 65536
ssl.truststore.type = JKS
ssl.truststore.location = null
ssl.keystore.password = null
fetch.min.bytes = 1
send.buffer.bytes = 131072
value.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
group.id = stream_group_id
retry.backoff.ms = 100
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
ssl.trustmanager.algorithm = PKIX
ssl.key.password = null
fetch.max.wait.ms = 500
sasl.kerberos.min.time.before.relogin = 60000
connections.max.idle.ms = 540000
session.timeout.ms = 30000
metrics.num.samples = 2
key.deserializer = class org.apache.kafka.common.serialization.StringDeserializer
ssl.protocol = TLS
ssl.provider = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
ssl.keystore.location = null
ssl.cipher.suites = null
security.protocol = PLAINTEXT
ssl.keymanager.algorithm = SunX509
metrics.sample.window.ms = 30000
auto.offset.reset = latest
17/11/23 05:29:50 INFO AppInfoParser: Kafka version : 0.10.0.1
17/11/23 05:29:50 INFO AppInfoParser: Kafka commitId : a7a17cdec9eaa6c5
17/11/23 05:29:50 DEBUG KafkaConsumer: Kafka consumer created
17/11/23 05:29:50 DEBUG KafkaConsumer: Subscribed to topic(s): foo
Related
I need to support a use case where we are able to run Beam pipelines in an external spark URL.
I took a basic example of a beam pipeline as below
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
class ConvertToByteArray(beam.DoFn):
def __init__(self):
pass
def setup(self):
pass
def process(self, row):
try:
yield bytearray(row + '\n', 'utf-8')
except Exception as e:
raise e
def run():
options = PipelineOptions([
"--runner=SparkRunner",
# "--spark_master_url=spark://0.0.0.0:7077",
# "--spark_version=3",
])
with beam.Pipeline(options=options) as p:
lines = (p
| 'Create words' >> beam.Create(['this is working'])
| 'Split words' >> beam.FlatMap(lambda words: words.split(' '))
| 'Build byte array' >> beam.ParDo(ConvertToByteArray())
| 'Group' >> beam.GroupBy() # Do future batching here
| 'print output' >> beam.Map(print)
)
if __name__ == "__main__":
run()
I try to run this pipeline in 2 ways
Using Apache Beam's internal spark runner
Running Spark locally and passing the spark master URL.
Approach 1 works fine and i'm able to see the output (screenshot below)
Screenshot of Output
Approach 2 gives a class incompatible error on the spark server
Running spark as a docker container and natively on my machine both gave the same error.
Exception trace
Spark Executor Command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=62420" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#192.168.8.120:62420" "--executor-id" "0" "--hostname" "172.17.0.2" "--cores" "4" "--app-id" "app-20220425143553-0000" "--worker-url" "spark://Worker#172.17.0.2:44575"
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/04/25 14:35:55 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 165#ace075bc56c0
22/04/25 14:35:55 INFO SignalUtils: Registering signal handler for TERM
22/04/25 14:35:55 INFO SignalUtils: Registering signal handler for HUP
22/04/25 14:35:55 INFO SignalUtils: Registering signal handler for INT
22/04/25 14:35:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/04/25 14:35:56 INFO SecurityManager: Changing view acls to: spark,nikamath
22/04/25 14:35:56 INFO SecurityManager: Changing modify acls to: spark,nikamath
22/04/25 14:35:56 INFO SecurityManager: Changing view acls groups to:
22/04/25 14:35:56 INFO SecurityManager: Changing modify acls groups to:
22/04/25 14:35:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, nikamath); groups with view permissions: Set(); users with modify permissions: Set(spark, nikamath); groups with modify permissions: Set()
22/04/25 14:35:57 INFO TransportClientFactory: Successfully created connection to /192.168.8.120:62420 after 194 ms (0 ms spent in bootstraps)
22/04/25 14:35:57 INFO SecurityManager: Changing view acls to: spark,nikamath
22/04/25 14:35:57 INFO SecurityManager: Changing modify acls to: spark,nikamath
22/04/25 14:35:57 INFO SecurityManager: Changing view acls groups to:
22/04/25 14:35:57 INFO SecurityManager: Changing modify acls groups to:
22/04/25 14:35:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, nikamath); groups with view permissions: Set(); users with modify permissions: Set(spark, nikamath); groups with modify permissions: Set()
22/04/25 14:35:57 INFO TransportClientFactory: Successfully created connection to /192.168.8.120:62420 after 19 ms (0 ms spent in bootstraps)
22/04/25 14:35:58 INFO DiskBlockManager: Created local directory at /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/blockmgr-d7bd5b95-ddb3-4642-9863-261a6e109fc4
22/04/25 14:35:58 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
22/04/25 14:35:58 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler#192.168.8.120:62420
22/04/25 14:35:58 INFO WorkerWatcher: Connecting to worker spark://Worker#172.17.0.2:44575
22/04/25 14:35:58 INFO TransportClientFactory: Successfully created connection to /172.17.0.2:44575 after 7 ms (0 ms spent in bootstraps)
22/04/25 14:35:58 INFO WorkerWatcher: Successfully connected to spark://Worker#172.17.0.2:44575
22/04/25 14:35:58 INFO ResourceUtils: ==============================================================
22/04/25 14:35:58 INFO ResourceUtils: No custom resources configured for spark.executor.
22/04/25 14:35:58 INFO ResourceUtils: ==============================================================
22/04/25 14:35:58 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
22/04/25 14:35:58 INFO Executor: Starting executor ID 0 on host 172.17.0.2
22/04/25 14:35:59 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39027.
22/04/25 14:35:59 INFO NettyBlockTransferService: Server created on 172.17.0.2:39027
22/04/25 14:35:59 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/04/25 14:35:59 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 172.17.0.2, 39027, None)
22/04/25 14:35:59 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 172.17.0.2, 39027, None)
22/04/25 14:35:59 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 172.17.0.2, 39027, None)
22/04/25 14:35:59 INFO Executor: Fetching spark://192.168.8.120:62420/jars/beam-runners-spark-3-job-server-2.33.0.jar with timestamp 1650897351765
22/04/25 14:35:59 INFO TransportClientFactory: Successfully created connection to /192.168.8.120:62420 after 5 ms (0 ms spent in bootstraps)
22/04/25 14:35:59 INFO Utils: Fetching spark://192.168.8.120:62420/jars/beam-runners-spark-3-job-server-2.33.0.jar to /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/spark-cd481993-e8df-46fd-b00c-9a31e17d245d/fetchFileTemp1955801007454794690.tmp
22/04/25 14:36:02 INFO Utils: Copying /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/spark-cd481993-e8df-46fd-b00c-9a31e17d245d/-16622853561650897351765_cache to /opt/bitnami/spark/work/app-20220425143553-0000/0/./beam-runners-spark-3-job-server-2.33.0.jar
22/04/25 14:36:04 INFO Executor: Adding file:/opt/bitnami/spark/work/app-20220425143553-0000/0/./beam-runners-spark-3-job-server-2.33.0.jar to class loader
22/04/25 14:36:04 INFO CoarseGrainedExecutorBackend: Got assigned task 0
22/04/25 14:36:04 INFO CoarseGrainedExecutorBackend: Got assigned task 1
22/04/25 14:36:04 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
22/04/25 14:36:04 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/04/25 14:36:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
22/04/25 14:36:04 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
22/04/25 14:36:04 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 0.0 in stage 0.0 (TID 0),5,main]
java.lang.Error: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1(Executor.scala:424)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1$adapted(Executor.scala:423)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.executor.Executor$TaskRunner.collectAccumulatorsAndResetStatusOnFailure(Executor.scala:423)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:704)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
22/04/25 14:36:04 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 1.0 in stage 0.0 (TID 1),5,main]
java.lang.Error: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1(Executor.scala:424)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1$adapted(Executor.scala:423)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.executor.Executor$TaskRunner.collectAccumulatorsAndResetStatusOnFailure(Executor.scala:423)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:704)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
22/04/25 14:36:05 INFO MemoryStore: MemoryStore cleared
22/04/25 14:36:05 INFO BlockManager: BlockManager stopped
22/04/25 14:36:05 INFO ShutdownHookManager: Shutdown hook called
22/04/25 14:36:05 INFO ShutdownHookManager: Deleting directory /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/spark-cd481993-e8df-46fd-b00c-9a31e17d245d
I confirmed that a worker was running and the spark was set up properly.
I submitted a sample spark job to this master and get it to work implying that there isn't anything wrong with the spark master or worker.
Versions used:
Spark 3.2.1
Apache Beam 2.37.0
Python 3.7
I have been trying to stream some sample data using pyspark from one kafka topic to another (I want to apply some transformations, but, could not get the basic data movement to work). Below is my spark code.
from pyspark.sql import SparkSession
from pyspark.sql.functions import explode
from pyspark.sql.functions import split
from pyspark.sql.types import StructType, StringType, IntegerType
from pyspark.sql.functions import from_json, col
import time
confluentApiKey = 'someapikeyvalue'
confluentSecret = 'someapikey'
spark = SparkSession.builder\
.appName("repartition-job") \
.config('spark.jars.packages', 'org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1')\
.getOrCreate()
df = spark\
.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-cloud:9092") \
.option("subscribe", "test1") \
.option("topic", "test1") \
.option("sasl.mechanisms", "PLAIN")\
.option("security.protocol", "SASL_SSL")\
.option("sasl.username", confluentApiKey)\
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))\
.option("kafka.ssl.endpoint.identification.algorithm", "https")\
.option("sasl.password", confluentSecret)\
.option("startingOffsets", "earliest")\
.option("basic.auth.credentials.source", "USER_INFO")\
.option("failOnDataLoss", "true").load()
df.printSchema()
query = df \
.selectExpr("CAST(key AS STRING) AS key", "to_json(struct(*)) AS value") \
.writeStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "pkc-cloud:9092") \
.option("topic", "test2") \
.option("sasl.mechanisms", "PLAIN")\
.option("security.protocol", "SASL_SSL")\
.option("sasl.username", confluentApiKey)\
.option("kafka.sasl.jaas.config", "kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username='{}' password='{}';".format(confluentApiKey, confluentSecret))\
.option("kafka.ssl.endpoint.identification.algorithm", "https")\
.option("sasl.password", confluentSecret)\
.option("startingOffsets", "latest")\
.option("basic.auth.credentials.source", "USER_INFO")\
.option("checkpointLocation", "/tmp/checkpoint").start()
I have been able to get the schema printed well.
|-- key: binary (nullable = true)
|-- value: binary (nullable = true)
|-- topic: string (nullable = true)
|-- partition: integer (nullable = true)
|-- offset: long (nullable = true)
|-- timestamp: timestamp (nullable = true)
|-- timestampType: integer (nullable = true)
And when attempting to write to another Kafka topic using writeStream, I see the below logs and dont see the data being written and spark shuts down.
22/02/04 18:29:26 INFO CheckpointFileManager: Writing atomically to file:/tmp/checkpoint/metadata using temp file file:/tmp/checkpoint/.metadata.e6c58f93-5c1c-4f26-97cf-a8d3ed389a57.tmp
22/02/04 18:29:26 INFO CheckpointFileManager: Renamed temp file file:/tmp/checkpoint/.metadata.e6c58f93-5c1c-4f26-97cf-a8d3ed389a57.tmp to file:/tmp/checkpoint/metadata
22/02/04 18:29:27 INFO MicroBatchExecution: Starting [id = 71f0aeb8-46fc-49b5-8baf-3b83cb4df71f, runId = 7a274767-8830-448c-b9bc-d03217cd4465]. Use file:/tmp/checkpoint to store the query checkpoint.
22/02/04 18:29:27 INFO MicroBatchExecution: Reading table [org.apache.spark.sql.kafka010.KafkaSourceProvider$KafkaTable#3be72f6d] from DataSourceV2 named 'kafka' [org.apache.spark.sql.kafka010.KafkaSourceProvider#276c9fdc]
22/02/04 18:29:27 INFO SparkUI: Stopped Spark web UI at http://spark-sample-9d328d7ec5fee0bc-driver-svc.default.svc:4045
22/02/04 18:29:27 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
22/02/04 18:29:27 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
22/02/04 18:29:27 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed (this is expected if the application is shutting down.)
22/02/04 18:29:27 INFO MicroBatchExecution: Starting new streaming query.
22/02/04 18:29:27 INFO MicroBatchExecution: Stream started from {}
22/02/04 18:29:27 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/02/04 18:29:27 INFO MemoryStore: MemoryStore cleared
22/02/04 18:29:27 INFO BlockManager: BlockManager stopped
22/02/04 18:29:27 INFO BlockManagerMaster: BlockManagerMaster stopped
22/02/04 18:29:27 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/02/04 18:29:28 INFO SparkContext: Successfully stopped SparkContext
22/02/04 18:29:28 INFO ConsumerConfig: ConsumerConfig values:
....
....
....
ssl.keystore.type = JKS
ssl.protocol = TLS
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
22/02/04 18:29:28 INFO ShutdownHookManager: Shutdown hook called
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /tmp/spark-7acecef5-0f0b-4b9a-af81-c8aa12f7fcad
22/02/04 18:29:28 INFO AppInfoParser: Kafka version: 2.4.1
22/02/04 18:29:28 INFO AppInfoParser: Kafka commitId: c57222ae8cd7866b
22/02/04 18:29:28 INFO AppInfoParser: Kafka startTimeMs: 1643999368467
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /var/data/spark-641f2e65-8f10-46b9-9821-d3b1f3536c0e/spark-1a103622-3329-4444-8e69-40f5a341c372/pyspark-59e822b4-a4a4-403b-9937-170d99c67584
22/02/04 18:29:28 INFO KafkaConsumer: [Consumer clientId=consumer-spark-kafka-source-75381ad2-1ce9-4e2b-a0b7-18d6ecb5ea8b--2090736517-driver-0-1, groupId=spark-kafka-source-75381ad2-1ce9-4e2b-a0b7-18d6ecb5ea8b--2090736517-driver-0] Subscribed to topic(s): test1
22/02/04 18:29:28 INFO ShutdownHookManager: Deleting directory /var/data/spark-641f2e65-8f10-46b9-9821-d3b1f3536c0e/spark-1a103622-3329-4444-8e69-40f5a341c372
22/02/04 18:29:28 INFO MetricsSystemImpl: Stopping s3a-file-system metrics system...
22/02/04 18:29:28 INFO MetricsSystemImpl: s3a-file-system metrics system stopped.
22/02/04 18:29:28 INFO MetricsSystemImpl: s3a-file-system metrics system shutdown complete.
Also, sometimes, I do see the below logs where the kafka connection fails to establish.
22/02/06 04:50:55 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:56 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:57 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
22/02/06 04:50:58 WARN NetworkClient: [Consumer clientId=consumer-spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0-1, groupId=spark-kafka-source-fc21e146-82f0-4fc7-a2da-34f3e8f70026-289148490-driver-0] Bootstrap broker pkc-.confluent.cloud:9092 (id: -1 rack: null) disconnected
What am I doing wrong?
I am trying to filter data frame/dataset records using filter function with scala anonymous function. but it throws Task not serializable exception can someone please look into code and explain to me what mistake with code.
val spark = SparkSession.builder()
.appName("test data frame")
.master("local[*]")
.getOrCreate()
val user_seq = Seq(
Row(1,"John","London"),
Row(1,"Martin","New York"),
Row(1,"Abhishek","New York")
)
val user_schema = StructType(
Array(
StructField("user_id",IntegerType,true),
StructField("user_name",StringType,true),
StructField("user_city",StringType,true)
))
var user_df = spark.createDataFrame(spark.sparkContext.parallelize(user_seq),user_schema)
var user_rdd = user_df.filter((item)=>{
return item.getString(2) == "New York"
})
user_rdd.count();
I can see below exception on console. when I am trying to filter data with ColumnName its working fine.
objc[48765]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java (0x1059db4c0) and /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x105a5f4e0). One of the two will be used. Which one is undefined.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/18 20:10:09 INFO SparkContext: Running Spark version 2.4.6
20/07/18 20:10:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/18 20:10:09 INFO SparkContext: Submitted application: test data frame
20/07/18 20:10:09 INFO SecurityManager: Changing view acls groups to:
20/07/18 20:10:09 INFO SecurityManager: Changing modify acls groups to:
20/07/18 20:10:12 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/07/18 20:10:12 INFO ContextCleaner: Cleaned accumulator 0
20/07/18 20:10:13 INFO CodeGenerator: Code generated in 170.789451 ms
20/07/18 20:10:13 INFO CodeGenerator: Code generated in 17.729004 ms
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:406)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:163)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:872)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:871)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:871)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:630)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:151)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2835)
at DataFrameTest$.main(DataFrameTest.scala:65)
at DataFrameTest.main(DataFrameTest.scala)
Caused by: java.io.NotSerializableException: java.lang.Object
Serialization stack:
- object not serializable (class: java.lang.Object, value: java.lang.Object#cec590c)
- field (class: DataFrameTest$$anonfun$1, name: nonLocalReturnKey1$1, type: class java.lang.Object)
- object (class DataFrameTest$$anonfun$1, <function1>)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 5)
- field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
... 48 more
20/07/18 20:10:13 INFO SparkContext: Invoking stop() from shutdown hook
20/07/18 20:10:13 INFO SparkUI: Stopped Spark web UI at http://192.168.31.239:4040
20/07/18 20:10:13 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/18 20:10:13 INFO MemoryStore: MemoryStore cleared
20/07/18 20:10:13 INFO BlockManager: BlockManager stopped
20/07/18 20:10:13 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/18 20:10:13 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/18 20:10:13 INFO SparkContext: Successfully stopped SparkContext
20/07/18 20:10:13 INFO ShutdownHookManager: Shutdown hook called
20/07/18 20:10:13 INFO ShutdownHookManager: Deleting directory /private/var/folders/33/3n6vtfs54mdb7x6882fyqy4mccfmvg/T/spark-3e071448-7ad7-47b8-bf70-68ab74721aa2
Process finished with exit code 1
Remove return keyword in below line.
Change below code
var user_rdd = user_df.filter((item)=>{
return item.getString(2) == "New York"
})
with below line
var user_rdd = user_df.filter(_.getString(2) == "New York")
or
user_df.filter($"user_city" === "New York").count
Also refactor your code like below.
val df = Seq((1,"John","London"),(1,"Martin","New York"),(1,"Abhishek","New York"))
.toDF("user_id","user_name","user_city")
df.filter($"user_city" === "New York").count
I am using spark-sql-2.4.1v to streaming in my PoC.
I am trying to do join by registering dataframes as table.
For which I am using createGlobalTempView and doing as below
first_df.createGlobalTempView("first_tab");
second_df.createGlobalTempView("second_tab");
Dataset<Row> joinUpdatedRecordsDs = sparkSession.sql("select a.* , b.create_date, b.last_update_date from first_tab as a "
+ " inner join second_tab as b "
+ " on a.company_id = b.company_id "
);
ERROR org.apache.spark.sql.AnalysisException: Table or view not
found: first_tab; line 1 pos 105
What wrong I am doing here ? how to fix this ?
Some more info
On my spark session I ".enableHiveSupport()" set.
When I see logs I found these traces
19/09/13 12:40:45 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
19/09/13 12:40:45 INFO HiveMetaStore: 0: get_table : db=default tbl=first_tab
19/09/13 12:40:45 INFO audit: ugi=userrw ip=unknown-ip-addr cmd=get_table : db=default tbl=first_tab
19/09/13 12:40:45 INFO HiveMetaStore: 0: get_table : db=default tbl=second_tab
19/09/13 12:40:45 INFO audit: ugi=userrw ip=unknown-ip-addr cmd=get_table : db=default tbl=second_tab
19/09/13 12:40:45 INFO HiveMetaStore: 0: get_database: default
19/09/13 12:40:45 INFO audit: ugi=userrw ip=unknown-ip-addr cmd=get_database: default
19/09/13 12:40:45 INFO HiveMetaStore: 0: get_database: default
19/09/13 12:40:45 INFO audit: ugi=userrw ip=unknown-ip-addr cmd=get_database: default
19/09/13 12:40:45 INFO HiveMetaStore: 0: get_tables: db=default pat=*
19/09/13 12:40:45 INFO audit: ugi=userrw ip=unknown-ip-addr cmd=get_tables: db=default pat=*
System.out.println("first_tab exists : " + sparkSession.catalog().tableExists("first_tab"));
System.out.println("second_tab exists : " + sparkSession.catalog().tableExists("second_tab"));
Output
first_tab exists : false
second_tab exists : false
I tried to print the tables in the db as below but nothing prints.
sparkSession.catalog().listTables().foreach( tab -> {
System.out.println("tab.database :" + tab.database());
System.out.println("tab.name :" + tab.name());
System.out.println("tab.tableType :" + tab.tableType());
});
No output printed , therefore we may say no table created.
I tried to create tables with "global_temp." but throws error
org.apache.spark.sql.AnalysisException: It is not allowed to add database prefix `global_temp` for the TEMPORARY view name.;
at org.apache.spark.sql.execution.command.CreateViewCommand.<init>(views.scala:122)
I tried to refer table with appending "global_temp." but throws same above error
i.e
System.out.println("first_tab exists : " + sparkSession.catalog().tableExists("global_temp.first_tab"));
System.out.println("second_tab exists : " + sparkSession.catalog().tableExists("global_temp.second_tab"));
same above error
These global views live in the database with the name global_temp so i would recommend to reference the tables in your queries as global_temp.table_name. I am not sure if it solves your problem, but you can try it.
From the Spark source code:
Global temporary view is cross-session. Its lifetime is the lifetime of the Spark application, i.e. it will be automatically dropped when the application terminates. It's tied to a system preserved database global_temp, and we must use the qualified name to refer a global temp view, e.g. SELECT * FROM global_temp.view1.
Remove .enableHiveSupport() while creating a session. This would work fine.
SparkSession spark = SparkSession
.builder()
.appName("DatabaseMigrationUtility")
//.enableHiveSupport()
.getOrCreate();
As David mentioned use global_temp. to refer the table.
Hi everybody I have a problem while saving a DataFrame. I found a similar unanswered question: Saving Spark dataFrames as parquet files - no errors, but data is not being saved. My problem is that when I tested the following code:
scala> import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vectors
scala> val dataset = spark.createDataFrame(
| Seq((0, 18, 1.0, Vectors.dense(0.0, 10.0, 0.5), 1.0))
| ).toDF("id", "hour", "mobile", "userFeatures", "clicked")
dataset: org.apache.spark.sql.DataFrame = [id: int, hour: int ... 3 more fields]
scala> dataset.show
+---+----+------+--------------+-------+
| id|hour|mobile| userFeatures|clicked|
+---+----+------+--------------+-------+
| 0| 18| 1.0|[0.0,10.0,0.5]| 1.0|
+---+----+------+--------------+-------+
scala> dataset.write.parquet("/home/vitrion/out")
No errors has been shown and seems that the DF has been saved as Parquet file. Surprisingly, no file has been created in the output directory.
This is my cluster configuration
The logfile says:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
18/03/01 12:56:53 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 51016#t630-0
18/03/01 12:56:53 INFO SignalUtils: Registered signal handler for TERM
18/03/01 12:56:53 INFO SignalUtils: Registered signal handler for HUP
18/03/01 12:56:53 INFO SignalUtils: Registered signal handler for INT
18/03/01 12:56:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/01 12:56:54 WARN Utils: Your hostname, t630-0 resolves to a loopback address: 127.0.1.1; using 192.168.239.218 instead (on interface eno1)
18/03/01 12:56:54 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
18/03/01 12:56:54 INFO SecurityManager: Changing view acls to: vitrion
18/03/01 12:56:54 INFO SecurityManager: Changing modify acls to: vitrion
18/03/01 12:56:54 INFO SecurityManager: Changing view acls groups to:
18/03/01 12:56:54 INFO SecurityManager: Changing modify acls groups to:
18/03/01 12:56:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vitrion); groups with view permissions: Set(); users with modify permissions: Set(vitrion); groups with modify permissions: Set()
18/03/01 12:56:54 INFO TransportClientFactory: Successfully created connection to /192.168.239.54:42629 after 80 ms (0 ms spent in bootstraps)
18/03/01 12:56:54 INFO SecurityManager: Changing view acls to: vitrion
18/03/01 12:56:54 INFO SecurityManager: Changing modify acls to: vitrion
18/03/01 12:56:54 INFO SecurityManager: Changing view acls groups to:
18/03/01 12:56:54 INFO SecurityManager: Changing modify acls groups to:
18/03/01 12:56:54 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(vitrion); groups with view permissions: Set(); users with modify permissions: Set(vitrion); groups with modify permissions: Set()
18/03/01 12:56:54 INFO TransportClientFactory: Successfully created connection to /192.168.239.54:42629 after 2 ms (0 ms spent in bootstraps)
18/03/01 12:56:54 INFO DiskBlockManager: Created local directory at /tmp/spark-d749d72b-6db2-4f02-8dae-481c0ea1f68f/executor-f379929a-3a6a-4366-8983-b38e19fb9cfc/blockmgr-c6d89ef4-b22a-4344-8816-23306722d40c
18/03/01 12:56:54 INFO MemoryStore: MemoryStore started with capacity 8.4 GB
18/03/01 12:56:54 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler#192.168.239.54:42629
18/03/01 12:56:54 INFO WorkerWatcher: Connecting to worker spark://Worker#192.168.239.218:45532
18/03/01 12:56:54 INFO TransportClientFactory: Successfully created connection to /192.168.239.218:45532 after 1 ms (0 ms spent in bootstraps)
18/03/01 12:56:54 INFO WorkerWatcher: Successfully connected to spark://Worker#192.168.239.218:45532
18/03/01 12:56:54 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
18/03/01 12:56:54 INFO Executor: Starting executor ID 2 on host 192.168.239.218
18/03/01 12:56:54 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 37178.
18/03/01 12:56:54 INFO NettyBlockTransferService: Server created on 192.168.239.218:37178
18/03/01 12:56:54 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
18/03/01 12:56:54 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(2, 192.168.239.218, 37178, None)
18/03/01 12:56:54 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(2, 192.168.239.218, 37178, None)
18/03/01 12:56:54 INFO BlockManager: Initialized BlockManager: BlockManagerId(2, 192.168.239.218, 37178, None)
18/03/01 12:56:54 INFO Executor: Using REPL class URI: spark://192.168.239.54:42629/classes
18/03/01 12:57:54 INFO CoarseGrainedExecutorBackend: Got assigned task 0
18/03/01 12:57:54 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
18/03/01 12:57:54 INFO TorrentBroadcast: Started reading broadcast variable 0
18/03/01 12:57:55 INFO TransportClientFactory: Successfully created connection to /192.168.239.54:35081 after 1 ms (0 ms spent in bootstraps)
18/03/01 12:57:55 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 28.1 KB, free 8.4 GB)
18/03/01 12:57:55 INFO TorrentBroadcast: Reading broadcast variable 0 took 103 ms
18/03/01 12:57:55 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 76.6 KB, free 8.4 GB)
18/03/01 12:57:55 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/03/01 12:57:55 INFO SQLHadoopMapReduceCommitProtocol: Using user defined output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
18/03/01 12:57:55 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
18/03/01 12:57:55 INFO SQLHadoopMapReduceCommitProtocol: Using output committer class org.apache.parquet.hadoop.ParquetOutputCommitter
18/03/01 12:57:55 INFO CodecConfig: Compression: SNAPPY
18/03/01 12:57:55 INFO CodecConfig: Compression: SNAPPY
18/03/01 12:57:55 INFO ParquetOutputFormat: Parquet block size to 134217728
18/03/01 12:57:55 INFO ParquetOutputFormat: Parquet page size to 1048576
18/03/01 12:57:55 INFO ParquetOutputFormat: Parquet dictionary page size to 1048576
18/03/01 12:57:55 INFO ParquetOutputFormat: Dictionary is on
18/03/01 12:57:55 INFO ParquetOutputFormat: Validation is off
18/03/01 12:57:55 INFO ParquetOutputFormat: Writer version is: PARQUET_1_0
18/03/01 12:57:55 INFO ParquetOutputFormat: Maximum row group padding size is 0 bytes
18/03/01 12:57:55 INFO ParquetOutputFormat: Page size checking is: estimated
18/03/01 12:57:55 INFO ParquetOutputFormat: Min row count for page size check is: 100
18/03/01 12:57:55 INFO ParquetOutputFormat: Max row count for page size check is: 10000
18/03/01 12:57:55 INFO ParquetWriteSupport: Initialized Parquet WriteSupport with Catalyst schema:
{
"type" : "struct",
"fields" : [ {
"name" : "id",
"type" : "integer",
"nullable" : false,
"metadata" : { }
}, {
"name" : "hour",
"type" : "integer",
"nullable" : false,
"metadata" : { }
}, {
"name" : "mobile",
"type" : "double",
"nullable" : false,
"metadata" : { }
}, {
"name" : "userFeatures",
"type" : {
"type" : "udt",
"class" : "org.apache.spark.ml.linalg.VectorUDT",
"pyClass" : "pyspark.ml.linalg.VectorUDT",
"sqlType" : {
"type" : "struct",
"fields" : [ {
"name" : "type",
"type" : "byte",
"nullable" : false,
"metadata" : { }
}, {
"name" : "size",
"type" : "integer",
"nullable" : true,
"metadata" : { }
}, {
"name" : "indices",
"type" : {
"type" : "array",
"elementType" : "integer",
"containsNull" : false
},
"nullable" : true,
"metadata" : { }
}, {
"name" : "values",
"type" : {
"type" : "array",
"elementType" : "double",
"containsNull" : false
},
"nullable" : true,
"metadata" : { }
} ]
}
},
"nullable" : true,
"metadata" : { }
}, {
"name" : "clicked",
"type" : "double",
"nullable" : false,
"metadata" : { }
} ]
}
and corresponding Parquet message type:
message spark_schema {
required int32 id;
required int32 hour;
required double mobile;
optional group userFeatures {
required int32 type (INT_8);
optional int32 size;
optional group indices (LIST) {
repeated group list {
required int32 element;
}
}
optional group values (LIST) {
repeated group list {
required double element;
}
}
}
required double clicked;
}
18/03/01 12:57:55 INFO CodecPool: Got brand-new compressor [.snappy]
18/03/01 12:57:55 INFO InternalParquetRecordWriter: Flushing mem columnStore to file. allocated memory: 84
18/03/01 12:57:55 INFO FileOutputCommitter: Saved output of task 'attempt_20180301125755_0000_m_000000_0' to file:/home/vitrion/out/_temporary/0/task_20180301125755_0000_m_000000
18/03/01 12:57:55 INFO SparkHadoopMapRedUtil: attempt_20180301125755_0000_m_000000_0: Committed
18/03/01 12:57:55 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1967 bytes result sent to driver`
Can you please help me to solve this?
Thank you
Have you tried writing without the Vector? I have seen it in the past where complex data structures would cause writing issues.