ClassNotFoundException: org.apache.beam.runners.spark.io.SourceRDD$SourcePartition during spark submit - apache-spark

I use spark-submit to spark standalone cluster to execute my shaded jar, however the executor gets error:
22/12/06 15:21:25 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1) (10.37.2.77, executor 0, partition 0, PROCESS_LOCAL, 5133 bytes) taskResourceAssignments Map()
22/12/06 15:21:25 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 10.37.2.77, executor 0: java.lang.ClassNotFoundException (org.apache.beam.runners.spark.io.SourceRDD$SourcePartition) [duplicate 1]
22/12/06 15:21:25 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2) (10.37.2.77, executor 0, partition 0, PROCESS_LOCAL, 5133 bytes) taskResourceAssignments Map()
22/12/06 15:21:25 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on 10.37.2.77, executor 0: java.lang.ClassNotFoundException (org.apache.beam.runners.spark.io.SourceRDD$SourcePartition) [duplicate 2]
22/12/06 15:21:25 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3) (10.37.2.77, executor 0, partition 0, PROCESS_LOCAL, 5133 bytes) taskResourceAssignments Map()
22/12/06 15:21:25 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on 10.37.2.77, executor 0: java.lang.ClassNotFoundException (org.apache.beam.runners.spark.io.SourceRDD$SourcePartition) [duplicate 3]
22/12/06 15:21:25 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
22/12/06 15:21:25 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/12/06 15:21:25 INFO TaskSchedulerImpl: Cancelling stage 0
22/12/06 15:21:25 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
22/12/06 15:21:25 INFO DAGScheduler: ResultStage 0 (collect at BoundedDataset.java:96) failed in 1.380 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.37.2.77 executor 0): java.lang.ClassNotFoundException: org.apache.beam.runners.spark.io.SourceRDD$SourcePartition
at java.lang.ClassLoader.findClass(ClassLoader.java:523)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1988)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
My request looks like:
curl -X POST http://xxxxxx:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"appResource": "/home/xxxx/xxxx-bundled-0.1.jar",
"sparkProperties": {
"spark.master": "spark://xxxxxxx:7077",
"spark.driver.userClassPathFirst": "true",
"spark.executor.userClassPathFirst": "true",
"spark.app.name": "DataPipeline",
"spark.submit.deployMode": "cluster",
"spark.driver.supervise": "true"
},
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"clientSparkVersion": "3.1.3",
"mainClass": "com.xxxx.DataPipeline",
"action": "CreateSubmissionRequest",
"appArgs": [
"--config=xxxx",
"--runner=SparkRunner"
]
I set "spark.driver.userClassPathFirst": "true", and "spark.executor.userClassPathFirst": "true" due to using proto3 in my jar.
Not sure why this class is not found on the executor.
My beam version 2.41.0, spark version 3.1.3, hadoop version 3.2.0.
Finally, I upgrade the shaded plugin to 3.4.0 and then the relocation for protobuf works and I deleted "spark.driver.userClassPathFirst": "true" and "spark.executor.userClassPathFirst": "true". Everything works after that. "spark-submit" locally or via rest api all works.

Finally, I upgrade the shaded plugin to 3.4.0 and then the relocation for protobuf works and I deleted "spark.driver.userClassPathFirst": "true" and "spark.executor.userClassPathFirst": "true". Everything works after that. "spark-submit" locally or via rest api all works.
Relocation for protobuf in shaded plugin:
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>shaded.com.google.protobuf</shadedPattern>
</relocation>
</relocations>
Then following both invocations worked:
spark-submit --class XXXXPipeline --master spark://xxxx:7077 --deploy-mode cluster --supervise /xxxxx/xxxx-bundled-0.1.jar
and
curl -X POST http://xxxx:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"appResource": "xxxxx-bundled-0.1.jar",
"sparkProperties": {
"spark.master": "spark://XXXXXX:7077",
"spark.jars": "xxxx-bundled-0.1.jar",
"spark.app.name": "XXXPipeline",
"spark.submit.deployMode": "cluster",
"spark.driver.supervise": "true"
},
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"clientSparkVersion": "3.1.3",
"mainClass": "XXXXPipeline",
"action": "CreateSubmissionRequest",
"appArgs": []
}'

Related

Apache Beam - Error with example using Spark Runner pointing to a local spark master URL

I need to support a use case where we are able to run Beam pipelines in an external spark URL.
I took a basic example of a beam pipeline as below
import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions
class ConvertToByteArray(beam.DoFn):
def __init__(self):
pass
def setup(self):
pass
def process(self, row):
try:
yield bytearray(row + '\n', 'utf-8')
except Exception as e:
raise e
def run():
options = PipelineOptions([
"--runner=SparkRunner",
# "--spark_master_url=spark://0.0.0.0:7077",
# "--spark_version=3",
])
with beam.Pipeline(options=options) as p:
lines = (p
| 'Create words' >> beam.Create(['this is working'])
| 'Split words' >> beam.FlatMap(lambda words: words.split(' '))
| 'Build byte array' >> beam.ParDo(ConvertToByteArray())
| 'Group' >> beam.GroupBy() # Do future batching here
| 'print output' >> beam.Map(print)
)
if __name__ == "__main__":
run()
I try to run this pipeline in 2 ways
Using Apache Beam's internal spark runner
Running Spark locally and passing the spark master URL.
Approach 1 works fine and i'm able to see the output (screenshot below)
Screenshot of Output
Approach 2 gives a class incompatible error on the spark server
Running spark as a docker container and natively on my machine both gave the same error.
Exception trace
Spark Executor Command: "/opt/bitnami/java/bin/java" "-cp" "/opt/bitnami/spark/conf/:/opt/bitnami/spark/jars/*" "-Xmx1024M" "-Dspark.driver.port=62420" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "--driver-url" "spark://CoarseGrainedScheduler#192.168.8.120:62420" "--executor-id" "0" "--hostname" "172.17.0.2" "--cores" "4" "--app-id" "app-20220425143553-0000" "--worker-url" "spark://Worker#172.17.0.2:44575"
========================================
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
22/04/25 14:35:55 INFO CoarseGrainedExecutorBackend: Started daemon with process name: 165#ace075bc56c0
22/04/25 14:35:55 INFO SignalUtils: Registering signal handler for TERM
22/04/25 14:35:55 INFO SignalUtils: Registering signal handler for HUP
22/04/25 14:35:55 INFO SignalUtils: Registering signal handler for INT
22/04/25 14:35:56 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/04/25 14:35:56 INFO SecurityManager: Changing view acls to: spark,nikamath
22/04/25 14:35:56 INFO SecurityManager: Changing modify acls to: spark,nikamath
22/04/25 14:35:56 INFO SecurityManager: Changing view acls groups to:
22/04/25 14:35:56 INFO SecurityManager: Changing modify acls groups to:
22/04/25 14:35:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, nikamath); groups with view permissions: Set(); users with modify permissions: Set(spark, nikamath); groups with modify permissions: Set()
22/04/25 14:35:57 INFO TransportClientFactory: Successfully created connection to /192.168.8.120:62420 after 194 ms (0 ms spent in bootstraps)
22/04/25 14:35:57 INFO SecurityManager: Changing view acls to: spark,nikamath
22/04/25 14:35:57 INFO SecurityManager: Changing modify acls to: spark,nikamath
22/04/25 14:35:57 INFO SecurityManager: Changing view acls groups to:
22/04/25 14:35:57 INFO SecurityManager: Changing modify acls groups to:
22/04/25 14:35:57 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark, nikamath); groups with view permissions: Set(); users with modify permissions: Set(spark, nikamath); groups with modify permissions: Set()
22/04/25 14:35:57 INFO TransportClientFactory: Successfully created connection to /192.168.8.120:62420 after 19 ms (0 ms spent in bootstraps)
22/04/25 14:35:58 INFO DiskBlockManager: Created local directory at /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/blockmgr-d7bd5b95-ddb3-4642-9863-261a6e109fc4
22/04/25 14:35:58 INFO MemoryStore: MemoryStore started with capacity 366.3 MiB
22/04/25 14:35:58 INFO CoarseGrainedExecutorBackend: Connecting to driver: spark://CoarseGrainedScheduler#192.168.8.120:62420
22/04/25 14:35:58 INFO WorkerWatcher: Connecting to worker spark://Worker#172.17.0.2:44575
22/04/25 14:35:58 INFO TransportClientFactory: Successfully created connection to /172.17.0.2:44575 after 7 ms (0 ms spent in bootstraps)
22/04/25 14:35:58 INFO WorkerWatcher: Successfully connected to spark://Worker#172.17.0.2:44575
22/04/25 14:35:58 INFO ResourceUtils: ==============================================================
22/04/25 14:35:58 INFO ResourceUtils: No custom resources configured for spark.executor.
22/04/25 14:35:58 INFO ResourceUtils: ==============================================================
22/04/25 14:35:58 INFO CoarseGrainedExecutorBackend: Successfully registered with driver
22/04/25 14:35:58 INFO Executor: Starting executor ID 0 on host 172.17.0.2
22/04/25 14:35:59 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 39027.
22/04/25 14:35:59 INFO NettyBlockTransferService: Server created on 172.17.0.2:39027
22/04/25 14:35:59 INFO BlockManager: Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
22/04/25 14:35:59 INFO BlockManagerMaster: Registering BlockManager BlockManagerId(0, 172.17.0.2, 39027, None)
22/04/25 14:35:59 INFO BlockManagerMaster: Registered BlockManager BlockManagerId(0, 172.17.0.2, 39027, None)
22/04/25 14:35:59 INFO BlockManager: Initialized BlockManager: BlockManagerId(0, 172.17.0.2, 39027, None)
22/04/25 14:35:59 INFO Executor: Fetching spark://192.168.8.120:62420/jars/beam-runners-spark-3-job-server-2.33.0.jar with timestamp 1650897351765
22/04/25 14:35:59 INFO TransportClientFactory: Successfully created connection to /192.168.8.120:62420 after 5 ms (0 ms spent in bootstraps)
22/04/25 14:35:59 INFO Utils: Fetching spark://192.168.8.120:62420/jars/beam-runners-spark-3-job-server-2.33.0.jar to /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/spark-cd481993-e8df-46fd-b00c-9a31e17d245d/fetchFileTemp1955801007454794690.tmp
22/04/25 14:36:02 INFO Utils: Copying /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/spark-cd481993-e8df-46fd-b00c-9a31e17d245d/-16622853561650897351765_cache to /opt/bitnami/spark/work/app-20220425143553-0000/0/./beam-runners-spark-3-job-server-2.33.0.jar
22/04/25 14:36:04 INFO Executor: Adding file:/opt/bitnami/spark/work/app-20220425143553-0000/0/./beam-runners-spark-3-job-server-2.33.0.jar to class loader
22/04/25 14:36:04 INFO CoarseGrainedExecutorBackend: Got assigned task 0
22/04/25 14:36:04 INFO CoarseGrainedExecutorBackend: Got assigned task 1
22/04/25 14:36:04 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
22/04/25 14:36:04 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
22/04/25 14:36:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
22/04/25 14:36:04 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
at org.apache.spark.scheduler.Task.run(Task.scala:100)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:506)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1462)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:509)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
22/04/25 14:36:04 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 0.0 in stage 0.0 (TID 0),5,main]
java.lang.Error: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1(Executor.scala:424)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1$adapted(Executor.scala:423)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.executor.Executor$TaskRunner.collectAccumulatorsAndResetStatusOnFailure(Executor.scala:423)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:704)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
22/04/25 14:36:04 ERROR SparkUncaughtExceptionHandler: Uncaught exception in thread Thread[Executor task launch worker for task 1.0 in stage 0.0 (TID 1),5,main]
java.lang.Error: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Caused by: java.io.InvalidClassException: org.apache.spark.util.AccumulatorV2; local class incompatible: stream classdesc serialVersionUID = 8273715124741334009, local class serialVersionUID = 574976528730727648
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:699)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:2005)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:109)
at org.apache.spark.scheduler.Task.metrics$lzycompute(Task.scala:72)
at org.apache.spark.scheduler.Task.metrics(Task.scala:71)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1(Executor.scala:424)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$collectAccumulatorsAndResetStatusOnFailure$1$adapted(Executor.scala:423)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.executor.Executor$TaskRunner.collectAccumulatorsAndResetStatusOnFailure(Executor.scala:423)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:704)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
... 2 more
22/04/25 14:36:05 INFO MemoryStore: MemoryStore cleared
22/04/25 14:36:05 INFO BlockManager: BlockManager stopped
22/04/25 14:36:05 INFO ShutdownHookManager: Shutdown hook called
22/04/25 14:36:05 INFO ShutdownHookManager: Deleting directory /tmp/spark-41a0e50e-81aa-48b7-ae55-888ae3a0a4ca/executor-a5ff3b34-0166-405f-9c00-d5a5ee6f8688/spark-cd481993-e8df-46fd-b00c-9a31e17d245d
I confirmed that a worker was running and the spark was set up properly.
I submitted a sample spark job to this master and get it to work implying that there isn't anything wrong with the spark master or worker.
Versions used:
Spark 3.2.1
Apache Beam 2.37.0
Python 3.7

problem with write from spark structured streaming to oracle table

I read files in a directory with readStream and process files, at the end I have a dataframe that I want to write it to oracle table. I use jdbc driver for do that and foreachbach() api.here my code:
def SaveToOracle(df,epoch_id):
try:
df.write.format('jdbc').options(
url='jdbc:oracle:thin:#192.168.49.8:1521:ORCL',
driver='oracle.jdbc.driver.OracleDriver',
dbtable='spark.result_table',
user='spark',
password='spark').mode('append').save()
pass
except Exception as e:
response = e.__str__()
print(response)
streamingQuery = (summaryDF4.writeStream
.outputMode("append")
.foreachBatch(SaveToOracle)
.start()
)
the job fail without any error and stop after start query streaming. the console log is like this:
2021-08-11 10:45:11,003 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
2021-08-11 10:45:11,003 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
2021-08-11 10:45:11,007 INFO streaming.MicroBatchExecution: Starting new streaming query.
2021-08-11 10:45:11,009 INFO cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped
2021-08-11 10:45:11,011 INFO streaming.MicroBatchExecution: Stream started from {}
2021-08-11 10:45:11,021 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
2021-08-11 10:45:11,034 INFO memory.MemoryStore: MemoryStore cleared
2021-08-11 10:45:11,034 INFO storage.BlockManager: BlockManager stopped
2021-08-11 10:45:11,042 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
2021-08-11 10:45:11,046 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
2021-08-11 10:45:11,053 INFO spark.SparkContext: Successfully stopped SparkContext
2021-08-11 10:45:11,056 INFO util.ShutdownHookManager: Shutdown hook called
2021-08-11 10:45:11,056 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-bf5c7539-9d1f-4c9d-af46-0c0874a81a40/pyspark-7416fc8a-18bd-4e79-aa0f-ea673e7c5cd8
2021-08-11 10:45:11,060 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-47c28d1d-236c-4b64-bc66-d07a918abe01
2021-08-11 10:45:11,063 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-bf5c7539-9d1f-4c9d-af46-0c0874a81a40
2021-08-11 10:45:11,065 INFO util.ShutdownHookManager: Deleting directory /tmp/temporary-f9420356-164a-4806-abb2-f132b8026b20
what is the problem and how can I get a proper log?
this is my sparkSession conf:
conf = SparkConf()
conf.set("spark.jars", "/home/hadoop/ojdbc6.jar")
spark=(SparkSession
.builder
.config(conf=conf)
.master("yarn")
.appName("Test010")
.getOrCreate()
)
Update:
I get Error on jdbc save(), here is :
An error occurred while calling o379.save.
: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:217)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:221)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
You need to call awaitTermination() method on streamingQuery after start() method call like this:
streamingQuery = (summaryDF4.writeStream
.outputMode("append")
.foreachBatch(SaveToOracle)
.start()
.awaitTermination())
thanks for your answer it keeps streaming job alive but still it doesn't work
"id" : "3dc2a37f-4a7a-49b9-aa83-bff526aa14c5",
"runId" : "e8864901-2729-41c5-b7e4-a19ac2478f1c",
"name" : null,
"timestamp" : "2021-08-11T06:51:00.000Z",
"batchId" : 2,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 240,
"getBatch" : 66,
"latestOffset" : 21,
"queryPlanning" : 143,
"triggerExecution" : 514,
"walCommit" : 23
},
"eventTime" : {
"watermark" : "1970-01-01T00:00:00.000Z"
},
"stateOperators" : [ {
"numRowsTotal" : 0,
"numRowsUpdated" : 0,
"memoryUsedBytes" : -1,
"numRowsDroppedByWatermark" : 0,
"customMetrics" : {
"loadedMapCacheHitCount" : 0,
"loadedMapCacheMissCount" : 0,
"stateOnCurrentVersionSizeBytes" : -1
}
} ],
"sources" : [ {
"description" : "FileStreamSource[hdfs://192.168.49.13:9000/input/lz]",
"startOffset" : {
"logOffset" : 1
},
"endOffset" : {
"logOffset" : 2
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "ForeachBatchSink",
"numOutputRows" : -1
}
}
if I use console my process doesn't have any problem and it will be correct but for write to oracle I have problem.

Spring boot - Spark Application: Unable to process a file on its nodes

I have the below setup:
Spark Master and Slaves configured and running in my local.
17/11/01 18:03:52 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
17/11/01 18:03:52 INFO Master: Starting Spark master at spark://127.0.0.1:7077
17/11/01 18:03:52 INFO Master: Running Spark version 2.2.0
17/11/01 18:03:52 INFO Utils: Successfully started service 'MasterUI' on port 8080.
I have a spring boot application whose properties file contents look like the below:
spark.home=/usr/local/Cellar/apache-spark/2.2.0/bin/
master.uri=spark://127.0.0.1:7077
#Autowired
SparkConf sparkConf;
public void processFile(String inputFile, String outputFile) {
JavaSparkContext javaSparkContext;
SparkContext sc = new SparkContext(sparkConf);
SerializationWrapper sw= new SerializationWrapper() {
private static final long serialVersionUID = 1L;
#Override
public JavaSparkContext createJavaSparkContext() {
// TODO Auto-generated method stub
return JavaSparkContext.fromSparkContext(sc);
}
};
javaSparkContext=sw.createJavaSparkContext();
JavaRDD<String> lines = javaSparkContext.textFile(inputFile);
Broadcast<JavaRDD<String>> outputLines;
outputLines = javaSparkContext.broadcast(lines.map(new Function<String, String>() {
/**
*
*/
private static final long serialVersionUID = 1L;
#Override
public String call(String arg0) throws Exception {
// TODO Auto-generated method stub
return arg0;
}
}));
outputLines.getValue().saveAsTextFile(outputFile);
//javaSparkContext.close();
}
When i run the code I'm getting the below error:
17/11/01 18:16:36 INFO TorrentBroadcast: Started reading broadcast variable 2
17/11/01 18:16:36 INFO TransportClientFactory: Successfully created connection to /192.168.0.135:51903 after 1 ms (0 ms spent in bootstraps)
17/11/01 18:16:36 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 24.4 KB, free 366.3 MB)
17/11/01 18:16:36 INFO TorrentBroadcast: Reading broadcast variable 2 took 82 ms
17/11/01 18:16:36 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 67.2 KB, free 366.2 MB)
17/11/01 18:16:36 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD
at java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(ObjectStreamClass.java:2133)
at java.io.ObjectStreamClass.setObjFieldValues(ObjectStreamClass.java:1305)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2251)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2245)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2169)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2027)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1535)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:422)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:80)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
17/11/01 18:16:36 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
The springboot-spark app should process the files based on REST API call where i get the input and output file location shared across the Spark nodes.
Any suggestions to fix the above errors
I think you should not broadcast a JavaRDD, since a RDD is already distributed among your cluster nodes.

Scala to HBASE- ZooKeeperRegistry: ClusterId read in ZooKeeper is null

I am using the following code to connect to my HBASE which is running on local sandbox.
I am using the following code, zookeeper and hbase master are running finw. I can scan the HBASE table by going to the HBASE Shell. Not sure whats going wrong here?:
object HBaseWrite {
def main(args: Array[String]) {
System.setProperty("hadoop.home.dir", "c://hadoop");
val sparkConf = new SparkConf().setAppName("HBaseWrite").setMaster("local[*]").set("spark.driver.allowMultipleContexts","true").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
val sc = new SparkContext(sparkConf)
val conf = HBaseConfiguration.create()
val outputTable = "TABLE1"
System.setProperty("user.name", "hdfs")
System.setProperty("HADOOP_USER_NAME", "hdfs")
conf.set("hbase.master", "localhost:60000")
conf.setInt("timeout", 120000)
conf.set("hbase.zookeeper.quorum", "192.168.163.129/sandbox.hortonworks.com")
conf.set("zookeeper.znode.parent", "/hbase-unsecure")
conf.setInt("hbase.client.scanner.caching", 10000)
sparkConf.registerKryoClasses(Array(classOf[org.apache.hadoop.hbase.client.Result]))
val jobConfig: JobConf = new JobConf(conf,this.getClass)
jobConfig.setOutputFormat(classOf[TableOutputFormat])
jobConfig.set(TableOutputFormat.OUTPUT_TABLE,outputTable)
val x = 12
val y = 15
val z = 25
var newarray = Array(x,y,z)
val newrddtohbase = sc.parallelize(newarray)
val convertFunc = convert _
new PairRDDFunctions(newrddtohbase.map(convertFunc)).saveAsHadoopDataset(jobConfig)
sc.stop()
}
def convert(a:Int) : Tuple2[ImmutableBytesWritable,Put] = {
val p = new Put(Bytes.toBytes(a))
p.add(Bytes.toBytes("columnfamily"),
Bytes.toBytes("col_1"), Bytes.toBytes(a))
new Tuple2[ImmutableBytesWritable,Put](new ImmutableBytesWritable(a.toString.getBytes()), p);
}
}
and get the following error:
ZooKeeperRegistry: ClusterId read in ZooKeeper is null
17/03/21 20:13:37 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
17/03/21 20:13:37 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
17/03/21 20:13:37 INFO ZooKeeperRegistry: ClusterId read in ZooKeeper is null
17/03/21 20:13:37 INFO SparkHadoopMapRedUtil: No need to commit output of task because needsTaskCommit=false: attempt_201703212013_0000_m_000000_0
17/03/21 20:13:37 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 874 bytes result sent to driver
17/03/21 20:13:37 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 816 ms on localhost (1/4)
17/03/21 20:13:37 ERROR Executor: Exception in task 3.0 in stage 0.0 (TID 3)
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1249)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1155)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:370)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:321)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:158)
at org.apache.hadoop.hbase.mapred.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:66)
at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1211)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1366)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:395)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:553)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1185)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1152)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
... 20 more
17/03/21 20:13:37 ERROR Executor: Exception in task 2.0 in stage 0.0 (TID 2)
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1249)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1155)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:370)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:321)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:158)
at org.apache.hadoop.hbase.mapred.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:66)
at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1211)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1366)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:395)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:553)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1185)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1152)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
... 20 more
17/03/21 20:13:37 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1249)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1155)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:370)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:321)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:158)
at org.apache.hadoop.hbase.mapred.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:66)
at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1211)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1366)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:395)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:553)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1185)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1152)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
... 20 more
17/03/21 20:13:37 WARN TaskSetManager: Lost task 2.0 in stage 0.0 (TID 2, localhost): java.lang.RuntimeException: java.lang.NullPointerException
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:208)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.loadCache(ClientSmallReversedScanner.java:211)
at org.apache.hadoop.hbase.client.ClientSmallReversedScanner.next(ClientSmallReversedScanner.java:185)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1249)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1155)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:370)
at org.apache.hadoop.hbase.client.AsyncProcess.submit(AsyncProcess.java:321)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.backgroundFlushCommits(BufferedMutatorImpl.java:206)
at org.apache.hadoop.hbase.client.BufferedMutatorImpl.close(BufferedMutatorImpl.java:158)
at org.apache.hadoop.hbase.mapred.TableOutputFormat$TableRecordWriter.close(TableOutputFormat.java:66)
at org.apache.spark.SparkHadoopWriter.close(SparkHadoopWriter.scala:102)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$8.apply$mcV$sp(PairRDDFunctions.scala:1211)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1366)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1211)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1190)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.NullPointerException
at org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.getMetaReplicaNodes(ZooKeeperWatcher.java:395)
at org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:553)
at org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1185)
at org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1152)
at org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:300)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:151)
at org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:59)
at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)
... 20 more
Please help, I am stuck on this bit. HBASE Master and Zookeeper are running fine.

Persist RDD as Avro File

I have written this sample program to persist an RDD into an avro file.
I am using CDH 5.4 with Spark 1.3
I wrote this avsc file and then generated code for class User
{"namespace": "com.abhi",
"type": "record",
"name": "User",
"fields": [
{"name": "firstname", "type": "string"},
{"name": "lastname", "type": "string"} ]
}
Then I generated the code for User
java -jar ~/Downloads/avro-tools-1.7.7.jar compile schema User.avsc .
The I wrote my example
package com.abhi
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.SparkConf
import org.apache.avro.generic.GenericRecord
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroKeyOutputFormat, AvroJob, AvroKeyInputFormat}
import org.apache.hadoop.io.NullWritable
import org.apache.spark.SparkContext
object MySpark {
def main(args : Array[String]) : Unit = {
val sf = new SparkConf()
.setMaster("local[2]")
.setAppName("MySpark")
val sc = new SparkContext(sf)
val user1 = new User();
user1.setFirstname("Test1");
user1.setLastname("Test2");
val user2 = new User("Test3", "Test4");
// Construct via builder
val user3 = User.newBuilder()
.setFirstname("Test5")
.setLastname("Test6")
.build()
val list = Array(user1, user2, user3)
val userRdd = sc.parallelize(list)
val job: Job = Job.getInstance()
AvroJob.setOutputKeySchema(job, user1.getSchema)
val output = "/user/cloudera/users.avro"
userRdd.map(row => (new AvroKey(row), NullWritable.get()))
.saveAsNewAPIHadoopFile(
output,
classOf[AvroKey[User]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[User]],
job.getConfiguration)
}
}
I have two concerns with this code
Some of the imports are from the old mapreduce api and I wonder why are they required for Spark code
import org.apache.hadoop.mapreduce.Job
import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroKeyOutputFormat, AvroJob,
AvroKeyInputFormat}
The code throws an exception when I submit it to the hadoop cluster
It does create an empty directory called /user/cloudera/users.avro in HDFS
15/11/01 08:20:42 INFO Configuration.deprecation: mapred.output.dir is deprecated. Instead, use mapreduce.output.fileoutputformat.outputdir
15/11/01 08:20:42 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
15/11/01 08:20:42 INFO spark.SparkContext: Starting job: saveAsNewAPIHadoopFile at MySpark.scala:52
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Got job 1 (saveAsNewAPIHadoopFile at MySpark.scala:52) with 2 output partitions (allowLocal=false)
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Final stage: Stage 1(saveAsNewAPIHadoopFile at MySpark.scala:52)
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Parents of final stage: List()
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Missing parents: List()
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Submitting Stage 1 (MapPartitionsRDD[2] at map at MySpark.scala:51), which has no missing parents
15/11/01 08:20:42 INFO storage.MemoryStore: ensureFreeSpace(66904) called with curMem=301745, maxMem=280248975
15/11/01 08:20:42 INFO storage.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 65.3 KB, free 266.9 MB)
15/11/01 08:20:42 INFO storage.MemoryStore: ensureFreeSpace(23066) called with curMem=368649, maxMem=280248975
15/11/01 08:20:42 INFO storage.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 22.5 KB, free 266.9 MB)
15/11/01 08:20:42 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:34630 (size: 22.5 KB, free: 267.2 MB)
15/11/01 08:20:42 INFO storage.BlockManagerMaster: Updated info of block broadcast_2_piece0
15/11/01 08:20:42 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:839
15/11/01 08:20:42 INFO scheduler.DAGScheduler: Submitting 2 missing tasks from Stage 1 (MapPartitionsRDD[2] at map at MySpark.scala:51)
15/11/01 08:20:42 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 2 tasks
15/11/01 08:20:42 ERROR scheduler.TaskSetManager: Failed to serialize task 1, not attempting to retry it.
java.lang.reflect.InvocationTargetException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.serializer.SerializationDebugger$ObjectStreamClassMethods$.getObjFieldValues$extension(SerializationDebugger.scala:240)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visitSerializable(SerializationDebugger.scala:150)
at org.apache.spark.serializer.SerializationDebugger$SerializationDebugger.visit(SerializationDebugger.scala:99)
at org.apache.spark.serializer.SerializationDebugger$.find(SerializationDebugger.scala:58)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:39)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:80)
at org.apache.spark.scheduler.Task$.serializeWithDependencies(Task.scala:149)
at org.apache.spark.scheduler.TaskSetManager.resourceOffer(TaskSetManager.scala:464)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet$1.apply$mcVI$sp(TaskSchedulerImpl.scala:232)
at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
at org.apache.spark.scheduler.TaskSchedulerImpl.org$apache$spark$scheduler$TaskSchedulerImpl$$resourceOfferSingleTaskSet(TaskSchedulerImpl.scala:227)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$6.apply(TaskSchedulerImpl.scala:296)
at org.apache.spark.scheduler.TaskSchedulerImpl$$anonfun$resourceOffers$3$$anonfun$apply$6.apply(TaskSchedulerImpl.scala:294)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
Problem is that Spark can't serialise your User class, try setting up a KryoConfigurator and registering your class there.

Resources