problem with write from spark structured streaming to oracle table - python-3.x

I read files in a directory with readStream and process files, at the end I have a dataframe that I want to write it to oracle table. I use jdbc driver for do that and foreachbach() api.here my code:
def SaveToOracle(df,epoch_id):
try:
df.write.format('jdbc').options(
url='jdbc:oracle:thin:#192.168.49.8:1521:ORCL',
driver='oracle.jdbc.driver.OracleDriver',
dbtable='spark.result_table',
user='spark',
password='spark').mode('append').save()
pass
except Exception as e:
response = e.__str__()
print(response)
streamingQuery = (summaryDF4.writeStream
.outputMode("append")
.foreachBatch(SaveToOracle)
.start()
)
the job fail without any error and stop after start query streaming. the console log is like this:
2021-08-11 10:45:11,003 INFO cluster.YarnClientSchedulerBackend: Shutting down all executors
2021-08-11 10:45:11,003 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asking each executor to shut down
2021-08-11 10:45:11,007 INFO streaming.MicroBatchExecution: Starting new streaming query.
2021-08-11 10:45:11,009 INFO cluster.YarnClientSchedulerBackend: YARN client scheduler backend Stopped
2021-08-11 10:45:11,011 INFO streaming.MicroBatchExecution: Stream started from {}
2021-08-11 10:45:11,021 INFO spark.MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
2021-08-11 10:45:11,034 INFO memory.MemoryStore: MemoryStore cleared
2021-08-11 10:45:11,034 INFO storage.BlockManager: BlockManager stopped
2021-08-11 10:45:11,042 INFO storage.BlockManagerMaster: BlockManagerMaster stopped
2021-08-11 10:45:11,046 INFO scheduler.OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
2021-08-11 10:45:11,053 INFO spark.SparkContext: Successfully stopped SparkContext
2021-08-11 10:45:11,056 INFO util.ShutdownHookManager: Shutdown hook called
2021-08-11 10:45:11,056 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-bf5c7539-9d1f-4c9d-af46-0c0874a81a40/pyspark-7416fc8a-18bd-4e79-aa0f-ea673e7c5cd8
2021-08-11 10:45:11,060 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-47c28d1d-236c-4b64-bc66-d07a918abe01
2021-08-11 10:45:11,063 INFO util.ShutdownHookManager: Deleting directory /tmp/spark-bf5c7539-9d1f-4c9d-af46-0c0874a81a40
2021-08-11 10:45:11,065 INFO util.ShutdownHookManager: Deleting directory /tmp/temporary-f9420356-164a-4806-abb2-f132b8026b20
what is the problem and how can I get a proper log?
this is my sparkSession conf:
conf = SparkConf()
conf.set("spark.jars", "/home/hadoop/ojdbc6.jar")
spark=(SparkSession
.builder
.config(conf=conf)
.master("yarn")
.appName("Test010")
.getOrCreate()
)
Update:
I get Error on jdbc save(), here is :
An error occurred while calling o379.save.
: java.lang.ClassNotFoundException: oracle.jdbc.driver.OracleDriver
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.jdbc.DriverRegistry$.register(DriverRegistry.scala:46)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1(JDBCOptions.scala:102)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.$anonfun$driverClass$1$adapted(JDBCOptions.scala:102)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.execution.datasources.jdbc.JDBCOptions.<init>(JDBCOptions.scala:102)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:217)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcOptionsInWrite.<init>(JDBCOptions.scala:221)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:45)
at org.apache.spark.sql.execution.datasources.SaveIntoDataSourceCommand.run(SaveIntoDataSourceCommand.scala:46)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
at org.apache.spark.sql.execution.command.ExecutedCommandExec.doExecute(commands.scala:90)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$execute$1(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$executeQuery$1(SparkPlan.scala:218)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:215)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:176)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:132)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:131)
at org.apache.spark.sql.DataFrameWriter.$anonfun$runCommand$1(DataFrameWriter.scala:989)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:989)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:438)
at org.apache.spark.sql.DataFrameWriter.saveInternal(DataFrameWriter.scala:415)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:301)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)

You need to call awaitTermination() method on streamingQuery after start() method call like this:
streamingQuery = (summaryDF4.writeStream
.outputMode("append")
.foreachBatch(SaveToOracle)
.start()
.awaitTermination())

thanks for your answer it keeps streaming job alive but still it doesn't work
"id" : "3dc2a37f-4a7a-49b9-aa83-bff526aa14c5",
"runId" : "e8864901-2729-41c5-b7e4-a19ac2478f1c",
"name" : null,
"timestamp" : "2021-08-11T06:51:00.000Z",
"batchId" : 2,
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"addBatch" : 240,
"getBatch" : 66,
"latestOffset" : 21,
"queryPlanning" : 143,
"triggerExecution" : 514,
"walCommit" : 23
},
"eventTime" : {
"watermark" : "1970-01-01T00:00:00.000Z"
},
"stateOperators" : [ {
"numRowsTotal" : 0,
"numRowsUpdated" : 0,
"memoryUsedBytes" : -1,
"numRowsDroppedByWatermark" : 0,
"customMetrics" : {
"loadedMapCacheHitCount" : 0,
"loadedMapCacheMissCount" : 0,
"stateOnCurrentVersionSizeBytes" : -1
}
} ],
"sources" : [ {
"description" : "FileStreamSource[hdfs://192.168.49.13:9000/input/lz]",
"startOffset" : {
"logOffset" : 1
},
"endOffset" : {
"logOffset" : 2
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "ForeachBatchSink",
"numOutputRows" : -1
}
}
if I use console my process doesn't have any problem and it will be correct but for write to oracle I have problem.

Related

ClassNotFoundException: org.apache.beam.runners.spark.io.SourceRDD$SourcePartition during spark submit

I use spark-submit to spark standalone cluster to execute my shaded jar, however the executor gets error:
22/12/06 15:21:25 INFO TaskSetManager: Starting task 0.1 in stage 0.0 (TID 1) (10.37.2.77, executor 0, partition 0, PROCESS_LOCAL, 5133 bytes) taskResourceAssignments Map()
22/12/06 15:21:25 INFO TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1) on 10.37.2.77, executor 0: java.lang.ClassNotFoundException (org.apache.beam.runners.spark.io.SourceRDD$SourcePartition) [duplicate 1]
22/12/06 15:21:25 INFO TaskSetManager: Starting task 0.2 in stage 0.0 (TID 2) (10.37.2.77, executor 0, partition 0, PROCESS_LOCAL, 5133 bytes) taskResourceAssignments Map()
22/12/06 15:21:25 INFO TaskSetManager: Lost task 0.2 in stage 0.0 (TID 2) on 10.37.2.77, executor 0: java.lang.ClassNotFoundException (org.apache.beam.runners.spark.io.SourceRDD$SourcePartition) [duplicate 2]
22/12/06 15:21:25 INFO TaskSetManager: Starting task 0.3 in stage 0.0 (TID 3) (10.37.2.77, executor 0, partition 0, PROCESS_LOCAL, 5133 bytes) taskResourceAssignments Map()
22/12/06 15:21:25 INFO TaskSetManager: Lost task 0.3 in stage 0.0 (TID 3) on 10.37.2.77, executor 0: java.lang.ClassNotFoundException (org.apache.beam.runners.spark.io.SourceRDD$SourcePartition) [duplicate 3]
22/12/06 15:21:25 ERROR TaskSetManager: Task 0 in stage 0.0 failed 4 times; aborting job
22/12/06 15:21:25 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
22/12/06 15:21:25 INFO TaskSchedulerImpl: Cancelling stage 0
22/12/06 15:21:25 INFO TaskSchedulerImpl: Killing all running tasks in stage 0: Stage cancelled
22/12/06 15:21:25 INFO DAGScheduler: ResultStage 0 (collect at BoundedDataset.java:96) failed in 1.380 s due to Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3) (10.37.2.77 executor 0): java.lang.ClassNotFoundException: org.apache.beam.runners.spark.io.SourceRDD$SourcePartition
at java.lang.ClassLoader.findClass(ClassLoader.java:523)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1988)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1852)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2186)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2431)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2355)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2213)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1669)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:503)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:461)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:458)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
My request looks like:
curl -X POST http://xxxxxx:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"appResource": "/home/xxxx/xxxx-bundled-0.1.jar",
"sparkProperties": {
"spark.master": "spark://xxxxxxx:7077",
"spark.driver.userClassPathFirst": "true",
"spark.executor.userClassPathFirst": "true",
"spark.app.name": "DataPipeline",
"spark.submit.deployMode": "cluster",
"spark.driver.supervise": "true"
},
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"clientSparkVersion": "3.1.3",
"mainClass": "com.xxxx.DataPipeline",
"action": "CreateSubmissionRequest",
"appArgs": [
"--config=xxxx",
"--runner=SparkRunner"
]
I set "spark.driver.userClassPathFirst": "true", and "spark.executor.userClassPathFirst": "true" due to using proto3 in my jar.
Not sure why this class is not found on the executor.
My beam version 2.41.0, spark version 3.1.3, hadoop version 3.2.0.
Finally, I upgrade the shaded plugin to 3.4.0 and then the relocation for protobuf works and I deleted "spark.driver.userClassPathFirst": "true" and "spark.executor.userClassPathFirst": "true". Everything works after that. "spark-submit" locally or via rest api all works.
Finally, I upgrade the shaded plugin to 3.4.0 and then the relocation for protobuf works and I deleted "spark.driver.userClassPathFirst": "true" and "spark.executor.userClassPathFirst": "true". Everything works after that. "spark-submit" locally or via rest api all works.
Relocation for protobuf in shaded plugin:
<relocations>
<relocation>
<pattern>com.google.protobuf</pattern>
<shadedPattern>shaded.com.google.protobuf</shadedPattern>
</relocation>
</relocations>
Then following both invocations worked:
spark-submit --class XXXXPipeline --master spark://xxxx:7077 --deploy-mode cluster --supervise /xxxxx/xxxx-bundled-0.1.jar
and
curl -X POST http://xxxx:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"appResource": "xxxxx-bundled-0.1.jar",
"sparkProperties": {
"spark.master": "spark://XXXXXX:7077",
"spark.jars": "xxxx-bundled-0.1.jar",
"spark.app.name": "XXXPipeline",
"spark.submit.deployMode": "cluster",
"spark.driver.supervise": "true"
},
"environmentVariables": {
"SPARK_ENV_LOADED": "1"
},
"clientSparkVersion": "3.1.3",
"mainClass": "XXXXPipeline",
"action": "CreateSubmissionRequest",
"appArgs": []
}'

cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDD

I am trying to submit a pyspark job to k8s spark cluster using airflow. In that spark job I am using writestream foreachBatch function to write streaming data and irrespective of the sink type facing this issue only when I am trying to write data :
Inside spark cluster
version: spark 3.3.0
pyspark 3.3
scala 2.12.15
OpenJDK 64-Bit Server VM,11.0.15
Inside airflow
spark version 3.1.2
pyspark 3.1.2
scala version 2.12.10
OpenJDK 64-Bit Server VM,1.8.0
dependencies: org.scala-lang:scala-library:2.12.8,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,org.apache.spark:spark-sql_2.12:3.3.0,org.apache.spark:spark-core_2.12:3.3.0,org.postgresql:postgresql:42.3.3 .
Dag which I am using to submit is:
import airflow
from datetime import timedelta
from airflow import DAG
from time import sleep
from datetime import datetime
from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator
dag = DAG( dag_id = 'testpostgres.py', schedule_interval=None , start_date=datetime(2022, 1, 1), catchup=False)
spark_job = SparkSubmitOperator(application= '/usr/local/airflow/data/testpostgres.py',
conn_id= 'spark_kcluster',
task_id= 'spark_job_test',
dag= dag,
packages= "org.scala-lang:scala-library:2.12.8,org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0,org.apache.spark:spark-sql_2.12:3.3.0,org.apache.spark:spark-core_2.12:3.3.0,org.postgresql:postgresql:42.3.3",
conf ={
'deploy-mode' : 'cluster',
'executor_cores' : 1,
'EXECUTORS_MEM' : '2G',
'name' : 'spark-py',
'spark.kubernetes.namespace' : 'sandbox',
'spark.kubernetes.file.upload.path' : '/usr/local/airflow/data',
'spark.kubernetes.container.image' : '**********',
'spark.kubernetes.container.image.pullPolicy' : 'IfNotPresent',
'spark.kubernetes.authenticate.driver.serviceAccountName' : 'spark',
'spark.kubernetes.driver.volumes.persistentVolumeClaim.rwopvc.options.claimName' : 'data-pvc',
'spark.kubernetes.driver.volumes.persistentVolumeClaim.rwopvc.mount.path' : '/usr/local/airflow/data',
'spark.driver.extraJavaOptions' : '-Divy.cache.dir=/tmp -Divy.home=/tmp'
}
)
This is the job I am trying to submit :
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import dayofweek
from pyspark.sql.functions import date_format
from pyspark.sql.functions import hour
from functools import reduce
from pyspark.sql.types import DoubleType, StringType, ArrayType
import pandas as pd
import json
spark = SparkSession.builder.appName('spark).getOrCreate()
kafka_topic_name = '****'
kafka_bootstrap_servers = '*********' + ':' + '*****'
streaming_dataframe = spark.readStream.format("kafka").option("kafka.bootstrap.servers", kafka_bootstrap_servers).option("subscribe", kafka_topic_name).option("startingOffsets", "earliest").load()
streaming_dataframe = streaming_dataframe.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")
dataframe_schema = '******'
streaming_dataframe = streaming_dataframe.select(from_csv(col("value"), dataframe_schema).alias("pipeline")).select("pipeline.*")
tumblingWindows = streaming_dataframe.withWatermark("timeStamp", "48 hour").groupBy(window("timeStamp", "24 hour", "1 hour"), "phoneNumber").agg((F.first(F.col("duration")).alias("firstDuration")))
tumblingWindows = tumblingWindows.withColumn("start_window", F.col('window')['start'])
tumblingWindows = tumblingWindows.withColumn("end_window", F.col('window')['end'])
tumblingWindows = tumblingWindows.drop('window')
def postgres_write(tumblingWindows, epoch_id):
tumblingWindows.write.jdbc(url=db_target_url, table=table_postgres, mode='append', properties=db_target_properties)
pass
db_target_url = 'jdbc:postgresql://' + '*******'+ ':' + '****' + '/' + 'test'
table_postgres = '******'
db_target_properties = {
'user': 'postgres',
'password': 'postgres',
'driver': 'org.postgresql.Driver'
}
query = tumblingWindows.writeStream.foreachBatch(postgres_write).start().awaitTermination()
Error logs:
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2672)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2608)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2607)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2607)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1182)
at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1182)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1182)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2860)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2802)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2791)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:952)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2228)
at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.writeWithV2(WriteToDataSourceV2Exec.scala:377)
... 42 more
Caused by: java.lang.ClassCastException: cannot assign instance of scala.collection.immutable.List$SerializationProxy to field org.apache.spark.sql.execution.datasources.v2.DataSourceRDDPartition.inputPartitions of type scala.collection.Seq in instance of org.apache.spark.sql.execution.datasources.v2.DataSourceRDDPartition
at java.base/java.io.ObjectStreamClass$FieldReflector.setObjFieldValues(Unknown Source)
at java.base/java.io.ObjectStreamClass$FieldReflector.checkObjectFieldValueTypes(Unknown Source)
at java.base/java.io.ObjectStreamClass.checkObjFieldValueTypes(Unknown Source)
at java.base/java.io.ObjectInputStream.defaultCheckFieldValues(Unknown Source)
at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
at java.base/java.io.ObjectInputStream.defaultReadFields(Unknown Source)
at java.base/java.io.ObjectInputStream.readSerialData(Unknown Source)
at java.base/java.io.ObjectInputStream.readOrdinaryObject(Unknown Source)
at java.base/java.io.ObjectInputStream.readObject0(Unknown Source)
at java.base/java.io.ObjectInputStream.readObject(Unknown Source)
at java.base/java.io.ObjectInputStream.readObject(Unknown Source)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:87)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:129)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:507)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
at java.base/java.lang.Thread.run(Unknown Source)
Traceback (most recent call last):
File "/usr/local/airflow/data/spark-upload-d03175bc-8c50-4baf-8383-a203182f16c0/debug.py", line 20, in <module>
streaming_dataframe.selectExpr("CAST(key AS STRING)", "CAST(value AS STRING)")\
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 107, in awaitTermination
File "/opt/spark/python/lib/py4j-0.10.9.5-src.zip/py4j/java_gateway.py", line 1321, in __call__
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 196, in deco
pyspark.sql.utils.StreamingQueryException: Query [id = d0e140c1-830d-49c8-88b7-90b82d301408, runId = c0f38f58-6571-4fda-b3e0-98e4ffaf8c7a] terminated with exception: Writing job aborted
22/08/24 10:12:53 INFO SparkUI: Stopped Spark web UI at ************************
22/08/24 10:12:53 INFO KubernetesClusterSchedulerBackend: Shutting down all executors
22/08/24 10:12:53 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asking each executor to shut down
22/08/24 10:12:53 WARN ExecutorPodsWatchSnapshotSource: Kubernetes client has been closed.
22/08/24 10:12:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
22/08/24 10:12:53 INFO MemoryStore: MemoryStore cleared
22/08/24 10:12:53 INFO BlockManager: BlockManager stopped
22/08/24 10:12:53 INFO BlockManagerMaster: BlockManagerMaster stopped
22/08/24 10:12:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
22/08/24 10:12:54 INFO SparkContext: Successfully stopped SparkContext
22/08/24 10:12:54 INFO ShutdownHookManager: Shutdown hook called
22/08/24 10:12:54 INFO ShutdownHookManager: Deleting directory /var/data/spark-32ef85e0-e85c-4ac6-a46d-d3379ca58468/spark-adecf44a-dc60-4a85-bbe3-bc125f5cc39f/pyspark-f3ffaa5e-a490-464a-98d2-fbce223628eb
22/08/24 10:12:54 INFO ShutdownHookManager: Deleting directory /var/data/spark-32ef85e0-e85c-4ac6-a46d-d3379ca58468/spark-adecf44a-dc60-4a85-bbe3-bc125f5cc39f
22/08/24 10:12:54 INFO ShutdownHookManager: Deleting directory /tmp/spark-5acdd5e6-7f6e-45ec-adae-e98862e1537c```
I faced this issue recently. I think it occurs when shuffling data coming from Kafka.
I fixed it by loading all dependencies(jars) of org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.0 to the project. You can find them here.
For now, i dont know which ones are enough.

Issue while reading delta file placed under wasb storage from mapr cluster

I am trying to read delta format file from azure storage using code below in jupyter notebook which is running in mapr cluster.
when i am running this code it is throwing issue that java.lang.NoSuchMethodException: org.apache.hadoop.fs.azure.NativeAzureFileSystem.(java.net.URI, org.apache.hadoop.conf.Configuration)
%%configure -f
{ "conf": {"spark.jars.packages": "io.delta:delta-core_2.11:0.5.0,org.apache.hadoop:hadoop-azure:3.3.1"}, "kind": "spark",
"driverMemory" : "4g", "executorMemory": "2g", "executorCores": 3, "numExecutors":4}
val storage_account_name_dim = "storageAcc"
val sas_access_key_dim = "saskey"
spark.conf.set("spark.delta.logStore.class","org.apache.spark.sql.delta.storage.AzureLogStore")
spark.conf.set("fs.AbstractFileSystem.wasbs.impl", "org.apache.hadoop.fs.azure.NativeAzureFileSystem")
spark.conf.set(s"fs.azure.sas.dsr.$storage_account_name_dim.blob.core.windows.net", sas_access_key_dim)
val refdf = spark.read.format("delta").load("wasbs://dsr#storageAcc.blob.core.windows.net/dim/ds/DS_CUSTOM_WAG_VENDOR_UPC")
refdf.show(1)
it is giving me below issue
java.lang.RuntimeException: java.lang.NoSuchMethodException: org.apache.hadoop.fs.azure.NativeAzureFileSystem.<init>(java.net.URI, org.apache.hadoop.conf.Configuration)
at org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:135)
at org.apache.hadoop.fs.AbstractFileSystem.createFileSystem(AbstractFileSystem.java:164)
at org.apache.hadoop.fs.AbstractFileSystem.get(AbstractFileSystem.java:249)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:334)
at org.apache.hadoop.fs.FileContext$2.run(FileContext.java:331)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1669)
at org.apache.hadoop.fs.FileContext.getAbstractFileSystem(FileContext.java:331)
at org.apache.hadoop.fs.FileContext.getFileContext(FileContext.java:448)
at org.apache.spark.sql.delta.storage.HDFSLogStore.getFileContext(HDFSLogStore.scala:53)
at org.apache.spark.sql.delta.storage.HDFSLogStore.listFrom(HDFSLogStore.scala:137)
at org.apache.spark.sql.delta.Checkpoints$class.findLastCompleteCheckpoint(Checkpoints.scala:174)
at org.apache.spark.sql.delta.DeltaLog.findLastCompleteCheckpoint(DeltaLog.scala:58)
at org.apache.spark.sql.delta.Checkpoints$class.loadMetadataFromFile(Checkpoints.scala:156)
at org.apache.spark.sql.delta.Checkpoints$class.loadMetadataFromFile(Checkpoints.scala:150)
at org.apache.spark.sql.delta.Checkpoints$class.loadMetadataFromFile(Checkpoints.scala:150)
at org.apache.spark.sql.delta.Checkpoints$class.loadMetadataFromFile(Checkpoints.scala:150)
at org.apache.spark.sql.delta.Checkpoints$class.lastCheckpoint(Checkpoints.scala:133)
at org.apache.spark.sql.delta.DeltaLog.lastCheckpoint(DeltaLog.scala:58)
at org.apache.spark.sql.delta.DeltaLog.<init>(DeltaLog.scala:139)
at org.apache.spark.sql.delta.DeltaLog$$anon$3$$anonfun$call$1$$anonfun$apply$10.apply(DeltaLog.scala:744)
at org.apache.spark.sql.delta.DeltaLog$$anon$3$$anonfun$call$1$$anonfun$apply$10.apply(DeltaLog.scala:744)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.delta.DeltaLog$$anon$3$$anonfun$call$1.apply(DeltaLog.scala:743)
at org.apache.spark.sql.delta.DeltaLog$$anon$3$$anonfun$call$1.apply(DeltaLog.scala:743)
at com.databricks.spark.util.DatabricksLogging$class.recordOperation(DatabricksLogging.scala:77)
at org.apache.spark.sql.delta.DeltaLog$.recordOperation(DeltaLog.scala:671)
at org.apache.spark.sql.delta.metering.DeltaLogging$class.recordDeltaOperation(DeltaLogging.scala:103)
at org.apache.spark.sql.delta.DeltaLog$.recordDeltaOperation(DeltaLog.scala:671)
at org.apache.spark.sql.delta.DeltaLog$$anon$3.call(DeltaLog.scala:742)
at org.apache.spark.sql.delta.DeltaLog$$anon$3.call(DeltaLog.scala:740)
at com.google.common.cache.LocalCache$LocalManualCache$1.load(LocalCache.java:4792)
at com.google.common.cache.LocalCache$LoadingValueReference.loadFuture(LocalCache.java:3599)
at com.google.common.cache.LocalCache$Segment.loadSync(LocalCache.java:2379)
at com.google.common.cache.LocalCache$Segment.lockedGetOrLoad(LocalCache.java:2342)
at com.google.common.cache.LocalCache$Segment.get(LocalCache.java:2257)
at com.google.common.cache.LocalCache.get(LocalCache.java:4000)
at com.google.common.cache.LocalCache$LocalManualCache.get(LocalCache.java:4789)
at org.apache.spark.sql.delta.DeltaLog$.apply(DeltaLog.scala:740)
at org.apache.spark.sql.delta.DeltaLog$.forTable(DeltaLog.scala:712)
at org.apache.spark.sql.delta.sources.DeltaDataSource.createRelation(DeltaDataSource.scala:169)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:318)
at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:223)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:211)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
... 50 elided
Caused by: java.lang.NoSuchMethodException: org.apache.hadoop.fs.azure.NativeAzureFileSystem.<init>(java.net.URI, org.apache.hadoop.conf.Configuration)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getDeclaredConstructor(Class.java:2178)
at org.apache.hadoop.fs.AbstractFileSystem.newInstance(AbstractFileSystem.java:129)
... 95 more
same code is working for parquet file , any help will be much appreciated
I Resolved it by adding spark.delta.logStore.class in conf parameter
{ "conf": {"spark.jars.packages": "io.delta:delta-core_2.11:0.5.0,org.apache.hadoop:hadoop-azure:3.3.1","spark.delta.logStore.class":"org.apache.spark.sql.delta.storage.AzureLogStore"}, "kind": "spark","driverMemory" : "4g", "executorMemory": "2g", "executorCores": 3, "numExecutors":4}

Spark DataFrame Filter function throwing Task not Serializable exception

I am trying to filter data frame/dataset records using filter function with scala anonymous function. but it throws Task not serializable exception can someone please look into code and explain to me what mistake with code.
val spark = SparkSession.builder()
.appName("test data frame")
.master("local[*]")
.getOrCreate()
val user_seq = Seq(
Row(1,"John","London"),
Row(1,"Martin","New York"),
Row(1,"Abhishek","New York")
)
val user_schema = StructType(
Array(
StructField("user_id",IntegerType,true),
StructField("user_name",StringType,true),
StructField("user_city",StringType,true)
))
var user_df = spark.createDataFrame(spark.sparkContext.parallelize(user_seq),user_schema)
var user_rdd = user_df.filter((item)=>{
return item.getString(2) == "New York"
})
user_rdd.count();
I can see below exception on console. when I am trying to filter data with ColumnName its working fine.
objc[48765]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/bin/java (0x1059db4c0) and /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x105a5f4e0). One of the two will be used. Which one is undefined.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/07/18 20:10:09 INFO SparkContext: Running Spark version 2.4.6
20/07/18 20:10:09 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/07/18 20:10:09 INFO SparkContext: Submitted application: test data frame
20/07/18 20:10:09 INFO SecurityManager: Changing view acls groups to:
20/07/18 20:10:09 INFO SecurityManager: Changing modify acls groups to:
20/07/18 20:10:12 INFO StateStoreCoordinatorRef: Registered StateStoreCoordinator endpoint
20/07/18 20:10:12 INFO ContextCleaner: Cleaned accumulator 0
20/07/18 20:10:13 INFO CodeGenerator: Code generated in 170.789451 ms
20/07/18 20:10:13 INFO CodeGenerator: Code generated in 17.729004 ms
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:406)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:163)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2326)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:872)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1.apply(RDD.scala:871)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:385)
at org.apache.spark.rdd.RDD.mapPartitionsWithIndex(RDD.scala:871)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:630)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.prepareShuffleDependency(ShuffleExchangeExec.scala:92)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:128)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec$$anonfun$doExecute$1.apply(ShuffleExchangeExec.scala:119)
at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
at org.apache.spark.sql.execution.exchange.ShuffleExchangeExec.doExecute(ShuffleExchangeExec.scala:119)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.InputAdapter.inputRDDs(WholeStageCodegenExec.scala:391)
at org.apache.spark.sql.execution.aggregate.HashAggregateExec.inputRDDs(HashAggregateExec.scala:151)
at org.apache.spark.sql.execution.WholeStageCodegenExec.doExecute(WholeStageCodegenExec.scala:627)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan.getByteArrayRdd(SparkPlan.scala:247)
at org.apache.spark.sql.execution.SparkPlan.executeCollect(SparkPlan.scala:296)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2836)
at org.apache.spark.sql.Dataset$$anonfun$count$1.apply(Dataset.scala:2835)
at org.apache.spark.sql.Dataset$$anonfun$52.apply(Dataset.scala:3370)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3369)
at org.apache.spark.sql.Dataset.count(Dataset.scala:2835)
at DataFrameTest$.main(DataFrameTest.scala:65)
at DataFrameTest.main(DataFrameTest.scala)
Caused by: java.io.NotSerializableException: java.lang.Object
Serialization stack:
- object not serializable (class: java.lang.Object, value: java.lang.Object#cec590c)
- field (class: DataFrameTest$$anonfun$1, name: nonLocalReturnKey1$1, type: class java.lang.Object)
- object (class DataFrameTest$$anonfun$1, <function1>)
- element of array (index: 1)
- array (class [Ljava.lang.Object;, size 5)
- field (class: org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13, name: references$1, type: class [Ljava.lang.Object;)
- object (class org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$13, <function2>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:46)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:100)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
... 48 more
20/07/18 20:10:13 INFO SparkContext: Invoking stop() from shutdown hook
20/07/18 20:10:13 INFO SparkUI: Stopped Spark web UI at http://192.168.31.239:4040
20/07/18 20:10:13 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
20/07/18 20:10:13 INFO MemoryStore: MemoryStore cleared
20/07/18 20:10:13 INFO BlockManager: BlockManager stopped
20/07/18 20:10:13 INFO BlockManagerMaster: BlockManagerMaster stopped
20/07/18 20:10:13 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
20/07/18 20:10:13 INFO SparkContext: Successfully stopped SparkContext
20/07/18 20:10:13 INFO ShutdownHookManager: Shutdown hook called
20/07/18 20:10:13 INFO ShutdownHookManager: Deleting directory /private/var/folders/33/3n6vtfs54mdb7x6882fyqy4mccfmvg/T/spark-3e071448-7ad7-47b8-bf70-68ab74721aa2
Process finished with exit code 1
Remove return keyword in below line.
Change below code
var user_rdd = user_df.filter((item)=>{
return item.getString(2) == "New York"
})
with below line
var user_rdd = user_df.filter(_.getString(2) == "New York")
or
user_df.filter($"user_city" === "New York").count
Also refactor your code like below.
val df = Seq((1,"John","London"),(1,"Martin","New York"),(1,"Abhishek","New York"))
.toDF("user_id","user_name","user_city")
df.filter($"user_city" === "New York").count

Spark fails to load data frame from elasticsearch due to field format exception

My dataframe fails due to NumberFormatException on one of the nested JSON fields when reading from Elasticsearch .
I am not providing any schema as it should be inferred automatically from Elastic.
package org.arc
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.log4j._
import scala.io.Source
import java.nio.charset.CodingErrorAction
import scala.io.Codec
import org.apache.spark.storage.StorageLevel
import org.apache.spark._
import org.apache.spark.sql.SparkSession
import org.apache.spark.util.Utils
import org.apache.spark.sql.Dataset
import org.apache.spark.sql.expressions
import org.apache.spark.sql.functions.{ concat, lit }
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.types._
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.types.{ StructType, StructField, StringType };
import org.apache.spark.serializer.KryoSerializer
object SparkOnES {
def main(args: Array[String]) {
val spark = SparkSession
.builder()
.appName("SparkESTest")
.config("spark.master", "local[*]")
.config("spark.sql.warehouse.dir", "C://SparkScala//SparkLocal//spark-warehouse")
.enableHiveSupport()
.getOrCreate()
//1.Read Sample JSON
import spark.implicits._
//val myjson = spark.read.json("C:\\Users\\jasjyotsinghj599\\Desktop\\SampleTest.json")
// myjson.show(false)
//2.Read Data from ES
val esdf = spark.read.format("org.elasticsearch.spark.sql")
.option("es.nodes", "XXXXXX")
.option("es.port", "80")
.option("es.query", "?q=*")
.option("es.nodes.wan.only", "true")
.option("pushdown", "true")
.option("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.load("batch_index/ticket")
esdf.createOrReplaceTempView("esdf")
spark.sql("Select * from esdf limit 1").show(false)
val esdf_fltr_lt = esdf.take(1)
}
}
The ErrorStack says that it cannot parse the input field.Looking at the exception, this issue seems to have caused due to mismatch in the type of data expected ( int, float, double ) and received ( string ) :
Caused by: java.lang.NumberFormatException: For input string: "161.60"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277)
at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
at org.elasticsearch.spark.serialization.ScalaValueReader.parseLong(ScalaValueReader.scala:142)
at org.elasticsearch.spark.serialization.ScalaValueReader$$anonfun$longValue$1.apply(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader$$anonfun$longValue$1.apply(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader.checkNull(ScalaValueReader.scala:120)
at org.elasticsearch.spark.serialization.ScalaValueReader.longValue(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader.readValue(ScalaValueReader.scala:89)
at org.elasticsearch.spark.sql.ScalaRowValueReader.readValue(ScalaEsRowValueReader.scala:46)
at org.elasticsearch.hadoop.serialization.ScrollReader.parseValue(ScrollReader.java:770)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:720)
... 25 more
18/04/25 23:33:53 WARN TaskSetManager: Lost task 3.0 in stage 1.0 (TID 4, localhost): org.elasticsearch.hadoop.rest.EsHadoopParsingException: Cannot parse value [161.60] for field [tvl_tkt_tot_chrg_amt]
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:723)
at org.elasticsearch.hadoop.serialization.ScrollReader.map(ScrollReader.java:867)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:710)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHitAsMap(ScrollReader.java:476)
at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:401)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:296)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:269)
at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:393)
at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:92)
at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:61)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:784)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NumberFormatException: For input string: "161.60"
at java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)
at java.lang.Long.parseLong(Long.java:589)
at java.lang.Long.parseLong(Long.java:631)
at scala.collection.immutable.StringLike$class.toLong(StringLike.scala:277)
at scala.collection.immutable.StringOps.toLong(StringOps.scala:29)
at org.elasticsearch.spark.serialization.ScalaValueReader.parseLong(ScalaValueReader.scala:142)
at org.elasticsearch.spark.serialization.ScalaValueReader$$anonfun$longValue$1.apply(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader$$anonfun$longValue$1.apply(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader.checkNull(ScalaValueReader.scala:120)
at org.elasticsearch.spark.serialization.ScalaValueReader.longValue(ScalaValueReader.scala:141)
at org.elasticsearch.spark.serialization.ScalaValueReader.readValue(ScalaValueReader.scala:89)
at org.elasticsearch.spark.sql.ScalaRowValueReader.readValue(ScalaEsRowValueReader.scala:46)
at org.elasticsearch.hadoop.serialization.ScrollReader.parseValue(ScrollReader.java:770)
at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:720)
... 25 more
18/04/25 23:33:53 INFO SparkContext: Invoking stop() from shutdown hook
18/04/25 23:33:53 INFO SparkUI: Stopped Spark web UI at http://10.1.2.244:4040
18/04/25 23:33:53 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/04/25 23:33:53 INFO MemoryStore: MemoryStore cleared
18/04/25 23:33:53 INFO BlockManager: BlockManager stopped
18/04/25 23:33:53 INFO BlockManagerMaster: BlockManagerMaster stopped
18/04/25 23:33:53 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/04/25 23:33:53 INFO SparkContext: Successfully stopped SparkContext
18/04/25 23:33:53 INFO ShutdownHookManager: Shutdown hook called
18/04/25 23:33:53 INFO ShutdownHookManager: Deleting directory

Resources