NullPointerException while writing from Spark to Cassandra - cassandra

I am using spark-cassandra-connector-2.4.0-s_2.11 to write data from spark to Cassandra on Databricks cluster.
I am getting java.lang.NullPointerException while writing data from Spark to Cassandra. This is working fine with few records.
But getting issue when I try to load ~150 Million records.
Can someone help me in finding the root cause?
Here is the code snippet:
val paymentExtractCsvDF = spark
.read
.format("csv")
.option("header", "true")
.load(/home/otl/extract/csvout/Payment)
paymentExtractCsvDF.printSchema()
root
|-- BAN: string (nullable = true)
|-- ENT_SEQ_NO: string (nullable = true)
|-- PYM_METHOD: string (nullable = true)
case class Payment(account_number: String, entity_sequence_number: String, payment_type: String)
val paymentResultDf = paymentExtractCsvDF.map(row => Payment(row.getAs("BAN"),
row.getAs("ENT_SEQ_NO"),
row.getAs("PYM_METHOD"))).toDF()
var paymentResultFilterDf = paymentResultDf
.filter($"account_number".isNotNull || $"account_number" != "")
.filter($"entity_sequence_number".isNotNull || $"entity_sequence_number" != "")
paymentResultFilterDf
.write
.format("org.apache.spark.sql.cassandra")
.mode("append")
.options(Map( "table" -> "cassandratable", "keyspace" -> "cassandrakeyspace"))
.save()
here is the exception I am getting:
Failed to write statements to cassandrakeyspace.cassandratable. The
latest exception was
An unexpected error occurred server side on /10.18.15.198:9042: java.lang.NullPointerException
Please check the executor logs for more exceptions and information
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1$$anonfun$apply$3.apply(TableWriter.scala:243)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1$$anonfun$apply$3.apply(TableWriter.scala:241)
at scala.Option.map(Option.scala:146)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:241)
at com.datastax.spark.connector.writer.TableWriter$$anonfun$writeInternal$1.apply(TableWriter.scala:210)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:112)
at com.datastax.spark.connector.cql.CassandraConnector$$anonfun$withSessionDo$1.apply(CassandraConnector.scala:111)
at com.datastax.spark.connector.cql.CassandraConnector.closeResourceAfterUse(CassandraConnector.scala:145)
at com.datastax.spark.connector.cql.CassandraConnector.withSessionDo(CassandraConnector.scala:111)
at com.datastax.spark.connector.writer.TableWriter.writeInternal(TableWriter.scala:210)
at com.datastax.spark.connector.writer.TableWriter.insert(TableWriter.scala:197)
at com.datastax.spark.connector.writer.TableWriter.write(TableWriter.scala:183)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at com.datastax.spark.connector.RDDFunctions$$anonfun$saveToCassandra$1.apply(RDDFunctions.scala:36)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
at org.apache.spark.scheduler.Task.run(Task.scala:112)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/11/22 01:12:17 INFO CoarseGrainedExecutorBackend: Got assigned task 1095
19/11/22 01:12:17 INFO Executor: Running task 39.1 in stage 21.0 (TID 1095)
19/11/22 01:12:17 INFO ShuffleBlockFetcherIterator: Getting 77 non-empty blocks including 10 local blocks and 67 remote blocks
19/11/22 01:12:17 INFO ShuffleBlockFetcherIterator: Started 7 remote fetches in 3 ms
19/11/22 01:12:17 INFO ShuffleBlockFetcherIterator: Getting 64 non-empty blocks including 8 local blocks and 56 remote blocks
19/11/22 01:12:17 INFO ShuffleBlockFetcherIterator: Started 7 remote fetches in 1 ms
19/11/22 01:12:17 ERROR Executor: Exception in task 7.0 in stage 21.0 (TID 1012)

It seems your dataframe has key fields with null values. The problem might be in you filter condition. I think you want to do something like this:
var paymentResultFilterDf = paymentResultDf
.filter($"account_number".isNotNull && $"account_number" != "")
.filter($"entity_sequence_number".isNotNull && $"entity_sequence_number" != "")

Related

hive is failing when joining external and internal tables

Our environment/versions
hadoop 3.2.3
hive 3.1.3
spark 2.3.0
our internal table in hive is defined as
CREATE TABLE dw.CLIENT
(
client_id integer,
client_abbrev string,
client_name string,
effective_start_ts timestamp,
effective_end_ts timestamp,
active_flag string,
record_version integer
)
stored as orc tblproperties ('transactional'='true');
external as
CREATE EXTERNAL TABLE ClientProcess_21
( ClientId string, ClientDescription string, IsActive string, OldClientId string, NewClientId string, Description string,
TinyName string, FinanceCode string, ParentClientId string, ClientStatus string, FSPortalClientId string,)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '.../client_extract_20220801.csv/' TBLPROPERTIES ("skip.header.line.count"="1")
I can select from both tables.
the internal table is empty, when I try joining them
select
null, s.*
from ClientProcess_21 s
join dw.client t
on s.ClientId = t.client_id
Hive is failing with
SQL Error [3] [42000]: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed during runtime. Please check stacktrace for the root cause.
partial stack trace from the Hive log
2022-08-01T18:53:39,012 INFO [RPC-Handler-1] client.SparkClientImpl: Received result for 07a38056-5ba8-45e0-8783-397f25f398cb
2022-08-01T18:53:39,219 ERROR [HiveServer2-Background-Pool: Thread-1667] status.SparkJobMonitor: Job failed with java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions;
at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.useUTCTimestamp(OrcFile.java:286)
at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.(OrcFile.java:113)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.writerOptions(OrcFile.java:317)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getOptions(OrcOutputFormat.java:126)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getHiveRecordWriter(OrcOutputFormat.java:184)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getHiveRecordWriter(OrcOutputFormat.java:61)
at org.apache.hadoop.hive.ql.exec.Utilities.createEmptyFile(Utilities.java:3458)
at org.apache.hadoop.hive.ql.exec.Utilities.createDummyFileForEmptyPartition(Utilities.java:3489)
at org.apache.hadoop.hive.ql.exec.Utilities.access$300(Utilities.java:222)
at org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3433)
at org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3370)
at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:318)
at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:241)
at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:113)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:359)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:378)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:343)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions;
at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.useUTCTimestamp(OrcFile.java:286)
at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.(OrcFile.java:113)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.writerOptions(OrcFile.java:317)
at org.apache.hadoop.hive.q
******* update
DMLs on tables defined as ..stored as orc tblproperties ('transactional'='true');
are failing with
2022-08-02 09:47:42 ERROR SparkJobMonitor:1250 - Job failed with java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions;
java.util.concurrent.ExecutionException: Exception thrown by job
,,
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.222.108.202, executor 0): java.lang.RuntimeException: Error processing row: java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions;
at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:149)
..
Caused by: java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions;
at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.useUTCTimestamp(OrcFile.java:286)
I think this is related to data type conversation when joining. One join col is string and other is int.
Can you please try this
select
null, s.*
from ClientProcess_21 s
join dw.client t
on s.ClientId = cast(t.client_id as string) -- cast it to string
resolved by copying orc jars to spark home
cp $HIVE_HOME/lib/orc $SPARK_HOME/jars/
cp $HIVE_HOME/hive-storage-api-2.7.0.jar $SPARK_HOME/jars/

Spark ML ALS collaborative filtering always fail if the iteration more than 20 [duplicate]

This question already has an answer here:
Checkpointing In ALS Spark Scala
(1 answer)
Closed 4 years ago.
My data set‘s size is about 3G and has 380 million datas. Always wrong if I add iteration steps. And increase memory, increase block or decrease block, decrease checkpoint cannot solve my problem.
Caused by: java.net.ConnectException: Connection refused (Connection refused) at java.net.PlainSocketImpl.socketConnect(Native Method)
The method introduced to set small checkpoint cannot solve my problem.
StackOverflow-error when applying pyspark ALS's "recommendProductsForUsers" (although cluster of >300GB Ram available)
This is the DataFrame for ALS training, which is about 380 million rows.
+---------+-----------+------+
| user_id|item_id|rating|
+---------+-----------+------+
|154317644| 58866| 6|
| 69669214| 601866| 7|
|126094876| 909352| 3|
| 45246613| 1484481| 3|
|123317968| 2101977| 3|
| 375928| 2681933| 1|
|136939309| 3375806| 2|
| 3150751| 4198976| 2|
| 87648646| 1030196| 3|
| 57672425| 5385142| 2|
+---------+-----------+------+
This is the code to train ALS.
val als = new ALS()
.setMaxIter(setMaxIter)
.setRegParam(setRegParam)
.setUserCol("user_id")
.setItemCol("item_id")
.setRatingCol("rating")
.setImplicitPrefs(false)
.setCheckpointInterval(setCheckpointInterval)
.setRank(setRank)
.setNumItemBlocks(setNumItemBlocks)
.setNumUserBlocks(setNumUserBlocks)
val Array(training, test) = ratings.randomSplit(Array(0.9, 0.1))
val model = als.fit(training) // wrong in this step
This is the ALS source code where error happens.
val srcOut = srcOutBlocks.join(srcFactorBlocks).flatMap {
case (srcBlockId, (srcOutBlock, srcFactors)) =>
srcOutBlock.view.zipWithIndex.map { case (activeIndices, dstBlockId) =>
(dstBlockId, (srcBlockId, activeIndices.map(idx => srcFactors(idx))))
}
}
This is the Exception and Error logs.
18/08/23 15:05:43 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
18/08/23 15:13:35 WARN scheduler.TaskSetManager: Lost task 20.0 in stage 56.0 (TID 31322, 6.ai.bjs-datalake.p1staff.com, executor 9): java.lang.StackOverflowError
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2669)
at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:3170)
at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1678)
18/08/23 15:13:35 WARN server.TransportChannelHandler: Exception in connection from /10.191.161.108:23300
java.io.IOException: Connection reset by peer
at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
at sun.nio.ch.IOUtil.read(IOUtil.java:192)
at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380)
18/08/23 15:13:36 ERROR cluster.YarnClusterScheduler: Lost executor 15 on 2.ai.bjs-datalake.p1staff.com: Container marked as failed: container_e04_1533096025492_4001_01_000016 on host: 2.ai.bjs-datalake.p1staff.com. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e04_1533096025492_4001_01_000016
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
18/08/23 15:05:43 WARN metastore.ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
18/08/23 15:13:35 WARN scheduler.TaskSetManager: Lost task 20.0 in stage 56.0 (TID 31322, 6.ai.bjs-datalake.p1staff.com, executor 9): java.lang.StackOverflowError
at java.io.ObjectInputStream$PeekInputStream.readFully(ObjectInputStream.java:2669)
at java.io.ObjectInputStream$BlockDataInputStream.readInt(ObjectInputStream.java:3170)
at java.io.ObjectInputStream.readHandle(ObjectInputStream.java:1678)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1739)
18/08/23 15:13:36 ERROR cluster.YarnClusterScheduler: Lost executor 10 on 5.ai.bjs-datalake.p1staff.com: Container marked as failed: container_e04_1533096025492_4001_01_000011 on host: 5.ai.bjs-datalake.p1staff.com. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_e04_1533096025492_4001_01_000011
Exit code: 50
Stack trace: ExitCodeException exitCode=50:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:585)
at org.apache.hadoop.util.Shell.run(Shell.java:482)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:776)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Is anybody meet this error?
After set the check point directory, it works. Thank #eliasah
spark.sparkContext.setCheckpointDir("hdfs://datalake/check_point_directory/als")
Check point will not work if you do not set the directory.

Spark streaming batches stopping and waiting for GC

I have got a simple spark streaming job. It reads event from Kafka topic, does simple event transformation (eg. replace some characters with another ones) and sends transformed events to second Kafka topic. Everything works OK for some time (1 – 1.5 h) and after that we see that batches are scheduled (see screen below) and waiting to run. The pause takes about 5-6 minutes and this time GC is working and cleaning memory. After that everything works OK, but sometimes processing stops and in logs we see errors like that (see stack trace below). Please advise what Spark / Java parameters should be set to avoid this GC overhead.
Spark jobs are scheduled every 10 sec., one batch execution takes about 5 sec.
Stack trace
2017-09-21 11:26:15 WARN TaskSetManager:66 - Lost task 33.0 in stage 115.0 (TID 4699, work8, executor 6): java.lang.OutOfMemoryError: GC overhead limit exceeded
at org.apache.kafka.clients.consumer.internals.Fetcher.createFetchRequests(Fetcher.java:724)
at org.apache.kafka.clients.consumer.internals.Fetcher.sendFetches(Fetcher.java:176)
at org.apache.kafka.clients.consumer.KafkaConsumer.pollOnce(KafkaConsumer.java:1042)
at org.apache.kafka.clients.consumer.KafkaConsumer.poll(KafkaConsumer.java:995)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.poll(CachedKafkaConsumer.scala:99)
at org.apache.spark.streaming.kafka010.CachedKafkaConsumer.get(CachedKafkaConsumer.scala:70)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:228)
at org.apache.spark.streaming.kafka010.KafkaRDD$KafkaRDDIterator.next(KafkaRDD.scala:194)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1951)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
2017-09-21 11:26:15 INFO TaskSetManager:54 - Lost task 37.0 in stage 115.0 (TID 4702) on work8, executor 6: java.lang.OutOfMemoryError (GC overhead limit exceeded) [duplicate 1]
2017-09-21 11:26:15 INFO TaskSetManager:54 - Lost task 26.0 in stage 115.0 (TID 4695) on work8, executor 6: java.lang.OutOfMemoryError (GC overhead limit exceeded) [duplicate 2]
Parameters of spark – submit
spark-2.1.1-bin-hadoop2.6/bin/spark-submit \
--master yarn \
--deploy-mode client \
--executor-cores 8 \
--executor-memory 20g \
--driver-memory 20g \
--num-executors 4 \
--conf "spark.driver.maxResultSize=8g" \
--conf "spark.streaming.receiver.maxRate=1125" \
--conf "spark.streaming.kafka.maxRatePerPartition=1125" \
//Job
val sendToKafka = KafkaSender.sendToKafka(spark, kafkaServers, outputTopic, kafkaEnabled) _
val stream = KafkaUtils
.createDirectStream(ssc, PreferConsistent, Subscribe[String, String](inputTopics, kafkaParams))
stream.foreachRDD { statementsStreamBatch =>
val offsetRanges = statementsStreamBatch.asInstanceOf[HasOffsetRanges].offsetRanges
if (!statementsStreamBatch.isEmpty) {
val inputCsvRDD = statementsStreamBatch.map(_.value)
var outputCsvRDD : RDD[String] = null
if(enrichmerEnabled) {
outputCsvRDD = Enricher.processStreaminputCsvRDD, enricherNumberOfFields)
} else {
outputCsvRDD = inputCsvRDD
}
sendToKafka(outputCsvRDD)
}
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
ssc.start()
ssc.awaitTermination()
//Enricher
object Enricher {
def processStream(eventStream: RDD[String], numberOfFields : Integer): RDD[String] = {
eventStream.map(
csv => if (csv.count(_ == ',') <= numberOfFields) {
csv
} else {
csv.replaceAll(",(?=[^']*',)", "#")
}
)
}
//KafkaSender
object KafkaSender {
def sendToKafka(spark: SparkSession, servers: String, topic: String, enabled: Boolean)(message: RDD[String]): Unit = {
val kafkaSink = spark.sparkContext.broadcast(KafkaSink(getKafkaProperties(servers)))
val kafkaTopic = spark.sparkContext.broadcast(topic)
message.foreach(kafkaSink.value.send(kafkaTopic.value, _))
}
}

Error while indexing to Elasticsearch with spark streaming, scala case class with more than 22 argument

I am using Spark Streaming with Scala to write the log data to Elastic search.
I am not able to create case scala with more than 22 argument required in my case and is not supported in scala 2.10.
So using the below approach to create the class instead of case class
Scala class
class FactUsage(d_EVENT_TYPE_NR: Long,EVENT_GRP_DESC: String,EVENT_DESC: String,CUST_TYPE_CD: Long,TICKET_RATING_CD: Long,BUS_UNIT_DESC: String,CUST_MKT_SEGM_DESC: String,EVENT_DTTM: String,EVENT_DTNR: Long,SERVED_PARTY_IMEI_NUM: String,SERVED_PARTY_IMSI_NUM: String,SERVED_PARTY_PHONE_NUM: Long,OTHER_PARTY_ID: String,EVENT_DURATION_QTY: Long,EVENT_VOLUME_DOWN_QTY: Long,EVENT_VOLUME_TOTAL_QTY: Long,EVENT_VOLUME_UP_QTY: Long,ACCESS_POINT_ID: String,d_CELL_NR: Long,d_CONTRACT_NR: Long,d_CUSTOMER_NR: Long,d_CUSTOMER_TOP_PARENT_NR: String,d_DEVICE_NR: Long,d_ORIGIN_DESTINATION_NR: Long,d_DIRECTION_NR: Long,d_OTHER_OPER_NR: Long,d_OTHER_SUBSCR_OPER_NR: Long,d_ROAMING_NR: Long,d_SALES_AGENT_NR: String,d_SERVED_OPER_NR: Long,d_SERVED_SUBSCR_OPER_NR: Long,d_TARIFF_MODEL_NR: Long,d_TERMINATION_NR: Long,d_USAGE_SERVICE_NR: Long,RUN_ID: String) extends Product with Serializable
{
def canEqual(that:Any)=that.isInstanceOf[FactUsage]
def productArity = 35 // Number of columns
def productElement(idx: Int) = idx match
{
case 0 => d_EVENT_TYPE_NR;case 1 =>EVENT_GRP_DESC;case 2 =>EVENT_DESC;case 3 =>CUST_TYPE_CD;case 4 =>TICKET_RATING_CD;case 5 =>BUS_UNIT_DESC;case 6 =>CUST_MKT_SEGM_DESC;case 7 =>EVENT_DTTM;case 8 =>EVENT_DTNR;case 9 =>SERVED_PARTY_IMEI_NUM;case 10 =>SERVED_PARTY_IMSI_NUM;case 11 =>SERVED_PARTY_PHONE_NUM;case 12 =>OTHER_PARTY_ID;case 13 =>EVENT_DURATION_QTY;case 14 =>EVENT_VOLUME_DOWN_QTY;case 15 =>EVENT_VOLUME_TOTAL_QTY;case 16 =>EVENT_VOLUME_UP_QTY;case 17 =>ACCESS_POINT_ID;case 18 =>d_CELL_NR;case 19 =>d_CONTRACT_NR;case 20 =>d_CUSTOMER_NR;case 21 =>d_CUSTOMER_TOP_PARENT_NR;case 22 =>d_DEVICE_NR;case 23 =>d_ORIGIN_DESTINATION_NR;case 24 =>d_DIRECTION_NR;case 25 =>d_OTHER_OPER_NR;case 26 =>d_OTHER_SUBSCR_OPER_NR;case 27 =>d_ROAMING_NR;case 28 =>d_SALES_AGENT_NR;case 29 =>d_SERVED_OPER_NR;case 30 =>d_SERVED_SUBSCR_OPER_NR;case 31 =>d_TARIFF_MODEL_NR;case 32 =>d_TERMINATION_NR;case 33 =>d_USAGE_SERVICE_NR;case 34 =>RUN_ID
}
}
Spark Streaming Code to Write to Elasticsearch
val rddAbcServerLog = lines.filter(x => x.toString.contains("abc_server_logs"))
EsSparkStreaming.saveToEs(rddAbcServerLog.map(line => parser.formatDelimeted(line)).map(p => parser.runES(p.toString)), esindex + "/" + estype)
I have debugged and there is no issues with the functions used in lambda expression.
Error comes while writing to Elasticsearch
Error
17/04/15 11:34:05 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [xx.xxx.xx.xx:10200] returned Bad Request(400) - failed to parse; Bailing out..
at org.elasticsearch.hadoop.rest.RestClient.processBulkResponse(RestClient.java:250)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:202)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:220)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:242)
at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:182)
at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:159)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:102)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
17/04/15 11:34:05 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost): org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: Found unrecoverable error [xx.xxx.xx.xx:10200] returned Bad Request(400) - failed to parse; Bailing out..
at org.elasticsearch.hadoop.rest.RestClient.processBulkResponse(RestClient.java:250)
at org.elasticsearch.hadoop.rest.RestClient.bulk(RestClient.java:202)
at org.elasticsearch.hadoop.rest.RestRepository.tryFlush(RestRepository.java:220)
at org.elasticsearch.hadoop.rest.RestRepository.flush(RestRepository.java:242)
at org.elasticsearch.hadoop.rest.RestRepository.doWriteToIndex(RestRepository.java:182)
at org.elasticsearch.hadoop.rest.RestRepository.writeToIndex(RestRepository.java:159)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:67)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:102)
at org.elasticsearch.spark.rdd.EsSpark$$anonfun$doSaveToEs$1.apply(EsSpark.scala:102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
17/04/15 11:34:05 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
17/04/15 11:34:05 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/04/15 11:34:05 INFO TaskSchedulerImpl: Cancelling stage 0
Note: Code might have weird naming conventions and masked IPs, I have modified the code for posting to the public forum
What you're doing is cumbersome and error-prone. Instead, use multiple case classes.
case class Group(grpDesc: String, eventDesc: String)
case class Event(dttm: String, dtnr: String)
...and so on
Then when you've grouped all related items into their own case classes:
case class FactUsage(group: Group, event: Event, ...)
You should pass an instance of FactUsage to saveToEs.

spark job failed with exception while saving dataframe contentes as csv files using spark SQL

I am trying to save dataframe contents to hdfs in csv format. I am able to do it with small no.of files. while trying to do with more number of files ( 90+ files) am getting NullPointerException and job fails. below is my code:
val df1 = sqlContext.read.format("com.databricks.spark.csv").option("header", "true").option("inferSchema", "false").option("delimiter", "|").load(hdfs path for loading multiple files/*");
val mydateFunc = udf {(x: String) => x.split("/") match {case Array(month,date,year) => year+"-"+month+"-"+date case Array(y)=> y}}
val df2 = df1.withColumn("orderdate", mydateFunc(df1("Date on which the record was created"))).drop("Date on which the record was created")
val df3 = df2.withColumn("deliverydate", mydateFunc(df2("Requested delivery date"))).drop("Requested delivery date")
val exp = "(.*)(44000\\d{5}|69499\\d{6})(.*)".r
val upc_extractor: (String => String) = (arg: String) => arg match { case exp(pref,required,suffx) => required case x:String => x }
val sqlfunc = udf(upc_extractor)
val df4 = df3.withColumn("formatted_UPC", sqlfunc(col("European Article Numbers/Universal Produ")))
df4.write.format("com.databricks.spark.csv").option("header", "false").save("destination path in hdfs to save the resultant files");
the below is the exception i am getting:
16/02/03 01:59:15 INFO FileOutputCommitter: File Output Committer Algorithm version is 1
16/02/03 01:59:33 ERROR Executor: Exception in task 2.0 in stage 1.0 (TID 3)
java.lang.NullPointerException
at $line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:30)
at $line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:30)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:71)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:70)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf.eval(ScalaUdf.scala:960)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at com.databricks.spark.csv.package$CsvSchemaRDD$$anonfun$9$$anon$1.next(package.scala:165)
at com.databricks.spark.csv.package$CsvSchemaRDD$$anonfun$9$$anon$1.next(package.scala:158)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/02/03 01:59:33 INFO TaskSetManager: Starting task 32.0 in stage 1.0 (TID 33, localhost, ANY, 1692 bytes)
16/02/03 01:59:33 INFO Executor: Running task 32.0 in stage 1.0 (TID 33)
16/02/03 01:59:33 WARN TaskSetManager: Lost task 2.0 in stage 1.0 (TID 3, localhost): java.lang.NullPointerException
at $line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:30)
at $line42.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:30)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:71)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf$$anonfun$2.apply(ScalaUdf.scala:70)
at org.apache.spark.sql.catalyst.expressions.ScalaUdf.eval(ScalaUdf.scala:960)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:118)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:68)
at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(Projection.scala:52)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at com.databricks.spark.csv.package$CsvSchemaRDD$$anonfun$9$$anon$1.next(package.scala:165)
at com.databricks.spark.csv.package$CsvSchemaRDD$$anonfun$9$$anon$1.next(package.scala:158)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply$mcV$sp(PairRDDFunctions.scala:1109)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13$$anonfun$apply$6.apply(PairRDDFunctions.scala:1108)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1285)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1116)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsHadoopDataset$1$$anonfun$13.apply(PairRDDFunctions.scala:1095)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)
at org.apache.spark.scheduler.Task.run(Task.scala:70)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/02/03 01:59:33 ERROR TaskSetManager: Task 2 in stage 1.0 failed 1 times; aborting job
16/02/03 01:59:33 INFO TaskSchedulerImpl: Cancelling stage 1
16/02/03 01:59:33 INFO Executor: Executor is trying to kill task 29.0 in stage 1.0 (TID 30)
16/02/03 01:59:33 INFO Executor: Executor is trying to kill task 8.0 in stage 1.0 (TID 9)
16/02/03 01:59:33 INFO TaskSchedulerImpl: Stage 1 was cancelled
16/02/03 01:59:33 INFO Executor: Executor is trying to kill task 0.0 in stage 1.0 (TID 1)
Spark version is 1.4.1. Any help is much more appreciated.
Probably one of your files have wrong input in it. The 1st thing to do is to find the file. After you found it you can try to find the line that causes the problem. When you got the line, have a close look at it, and probably you'll see the problem. My guess is that the number of columns doesn't match the expectations. Maybe something is not escaped correctly. If you don't find it, you can still update the question by adding the content of the file.
after adding if condition to udf mydateFunc to filter null values which causing NPE, the code is working fine. And i am able to load all the files.
val mydateFunc = udf {(x: String) => if(x ==null) x else x.split("/") match {case Array(month,date,year) => year+"-"+month+"-"+date case Array(y)=> y}}

Resources