Spark exception when inserting dataframe results into a hive table - apache-spark

This is my code snippet. I am getting following exception when spar.sql(query) is getting executed.
My table_v2 has 262 columns. My table_v3 has 9 columns.
Can someone faced similar issue and help to resolve this? TIA
spark = SparkSession.builder.enableHiveSupport().getOrCreate()
sc=spark.sparkContext
df1 = spark.sql("select * from myDB.table_v2")
df2 = spark.sql("select * from myDB.table_v3")
result_df = df1.join(df2, (df1.id_c == df2.id_c) & (df1.cycle_r == df2.cycle_r) & (df1.consumer_r == df2.consumer_r))
final_result_df = result_df.select(df1["*"])
final_result_df.distinct().createOrReplaceTempView("results")
query = "INSERT INTO TABLE myDB.table_v2_final select * from results"
spark.sql(query);
I tried to set the parameter in conf and it did not help to resolve the issue:
spark.sql.debug.maxToStringFields=500
Error:
20/12/16 19:28:20 ERROR FileFormatWriter: Job job_20201216192707_0002 aborted.
20/12/16 19:28:20 ERROR Executor: Exception in task 90.0 in stage 2.0 (TID 225)
org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:285)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:197)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:196)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:109)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Missing required char ':' at 'struct<>
at org.apache.orc.TypeDescription.requireChar(TypeDescription.java:293)
at org.apache.orc.TypeDescription.parseStruct(TypeDescription.java:326)
at org.apache.orc.TypeDescription.parseType(TypeDescription.java:385)
at org.apache.orc.TypeDescription.fromString(TypeDescription.java:406)
at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.org$apache$spark$sql$execution$datasources$orc$OrcSerializer$$createOrcValue(OrcSerializer.scala:226)
at org.apache.spark.sql.execution.datasources.orc.OrcSerializer.<init>(OrcSerializer.scala:36)
at org.apache.spark.sql.execution.datasources.orc.OrcOutputWriter.<init>(OrcOutputWriter.scala:36)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$$anon$1.newInstance(OrcFileFormat.scala:108)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.newOutputWriter(FileFormatWriter.scala:367)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$SingleDirectoryWriteTask.execute(FileFormatWriter.scala:378)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:269)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask$3.apply(FileFormatWriter.scala:267)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1415)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:272)
... 8 more

I have dropped my myDB.table_v2_final and modified the below line in my code and it worked.
I suspect there might be some issue in the way I created the table.
query = "create external table myDB.table_v2_final as select * from results"

Related

Getting error while writing parquet files to Azure data lake storage gen 2

Hi I have a usecase where I am reading parquet files and writing it to ADLG Gen 2. This is without any modification to data.
MY Code:
val kustoLogsSourcePath: String = "/mnt/SOME_FOLDER/2023/01/11/fe73f221-b771-49c9-ba7d-2e2af4fe4f2a_1_69fc119b888447efa9ed2ecd7a4ab647.parquet"
val outputPath: String = "/mnt/SOME_FOLDER/2023/01/10/EventLogs1/"
val kustoLogData = spark.read.parquet(kustoLogsSourcePath)
kustoLogData.write.mode(SaveMode.Overwrite).save(outputPath)
I am getting this error, any ideas how to solve it:
Here, I have shared all the exception related messages that I got.
org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:196)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:192)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:110)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:108)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:128)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:143)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$5.apply(SparkPlan.scala:183)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:180)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:131)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:114)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:114)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:690)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:690)
at
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 276 in stage 2.0 failed 4 times, most recent failure: Lost task 276.3 in stage 2.0 (TID 351, 10.139.64.13, executor 5): com.databricks.sql.io.FileReadException: Error while reading file dbfs:[REDACTED]/eventlogs/2023/01/10/[REDACTED-FILE-NAME].parquet.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:272)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:256)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:197)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.scan_nextBatch_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:584)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:634)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.access$100(VectorizedColumnReader.java:49)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader$1.visit(VectorizedColumnReader.java:557)
at
Caused by: com.databricks.sql.io.FileReadException: Error while reading file dbfs:[REDACTED]/eventlogs/2023/01/11/fe73f221-b771-49c9-ba7d-2e2af4fe4f2a_1_69fc119b888447efa9ed2ecd7a4ab647.parquet.
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.logFileNameAndThrow(FileScanRDD.scala:272)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1$$anon$2.getNext(FileScanRDD.scala:256)
at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:197)
at
Caused by: java.lang.UnsupportedOperationException: Unsupported encoding: DELTA_BYTE_ARRAY
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.initDataReader(VectorizedColumnReader.java:584)
at org.apache.spark.sql.execution.datasources.parquet.VectorizedColumnReader.readPageV2(VectorizedColumnReader.java:634)
at
It seems that some columns are DELTA_BYTE_ARRAY encoded, a workarround would be to turn off the vectorized reader property:
spark.conf.set("spark.sql.parquet.enableVectorizedReader", "false")
Try to modify your code and also remove the string parameter in the font of the variable and also use .format("delta") for reading delta file.
%scala
val kustoLogsSourcePath = "/mnt/SOME_FOLDER/2023/01/11/"
val outputPath = "/mnt/SOME_FOLDER/2023/01/10/EventLogs1/"
val kustoLogData = spark.read.format("delta").load(kustoLogsSourcePath)
kustoLogData.write.format("parquet").mode("append").mode(SaveMode.Overwrite).save(outputPath)
For the demo, this is my FileStore location /FileStore/tables/delta_train/.
I reproduce same in my environment as per above code .I got this output.

hive is failing when joining external and internal tables

Our environment/versions
hadoop 3.2.3
hive 3.1.3
spark 2.3.0
our internal table in hive is defined as
CREATE TABLE dw.CLIENT
(
client_id integer,
client_abbrev string,
client_name string,
effective_start_ts timestamp,
effective_end_ts timestamp,
active_flag string,
record_version integer
)
stored as orc tblproperties ('transactional'='true');
external as
CREATE EXTERNAL TABLE ClientProcess_21
( ClientId string, ClientDescription string, IsActive string, OldClientId string, NewClientId string, Description string,
TinyName string, FinanceCode string, ParentClientId string, ClientStatus string, FSPortalClientId string,)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION '.../client_extract_20220801.csv/' TBLPROPERTIES ("skip.header.line.count"="1")
I can select from both tables.
the internal table is empty, when I try joining them
select
null, s.*
from ClientProcess_21 s
join dw.client t
on s.ClientId = t.client_id
Hive is failing with
SQL Error [3] [42000]: Error while processing statement: FAILED: Execution Error, return code 3 from org.apache.hadoop.hive.ql.exec.spark.SparkTask. Spark job failed during runtime. Please check stacktrace for the root cause.
partial stack trace from the Hive log
2022-08-01T18:53:39,012 INFO [RPC-Handler-1] client.SparkClientImpl: Received result for 07a38056-5ba8-45e0-8783-397f25f398cb
2022-08-01T18:53:39,219 ERROR [HiveServer2-Background-Pool: Thread-1667] status.SparkJobMonitor: Job failed with java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions;
at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.useUTCTimestamp(OrcFile.java:286)
at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.(OrcFile.java:113)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.writerOptions(OrcFile.java:317)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getOptions(OrcOutputFormat.java:126)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getHiveRecordWriter(OrcOutputFormat.java:184)
at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat.getHiveRecordWriter(OrcOutputFormat.java:61)
at org.apache.hadoop.hive.ql.exec.Utilities.createEmptyFile(Utilities.java:3458)
at org.apache.hadoop.hive.ql.exec.Utilities.createDummyFileForEmptyPartition(Utilities.java:3489)
at org.apache.hadoop.hive.ql.exec.Utilities.access$300(Utilities.java:222)
at org.apache.hadoop.hive.ql.exec.Utilities$GetInputPathsCallable.call(Utilities.java:3433)
at org.apache.hadoop.hive.ql.exec.Utilities.getInputPaths(Utilities.java:3370)
at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.cloneJobConf(SparkPlanGenerator.java:318)
at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:241)
at org.apache.hadoop.hive.ql.exec.spark.SparkPlanGenerator.generate(SparkPlanGenerator.java:113)
at org.apache.hadoop.hive.ql.exec.spark.RemoteHiveSparkClient$JobStatusJob.call(RemoteHiveSparkClient.java:359)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:378)
at org.apache.hive.spark.client.RemoteDriver$JobWrapper.call(RemoteDriver.java:343)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions;
at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.useUTCTimestamp(OrcFile.java:286)
at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.(OrcFile.java:113)
at org.apache.hadoop.hive.ql.io.orc.OrcFile.writerOptions(OrcFile.java:317)
at org.apache.hadoop.hive.q
******* update
DMLs on tables defined as ..stored as orc tblproperties ('transactional'='true');
are failing with
2022-08-02 09:47:42 ERROR SparkJobMonitor:1250 - Job failed with java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions;
java.util.concurrent.ExecutionException: Exception thrown by job
,,
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, 10.222.108.202, executor 0): java.lang.RuntimeException: Error processing row: java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions;
at org.apache.hadoop.hive.ql.exec.spark.SparkMapRecordHandler.processRow(SparkMapRecordHandler.java:149)
..
Caused by: java.lang.NoSuchMethodError: org.apache.orc.OrcFile$WriterOptions.useUTCTimestamp(Z)Lorg/apache/orc/OrcFile$WriterOptions;
at org.apache.hadoop.hive.ql.io.orc.OrcFile$WriterOptions.useUTCTimestamp(OrcFile.java:286)
I think this is related to data type conversation when joining. One join col is string and other is int.
Can you please try this
select
null, s.*
from ClientProcess_21 s
join dw.client t
on s.ClientId = cast(t.client_id as string) -- cast it to string
resolved by copying orc jars to spark home
cp $HIVE_HOME/lib/orc $SPARK_HOME/jars/
cp $HIVE_HOME/hive-storage-api-2.7.0.jar $SPARK_HOME/jars/

a Spark Scala String Matching UDF

import org.apache.spark.sql.functions.lit
val containsString = (haystack:String, needle:String) =>{
if (haystack.contains(needle)){
1
}
else{
0
}
}
val containsStringUDF = udf(containsString _)
val new_df = df.withColumn("nameContainsxyz", containsStringUDF($"name"),lit("xyz")))
I am new to Spark scala. The above code seems to compile successfully. However, when I try to run
new_df.groupBy("nameContainsxyz").sum().show()
The error throws out. Could someone help me out? The error message is below.
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($anonfun$1: (string, string) => int)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.agg_doAggregateWithKeys_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage6.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$15$$anon$2.hasNext(WholeStageCodegenExec.scala:655)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1405)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
... 3 more
Caused by: java.lang.NullPointerException
at $anonfun$1.apply(<console>:41)
at $anonfun$1.apply(<console>:40)
... 15 more
Just an update: the error throws out because some of the rows in the specified column is null. Adding null check in the UDF completely solved the issue.
Thanks
If I understand what you're trying to do correctly, you want to know how many rows there are in which 'xyz' appears in column name?
You can do that without using a UDF:
df.filter('name.contains("xyz")). count

shc-core: NoSuchMethodError org.apache.hadoop.hbase.client.Put.addColumn

I try to use shc-core to save spark dataframe into hbase via spark.
My versions:
hbase: 1.1.2.2.6.4.0-91
spark: 1.6
scala: 2.10
shc: 1.1.1-1.6-s_2.10
hdp: 2.6.4.0-91
Configuration looks like that:
val schema_array = s"""{"type": "array", "items": ["string","null"]}""".stripMargin
def catalog: String = s"""{
|"table":{"namespace":"default", "name":"tblename"},
|"rowkey":"id",
|"columns":{
|"id":{"cf":"rowkey", "col":"id", "type":"string"},
|"col1":{"cf":"data", "col":"col1", "avro":"schema_array"}
|}
|}""".stripMargin
df
.write
.options(Map(
"schema_array"-> schema_array,
HBaseTableCatalog.tableCatalog -> catalog,
HBaseTableCatalog.newTable -> "5"
))
.format("org.apache.spark.sql.execution.datasources.hbase")
.save()
Sometimes it works fine as expected and creates table and saves all the data into hbase. But sometimes just fail with following error:
Lost task 35.0 in stage 9.0 (TID 301, host): java.lang.NoSuchMethodError: org.apache.hadoop.hbase.client.Put.addColumn([B[B[B)Lorg/apache/hadoop/hbase/client/Put;
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$org$apache$spark$sql$execution$datasources$hbase$HBaseRelation$$convertToPut$1$1.apply(HBaseRelation.scala:211)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$org$apache$spark$sql$execution$datasources$hbase$HBaseRelation$$convertToPut$1$1.apply(HBaseRelation.scala:210)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation.org$apache$spark$sql$execution$datasources$hbase$HBaseRelation$$convertToPut$1(HBaseRelation.scala:210)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$insert$1.apply(HBaseRelation.scala:219)
at org.apache.spark.sql.execution.datasources.hbase.HBaseRelation$$anonfun$insert$1.apply(HBaseRelation.scala:219)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1112)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1111)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1277)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1119)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1091)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:247)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Any ideas?
That was actually a class path issue - I've got two different versions of hbase client.

Why is spark throwing an ArrayIndexOutOfBoundsException expection for empty attributes?

Context
I am using Spark 1.5.
I have a file records.txt which is ctrl A delimited and in that file 31st index is for subscriber_id. For some records the subscriber_id is empty. Record with subscriber_id is NOT empty.
Here subscriber_id(UK8jikahasjp23) is located at one before the last attribute:
99^A2013-12-11^A23421421412^qweqweqw2222^A34232432432^A365633049^A1^A6yudgfdhaf9923^AAC^APrimary DTV^AKKKR DATA+ PVR3^AGrundig^AKKKR PVR3^AKKKR DATA+ PVR3^A127b146^APVR3^AYes^ANo^ANo^ANo^AYes^AYes^ANo^A2017-08-07 21:27:30.000000^AYes^ANo^ANo^A6yudgfdhaf9923^A7290921396551747605^A2013-12-11 16:00:03.000000^A7022497306379992936^AUK8jikahasjp23^A
Record with subscriber_id is empty:
23^A2013-12-11^A23421421412^qweqweqw2222^A34232432432^A365633049^A1^A6yudgfdhaf9923^AAC^APrimary DTV^AKKKR DATA+ PVR3^AGrundig^AKKKR PVR3^AKKKR DATA+ PVR3^A127b146^APVR3^AYes^ANo^ANo^ANo^AYes^AYes^ANo^A2017-08-07 21:27:30.000000^AYes^ANo^ANo^A6yudgfdhaf9923^A7290921396551747605^A2013-12-11 16:00:03.000000^A7022497306379992936^A^A
Problem
I am getting java.lang.ArrayIndexOutOfBoundsException for the records with empty subscriber_id.
Why is spark throwing java.lang.ArrayIndexOutOfBoundsException for the empty values for the field subscriber_id?
16/08/20 10:22:18 WARN scheduler.TaskSetManager: Lost task 31.0 in stage 8.0 : java.lang.ArrayIndexOutOfBoundsException: 31
case class CustomerCard(accountNumber:String, subscriber_id:String,subscriptionStatus:String )
object CustomerCardProcess {
val log = LoggerFactory.getLogger(this.getClass.getName)
def doPerform(sc: SparkContext, sqlContext: HiveContext, custCardRDD: RDD[String]): DataFrame = {
import sqlContext.implicits._
log.info("doCustomerCardProcess method started")
val splitRDD = custCardRDD.map(elem => elem.split("\\u0001"))
val schemaRDD = splitRDD.map(arr => new CustomerCard( arr(3).trim, arr(31).trim,arr(8).trim))
schemaRDD.toDF().registerTempTable("customer_card")
val custCardDF = sqlContext.sql(
"""
|SELECT
|accountNumber,
|subscriber_id
|FROM
|customer_card
|WHERE
|subscriptionStatus IN('AB', 'AC', 'PC')
|AND accountNumber IS NOT NULL AND LENGTH(accountNumber) > 0
""".stripMargin)
log.info("doCustomerCardProcess method ended")
custCardDF
}
}
Error
13/09/12 23:22:18 WARN scheduler.TaskSetManager: Lost task 31.0 in
stage 8.0 (TID 595, : java.lang.ArrayIndexOutOfBoundsException: 31 at
com.org.CustomerCardProcess$$anonfun$2.apply(CustomerCardProcess.scala:23)
at
com.org.CustomerCardProcess$$anonfun$2.apply(CustomerCardProcess.scala:23)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.insertAll(BypassMergeSortShuffleWriter.java:118)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Could anyone help me to fix this issue ?
The split function is neglecting all the empty fields at the end of splitted line. So,
Change your following line
val splitRDD = custCardRDD.map(elem => elem.split("\\u0001"))
to
val splitRDD = custCardRDD.map(elem => elem.split("\\u0001", -1))
-1 tells to consider all the empty fields.

Resources