Caused by: org.apache.spark.SparkException: Task failed while writing rows - apache-spark

import org.apache.spark.SparkContext
import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.Row
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.DataType
import org.apache.spark.sql._
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
val rdd =
sc.textFile("/user/scvappdf/hdpcf/SIT/df/meta/1576373/SDM_Data/RDM/rdm_all_ctry_cd_ver1.txt")
val header = rdd.first()
val rdd1 = rdd.filter(x => x != header)
val rowRDD = rdd1.map(_.split("\\|")).map(p => Row(p(0).trim,
p(1).trim,p(2).trim,p(3).trim,p(4).trim,p(5).trim,p(6).trim,p(7).trim,p(8).trim,p(9).trim,
p(10).trim,p(11).trim,p(12).trim,p(13).trim,p(14).trim,p(15).trim,p(16).trim,p(17).trim,
p(18).trim,p(19).trim,p(20).trim,p(21).trim,p(22).trim,p(23).trim,p(24).trim,p(25).trim,p(26).trim))
val innerStruct = StructType(StructField("rowid", StringType, true)::StructField("s_startdt",
StringType, false)::StructField("s_starttime", StringType, false)::StructField("s_enddt",
StringType, false)::StructField("s_endtime", StringType, false)::StructField("s_deleted_flag",
StringType, false)::StructField("ctry_cd", StringType, false)::StructField("ctry_cd_desc",
StringType, false)::StructField("intrnl_ctry_rgn_cd", StringType, false)::
StructField("intrnl_ctry_rgn_desc", StringType, false)::StructField("iso_ind", StringType,
false)::StructField("iso_alp2_ctry_cd", StringType, false)::StructField("iso_alp3_ctry_cd",
StringType, false)::StructField("iso_num_cd", StringType, false)::StructField("npc_ind", StringType,
false)::StructField("cal_type_cd", StringType, false)::StructField("hemisphere_cd", StringType,
false)::StructField("hemisphere_desc", StringType, false)::StructField("ref_tbl_id", StringType,
false)::StructField("eff_strt_dt", StringType, false)::StructField("eff_end_dt", StringType,
false)::StructField("ent_st", StringType, false)::StructField("vers", StringType,
false)::StructField("crtd_tmst", StringType, false)::StructField("mod_tmst", StringType,
false)::StructField("src_id", StringType, false)::StructField("ods", StringType, false):: Nil)
val peopleSchemaRDD = hiveContext.createDataFrame(rowRDD, innerStruct)
peopleSchemaRDD.write.format("orc").mode("overwrite")
.save("/sit/regenv/hdata/DATAFABRIC/reg_tb_prd_dfab/stg_rdm_all_ctry_cd_ver1/ods=2019-12-06")
While saving the dataframe peopleSchemaRDD in ORC format I am getting the error "java.lang.ArrayIndexOutOfBoundsException: 3"
and I am getting this error Caused by: org.apache.spark.SparkException: Task failed while writing rows
There is no problem with the path as folder is getting created but no data is coming. I am using spark 1.6.3 version

There are some issues with SQLContext , you try to use the HiveContext. Use below configuration to resolve that:
spark.sql.orc.impl=native
Above is provided in the perspective of Spark-submit command.

Related

Unable to read data from kafka topic

I'm a beginner in kafka. Trying to code a spark application to read data from a kafka topic created. Kafka topic1 is up & running.
Is there any problem with the code provided below:
val kafka_bootstrap_servers = "localhost:9092"
val users_df = spark.read
.format("kafka")
.option("kafka.bootstrap.servers", kafka_bootstrap_servers)
.option("subscribe", kafka_topic_name)
.load()
val users_df_1 = users_df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)")
val user_schema = StructType(
List(
StructField("RecordNumber", IntegerType, true),
StructField("Zipcode", StringType, true),
StructField("ZipCodeType", StringType, true),
StructField("City", StringType, true),
StructField("State", StringType, true),
StructField("LocationType", StringType, true),
StructField("Lat", StringType, true),
StructField("Long", StringType, true),
StructField("Xaxis", StringType, true),
StructField("Yaxis", StringType, true),
StructField("Zaxis", StringType, true),
StructField("WorldRegion", StringType, true),
StructField("Country", StringType, true),
StructField("LocationText", StringType, true),
StructField("Location", StringType, true),
StructField("Decommisioned", StringType, true)
)
)
val users_df_2 = users_df_1.select(from_json(col("RecordNumber"), user_schema)
.as("user_detail"), col("Zipcode"))
val users_df_3 = users_df_2.select(col = "user_detail.*", "Zipcode")
users_df_3.printSchema()
users_df_3.show(numRows = 10, truncate = false)
spark.stop()
println("Apache spark application completed.")
}
}
json data sample below
{"RecordNumber":76511,"Zipcode":27007,"ZipCodeType":"STANDARD","City":"ASH HILL","State":"NC","LocationType":"NOT ACCEPTABLE","Lat":36.4,"Long":-80.56,"Xaxis":0.13,"Yaxis":-0.79,"Zaxis":0.59,"WorldRegion":"NA","Country":"US","LocationText":"Ash Hill, NC","Location":"NA-US-NC-ASH HILL","Decommisioned":false,"TaxReturnsFiled":842,"EstimatedPopulation":1666,"TotalWages":28876493}
Error msg below
Exception in thread "main" org.apache.spark.sql.AnalysisException: Failed to find data source: kafka. Please deploy the application as per the deployment section of "Structured Streaming + Kafka Integration Guide".;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:652)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:167)
at streamingApp$.main(streamingApp.scala:25)
at streamingApp.main(streamingApp.scala)
Need help to read data from kafka topic.
Please follow the guide for spark streaming + kafka integration
https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html
You might be missing the artifact "spark-sql-kafka-0-10_2.12"

How does the Databricks Delta Lake `mergeSchema` option handle differing data types?

What does the Databricks Delta Lake mergeSchema option do if a pre-existing column is appended with a different data type?
For example, given a Delta Lake table with schema foo INT, bar INT, what would happen when trying to write-append new data with schema foo INT, bar DOUBLE when specifying the option mergeSchema = true?
The write fails. (as of Delta Lake 0.5.0 on Databricks 6.3)
I think this is what you are looking for.
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.functions.input_file_name
val customSchema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", StringType, true),
StructField("field3", StringType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true),
StructField("field6", StringType, true),
StructField("field7", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("sep", "|")
.schema(customSchema)
.load("mnt/rawdata/corp/ABC*.gz")
.withColumn("file_name", input_file_name())
Just name 'field1', 'field2', etc., as your actual field names. Also, the 'ABC*.gz' does a wildcard search for files beginning with a specific string, like 'abc', or whatever, and the '*' character, which means any combination of characters, up the the backslash and the '.gz' which means it's a zipped file. Yours could be different, of course, so just change that convention to meet your specific needs.

When reading a CSV is there an option to start on row 2 or below?

I'm reading a bunch of CSV files into a dataframe using the sample code below.
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","true")
.option("header","false")
.load("mnt/rawdata/corp/ABC*.gz")
I'm hoping there is a way to start on row 2 or below, because row 1 contains some basic metadata about these files, and the first row has 4 pipe characters, so Spark thinks the file has 4 columns, but it actually has over 100 columns in the actual data.
I tried playing with the inferSchema and header but I couldn't get anything to work.
If the first line in CSV doesnt match actual column count and names, you may need to define your schema by hand, and then try this combination:
val df = spark.read.format("csv")
.option("sep","|")
.option("inferSchema","false")
.option("header","true")
.schema(mySchema)
.option("enforceSchema","true")
.load(...
Full list of CSV options.
Note that for Spark 2.3 and above, you can use a shorthand, SQL-style notation for schema definition -- simple string "column1 type1, column2 type2, ...".
If however your header has more than one line, you will probably be forced to ignore all "errors" by using additional option .option("mode","DROPMALFORMED").
You are right! You need to define a custom schema! I ended up going with this.
import org.apache.spark.sql.SQLContext
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.functions.input_file_name
val customSchema = StructType(Array(
StructField("field1", StringType, true),
StructField("field2", StringType, true),
StructField("field3", StringType, true),
StructField("field4", StringType, true),
StructField("field5", StringType, true),
StructField("field6", StringType, true),
StructField("field7", StringType, true)))
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.option("sep", "|")
.schema(customSchema)
.load("mnt/rawdata/corp/ABC*.gz")
.withColumn("file_name", input_file_name())

Creating a Hive schema in PySpark

Syntax for creating a schema in PySpark.
data.csv
id,name
1,sam
2,smith
val schema = new StructType().add("id", IntType).add("name", StringType)
val ds = spark.read.schema(schema).option("header", "true").csv("data.csv")
ds.show
define StructType with StructField(name, dataType, nullable=True)
from pyspark.sql.types you can import datatypes
from pyspark.sql.types import StructType, StructField, IntegerType, StringType,FloatType,BooleanType
schema = StructType([
StructField("col_a", StringType(), True),
StructField("col_b", IntegerType(), True),
StructField("col_c", FloatType(), True),
StructField("col_d", BooleanType(), True)
])

Custom schema in spark-csv throwing error in spark 1.4.1

I trying to process CSV file using spark -csv package in spark-shell in spark 1.4.1.
scala> import org.apache.spark.sql.hive.HiveContext
import org.apache.spark.sql.hive.HiveContext
scala> import org.apache.spark.sql.hive.orc._
import org.apache.spark.sql.hive.orc._
scala> import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType};
import org.apache.spark.sql.types.{StructType, StructField, StringType, IntegerType}
scala> val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
15/12/21 02:06:24 WARN SparkConf: The configuration key 'spark.yarn.applicationMaster.waitTries' has been deprecated as of Spark 1.3 and and may be removed in the future. Please use the new key 'spark.yarn.am.waitTime' instead.
15/12/21 02:06:24 INFO HiveContext: Initializing execution hive, version 0.13.1
hiveContext: org.apache.spark.sql.hive.HiveContext = org.apache.spark.sql.hive.HiveContext#74cba4b
scala> val customSchema = StructType(Seq(StructField("year", IntegerType, true),StructField("make", StringType, true),StructField("model", StringType, true),StructField("comment", StringType, true),StructField("blank", StringType, true)))
customSchema: org.apache.spark.sql.types.StructType = StructType(StructField(year,IntegerType,true), StructField(make,StringType,true), StructField(model,StringType,true), StructField(comment,StringType,true), StructField(blank,StringType,true))
scala> val customSchema = (new StructType).add("year", IntegerType, true).add("make", StringType, true).add("model", StringType, true).add("comment", StringType, true).add("blank", StringType, true)
:24: error: not enough arguments for constructor StructType: (fields: Array[org.apache.spark.sql.types.StructField])org.apache.spark.sql.types.StructType. Unspecified value parameter fields.
val customSchema = (new StructType).add("year", IntegerType, true).add("make", StringType, true).add("model", StringType,true).add("comment", StringType, true).add("blank", StringType, true)
According to Spark 1.4.1 documentation there isn't a no-arg constructor for StructType, which is why you are getting the error. You need to either upgrade to 1.5.x to get the no-arg constructor or create the schema as you suggest in the first example.
val customSchema = StructType(Seq(StructField("year", IntegerType, true),StructField("make", StringType, true),StructField("model", StringType, true),StructField("comment", StringType, true),StructField("blank", StringType, true)))

Resources