Query fails in HiveContext of pyspark while writing into avro format - apache-spark

I'm trying to load an external table as avro format, using HiveContext of pyspark.
The external-table creation query runs in hive. However, the same query fails in hive context with error as, org.apache.hadoop.hive.serde2.SerDeException: Encountered exception determining schema. Returning signal schema to indicate problem: null
My avro schema is as follows.
{
"type" : "record",
"name" : "test_table",
"namespace" : "com.ent.dl.enh.test_table",
"fields" : [ {
"name" : "column1",
"type" : [ "null", "string" ] , "default": null
}, {
"name" : "column2",
"type" : [ "null", "string" ] , "default": null
}, {
"name" : "column3",
"type" : [ "null", "string" ] , "default": null
}, {
"name" : "column4",
"type" : [ "null", "string" ] , "default": null
} ]
}
My create table script is,
CREATE EXTERNAL TABLE test_table_enh ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 's3://Staging/test_table/enh' TBLPROPERTIES ('avro.schema.url'='s3://Staging/test_table/test_table.avsc')
I'm running below code using using spark-submit,
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
print "Start of program"
sc = SparkContext()
hive_context = HiveContext(sc)
hive_context.sql("CREATE EXTERNAL TABLE test_table_enh ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 's3://Staging/test_table/enh' TBLPROPERTIES ('avro.schema.url'='s3://Staging/test_table/test_table.avsc')")
print "end"
Spark Version: 2.2.0
OpenJDK version: 1.8.0
Hive Version: 2.3.0

Related

hive external table on avro alias column not working

I tried creating hive table on avro using avsc spec file and need to renam some of the columns . used alias but seems its not working. the columns are returned as null when i query the table
SPARK DATAFRAME TO SAVE DATA
val data=Seq(("john","adams"),("john","smith"))
val columns = Seq("fname","lname")
import spark.sqlContext.implicits._
val df=data.toDF(columns:_*)
df.write.format("avro").save("/test")
AVSC Spec file
{
"type" : "record",
"name" : "test",
"doc" : " import of test",
"fields" : [ {
"name" : "first_name",
"type" : [ "null", "string" ],
"default" : null,
"aliases" : [ "fname" ],
"columnName" : "fname",
"sqlType" : "12"
}, {
"name" : "last_name",
"type" : [ "null", "string" ],
"default" : null,
"aliases" : [ "lname" ],
"columnName" : "lname",
"sqlType" : "12"
} ],
"tableName" : "test"
}
EXTERNAL HIVE TABLE
create external table test
STORED AS AVRO
LOCATION '/test'
TBLPROPERTIES ('avro.schema.url'='/test.avsc');
HIVE QUERY
SELECT last_name from test;
returns null even though there is data in avro with the original name ie lname

Not able to pass schema of the json record dynamically to spark structured streaming records

I have a Spark-Kakfa-Strucutre streaming pipeline. Listening to a topic, which may have json records of varying schema.
Now I want to resolve the schema based on the key(x_y), and then apply to value portion to parse the json record.
So here key's 'y' part tells about the schema type.
I tried to get the schema string from udf and then pass to from_json() function.
But it fails with exception
org.apache.spark.sql.AnalysisException: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of `schema`
Code used:
df.withColumn("data_type", element_at(split(col("key").cast("string"),"_"),1))
.withColumn("schema", schemaUdf($"data_type"))
.select(from_json(col("value").cast("string"), col("schema")).as("data"))
Schema demo:
{
"type" : "struct",
"fields" : [ {
"name" : "name",
"type" : {
"type" : "struct",
"fields" : [ {
"name" : "firstname",
"type" : "string",
"nullable" : true,
"metadata" : { }
}]
},
"nullable" : true,
"metadata" : { }
} ]
}
UDF used:
lazy val fetchSchema = (fileName : String) => {
DataType.fromJson(mapper.readTree(new File(fileName)).toString)
}
val schemaUdf = udf[DataType, String](fetchSchema)
Note: I am not using confluent feature.

spark json schema metadata can be mapped to hive?

When working with apache spark we can easily generate a json file to describe the Dataframe structure. This dataframe structure looks as below:
{
"type": "struct",
"fields": [
{
"name": "employee_name",
"type": "string",
"nullable": true,
"metadata": {
"comment": "employee name",
"system_name": "hr system",
"business_key": true,
"private_info": true
}
},
{
"name": "employee_job",
"type": "string",
"nullable": true,
"metadata": {
"comment": "employee job description",
"system_name": "sap",
"business_key": false,
"private_info": false
}
}
]
}
When storing this information in Hive or getting the dataframe from Hive, spark will map the "comments" from the columns hive metadata to the "comment" attribute within metadata. But what about mapping the dataframe definition in a json into a Hive table, is it possible to store additional tags to the columns like business_key or private_info flag?
thanks
Yes, It's possible to store additional metadata. Create spark compatible hive table & add required metadata in TBLPROPERTIES
like below.
Hive Table
CREATE TABLE `employee_details`(
`employee_name` string COMMENT 'employee name',
`employee_job` string COMMENT 'employee job description')
STORED AS ORC
TBLPROPERTIES (
'spark.sql.sources.provider'='orc',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"employee_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"comment\":\"employee name\",\"business_key\":true,\"system_name\":\"hr system\",\"private_info\":true}},{\"name\":\"employee_job\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"comment\":\"employee job description\",\"business_key\":false,\"system_name\":\"sap\",\"private_info\":false}}]}'
)
Accessing table from spark
scala> val df = spark.table("hivedb.employee_details")
adf: org.apache.spark.sql.DataFrame = [employee_name: string, employee_job: string]
scala> df.schema.prettyJson
res12: String =
{
"type" : "struct",
"fields" : [ {
"name" : "employee_name",
"type" : "string",
"nullable" : true,
"metadata" : {
"comment" : "employee name",
"business_key" : true,
"system_name" : "hr system",
"private_info" : true
}
}, {
"name" : "employee_job",
"type" : "string",
"nullable" : true,
"metadata" : {
"comment" : "employee job description",
"business_key" : false,
"system_name" : "sap",
"private_info" : false
}
} ]
}

Alias value in avsc does not display value on par with avro file

I have updated avsc file to rename column like,
"fields" : [ {
"name" : "department_id",
"type" : [ "null", "int" ],
"default" : null
}, {
"name" : "office_name",
"type" : [ "null", "string" ],
"default" : null,
"aliases" : [ "department_name" ],
"columnName" : "department_name"
}
However in may avro file columns are like department_id : 10, department_name : "maths"
Now when i query like below,
select office_name from t
it always returns null values. Will it not return value from department_name in avro. Is there a way to have multiple names for column in avsc
From cloudera community, "we recommend to use the original name rather than the aliased name of the field in the table, as the Avro aliases are stripped during loading into Spark."
Schema with aliases,
val schema = new Schema.Parser().parse(new File("../spark-2.4.3-bin-hadoop2.7/examples/src/main/resources/user.avsc"))
schema: org.apache.avro.Schema = {"type":"record","name":"User","namespace":"example.avro","fields":[{"name":"name","type":"string","aliases":["customer_name"],"columnName":"customer_name"},{"name":"favorite_color","type":["string","null"],"aliases":["color"],"columnName":"color"}]}
Spark striping the aliases,
val usersDF = spark.read.format("avro").option("avroSchema",schema.toString).load("../spark-2.4.3-bin-hadoop2.7/examples/src/main/resources/users.avro")
usersDF: org.apache.spark.sql.DataFrame = [name: string, favorite_color: string]
I guess you can go with spark builtin features to rename a column, but if you find any other workaround let me know as well.

Convert CSV (KeyValueTextInputFormat) to Avro (AvroKeyOutputFormat) using Spark saveAsNewAPIHadoopFile

I am trying to convert csv to avro using Spark's api as below :
1) read csv files using newAPIHadoopFile
2) save csv to avro using saveAsNewAPIHadoopFile .
At the time of saving the file to avro , getting below error :
org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.NullPointerException: in org.srdevin.avro.topLevelRecord null of org.srdevin.avro.topLevelRecord
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:308)
at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:77)
at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:39)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Below is the code snippet :
val in = "test.csv"
val out = "csvToAvroOutput"
val schema = new Schema.Parser().parse(new File("/path/to/test.avsc"))
val hadoopRDD = sc.newAPIHadoopFile(in, classOf[KeyValueTextInputFormat]
, classOf[Text], classOf[Text])
val job = Job.getInstance
AvroJob.setOutputKeySchema(job, schema)
hadoopRDD.map(row => (new AvroKey(row), NullWritable.get()))
.saveAsNewAPIHadoopFile(
out,
classOf[AvroKey[GenericRecord]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[GenericRecord]],
job.getConfiguration)
Schema : test.avsc
{
"type" : "record",
"name" : "topLevelRecord",
"namespace" : "org.srdevin.avro",
"aliases": ["MyRecord"],
"fields" : [ {
"name" : "name",
"type" : [ "string", "null"] ,
"default": "null",
"aliases": ["name"]
}, {
"name" : "age",
"type" : [ "string" , "null" ],
"default": "null",
"aliases": ["age"]
}]
}
Spark version used : 2.1.0 , avro version : 1.7.6
Thanks

Resources