I am writing a program to analyze sql query. So I am using Spark logical plan.
Below is the code which I am using
object QueryAnalyzer {
val LOG = LoggerFactory.getLogger(this.getClass)
//Spark Conf
val conf = new
SparkConf().setMaster("local[2]").setAppName("LocalEdlExecutor")
//Spark Context
val sc = new SparkContext(conf)
//sql Context
val sqlContext = new SQLContext(sc)
//Spark Session
val sparkSession = SparkSession
.builder()
.appName("Spark User Data")
.config("spark.app.name", "LocalEdl")
.getOrCreate()
def main(args: Array[String]) {
var inputDfColumns = Map[String,List[String]]()
val dfSession = sparkSession.
read.
format("csv").
option("header", EdlConstants.TRUE).
option("inferschema", EdlConstants.TRUE).
option("delimiter", ",").
option("decoding", EdlConstants.UTF8).
option("multiline", true)
var oDF = dfSession.
load("C:\\Users\\tarun.khaneja\\data\\order.csv")
println("smaple data in oDF====>")
oDF.show()
var cusDF = dfSession.
load("C:\\Users\\tarun.khaneja\\data\\customer.csv")
println("smaple data in cusDF====>")
cusDF.show()
oDF.createOrReplaceTempView("orderTempView")
cusDF.createOrReplaceTempView("customerTempView")
//get input columns from all dataframe
inputDfColumns += ("orderTempView"->oDF.columns.toList)
inputDfColumns += ("customerTempView"->cusDF.columns.toList)
val res = sqlContext.sql("""select OID, max(MID+CID) as MID_new,ROW_NUMBER() OVER (
ORDER BY CID) as rn from
(select OID_1 as OID, CID_1 as CID, OID_1+CID_1 as MID from
(select min(ot.OrderID) as OID_1, ct.CustomerID as CID_1
from orderTempView as ot inner join customerTempView as ct
on ot.CustomerID = ct.CustomerID group by CID_1)) group by OID,CID""")
println(res.show(false))
val analyzedPlan = res.queryExecution.analyzed
println(analyzedPlan.prettyJson)
}
Now problem is, with Spark 2.2.1, I am getting below json. where I have SubqueryAlias which provide important information of alias name for table which we used in query, as shown below.
...
...
...
[ {
"class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children" : 0,
"name" : "OrderDate",
"dataType" : "string",
"nullable" : true,
"metadata" : { },
"exprId" : {
"product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
"id" : 2,
"jvmId" : "acefe6e6-e469-4c9a-8a36-5694f054dc0a"
},
"isGenerated" : false
} ] ]
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children" : 1,
"alias" : "ct",
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children" : 1,
"alias" : "customertempview",
"child" : 0
}, {
"class" : "org.apache.spark.sql.execution.datasources.LogicalRelation",
"num-children" : 0,
"relation" : null,
"output" :
...
...
...
But with Spark 2.4, I am getting SubqueryAlias name as null. As shown below in json.
...
...
{
"class":
"org.apache.spark.sql.catalyst.expressions.AttributeReference",
"num-children": 0,
"name": "CustomerID",
"dataType": "integer",
"nullable": true,
"metadata": {},
"exprId": {
"product-class":
"org.apache.spark.sql.catalyst.expressions.ExprId",
"id": 19,
"jvmId": "3b0dde0c-0b8f-4c63-a3ed-4dba526f8331"
},
"qualifier": "[ct]"
}]
}, {
"class":
"org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children": 1,
"name": null,
"child": 0
}, {
"class":
"org.apache.spark.sql.catalyst.plans.logical._**SubqueryAlias**_",
"num-children": 1,
"name": null,
"child": 0
}, {
"class":
"org.apache.spark.sql.execution.datasources.LogicalRelation",
"num-children": 0,
"relation": null,
"output":
...
...
So, I am not sure if it is bug in Spark 2.4 because of which I am getting name as null in SubquerAlias.
Or if it is not bug then how can I get relation between alias name and real table name.
Any idea on this?
Related
I tried creating hive table on avro using avsc spec file and need to renam some of the columns . used alias but seems its not working. the columns are returned as null when i query the table
SPARK DATAFRAME TO SAVE DATA
val data=Seq(("john","adams"),("john","smith"))
val columns = Seq("fname","lname")
import spark.sqlContext.implicits._
val df=data.toDF(columns:_*)
df.write.format("avro").save("/test")
AVSC Spec file
{
"type" : "record",
"name" : "test",
"doc" : " import of test",
"fields" : [ {
"name" : "first_name",
"type" : [ "null", "string" ],
"default" : null,
"aliases" : [ "fname" ],
"columnName" : "fname",
"sqlType" : "12"
}, {
"name" : "last_name",
"type" : [ "null", "string" ],
"default" : null,
"aliases" : [ "lname" ],
"columnName" : "lname",
"sqlType" : "12"
} ],
"tableName" : "test"
}
EXTERNAL HIVE TABLE
create external table test
STORED AS AVRO
LOCATION '/test'
TBLPROPERTIES ('avro.schema.url'='/test.avsc');
HIVE QUERY
SELECT last_name from test;
returns null even though there is data in avro with the original name ie lname
When working with apache spark we can easily generate a json file to describe the Dataframe structure. This dataframe structure looks as below:
{
"type": "struct",
"fields": [
{
"name": "employee_name",
"type": "string",
"nullable": true,
"metadata": {
"comment": "employee name",
"system_name": "hr system",
"business_key": true,
"private_info": true
}
},
{
"name": "employee_job",
"type": "string",
"nullable": true,
"metadata": {
"comment": "employee job description",
"system_name": "sap",
"business_key": false,
"private_info": false
}
}
]
}
When storing this information in Hive or getting the dataframe from Hive, spark will map the "comments" from the columns hive metadata to the "comment" attribute within metadata. But what about mapping the dataframe definition in a json into a Hive table, is it possible to store additional tags to the columns like business_key or private_info flag?
thanks
Yes, It's possible to store additional metadata. Create spark compatible hive table & add required metadata in TBLPROPERTIES
like below.
Hive Table
CREATE TABLE `employee_details`(
`employee_name` string COMMENT 'employee name',
`employee_job` string COMMENT 'employee job description')
STORED AS ORC
TBLPROPERTIES (
'spark.sql.sources.provider'='orc',
'spark.sql.sources.schema.numParts'='1',
'spark.sql.sources.schema.part.0'='{\"type\":\"struct\",\"fields\":[{\"name\":\"employee_name\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"comment\":\"employee name\",\"business_key\":true,\"system_name\":\"hr system\",\"private_info\":true}},{\"name\":\"employee_job\",\"type\":\"string\",\"nullable\":true,\"metadata\":{\"comment\":\"employee job description\",\"business_key\":false,\"system_name\":\"sap\",\"private_info\":false}}]}'
)
Accessing table from spark
scala> val df = spark.table("hivedb.employee_details")
adf: org.apache.spark.sql.DataFrame = [employee_name: string, employee_job: string]
scala> df.schema.prettyJson
res12: String =
{
"type" : "struct",
"fields" : [ {
"name" : "employee_name",
"type" : "string",
"nullable" : true,
"metadata" : {
"comment" : "employee name",
"business_key" : true,
"system_name" : "hr system",
"private_info" : true
}
}, {
"name" : "employee_job",
"type" : "string",
"nullable" : true,
"metadata" : {
"comment" : "employee job description",
"business_key" : false,
"system_name" : "sap",
"private_info" : false
}
} ]
}
I am trying to convert csv to avro using Spark's api as below :
1) read csv files using newAPIHadoopFile
2) save csv to avro using saveAsNewAPIHadoopFile .
At the time of saving the file to avro , getting below error :
org.apache.avro.file.DataFileWriter$AppendWriteException: java.lang.NullPointerException: in org.srdevin.avro.topLevelRecord null of org.srdevin.avro.topLevelRecord
at org.apache.avro.file.DataFileWriter.append(DataFileWriter.java:308)
at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:77)
at org.apache.avro.mapreduce.AvroKeyRecordWriter.write(AvroKeyRecordWriter.java:39)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply$mcV$sp(PairRDDFunctions.scala:1125)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12$$anonfun$apply$4.apply(PairRDDFunctions.scala:1123)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1341)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1131)
at org.apache.spark.rdd.PairRDDFunctions$$anonfun$saveAsNewAPIHadoopDataset$1$$anonfun$12.apply(PairRDDFunctions.scala:1102)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Below is the code snippet :
val in = "test.csv"
val out = "csvToAvroOutput"
val schema = new Schema.Parser().parse(new File("/path/to/test.avsc"))
val hadoopRDD = sc.newAPIHadoopFile(in, classOf[KeyValueTextInputFormat]
, classOf[Text], classOf[Text])
val job = Job.getInstance
AvroJob.setOutputKeySchema(job, schema)
hadoopRDD.map(row => (new AvroKey(row), NullWritable.get()))
.saveAsNewAPIHadoopFile(
out,
classOf[AvroKey[GenericRecord]],
classOf[NullWritable],
classOf[AvroKeyOutputFormat[GenericRecord]],
job.getConfiguration)
Schema : test.avsc
{
"type" : "record",
"name" : "topLevelRecord",
"namespace" : "org.srdevin.avro",
"aliases": ["MyRecord"],
"fields" : [ {
"name" : "name",
"type" : [ "string", "null"] ,
"default": "null",
"aliases": ["name"]
}, {
"name" : "age",
"type" : [ "string" , "null" ],
"default": "null",
"aliases": ["age"]
}]
}
Spark version used : 2.1.0 , avro version : 1.7.6
Thanks
I'm trying to load an external table as avro format, using HiveContext of pyspark.
The external-table creation query runs in hive. However, the same query fails in hive context with error as, org.apache.hadoop.hive.serde2.SerDeException: Encountered exception determining schema. Returning signal schema to indicate problem: null
My avro schema is as follows.
{
"type" : "record",
"name" : "test_table",
"namespace" : "com.ent.dl.enh.test_table",
"fields" : [ {
"name" : "column1",
"type" : [ "null", "string" ] , "default": null
}, {
"name" : "column2",
"type" : [ "null", "string" ] , "default": null
}, {
"name" : "column3",
"type" : [ "null", "string" ] , "default": null
}, {
"name" : "column4",
"type" : [ "null", "string" ] , "default": null
} ]
}
My create table script is,
CREATE EXTERNAL TABLE test_table_enh ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 's3://Staging/test_table/enh' TBLPROPERTIES ('avro.schema.url'='s3://Staging/test_table/test_table.avsc')
I'm running below code using using spark-submit,
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
print "Start of program"
sc = SparkContext()
hive_context = HiveContext(sc)
hive_context.sql("CREATE EXTERNAL TABLE test_table_enh ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 's3://Staging/test_table/enh' TBLPROPERTIES ('avro.schema.url'='s3://Staging/test_table/test_table.avsc')")
print "end"
Spark Version: 2.2.0
OpenJDK version: 1.8.0
Hive Version: 2.3.0
I'm using a custom sink in structured stream (spark 2.2.0) and noticed that spark produces incorrect metrics for number of input rows - it's always zero.
My stream construction:
StreamingQuery writeStream = session
.readStream()
.schema(RecordSchema.fromClass(TestRecord.class))
.option(OPTION_KEY_DELIMITER, OPTION_VALUE_DELIMITER_TAB)
.option(OPTION_KEY_QUOTE, OPTION_VALUE_QUOTATION_OFF)
.csv(s3Path.toString())
.as(Encoders.bean(TestRecord.class))
.flatMap(
((FlatMapFunction<TestRecord, TestOutputRecord>) (u) -> {
List<TestOutputRecord> list = new ArrayList<>();
try {
TestOutputRecord result = transformer.convert(u);
list.add(result);
} catch (Throwable t) {
System.err.println("Failed to convert a record");
t.printStackTrace();
}
return list.iterator();
}),
Encoders.bean(TestOutputRecord.class))
.map(new DataReinforcementMapFunction<>(), Encoders.bean(TestOutputRecord.clazz))
.writeStream()
.trigger(Trigger.ProcessingTime(WRITE_FREQUENCY, TimeUnit.SECONDS))
.format(MY_WRITER_FORMAT)
.outputMode(OutputMode.Append())
.queryName("custom-sink-stream")
.start();
writeStream.processAllAvailable();
writeStream.stop();
Logs:
Streaming query made progress: {
"id" : "a8a7fbc2-0f06-4197-a99a-114abae24964",
"runId" : "bebc8a0c-d3b2-4fd6-8710-78223a88edc7",
"name" : "custom-sink-stream",
"timestamp" : "2018-01-25T18:39:52.949Z",
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0,
"durationMs" : {
"getOffset" : 781,
"triggerExecution" : 781
},
"stateOperators" : [ ],
"sources" : [ {
"description" : "FileStreamSource[s3n://test-bucket/test]",
"startOffset" : {
"logOffset" : 0
},
"endOffset" : {
"logOffset" : 0
},
"numInputRows" : 0,
"inputRowsPerSecond" : 0.0,
"processedRowsPerSecond" : 0.0
} ],
"sink" : {
"description" : "com.mycompany.spark.MySink#f82a99"
}
}
Do I have to populate any metrics in my custom sink to be able to track progress? Or could it be a problem in FileStreamSource when it reads from s3 bucket?
The problem was related to using dataset.rdd in my custom sink that creates a new plan so that StreamExecution doesn't know about it and therefore is not able to get metrics.
Replacing data.rdd.mapPartitions with data.queryExecution.toRdd.mapPartitions fixes the issue.