Inserting arrays into parquet using spark sql query - apache-spark

I need to add complex data types to a parquet file using the SQL query option.
I've had partial success using the following code:
self._operationHandleRdd = spark_context_.sql(u"INSERT OVERWRITE
TABLE _df_Dns VALUES
array(struct(struct(35,'ww'),5,struct(47,'BGN')),
struct(struct(70,'w'),1,struct(82,'w')),
struct(struct(86,'AA'),1,struct(97,'ClU'))
)")
spark_context_.sql("select * from _df_Dns").collect()
[Row(dns_rsp_resource_record_items=[Row(dns_rsp_rr_name=Row(seqno=86, value=u'AA'),
dns_rsp_rr_type=1, dns_rsp_rr_value=Row(seqno=97, value=u'ClU')),
Row(dns_rsp_rr_name=Row(seqno=86, value=u'AA'), dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(seqno=97, value=u'ClU')),
Row(dns_rsp_rr_name=Row(seqno=86, value=u'AA'), dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(seqno=97, value=u'ClU'))])]
So, this returns an array with 3 items but the last item appears thrice.
Did anyone encounter such issues and found a way around just by using Spark SQL and not Python?
Any help is appreciated.

Using your example:
from pyspark.sql import Row
df = spark.createDataFrame([
Row(dns_rsp_resource_record_items=[Row(
dns_rsp_rr_name=Row(
seqno=35, value=u'ww'),
dns_rsp_rr_type=5,
dns_rsp_rr_value=Row(seqno=47, value=u'BGN')),
Row(
dns_rsp_rr_name=Row(
seqno=70, value=u'w'),
dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(
seqno=82, value=u'w')),
Row(
dns_rsp_rr_name=Row(
seqno=86, value=u'AA'),
dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(
seqno=97,
value=u'ClU'))])])
df.write.saveAsTable("_df_Dns")
Overwriting and inserting new lines work fine with your code (appart from the extra parenthesis):
spark.sql(u"INSERT OVERWRITE \
TABLE _df_Dns VALUES \
array(struct(struct(35,'ww'),5,struct(47,'BGN')), \
struct(struct(70,'w'),1,struct(82,'w')), \
struct(struct(86,'AA'),1,struct(97,'ClU')) \
)")
spark.sql("select * from _df_Dns").show(truncate=False)
+---------------------------------------------------------------+
|dns_rsp_resource_record_items |
+---------------------------------------------------------------+
|[[[35,ww],5,[47,BGN]], [[70,w],1,[82,w]], [[86,AA],1,[97,ClU]]]|
+---------------------------------------------------------------+
The only possible reason I see for the weird outcome you get is that your initial table had a compatible but different schema.
df.printSchema()
root
|-- dns_rsp_resource_record_items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dns_rsp_rr_name: struct (nullable = true)
| | | |-- seqno: long (nullable = true)
| | | |-- value: string (nullable = true)
| | |-- dns_rsp_rr_type: long (nullable = true)
| | |-- dns_rsp_rr_value: struct (nullable = true)
| | | |-- seqno: long (nullable = true)
| | | |-- value: string (nullable = true)

Related

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

iterate array in pyspark /nested elements

I have input_data as
[[2022-04-06,test],[2022-04-05,test2]]
schema of the input_data is
|-- source: array(nullable = true)
| |-- element: struct (containsNull= true)
| | |-- #date: string(nullable = true)
| | |-- user: string (nullable = true)
I am looking output as
+-----------+--------+
| date | user |
+-----------|--------+
|2022-04-06 |test |
|2022-04-05 |test2 |
+--------------------+
I have created a df from input_data and applied explode on it further I am thinking to explode the result of it
df.select(explode(df.source))
is there any better way to achieve the output in spark sql or spark df
note I am getting #date and not date in input_data so applying spark sql is also some challenge
Use select inline
df.selectExpr("inline(source)").show()

pyspark save json handling nulls for struct

Using Pyspark and Spark 2.4, Python3 here. While writing the dataframe as json file, if the struct column is null I want it to be written as {} and if the struct field is null I want it as "". For example:
>>> df.printSchema()
root
|-- id: string (nullable = true)
|-- child1: struct (nullable = true)
| |-- f_name: string (nullable = true)
| |-- l_name: string (nullable = true)
|-- child2: struct (nullable = true)
| |-- f_name: string (nullable = true)
| |-- l_name: string (nullable = true)
>>> df.show()
+---+------------+------------+
| id| child1| child2|
+---+------------+------------+
|123|[John, Matt]|[Paul, Matt]|
|111|[Jack, null]| null|
|101| null| null|
+---+------------+------------+
df.fillna("").coalesce(1).write.mode("overwrite").format('json').save('/home/test')
Result:
{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
{"id":"111","child1":{"f_name":"jack","l_name":""}}
{"id":"111"}
Output Required:
{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
{"id":"111","child1":{"f_name":"jack","l_name":""},"child2": {}}
{"id":"111","child1":{},"child2": {}}
I tried some map and udf's but was not able to acheive what I need. Appreciate your help here.
Spark 3.x
If you pass option ignoreNullFields into your code, you will have output like this. Not exactly an empty struct as you requested, but the schema is still correct.
df.fillna("").coalesce(1).write.mode("overwrite").format('json').option('ignoreNullFields', False).save('/home/test')
{"child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"},"id":"123"}
{"child1":{"f_name":"Jack","l_name":null},"child2":null,"id":"111"}
{"child1":null,"child2":null,"id":"101"}
Spark 2.x
Since that option above does not exist, I figured there is a "dirty fix" for that, is mimicking the JSON structure and bypassing the null check. Again, the result is not exactly like you're asking for, but the schema is correct.
(df
.select(F.struct(
F.col('id'),
F.coalesce(F.col('child1'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child1'),
F.coalesce(F.col('child2'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child2')
).alias('json'))
.coalesce(1).write.mode("overwrite").format('json').save('/home/test')
)
{"json":{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}}
{"json":{"id":"111","child1":{"f_name":"Jack"},"child2":{}}}
{"json":{"id":"101","child1":{},"child2":{}}}

Reading XML File Through Dataframe

I have XML file like below format.
<nt:vars>
<nt:var id="1.3.0" type="TimeStamp"> 89:19:00.01</nt:var>
<nt:var id="1.3.1" type="OBJECT ">1.9.5.67.2</nt:var>
<nt:var id="1.3.9" type="STRING">AB-CD-EF</nt:var>
</nt:vars>
I built a dataframe on it using below code. Though the code is displaying 3 rows and retrieving id and type fields it'snot displaying actual value which is 89:19:00.01, 1.9.5.67.2, AB-CD-EF
spark.read.format("xml").option("rootTag","nt:vars").option("rowTag","nt:var").load("/FileStore/tables/POC_DB.xml").show()
Could you please help me if I have to add any other options to above line to bring the values as well please.
You can instead specify rowTag as nt:vars:
df = spark.read.format("xml").option("rowTag","nt:vars").load("file.xml")
df.printSchema()
root
|-- nt:var: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- _VALUE: string (nullable = true)
| | |-- _id: string (nullable = true)
| | |-- _type: string (nullable = true)
df.show(truncate=False)
+-------------------------------------------------------------------------------------------+
|nt:var |
+-------------------------------------------------------------------------------------------+
|[[ 89:19:00.01, 1.3.0, TimeStamp], [1.9.5.67.2, 1.3.1, OBJECT ], [AB-CD-EF, 1.3.9, STRING]]|
+-------------------------------------------------------------------------------------------+
And to get the values as separate rows, you can explode the array of structs:
df.select(F.explode('nt:var')).show(truncate=False)
+--------------------------------+
|col |
+--------------------------------+
|[ 89:19:00.01, 1.3.0, TimeStamp]|
|[1.9.5.67.2, 1.3.1, OBJECT ] |
|[AB-CD-EF, 1.3.9, STRING] |
+--------------------------------+
Or if you just want the values:
df.select(F.explode('nt:var._VALUE')).show()
+------------+
| col|
+------------+
| 89:19:00.01|
| 1.9.5.67.2|
| AB-CD-EF|
+------------+

Dynamic query preparation and execution in spark

In Spark ,this json is in dataframe(DF),now we have to navigate to tables(in json based on cust),we have to read first block of tables & have to prepare sql query.
Ex: SELECT CUST_NAME FROM CUST WHERE CUST_ID =112
we have to execute this query in Database & store that result in json file.
{
"cust": "Retails",
"tables": [
{
"Name":"customer",
"table_NAME":"cust",
"param1":"cust_id",
"val":"112",
"op":"cust_name"
},
{
"Name":"sales",
"table_NAME":"sale",
"param1":"country",
"val":"ind",
"op":"monthly_sale"
}]
}
root |-- cust: string (nullable = true)
|-- tables: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Name: string (nullable = true)
| | |-- op: string (nullable = true)
| | |-- param1: string (nullable = true)
| | |-- table_NAME: string (nullable = true)
| | |-- val: string (nullable = true)
same for second block of tables.
Ex : SELECT MONTHLY_SALE FROM SALE WHERE COUNTRY = 'IND'
have to execute this query in DB & have to store this result as well in above json file.
what is the best approach to do this? any ideas ?
This is my way of achieving this. For this whole solution I've used spark-shell. These are some prerequisites:
Download this jar from json-serde
Extract the zip file to any location
Now run spark-shell using this command
spark-shell --jars path/to/jars/json-serde-cdh5-shim-1.3.7.3.jar,path/to/jars/json-serde-1.3.7.3.jar,path/to/jars/json-1.3.7.3.jar
Your Json document:
{
"cust": "Retails",
"tables": [
{
"Name":"customer",
"table_NAME":"cust",
"param1":"cust_id",
"val":"112",
"op":"cust_name"
},
{
"Name":"sales",
"table_NAME":"sale",
"param1":"country",
"val":"ind",
"op":"monthly_sale"
}]
}
Collapsed version:
{"cust": "Retails","tables":[{"Name":"customer","table_NAME":"cust","param1":"cust_id","val":"112","op":"cust_name"},{"Name":"sales","table_NAME":"sale","param1":"country","val":"ind","op":"monthly_sale"}]}
I've put this json in this /tmp/sample.json
Now going to spark-sql part:
Creating table based on json schema
sql("CREATE TABLE json_table(cust string,tables array<struct<Name: string,table_NAME:string,param1:string,val:string,op:string>>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'")
Now load the json data into table
sql("LOAD DATA LOCAL INPATH '/tmp/sample.json' OVERWRITE INTO TABLE json_table")
Now I'll be using hive lateral view concept Lateral view
val ans=sql("SELECT myCol FROM json_table LATERAL VIEW explode(tables) myTable as myCol").collect
Schema of the returned result:
ans.printSchema
root
|-- table: struct (nullable = true)
| |-- Name: string (nullable = true)
| |-- table_NAME: string (nullable = true)
| |-- param1: string (nullable = true)
| |-- val: string (nullable = true)
| |-- op: string (nullable = true)
Result of ans.show
ans.show
+--------------------+
| table|
+--------------------+
|[customer,cust,cu...|
|[sales,sale,count...|
+--------------------+
Now I'm assuming there can be two types of data e.g. cust_id is of Number type and country is of String type. I'm adding a method to identify the type of data based on it's value. e.g.
def isAllDigits(x: String) = x forall Character.isDigit
Note: You can use your own way of identify this
7.Now query creation based on json data
ans.foreach(f=>{
val splitted_string=f.toString.split(",")
val op=splitted_string(4).substring(0,splitted_string(4).size-2)
val table_NAME=splitted_string(1)
val param1 = splitted_string(2)
val value = splitted_string(3)
if(isAllDigits(value)){
println("SELECT " +op+" FROM "+ table_NAME+" WHERE "+param1+"="+value)
}else{
println("SELECT " +op+" FROM "+ table_NAME+" WHERE "+param1+"='"+value+"'")
}
})
This is the result I've got:
SELECT cust_name FROM cust WHERE cust_id=112
SELECT monthly_sale FROM sale WHERE country='ind'

Resources