pyspark save json handling nulls for struct - python-3.x

Using Pyspark and Spark 2.4, Python3 here. While writing the dataframe as json file, if the struct column is null I want it to be written as {} and if the struct field is null I want it as "". For example:
>>> df.printSchema()
root
|-- id: string (nullable = true)
|-- child1: struct (nullable = true)
| |-- f_name: string (nullable = true)
| |-- l_name: string (nullable = true)
|-- child2: struct (nullable = true)
| |-- f_name: string (nullable = true)
| |-- l_name: string (nullable = true)
>>> df.show()
+---+------------+------------+
| id| child1| child2|
+---+------------+------------+
|123|[John, Matt]|[Paul, Matt]|
|111|[Jack, null]| null|
|101| null| null|
+---+------------+------------+
df.fillna("").coalesce(1).write.mode("overwrite").format('json').save('/home/test')
Result:
{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
{"id":"111","child1":{"f_name":"jack","l_name":""}}
{"id":"111"}
Output Required:
{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}
{"id":"111","child1":{"f_name":"jack","l_name":""},"child2": {}}
{"id":"111","child1":{},"child2": {}}
I tried some map and udf's but was not able to acheive what I need. Appreciate your help here.

Spark 3.x
If you pass option ignoreNullFields into your code, you will have output like this. Not exactly an empty struct as you requested, but the schema is still correct.
df.fillna("").coalesce(1).write.mode("overwrite").format('json').option('ignoreNullFields', False).save('/home/test')
{"child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"},"id":"123"}
{"child1":{"f_name":"Jack","l_name":null},"child2":null,"id":"111"}
{"child1":null,"child2":null,"id":"101"}
Spark 2.x
Since that option above does not exist, I figured there is a "dirty fix" for that, is mimicking the JSON structure and bypassing the null check. Again, the result is not exactly like you're asking for, but the schema is correct.
(df
.select(F.struct(
F.col('id'),
F.coalesce(F.col('child1'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child1'),
F.coalesce(F.col('child2'), F.struct(F.lit(None).alias('f_name'), F.lit(None).alias('l_name'))).alias('child2')
).alias('json'))
.coalesce(1).write.mode("overwrite").format('json').save('/home/test')
)
{"json":{"id":"123","child1":{"f_name":"John","l_name":"Matt"},"child2":{"f_name":"Paul","l_name":"Matt"}}}
{"json":{"id":"111","child1":{"f_name":"Jack"},"child2":{}}}
{"json":{"id":"101","child1":{},"child2":{}}}

Related

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

How to convert JSON file into regular table DataFrame in Apache Spark

I have the following JSON fields
{"constructorId":1,"constructorRef":"mclaren","name":"McLaren","nationality":"British","url":"http://en.wikipedia.org/wiki/McLaren"}
{"constructorId":2,"constructorRef":"bmw_sauber","name":"BMW Sauber","nationality":"German","url":"http://en.wikipedia.org/wiki/BMW_Sauber"}
The following code produces the the following DataFrame:
I'm running the code on Databricks
df = (spark.read
.format(csv) \
.schema(mySchema) \
.load(dataPath)
)
display(df)
However, I need the DataFrame to look like the following:
I believe the problem is because the JSON is nested, and I'm trying to convert to CSV. However, I do need to convert to CSV.
Is there code that I can apply to remove the nested feature of the JSON?
Just try:
someDF = spark.read.json(somepath)
Infer schema by default or supply your own, set in your case in pySpark multiLine to false.
someDF = spark.read.json(somepath, someschema, multiLine=False)
See https://spark.apache.org/docs/latest/sql-data-sources-json.html
With schema inference:
df = spark.read.option("multiline","false").json("/FileStore/tables/SOabc2.txt")
df.printSchema()
df.show()
df.count()
returns:
root
|-- constructorId: long (nullable = true)
|-- constructorRef: string (nullable = true)
|-- name: string (nullable = true)
|-- nationality: string (nullable = true)
|-- url: string (nullable = true)
+-------------+--------------+----------+-----------+--------------------+
|constructorId|constructorRef| name|nationality| url|
+-------------+--------------+----------+-----------+--------------------+
| 1| mclaren| McLaren| British|http://en.wikiped...|
| 2| bmw_sauber|BMW Sauber| German|http://en.wikiped...|
+-------------+--------------+----------+-----------+--------------------+
Out[11]: 2

How to find out the schema from a tons of messy structured data?

I have a huge dataset with messy structured schema.
Say, the same data fields can have different data type of data, for example, data.tags can be a list of string or a list of object
I tried to load the JSON data from hdfs and print the schema but it occurs the error below.
TypeError: Can not merge type <class 'pyspark.sql.types.ArrayType'> and <class 'pyspark.sql.types.StringType'>
Here is the code
data_json = sc.textFile(data_path)
data_dataset = data_json.map(json.loads)
data_dataset_df = data_dataset.toDF()
data_dataset_df.printSchema()
Is it possible to figure out the schema something like
root
|-- children: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: boolean (valueContainsNull = true)
| |-- element: string
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- occupation: string (nullable = true)
in this case?
If I understand correctly, you're looking to find how to infer the schema of a JSON file. You should take a look at reading the JSON into a DataFrame straightaway, instead of through a Python mapping function.
Also, I'm referring you to How to infer schema of JSON files?, as I think it answers your question.

Inserting arrays into parquet using spark sql query

I need to add complex data types to a parquet file using the SQL query option.
I've had partial success using the following code:
self._operationHandleRdd = spark_context_.sql(u"INSERT OVERWRITE
TABLE _df_Dns VALUES
array(struct(struct(35,'ww'),5,struct(47,'BGN')),
struct(struct(70,'w'),1,struct(82,'w')),
struct(struct(86,'AA'),1,struct(97,'ClU'))
)")
spark_context_.sql("select * from _df_Dns").collect()
[Row(dns_rsp_resource_record_items=[Row(dns_rsp_rr_name=Row(seqno=86, value=u'AA'),
dns_rsp_rr_type=1, dns_rsp_rr_value=Row(seqno=97, value=u'ClU')),
Row(dns_rsp_rr_name=Row(seqno=86, value=u'AA'), dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(seqno=97, value=u'ClU')),
Row(dns_rsp_rr_name=Row(seqno=86, value=u'AA'), dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(seqno=97, value=u'ClU'))])]
So, this returns an array with 3 items but the last item appears thrice.
Did anyone encounter such issues and found a way around just by using Spark SQL and not Python?
Any help is appreciated.
Using your example:
from pyspark.sql import Row
df = spark.createDataFrame([
Row(dns_rsp_resource_record_items=[Row(
dns_rsp_rr_name=Row(
seqno=35, value=u'ww'),
dns_rsp_rr_type=5,
dns_rsp_rr_value=Row(seqno=47, value=u'BGN')),
Row(
dns_rsp_rr_name=Row(
seqno=70, value=u'w'),
dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(
seqno=82, value=u'w')),
Row(
dns_rsp_rr_name=Row(
seqno=86, value=u'AA'),
dns_rsp_rr_type=1,
dns_rsp_rr_value=Row(
seqno=97,
value=u'ClU'))])])
df.write.saveAsTable("_df_Dns")
Overwriting and inserting new lines work fine with your code (appart from the extra parenthesis):
spark.sql(u"INSERT OVERWRITE \
TABLE _df_Dns VALUES \
array(struct(struct(35,'ww'),5,struct(47,'BGN')), \
struct(struct(70,'w'),1,struct(82,'w')), \
struct(struct(86,'AA'),1,struct(97,'ClU')) \
)")
spark.sql("select * from _df_Dns").show(truncate=False)
+---------------------------------------------------------------+
|dns_rsp_resource_record_items |
+---------------------------------------------------------------+
|[[[35,ww],5,[47,BGN]], [[70,w],1,[82,w]], [[86,AA],1,[97,ClU]]]|
+---------------------------------------------------------------+
The only possible reason I see for the weird outcome you get is that your initial table had a compatible but different schema.
df.printSchema()
root
|-- dns_rsp_resource_record_items: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- dns_rsp_rr_name: struct (nullable = true)
| | | |-- seqno: long (nullable = true)
| | | |-- value: string (nullable = true)
| | |-- dns_rsp_rr_type: long (nullable = true)
| | |-- dns_rsp_rr_value: struct (nullable = true)
| | | |-- seqno: long (nullable = true)
| | | |-- value: string (nullable = true)

PySpark : Use of aggregate function for complex query

Suppose you have a dataframe with 3 columns which are numeric - like:
>>> df.show()
+-----------+---------------+------------------+--------------+---------------------+
| IP| URL|num_rebuf_sessions|first_rebuffer|num_backward_sessions|
+-----------+---------------+------------------+--------------+---------------------+
|10.45.12.13| ww.tre.com/ada| 1261| 764| 2043|
|10.54.12.34|www.rwr.com/yuy| 1126| 295| 1376|
|10.44.23.09|www.453.kig/827| 2725| 678| 1036|
|10.23.43.14|www.res.poe/skh| 2438| 224| 1455|
|10.32.22.10|www.res.poe/skh| 3655| 157| 1838|
|10.45.12.13|www.453.kig/827| 7578| 63| 1754|
|10.45.12.13| ww.tre.com/ada| 3854| 448| 1224|
|10.34.22.10|www.rwr.com/yuy| 1029| 758| 1275|
|10.54.12.34| ww.tre.com/ada| 7341| 10| 856|
|10.34.22.10| ww.tre.com/ada| 4150| 455| 1372|
+-----------+---------------+------------------+--------------+---------------------+
With schema being
>>> df.printSchema()
root
|-- IP: string (nullable = true)
|-- URL: string (nullable = true)
|-- num_rebuf_sessions: long (nullable = false)
|-- first_rebuffer: long (nullable = false)
|-- num_backward_sessions: long (nullable = false)
Question
I am interested in finding a complex query aggregation - like say (sum(num_rebuf_sessions) - sum(num_backward_sessions)) * 100 / sum(first_rebuffer)
How do I do it programmatically?
The query aggregation can be anything which is provided as a input (assume a xml file or a json file)
Note:
1. In interpreter, I can run the complete statement like
>>> df.groupBy(keyList).agg((((func.sum('num_rebuf_sessions') - func.sum('first_rebuffer')) * 100)/func.sum('num_backward_sessions')).alias('Result')).show()
+-----------+---------------+------------------+
| IP| URL| Result|
+-----------+---------------+------------------+
|10.54.12.34|www.rwr.com/yuy|263.70753561548884|
|10.23.43.14|www.453.kig/827| 278.7099317601011|
|10.34.22.10| ww.tre.com/ada|187.53939800299088|
+-----------+---------------+------------------+
But programmatically, it would take only a dict or a list of column which would mean tough to achieve the above functionality.
So is the pyspark.sql.context.SQLContext.sql only option left? Or am I missing something obvious?

Resources