AWS Glue - DynamicFrame with varying schema in json files - apache-spark

Sample:
I have a partitioned table with DDL below in Glue catalog:
CREATE EXTERNAL TABLE `test`(
`id` int,
`data` struct<a:string,b:string>)
PARTITIONED BY (
`partition_0` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
The underlying data in S3 is json files with varying schema meaning that some elements may not exist in some files but exist in other.
In the this sample partition_0='01' contains json file with all elements:
{"id": 1,"data": {"a": "value-a", "b": "value-b"}}
The file in partition_0='02' does not contain element data.b:
{"id": 1,"data": {"a": "value-a"}}
Issue:
When I create DynamicFrame in Glue (I use Python), its schema depends on the data that I query. If I include the data from partition_0='01' then all elements are present in the schema.
id_partition_predicate="partition_0 = '01'"
print("partition with 'b'")
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate).printSchema()
partition with 'b'
root
|-- id: int
|-- data: struct
| |-- a: string
| |-- b: string
|-- partition_0: string
print("both partitions")
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test").printSchema()
both partitions
root
|-- id: int
|-- data: struct
| |-- a: string
| |-- b: string
|-- partition_0: string
If I query only data from partition_0='02' then element data.b does not exist in the DynamicFrame schema even though it exists in the table definition.
print("partition without 'b'")
id_partition_predicate="partition_0 = '02'"
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate).printSchema()
partition without 'b'
root
|-- id: int
|-- data: struct
| |-- a: string
|-- partition_0: string
Question: How create DynamicFrame or DataFrame that always contains all elements from the Glue table's schema?
Thanks in advance!

Came up with this solution:
id_partition_predicate="partition_0 = '02'"
dyf = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate)
dyf.printSchema()
df=dyf.toDF()
try:
df = df.withColumn("b", col("data").getItem("b"))
except:
df = df.withColumn("b", lit(None).cast(StringType()))
df.show()
Output:
root
|-- id: int
|-- data: struct
| |-- a: string
|-- partition_0: string
+---+---------+-----------+----+
| id| data|partition_0| b|
+---+---------+-----------+----+
| 1|[value-a]| 02|null|
+---+---------+-----------+----+

Related

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

How to concatenate nested json in Apache Spark

Can someone let me know where I'm going wrong with my attempt to concatenate a nested JSON field.
I'm using the following code:
df = (df
.withColumn("ingestion_date", current_timestamp())
.withColumn("name", concat(col("name.forename"),
lit(" "), col("name.surname"))))
)
Schema:
root
|-- driverRef: string (nullable = true)
|-- number: integer (nullable = true)
|-- code: string (nullable = true)
|-- forename: string (nullable = true)
|-- surname: string (nullable = true)
|-- dob: date (nullable = true)
As you can see, I'm trying to concatenate forname & surname, so as to provide a full name in the name field. At the present the data looks like the following:
After concatenating the 'name' field there should be one single value e.g. the 'name' field would just show Lewis Hamilton, and like wise for the other values in the 'name' field.
My code produces the following error:
Can't extract value from name#6976: need struct type but got string
It would seem that you have a dataframe that contains a name column containing a json with two values: forename and surname, just like this {"forename": "Lewis", "surname" : "Hamilton"}.
That column, in spark, has a string type. That explains the error you obtain. You could only do name.forename if name were of type struct with a field called forename. That what spark means by need struct type but got string.
You just need to tell spark that this string column is a JSON and how to parse it.
from pyspark.sql.types import StructType, StringType, StructField
from pyspark.sql import functions as f
# initializing data
df = spark.range(1).withColumn('name',
f.lit('{"forename": "Lewis", "surname" : "Hamilton"}'))
df.show(truncate=False)
+---+---------------------------------------------+
|id |name |
+---+---------------------------------------------+
|0 |{"forename": "Lewis", "surname" : "Hamilton"}|
+---+---------------------------------------------+
And parsing that JSON:
json_schema = StructType([
StructField('forename', StringType()),
StructField('surname', StringType())
])
df\
.withColumn('s', f.from_json(f.col('name'), json_schema))\
.withColumn("name", f.concat_ws(" ", f.col("s.forename"), f.col("s.surname")))\
.show()
+---+--------------+-----------------+
| id| name| s|
+---+--------------+-----------------+
| 0|Lewis Hamilton|{Lewis, Hamilton}|
+---+--------------+-----------------+
You may than get rid of s with drop, it contains the parsed struct.

pyspark with hive - can't properly create with partition and save a table from a dataframe

I'm trying to convert json files to parquet with very few transformations (adding date) but I then need to partition this data before saving it to parquet.
I'm hitting a wall on this area.
Here is the creation process of the table:
df_temp = spark.read.json(data_location) \
.filter(
cond3
)
df_temp = df_temp.withColumn("date", fn.to_date(fn.lit(today.strftime("%Y-%m-%d"))))
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
then regarding the save of the conversion:
df_final.write.mode("append").format("parquet").partitionBy("customer_id", "date").saveAsTable('duration')
but this generates the following error:
pyspark.sql.utils.AnalysisException: '\nSpecified partitioning does not match that of the existing table default.duration.\nSpecified partition columns: [customer_id, date]\nExisting partition columns: []\n ;'
the schema being:
root
|-- action_id: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- duration: long (nullable = true)
|-- initial_value: string (nullable = true)
|-- item_class: string (nullable = true)
|-- set_value: string (nullable = true)
|-- start_time: string (nullable = true)
|-- stop_time: string (nullable = true)
|-- undo_event: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- date: date (nullable = true)
Thus I tried to change the create table to:
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp PARTITIONED BY (customer_id, date) LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
But this create an error like:
...mismatched input 'PARTITIONED' expecting ...
So I discovered that PARTITIONED BY doesn't work with LIKE but I'm running out of ideas.
if using USING instead of LIKE I got the error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to specify partition columns when the table schema is not defined. When the table schema is not provided, schema and partition columns will be inferred.;'
How am I supposed to add a partition when creating the table?
Ps - Once the schema of the table is defined with the partitions, I want to simply use:
df_final.write.format("parquet").insertInto('duration')
I finally figured out how to do it with spark.
df_temp.read.json...
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("""
CREATE TABLE IF NOT EXISTS {1}
USING PARQUET
PARTITIONED BY (customer_id, date)
LOCATION '{2}/{1}' AS SELECT * FROM {0}_tmp
""".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
df_temp.write.mode("append").partitionBy("customer_id", "date").saveAsTable('duration')
I don't know why, but if I can't use insertInto, it uses a weird customer_id out of nowhere and doesn't append the different dates.

Dynamic query preparation and execution in spark

In Spark ,this json is in dataframe(DF),now we have to navigate to tables(in json based on cust),we have to read first block of tables & have to prepare sql query.
Ex: SELECT CUST_NAME FROM CUST WHERE CUST_ID =112
we have to execute this query in Database & store that result in json file.
{
"cust": "Retails",
"tables": [
{
"Name":"customer",
"table_NAME":"cust",
"param1":"cust_id",
"val":"112",
"op":"cust_name"
},
{
"Name":"sales",
"table_NAME":"sale",
"param1":"country",
"val":"ind",
"op":"monthly_sale"
}]
}
root |-- cust: string (nullable = true)
|-- tables: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Name: string (nullable = true)
| | |-- op: string (nullable = true)
| | |-- param1: string (nullable = true)
| | |-- table_NAME: string (nullable = true)
| | |-- val: string (nullable = true)
same for second block of tables.
Ex : SELECT MONTHLY_SALE FROM SALE WHERE COUNTRY = 'IND'
have to execute this query in DB & have to store this result as well in above json file.
what is the best approach to do this? any ideas ?
This is my way of achieving this. For this whole solution I've used spark-shell. These are some prerequisites:
Download this jar from json-serde
Extract the zip file to any location
Now run spark-shell using this command
spark-shell --jars path/to/jars/json-serde-cdh5-shim-1.3.7.3.jar,path/to/jars/json-serde-1.3.7.3.jar,path/to/jars/json-1.3.7.3.jar
Your Json document:
{
"cust": "Retails",
"tables": [
{
"Name":"customer",
"table_NAME":"cust",
"param1":"cust_id",
"val":"112",
"op":"cust_name"
},
{
"Name":"sales",
"table_NAME":"sale",
"param1":"country",
"val":"ind",
"op":"monthly_sale"
}]
}
Collapsed version:
{"cust": "Retails","tables":[{"Name":"customer","table_NAME":"cust","param1":"cust_id","val":"112","op":"cust_name"},{"Name":"sales","table_NAME":"sale","param1":"country","val":"ind","op":"monthly_sale"}]}
I've put this json in this /tmp/sample.json
Now going to spark-sql part:
Creating table based on json schema
sql("CREATE TABLE json_table(cust string,tables array<struct<Name: string,table_NAME:string,param1:string,val:string,op:string>>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'")
Now load the json data into table
sql("LOAD DATA LOCAL INPATH '/tmp/sample.json' OVERWRITE INTO TABLE json_table")
Now I'll be using hive lateral view concept Lateral view
val ans=sql("SELECT myCol FROM json_table LATERAL VIEW explode(tables) myTable as myCol").collect
Schema of the returned result:
ans.printSchema
root
|-- table: struct (nullable = true)
| |-- Name: string (nullable = true)
| |-- table_NAME: string (nullable = true)
| |-- param1: string (nullable = true)
| |-- val: string (nullable = true)
| |-- op: string (nullable = true)
Result of ans.show
ans.show
+--------------------+
| table|
+--------------------+
|[customer,cust,cu...|
|[sales,sale,count...|
+--------------------+
Now I'm assuming there can be two types of data e.g. cust_id is of Number type and country is of String type. I'm adding a method to identify the type of data based on it's value. e.g.
def isAllDigits(x: String) = x forall Character.isDigit
Note: You can use your own way of identify this
7.Now query creation based on json data
ans.foreach(f=>{
val splitted_string=f.toString.split(",")
val op=splitted_string(4).substring(0,splitted_string(4).size-2)
val table_NAME=splitted_string(1)
val param1 = splitted_string(2)
val value = splitted_string(3)
if(isAllDigits(value)){
println("SELECT " +op+" FROM "+ table_NAME+" WHERE "+param1+"="+value)
}else{
println("SELECT " +op+" FROM "+ table_NAME+" WHERE "+param1+"='"+value+"'")
}
})
This is the result I've got:
SELECT cust_name FROM cust WHERE cust_id=112
SELECT monthly_sale FROM sale WHERE country='ind'

JSON Struct to Map[String,String] using sqlContext

I am trying to read json data in spark streaming job.
By default sqlContext.read.json(rdd) is converting all map types to struct types.
|-- legal_name: struct (nullable = true)
| |-- first_name: string (nullable = true)
| |-- last_name: string (nullable = true)
| |-- middle_name: string (nullable = true)
But when i read from hive table using sqlContext
val a = sqlContext.sql("select * from student_record")
below is the schema.
|-- leagalname: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Is there any way we can read data using read.json(rdd) and get Map data type?
Is there any option like
spark.sql.schema.convertStructToMap?
Any help is appreciated.
You need to explicitly define your schema, when calling read.json.
You can read about the details in Programmatically specifying the schema in the Spark SQL Documentation.
For example in your specific case it would be
import org.apache.spark.sql.types._
val schema = StructType(List(StructField("legal_name",MapType(StringType,StringType,true))))
That would be one column legal_name being a map.
When you have defined you schema you can call
sqlContext.read.json(rdd, schema) to create your data frame from your JSON dataset with the desired schema.

Resources