Dynamic query preparation and execution in spark - apache-spark

In Spark ,this json is in dataframe(DF),now we have to navigate to tables(in json based on cust),we have to read first block of tables & have to prepare sql query.
Ex: SELECT CUST_NAME FROM CUST WHERE CUST_ID =112
we have to execute this query in Database & store that result in json file.
{
"cust": "Retails",
"tables": [
{
"Name":"customer",
"table_NAME":"cust",
"param1":"cust_id",
"val":"112",
"op":"cust_name"
},
{
"Name":"sales",
"table_NAME":"sale",
"param1":"country",
"val":"ind",
"op":"monthly_sale"
}]
}
root |-- cust: string (nullable = true)
|-- tables: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- Name: string (nullable = true)
| | |-- op: string (nullable = true)
| | |-- param1: string (nullable = true)
| | |-- table_NAME: string (nullable = true)
| | |-- val: string (nullable = true)
same for second block of tables.
Ex : SELECT MONTHLY_SALE FROM SALE WHERE COUNTRY = 'IND'
have to execute this query in DB & have to store this result as well in above json file.
what is the best approach to do this? any ideas ?

This is my way of achieving this. For this whole solution I've used spark-shell. These are some prerequisites:
Download this jar from json-serde
Extract the zip file to any location
Now run spark-shell using this command
spark-shell --jars path/to/jars/json-serde-cdh5-shim-1.3.7.3.jar,path/to/jars/json-serde-1.3.7.3.jar,path/to/jars/json-1.3.7.3.jar
Your Json document:
{
"cust": "Retails",
"tables": [
{
"Name":"customer",
"table_NAME":"cust",
"param1":"cust_id",
"val":"112",
"op":"cust_name"
},
{
"Name":"sales",
"table_NAME":"sale",
"param1":"country",
"val":"ind",
"op":"monthly_sale"
}]
}
Collapsed version:
{"cust": "Retails","tables":[{"Name":"customer","table_NAME":"cust","param1":"cust_id","val":"112","op":"cust_name"},{"Name":"sales","table_NAME":"sale","param1":"country","val":"ind","op":"monthly_sale"}]}
I've put this json in this /tmp/sample.json
Now going to spark-sql part:
Creating table based on json schema
sql("CREATE TABLE json_table(cust string,tables array<struct<Name: string,table_NAME:string,param1:string,val:string,op:string>>) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'")
Now load the json data into table
sql("LOAD DATA LOCAL INPATH '/tmp/sample.json' OVERWRITE INTO TABLE json_table")
Now I'll be using hive lateral view concept Lateral view
val ans=sql("SELECT myCol FROM json_table LATERAL VIEW explode(tables) myTable as myCol").collect
Schema of the returned result:
ans.printSchema
root
|-- table: struct (nullable = true)
| |-- Name: string (nullable = true)
| |-- table_NAME: string (nullable = true)
| |-- param1: string (nullable = true)
| |-- val: string (nullable = true)
| |-- op: string (nullable = true)
Result of ans.show
ans.show
+--------------------+
| table|
+--------------------+
|[customer,cust,cu...|
|[sales,sale,count...|
+--------------------+
Now I'm assuming there can be two types of data e.g. cust_id is of Number type and country is of String type. I'm adding a method to identify the type of data based on it's value. e.g.
def isAllDigits(x: String) = x forall Character.isDigit
Note: You can use your own way of identify this
7.Now query creation based on json data
ans.foreach(f=>{
val splitted_string=f.toString.split(",")
val op=splitted_string(4).substring(0,splitted_string(4).size-2)
val table_NAME=splitted_string(1)
val param1 = splitted_string(2)
val value = splitted_string(3)
if(isAllDigits(value)){
println("SELECT " +op+" FROM "+ table_NAME+" WHERE "+param1+"="+value)
}else{
println("SELECT " +op+" FROM "+ table_NAME+" WHERE "+param1+"='"+value+"'")
}
})
This is the result I've got:
SELECT cust_name FROM cust WHERE cust_id=112
SELECT monthly_sale FROM sale WHERE country='ind'

Related

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

AWS Glue - DynamicFrame with varying schema in json files

Sample:
I have a partitioned table with DDL below in Glue catalog:
CREATE EXTERNAL TABLE `test`(
`id` int,
`data` struct<a:string,b:string>)
PARTITIONED BY (
`partition_0` string)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
The underlying data in S3 is json files with varying schema meaning that some elements may not exist in some files but exist in other.
In the this sample partition_0='01' contains json file with all elements:
{"id": 1,"data": {"a": "value-a", "b": "value-b"}}
The file in partition_0='02' does not contain element data.b:
{"id": 1,"data": {"a": "value-a"}}
Issue:
When I create DynamicFrame in Glue (I use Python), its schema depends on the data that I query. If I include the data from partition_0='01' then all elements are present in the schema.
id_partition_predicate="partition_0 = '01'"
print("partition with 'b'")
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate).printSchema()
partition with 'b'
root
|-- id: int
|-- data: struct
| |-- a: string
| |-- b: string
|-- partition_0: string
print("both partitions")
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test").printSchema()
both partitions
root
|-- id: int
|-- data: struct
| |-- a: string
| |-- b: string
|-- partition_0: string
If I query only data from partition_0='02' then element data.b does not exist in the DynamicFrame schema even though it exists in the table definition.
print("partition without 'b'")
id_partition_predicate="partition_0 = '02'"
glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate).printSchema()
partition without 'b'
root
|-- id: int
|-- data: struct
| |-- a: string
|-- partition_0: string
Question: How create DynamicFrame or DataFrame that always contains all elements from the Glue table's schema?
Thanks in advance!
Came up with this solution:
id_partition_predicate="partition_0 = '02'"
dyf = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = "test", push_down_predicate = id_partition_predicate)
dyf.printSchema()
df=dyf.toDF()
try:
df = df.withColumn("b", col("data").getItem("b"))
except:
df = df.withColumn("b", lit(None).cast(StringType()))
df.show()
Output:
root
|-- id: int
|-- data: struct
| |-- a: string
|-- partition_0: string
+---+---------+-----------+----+
| id| data|partition_0| b|
+---+---------+-----------+----+
| 1|[value-a]| 02|null|
+---+---------+-----------+----+

How to find out the schema from a tons of messy structured data?

I have a huge dataset with messy structured schema.
Say, the same data fields can have different data type of data, for example, data.tags can be a list of string or a list of object
I tried to load the JSON data from hdfs and print the schema but it occurs the error below.
TypeError: Can not merge type <class 'pyspark.sql.types.ArrayType'> and <class 'pyspark.sql.types.StringType'>
Here is the code
data_json = sc.textFile(data_path)
data_dataset = data_json.map(json.loads)
data_dataset_df = data_dataset.toDF()
data_dataset_df.printSchema()
Is it possible to figure out the schema something like
root
|-- children: array (nullable = true)
| |-- element: map (containsNull = true)
| | |-- key: string
| | |-- value: boolean (valueContainsNull = true)
| |-- element: string
|-- first_name: string (nullable = true)
|-- last_name: string (nullable = true)
|-- occupation: string (nullable = true)
in this case?
If I understand correctly, you're looking to find how to infer the schema of a JSON file. You should take a look at reading the JSON into a DataFrame straightaway, instead of through a Python mapping function.
Also, I'm referring you to How to infer schema of JSON files?, as I think it answers your question.

Is there any better way to convert Array<int> to Array<String> in pyspark

A very huge DataFrame with schema:
root
|-- id: string (nullable = true)
|-- ext: array (nullable = true)
| |-- element: integer (containsNull = true)
So far I try to explode data, then collect_list:
select
id,
collect_list(cast(item as string))
from default.dual
lateral view explode(ext) t as item
group by
id
But this way is too expansive.
You can simply cast the ext column to a string array
df = source.withColumn("ext", source.ext.cast("array<string>"))
df.printSchema()
df.show()

JSON Struct to Map[String,String] using sqlContext

I am trying to read json data in spark streaming job.
By default sqlContext.read.json(rdd) is converting all map types to struct types.
|-- legal_name: struct (nullable = true)
| |-- first_name: string (nullable = true)
| |-- last_name: string (nullable = true)
| |-- middle_name: string (nullable = true)
But when i read from hive table using sqlContext
val a = sqlContext.sql("select * from student_record")
below is the schema.
|-- leagalname: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Is there any way we can read data using read.json(rdd) and get Map data type?
Is there any option like
spark.sql.schema.convertStructToMap?
Any help is appreciated.
You need to explicitly define your schema, when calling read.json.
You can read about the details in Programmatically specifying the schema in the Spark SQL Documentation.
For example in your specific case it would be
import org.apache.spark.sql.types._
val schema = StructType(List(StructField("legal_name",MapType(StringType,StringType,true))))
That would be one column legal_name being a map.
When you have defined you schema you can call
sqlContext.read.json(rdd, schema) to create your data frame from your JSON dataset with the desired schema.

Resources