Pyspark: Write CSV from JSON file with struct column - apache-spark

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!

Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

Related

How to convert JSON file into regular table DataFrame in Apache Spark

I have the following JSON fields
{"constructorId":1,"constructorRef":"mclaren","name":"McLaren","nationality":"British","url":"http://en.wikipedia.org/wiki/McLaren"}
{"constructorId":2,"constructorRef":"bmw_sauber","name":"BMW Sauber","nationality":"German","url":"http://en.wikipedia.org/wiki/BMW_Sauber"}
The following code produces the the following DataFrame:
I'm running the code on Databricks
df = (spark.read
.format(csv) \
.schema(mySchema) \
.load(dataPath)
)
display(df)
However, I need the DataFrame to look like the following:
I believe the problem is because the JSON is nested, and I'm trying to convert to CSV. However, I do need to convert to CSV.
Is there code that I can apply to remove the nested feature of the JSON?
Just try:
someDF = spark.read.json(somepath)
Infer schema by default or supply your own, set in your case in pySpark multiLine to false.
someDF = spark.read.json(somepath, someschema, multiLine=False)
See https://spark.apache.org/docs/latest/sql-data-sources-json.html
With schema inference:
df = spark.read.option("multiline","false").json("/FileStore/tables/SOabc2.txt")
df.printSchema()
df.show()
df.count()
returns:
root
|-- constructorId: long (nullable = true)
|-- constructorRef: string (nullable = true)
|-- name: string (nullable = true)
|-- nationality: string (nullable = true)
|-- url: string (nullable = true)
+-------------+--------------+----------+-----------+--------------------+
|constructorId|constructorRef| name|nationality| url|
+-------------+--------------+----------+-----------+--------------------+
| 1| mclaren| McLaren| British|http://en.wikiped...|
| 2| bmw_sauber|BMW Sauber| German|http://en.wikiped...|
+-------------+--------------+----------+-----------+--------------------+
Out[11]: 2

How to modify a dataframe in-place so that its ArrayType column can't be null (nullable = false and containsNull = false)?

Take the following example dataframe:
val df = Seq(Seq("xxx")).toDF("a")
Schema:
root
|-- a: array (nullable = true)
| |-- element: string (containsNull = true)
How can I modify df in-place so that the resulting dataframe is not nullable anywhere, i.e. has the following schema:
root
|-- a: array (nullable = false)
| |-- element: string (containsNull = false)
I understand that I can re-create another dataframe enforcing a non-nullable schema, such as following Change nullable property of column in spark dataframe
spark.createDataFrame(df.rdd, StructType(StructField("a", ArrayType(StringType, false), false) :: Nil))
But this is not an option under structured streaming, so I want it to be some kind of in-place modification.
So the way to achieve this is with a UserDefinedFunction
// Problem setup
val df = Seq(Seq("xxx")).toDF("a")
df.printSchema
root
|-- a: array (nullable = true)
| |-- element: string (containsNull = true)
Onto the solution:
import org.apache.spark.sql.types.{ArrayType, StringType}
import org.apache.spark.sql.functions.{udf, col}
// We define a sub schema with the appropriate data type and null condition
val subSchema = ArrayType(StringType, containsNull = false)
// We create a UDF that applies this sub schema
// while specifying the output of the UDF to be non-nullable
val applyNonNullableSchemaUdf = udf((x:Seq[String]) => x, subSchema).asNonNullable
// We apply the UDF
val newSchemaDF = df.withColumn("a", applyNonNullableSchemaUdf(col("a")))
And there you have it.
// Check new schema
newSchemaDF.printSchema
root
|-- a: array (nullable = false)
| |-- element: string (containsNull = false)
// Check that it actually works
newSchemaDF.show
+-----+
| a|
+-----+
|[xxx]|
+-----+

Create dataframe on printschema output

I have created a dataframe on top of parquet file and now able to see the dataframe schema.Now I want to create dataframe on top of the printschema output
df = spark.read.parquet("s3/location")
df.printschema()
the output looks like [(cola , string) , (colb,string)]
Now I want to create dataframe on the output of printschema .
What would be the best way to do that
Adding more inputs on what has been achieved so far -
df1 = sqlContext.read.parquet("s3://t1")
df1.printSchema()
We got the below result -
root
|-- Atp: string (nullable = true)
|-- Ccetp: string (nullable = true)
|-- Ccref: string (nullable = true)
|-- Ccbbn: string (nullable = true)
|-- Ccsdt: string (nullable = true)
|-- Ccedt: string (nullable = true)
|-- Ccfdt: string (nullable = true)
|-- Ccddt: string (nullable = true)
|-- Ccamt: string (nullable = true)
We want to create dataframe with two columns - 1) colname , 2) datatype
But if we run the below code -
schemaRDD = spark.sparkContext.parallelize([df1.schema.json()])
schema_df = spark.read.json(schemaRDD)
schema_df.show()
We are getting below output where we are getting the entire column names and datatype in a single row -
+--------------------+------+
| fields| type|
+--------------------+------+
|[[Atp,true,str...|struct|
+--------------------+------+
Looking for a output like
Atp| string
Ccetp| string
Ccref| string
Ccbbn| string
Ccsdt| string
Ccedt| string
Ccfdt| string
Ccddt| string
Ccamt| string
Not sure what language your are using but on pyspark I would do it like this:
schemaRDD = spark.sparkContext.parallelize([df.schema.json()])
schema_df = spark.read.json(schemaRDD)
schema_df = sqlContext.createDataFrame(zip([col[0] for col in df1.dtypes], [col[1] for col in df1.dtypes]), schema=['colname', 'datatype'])

Is there any better way to convert Array<int> to Array<String> in pyspark

A very huge DataFrame with schema:
root
|-- id: string (nullable = true)
|-- ext: array (nullable = true)
| |-- element: integer (containsNull = true)
So far I try to explode data, then collect_list:
select
id,
collect_list(cast(item as string))
from default.dual
lateral view explode(ext) t as item
group by
id
But this way is too expansive.
You can simply cast the ext column to a string array
df = source.withColumn("ext", source.ext.cast("array<string>"))
df.printSchema()
df.show()

JSON Struct to Map[String,String] using sqlContext

I am trying to read json data in spark streaming job.
By default sqlContext.read.json(rdd) is converting all map types to struct types.
|-- legal_name: struct (nullable = true)
| |-- first_name: string (nullable = true)
| |-- last_name: string (nullable = true)
| |-- middle_name: string (nullable = true)
But when i read from hive table using sqlContext
val a = sqlContext.sql("select * from student_record")
below is the schema.
|-- leagalname: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Is there any way we can read data using read.json(rdd) and get Map data type?
Is there any option like
spark.sql.schema.convertStructToMap?
Any help is appreciated.
You need to explicitly define your schema, when calling read.json.
You can read about the details in Programmatically specifying the schema in the Spark SQL Documentation.
For example in your specific case it would be
import org.apache.spark.sql.types._
val schema = StructType(List(StructField("legal_name",MapType(StringType,StringType,true))))
That would be one column legal_name being a map.
When you have defined you schema you can call
sqlContext.read.json(rdd, schema) to create your data frame from your JSON dataset with the desired schema.

Resources