Create dataframe on printschema output - apache-spark

I have created a dataframe on top of parquet file and now able to see the dataframe schema.Now I want to create dataframe on top of the printschema output
df = spark.read.parquet("s3/location")
df.printschema()
the output looks like [(cola , string) , (colb,string)]
Now I want to create dataframe on the output of printschema .
What would be the best way to do that
Adding more inputs on what has been achieved so far -
df1 = sqlContext.read.parquet("s3://t1")
df1.printSchema()
We got the below result -
root
|-- Atp: string (nullable = true)
|-- Ccetp: string (nullable = true)
|-- Ccref: string (nullable = true)
|-- Ccbbn: string (nullable = true)
|-- Ccsdt: string (nullable = true)
|-- Ccedt: string (nullable = true)
|-- Ccfdt: string (nullable = true)
|-- Ccddt: string (nullable = true)
|-- Ccamt: string (nullable = true)
We want to create dataframe with two columns - 1) colname , 2) datatype
But if we run the below code -
schemaRDD = spark.sparkContext.parallelize([df1.schema.json()])
schema_df = spark.read.json(schemaRDD)
schema_df.show()
We are getting below output where we are getting the entire column names and datatype in a single row -
+--------------------+------+
| fields| type|
+--------------------+------+
|[[Atp,true,str...|struct|
+--------------------+------+
Looking for a output like
Atp| string
Ccetp| string
Ccref| string
Ccbbn| string
Ccsdt| string
Ccedt| string
Ccfdt| string
Ccddt| string
Ccamt| string

Not sure what language your are using but on pyspark I would do it like this:
schemaRDD = spark.sparkContext.parallelize([df.schema.json()])
schema_df = spark.read.json(schemaRDD)

schema_df = sqlContext.createDataFrame(zip([col[0] for col in df1.dtypes], [col[1] for col in df1.dtypes]), schema=['colname', 'datatype'])

Related

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

Apache Spark Column has a data type that cannot participate in a columnstore index with PySpark

Whenever I attempt to transfer data to Azure SQLDW from Apache Spark using PySpark on Apache Spark I get the following error:
Column 'cd_created_date' has a data type that cannot participate in a columnstore index
My schema is as follows:
root
|-- extraction_date: string (nullable = true)
|-- ce_case_data_id: string (nullable = true)
|-- cd_created_date: string (nullable = true)
|-- cd_last_modified: string (nullable = true)
|-- cd_jurisdiction: string (nullable = true)
|-- cd_latest_state: string (nullable = true)
|-- cd_reference: string (nullable = true)
|-- cd_security_classification: string (nullable = true)
|-- cd_version: string (nullable = true)
|-- cd_last_state_modified_date: string (nullable = true)
The failure starts with field 'cd_created_date', but I'm sure I will also get the error with 'cd_last_state_modified_date'.
My guess is that I will need to change schema of those fields to fix the problem, but I'm not sure.
Any thoughts?

Loading Parquet Files with Different Column Ordering

I have two Parquet directories that are being loaded into Spark. They have all the same columns, but the columns are in a different order.
val df = spark.read.parquet("url1").printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
val df = spark.read.parquet("url2").printSchema()
root
|-- b: string (nullable = true)
|-- a: string (nullable = true)
val urls = Array("url1", "url2")
val df = spark.read.parquet(urls: _*).printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
When I load the files together they always seem to take on the ordering of url1. I'm worried that having the parquet files in url1 and url2 saved in a different order will have unintended consequences, such as a and b swapping values. Can someone explain how parquet loads columns stored in a different order, with links to official documentation, if possible?

Timestamp conversion missmatching?

I have a database in which i want to save readable timestamps ass strings with a specific format.
The input-dataframe I get has timstamps of type 'timestamp'.
For calculations I convert them to unix-timestamps. The results go back to a database as a string formatted timestamp.
The problem is, that when I convert these string-formatted timestamps back to unix I have missmatching values.
How can that happen?
time_format = "YYYY-MM-dd'T'HH:mm:ssz"
dummy = df.withColumn('start_time_unix', f.unix_timestamp('start_time'))\
.withColumn('start_time_string', f.from_unixtime('start_time_unix', format=time_format))\
.withColumn('start_time_unix_2', f.unix_timestamp('start_time_string', format=time_format))
dummy.printSchema()
dummy.show(10, False)
OUTPUT
root
|-- start_time: timestamp (nullable = true)
|-- start_time_unix: long (nullable = true)
|-- start_time_string: string (nullable = true)
|-- start_time_unix_2: long (nullable = true)
+-------------------+---------------+-----------------------+-----------------+
|start_time |start_time_unix|start_time_string |start_time_unix_2|
+-------------------+---------------+-----------------------+-----------------+
|2019-06-04 00:39:08|1559601548 |2019-06-04T00:39:08CEST|1546123148 |
+-------------------+---------------+-----------------------+-----------------+
EDIT
The input-dataframe was generated according to:
df = spark.read.csv('data.csv', header=True, inferSchema=True)
With the csv containing rows like: "2019-06-04 00:39:08" (without ")

Is there any better way to convert Array<int> to Array<String> in pyspark

A very huge DataFrame with schema:
root
|-- id: string (nullable = true)
|-- ext: array (nullable = true)
| |-- element: integer (containsNull = true)
So far I try to explode data, then collect_list:
select
id,
collect_list(cast(item as string))
from default.dual
lateral view explode(ext) t as item
group by
id
But this way is too expansive.
You can simply cast the ext column to a string array
df = source.withColumn("ext", source.ext.cast("array<string>"))
df.printSchema()
df.show()

Resources