Loading Parquet Files with Different Column Ordering - apache-spark

I have two Parquet directories that are being loaded into Spark. They have all the same columns, but the columns are in a different order.
val df = spark.read.parquet("url1").printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
val df = spark.read.parquet("url2").printSchema()
root
|-- b: string (nullable = true)
|-- a: string (nullable = true)
val urls = Array("url1", "url2")
val df = spark.read.parquet(urls: _*).printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
When I load the files together they always seem to take on the ordering of url1. I'm worried that having the parquet files in url1 and url2 saved in a different order will have unintended consequences, such as a and b swapping values. Can someone explain how parquet loads columns stored in a different order, with links to official documentation, if possible?

Related

Do Parquet files preserve column order of Spark DataFrames?

Does creating a Spark DataFrame and saving it in Parquet format guarantee that the order of columns in the parquet file will be preserved?
Ex) A Spark DataFrame is created with columns A, B, C, and saved as Parquet. When the Parquet files are read, will the column order always be A, B, C?
I've noticed that if I save a Spark DataFrame, and then read the parquet files, the column order is preserved:
df.select("A", "B", "C").write.save(...)
df = spark.read.load(...)
df.printSchema()
root
|-- A: string (nullable = true)
|-- B: string (nullable = true)
|-- C: string (nullable = true)
Then, if I save by selecting a different order of columns, and then read the parquet files, I can see the order is also what I expect:
df.select("C", "B", "A").write.save(...)
df = spark.read.load(...)
df.printSchema()
root
|-- C: string (nullable = true)
|-- B: string (nullable = true)
|-- A: string (nullable = true)
However, I can't seem to find any documentation supporting this, and the comments of this post: Is there a possibility to keep column order when reading parquet? have conflicting information.

Create dataframe on printschema output

I have created a dataframe on top of parquet file and now able to see the dataframe schema.Now I want to create dataframe on top of the printschema output
df = spark.read.parquet("s3/location")
df.printschema()
the output looks like [(cola , string) , (colb,string)]
Now I want to create dataframe on the output of printschema .
What would be the best way to do that
Adding more inputs on what has been achieved so far -
df1 = sqlContext.read.parquet("s3://t1")
df1.printSchema()
We got the below result -
root
|-- Atp: string (nullable = true)
|-- Ccetp: string (nullable = true)
|-- Ccref: string (nullable = true)
|-- Ccbbn: string (nullable = true)
|-- Ccsdt: string (nullable = true)
|-- Ccedt: string (nullable = true)
|-- Ccfdt: string (nullable = true)
|-- Ccddt: string (nullable = true)
|-- Ccamt: string (nullable = true)
We want to create dataframe with two columns - 1) colname , 2) datatype
But if we run the below code -
schemaRDD = spark.sparkContext.parallelize([df1.schema.json()])
schema_df = spark.read.json(schemaRDD)
schema_df.show()
We are getting below output where we are getting the entire column names and datatype in a single row -
+--------------------+------+
| fields| type|
+--------------------+------+
|[[Atp,true,str...|struct|
+--------------------+------+
Looking for a output like
Atp| string
Ccetp| string
Ccref| string
Ccbbn| string
Ccsdt| string
Ccedt| string
Ccfdt| string
Ccddt| string
Ccamt| string
Not sure what language your are using but on pyspark I would do it like this:
schemaRDD = spark.sparkContext.parallelize([df.schema.json()])
schema_df = spark.read.json(schemaRDD)
schema_df = sqlContext.createDataFrame(zip([col[0] for col in df1.dtypes], [col[1] for col in df1.dtypes]), schema=['colname', 'datatype'])

spark read orc with specific columns

I have a orc file, when read with below option it reads all the columns .
val df= spark.read.orc("/some/path/")
df.printSChema
root
|-- id: string (nullable = true)
|-- name: string (nullable = true)
|-- value: string (nullable = true)
|-- all: string (nullable = true)
|-- next: string (nullable = true)
|-- action: string (nullable = true)
but I want to read only two columns from that file , is there any way to read only two columns (id,name) while loading orc file ?
is there any way to read only two columns (id,name) while loading orc file ?
Yes, all you need is subsequent select. Spark will take care of the rest for you:
val df = spark.read.orc("/some/path/").select("id", "name")
Spark has lazy execution model. So you can do any data transformation in you code without immediate real effect. Only after action call Spark start to doing job. And Spark are smart enough not to do extra work.
So you can write like this:
val inDF: DataFrame = spark.read.orc("/some/path/")
import spark.implicits._
val filteredDF: DataFrame = inDF.select($"id", $"name")
// any additional transformations
// real work starts after this action
val result: Array[Row] = filteredDF.collect()

Spark inner joins results in empty records

I'm performing an inner join between dataframes to only keep the sales for specific days:
val days_df = ss.createDataFrame(days_array.map(Tuple1(_))).toDF("DAY_ID")
val filtered_sales = sales.join(days_df,Seq("DAY_ID")
filtered_sales.show()
This results in an empty filtered_sales dataframe (0 records), both columns DAY_ID have the same type (string).
root
|-- DAY_ID: string (nullable = true)
root
|-- SKU: string (nullable = true)
|-- DAY_ID: string (nullable = true)
|-- STORE_ID: string (nullable = true)
|-- SALES_UNIT: integer (nullable = true)
|-- SALES_REVENUE: decimal(20,5) (nullable = true)
The sales df is populated from a 20GB file.
Using the same code with a small file of some KB will work fine with the join and I can see the results. The empty result dataframe occurs only with bigger dataset.
If I change the code and use the following one, it works fine even with the 20GB sales file:
sales.filter(sales("DAY_ID").isin(days_array:_*))
.show()
What is wrong with the inner join?
Try to broadcast days_array and then apply inner join. As days_array is too small compared to another table, broadcasting will help.

Multiple aggregations on nested structure in a single Spark statement

I have a json structure like this:
{
"a":5,
"b":10,
"c":{
"c1": 3,
"c4": 5
}
}
I have a dataframe created from this structure with several million rows. What I need are aggregation in several keys like this:
df.agg(count($"b") as "cntB", sum($"c.c4") as "sumC")
Do I just miss the syntax? Or is there a different way to do it? Most important Spark should only scan the data once for all aggregations.
It is possible, but your JSON must be in one line.
Each line = new JSON object.
val json = sc.parallelize(
"{\"a\":5,\"b\":10,\"c\":{\"c1\": 3,\"c4\": 5}}" :: Nil)
val jsons = sqlContext.read.json(json)
jsons.agg(count($"b") as "cntB", sum($"c.c4") as "sumC").show
Works fine - please see that json is formatted to be in one line.
jsons.printSchema() is printing:
root
|-- a: long (nullable = true)
|-- b: long (nullable = true)
|-- c: struct (nullable = true)
| |-- c1: long (nullable = true)
| |-- c4: long (nullable = true)

Resources