Timestamp conversion missmatching? - apache-spark

I have a database in which i want to save readable timestamps ass strings with a specific format.
The input-dataframe I get has timstamps of type 'timestamp'.
For calculations I convert them to unix-timestamps. The results go back to a database as a string formatted timestamp.
The problem is, that when I convert these string-formatted timestamps back to unix I have missmatching values.
How can that happen?
time_format = "YYYY-MM-dd'T'HH:mm:ssz"
dummy = df.withColumn('start_time_unix', f.unix_timestamp('start_time'))\
.withColumn('start_time_string', f.from_unixtime('start_time_unix', format=time_format))\
.withColumn('start_time_unix_2', f.unix_timestamp('start_time_string', format=time_format))
dummy.printSchema()
dummy.show(10, False)
OUTPUT
root
|-- start_time: timestamp (nullable = true)
|-- start_time_unix: long (nullable = true)
|-- start_time_string: string (nullable = true)
|-- start_time_unix_2: long (nullable = true)
+-------------------+---------------+-----------------------+-----------------+
|start_time |start_time_unix|start_time_string |start_time_unix_2|
+-------------------+---------------+-----------------------+-----------------+
|2019-06-04 00:39:08|1559601548 |2019-06-04T00:39:08CEST|1546123148 |
+-------------------+---------------+-----------------------+-----------------+
EDIT
The input-dataframe was generated according to:
df = spark.read.csv('data.csv', header=True, inferSchema=True)
With the csv containing rows like: "2019-06-04 00:39:08" (without ")

Related

Pyspark: Write CSV from JSON file with struct column

I'm reading a .json file that contains the structure below, and I need to generate a csv with this data in column form, I know that I can't directly write an array-type object in a csv, I used the explode function to remove the fields I need , being able to leave them in a columnar form, but when writing the data frame in csv, I'm getting an error when using the explode function, from what I understand it's not possible to do this with two variables in the same select, can someone help me with something alternative?
from pyspark.sql.functions import col, explode
from pyspark.sql import SparkSession
spark = (SparkSession.builder
.master("local[1]")
.appName("sample")
.getOrCreate())
df = (spark.read.option("multiline", "true")
.json("data/origin/crops.json"))
df2 = (explode('history').alias('history'), explode('trial').alias('trial'))
.select('history.started_at', 'history.finished_at', col('id'), trial.is_trial, trial.ws10_max))
(df2.write.format('com.databricks.spark.csv')
.mode('overwrite')
.option("header","true")
.save('data/output/'))
root
|-- history: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- finished_at: string (nullable = true)
| | |-- started_at: string (nullable = true)
|-- id: long (nullable = true)
|-- trial: struct (nullable = true)
| |-- is_trial: boolean (nullable = true)
| |-- ws10_max: double (nullable = true)
I'm trying to return something like this
started_at
finished_at
is_trial
ws10_max
First
row
row
Second
row
row
Thank you!
Use explode on array and select("struct.*") on struct.
df.select("trial", "id", explode('history').alias('history')),
.select('id', 'history.*', 'trial.*'))

Loading Parquet Files with Different Column Ordering

I have two Parquet directories that are being loaded into Spark. They have all the same columns, but the columns are in a different order.
val df = spark.read.parquet("url1").printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
val df = spark.read.parquet("url2").printSchema()
root
|-- b: string (nullable = true)
|-- a: string (nullable = true)
val urls = Array("url1", "url2")
val df = spark.read.parquet(urls: _*).printSchema()
root
|-- a: string (nullable = true)
|-- b: string (nullable = true)
When I load the files together they always seem to take on the ordering of url1. I'm worried that having the parquet files in url1 and url2 saved in a different order will have unintended consequences, such as a and b swapping values. Can someone explain how parquet loads columns stored in a different order, with links to official documentation, if possible?

pyspark with hive - can't properly create with partition and save a table from a dataframe

I'm trying to convert json files to parquet with very few transformations (adding date) but I then need to partition this data before saving it to parquet.
I'm hitting a wall on this area.
Here is the creation process of the table:
df_temp = spark.read.json(data_location) \
.filter(
cond3
)
df_temp = df_temp.withColumn("date", fn.to_date(fn.lit(today.strftime("%Y-%m-%d"))))
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
then regarding the save of the conversion:
df_final.write.mode("append").format("parquet").partitionBy("customer_id", "date").saveAsTable('duration')
but this generates the following error:
pyspark.sql.utils.AnalysisException: '\nSpecified partitioning does not match that of the existing table default.duration.\nSpecified partition columns: [customer_id, date]\nExisting partition columns: []\n ;'
the schema being:
root
|-- action_id: string (nullable = true)
|-- customer_id: string (nullable = true)
|-- duration: long (nullable = true)
|-- initial_value: string (nullable = true)
|-- item_class: string (nullable = true)
|-- set_value: string (nullable = true)
|-- start_time: string (nullable = true)
|-- stop_time: string (nullable = true)
|-- undo_event: string (nullable = true)
|-- year: integer (nullable = true)
|-- month: integer (nullable = true)
|-- day: integer (nullable = true)
|-- date: date (nullable = true)
Thus I tried to change the create table to:
spark.sql("CREATE TABLE IF NOT EXISTS {1} LIKE {0}_tmp PARTITIONED BY (customer_id, date) LOCATION '{2}/{1}'".format("duration_small","duration", warehouse_location))
But this create an error like:
...mismatched input 'PARTITIONED' expecting ...
So I discovered that PARTITIONED BY doesn't work with LIKE but I'm running out of ideas.
if using USING instead of LIKE I got the error:
pyspark.sql.utils.AnalysisException: 'It is not allowed to specify partition columns when the table schema is not defined. When the table schema is not provided, schema and partition columns will be inferred.;'
How am I supposed to add a partition when creating the table?
Ps - Once the schema of the table is defined with the partitions, I want to simply use:
df_final.write.format("parquet").insertInto('duration')
I finally figured out how to do it with spark.
df_temp.read.json...
df_temp.createOrReplaceTempView("{}_tmp".format("duration_small"))
spark.sql("""
CREATE TABLE IF NOT EXISTS {1}
USING PARQUET
PARTITIONED BY (customer_id, date)
LOCATION '{2}/{1}' AS SELECT * FROM {0}_tmp
""".format("duration_small","duration", warehouse_location))
spark.sql("DESC {}".format("duration"))
df_temp.write.mode("append").partitionBy("customer_id", "date").saveAsTable('duration')
I don't know why, but if I can't use insertInto, it uses a weird customer_id out of nowhere and doesn't append the different dates.

Create dataframe on printschema output

I have created a dataframe on top of parquet file and now able to see the dataframe schema.Now I want to create dataframe on top of the printschema output
df = spark.read.parquet("s3/location")
df.printschema()
the output looks like [(cola , string) , (colb,string)]
Now I want to create dataframe on the output of printschema .
What would be the best way to do that
Adding more inputs on what has been achieved so far -
df1 = sqlContext.read.parquet("s3://t1")
df1.printSchema()
We got the below result -
root
|-- Atp: string (nullable = true)
|-- Ccetp: string (nullable = true)
|-- Ccref: string (nullable = true)
|-- Ccbbn: string (nullable = true)
|-- Ccsdt: string (nullable = true)
|-- Ccedt: string (nullable = true)
|-- Ccfdt: string (nullable = true)
|-- Ccddt: string (nullable = true)
|-- Ccamt: string (nullable = true)
We want to create dataframe with two columns - 1) colname , 2) datatype
But if we run the below code -
schemaRDD = spark.sparkContext.parallelize([df1.schema.json()])
schema_df = spark.read.json(schemaRDD)
schema_df.show()
We are getting below output where we are getting the entire column names and datatype in a single row -
+--------------------+------+
| fields| type|
+--------------------+------+
|[[Atp,true,str...|struct|
+--------------------+------+
Looking for a output like
Atp| string
Ccetp| string
Ccref| string
Ccbbn| string
Ccsdt| string
Ccedt| string
Ccfdt| string
Ccddt| string
Ccamt| string
Not sure what language your are using but on pyspark I would do it like this:
schemaRDD = spark.sparkContext.parallelize([df.schema.json()])
schema_df = spark.read.json(schemaRDD)
schema_df = sqlContext.createDataFrame(zip([col[0] for col in df1.dtypes], [col[1] for col in df1.dtypes]), schema=['colname', 'datatype'])

Spark inner joins results in empty records

I'm performing an inner join between dataframes to only keep the sales for specific days:
val days_df = ss.createDataFrame(days_array.map(Tuple1(_))).toDF("DAY_ID")
val filtered_sales = sales.join(days_df,Seq("DAY_ID")
filtered_sales.show()
This results in an empty filtered_sales dataframe (0 records), both columns DAY_ID have the same type (string).
root
|-- DAY_ID: string (nullable = true)
root
|-- SKU: string (nullable = true)
|-- DAY_ID: string (nullable = true)
|-- STORE_ID: string (nullable = true)
|-- SALES_UNIT: integer (nullable = true)
|-- SALES_REVENUE: decimal(20,5) (nullable = true)
The sales df is populated from a 20GB file.
Using the same code with a small file of some KB will work fine with the join and I can see the results. The empty result dataframe occurs only with bigger dataset.
If I change the code and use the following one, it works fine even with the 20GB sales file:
sales.filter(sales("DAY_ID").isin(days_array:_*))
.show()
What is wrong with the inner join?
Try to broadcast days_array and then apply inner join. As days_array is too small compared to another table, broadcasting will help.

Resources