Is there any way to handle time in pyspark? - apache-spark

I have a string with 6 characters which should be loaded into SQL Server as the TIME data type.
But spark doesn't have any time data type. I have tried a few ways but data type is not returning in the timestamp.
I am reading the data as a string and converting it to timestamp and then finally trying to extract time values but it is returning value as string again.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).printSchema()
root
|-- time_col: timestamp (nullable = true)
|-- tim2: string (nullable = true)
And the data looks like this but in a different data type.
df.select('time_col').withColumn("time_col",to_timestamp(col("time_col"),"HHmmss").cast(TimestampType())).withColumn("tim2", date_format(col("time_col"), "HHmmss")).show(5)
+-------------------+------+
| time_col| tim2|
+-------------------+------+
|1970-01-01 14:44:51|144451|
|1970-01-01 14:48:37|144837|
|1970-01-01 14:46:10|144610|
|1970-01-01 11:46:39|114639|
|1970-01-01 17:44:33|174433|
+-------------------+------+
Is there any way I can get tim2 column in timestamp column or column equivalent to TIME data type of SQL Server?

I think you won't get what you are trying to do, there's no type in PySpark to handle "HH:mm:ss", see: What data type should be used for a time column
I'd suggest you to use it as string.

In my case I used to convert into timestamp in spark and before sending to SQL server just make it string.. it worked fine with me.

Maybe this will help you, but it seems to me that this changes the column in str
df.withColumn('TIME', date_format('datetime', 'HH:mm:ss'))

In scala, python will be similar:
scala> val df = Seq("144451","144837").toDF("c").select('c.cast("INT").cast("TIMESTAMP"))
df: org.apache.spark.sql.DataFrame = [c: timestamp]
scala> df.show()
+-------------------+
| c|
+-------------------+
|1970-01-02 17:07:31|
|1970-01-02 17:13:57|
+-------------------+
scala> df.printSchema()
root
|-- c: timestamp (nullable = true)

Related

how to convert string to timestamptype in pyspark

I have a csv file with below data
Tran_id,Tran_date1,Tran_date2,Tran_date3
1,2022-07-02T16:53:30.375Z,2022-07-02T16:53:30.3750000+00:00,2022-07-02 16:53:30.3750000+00:00
2,2022-08-02T17:33:10.456Z,2022-08-02T17:33:10.4560000+00:00,2022-08-02 17:33:10.4560000+00:00
3,2022-09-02T18:13:20.375Z,2022-09-02T18:13:20.3750000+00:00,2022-09-02 18:13:20.3750000+00:00
4,2022-09-02T19:23:90.322Z,2022-09-02T19:23:90.3220000+00:00,2022-09-02 19:23:90.3220000+00:00
I want to read this csv file using pyspark and convert the data to below format
root
|-- Tran_id: integer (nullable = false)
|-- Tran_date1: TimestampType(nullable = false)
|-- Tran_date2: TimestampType(nullable = false)
|-- Tran_date3: TimestampType(nullable = false)
and save this data into hive table by converting the string type to timestamptype
how to convert the string into timestamptype without losing the format
You could have read your csv file with automatic conversion to the required type, like this:
spark.read.option("header","true").option("inferSchema", "true").csv("test.csv")
df.printSchema()
root
|-- Tran_id: integer (nullable = true)
|-- Tran_date1: string (nullable = true)
|-- Tran_date2: string (nullable = true)
|-- Tran_date3: string (nullable = true)
But as you can see with your data it won't give you the correct schema. The reason is that you have some bad data in your csv. E.g. look at your record #4: 022-09-02T19:23:90.322Z, i.e. there can't be 90 seconds.
You can do the parsing yourself:
df = (
spark.read
.option("header","true")
.csv("test.csv")
.select(
"Tran_id",
F.to_timestamp("Tran_date1").alias("Tran_date1"),
F.to_timestamp("Tran_date2").alias("Tran_date2"),
F.to_timestamp("Tran_date3").alias("Tran_date3")))
# Schema is correct now
df.printSchema()
root
|-- Tran_id: string (nullable = true)
|-- Tran_date1: timestamp (nullable = true)
|-- Tran_date2: timestamp (nullable = true)
|-- Tran_date3: timestamp (nullable = true)
# But we now have nulls for bad data
df.show(truncate=False)
+-------+-----------------------+-----------------------+-----------------------+
|Tran_id|Tran_date1 |Tran_date2 |Tran_date3 |
+-------+-----------------------+-----------------------+-----------------------+
|1 |2022-07-02 16:53:30.375|2022-07-02 16:53:30.375|2022-07-02 16:53:30.375|
|2 |2022-08-02 17:33:10.456|2022-08-02 17:33:10.456|2022-08-02 17:33:10.456|
|3 |2022-09-02 18:13:20.375|2022-09-02 18:13:20.375|2022-09-02 18:13:20.375|
|4 |null |null |null |
+-------+-----------------------+-----------------------+-----------------------+
With the correct schema in place, you can later save you dataframe to hive, spark will take care of preserving the types.
I chose to show you have to solve this problem with Spark SQL.
#
# Create sample dataframe + view
#
# array of tuples - data
dat1 = [
(1,"2022-07-02T16:53:30.375Z","2022-07-02T16:53:30.3750000+00:00","2022-07-02 16:53:30.3750000+00:00"),
(2,"2022-08-02T17:33:10.456Z","2022-08-02T17:33:10.4560000+00:00","2022-08-02 17:33:10.4560000+00:00"),
(3,"2022-09-02T18:13:20.375Z","2022-09-02T18:13:20.3750000+00:00","2022-09-02 18:13:20.3750000+00:00"),
(4,"2022-09-02T19:23:90.322Z","2022-09-02T19:23:90.3220000+00:00","2022-09-02 19:23:90.3220000+00:00")
]
# array of names - columns
col1 = ["Tran_id", "Tran_date1", "Tran_date2", "Tran_date3"]
# make data frame
df1 = spark.createDataFrame(data=dat1, schema=col1)
# make temp hive view
df1.createOrReplaceTempView("sample_data")
# show schema
df1.printSchema()
Instead of creating a file on my Azure Databrick's storage, I decided to use an array of tuples since this is a very simple file.
We can see from the image below that the last three fields are strings.
The code below uses spark functions to convert the information from string to timestamp.
See this link for Spark SQL functions.
https://spark.apache.org/docs/2.3.0/api/sql/index.html#to_timestamp
#
# Convert data using PySpark SQL
#
# sql stmt
stmt = """
select
Tran_id,
to_timestamp(Tran_date1) as Tran_Ts1,
to_timestamp(Tran_date2) as Tran_Ts2,
to_timestamp(Tran_date3) as Tran_Ts3
from sample_data
"""
# new dataframe w/results
df2 = spark.sql(stmt)
# show schema
df2.printSchema()
I totally agree with qaziqarta.
Once the data is a timestamp, you can format it any which way you want either in the front end reporting tool (Power BI) or convert it to a formatted string in the curated zone as a file ready formatted for reporting.
See link for Spark SQL Function.
https://spark.apache.org/docs/2.3.0/api/sql/index.html#date_format
See link for valid date format strings.
https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
The code below formats the timestamp in 3 different forms.
#
# Convert data using PySpark SQL
#
# sql stmt
stmt = """
select
Tran_id,
date_format(to_timestamp(Tran_date1), "yyyy-MM-dd") as Tran_Fmt1,
date_format(to_timestamp(Tran_date2), "yyyy-MM-dd hh:mm:ss") as Tran_Fmt2,
date_format(to_timestamp(Tran_date3), "yyyy-MM-dd hh:mm:ss.SSS") as Tran_Fmt3
from sample_data
"""
# new dataframe w/results
df3 = spark.sql(stmt)
# show schema
df3.printSchema()
Please note, we are now going back to a string format.
Here is execution and output of the above SQL to obtain formatted strings representing date, date/timestamp and date/extended timestamp.

PySpark not picking the custom schema in csv

I am struggling with a very basic pyspark example and I don't know what is going on and would really appreciate if some could help me out
Below is my pypsark code to read a csv file which contains three column
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").appName("Sample App").getOrCreate()
child_df1 = spark.read.csv("E:\\data\\person.csv",inferSchema=True,header=True,multiLine=True)
child_df1.printSchema()
Below is the output of above code
root
|-- CPRIMARYKEY: long (nullable = true)
|-- GENDER: string (nullable = true)
|-- FOREIGNKEY: long (nullable = true)
child_df1.select("CPRIMARYKEY","FOREIGNKEY","GENDER").show()
Output
+--------------------+----------------------+------+
| CPRIMARYKEY |FOREIGNKEY |GENDER|
+--------------------+----------------------+------+
| 6922132627268452352| -4967470388989657188| F|
|-1832965148339791872| 761108337125613824| F|
| 7948853342318925440| -914230724356211688| M|
The issue comes when I provide the custom schema
import pyspark.sql.types as T
child_schema = T.StructType(
[
T.StructField("CPRIMARYKEY", T.LongType()),
T.StructField("FOREIGNKEY", T.LongType())
]
)
child_df2 = spark.read.csv("E:\\data\\person.csv",schema=child_schema,multiLine=True,header=True)
child_df2.show()
+--------------------+----------------------+
| CPRIMARYKEY |FOREIGNKEY|
+--------------------+----------------------+
| 6922132627268452352| null|
|-1832965148339791872| null|
| 7948853342318925440| null|
I am not able to understand that when inferring schema spark can recognize long value however when providing schema , its putting null values for FOREIGNKEY column. I am struggling with this simple exercise for a very long time and no luck. Could someone please point me out on what I am missing. Thank you
As far as I understand, you said the CSV has 2 columns.
so the FOREGINKEY and GENDER columns are de-facto one.
so Spark tries to parse -4967470388989657188,F as Long and returns null because it's not a valid long.
Can you add the Gender column to the schema and see if it fixes FOREGINKEY?
If you don't want the gender column, instead of removing it in the schema just .drop('gender') after reading the csv.

Do I need to use "mergeSchema" option in spark with parquet if I am passing in a schema explicitly?

From spark documentation:
Since schema merging is a relatively expensive operation, and is not a necessity in most cases, we turned it off by default starting from 1.5.0. You may enable it by setting data source option mergeSchema to true when reading Parquet files (as shown in the examples below), or setting the global SQL option spark.sql.parquet.mergeSchema to true.
(https://spark.apache.org/docs/latest/sql-data-sources-parquet.html)
My understanding from the documentation is that if I have multiple parquet partitions with different schemas, spark will be able to merge these schemas automatically if I use spark.read.option("mergeSchema", "true").parquet(path).
This seems like a good option if I don't know at query time what schemas exist in these partitions.
However, consider the case where I have two partitions, one using an old schema, and one using a new schema that differs only in having one additional field. Let's also assume that my code knows the new schema and I'm able to pass this schema in explicitly.
In this case, I would do something like spark.read.schema(my_new_schema).parquet(path). What I'm hoping Spark would do in this case is read in both partitions using the new schema and simply supply null values for the new column to any rows in the old partition. Is this the expected behavior? Or do I need also need to use option("mergeSchema", "true") in this case as well?
I'm hoping to avoid using the mergeSchema option if possible in order to avoid the additional overhead mentioned in the documentation.
I've tried extending the example code from the spark documentation linked above, and my assumptions appear to be correct. See below:
// This is used to implicitly convert an RDD to a DataFrame.
scala> import spark.implicits._
import spark.implicits._
// Create a simple DataFrame, store into a partition directory
scala> val squaresDF = spark.sparkContext.makeRDD(1 to 5).map(i => (i, i * i)).toDF("value", "square")
squaresDF: org.apache.spark.sql.DataFrame = [value: int, square: int]
scala> squaresDF.write.parquet("test_data/test_table/key=1")
// Create another DataFrame in a new partition directory,
// adding a new column and dropping an existing column
scala> val cubesDF = spark.sparkContext.makeRDD(6 to 10).map(i => (i, i * i * i)).toDF("value", "cube")
scala> cubesDF: org.apache.spark.sql.DataFrame = [value: int, cube: int]
scala> cubesDF.write.parquet("test_data/test_table/key=2")
// Read the partitioned table
scala> val mergedDF = spark.read.option("mergeSchema", "true").parquet("test_data/test_table")
mergedDF: org.apache.spark.sql.DataFrame = [value: int, square: int ... 2 more fields]
scala> mergedDF.printSchema()
root
|-- value: integer (nullable = true)
|-- square: integer (nullable = true)
|-- cube: integer (nullable = true)
|-- key: integer (nullable = true)
// Read without mergeSchema option
scala> val naiveDF = spark.read.parquet("test_data/test_table")
naiveDF: org.apache.spark.sql.DataFrame = [value: int, square: int ... 1 more field]
// Note that cube column is missing.
scala> naiveDF.printSchema()
root
|-- value: integer (nullable = true)
|-- square: integer (nullable = true)
|-- key: integer (nullable = true)
// Take the schema from the mergedDF above and use it to read the same table with an explicit schema, but without the "mergeSchema" option.
scala> val explicitSchemaDF = spark.read.schema(mergedDF.schema).parquet("test_data/test_table")
explicitSchemaDF: org.apache.spark.sql.DataFrame = [value: int, square: int ... 2 more fields]
// Spark was able to use the correct schema despite not using the "mergeSchema" option
scala> explicitSchemaDF.printSchema()
root
|-- value: integer (nullable = true)
|-- square: integer (nullable = true)
|-- cube: integer (nullable = true)
|-- key: integer (nullable = true)
// Data is as expected.
scala> explicitSchemaDF.show()
+-----+------+----+---+
|value|square|cube|key|
+-----+------+----+---+
| 3| 9|null| 1|
| 4| 16|null| 1|
| 5| 25|null| 1|
| 8| null| 512| 2|
| 9| null| 729| 2|
| 10| null|1000| 2|
| 1| 1|null| 1|
| 2| 4|null| 1|
| 6| null| 216| 2|
| 7| null| 343| 2|
+-----+------+----+---+
As you can see, spark appears to be correctly supplying null values to any columns missing from the parquet partitions when using an explicit schema to read the data.
This makes me feel fairly confident that I can answer my question with "no, the mergeSchema option is not necessary in this case," but I'm still wondering if there are any caveats that I should be aware of. Any additional help from others would be appreciated.

How to split values from map_keys() into multiple columns in PySpark

I have this data frame that has a schema with a map like below:
root
|-- events: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
When I explode it or use map_keys() to obtain those values I get this dataframe below:
+--------------------+--------------------+
| map_data| map_values|
+--------------------+--------------------+
|[[{event_name=walk..|[{event_name=walk...|
|[[{event_name=walk..| 2019-02-17|
|[[{event_name=walk..| 08:00:00|
|[[{event_name=run...|[{event_name=walk...|
|[[{event_name=fly...| 2019-02-17|
|[[{event_name=run...| 09:00:00|
+--------------------+--------------------+
This is my code to get to the dataframe show above:
events = event_data\
.withColumn(
"map_data",
F.map_values(event_data.events)
)
events.printSchema()
events.select("map_data")
.withColumn(
"map_values",
F.explode(events.map_data)
).show(10)
From what I started with, I would consider this a milestone reached, however, I would like my data frame to look like this:
+--------------------+-----------+--------+
| events | date | time |
+--------------------+-----------+--------+
|[{event_name=walk...| 2019-02-17|08:00:00|
|[{event_name=walk...| 2019-02-17|09:00:00|
+--------------------+-----------+--------+
I have been researching and I have seen that people are utilizing udf's, however, I am sure there is a way to accomplish what I want purely with dataframes and sql functions.
For more insight here is how my rows look like when without .show(truncate=False)
+--------------------+--------------------+
| map_data| map_values|
+--------------------+--------------------+
|[[{event_name=walk..|[{event_name=walk, duration=0.47, x=0.39, y=0.14, timestamp=08:02:30.574892}, {event_name=walk, duration=0.77, x=0.15, y=0.08, timestamp=08:02:50.330245}, {event_name=run, duration=0.02, x=0.54, y=0.44, timestamp=08:02:22.737803}, {event_name=run, duration=0.01, x=0.43, y=0.56, timestamp=08:02:11.629404}, {event_name=run, duration=0.03, x=0.57, y=0.4, timestamp=08:02:22.660778}, {event_name=run, duration=0.02, x=0.49, y=0.49, timestamp=08:02:56.660186}]|
|[[{event_name=walk..| 2019-02-17|
|[[{event_name=walk..| 08:00:00|
Also, with the dataframe I have now, my issue here is to find out how to explode an array into multiple columns. I mention this cause I can either work with that or perform a more efficient process to create the dataframe based on the map I was given.
I have found a solution to my problem. I needed to go about this approach (Create a dataframe from a hashmap with keys as column names and values as rows in Spark) and perform these series of computation on event_data which is my initialized dataframe.
This is how my dataframe looks now
|25769803776|2019-03-19|[{event_name=walk, duration=0.47, x=0.39, y=0.14, timestamp=08:02:30.574892}, {event_name=walk, duration=0.77, x=0.15, y=0.08, timestamp=08:02:50.330245}, {event_name=run, duration=0.02, x=0.54, y=0.44, timestamp=08:02:22.737803}, {event_name=run, duration=0.01, x=0.43, y=0.56, timestamp=08:02:11.629404}, {event_name=run, duration=0.03, x=0.57, y=0.4, timestamp=08:02:22.660778}, {event_name=run, duration=0.02, x=0.49, y=0.49, timestamp=08:02:56.660186}]|08:02:00|

Spark 1.5.2: org.apache.spark.sql.AnalysisException: unresolved operator 'Union;

I have two dataframes df1 and df2. Both of them have the following schema:
|-- ts: long (nullable = true)
|-- id: integer (nullable = true)
|-- managers: array (nullable = true)
| |-- element: string (containsNull = true)
|-- projects: array (nullable = true)
| |-- element: string (containsNull = true)
df1 is created from an avro file while df2 from an equivalent parquet file. However, If I execute, df1.unionAll(df2).show(), I get the following error:
org.apache.spark.sql.AnalysisException: unresolved operator 'Union;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:37)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:44)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:174)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.apply(CheckAnalysis.scala:49)
at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:103)
I ran into the same situation and it turns out to be not only the fields need to be the same but also you need to maintain the exact same ordering of the fields in both dataframe in order to make it work.
This is old and there are already some answers lying around but I just faced this problem while trying to make a union of two dataframes like in...
//Join 2 dataframes
val df = left.unionAll(right)
As others have mentioned, order matters. So just select right columns in the same order than left dataframe columns
//Join 2 dataframes, but take columns in the same order
val df = left.unionAll(right.select(left.columns.map(col):_*))
I found the following PR on github
https://github.com/apache/spark/pull/11333.
That relates to UDF (user defined function) columns, which were not correctly handled during the union, and thus would cause the union to fail. The PR fixes it, but it hasn't made it to spark 1.6.2, I haven't checked on spark 2.x yet.
If you're stuck on 1.6.x there's a stupid work around, map the DataFrame to an RDD and back to a DataFrame
// for a DF with 2 columns (Long, Array[Long])
val simple = dfWithUDFColumn
.map{ r => (r.getLong(0), r.getAs[Array[Long]](1))} // DF --> RDD[(Long, Array[Long])]
.toDF("id", "tags") // RDD --> back to DF but now without UDF column
// dfOrigin has the same structure but no UDF columns
val joined = dfOrigin.unionAll(simple).dropDuplicates(Seq("id")).cache()

Resources