Partition of Timestamp column in Dataframes Pyspark - apache-spark

I have a DataFrame in PSspark in the below format
Date Id Name Hours Dno Dname
12/11/2013 1 sam 8 102 It
12/10/2013 2 Ram 7 102 It
11/10/2013 3 Jack 8 103 Accounts
12/11/2013 4 Jim 9 101 Marketing
I want to do partition based on dno and save as table in Hive using Parquet format.
df.write.saveAsTable(
'default.testing', mode='overwrite', partitionBy='Dno', format='parquet')
The query worked fine and created table in Hive with Parquet input.
Now I want to do partitioned based on the year and month of the date column. The timestamp is Unix timestamp
how can we achieve that in PySpark. I have done it in hive but unable to do it PySpark

Spark >= 3.1
Instead of cast use timestamp_seconds
from pyspark.sql.functions import timestamp_seconds
year(timestamp_seconds(col("timestamp")))
Spark < 3.1
Just extract fields you want to use and provide a list of columns as an argument to the partitionBy of the writer. If timestamp is UNIX timestamps expressed in seconds:
df = sc.parallelize([
(1484810378, 1, "sam", 8, 102, "It"),
(1484815300, 2, "ram", 7, 103, "Accounts")
]).toDF(["timestamp", "id", "name", "hours", "dno", "dname"])
add columns:
from pyspark.sql.functions import year, month, col
df_with_year_and_month = (df
.withColumn("year", year(col("timestamp").cast("timestamp")))
.withColumn("month", month(col("timestamp").cast("timestamp"))))
and write:
(df_with_year_and_month
.write
.partitionBy("year", "month")
.mode("overwrite")
.format("parquet")
.saveAsTable("default.testing"))

Related

Saving dataframe to hive ( how to change df schema types to string if its of type date , i do not want to hardcode columns names)

I want to covert pandas df to spark df and save it to hive.
#create spark df from panda dataframe
df = self.ss.createDataFrame(dataframe)
df.createOrReplaceTempView("table_Template")
self.ss.sql("create table IF NOT EXISTS database."+ table_name +" STORED AS PARQUET as select * from table_Template")
ERROR:
pyspark.sql.utils.AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Parquet does not support date. See HIVE-6384;
Try below code, to cast all date type columns to string.
df.select(map(lambda field: F.col(field.name).cast("string") if field.dataType.typeName == "date" else F.col(field.name), df.schema)).show()

How to use SparkSQL to select rows in Spark DF based on multiple conditions

I am relatively new to pyspark and I have a spark dataframe with a date column "Issue_Date". The "Issue_Date" column contains several dates from 1970-2060 (due to errors). From the spark dataframe, I have created a temp table from it and have been able to filter the data from year 2018. I would also like to include the data from year 2019 (i.e., multiple conditions). Is there a way to do so? I've tried many combinations but couldn't get it. Any form of help is appreciated, thank you.
# Filter data from 2018
sparkdf3.createOrReplaceTempView("table_view")
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) = 2018")
sparkdf4.count()
Did you try using year(Issue_Date) >= 2018?:
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) >= 2018")
If your column has errors, and you want to specify a range you can use year IN (2018, 2019):
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) in (2018, 2019)")

Updating static source based on Kafka Stream using Spark Streaming?

I am using spark-sql 2.4.1v with java8.
I have a scenario where I have some meta data in dataset1 i.e. which is loaded from an HDFS Parquet file.
And I have another dataset2 which is read from a Kafka Stream.
For each record from dataset2 for column1 I need to check columnX in dataset2
if its there in dataset1.
If it is there in dataset1,then I need replace the columnX value with column1 value of dataset1
Else
I need to add increment (max(column1 ) from dataset1 ) by one and store it dataset1.
Some sample data you can see here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3447405230020171/7035720262824085/latest.html
How this can be done in sSpark?
Example:
val df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
).toDF("company_id_external","company_id")
val df2 = Seq(
("60886923","Chengdu Fuma Food Co,.Ltd"), //company_id_external match found in df1
("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
("583633","Boso oil and fat Co., Ltd. ") //company_id_external match found in df1
)toDF("company_id_external","companyName")
If match found in df1
Here only two records of df1 "company_id_external" matching to df2 "company_id_external"
i.e. 60886923 & 583633 ( first and last record)
For these records of df2
i.e. ("60886923","Chengdu Fuma Food Co,.Ltd") becomes ==> ("2860","Chengdu Fuma Food Co,.Ltd")
("583633","Boso oil and fat Co., Ltd. ") becomes ==> ("46067","Boso oil and fat Co., Ltd. ")
Else match not found in df1
For other two of df2 there is no "company_id_external" match in df1, need to generate it company_id and add to df1
i.e. ("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
company_id generation logic
new company_id = max(company_id) of df1 + 1
From the above max is 50330 + 1 => 50331 add this record to df1 i.e. ("608815923","50331")
Do the same for other one i.e. add this record to df1 i.e. ("583633","50332")
**So now**
df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
("608815923","50331")
("583633","50332")
).toDF("company_id_external","company_id")

How to repartition in spark based on column?

I want to repartition the dataframe based on day column.
Like, I have 90 days data in dataframe and I want to partition data based on day, so that I have each day in each partition
I want a syntax like below..
df.repartition("day",90)
Where
day => column in dataframe
90 => number of partitions I want
You can do that by
import spark.implicits._
df.repartition(df.select($"day").count().toInt, $"day")

Apache Spark subtract days from timestamp column

I am using Spark Dataset and having trouble subtracting days from a timestamp column.
I would like to subtract days from Timestamp Column and get new Column with full datetime format. Example:
2017-09-22 13:17:39.900 - 10 ----> 2017-09-12 13:17:39.900
With date_sub functions I am getting 2017-09-12 without 13:17:39.900.
You cast data to timestamp and expr to subtract an INTERVAL:
import org.apache.spark.sql.functions.expr
val df = Seq("2017-09-22 13:17:39.900").toDF("timestamp")
df.withColumn(
"10_days_before",
$"timestamp".cast("timestamp") - expr("INTERVAL 10 DAYS")).show(false)
+-----------------------+---------------------+
|timestamp |10_days_before |
+-----------------------+---------------------+
|2017-09-22 13:17:39.900|2017-09-12 13:17:39.9|
+-----------------------+---------------------+
If data is already of TimestampType you can skip cast.
Or you can simply use date_sub function from pyspark +1.5:
from pyspark.sql.functions import *
df.withColumn("10_days_before", date_sub(col('timestamp'),10).cast('timestamp'))

Resources