Partition of Timestamp column in Dataframes Pyspark

Partition of Timestamp column in Dataframes Pyspark - apache-spark

I have a DataFrame in PSspark in the below format
Date Id Name Hours Dno Dname
12/11/2013 1 sam 8 102 It
12/10/2013 2 Ram 7 102 It
11/10/2013 3 Jack 8 103 Accounts
12/11/2013 4 Jim 9 101 Marketing
I want to do partition based on dno and save as table in Hive using Parquet format.
df.write.saveAsTable(
'default.testing', mode='overwrite', partitionBy='Dno', format='parquet')
The query worked fine and created table in Hive with Parquet input.
Now I want to do partitioned based on the year and month of the date column. The timestamp is Unix timestamp
how can we achieve that in PySpark. I have done it in hive but unable to do it PySpark

Spark >= 3.1
Instead of cast use timestamp_seconds
from pyspark.sql.functions import timestamp_seconds
year(timestamp_seconds(col("timestamp")))
Spark < 3.1
Just extract fields you want to use and provide a list of columns as an argument to the partitionBy of the writer. If timestamp is UNIX timestamps expressed in seconds:
df = sc.parallelize([
(1484810378, 1, "sam", 8, 102, "It"),
(1484815300, 2, "ram", 7, 103, "Accounts")
]).toDF(["timestamp", "id", "name", "hours", "dno", "dname"])
add columns:
from pyspark.sql.functions import year, month, col
df_with_year_and_month = (df
.withColumn("year", year(col("timestamp").cast("timestamp")))
.withColumn("month", month(col("timestamp").cast("timestamp"))))
and write:
(df_with_year_and_month
.write
.partitionBy("year", "month")
.mode("overwrite")
.format("parquet")
.saveAsTable("default.testing"))

Related

Saving dataframe to hive ( how to change df schema types to string if its of type date , i do not want to hardcode columns names)

I want to covert pandas df to spark df and save it to hive.
#create spark df from panda dataframe
df = self.ss.createDataFrame(dataframe)
df.createOrReplaceTempView("table_Template")
self.ss.sql("create table IF NOT EXISTS database."+ table_name +" STORED AS PARQUET as select * from table_Template")
ERROR:
pyspark.sql.utils.AnalysisException: 'org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.UnsupportedOperationException: Parquet does not support date. See HIVE-6384;

Try below code, to cast all date type columns to string.
df.select(map(lambda field: F.col(field.name).cast("string") if field.dataType.typeName == "date" else F.col(field.name), df.schema)).show()

How to use SparkSQL to select rows in Spark DF based on multiple conditions

I am relatively new to pyspark and I have a spark dataframe with a date column "Issue_Date". The "Issue_Date" column contains several dates from 1970-2060 (due to errors). From the spark dataframe, I have created a temp table from it and have been able to filter the data from year 2018. I would also like to include the data from year 2019 (i.e., multiple conditions). Is there a way to do so? I've tried many combinations but couldn't get it. Any form of help is appreciated, thank you.
# Filter data from 2018
sparkdf3.createOrReplaceTempView("table_view")
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) = 2018")
sparkdf4.count()

Did you try using year(Issue_Date) >= 2018?:
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) >= 2018")
If your column has errors, and you want to specify a range you can use year IN (2018, 2019):
sparkdf4 = spark.sql("select * from table_view where year(to_date(cast(unix_timestamp(Issue_Date,'MM/dd/yyyy') as timestamp))) in (2018, 2019)")

Updating static source based on Kafka Stream using Spark Streaming?

I am using spark-sql 2.4.1v with java8.
I have a scenario where I have some meta data in dataset1 i.e. which is loaded from an HDFS Parquet file.
And I have another dataset2 which is read from a Kafka Stream.
For each record from dataset2 for column1 I need to check columnX in dataset2
if its there in dataset1.
If it is there in dataset1,then I need replace the columnX value with column1 value of dataset1
Else
I need to add increment (max(column1 ) from dataset1 ) by one and store it dataset1.
Some sample data you can see here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1165111237342523/3447405230020171/7035720262824085/latest.html
How this can be done in sSpark?
Example:
val df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
).toDF("company_id_external","company_id")
val df2 = Seq(
("60886923","Chengdu Fuma Food Co,.Ltd"), //company_id_external match found in df1
("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
("583633","Boso oil and fat Co., Ltd. ") //company_id_external match found in df1
)toDF("company_id_external","companyName")
If match found in df1
Here only two records of df1 "company_id_external" matching to df2 "company_id_external"
i.e. 60886923 & 583633 ( first and last record)
For these records of df2
i.e. ("60886923","Chengdu Fuma Food Co,.Ltd") becomes ==> ("2860","Chengdu Fuma Food Co,.Ltd")
("583633","Boso oil and fat Co., Ltd. ") becomes ==> ("46067","Boso oil and fat Co., Ltd. ")
Else match not found in df1
For other two of df2 there is no "company_id_external" match in df1, need to generate it company_id and add to df1
i.e. ("608815923","Australia Deloraine Dairy Pty Ltd ),
("59322769","Consalac B.V.")
company_id generation logic
new company_id = max(company_id) of df1 + 1
From the above max is 50330 + 1 => 50331 add this record to df1 i.e. ("608815923","50331")
Do the same for other one i.e. add this record to df1 i.e. ("583633","50332")
**So now**
df1 = Seq(
("20359045","2263"),
("8476349","3280"),
("60886923","2860"),
("204831453","50330"),
("6487533","48236"),
("583633","46067"),
("608815923","50331")
("583633","50332")
).toDF("company_id_external","company_id")

How to repartition in spark based on column?

I want to repartition the dataframe based on day column.
Like, I have 90 days data in dataframe and I want to partition data based on day, so that I have each day in each partition
I want a syntax like below..
df.repartition("day",90)
Where
day => column in dataframe
90 => number of partitions I want

You can do that by
import spark.implicits._
df.repartition(df.select($"day").count().toInt, $"day")

Apache Spark subtract days from timestamp column

I am using Spark Dataset and having trouble subtracting days from a timestamp column.
I would like to subtract days from Timestamp Column and get new Column with full datetime format. Example:
2017-09-22 13:17:39.900 - 10 ----> 2017-09-12 13:17:39.900
With date_sub functions I am getting 2017-09-12 without 13:17:39.900.

You cast data to timestamp and expr to subtract an INTERVAL:
import org.apache.spark.sql.functions.expr
val df = Seq("2017-09-22 13:17:39.900").toDF("timestamp")
df.withColumn(
"10_days_before",
$"timestamp".cast("timestamp") - expr("INTERVAL 10 DAYS")).show(false)
+-----------------------+---------------------+
|timestamp |10_days_before |
+-----------------------+---------------------+
|2017-09-22 13:17:39.900|2017-09-12 13:17:39.9|
+-----------------------+---------------------+
If data is already of TimestampType you can skip cast.

Or you can simply use date_sub function from pyspark +1.5:
from pyspark.sql.functions import *
df.withColumn("10_days_before", date_sub(col('timestamp'),10).cast('timestamp'))

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Partition of Timestamp column in Dataframes Pyspark - apache-spark

Related

Saving dataframe to hive ( how to change df schema types to string if its of type date , i do not want to hardcode columns names)

How to use SparkSQL to select rows in Spark DF based on multiple conditions

Updating static source based on Kafka Stream using Spark Streaming?

How to repartition in spark based on column?

Apache Spark subtract days from timestamp column

Categories

Resources