spark sql change date format using spark expr - apache-spark

im using pyspark 2.4 and
bellow the code :
i have a dataframe with french month ,
i converte them to english month in order to change the fomat date ( date_desired column) and everything is ok using two expresssion
data = [
(1,"20 mai 2021"),
(1,"21 juin 2021")
]
schema = StructType([
StructField('montant', IntegerType(), False),
StructField('date', StringType(),True),
])
col = ["montant","date"]
df2 = spark.createDataFrame(data=data, schema= schema)
df2= df2.select(col)
df2.show()
dd =df2.withColumn('date_expr',F.expr(" CASE WHEN rlike(date,'mai') THEN regexp_replace(date,'mai','may') \
WHEN rlike(date,'juin') THEN regexp_replace(date,'juin','june') \
ELSE date \
END as rr\
"))
dd =dd.withColumn('date_desired',F.expr(" to_date(date_expr ,'dd MMMM yyyy') "))
dd.show()
+-------+------------+
|montant| date|
+-------+------------+
| 1| 20 mai 2021|
| 1|21 juin 2021|
+-------+------------+
+-------+------------+------------+------------+
|montant| date| date_expr|date_desired|
+-------+------------+------------+------------+
| 1| 20 mai 2021| 20 may 2021| 2021-05-20|
| 1|21 juin 2021|21 june 2021| 2021-06-21|
+-------+------------+------------+------------+
But ~:
i want to acheive the same result with one expression as below :
dd =df2.withColumn('date_expr',F.expr(" CASE WHEN rlike(date,'mai') THEN regexp_replace(date,'mai','may') \
WHEN rlike(date,'juin') THEN regexp_replace(date,'juin','june') \
ELSE date \
END as dt_col\
to_date(dt_col ,'dd MMMM yyyy')"))
but i got error sql syntax

from itertools import chain
#create map using itertolls
d={'mai': "May", 'juin': "June"}
m_expr1 = create_map([lit(x) for x in chain(*d.items())])
new = (df2.withColumn('new_date', split(df2['date'],'\s')).withColumn('x', F.struct(*[F.col("new_date")[i].alias(f"val{i+1}") for i in range(3)]))#convert date intostruct column
.withColumn("x", F.col("x").withField("val2", m_expr1[F.col("x.val2")]))#Map new dates
.select('montant','date',array_join(array('x.*'),' ').alias('newdate'))#Convert struct column to string date
.withColumn('date_desired',F.expr(" to_date(newdate ,'dd MMMM yyyy') "))#convert to datetime
).show()
+-------+------------+------------+------------+
|montant| date| newdate|date_desired|
+-------+------------+------------+------------+
| 1| 20 mai 2021| 20 May 2021| 2021-05-20|
| 1|21 juin 2021|21 June 2021| 2021-06-21|
+-------+------------+------------+------------+

Related

Change string to HH:MM:SS in PySpark

I have Column "minutes" . i want change the column to hh:mm:ss format in PySpark
Input:
minutes(string type)
10
20
70
90
output:
minutes(string type) min_change
10 00:10:00
20 00:20:00
70 01:10:00
90 01:30:00
Add a column with lit("00:00:00") and cast it to timestamp. Convert the minutes to seconds and add it to the timestamp column. Finally, use date_format() to get your desired format:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
df.withColumn("minutes", col("minutes").cast("int"))\
.withColumn("min_change", lit("00:00:00").cast("timestamp"))\
.withColumn("min_change", (F.unix_timestamp("min_change") + F.col("minutes")*60).cast('timestamp'))\
.withColumn("min_change", date_format("min_change",'HH:mm:ss')).show()
+-------+----------+
|minutes|min_change|
+-------+----------+
| 10| 00:10:00|
| 20| 00:20:00|
| 70| 01:10:00|
| 90| 01:30:00|
+-------+----------+

Pyspark column: convert string to datetype

I am trying to convert a pyspark column of string type to date type as below.
**Date**
31 Mar 2020
2 Apr 2020
29 Jan 2019
8 Sep 2109
Output required:
31-03-2020
02-04-2020
29-01-2019
08-04-2109
Thanks.
You can use dayofmonth,year,month (or) date_format() (or) from_unixtime(unix_timestamp()) in built functions for this case.
Example:
#sample data
df=spark.createDataFrame([("31 Mar 2020",),("2 Apr 2020",),("29 Jan 2019",)],["Date"])
#DataFrame[Date: string]
df.show()
#+-----------+
#| Date|
#+-----------+
#|31 Mar 2020|
#| 2 Apr 2020|
#|29 Jan 2019|
#+-----------+
from pyspark.sql.functions import *
df.withColumn("new_dt", to_date(col("Date"),"dd MMM yyyy")).\
withColumn("year",year(col("new_dt"))).\
withColumn("month",month(col("new_dt"))).\
withColumn("day",dayofmonth(col("new_dt"))).\
show()
#+-----------+----------+----+-----+---+
#| Date| new_dt|year|month|day|
#+-----------+----------+----+-----+---+
#|31 Mar 2020|2020-03-31|2020| 3| 31|
#| 2 Apr 2020|2020-04-02|2020| 4| 2|
#|29 Jan 2019|2019-01-29|2019| 1| 29|
#+-----------+----------+----+-----+---+
#using date_format
df.withColumn("new_dt", to_date(col("Date"),"dd MMM yyyy")).\
withColumn("year",date_format(col("new_dt"),"yyyy")).\
withColumn("month",date_format(col("new_dt"),"MM")).\
withColumn("day",date_format(col("new_dt"),"dd")).show()
#+-----------+----------+----+-----+---+
#| Date| new_dt|year|month|day|
#+-----------+----------+----+-----+---+
#|31 Mar 2020|2020-03-31|2020| 03| 31|
#| 2 Apr 2020|2020-04-02|2020| 04| 02|
#|29 Jan 2019|2019-01-29|2019| 01| 29|
#+-----------+----------+----+-----+---+
The to_date function would need days as 02 or ' 2' instead of 2. Therefore, we can use regex to remove spaces, then wherever the length of the string is less than the max(9), we can add 0 to the start of the string. Then we can apply to_date and use it to extract your other columns(day,month,year). Can also use date_format to keep your date in a specified format.
df.show()#sample df
+-----------+
| Date|
+-----------+
|31 Mar 2020|
|2 Apr 2020|
|29 Jan 2019|
|8 Sep 2019|
+-----------+
from pyspark.sql import functions as F
df.withColumn("regex", F.regexp_replace("Date","\ ",""))\
.withColumn("Date", F.when(F.length("regex")<9, F.concat(F.lit(0),F.col("regex")))\
.otherwise(F.col("regex"))).drop("regex")\
.withColumn("Date", F.to_date("Date",'ddMMMyyyy'))\
.withColumn("Year", F.year("Date"))\
.withColumn("Month",F.month("Date"))\
.withColumn("Day", F.dayofmonth("Date"))\
.withColumn("Date_Format2", F.date_format("Date", 'dd-MM-yyyy'))\
.show()
#output
+----------+----+-----+---+------------+
| Date|Year|Month|Day|Date_Format2|
+----------+----+-----+---+------------+
|2020-03-31|2020| 3| 31| 31-03-2020|
|2020-04-02|2020| 4| 2| 02-04-2020|
|2019-01-29|2019| 1| 29| 29-01-2019|
|2019-09-08|2019| 9| 8| 08-09-2019|
+----------+----+-----+---+------------+

Parsing through rows and isolating student records from Spark Dataframe

My student database has multiple records for each student in the table Student.
I am reading the data into a Spark dataframe and then iterate through a Spark Dataframe, isolate records for each student and do some processing for each student records.
My code so far:
from pyspark.sql import SparkSession
spark_session = SparkSession \
.builder \
.appName("app") \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.2") \
.getOrCreate()
class_3A = spark_session.sql("SQL")
for row in class_3A:
#for each student
#Print Name, Age and Subject Marks
How do I do this?
Another approach would be to use SparkSQL
>>> df = spark.createDataFrame([('Ankit',25),('Jalfaizy',22),('Suresh',20),('Bala',26)],['name','age'])
>>> df.show()
+--------+---+
| name|age|
+--------+---+
| Ankit| 25|
|Jalfaizy| 22|
| Suresh| 20|
| Bala| 26|
+--------+---+
>>> df.where('age > 20').show()
+--------+---+
| name|age|
+--------+---+
| Ankit| 25|
|Jalfaizy| 22|
| Bala| 26|
+--------+---+
>>> from pyspark.sql.functions import *
>>> df.select('name', col('age') + 100).show()
+--------+-----------+
| name|(age + 100)|
+--------+-----------+
| Ankit| 125|
|Jalfaizy| 122|
| Suresh| 120|
| Bala| 126|
+--------+-----------+
Imperative approach(in addition to Bala's SQL approach):
class_3A = spark_session.sql("SQL")
def process_student(student_row):
# Do Something with student_row
return processed_student_row
#"isolate records for each student"
# Each student record will be passed to process_student function for processing.
# Results will be accumulated to a new DF - result_df
result_df = class_3A.map(process_student)
# If you don't care about results and just want to do some processing:
class_3A.foreach(process_student)
You can loop through each records in a dataframe and access them with the column names
from pyspark.sql import Row
from pyspark.sql.functions import *
l = [('Ankit',25),('Jalfaizy',22),('Suresh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = spark.createDataFrame(people)
schemaPeople.show(10, False)
for row in schemaPeople.rdd.collect():
print("Hi " + str(row.name) + " your age is : " + str(row.age) )
This will produce an output as below
+---+--------+
|age|name |
+---+--------+
|25 |Ankit |
|22 |Jalfaizy|
|20 |Suresh |
|26 |Bala |
+---+--------+
Hi Ankit your age is : 25
Hi Jalfaizy your age is : 22
Hi Suresh your age is : 20
Hi Bala your age is : 26
So you can do your processing or some logic that you need to perform on each record of your dataframe.
Not sure if i understand the question right but if you want to do operation on
rows based on any column you can do that using dataframe functions . Example :
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql import Window
sc = SparkSession.builder.appName("example").\
config("spark.driver.memory","1g").\
config("spark.executor.cores",2).\
config("spark.max.cores",4).getOrCreate()
df1 = sc.read.format("csv").option("header","true").load("test.csv")
w = Window.partitionBy("student_id")
df2 = df1.groupBy("student_id").agg(f.sum(df1["marks"]).alias("total"))
df3 = df1.withColumn("max_marks_inanysub",f.max(df1["marks"]).over(w))
df3 = df3.filter(df3["marks"] == df3["max_marks_inanysub"])
df1.show()
df3.show()
sample data
student_id,subject,marks
1,maths,3
1,science,6
2,maths,4
2,science,7
output
+----------+-------+-----+
|student_id|subject|marks|
+----------+-------+-----+
| 1| maths| 3|
| 1|science| 6|
| 2| maths| 4|
| 2|science| 7|
+----------+-------+-----+
+----------+-------+-----+------------------+
|student_id|subject|marks|max_marks_inanysub|
+----------+-------+-----+------------------+
| 1|science| 6| 6|
| 2|science| 7| 7|
+----------+-------+-----+------------------+

How to count the null,na and nan values in each column of pyspark dataframe

Dataframe as na,Nan and Null values .
Schema (Name:String,Rol.No:Integer,Dept:String
Example:
Name Rol.No Dept
priya 345 cse
James NA Nan
Null 567 NULL
Expected output as to columns name and count of null,na and nan values
Name 1
Rol.No 1
Dept 2
Use when()
spark.version
'2.3.2'
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
schema = T.StructType([\
T.StructField("Name", T.StringType(), True),
T.StructField("RolNo", T.StringType(), True),
T.StructField("Dept", T.StringType(), True),
])
rows = sc.parallelize([("priy", "345", "cse"),\
("james", "NA", np.nan),\
(None, "567", "NULL")])
myDF = spark.createDataFrame(rows, schema)
myDF.show()
+-----+-----+----+
| Name|RolNo|Dept|
+-----+-----+----+
| priy| 345| cse|
|james| NA| NaN|
| null| 567|NULL|
+-----+-----+----+
# gives you a count of nans, nulls, specific string values, etc for each col
myDF = myDF.select([F.count(F.when(F.isnan(i) | \
F.col(i).contains('NA') | \
F.col(i).contains('NULL') | \
F.col(i).isNull(), i)).alias(i) \
for i in myDF.columns])
myDF.show()
+----+-----+----+
|Name|RolNo|Dept|
+----+-----+----+
| 1| 1| 2|
+----+-----+----+

Convert string in Spark dataframe to date. Month and date are incorrect [duplicate]

Any Idea why I am getting the result below?
scala> val b = to_timestamp($"DATETIME", "ddMMMYYYY:HH:mm:ss")
b: org.apache.spark.sql.Column = to_timestamp(`DATETIME`, 'ddMMMYYYY:HH:mm:ss')
scala> sourceRawData.withColumn("ts", b).show(6,false)
+------------------+-------------------+-----------+--------+----------------+---------+-------------------+
|DATETIME |LOAD_DATETIME |SOURCE_BANK|EMP_NAME|HEADER_ROW_COUNT|EMP_HOURS|ts |
+------------------+-------------------+-----------+--------+----------------+---------+-------------------+
|01JAN2017:01:02:03|01JAN2017:01:02:03 | RBS | Naveen |100 |15.23 |2017-01-01 01:02:03|
|15MAR2017:01:02:03|15MAR2017:01:02:03 | RBS | Naveen |100 |115.78 |2017-01-01 01:02:03|
|02APR2015:23:24:25|02APR2015:23:24:25 | RBS |Arun |200 |2.09 |2014-12-28 23:24:25|
|28MAY2010:12:13:14| 28MAY2010:12:13:14|RBS |Arun |100 |30.98 |2009-12-27 12:13:14|
|04JUN2018:10:11:12|04JUN2018:10:11:12 |XZX | Arun |400 |12.0 |2017-12-31 10:11:12|
+------------------+-------------------+-----------+--------+----------------+---------+-------------------+
I am trying to convert DATETIME (which is in ddMMMYY:HH:mm:ss format) to Timestamp (which is shown in the last column above) but it doesn't seem to be converting to correct value.
I referred the below post but no help:
Better way to convert a string field into timestamp in Spark
Anyone can help me ?
Use y (year) not Y (week year):
spark.sql("SELECT to_timestamp('04JUN2018:10:11:12', 'ddMMMyyyy:HH:mm:ss')").show
// +--------------------------------------------------------+
// |to_timestamp('04JUN2018:10:11:12', 'ddMMMyyyy:HH:mm:ss')|
// +--------------------------------------------------------+
// | 2018-06-04 10:11:12|
// +--------------------------------------------------------+
Another example:
scala> sql("select to_timestamp('12/08/2020 1:24:21 AM', 'MM/dd/yyyy H:mm:ss a')").show
+-------------------------------------------------------------+
|to_timestamp('12/08/2020 1:24:21 AM', 'MM/dd/yyyy H:mm:ss a')|
+-------------------------------------------------------------+
| 2020-12-08 01:24:21|
+-------------------------------------------------------------+
Try this UDF:
val changeDtFmt = udf{(cFormat: String,
rFormat: String,
date: String) => {
val formatterOld = new SimpleDateFormat(cFormat)
val formatterNew = new SimpleDateFormat(rFormat)
formatterNew.format(formatterOld.parse(date))
}}
sourceRawData.
withColumn("ts",
changeDtFmt(lit("ddMMMyyyy:HH:mm:ss"), lit("yyyy-MM-dd HH:mm:ss"), $"DATETIME")).
show(6,false)
try below code
I have created a sample dataframe "df" for the table
+---+-------------------+
| id| date|
+---+-------------------+
| 1| 01JAN2017:01:02:03|
| 2| 15MAR2017:01:02:03|
| 3|02APR2015:23:24:25 |
+---+-------------------+
val t_s= unix_timestamp($"date","ddMMMyyyy:HH:mm:ss").cast("timestamp")
df.withColumn("ts",t_s).show()
+---+-------------------+--------------------+
| id| date| ts|
+---+-------------------+--------------------+
| 1| 01JAN2017:01:02:03|2017-01-01 01:02:...|
| 2| 15MAR2017:01:02:03|2017-03-15 01:02:...|
| 3|02APR2015:23:24:25 |2015-04-02 23:24:...|
+---+-------------------+--------------------+
Thanks

Resources