spark sql change date format using spark expr

spark sql change date format using spark expr - apache-spark

im using pyspark 2.4 and
bellow the code :
i have a dataframe with french month ,
i converte them to english month in order to change the fomat date ( date_desired column) and everything is ok using two expresssion
data = [
(1,"20 mai 2021"),
(1,"21 juin 2021")
]
schema = StructType([
StructField('montant', IntegerType(), False),
StructField('date', StringType(),True),
])
col = ["montant","date"]
df2 = spark.createDataFrame(data=data, schema= schema)
df2= df2.select(col)
df2.show()
dd =df2.withColumn('date_expr',F.expr(" CASE WHEN rlike(date,'mai') THEN regexp_replace(date,'mai','may') \
WHEN rlike(date,'juin') THEN regexp_replace(date,'juin','june') \
ELSE date \
END as rr\
"))
dd =dd.withColumn('date_desired',F.expr(" to_date(date_expr ,'dd MMMM yyyy') "))
dd.show()
+-------+------------+
|montant| date|
+-------+------------+
| 1| 20 mai 2021|
| 1|21 juin 2021|
+-------+------------+
+-------+------------+------------+------------+
|montant| date| date_expr|date_desired|
+-------+------------+------------+------------+
| 1| 20 mai 2021| 20 may 2021| 2021-05-20|
| 1|21 juin 2021|21 june 2021| 2021-06-21|
+-------+------------+------------+------------+
But ~:
i want to acheive the same result with one expression as below :
dd =df2.withColumn('date_expr',F.expr(" CASE WHEN rlike(date,'mai') THEN regexp_replace(date,'mai','may') \
WHEN rlike(date,'juin') THEN regexp_replace(date,'juin','june') \
ELSE date \
END as dt_col\
to_date(dt_col ,'dd MMMM yyyy')"))
but i got error sql syntax

from itertools import chain
#create map using itertolls
d={'mai': "May", 'juin': "June"}
m_expr1 = create_map([lit(x) for x in chain(*d.items())])
new = (df2.withColumn('new_date', split(df2['date'],'\s')).withColumn('x', F.struct(*[F.col("new_date")[i].alias(f"val{i+1}") for i in range(3)]))#convert date intostruct column
.withColumn("x", F.col("x").withField("val2", m_expr1[F.col("x.val2")]))#Map new dates
.select('montant','date',array_join(array('x.*'),' ').alias('newdate'))#Convert struct column to string date
.withColumn('date_desired',F.expr(" to_date(newdate ,'dd MMMM yyyy') "))#convert to datetime
).show()
+-------+------------+------------+------------+
|montant| date| newdate|date_desired|
+-------+------------+------------+------------+
| 1| 20 mai 2021| 20 May 2021| 2021-05-20|
| 1|21 juin 2021|21 June 2021| 2021-06-21|
+-------+------------+------------+------------+

Related

Change string to HH:MM:SS in PySpark

I have Column "minutes" . i want change the column to hh:mm:ss format in PySpark
Input:
minutes(string type)
10
20
70
90
output:
minutes(string type) min_change
10 00:10:00
20 00:20:00
70 01:10:00
90 01:30:00

Add a column with lit("00:00:00") and cast it to timestamp. Convert the minutes to seconds and add it to the timestamp column. Finally, use date_format() to get your desired format:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
df.withColumn("minutes", col("minutes").cast("int"))\
.withColumn("min_change", lit("00:00:00").cast("timestamp"))\
.withColumn("min_change", (F.unix_timestamp("min_change") + F.col("minutes")*60).cast('timestamp'))\
.withColumn("min_change", date_format("min_change",'HH:mm:ss')).show()
+-------+----------+
|minutes|min_change|
+-------+----------+
| 10| 00:10:00|
| 20| 00:20:00|
| 70| 01:10:00|
| 90| 01:30:00|
+-------+----------+

Pyspark column: convert string to datetype

I am trying to convert a pyspark column of string type to date type as below.
**Date**
31 Mar 2020
2 Apr 2020
29 Jan 2019
8 Sep 2109
Output required:
31-03-2020
02-04-2020
29-01-2019
08-04-2109
Thanks.

You can use dayofmonth,year,month (or) date_format() (or) from_unixtime(unix_timestamp()) in built functions for this case.
Example:
#sample data
df=spark.createDataFrame([("31 Mar 2020",),("2 Apr 2020",),("29 Jan 2019",)],["Date"])
#DataFrame[Date: string]
df.show()
#+-----------+
#| Date|
#+-----------+
#|31 Mar 2020|
#| 2 Apr 2020|
#|29 Jan 2019|
#+-----------+
from pyspark.sql.functions import *
df.withColumn("new_dt", to_date(col("Date"),"dd MMM yyyy")).\
withColumn("year",year(col("new_dt"))).\
withColumn("month",month(col("new_dt"))).\
withColumn("day",dayofmonth(col("new_dt"))).\
show()
#+-----------+----------+----+-----+---+
#| Date| new_dt|year|month|day|
#+-----------+----------+----+-----+---+
#|31 Mar 2020|2020-03-31|2020| 3| 31|
#| 2 Apr 2020|2020-04-02|2020| 4| 2|
#|29 Jan 2019|2019-01-29|2019| 1| 29|
#+-----------+----------+----+-----+---+
#using date_format
df.withColumn("new_dt", to_date(col("Date"),"dd MMM yyyy")).\
withColumn("year",date_format(col("new_dt"),"yyyy")).\
withColumn("month",date_format(col("new_dt"),"MM")).\
withColumn("day",date_format(col("new_dt"),"dd")).show()
#+-----------+----------+----+-----+---+
#| Date| new_dt|year|month|day|
#+-----------+----------+----+-----+---+
#|31 Mar 2020|2020-03-31|2020| 03| 31|
#| 2 Apr 2020|2020-04-02|2020| 04| 02|
#|29 Jan 2019|2019-01-29|2019| 01| 29|
#+-----------+----------+----+-----+---+

The to_date function would need days as 02 or ' 2' instead of 2. Therefore, we can use regex to remove spaces, then wherever the length of the string is less than the max(9), we can add 0 to the start of the string. Then we can apply to_date and use it to extract your other columns(day,month,year). Can also use date_format to keep your date in a specified format.
df.show()#sample df
+-----------+
| Date|
+-----------+
|31 Mar 2020|
|2 Apr 2020|
|29 Jan 2019|
|8 Sep 2019|
+-----------+
from pyspark.sql import functions as F
df.withColumn("regex", F.regexp_replace("Date","\ ",""))\
.withColumn("Date", F.when(F.length("regex")<9, F.concat(F.lit(0),F.col("regex")))\
.otherwise(F.col("regex"))).drop("regex")\
.withColumn("Date", F.to_date("Date",'ddMMMyyyy'))\
.withColumn("Year", F.year("Date"))\
.withColumn("Month",F.month("Date"))\
.withColumn("Day", F.dayofmonth("Date"))\
.withColumn("Date_Format2", F.date_format("Date", 'dd-MM-yyyy'))\
.show()
#output
+----------+----+-----+---+------------+
| Date|Year|Month|Day|Date_Format2|
+----------+----+-----+---+------------+
|2020-03-31|2020| 3| 31| 31-03-2020|
|2020-04-02|2020| 4| 2| 02-04-2020|
|2019-01-29|2019| 1| 29| 29-01-2019|
|2019-09-08|2019| 9| 8| 08-09-2019|
+----------+----+-----+---+------------+

Parsing through rows and isolating student records from Spark Dataframe

My student database has multiple records for each student in the table Student.
I am reading the data into a Spark dataframe and then iterate through a Spark Dataframe, isolate records for each student and do some processing for each student records.
My code so far:
from pyspark.sql import SparkSession
spark_session = SparkSession \
.builder \
.appName("app") \
.config("spark.jars.packages", "org.apache.hadoop:hadoop-aws:2.7.2") \
.getOrCreate()
class_3A = spark_session.sql("SQL")
for row in class_3A:
#for each student
#Print Name, Age and Subject Marks
How do I do this?

Another approach would be to use SparkSQL
>>> df = spark.createDataFrame([('Ankit',25),('Jalfaizy',22),('Suresh',20),('Bala',26)],['name','age'])
>>> df.show()
+--------+---+
| name|age|
+--------+---+
| Ankit| 25|
|Jalfaizy| 22|
| Suresh| 20|
| Bala| 26|
+--------+---+
>>> df.where('age > 20').show()
+--------+---+
| name|age|
+--------+---+
| Ankit| 25|
|Jalfaizy| 22|
| Bala| 26|
+--------+---+
>>> from pyspark.sql.functions import *
>>> df.select('name', col('age') + 100).show()
+--------+-----------+
| name|(age + 100)|
+--------+-----------+
| Ankit| 125|
|Jalfaizy| 122|
| Suresh| 120|
| Bala| 126|
+--------+-----------+

Imperative approach(in addition to Bala's SQL approach):
class_3A = spark_session.sql("SQL")
def process_student(student_row):
# Do Something with student_row
return processed_student_row
#"isolate records for each student"
# Each student record will be passed to process_student function for processing.
# Results will be accumulated to a new DF - result_df
result_df = class_3A.map(process_student)
# If you don't care about results and just want to do some processing:
class_3A.foreach(process_student)

You can loop through each records in a dataframe and access them with the column names
from pyspark.sql import Row
from pyspark.sql.functions import *
l = [('Ankit',25),('Jalfaizy',22),('Suresh',20),('Bala',26)]
rdd = sc.parallelize(l)
people = rdd.map(lambda x: Row(name=x[0], age=int(x[1])))
schemaPeople = spark.createDataFrame(people)
schemaPeople.show(10, False)
for row in schemaPeople.rdd.collect():
print("Hi " + str(row.name) + " your age is : " + str(row.age) )
This will produce an output as below
+---+--------+
|age|name |
+---+--------+
|25 |Ankit |
|22 |Jalfaizy|
|20 |Suresh |
|26 |Bala |
+---+--------+
Hi Ankit your age is : 25
Hi Jalfaizy your age is : 22
Hi Suresh your age is : 20
Hi Bala your age is : 26
So you can do your processing or some logic that you need to perform on each record of your dataframe.

Not sure if i understand the question right but if you want to do operation on
rows based on any column you can do that using dataframe functions . Example :
from pyspark.sql import SparkSession
import pyspark.sql.functions as f
from pyspark.sql import Window
sc = SparkSession.builder.appName("example").\
config("spark.driver.memory","1g").\
config("spark.executor.cores",2).\
config("spark.max.cores",4).getOrCreate()
df1 = sc.read.format("csv").option("header","true").load("test.csv")
w = Window.partitionBy("student_id")
df2 = df1.groupBy("student_id").agg(f.sum(df1["marks"]).alias("total"))
df3 = df1.withColumn("max_marks_inanysub",f.max(df1["marks"]).over(w))
df3 = df3.filter(df3["marks"] == df3["max_marks_inanysub"])
df1.show()
df3.show()
sample data
student_id,subject,marks
1,maths,3
1,science,6
2,maths,4
2,science,7
output
+----------+-------+-----+
|student_id|subject|marks|
+----------+-------+-----+
| 1| maths| 3|
| 1|science| 6|
| 2| maths| 4|
| 2|science| 7|
+----------+-------+-----+
+----------+-------+-----+------------------+
|student_id|subject|marks|max_marks_inanysub|
+----------+-------+-----+------------------+
| 1|science| 6| 6|
| 2|science| 7| 7|
+----------+-------+-----+------------------+

How to count the null,na and nan values in each column of pyspark dataframe

Dataframe as na,Nan and Null values .
Schema (Name:String,Rol.No:Integer,Dept:String
Example:
Name Rol.No Dept
priya 345 cse
James NA Nan
Null 567 NULL
Expected output as to columns name and count of null,na and nan values
Name 1
Rol.No 1
Dept 2

Use when()
spark.version
'2.3.2'
import numpy as np
import pyspark.sql.functions as F
import pyspark.sql.types as T
schema = T.StructType([\
T.StructField("Name", T.StringType(), True),
T.StructField("RolNo", T.StringType(), True),
T.StructField("Dept", T.StringType(), True),
])
rows = sc.parallelize([("priy", "345", "cse"),\
("james", "NA", np.nan),\
(None, "567", "NULL")])
myDF = spark.createDataFrame(rows, schema)
myDF.show()
+-----+-----+----+
| Name|RolNo|Dept|
+-----+-----+----+
| priy| 345| cse|
|james| NA| NaN|
| null| 567|NULL|
+-----+-----+----+
# gives you a count of nans, nulls, specific string values, etc for each col
myDF = myDF.select([F.count(F.when(F.isnan(i) | \
F.col(i).contains('NA') | \
F.col(i).contains('NULL') | \
F.col(i).isNull(), i)).alias(i) \
for i in myDF.columns])
myDF.show()
+----+-----+----+
|Name|RolNo|Dept|
+----+-----+----+
| 1| 1| 2|
+----+-----+----+

Convert string in Spark dataframe to date. Month and date are incorrect [duplicate]

Any Idea why I am getting the result below?
scala> val b = to_timestamp($"DATETIME", "ddMMMYYYY:HH:mm:ss")
b: org.apache.spark.sql.Column = to_timestamp(`DATETIME`, 'ddMMMYYYY:HH:mm:ss')
scala> sourceRawData.withColumn("ts", b).show(6,false)
+------------------+-------------------+-----------+--------+----------------+---------+-------------------+
|DATETIME |LOAD_DATETIME |SOURCE_BANK|EMP_NAME|HEADER_ROW_COUNT|EMP_HOURS|ts |
+------------------+-------------------+-----------+--------+----------------+---------+-------------------+
|01JAN2017:01:02:03|01JAN2017:01:02:03 | RBS | Naveen |100 |15.23 |2017-01-01 01:02:03|
|15MAR2017:01:02:03|15MAR2017:01:02:03 | RBS | Naveen |100 |115.78 |2017-01-01 01:02:03|
|02APR2015:23:24:25|02APR2015:23:24:25 | RBS |Arun |200 |2.09 |2014-12-28 23:24:25|
|28MAY2010:12:13:14| 28MAY2010:12:13:14|RBS |Arun |100 |30.98 |2009-12-27 12:13:14|
|04JUN2018:10:11:12|04JUN2018:10:11:12 |XZX | Arun |400 |12.0 |2017-12-31 10:11:12|
+------------------+-------------------+-----------+--------+----------------+---------+-------------------+
I am trying to convert DATETIME (which is in ddMMMYY:HH:mm:ss format) to Timestamp (which is shown in the last column above) but it doesn't seem to be converting to correct value.
I referred the below post but no help:
Better way to convert a string field into timestamp in Spark
Anyone can help me ?

Use y (year) not Y (week year):
spark.sql("SELECT to_timestamp('04JUN2018:10:11:12', 'ddMMMyyyy:HH:mm:ss')").show
// +--------------------------------------------------------+
// |to_timestamp('04JUN2018:10:11:12', 'ddMMMyyyy:HH:mm:ss')|
// +--------------------------------------------------------+
// | 2018-06-04 10:11:12|
// +--------------------------------------------------------+
Another example:
scala> sql("select to_timestamp('12/08/2020 1:24:21 AM', 'MM/dd/yyyy H:mm:ss a')").show
+-------------------------------------------------------------+
|to_timestamp('12/08/2020 1:24:21 AM', 'MM/dd/yyyy H:mm:ss a')|
+-------------------------------------------------------------+
| 2020-12-08 01:24:21|
+-------------------------------------------------------------+

Try this UDF:
val changeDtFmt = udf{(cFormat: String,
rFormat: String,
date: String) => {
val formatterOld = new SimpleDateFormat(cFormat)
val formatterNew = new SimpleDateFormat(rFormat)
formatterNew.format(formatterOld.parse(date))
}}
sourceRawData.
withColumn("ts",
changeDtFmt(lit("ddMMMyyyy:HH:mm:ss"), lit("yyyy-MM-dd HH:mm:ss"), $"DATETIME")).
show(6,false)

try below code
I have created a sample dataframe "df" for the table
+---+-------------------+
| id| date|
+---+-------------------+
| 1| 01JAN2017:01:02:03|
| 2| 15MAR2017:01:02:03|
| 3|02APR2015:23:24:25 |
+---+-------------------+
val t_s= unix_timestamp($"date","ddMMMyyyy:HH:mm:ss").cast("timestamp")
df.withColumn("ts",t_s).show()
+---+-------------------+--------------------+
| id| date| ts|
+---+-------------------+--------------------+
| 1| 01JAN2017:01:02:03|2017-01-01 01:02:...|
| 2| 15MAR2017:01:02:03|2017-03-15 01:02:...|
| 3|02APR2015:23:24:25 |2015-04-02 23:24:...|
+---+-------------------+--------------------+
Thanks

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

spark sql change date format using spark expr - apache-spark

Related

Change string to HH:MM:SS in PySpark

Pyspark column: convert string to datetype

Parsing through rows and isolating student records from Spark Dataframe

How to count the null,na and nan values in each column of pyspark dataframe

Convert string in Spark dataframe to date. Month and date are incorrect [duplicate]

Categories

Resources