how to convert date of format string to timestamp in spark? - string

%scala
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, to_date}
Seq(("20110813"),("20090724")).toDF("Date").select(
col("Date"),
to_date(col("Date"),"yyyy-mm-dd").as("to_date")
).show()
+--------+-------+
| Date|to_date|
+--------+-------+
|20110813| null|
|20090724| null|
+--------+-------+
+--------+----------+
| Date| to_date|
+--------+----------+
|20110813|2011-01-13|
|20090724|2009-01-24|
+--------+----------+
Seq(("20110813"),("20090724")).toDF("Date").select(
col("Date"),
to_date(col("Date"),"yyyymmdd").as("to_date")
).show()
I am trying to convert a string to timestamp, but I am getting always null/default values returned to the date value

You haven't given value for the new column to convert. you should use withColumn to add the new date column and tell him to use Date column values.
import org.apache.spark.sql.functions.{col, to_date}
import org.apache.spark.sql.types._
val df = Seq((20110813),(20090724)).toDF("Date")
val newDf = df.withColumn("to_date", to_date(col("Date").cast(TimestampType), "yyyy-MM-dd"))
newDf.show()

Related

Pyspark: create a lag column

I am using pyspark and obtained a table as below, table_1
+--------+------------------+---------------+----------+-----+
| Number|Institution | datetime| date| time|
+--------+------------------+---------------+----------+-----+
|AE19075B| ABC| 7/20/2019 7:45|07/20/2019| 7:45|
|AE11688U| CBT|2/11/2019 20:31|02/11/2019|20:31|
+--------+------------------+---------------+----------+-----+
I would like to add a lag column of the time (15 minutes) to the table_1
+--------+------------------+---------------+----------+-----+-----+
| Number|Institution | datetime| date| time|lag1 |
+--------+------------------+---------------+----------+-----+-----+
|AE19075B| ABC| 7/20/2019 7:45|07/20/2019| 7:45|7:30 |
|AE11688U| CBT|2/11/2019 20:31|02/11/2019|20:31|20:16|
+--------+------------------+---------------+----------+-----+-----+
from datetime import datetime, timedelta
table_2 = table_.withColumn('lag1', (datetime.strptime(table1['time'], '%H:%M') -timedelta(minutes=15)).strftime('%H:%M'))
The code above could be applied to a string but I have no idea why it cannot apply to the table in this case. It showed an error '''TypeError: strptime() argument 1 must be str, not Column''', is there any method to obtain a string from a column in Pyspark? Thanks!
You can't use Python functions directly on Spark dataframe columns. You can use Spark SQL functions instead, as shown below:
import pyspark.sql.functions as F
df2 = df.withColumn(
'lag1',
F.expr("date_format(to_timestamp(time, 'H:m') - interval 15 minute, 'H:m')")
)
df2.show()
+--------+-----------+---------------+----------+-----+-----+
| Number|Institution| datetime| date| time| lag1|
+--------+-----------+---------------+----------+-----+-----+
|AE19075B| ABC| 7/20/2019 7:45|07/20/2019| 7:45| 7:30|
|AE11688U| CBT|2/11/2019 20:31|02/11/2019|20:31|20:16|
+--------+-----------+---------------+----------+-----+-----+
Alternatively, you can call the Python function as a UDF (but the performance should be worse than calling Spark SQL functions directly):
import pyspark.sql.functions as F
from datetime import datetime, timedelta
lag = F.udf(lambda t: (datetime.strptime(t, '%H:%M') -timedelta(minutes=15)).strftime('%H:%M'))
df2 = df.withColumn('lag1', lag('time'))
df2.show()
+--------+-----------+---------------+----------+-----+-----+
| Number|Institution| datetime| date| time| lag1|
+--------+-----------+---------------+----------+-----+-----+
|AE19075B| ABC| 7/20/2019 7:45|07/20/2019| 7:45|07:30|
|AE11688U| CBT|2/11/2019 20:31|02/11/2019|20:31|20:16|
+--------+-----------+---------------+----------+-----+-----+

Promote Row 1 as Column Heading - Spark DataFrame

I got below Spark Data Frame.
I want to promote Row 1 as column Headings and the new spark DataFrame should be
I know this can be done in pandas easily as:
new_header = pandaDF.iloc[0]
pandaDF = pandaDF[1:]
pandaDF.columns = new_header
But doesn't want to convert into Pandas DF as have to persist this into to Database, wherein have to convert back pandas DF to Spark DF and then register as table and then write to db.
Try with .toDF and filter our the column values.
Example:
#sample dataframe
df.show()
#+----------+------------+----------+
#| prop_0| prop_1| prop_2|
#+----------+------------+----------+
#|station_id|station_name|sample_num|
#| 101| Station101| Sample101|
#| 102| Station102| Sample102|
#+----------+------------+----------+
from pyspark.sql.functions import *
cols=sc.parallelize(cols).map(lambda x:x).collect()
df.toDF(*cols).filter(~col("station_id").isin(*cols)).show()
#+----------+------------+----------+
#|station_id|station_name|sample_num|
#+----------+------------+----------+
#| 101| Station101| Sample101|
#| 102| Station102| Sample102|
#+----------+------------+----------+

PYSPARK: how can I update a value in a column based in a condition

Given a table with two columns: DEVICEID and DEVICETYPE
How can I update column DEVICETYPE if the string length in DEVICEID is 5:
from pyspark.sql.functions import *
df.where(length(col("DEVICEID")) = 5).show()
Use when+otherwise statement and check the length of deviceid==5 update new value.
Example:
df=spark.createDataFrame([('abcde',1),('abc',2)],["DEVICEID","DEVICETYPE"])
from pyspark.sql.functions import *
df.withColumn("new_col",when(length(col("deviceid")) ==5,lit("new_length")).otherwise(col("DEVICEID"))).show()
#+--------+----------+----------+
#|DEVICEID|DEVICETYPE| new_col|
#+--------+----------+----------+
#| abcde| 1|new_length|
#| abc| 2| abc|
#+--------+----------+----------+

Pyspark DataFrame: Split column with multiple values into rows

I have a dataframe (with more rows and columns) as shown below.
Sample DF:
from pyspark import Row
from pyspark.sql import SQLContext
from pyspark.sql.functions import explode
sqlc = SQLContext(sc)
df = sqlc.createDataFrame([Row(col1 = 'z1', col2 = '[a1, b2, c3]', col3 = 'foo')])
# +------+-------------+------+
# | col1| col2| col3|
# +------+-------------+------+
# | z1| [a1, b2, c3]| foo|
# +------+-------------+------+
df
# DataFrame[col1: string, col2: string, col3: string]
What I want:
+-----+-----+-----+
| col1| col2| col3|
+-----+-----+-----+
| z1| a1| foo|
| z1| b2| foo|
| z1| c3| foo|
+-----+-----+-----+
I tried to replicate the RDD solution provided here: Pyspark: Split multiple array columns into rows
(df
.rdd
.flatMap(lambda row: [(row.col1, col2, row.col3) for col2 in row.col2)])
.toDF(["col1", "col2", "col3"]))
However, it is not giving the required result
Edit: The explode option does not work because it is currently stored as string and the explode function expects an array
You can use explode but first you'll have to convert the string representation of the array into an array.
One way is to use regexp_replace to remove the leading and trailing square brackets, followed by split on ", ".
from pyspark.sql.functions import col, explode, regexp_replace, split
df.withColumn(
"col2",
explode(split(regexp_replace(col("col2"), "(^\[)|(\]$)", ""), ", "))
).show()
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| z1| a1| foo|
#| z1| b2| foo|
#| z1| c3| foo|
#+----+----+----+
Pault's solution should work perfectly fine although here is another solution which uses regexp_extract instead (you don't really need to replace anything in this case) and it can handle arbitrary number of spaces:
from pyspark.sql.functions import col, explode, regexp_extract,regexp_replace, split
df.withColumn("col2",
explode(
split(
regexp_extract(
regexp_replace(col("col2"), "\s", ""), "^\[(.*)\]$", 1), ","))) \
.show()
Explanation:
Initially regexp_replace(col("col2"), "\s", "") will replace all spaces with empty string.
Next regexp_extract will extract the content of the column which start with [ and ends with ].
Then we execute split for the comma separated values and finally explode.

how to add a value to the date field using data frame in spark

I have date values some (yyyy/mm/dd) on my dataframe. i need to find the next 7 days of data. How can i do it using dataframe in spark
for example: I have data like below
23/01/2018 , 23
24/01/2018 , 21
25/01/2018, 44
.
.
.
.
.
29/01/2018,17
I need to get the next 7 days of data including today(starting from minimum date from the data). so in my example i need to get dates 2018/01/23 plus 7 days ahead. is there any way to achieve the same ?
Note: i need to find minimum date from the data and need to filter that minimum date + 7 days of data
scala> df.show
+----------+---+-------+
| data_date|vol|channel|
+----------+---+-------+
|05/01/2019| 10| ABC|
|05/01/2019| 20| CNN|
|06/01/2019| 10| BBC|
|07/01/2019| 10| ABC|
|02/01/2019| 20| CNN|
|17/01/2019| 10| BBC|
+----------+---+-------+
scala> val df2 = df.select("*").filter( to_date(replaceUDF('data_date)) > date_add(to_date(replaceUDF(lit(minDate))),7))
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [data_date: string, vol: int ... 1 more field]
scala> df2.show
+---------+---+-------+
|data_date|vol|channel|
+---------+---+-------+
+---------+---+-------+
I need data as below : minimum date is 02/02/2018 a, so minimum date + 7 is 07/02/2018. I need data between 02/01/2018 and 07/02/2018
+----------+---+-------+
| data_date|vol|channel|
+----------+---+-------+
|05/01/2019| 10| ABC|
|05/01/2019| 20| CNN|
|06/01/2019| 10| BBC|
|07/01/2019| 10| ABC|
|02/01/2019| 20| CNN|
+----------+---+-------+
can someone help as i am beginner on spark
Import below statement
import org.apache.spark.sql.functions._
Code Snippet
val minDate = df.agg(min($"date1")).collect()(0).get(0)
val df2 = df.select("*").filter( to_date(regexp_replace('date1,"/","-")) > date_add(to_date(regexp_replace(lit(minDate)),"/","-"),7))
df2.show()
For data
val data = Seq(("2018/01/23",23),("2018/01/24",24),("2018/02/20",25))
Output would be
+----------+---+
| date1|day|
+----------+---+
|2018/02/20| 25|
+----------+---+
If you are looking for different output, please update your question with the expected results.
Below is a complete program for your reference
package com.nelamalli.spark.dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object DataFrameUDF {
def main(args:Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExample")
.getOrCreate()
val data = Seq(("2018/01/23",23),("2018/01/24",24),("2018/02/20",25))
import spark.sqlContext.implicits._
val df = data.toDF("date1","day")
val minDate = df.agg(min($"date1")).collect()(0).get(0)
val df2 = df.select("*").filter( to_date(regexp_replace('date1,"/","-")) > date_add(to_date(regexp_replace(lit(minDate)),"/","-"),7))
df2.show()
}
}
Thanks
Your question is still unclear. I'm borrowing the input from #Naveen and you can get the same results without UDFs. Check this out
scala> val df = Seq(("2018/01/23",23),("2018/01/24",24),("2018/02/20",25)).toDF("dt","day").withColumn("dt",to_date(regexp_replace('dt,"/","-")))
df: org.apache.spark.sql.DataFrame = [dt: date, day: int]
scala> df.show(false)
+----------+---+
|dt |day|
+----------+---+
|2018-01-23|23 |
|2018-01-24|24 |
|2018-02-20|25 |
+----------+---+
scala> val mindt = df.groupBy().agg(min('dt)).as[(java.sql.Date)].first
mindt: java.sql.Date = 2018-01-23
scala> df.filter('dt > date_add(lit(mindt),7)).show(false)
+----------+---+
|dt |day|
+----------+---+
|2018-02-20|25 |
+----------+---+
scala>

Resources