How do i change string to HH:mm:ss only in spark - string

I am getting the time as string like 134455 and I need to convert into 13:44:55 using spark sql how can we get this in right format

You can try the regexp_replace function.
scala> val df = Seq((134455 )).toDF("ts_str")
df: org.apache.spark.sql.DataFrame = [ts_str: int]
scala> df.show(false)
+------+
|ts_str|
+------+
|134455|
+------+
scala> df.withColumn("ts",regexp_replace('ts_str,"""(\d\d)""","$1:")).show(false)
+------+---------+
|ts_str|ts |
+------+---------+
|134455|13:44:55:|
+------+---------+
scala> df.withColumn("ts",trim(regexp_replace('ts_str,"""(\d\d)""","$1:"),":")).show(false)
+------+--------+
|ts_str|ts |
+------+--------+
|134455|13:44:55|
+------+--------+
scala>

val df = Seq("133456").toDF
+------+
| value|
+------+
|133456|
+------+
df.withColumn("value", unix_timestamp('value, "HHmmss"))
.withColumn("value", from_unixtime('value, "HH:mm:ss"))
.show
+--------+
| value|
+--------+
|13:34:56|
+--------+
Note that a unix timestamp is stored as the number of seconds since 00:00:00, 1 January 1970. If you try to convert a time with millisecond accuracy to a timestamp, you will lose the millisecond part of the time. For times including milliseconds, you will need to use a different approach.

Related

pyspark: read partitioned parquet "my_file.parquet/col1=NOW" string value replaced by <current_time> on read()

With pyspark 3.1.1 on wsl Debian 10
When reading parquet file partitioned with a column containing the string NOW, the string is replaced by the current time at the moment of the read() funct is executed. I suppose that NOW string interpreted as now()
# step to reproduce
df = spark.createDataFrame(data=[("NOW",1), ("TEST", 2)], schema = ["col1", "id"])
df.write.partitionBy("col1").parquet("test/test.parquet")
>>> /home/test/test.parquet/col1=NOW
df_loaded = spark.read.option(
"basePath",
"test/test.parquet",
).parquet("test/test.parquet/col1=*")
df_loaded.show(truncate=False)
>>>
+---+--------------------------+
|id |col1 |
+---+--------------------------+
|2 |TEST |
|1 |2021-04-18 14:36:46.532273|
+---+--------------------------+
Is that a bug or a normal function of pyspark?
if the latter, is there a sparkContext option to avoid that behaviour?
I suspect that's an expected feature... but I'm not sure where it was documented. Anyway, if you want to keep the column as a string column, you can provide a schema while reading the parquet file:
df = spark.read.schema("id long, col1 string").parquet("test/test.parquet")
df.show()
+---+----+
| id|col1|
+---+----+
| 1| NOW|
| 2|TEST|
+---+----+

Ingest "t" and "f" as boolean to Cassandra

I use pyspark to load a csv as dataframe, then save it to Cassandra. One of the columns is defined as boolean in Cassandra's schema, but my actual data in csv are string t or f. Is there any chance I can make Cassandra recognize t and f as boolean? Otherwise I have to add a data transformation step.
Spark Cassandra Connector uses String.toBoolean call to convert strings to boolean values. But it accepts only true and false, and throws an exception if it's used with other strings. So you'll need to write small data transformation code, like this:
scala> val df = Seq((1, "t"), (2, "f"), (3, "t")).toDF("id", "b")
df: org.apache.spark.sql.DataFrame = [id: int, b: string]
scala> val df2 = df.withColumn("b", $"b" === "t")
df2: org.apache.spark.sql.DataFrame = [id: int, b: boolean]
scala> df2.show()
+---+-----+
| id| b|
+---+-----+
| 1| true|
| 2|false|
| 3| true|
+---+-----+

how to add a value to the date field using data frame in spark

I have date values some (yyyy/mm/dd) on my dataframe. i need to find the next 7 days of data. How can i do it using dataframe in spark
for example: I have data like below
23/01/2018 , 23
24/01/2018 , 21
25/01/2018, 44
.
.
.
.
.
29/01/2018,17
I need to get the next 7 days of data including today(starting from minimum date from the data). so in my example i need to get dates 2018/01/23 plus 7 days ahead. is there any way to achieve the same ?
Note: i need to find minimum date from the data and need to filter that minimum date + 7 days of data
scala> df.show
+----------+---+-------+
| data_date|vol|channel|
+----------+---+-------+
|05/01/2019| 10| ABC|
|05/01/2019| 20| CNN|
|06/01/2019| 10| BBC|
|07/01/2019| 10| ABC|
|02/01/2019| 20| CNN|
|17/01/2019| 10| BBC|
+----------+---+-------+
scala> val df2 = df.select("*").filter( to_date(replaceUDF('data_date)) > date_add(to_date(replaceUDF(lit(minDate))),7))
df2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [data_date: string, vol: int ... 1 more field]
scala> df2.show
+---------+---+-------+
|data_date|vol|channel|
+---------+---+-------+
+---------+---+-------+
I need data as below : minimum date is 02/02/2018 a, so minimum date + 7 is 07/02/2018. I need data between 02/01/2018 and 07/02/2018
+----------+---+-------+
| data_date|vol|channel|
+----------+---+-------+
|05/01/2019| 10| ABC|
|05/01/2019| 20| CNN|
|06/01/2019| 10| BBC|
|07/01/2019| 10| ABC|
|02/01/2019| 20| CNN|
+----------+---+-------+
can someone help as i am beginner on spark
Import below statement
import org.apache.spark.sql.functions._
Code Snippet
val minDate = df.agg(min($"date1")).collect()(0).get(0)
val df2 = df.select("*").filter( to_date(regexp_replace('date1,"/","-")) > date_add(to_date(regexp_replace(lit(minDate)),"/","-"),7))
df2.show()
For data
val data = Seq(("2018/01/23",23),("2018/01/24",24),("2018/02/20",25))
Output would be
+----------+---+
| date1|day|
+----------+---+
|2018/02/20| 25|
+----------+---+
If you are looking for different output, please update your question with the expected results.
Below is a complete program for your reference
package com.nelamalli.spark.dataframe
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
object DataFrameUDF {
def main(args:Array[String]): Unit = {
val spark:SparkSession = SparkSession.builder()
.master("local[3]")
.appName("SparkByExample")
.getOrCreate()
val data = Seq(("2018/01/23",23),("2018/01/24",24),("2018/02/20",25))
import spark.sqlContext.implicits._
val df = data.toDF("date1","day")
val minDate = df.agg(min($"date1")).collect()(0).get(0)
val df2 = df.select("*").filter( to_date(regexp_replace('date1,"/","-")) > date_add(to_date(regexp_replace(lit(minDate)),"/","-"),7))
df2.show()
}
}
Thanks
Your question is still unclear. I'm borrowing the input from #Naveen and you can get the same results without UDFs. Check this out
scala> val df = Seq(("2018/01/23",23),("2018/01/24",24),("2018/02/20",25)).toDF("dt","day").withColumn("dt",to_date(regexp_replace('dt,"/","-")))
df: org.apache.spark.sql.DataFrame = [dt: date, day: int]
scala> df.show(false)
+----------+---+
|dt |day|
+----------+---+
|2018-01-23|23 |
|2018-01-24|24 |
|2018-02-20|25 |
+----------+---+
scala> val mindt = df.groupBy().agg(min('dt)).as[(java.sql.Date)].first
mindt: java.sql.Date = 2018-01-23
scala> df.filter('dt > date_add(lit(mindt),7)).show(false)
+----------+---+
|dt |day|
+----------+---+
|2018-02-20|25 |
+----------+---+
scala>

How to filter dataframe using two dates?

I have a scenario where dataframe has data_date as below
root
|-- data_date: timestamp (nullable = true)
+-------------------+
| data_date|
+-------------------+
|2009-10-19 00:00:00|
|2004-02-24 00:00:00|
+-------------------+
I Need to filter the data between two dates i.e. data_date between '01-Jan-2017' and '31-dec-2017'
I tried many ways like
df.where(col("data_date") >= "2017-01-01" )
df.filter(col("data_date").gt("2017-01-01"))
df.filter(col("data_date").gt(lit("2017-01-01"))).filter(col("data_date").lt("2017-12-31")
but nothing worked.
I am getting below error:
java.lang.AssertionError: assertion failed: unsafe symbol Unstable (child of <none>) in runtime reflection universe
at scala.reflect.internal.Symbols$Symbol.<init>(Symbols.scala:205)
at scala.reflect.internal.Symbols$TypeSymbol.<init>(Symbols.scala:3030)
at scala.reflect.internal.Symbols$ClassSymbol.<init>(Symbols.scala:3222)
at scala.reflect.internal.Symbols$StubClassSymbol.<init>(Symbols.scala:3522)
at scala.reflect.internal.Symbols$class.newStubSymbol(Symbols.scala:191)
at scala.reflect.internal.SymbolTable.newStubSymbol(SymbolTable.scala:16)\
How can I solve it?
You need to cast the literal value as "date" datatype. BTW.. the input is not between the condition that you are specifying. Check this out:
scala> val df = Seq(("2009-10-19 00:00:00"),("2004-02-24 00:00:00")).toDF("data_date").select('data_date.cast("timestamp"))
df: org.apache.spark.sql.DataFrame = [data_date: timestamp]
scala> df.printSchema
root
|-- data_date: timestamp (nullable = true)
scala> df.withColumn("greater",'data_date.gt(lit("2017-01-01").cast("date"))).withColumn("lesser",'data_date.lt(lit("2017-12-31").cast("date"))).show
+-------------------+-------+------+
| data_date|greater|lesser|
+-------------------+-------+------+
|2009-10-19 00:00:00| false| true|
|2004-02-24 00:00:00| false| true|
+-------------------+-------+------+
scala>
If I change the input as below, the filter works.
val df = Seq(("2017-10-19 00:00:00"),("2017-02-24 00:00:00")).toDF("data_date").select('data_date.cast("timestamp"))
val df2= df.withColumn("greater",'data_date.gt(lit("2017-01-01").cast("date"))).withColumn("lesser",'data_date.lt(lit("2017-12-31").cast("date")))
df2.filter("greater and lesser ").show(false)
+-------------------+-------+------+
|data_date |greater|lesser|
+-------------------+-------+------+
|2017-10-19 00:00:00|true |true |
|2017-02-24 00:00:00|true |true |
+-------------------+-------+------+

How to get non-null sorted ascending data from Spark DataFrame?

I load the data into data frames where one of the columns is zipCode (String type). I wonder how to get non-null values for that column in ascending order in Scala? Many thanks in advance.
scala> val df = Seq("2", "1", null).toDF("x")
df: org.apache.spark.sql.DataFrame = [x: string]
scala> df.orderBy($"x".asc_nulls_last).show
+----+
| x|
+----+
| 1|
| 2|
|null|
+----+

Resources