Change string to HH:MM:SS in PySpark

Change string to HH:MM:SS in PySpark - string

I have Column "minutes" . i want change the column to hh:mm:ss format in PySpark
Input:
minutes(string type)
10
20
70
90
output:
minutes(string type) min_change
10 00:10:00
20 00:20:00
70 01:10:00
90 01:30:00

Add a column with lit("00:00:00") and cast it to timestamp. Convert the minutes to seconds and add it to the timestamp column. Finally, use date_format() to get your desired format:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
df.withColumn("minutes", col("minutes").cast("int"))\
.withColumn("min_change", lit("00:00:00").cast("timestamp"))\
.withColumn("min_change", (F.unix_timestamp("min_change") + F.col("minutes")*60).cast('timestamp'))\
.withColumn("min_change", date_format("min_change",'HH:mm:ss')).show()
+-------+----------+
|minutes|min_change|
+-------+----------+
| 10| 00:10:00|
| 20| 00:20:00|
| 70| 01:10:00|
| 90| 01:30:00|
+-------+----------+

Related

Spark DataFrame Get Difference between values of two rows

I have calculated the average temperature for two cities grouped by seasons, but I'm having trouble in with getting the difference between the avg(TemperatureF) for City A vs City B. Here is an example of what my Spark Scala DataFrame looks like:
City
Season
avg(TemperatureF)
A
Fall
52
A
Spring
50
A
Summer
72
A
Winter
25
B
Fall
49
B
Spring
44
B
Summer
69
B
Winter
22

You may use the pivot function as follows:
df.groupBy('Season').pivot('City').agg(f.first('avg')) \
.withColumn('diff', f.expr('A - B')) \
.show()
+------+---+---+----+
|Season| A| B|diff|
+------+---+---+----+
|Spring| 50| 44| 6.0|
|Summer| 72| 69| 3.0|
| Fall| 52| 49| 3.0|
|Winter| 25| 22| 3.0|
+------+---+---+----+

Case Based scenerio in pyspark

The DataFrame in pyspark looks like below.
model,DAYS
MarutiDesire,15
MarutiErtiga,30
Suzukicelerio,45
I10lxi,60
Verna,55
Output i am trying to get like
Output : I am trying to get the output as
when days less than 30 than economical,
between 30 and 60 than average,
and when greater than 60 than Low Profit
Code i tried but giving incorrect output.
dataset1.selectExpr("*", "CASE WHEN DAYS <=30 THEN 'ECONOMICAL' WHEN DAYS>30 AND LESS THEN 60 THEN 'AVERAGE' ELSE 'LOWPROFIT' END REASON").show()
kindly share your suggestion. is there any better way to do this in pyspark.

>>> from pyspark.sql.functions import *
>>> df.show()
+-------------+----+
| model|DAYS|
+-------------+----+
| MarutiDesire| 15|
| MarutiErtiga| 30|
|Suzukicelerio| 45|
| I10lxi| 60|
| Verna| 55|
+-------------+----+
>>> df.withColumn("REMARKS", when(col("DAYS") < 30, lit("ECONOMICAL")).when((col("DAYS") >= 30) & (col("DAYS") < 60), lit("AVERAGE")).otherwise(lit("LOWPROFIT"))).show()
+-------------+----+----------+
| model|DAYS| REMARKS|
+-------------+----+----------+
| MarutiDesire| 15|ECONOMICAL|
| MarutiErtiga| 30| AVERAGE|
|Suzukicelerio| 45| AVERAGE|
| I10lxi| 60| LOWPROFIT|
| Verna| 55| AVERAGE|
+-------------+----+----------+

Explode array into columns Spark

Hi1, I have a json like beow:
{meta:{"clusters":[{"1":"Aged 35 to 49"},{"2":"Male"},{"5":"Aged 15 to 17"}]}}
and I'd like to obtain the following dataframe:
+---------------+----+---------------+
| 1| 2| 5 |
+---------------+----+---------------+
| Aged 35 to 49|Male| Aged 15 to 17|
+---------------+----+---------------+
How could I do it in pyspark?
Thanks

You can use get_json_object() function to parse json column:
Example:
df=spark.createDataFrame([Row(jsn='{"meta":{"clusters":[{"1":"Aged 35 to 49"},{"2":"Male"},{"5":"Aged 15 to 17"}]}}')])
df.selectExpr("get_json_object(jsn,'$.meta.clusters[0].1') as `1`",
"get_json_object(jsn,'$.meta.clusters[*].2') as `2`",
"get_json_object(jsn,'$.meta.clusters[*].5') as `5`").show(10,False)
"Output":
+-------------+------+---------------+
|1 |2 |5 |
+-------------+------+---------------+
|Aged 35 to 49|"Male"|"Aged 15 to 17"|
+-------------+------+---------------+

Inserting records in a spark dataframe

I have a dataframe in pyspark. Here is what it looks like,
+---------+---------+
|timestamp| price |
+---------+---------+
|670098928| 50 |
|670098930| 53 |
|670098934| 55 |
+---------+---------+
I want to fill in the gaps in timestamp with the previous state, so that I can get a perfect set to calculate time weighted averages. Here is what the output should be like -
+---------+---------+
|timestamp| price |
+---------+---------+
|670098928| 50 |
|670098929| 50 |
|670098930| 53 |
|670098931| 53 |
|670098932| 53 |
|670098933| 53 |
|670098934| 55 |
+---------+---------+
Eventually, I want to persist this new dataframe on disk and visualize my analysis.
How do I do this in pyspark? (For simplicity sake, I have just kept 2 columns. My actual dataframe has 89 columns with ~670 million records before filling the gaps.)

You can generate timestamp ranges, flatten them and select rows
import pyspark.sql.functions as func
from pyspark.sql.types import IntegerType, ArrayType
a=sc.parallelize([[670098928, 50],[670098930, 53], [670098934, 55]])\
.toDF(['timestamp','price'])
f=func.udf(lambda x:range(x,x+5),ArrayType(IntegerType()))
a.withColumn('timestamp',f(a.timestamp))\
.withColumn('timestamp',func.explode(func.col('timestamp')))\
.groupBy('timestamp')\
.agg(func.max(func.col('price')))\
.show()
+---------+----------+
|timestamp|max(price)|
+---------+----------+
|670098928| 50|
|670098929| 50|
|670098930| 53|
|670098931| 53|
|670098932| 53|
|670098933| 53|
|670098934| 55|
|670098935| 55|
|670098936| 55|
|670098937| 55|
|670098938| 55|
+---------+----------+

Filter records between particular hours minutes and seconds in Spark data frames

Suppose I have a data frame
+--------------------+---------------+------+
| timestamp| login | Age |
+--------------------+---------------+------+
2016-06-01 01:05:20 | 7372 | 50|
2016-06-01 01:00:20 | 7374 | 35|
2016-06-01 01:10:20 | 7376 | 40|
I want records only between 1 to 1:10 time irrespective of date and
the time is in unix_timestamp as "yyyy-MM-dd HH:mm:ss"
How to extract those records?
This is to analyze people who are coming late

I achieved it using below code:
val attendenceDF = DF.withColumn("Attendence",when(date_format(DF("timestamp"),"HH:mm:ss").between("01:00:00","01:10:00"),"InTime").otherwise("NotInTime"))
attendenceDF.show()
+--------------------+---------------+------+-----------+
| timestamp| login | Age | Attendence|
+--------------------+---------------+------+-----------+
2016-06-01 01:05:20 | 7372 | 50|InTime |
2016-06-01 01:00:20 | 7374 | 35|InTime |
2016-06-01 01:10:20 | 7376 | 40|NotInTime |

You could try using the functions hour and minute of the functions package:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val tsCol = col("timestamp").cast(TimestampType)
val filteredDF = df.filter(
(hour(tsCol) === 1) && (minute(tsCol).between(0, 10))
)

if the timestamp is of type string then with a substring you could do it.
if it is of type unix then you coould convert it, but more efficient is to look in the exact library and format of the type it is saved and check to a way to extract the hour and minute.
hope it helps you :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Change string to HH:MM:SS in PySpark - string

I have Column "minutes" . i want change the column to hh:mm:ss format in PySpark Input: minutes(string type) 10 20 70 90 output: minutes(string type) min_change 10 00:10:00 20 00:20:00 70 01:10:00 90 01:30:00

Related

Spark DataFrame Get Difference between values of two rows

Case Based scenerio in pyspark

Explode array into columns Spark

Inserting records in a spark dataframe

Filter records between particular hours minutes and seconds in Spark data frames

Categories

Resources