I have Column "minutes" . i want change the column to hh:mm:ss format in PySpark
Input:
minutes(string type)
10
20
70
90
output:
minutes(string type) min_change
10 00:10:00
20 00:20:00
70 01:10:00
90 01:30:00
Add a column with lit("00:00:00") and cast it to timestamp. Convert the minutes to seconds and add it to the timestamp column. Finally, use date_format() to get your desired format:
from pyspark.sql.functions import *
from pyspark.sql import functions as F
df.withColumn("minutes", col("minutes").cast("int"))\
.withColumn("min_change", lit("00:00:00").cast("timestamp"))\
.withColumn("min_change", (F.unix_timestamp("min_change") + F.col("minutes")*60).cast('timestamp'))\
.withColumn("min_change", date_format("min_change",'HH:mm:ss')).show()
+-------+----------+
|minutes|min_change|
+-------+----------+
| 10| 00:10:00|
| 20| 00:20:00|
| 70| 01:10:00|
| 90| 01:30:00|
+-------+----------+
Related
I have calculated the average temperature for two cities grouped by seasons, but I'm having trouble in with getting the difference between the avg(TemperatureF) for City A vs City B. Here is an example of what my Spark Scala DataFrame looks like:
City
Season
avg(TemperatureF)
A
Fall
52
A
Spring
50
A
Summer
72
A
Winter
25
B
Fall
49
B
Spring
44
B
Summer
69
B
Winter
22
You may use the pivot function as follows:
df.groupBy('Season').pivot('City').agg(f.first('avg')) \
.withColumn('diff', f.expr('A - B')) \
.show()
+------+---+---+----+
|Season| A| B|diff|
+------+---+---+----+
|Spring| 50| 44| 6.0|
|Summer| 72| 69| 3.0|
| Fall| 52| 49| 3.0|
|Winter| 25| 22| 3.0|
+------+---+---+----+
The DataFrame in pyspark looks like below.
model,DAYS
MarutiDesire,15
MarutiErtiga,30
Suzukicelerio,45
I10lxi,60
Verna,55
Output i am trying to get like
Output : I am trying to get the output as
when days less than 30 than economical,
between 30 and 60 than average,
and when greater than 60 than Low Profit
Code i tried but giving incorrect output.
dataset1.selectExpr("*", "CASE WHEN DAYS <=30 THEN 'ECONOMICAL' WHEN DAYS>30 AND LESS THEN 60 THEN 'AVERAGE' ELSE 'LOWPROFIT' END REASON").show()
kindly share your suggestion. is there any better way to do this in pyspark.
>>> from pyspark.sql.functions import *
>>> df.show()
+-------------+----+
| model|DAYS|
+-------------+----+
| MarutiDesire| 15|
| MarutiErtiga| 30|
|Suzukicelerio| 45|
| I10lxi| 60|
| Verna| 55|
+-------------+----+
>>> df.withColumn("REMARKS", when(col("DAYS") < 30, lit("ECONOMICAL")).when((col("DAYS") >= 30) & (col("DAYS") < 60), lit("AVERAGE")).otherwise(lit("LOWPROFIT"))).show()
+-------------+----+----------+
| model|DAYS| REMARKS|
+-------------+----+----------+
| MarutiDesire| 15|ECONOMICAL|
| MarutiErtiga| 30| AVERAGE|
|Suzukicelerio| 45| AVERAGE|
| I10lxi| 60| LOWPROFIT|
| Verna| 55| AVERAGE|
+-------------+----+----------+
Hi1, I have a json like beow:
{meta:{"clusters":[{"1":"Aged 35 to 49"},{"2":"Male"},{"5":"Aged 15 to 17"}]}}
and I'd like to obtain the following dataframe:
+---------------+----+---------------+
| 1| 2| 5 |
+---------------+----+---------------+
| Aged 35 to 49|Male| Aged 15 to 17|
+---------------+----+---------------+
How could I do it in pyspark?
Thanks
You can use get_json_object() function to parse json column:
Example:
df=spark.createDataFrame([Row(jsn='{"meta":{"clusters":[{"1":"Aged 35 to 49"},{"2":"Male"},{"5":"Aged 15 to 17"}]}}')])
df.selectExpr("get_json_object(jsn,'$.meta.clusters[0].1') as `1`",
"get_json_object(jsn,'$.meta.clusters[*].2') as `2`",
"get_json_object(jsn,'$.meta.clusters[*].5') as `5`").show(10,False)
"Output":
+-------------+------+---------------+
|1 |2 |5 |
+-------------+------+---------------+
|Aged 35 to 49|"Male"|"Aged 15 to 17"|
+-------------+------+---------------+
I have a dataframe in pyspark. Here is what it looks like,
+---------+---------+
|timestamp| price |
+---------+---------+
|670098928| 50 |
|670098930| 53 |
|670098934| 55 |
+---------+---------+
I want to fill in the gaps in timestamp with the previous state, so that I can get a perfect set to calculate time weighted averages. Here is what the output should be like -
+---------+---------+
|timestamp| price |
+---------+---------+
|670098928| 50 |
|670098929| 50 |
|670098930| 53 |
|670098931| 53 |
|670098932| 53 |
|670098933| 53 |
|670098934| 55 |
+---------+---------+
Eventually, I want to persist this new dataframe on disk and visualize my analysis.
How do I do this in pyspark? (For simplicity sake, I have just kept 2 columns. My actual dataframe has 89 columns with ~670 million records before filling the gaps.)
You can generate timestamp ranges, flatten them and select rows
import pyspark.sql.functions as func
from pyspark.sql.types import IntegerType, ArrayType
a=sc.parallelize([[670098928, 50],[670098930, 53], [670098934, 55]])\
.toDF(['timestamp','price'])
f=func.udf(lambda x:range(x,x+5),ArrayType(IntegerType()))
a.withColumn('timestamp',f(a.timestamp))\
.withColumn('timestamp',func.explode(func.col('timestamp')))\
.groupBy('timestamp')\
.agg(func.max(func.col('price')))\
.show()
+---------+----------+
|timestamp|max(price)|
+---------+----------+
|670098928| 50|
|670098929| 50|
|670098930| 53|
|670098931| 53|
|670098932| 53|
|670098933| 53|
|670098934| 55|
|670098935| 55|
|670098936| 55|
|670098937| 55|
|670098938| 55|
+---------+----------+
Suppose I have a data frame
+--------------------+---------------+------+
| timestamp| login | Age |
+--------------------+---------------+------+
2016-06-01 01:05:20 | 7372 | 50|
2016-06-01 01:00:20 | 7374 | 35|
2016-06-01 01:10:20 | 7376 | 40|
I want records only between 1 to 1:10 time irrespective of date and
the time is in unix_timestamp as "yyyy-MM-dd HH:mm:ss"
How to extract those records?
This is to analyze people who are coming late
I achieved it using below code:
val attendenceDF = DF.withColumn("Attendence",when(date_format(DF("timestamp"),"HH:mm:ss").between("01:00:00","01:10:00"),"InTime").otherwise("NotInTime"))
attendenceDF.show()
+--------------------+---------------+------+-----------+
| timestamp| login | Age | Attendence|
+--------------------+---------------+------+-----------+
2016-06-01 01:05:20 | 7372 | 50|InTime |
2016-06-01 01:00:20 | 7374 | 35|InTime |
2016-06-01 01:10:20 | 7376 | 40|NotInTime |
You could try using the functions hour and minute of the functions package:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._
val tsCol = col("timestamp").cast(TimestampType)
val filteredDF = df.filter(
(hour(tsCol) === 1) && (minute(tsCol).between(0, 10))
)
if the timestamp is of type string then with a substring you could do it.
if it is of type unix then you coould convert it, but more efficient is to look in the exact library and format of the type it is saved and check to a way to extract the hour and minute.
hope it helps you :)