Explode array into columns Spark - apache-spark

Hi1, I have a json like beow:
{meta:{"clusters":[{"1":"Aged 35 to 49"},{"2":"Male"},{"5":"Aged 15 to 17"}]}}
and I'd like to obtain the following dataframe:
+---------------+----+---------------+
| 1| 2| 5 |
+---------------+----+---------------+
| Aged 35 to 49|Male| Aged 15 to 17|
+---------------+----+---------------+
How could I do it in pyspark?
Thanks

You can use get_json_object() function to parse json column:
Example:
df=spark.createDataFrame([Row(jsn='{"meta":{"clusters":[{"1":"Aged 35 to 49"},{"2":"Male"},{"5":"Aged 15 to 17"}]}}')])
df.selectExpr("get_json_object(jsn,'$.meta.clusters[0].1') as `1`",
"get_json_object(jsn,'$.meta.clusters[*].2') as `2`",
"get_json_object(jsn,'$.meta.clusters[*].5') as `5`").show(10,False)
"Output":
+-------------+------+---------------+
|1 |2 |5 |
+-------------+------+---------------+
|Aged 35 to 49|"Male"|"Aged 15 to 17"|
+-------------+------+---------------+

Related

Calculate duration inside groups based on timestamp

I have a dataframe looking like this:
| id | device| x | y | z | timestamp |
1 device_1 22 8 23 2020-10-30T16:00:00.000+0000
1 device_1 21 88 65 2020-10-30T16:01:00.000+0000
1 device_1 33 34 64 2020-10-30T16:02:00.000+0000
2 device_2 12 6 97 2019-11-30T13:00:00.000+0000
2 device_2 44 77 13 2019-11-30T13:00:00.000+0000
1 device_1 22 11 30 2022-10-30T08:00:00.000+0000
1 device_1 22 11 30 2022-10-30T08:01:00.000+0000
The data represents events for an "id" on a certain point in time. I would like to see development of values over a period of time to plot a time series for instance.
I'm thinking of adding a column 'duration' which is 0 for the first entry and then the difference in the next entry related to the same id on the same day (there might be multiple different event streams for the same id on separate days).
I would ideally want a dataframe looking something like this:
| id | device | x | y | z | timestamp | duration |
1 device_1 22 8 23 2020-10-30T16:00:00.000+0000 00:00.00.000
1 device_1 21 88 65 2020-10-30T16:01:00.000+0000 00:01:00.000
1 device_1 33 34 64 2020-10-30T16:02:00.000+0000 00:02:00.000
2 device_2 12 6 97 2019-11-30T13:00:00.000+0000 00:00:00.000
2 device_2 44 77 13 2019-11-30T13:00:30.000+0000 00:00:30.000
1 device_1 22 11 30 2022-10-30T08:00:00.000+0000 00:00:00.000
1 device_1 22 11 30 2022-10-30T08:01:00.000+0000 00:01:00.000
I have no idea where to begin in order to achieve this so either a good explanation or a code example would be very helpful!
Any other suggestions on how to be able to plot development over time (in general not related to a specific date or time of the day) based on this dataframe are also very welcome.
Note: It has to be in PySpark (not pandas) since the dataset is extremely large.
You will need to use window functions (specific functions working inside partitions created using over clause). The below code does the same thing as in the other answer, but I wanted to show a more streamlined version, fully in PySpark, as opposed to PySpark + SQL with subqueries.
Initially, the column "difference" will be of type interval, so then it's up to you to try to transform it to whatever data type you need. I have just extracted the interval using regexp_extract which stores it as string.
Input (I assume your "timestamp" column is of type timestamp):
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(1, 'device_1', 22, 8, 23, '2020-10-30T16:00:00.000+0000'),
(1, 'device_1', 21, 88, 65, '2020-10-30T16:01:00.000+0000'),
(1, 'device_1', 33, 34, 64, '2020-10-30T16:02:00.000+0000'),
(2, 'device_2', 12, 6, 97, '2019-11-30T13:00:00.000+0000') ,
(2, 'device_2', 44, 77, 13, '2019-11-30T13:00:30.000+0000'),
(1, 'device_1', 22, 11, 30, '2022-10-30T08:00:00.000+0000'),
(1, 'device_1', 22, 11, 30, '2022-10-30T08:01:00.000+0000')],
["id", "device", "x", "y", "z", "timestamp"]
).withColumn("timestamp", F.to_timestamp("timestamp"))
Script:
w = W.partitionBy('id', F.to_date('timestamp')).orderBy('timestamp')
df = df.withColumn('duration', F.col('timestamp') - F.min('timestamp').over(w))
df = df.withColumn('duration', F.regexp_extract('duration', r'\d\d:\d\d:\d\d', 0))
df.show(truncate=0)
# +---+--------+---+---+---+-------------------+--------+
# |id |device |x |y |z |timestamp |duration|
# +---+--------+---+---+---+-------------------+--------+
# |1 |device_1|22 |8 |23 |2020-10-30 16:00:00|00:00:00|
# |1 |device_1|21 |88 |65 |2020-10-30 16:01:00|00:01:00|
# |1 |device_1|33 |34 |64 |2020-10-30 16:02:00|00:02:00|
# |1 |device_1|22 |11 |30 |2022-10-30 08:00:00|00:00:00|
# |1 |device_1|22 |11 |30 |2022-10-30 08:01:00|00:01:00|
# |2 |device_2|12 |6 |97 |2019-11-30 13:00:00|00:00:00|
# |2 |device_2|44 |77 |13 |2019-11-30 13:00:30|00:00:30|
# +---+--------+---+---+---+-------------------+--------+
your problem can be resolved using Window functions a below
from pyspark.sql import SparkSession
spark = SparkSession.Builder().getOrCreate()
df = spark.createDataFrame(
[
(1,'device_1',22,8,23,'2020-10-30T16:00:00.000+0000'),
(1,'device_1',21,88,65,'2020-10-30T16:01:00.000+0000'),
(1,'device_1',33,34,64,'2020-10-30T16:02:00.000+0000'),
(2,'device_2',12,6,97,'2019-11-30T13:00:00.000+0000') ,
(2,'device_2',44,77,13,'2019-11-30T13:00:00.000+0000'),
(1,'device_1',22,11,30,'2022-10-30T08:00:00.000+0000'),
(1,'device_1',22,11,30,'2022-10-30T08:01:00.000+0000')
],
("id", "device_name", "x","y","z","timestmp"))
df.show(5, False)
+---+-----------+---+---+---+----------------------------+
|id |device_name|x |y |z |timestmp |
+---+-----------+---+---+---+----------------------------+
|1 |device_1 |22 |8 |23 |2020-10-30T16:00:00.000+0000|
|1 |device_1 |21 |88 |65 |2020-10-30T16:01:00.000+0000|
|1 |device_1 |33 |34 |64 |2020-10-30T16:02:00.000+0000|
|2 |device_2 |12 |6 |97 |2019-11-30T13:00:00.000+0000|
|2 |device_2 |44 |77 |13 |2019-11-30T13:00:00.000+0000|
+---+-----------+---+---+---+----------------------------+
from pyspark.sql.functions import *
df_1 = df.withColumn("timestmp_t", to_timestamp(col("timestmp")))
df_1 = df_1.withColumn("date_t", to_date(substring(col("timestmp"), 1, 10)))
df_1.show(5)
+---+-----------+---+---+---+--------------------+-------------------+----------+
| id|device_name| x| y| z| timestmp| timestmp_t| date_t|
+---+-----------+---+---+---+--------------------+-------------------+----------+
| 1| device_1| 22| 8| 23|2020-10-30T16:00:...|2020-10-30 16:00:00|2020-10-30|
| 1| device_1| 21| 88| 65|2020-10-30T16:01:...|2020-10-30 16:01:00|2020-10-30|
| 1| device_1| 33| 34| 64|2020-10-30T16:02:...|2020-10-30 16:02:00|2020-10-30|
| 2| device_2| 12| 6| 97|2019-11-30T13:00:...|2019-11-30 13:00:00|2019-11-30|
| 2| device_2| 44| 77| 13|2019-11-30T13:00:...|2019-11-30 13:00:00|2019-11-30|
+---+-----------+---+---+---+--------------------+-------------------+----------+
df_1.createOrReplaceTempView("tmp_table")
spark.sql("""
select t.*, (timestmp_t - min) as duration from (
SELECT id, device_name, date_t, timestmp_t, MIN(timestmp_t) OVER (PARTITION BY id, date_t ORDER BY timestmp_t) AS min
FROM tmp_table) as t
""").show(5, False)
+---+-----------+----------+-------------------+-------------------+-----------------------------------+
|id |device_name|date_t |timestmp_t |min |duration |
+---+-----------+----------+-------------------+-------------------+-----------------------------------+
|1 |device_1 |2020-10-30|2020-10-30 16:00:00|2020-10-30 16:00:00|INTERVAL '0 00:00:00' DAY TO SECOND|
|1 |device_1 |2020-10-30|2020-10-30 16:01:00|2020-10-30 16:00:00|INTERVAL '0 00:01:00' DAY TO SECOND|
|1 |device_1 |2020-10-30|2020-10-30 16:02:00|2020-10-30 16:00:00|INTERVAL '0 00:02:00' DAY TO SECOND|
|1 |device_1 |2022-10-30|2022-10-30 08:00:00|2022-10-30 08:00:00|INTERVAL '0 00:00:00' DAY TO SECOND|
|1 |device_1 |2022-10-30|2022-10-30 08:01:00|2022-10-30 08:00:00|INTERVAL '0 00:01:00' DAY TO SECOND|
+---+-----------+----------+-------------------+-------------------+-----------------------------------+

Split large dataframe into small ones Spark

I have a DF that has 200 million lines. I cant group this DF and I have to split this DF in 8 smaller DFs (approx 30 million lines each). I've tried this approach but with no success. Without caching the DF, the count of the splitted DFs does not match the larger DF. If I use cache I get out of disk space (my config is 64gb RAM and 512 SSD).
Considering this, I though about the following approach:
Load the entire DF
Give 8 random numbers to this DF
Distribute the random number evenly in the DF
Consider the following DF as example:
+------+--------+
| val1 | val2 |
+------+--------+
|Paul | 1.5 |
|Bostap| 1 |
|Anna | 3 |
|Louis | 4 |
|Jack | 2.5 |
|Rick | 0 |
|Grimes| null|
|Harv | 2 |
|Johnny| 2 |
|John | 1 |
|Neo | 5 |
|Billy | null|
|James | 2.5 |
|Euler | null|
+------+--------+
The DF has 14 lines, I though to use window to create the following DF:
+------+--------+----+
| val1 | val2 | sep|
+------+--------+----+
|Paul | 1.5 |1 |
|Bostap| 1 |1 |
|Anna | 3 |1 |
|Louis | 4 |1 |
|Jack | 2.5 |1 |
|Rick | 0 |1 |
|Grimes| null|1 |
|Harv | 2 |2 |
|Johnny| 2 |2 |
|John | 1 |2 |
|Neo | 5 |2 |
|Billy | null|2 |
|James | 2.5 |2 |
|Euler | null|2 |
+------+--------+----+
Considering the last DF, I will use a filter to filter by sep. My doubt is: How can I use window function to generate the column sep of last DF?
Since you are randomly splitting the dataframe into 8 parts, you could use randomSplit():
split_weights = [1.0] * 8
splits = df.randomSplit(split_weights)
for df_split in splits:
# do what you want with the smaller df_split
Note that this will not ensure same number of records in each df_split. There may be some fluctuation but with 200 million records it will be negligible.
If you want to process and store to files with the count names to avoid getting mixed up.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.parquet('parquet-files')
split_w = [1.0] * 5
splits = df.randomSplit(split_w)
for count, df_split in enumerate(splits, start=1):
df_split.write.parquet(f'split-files/split-file-{count}', mode='overwrite')
The file sizes will be averagely the same size, some with a slight difference.

Sum last 2 days with day data gaps in Spark dataframe

I have a data frame
| Id | Date | Value |
| 1 | 1/1/2019 | 11 |
| 1 | 1/2/2019 | 12 |
| 1 | 1/3/2019 | 13 |
| 1 | 1/5/2019 | 14 |
| 1 | 1/6/2019 | 15 |
I want to calculate the sum of last 2 values by date:
| Id | Date | Value | Sum |
| 1 | 1/1/2019 | 11 | null |
| 1 | 1/2/2019 | 12 | null |
| 1 | 1/3/2019 | 13 | 23 |
| 1 | 1/5/2019 | 14 | -13 | // there is no 1/4 so 0 - 13
| 1 | 1/6/2019 | 15 | 14 | // there is no 1/4 so 14 - 0
Right now I have
let window = Window
.PartitionBy("Id")
.OrderBy(Functions.Col("Date").Cast("timestamp").Cast("long"))
data.WithColumn("Sum", Functions.Lag("Value", 1).Over(window) - Functions.Lag("Value", 2).Over(window))
With this approach I can assume that the missed value is equal to previous one (so 1/4 is equal to 1/3 = 13).
How can I consider 1/4 as zero?
You got two ways to do this.
One would be to use lagfunction with when and otherwise and use the api data to remove one day from date.
The pros is this is working fine and quickly, the cons is that each time you to change your lag formula, you have to rewrite it...
However, I found a more generalizable method. The idea will be to fill the missing date using the Timestamp to Long and use spark.range to generate every possible date between minDate and maxDate
// Some imports
import org.apache.spark.sql.{functions => F}
import org.apache.spark.sql.expressions.Window
// Our DF
val df = Seq(
(1, "1/1/2019", 11),
(1, "1/2/2019", 12),
(1, "1/3/2019", 13),
(1, "1/5/2019", 14),
(1, "1/6/2019", 15)
).
toDF("id", "date", "value").
withColumn("date", F.to_timestamp($"date", "MM/dd/yyyy"))
// min and max date
val (mindate, maxdate) = df.select(min($"date"), max($"date")).as[(Long, Long)].first
// Our step in seconds, so one day here
val step: Long = 24 * 60 * 60
// Generate missing dates
val reference = spark.
range(mindate, ((maxdate / step) + 1) * step, step).
select($"id".cast("timestamp").as("date"))
// Our df filled !
val filledDf = reference.join(df, Seq("date"), "leftouter").na.fill(0, Seq("value"))
/**
+-------------------+----+-----+
| date| id|value|
+-------------------+----+-----+
|2019-01-01 00:00:00| 1| 11|
|2019-01-02 00:00:00| 1| 12|
|2019-01-03 00:00:00| 1| 13|
|2019-01-04 00:00:00|null| 0|
|2019-01-05 00:00:00| 1| 14|
|2019-01-06 00:00:00| 1| 15|
+-------------------+----+-----+
*/
filledDf.
withColumn("result", F.lag($"value", 1, 0).over(windowSpec) - F.lag($"value", 2, 0).over(windowSpec)).show
/**
+-------------------+----+-----+------+
| date| id|value|result|
+-------------------+----+-----+------+
|2019-01-01 00:00:00| 1| 11| 0|
|2019-01-02 00:00:00| 1| 12| 11|
|2019-01-03 00:00:00| 1| 13| 1|
|2019-01-04 00:00:00|null| 0| 1|
|2019-01-05 00:00:00| 1| 14| -13|
|2019-01-06 00:00:00| 1| 15| 14|
+-------------------+----+-----+------+
*/

Add a new Column to my DataSet in spark Java API

I'm new In Spark .
My DataSet contains two columns. I want to add the third that is the sum of the two columns.
My DataSet is:
+---------+-------------------+
|C1 | C2 |
+---------+-------------------+
| 44 | 10|
| 55 | 10|
+---------+-------------------+
I want to obtain a DataSet like this:
+---------+-------------------+---------+
|C1 | C2 | C3 |
+---------+-------------------+---------+
| 44 | 10| 54 |
| 55 | 10| 65 |
+---------+-------------------+---------+
Any help will be apprecieted.
The correct solution is:
df.withColumn("C3", df.col1("C1").plus(df.col("C2")));
or
df.selectExpr("*", "C1 + C2");
For more arithmetic operators check Java-specific expression operators in the Column documentation.

Inserting records in a spark dataframe

I have a dataframe in pyspark. Here is what it looks like,
+---------+---------+
|timestamp| price |
+---------+---------+
|670098928| 50 |
|670098930| 53 |
|670098934| 55 |
+---------+---------+
I want to fill in the gaps in timestamp with the previous state, so that I can get a perfect set to calculate time weighted averages. Here is what the output should be like -
+---------+---------+
|timestamp| price |
+---------+---------+
|670098928| 50 |
|670098929| 50 |
|670098930| 53 |
|670098931| 53 |
|670098932| 53 |
|670098933| 53 |
|670098934| 55 |
+---------+---------+
Eventually, I want to persist this new dataframe on disk and visualize my analysis.
How do I do this in pyspark? (For simplicity sake, I have just kept 2 columns. My actual dataframe has 89 columns with ~670 million records before filling the gaps.)
You can generate timestamp ranges, flatten them and select rows
import pyspark.sql.functions as func
from pyspark.sql.types import IntegerType, ArrayType
a=sc.parallelize([[670098928, 50],[670098930, 53], [670098934, 55]])\
.toDF(['timestamp','price'])
f=func.udf(lambda x:range(x,x+5),ArrayType(IntegerType()))
a.withColumn('timestamp',f(a.timestamp))\
.withColumn('timestamp',func.explode(func.col('timestamp')))\
.groupBy('timestamp')\
.agg(func.max(func.col('price')))\
.show()
+---------+----------+
|timestamp|max(price)|
+---------+----------+
|670098928| 50|
|670098929| 50|
|670098930| 53|
|670098931| 53|
|670098932| 53|
|670098933| 53|
|670098934| 55|
|670098935| 55|
|670098936| 55|
|670098937| 55|
|670098938| 55|
+---------+----------+

Resources