How to align timestamps from two Datasets in Apache Spark

How to align timestamps from two Datasets in Apache Spark - apache-spark

I got the following problem, while developing an Apache Spark Application. I have two Datasets (D1 and D2) from a Postgresql Database, that I would like to process using Apache Spark. Both contain a column (ts) with timestamps from the same period. I would like to join D2 with the largest timestamp from D1 that is smaller or equal. It might look like:
D1 D2 DJOIN
|ts| |ts| |D1.ts|D2.ts|
---- ---- -------------
| 1| | 1| | 1 | 1 |
| 3| | 2| | 1 | 2 |
| 5| | 3| | 3 | 3 |
| 7| | 4| | 3 | 4 |
|11| | 5| | 5 | 5 |
|13| | 6| | 5 | 6 |
| 7| = join => | 7 | 7 |
| 8| | 7 | 8 |
| 9| | 7 | 9 |
|10| | 7 | 10 |
|11| | 11 | 11 |
|12| | 11 | 12 |
|13| | 13 | 13 |
|14| | 13 | 14 |
In SQL I can simply write something like:
SELECT D1.ts, D2.ts
FROM D1, D2
WHERE D1.ts = (SELECT max(D1.ts)
FROM D1
WHERE D1.ts <= D2.ts);
There is the possibility for nested SELECT queries in Spark Datasets but unfortunately they support only equality = and no <=. I am a beginner in Spark and currently I am stuck here. Is there someone more knowledgable with a good idea on how to solve that issue?

Related

PySpark: creating aggregated columns out of a string type column different values

I have this dataframe:
+---------+--------+------+
| topic| emotion|counts|
+---------+--------+------+
| dog | sadness| 4 |
| cat |surprise| 1 |
| bird | fear| 3 |
| cat | joy| 2 |
| dog |surprise| 10 |
| dog |surprise| 3 |
+---------+--------+------+
And I want to create a column for every different emotion aggregating the counts for every topic and every emotion, ending up having an output like this:
+---------+--------+---------+-----+----------+
| topic| fear | sadness | joy | surprise |
+---------+--------+---------+-----+----------+
| dog | 0 | 4 | 0 | 13 |
| cat | 0 | 0 | 2 | 1 |
| bird | 3 | 0 | 0 | 0 |
+---------+--------+---------+-----+----------+
This is what I tried so far, for the fear column but the rest of the emotions keep showing up for every topic, how can I get a result like the above?
agg_emotion = df.groupby("topic", "emotion") \
.agg(F.sum(F.when(F.col("emotion").eqNullSafe("fear"), 1)\
.otherwise(0)).alias('fear'))

groupy sum then groupby pivot the outcome
df.groupby('topic','emotion').agg(sum('counts').alias('counts')).groupby('topic').pivot('emotion').agg(F.first('counts')).na.fill(0).show()
+-----+----+---+-------+--------+
|topic|fear|joy|sadness|surprise|
+-----+----+---+-------+--------+
| dog| 0| 0| 4| 13|
| cat| 0| 2| 0| 1|
| bird| 3| 0| 0| 0|
+-----+----+---+-------+--------+

pyspark detect change of categorical variable

I have a spark dataframe consisting of two columns.
+-----------------------+-----------+
| Metric|Recipe_name|
+-----------------------+-----------+
| 100. | A |
| 200. | A |
| 300. | A |
| 10. | A |
| 20. | A |
| 10. | B |
| 20. | B |
| 10. | A |
| 20. | A |
| .. | .. |
| .. | .. |
| 10. | B |
The dataframe is time ordered ( you can imagine there is a increasing timestamp column ). I need to add a column 'Cycles'. There are two scenarios when I say a new cycle begins :
If the same recipe is running lets say recipe 'A', and the value of Metric decreases (with respect to the last row) then a new cycle begins.
Lets say we switch from current recipe 'A' to second recipe 'B' and switch back to recipe 'A' we say a new cycle for recipe 'A' has begun.
So in the end i would like to have a column 'Cycle' which looks like this :
+-----------------------+-----------+-----------+
| Metric|Recipe_name| Cycle|
+-----------------------+-----------+-----------+
| 100. | A | 0 |
| 200. | A | 0 |
| 300. | A | 0 |
| 10. | A | 1 |
| 20. | A | 1 |
| 10. | B | 0 |
| 20. | B | 0 |
| 10. | A | 2 |
| 20. | A | 2 |
| .. | .. | 2 |
| .. | .. | 2 |
| 10. | B | 1 |
So it means recipe A has cycle 0 then metric decreases and cycle changes to 1.
Then a new recipe starts B so it has a new cycle 0.
Then again we get back to recipe A we say a new cycle begins for recipe A and with respect to last cycle number it has cycle 2 ( and similarly for recipe B).
In total there are 200 recipes.
Thanks for the help.

Replace my order column to your ordering column. Compare your condition by using lag function where the Recipe_name column is being partitioned.
w = Window.partitionBy('Recipe_name').orderBy('order')
df.withColumn('Cycle', when(col('Metric') < lag('Metric', 1, 0).over(w), 1).otherwise(0)) \
.withColumn('Cycle', sum('Cycle').over(w)) \
.orderBy('order') \
.show()
+------+-----------+-----+
|Metric|Recipe_name|Cycle|
+------+-----------+-----+
| 100| A| 0|
| 200| A| 0|
| 300| A| 0|
| 10| A| 1|
| 20| A| 1|
| 10| B| 0|
| 20| B| 0|
| 10| A| 2|
| 20| A| 2|
| 10| B| 1|
+------+-----------+-----+

Pandas read_csv randomly skip rows with specific entries

I have a csv file where I want to skip a random percentage of rows but only for rows where one of the columns has a specific entry. For example I might have a csv with contents below and I want to skip a certain percentage of all the apple entries:
| a | b | c | d | e |
|----|----|----|----|--------|
0| 9 | 1 | 2 | 3 | apple |
1| 8 | 4 | 5 | 6 | apple |
2| 7 | 7 | 8 | 9 | apple |
3| 6 | 10 | 11 | 12 | orange |
4| 5 | 13 | 14 | 15 | orange |
5| 4 | 16 | 17 | 18 | orange |
6| 3 | 19 | 20 | 21 | orange |
7| 2 | 22 | 23 | 24 | banana |
8| 1 | 25 | 26 | 27 | banana |
9| 0 | 28 | 29 | 30 | banana |
I know I could skip rows across the entire file with something like
df = pd.read_csv('fruit.csv', skiprows = lambda i: i>0 and random.random() > probability_value)
I know I can also select just the apple entries from the dataframe with
df2 = df.loc[df['e'] == 'apple']
But is there a simple way to select these entries when importing the csv and apply the skip rows so all the non 'apple' entries aren't affected by the skip row?

You can do it as follows, But I would prefer doing it in later stage.
df = pd.read_csv('fruit.csv').query("e != 'apple'")

How to calculate rolling sum with varying window sizes in PySpark

I have a spark dataframe that contains sales prediction data for some products in some stores over a time period. How do I calculate the rolling sum of Predictions for a window size of next N values?
Input Data
+-----------+---------+------------+------------+---+
| ProductId | StoreId | Date | Prediction | N |
+-----------+---------+------------+------------+---+
| 1 | 100 | 2019-07-01 | 0.92 | 2 |
| 1 | 100 | 2019-07-02 | 0.62 | 2 |
| 1 | 100 | 2019-07-03 | 0.89 | 2 |
| 1 | 100 | 2019-07-04 | 0.57 | 2 |
| 2 | 200 | 2019-07-01 | 1.39 | 3 |
| 2 | 200 | 2019-07-02 | 1.22 | 3 |
| 2 | 200 | 2019-07-03 | 1.33 | 3 |
| 2 | 200 | 2019-07-04 | 1.61 | 3 |
+-----------+---------+------------+------------+---+
Expected Output Data
+-----------+---------+------------+------------+---+------------------------+
| ProductId | StoreId | Date | Prediction | N | RollingSum |
+-----------+---------+------------+------------+---+------------------------+
| 1 | 100 | 2019-07-01 | 0.92 | 2 | sum(0.92, 0.62) |
| 1 | 100 | 2019-07-02 | 0.62 | 2 | sum(0.62, 0.89) |
| 1 | 100 | 2019-07-03 | 0.89 | 2 | sum(0.89, 0.57) |
| 1 | 100 | 2019-07-04 | 0.57 | 2 | sum(0.57) |
| 2 | 200 | 2019-07-01 | 1.39 | 3 | sum(1.39, 1.22, 1.33) |
| 2 | 200 | 2019-07-02 | 1.22 | 3 | sum(1.22, 1.33, 1.61 ) |
| 2 | 200 | 2019-07-03 | 1.33 | 3 | sum(1.33, 1.61) |
| 2 | 200 | 2019-07-04 | 1.61 | 3 | sum(1.61) |
+-----------+---------+------------+------------+---+------------------------+
There are lots of questions and answers to this problem in Python but I couldn't find any in PySpark.
Similar Question 1
There is a similar question here but in this one frame size is fixed to 3. In the provided answer rangeBetween function is used and it is only working with fixed sized frames so I cannot use it for varying sizes.
Similar Question 2
There is also a similar question here. In this one, writing cases for all possible sizes is suggested but it is not applicable for my case since I don't know how many distinct frame sizes I need to calculate.
Solution attempt 1
I've tried to solve the problem using a pandas udf:
rolling_sum_predictions = predictions.groupBy('ProductId', 'StoreId').apply(calculate_rolling_sums)
calculate_rolling_sums is a pandas udf where I solve the problem in python. This solution works with a small amount of test data. However, when the data gets bigger (in my case, the input df has around 1B rows), calculations take so long.
Solution attempt 2
I have used a workaround of the answer of Similar Question 1 above. I've calculated the biggest possible N, created the list using it and then calculate the sum of predictions by slicing the list.
predictions = predictions.withColumn('DayIndex', F.rank().over(Window.partitionBy('ProductId', 'StoreId').orderBy('Date')))
# find the biggest period
biggest_period = predictions.agg({"N": "max"}).collect()[0][0]
# calculate rolling predictions starting from the DayIndex
w = (Window.partitionBy(F.col("ProductId"), F.col("StoreId")).orderBy(F.col('DayIndex')).rangeBetween(0, biggest_period - 1))
rolling_prediction_lists = predictions.withColumn("next_preds", F.collect_list("Prediction").over(w))
# calculate rolling forecast sums
pred_sum_udf = udf(lambda preds, period: float(np.sum(preds[:period])), FloatType())
rolling_pred_sums = rolling_prediction_lists \
.withColumn("RollingSum", pred_sum_udf("next_preds", "N"))
This solution is also works with the test data. I couldn't have chance to test it with the original data yet but whether it works or not I do not like this solution. Is there any smarter way to solve this?

If you're using spark 2.4+, you can use the new higher-order array functions slice and aggregate to efficiently implement your requirement without any UDFs:
summed_predictions = predictions\
.withColumn("summed", F.collect_list("Prediction").over(Window.partitionBy("ProductId", "StoreId").orderBy("Date").rowsBetween(Window.currentRow, Window.unboundedFollowing))\
.withColumn("summed", F.expr("aggregate(slice(summed,1,N), cast(0 as double), (acc,d) -> acc + d)"))
summed_predictions.show()
+---------+-------+-------------------+----------+---+------------------+
|ProductId|StoreId| Date|Prediction| N| summed|
+---------+-------+-------------------+----------+---+------------------+
| 1| 100|2019-07-01 00:00:00| 0.92| 2| 1.54|
| 1| 100|2019-07-02 00:00:00| 0.62| 2| 1.51|
| 1| 100|2019-07-03 00:00:00| 0.89| 2| 1.46|
| 1| 100|2019-07-04 00:00:00| 0.57| 2| 0.57|
| 2| 200|2019-07-01 00:00:00| 1.39| 3| 3.94|
| 2| 200|2019-07-02 00:00:00| 1.22| 3| 4.16|
| 2| 200|2019-07-03 00:00:00| 1.33| 3|2.9400000000000004|
| 2| 200|2019-07-04 00:00:00| 1.61| 3| 1.61|
+---------+-------+-------------------+----------+---+------------------+

It might not be the best, but you can get distinct "N" column values and loop like below.
val arr = df.select("N").distinct.collect
for(n <- arr) df.filter(col("N") === n.get(0))
.withColumn("RollingSum",sum(col("Prediction"))
.over(Window.partitionBy("N").orderBy("N").rowsBetween(Window.currentRow, n.get(0).toString.toLong-1))).show
This will give you like:
+---------+-------+----------+----------+---+------------------+
|ProductId|StoreId| Date|Prediction| N| RollingSum|
+---------+-------+----------+----------+---+------------------+
| 2| 200|2019-07-01| 1.39| 3| 3.94|
| 2| 200|2019-07-02| 1.22| 3| 4.16|
| 2| 200|2019-07-03| 1.33| 3|2.9400000000000004|
| 2| 200|2019-07-04| 1.61| 3| 1.61|
+---------+-------+----------+----------+---+------------------+
+---------+-------+----------+----------+---+----------+
|ProductId|StoreId| Date|Prediction| N|RollingSum|
+---------+-------+----------+----------+---+----------+
| 1| 100|2019-07-01| 0.92| 2| 1.54|
| 1| 100|2019-07-02| 0.62| 2| 1.51|
| 1| 100|2019-07-03| 0.89| 2| 1.46|
| 1| 100|2019-07-04| 0.57| 2| 0.57|
+---------+-------+----------+----------+---+----------+
Then you can do a union of all the dataframes inside the loop.

Finding efficiently all relevant sub ranges for bigdata tables in Hive/ Spark

Following this question, I would like to ask.
I have 2 tables:
The first table - MajorRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1200 | 1500 | A
2 | 2200 | 2700 | B
3 | 1700 | 1900 | C
4 | 2100 | 2150 | D
...
The second table - SubRange
row | From | To | Group ....
-----|--------|---------|---------
1 | 1208 | 1300 | E
2 | 1400 | 1600 | F
3 | 1700 | 2100 | G
4 | 2100 | 2500 | H
...
The output table should be the all the SubRange groups who has overlap over the MajorRange groups. In the following example the result table is:
row | Major | Sub |
-----|--------|------|-
1 | A | E |
2 | A | F |
3 | B | H |
4 | C | G |
5 | D | H |
In case there is no overlapping between the ranges the Major will not appear.
Both tables are big data tables.How can I do it using Hive/ Spark in most efficient way?

With spark, maybe a non equi join like this?
val join_expr = major_range("From") < sub_range("To") && major_range("To") > sub_range("From")
(major_range.join(sub_range, join_expr)
.select(
monotonically_increasing_id().as("row"),
major_range("Group").as("Major"),
sub_range("Group").as("Sub")
)
).show
+---+-----+---+
|row|Major|Sub|
+---+-----+---+
| 0| A| E|
| 1| A| F|
| 2| B| H|
| 3| C| G|
| 4| D| H|
+---+-----+---+

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to align timestamps from two Datasets in Apache Spark - apache-spark

Related

PySpark: creating aggregated columns out of a string type column different values

pyspark detect change of categorical variable

Pandas read_csv randomly skip rows with specific entries

How to calculate rolling sum with varying window sizes in PySpark

Finding efficiently all relevant sub ranges for bigdata tables in Hive/ Spark

Categories

Resources