Add a new Column to my DataSet in spark Java API - apache-spark

I'm new In Spark .
My DataSet contains two columns. I want to add the third that is the sum of the two columns.
My DataSet is:
+---------+-------------------+
|C1 | C2 |
+---------+-------------------+
| 44 | 10|
| 55 | 10|
+---------+-------------------+
I want to obtain a DataSet like this:
+---------+-------------------+---------+
|C1 | C2 | C3 |
+---------+-------------------+---------+
| 44 | 10| 54 |
| 55 | 10| 65 |
+---------+-------------------+---------+
Any help will be apprecieted.

The correct solution is:
df.withColumn("C3", df.col1("C1").plus(df.col("C2")));
or
df.selectExpr("*", "C1 + C2");
For more arithmetic operators check Java-specific expression operators in the Column documentation.

Related

How to get the rows with Max value in Spark DataFrame

I have a dataframe (df1) with the following details
| Date |High|Low |
| -------- |----|----|
| 2021-01-23| 89 | 43 |
| 2021-02-09| 90 | 54 |
|2009-09-19 | 96 | 50 |
I then apply aggregate functions to the High
df1.agg({'High':'max'}).show()
This will give me:
| max(High)|
| -------- |
| 96 |
How can I apply filter or other methods so that I can get the other columns that is within the same row as max(High) to show together with aggregated results?
My desired outcome is -
| Date | High | Low |
| -------- | ---- |------|
|2009-09-19| 96 | 50 |
You can easily to did by extracting the MAX High value and finally applying a filter against the value on the entire Dataframe
Data Preparation
df = pd.DataFrame({
'Date':['2021-01-23','2021-02-09','2009-09-19'],
'High':[89,90,96],
'Low':[43,54,50]
})
sparkDF = sql.createDataFrame(df)
sparkDF.show()
+----------+----+---+
| Date|High|Low|
+----------+----+---+
|2021-01-23| 89| 43|
|2021-02-09| 90| 54|
|2009-09-19| 96| 50|
+----------+----+---+
Filter
max_high = sparkDF.select(F.max(F.col('High')).alias('High')).collect()[0]['High']
>>> 96
sparkDF.filter(F.col('High') == max_high).orderBy(F.col('Date').desc()).limit(1).show()
+----------+----+---+
| Date|High|Low|
+----------+----+---+
|2009-09-19| 96| 50|
+----------+----+---+

PySpark: Filtering duplicates of a union, keeping only the groupby rows with the maximum value for a specified column

I want to create a DataFrame that contains all the rows from two DataFrames, and where there are duplicates we keep only the row with the max value of a column.
For example, if we have two tables with the same schema, like below, we will merge into one table which includes only the rows with the maximum column value (highest score) for the group of rows grouped by another column ("name" in the below example).
Table A
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Finch | Acme | 62 |
| Jones | Acme | 30 |
| Lewis | Acme | 59 |
| Smith | Acme | 98 |
| Starr | Acme | 87 |
+--------+---------+-------+
Table B
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Bryan | Beta | 93 |
| Jones | Beta | 75 |
| Lewis | Beta | 59 |
| Smith | Beta | 64 |
| Starr | Beta | 81 |
+--------+---------+-------+
Final Table
+--------------------------+
| name | source | score |
+--------+---------+-------+
| Bryan | Beta | 93 |
| Finch | Acme | 62 |
| Jones | Beta | 75 |
| Lewis | Acme | 59 |
| Smith | Acme | 98 |
| Starr | Acme | 87 |
+--------+---------+-------+
Here's what seems to work:
from pyspark.sql import functions as F
schema = ["name", "source", "score"]
rows1 = [("Smith", "Acme", 98),
("Jones", "Acme", 30),
("Finch", "Acme", 62),
("Lewis", "Acme", 59),
("Starr", "Acme", 87)]
rows2 = [("Smith", "Beta", 64),
("Jones", "Beta", 75),
("Bryan", "Beta", 93),
("Lewis", "Beta", 59),
("Starr", "Beta", 81)]
df1 = spark.createDataFrame(rows1, schema)
df2 = spark.createDataFrame(rows2, schema)
df_union = df1.unionAll(df2)
df_agg = df_union.groupBy("name").agg(F.max("score").alias("score"))
df_final = df_union.join(df_agg, on="score", how="leftsemi").orderBy("name", F.col("score").desc()).dropDuplicates(["name"])
The above results in the DataFrame I expect. It seems like a convoluted way to do this, but I don't know as I'm relatively new to Spark. Can this be done in a more efficient, elegant, or "Pythonic" manner?
You can use window functions. Partition by name and choose the record with the highest score.
from pyspark.sql.functions import *
from pyspark.sql.window import Window
w=Window().partitionBy("name").orderBy(desc("score"))
df_union.withColumn("rank", row_number().over(w))\
.filter(col("rank")==1).drop("rank").show()
+-----+------+-----+
| name|source|score|
+-----+------+-----+
|Bryan| Beta| 93|
|Finch| Acme| 62|
|Jones| Beta| 75|
|Lewis| Acme| 59|
|Smith| Acme| 98|
|Starr| Acme| 87|
+-----+------+-----+
I don't see anything wrong with your answer, except for the last line - you cannot join on score only, but need to join on combination of "name" and "score", and you can choose inner join, which will eliminate the need to remove rows with lower scores for the same name:
df_final = (df_union.join(df_agg, on=["name", "score"], how="inner")
.orderBy("name")
.dropDuplicates(["name"]))
Notice that there is no need to order by score, and .dropDuplicates(["name"]) is only needed if you want to avoid displaying two rows for name = Lewis who has the same score in both dataframes.

How to calculate rolling sum with varying window sizes in PySpark

I have a spark dataframe that contains sales prediction data for some products in some stores over a time period. How do I calculate the rolling sum of Predictions for a window size of next N values?
Input Data
+-----------+---------+------------+------------+---+
| ProductId | StoreId | Date | Prediction | N |
+-----------+---------+------------+------------+---+
| 1 | 100 | 2019-07-01 | 0.92 | 2 |
| 1 | 100 | 2019-07-02 | 0.62 | 2 |
| 1 | 100 | 2019-07-03 | 0.89 | 2 |
| 1 | 100 | 2019-07-04 | 0.57 | 2 |
| 2 | 200 | 2019-07-01 | 1.39 | 3 |
| 2 | 200 | 2019-07-02 | 1.22 | 3 |
| 2 | 200 | 2019-07-03 | 1.33 | 3 |
| 2 | 200 | 2019-07-04 | 1.61 | 3 |
+-----------+---------+------------+------------+---+
Expected Output Data
+-----------+---------+------------+------------+---+------------------------+
| ProductId | StoreId | Date | Prediction | N | RollingSum |
+-----------+---------+------------+------------+---+------------------------+
| 1 | 100 | 2019-07-01 | 0.92 | 2 | sum(0.92, 0.62) |
| 1 | 100 | 2019-07-02 | 0.62 | 2 | sum(0.62, 0.89) |
| 1 | 100 | 2019-07-03 | 0.89 | 2 | sum(0.89, 0.57) |
| 1 | 100 | 2019-07-04 | 0.57 | 2 | sum(0.57) |
| 2 | 200 | 2019-07-01 | 1.39 | 3 | sum(1.39, 1.22, 1.33) |
| 2 | 200 | 2019-07-02 | 1.22 | 3 | sum(1.22, 1.33, 1.61 ) |
| 2 | 200 | 2019-07-03 | 1.33 | 3 | sum(1.33, 1.61) |
| 2 | 200 | 2019-07-04 | 1.61 | 3 | sum(1.61) |
+-----------+---------+------------+------------+---+------------------------+
There are lots of questions and answers to this problem in Python but I couldn't find any in PySpark.
Similar Question 1
There is a similar question here but in this one frame size is fixed to 3. In the provided answer rangeBetween function is used and it is only working with fixed sized frames so I cannot use it for varying sizes.
Similar Question 2
There is also a similar question here. In this one, writing cases for all possible sizes is suggested but it is not applicable for my case since I don't know how many distinct frame sizes I need to calculate.
Solution attempt 1
I've tried to solve the problem using a pandas udf:
rolling_sum_predictions = predictions.groupBy('ProductId', 'StoreId').apply(calculate_rolling_sums)
calculate_rolling_sums is a pandas udf where I solve the problem in python. This solution works with a small amount of test data. However, when the data gets bigger (in my case, the input df has around 1B rows), calculations take so long.
Solution attempt 2
I have used a workaround of the answer of Similar Question 1 above. I've calculated the biggest possible N, created the list using it and then calculate the sum of predictions by slicing the list.
predictions = predictions.withColumn('DayIndex', F.rank().over(Window.partitionBy('ProductId', 'StoreId').orderBy('Date')))
# find the biggest period
biggest_period = predictions.agg({"N": "max"}).collect()[0][0]
# calculate rolling predictions starting from the DayIndex
w = (Window.partitionBy(F.col("ProductId"), F.col("StoreId")).orderBy(F.col('DayIndex')).rangeBetween(0, biggest_period - 1))
rolling_prediction_lists = predictions.withColumn("next_preds", F.collect_list("Prediction").over(w))
# calculate rolling forecast sums
pred_sum_udf = udf(lambda preds, period: float(np.sum(preds[:period])), FloatType())
rolling_pred_sums = rolling_prediction_lists \
.withColumn("RollingSum", pred_sum_udf("next_preds", "N"))
This solution is also works with the test data. I couldn't have chance to test it with the original data yet but whether it works or not I do not like this solution. Is there any smarter way to solve this?
If you're using spark 2.4+, you can use the new higher-order array functions slice and aggregate to efficiently implement your requirement without any UDFs:
summed_predictions = predictions\
.withColumn("summed", F.collect_list("Prediction").over(Window.partitionBy("ProductId", "StoreId").orderBy("Date").rowsBetween(Window.currentRow, Window.unboundedFollowing))\
.withColumn("summed", F.expr("aggregate(slice(summed,1,N), cast(0 as double), (acc,d) -> acc + d)"))
summed_predictions.show()
+---------+-------+-------------------+----------+---+------------------+
|ProductId|StoreId| Date|Prediction| N| summed|
+---------+-------+-------------------+----------+---+------------------+
| 1| 100|2019-07-01 00:00:00| 0.92| 2| 1.54|
| 1| 100|2019-07-02 00:00:00| 0.62| 2| 1.51|
| 1| 100|2019-07-03 00:00:00| 0.89| 2| 1.46|
| 1| 100|2019-07-04 00:00:00| 0.57| 2| 0.57|
| 2| 200|2019-07-01 00:00:00| 1.39| 3| 3.94|
| 2| 200|2019-07-02 00:00:00| 1.22| 3| 4.16|
| 2| 200|2019-07-03 00:00:00| 1.33| 3|2.9400000000000004|
| 2| 200|2019-07-04 00:00:00| 1.61| 3| 1.61|
+---------+-------+-------------------+----------+---+------------------+
It might not be the best, but you can get distinct "N" column values and loop like below.
val arr = df.select("N").distinct.collect
for(n <- arr) df.filter(col("N") === n.get(0))
.withColumn("RollingSum",sum(col("Prediction"))
.over(Window.partitionBy("N").orderBy("N").rowsBetween(Window.currentRow, n.get(0).toString.toLong-1))).show
This will give you like:
+---------+-------+----------+----------+---+------------------+
|ProductId|StoreId| Date|Prediction| N| RollingSum|
+---------+-------+----------+----------+---+------------------+
| 2| 200|2019-07-01| 1.39| 3| 3.94|
| 2| 200|2019-07-02| 1.22| 3| 4.16|
| 2| 200|2019-07-03| 1.33| 3|2.9400000000000004|
| 2| 200|2019-07-04| 1.61| 3| 1.61|
+---------+-------+----------+----------+---+------------------+
+---------+-------+----------+----------+---+----------+
|ProductId|StoreId| Date|Prediction| N|RollingSum|
+---------+-------+----------+----------+---+----------+
| 1| 100|2019-07-01| 0.92| 2| 1.54|
| 1| 100|2019-07-02| 0.62| 2| 1.51|
| 1| 100|2019-07-03| 0.89| 2| 1.46|
| 1| 100|2019-07-04| 0.57| 2| 0.57|
+---------+-------+----------+----------+---+----------+
Then you can do a union of all the dataframes inside the loop.

How to align timestamps from two Datasets in Apache Spark

I got the following problem, while developing an Apache Spark Application. I have two Datasets (D1 and D2) from a Postgresql Database, that I would like to process using Apache Spark. Both contain a column (ts) with timestamps from the same period. I would like to join D2 with the largest timestamp from D1 that is smaller or equal. It might look like:
D1 D2 DJOIN
|ts| |ts| |D1.ts|D2.ts|
---- ---- -------------
| 1| | 1| | 1 | 1 |
| 3| | 2| | 1 | 2 |
| 5| | 3| | 3 | 3 |
| 7| | 4| | 3 | 4 |
|11| | 5| | 5 | 5 |
|13| | 6| | 5 | 6 |
| 7| = join => | 7 | 7 |
| 8| | 7 | 8 |
| 9| | 7 | 9 |
|10| | 7 | 10 |
|11| | 11 | 11 |
|12| | 11 | 12 |
|13| | 13 | 13 |
|14| | 13 | 14 |
In SQL I can simply write something like:
SELECT D1.ts, D2.ts
FROM D1, D2
WHERE D1.ts = (SELECT max(D1.ts)
FROM D1
WHERE D1.ts <= D2.ts);
There is the possibility for nested SELECT queries in Spark Datasets but unfortunately they support only equality = and no <=. I am a beginner in Spark and currently I am stuck here. Is there someone more knowledgable with a good idea on how to solve that issue?

Excel: sort column with mixed numbers and letters?

I am working of the dataset in Excel that I obtained from an experiment. Since I needed some ratings (and I wanted the raters to be blind) I completely randomized the answers and now I can't put them back in order!
This is what I have:
1A
38R
22R
7A
41R
64A
etc...
And this is what I need in the end:
1A
2A
3A
...
99R
100R
101R
Thank you!
I have created two new columns (B and C in this case, as in the other example posted).
I have typed LEFT(A1,LEN(A1)-1) in column B to get the number; then =RIGHT(A1,1) in column C to get the letter; finally I can sort by B and C.
You will not get your desired output by sorting alphabetically, because 100R would come before 2A.
If your values will always be in the format of a number followed by a single character and will be at most 5 characters long, you can use #Scott Craner's formula =RIGHT("00000"&A1,5) to pad the left of your value with "0" so that you can alphabetize correctly. 100R will become 0100R. 2A will become 0002A. These will now alphabetize correctly.
Now you can simply sort your range by column B ascending alphabetically.
If you need more characters, just add as many zeroes as characters to the formula, and change the 5 in the formula to your new number of characters.
Here is an example excel file.
INPUT
+---+-----+-------+
| | A | B |
+---+-----+-------+
| 1 | 1A | 0001A |
+---+-----+-------+
| 2 | 38R | 0038R |
+---+-----+-------+
| 3 | 22R | 0022R |
+---+-----+-------+
| 4 | 7A | 0007A |
+---+-----+-------+
| 5 | 41R | 0041R |
+---+-----+-------+
| 6 | 64A | 0064A |
+---+-----+-------+
RESULT
+---+-----+-------+
| | A | B |
+---+-----+-------+
| 1 | 1A | 0001A |
+---+-----+-------+
| 2 | 7A | 0007A |
+---+-----+-------+
| 3 | 22R | 0022R |
+---+-----+-------+
| 4 | 38R | 0038R |
+---+-----+-------+
| 5 | 41R | 0041R |
+---+-----+-------+
| 6 | 64A | 0064A |
+---+-----+-------+

Resources