Spark drop duplicates and select row with max value - apache-spark

I'm trying to drop duplicates based on column1 and select the row with max value in column2. The column2 has "year"(2019,2020 etc) as values and it is of type "String". The solution I have is, converting the column 2 into integer and selecting the max value.
Dataset<Row> ds ; //The dataset with column1,column2(year), column3 etc.
Dataset<Row> newDs = ds.withColumn("column2Int", col("column2").cast(DataTypes.IntegerType));
newDs = newDs.groupBy("column1").max("column2Int"); // drops all other columns
This approach drops all other columns in the original dataset 'ds' when I do a "group by". So I have to do a join between 'ds' and 'newDS' to get back all the original columns. Also casting the String column to Integer looks like an ineffective workaround.
Is it possible to drop the duplicates and get the row with bigger string value from the original dataset itself ?

This is a classic de-duplication problem and you'll need to use Window + Rank + filter combo for this.
I'm not very familiar with the Java syntax, but the sample code should look like something below,
import org.apache.spark.sql.expressions.Window;
import org.apache.spark.sql.expressions.WindowSpec;
import org.apache.spark.sql.functions;
import org.apache.spark.sql.types.DataTypes;
Dataset<Row> df = ???;
WindowSpec windowSpec = Window.partitionBy("column1").orderBy(functions.desc("column2Int"));
Dataset<Row> result =
df.withColumn("column2Int", functions.col("column2").cast(DataTypes.IntegerType))
.withColumn("rank", functions.rank().over(windowSpec))
.where("rank == 1")
.drop("rank");
result.show(false);
Overview of what happened,
Add the casted integer column to the df for future sorting.
Subsections/ windows were formed in your dataset (partitions) based on the value of column1
For each of these subsections/ windows/ partitions the rows were sorted on column casted to int. Desc order as you want max.
Ranks like row numbers are assigned to the rows in each partition/ window created.
Filtering is done for all row where rank is 1 (max value as the ordering was desc.)

Related

Get only rows of dataframe where a subset of columns exist in another dataframe

I want to get all rows of a dataframe (df2) where the city column value and postcode column value also exist in another dataframe (df1).
Important is here that I want the combination of both columns and not look at the column individually.
My approach was this:
#1. Get all combinations
df_combinations=np.array(df1.select("Ort","Postleitzahl").dropDuplicates().collect())
sc.broadcast(df_combinations)
#2.Define udf
def combination_in_vx(ort,plz):
for arr_el in dfSpark_combinations:
if str(arr_el[0]) == ort and int(arr_el[1]) == plz:
return True
return False
combination_in_vx = udf(combination_in_vx, BooleanType())
#3.
df_tmp=df_2.withColumn("Combination_Exists", combination_in_vx('city','postcode'))
df_result=df_tmp.filter(df_tmp.Combination_Exists)
Although this should theoretically work it takes forever!
Does anybody know about a better solution here? Thank you very much!
You can do a left semi join using the two columns. This will include the rows in df2 where the values in both of the two specified columns exist in df1:
import pyspark.sql.functions as F
df_result = df2.join(df1, ["Ort", "Postleitzahl"], 'left_semi')

How to selecting multiple rows and take mean value based on name of the row

From this data frame I like to select rows with same concentration and also almost same name. For example, first three rows has same concentration and also same name except at the end of the name Dig_I, Dig_II, Dig_III. This 3 rows same with same concentration. I like to somehow select this three rows and take mean value of each column. After that I want to create a new data frame.
here is the whole data frame:
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new_df = df.groupby('concentration').mean()
Note: This will only find the averages for columns with dtype float or int... this will drop the img_name column and will take the averages of all columns...
This may be faster...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js").groupby('concentration').mean()
If you would like to preserve the img_name...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new = df.groupby('concentration').mean()
pd.merge(df, new, left_on = 'concentration', right_on = 'concentration', how = 'inner')
Does that help?

How to randomly select rows from a Spark dataframe while a condition based on a column must holds too

Let say we have a Spark dataframe df with a column col where the values in this column are only 0 and 1. How can we select all rows where col==1 and also 50% of rows where col==0? The 50% population with col==0 should be randomly selected.
The sample method allows random selection of 50% of rows but no other condition can be imposed.
The solution that I currently have is as follows which seems a bit ugly to me. I wonder if there is a better solution.
from pyspark.sql import functions as F
df = df.withColumn('uniform', F.rand())
df = df.filter((df.uniform<0.5) | (df.col==1))
This won't guarantee exactly 50%, but it should suffice given a large enough data set.
df.where($"col" == 1 or rand() > rand())
note: This will return a different set of random rows each time the dataframe/dataset is calculated. To remedy this, add the rand() > rand() qualification as a column in the DF, i.e. df.withColumn("lucky", rand() > rand())

PySpark: Update column values for a given number of rows of a DataFrame

I have a DataFrame with 10 rows and 2 columns: an ID column with random identifier values and a VAL column filled with None.
vals = [
Row(ID=1,VAL=None),
Row(ID=2,VAL=None),
Row(ID=3,VAL=None),
Row(ID=4,VAL=None),
Row(ID=5,VAL=None),
Row(ID=6,VAL=None),
Row(ID=7,VAL=None),
Row(ID=8,VAL=None),
Row(ID=9,VAL=None),
Row(ID=10,VAL=None)
]
df = spark.createDataFrame(vals)
Now lets say I want to update the VAL column for 3 Rows with value "lets", 3 Rows with value "bucket" and 4 Rows with value "this".
Is there a straightforward way of doing this in PySpark?
Note: ID values is not necessarily consecutive, bucket distribution is not necessarily even
I'll try to explain an idea with some pseudo-code and you'll map to your solution.
Using window function on one partition we can generate row_number() sequential number for each row in dataframe and store it let say in column row_num.
Next your "rules" can be represented as another little dataframe: [min_row_num, max_row_num, label].
All you need is to join those two datasets on row number, adding new column:
df1.join(df2,
on=col('df1.row_num').between(col('min_row_num'), col('max_row_num'))
)
.select('df1.*', 'df2.label')

spark Dataframe column transformations using lookup for values into other Dataframe

I need to transform dataframe's multiple column values by looking up into other dataframe.
The other dataframe on the right will not have too much rows, say around 5000 records.
I need to replace for example field_1 column values to ratios like field_1,0 to 8 & field_1,3 to 25 by looking up into right data frame.
So eventually it will be filled like below:
Option 1 is to load & collect the look up dataframe on left into memory as broadcast it as boadcast variable. A Map of Map can be used I believe and should not take too much of memory on executors.
Option 2 is join the lookup data frame for each column. But I believe this will be highly inefficient as the number of field columns can be too many like 50 to 100.
Which of the above option is good? Or is there a better way of filling the values?
I would go for option1, e.g.:
val dfBig : DataFrame = ???
val dfLookup : DataFrame = ???
val lookupMap = dfLookup
.map{case Row(category:String,field_values:Int,ratio:Int) => ((category,field_values),ratio)}
.collect()
.toMap
val bc_lookupMap = spark.sparkContext.broadcast(lookupMap)
val lookupUdf = udf((field1:Int,field2:Int) =>
(bc_lookupMap.value(("field_1",field1)),bc_lookupMap.value(("field_2",field2)))
)
dfBig
.withColumn("udfResult", lookupUdf($"field_1",$"field_2"))
.select($"primaryId",$"udfResult._1".as("field_1"),$"udfResult._2".as("field_2"))

Resources