Compare each spark dataframe element with all the rest of same dataframe - apache-spark

I'm looking for efficient way of applying some map function to each pair of elements in a dataframe. e.g.
records = spark.createDataFrame(
[(1, 'a'), (2, 'b'), (3, 'c'), (4, 'd')], \
['id', 'val'])
records.show()
+---+---+
| id|val|
+---+---+
| 1| a|
| 2| b|
| 3| c|
| 4| d|
+---+---+
I want to take values a, b, c, d and compare each of them with all the rest:
a -> b
a -> c
a -> d
b -> c
b -> d
c -> d
By comparison I mean custom function that takes those 2 values and calculates some similarity index between them.
Could you suggest efficient way to perform this calculation, assuming input dataframe could contain tenth millions elements?
Spark version 2.4.6 (AWS emr-5.31.0), using EMR notebook with pyspark

Collect val column values into lookup column. then compare each value from lookup array with val column.
Check below code.
>>> records\
.select(F.collect_list(F.struct(F.col("id"),F.col("val"))).alias("data"),F.collect_list(F.col("val")).alias("lookup"))\
.withColumn("data",F.explode(F.col("data"))) \
.select("data.*",F.expr("filter(lookup,v -> v != data.val)").alias("lookup")) \
#.withColumn("compare",expr("transform(lookup, v -> val [.....] )")) # May be you can add your logic in this -> [.....]
.show()
+---+---+---------+
| id|val| lookup|
+---+---+---------+
| 1| a|[b, c, d]|
| 2| b|[a, c, d]|
| 3| c|[a, b, d]|
| 4| d|[a, b, c]|
+---+---+---------+

This is a cross join operation with a collect_list aggregation. if you want a's matches list to contain only [b,c,d] you should apply that filter before doing the collect_list.
records.alias("lhs")
.crossJoin(episodes.alias("rhs"))
.filter("lhs.val!=rhs.val")
.groupBy("lhs")
.agg(functions.collect_list("rhs.val").alias("lookup"))
.selectExpr("lhs.*", "lookup");

Related

Remove rows where groups of two columns have differences

Is it possible to remove rows if the values in the Block column occurs at least twice which has different values in the ID column?
My data looks like this:
ID
Block
1
A
1
C
1
C
3
A
3
B
In the above case, the value A in the Block column occurs twice, which has values 1 and 3 in the ID column. So the rows are removed.
The expected output should be:
ID
Block
1
C
1
C
3
B
I tried to use the dropDuplicates after the groupBy, but I don't know how to filter with this type of condition. It appears that I would need a set for the Block column to check with the ID column.
One way to do it is using window functions. The first one (lag) marks the row if it is different than the previous. The second (sum) marks all "Block" rows for previously marked rows. Lastly, deleting roes and the helper (_flag) column.
Input:
from pyspark.sql import functions as F, Window as W
df = spark.createDataFrame(
[(1, 'A'),
(1, 'C'),
(1, 'C'),
(3, 'A'),
(3, 'B')],
['ID', 'Block'])
Script:
w1 = W.partitionBy('Block').orderBy('ID')
w2 = W.partitionBy('Block')
grp = F.when(F.lag('ID').over(w1) != F.col('ID'), 1).otherwise(0)
df = df.withColumn('_flag', F.sum(grp).over(w2) == 0) \
.filter('_flag').drop('_flag')
df.show()
# +---+-----+
# | ID|Block|
# +---+-----+
# | 3| B|
# | 1| C|
# | 1| C|
# +---+-----+
Use window functions. get ranks per group of blocks and through away any rows that rank higher than 1. Code below
(df.withColumn('index', row_number().over(Window.partitionBy().orderBy('ID','Block')))#create an index to reorder after comps
.withColumn('BlockRank', rank().over(Window.partitionBy('Block').orderBy('ID'))).orderBy('index')#Rank per Block
.where(col('BlockRank')==1)
.drop('index','BlockRank')
).show()
+---+-----+
| ID|Block|
+---+-----+
| 1| A|
| 1| C|
| 1| C|
| 3| B|
+---+-----+

How to remove all the subset from a column except few based on the other column in Pyspark?

I have a pyspark data frame with a column of lists(column a) and another column with numbers(column b), I want to retain all are supersets rows, and subsets that have greater value in column b than their supersets.
For Example,
Input data frame:
Column a = ([A,B,C],[A,C],[B,C],[J,S,K],[J,S],[J,K])
Column b = (10,15,7,8,9,8)
Expected Outcome:
Column a = ([A,B,C],[A,C],[J,S,K],[J,S])
Column b = (10,15,8,9)
Here [B,C] and [A,C] are subsets of [A,B,C] but we only retain [A,C] because this subset has 15 in column b which is greater than 10 the supersets([A,B,C]) value in column b.
Similarly, the superset [J,S,K] is retained along with its subset [J,S] because its value in column b is greater than the superset column b value.
You can use a self left_anti join to filter rows that satisfies that condition.
You'll need to have an ID column in your dataframe, here I'm using monotonically_increasing_id function to generate an ID for each row:
import pyspark.sql.functions as F
df = df.withColumn("ID", F.monotonically_increasing_id())
df.show()
#+---------+---+-----------+
#| a| b| ID|
#+---------+---+-----------+
#|[A, B, C]| 10| 8589934592|
#| [A, C]| 15|17179869184|
#| [B, C]| 7|25769803776|
#|[J, S, K]| 8|42949672960|
#| [J, S]| 9|51539607552|
#| [J, K]| 8|60129542144|
#+---------+---+-----------+
Now, to verify an array arr1 is a subset of another array arr2, you can use array_intersect and size functions size(array_intersect(arr1, arr2)) = size(arr1):
df_result = df.alias("df1").join(
df.alias("df2"),
(
(F.size(F.array_intersect("df1.a", "df2.a")) == F.size("df1.a"))
& (F.col("df1.b") <= F.col("df2.b"))
& (F.col("df1.ID") != F.col("df2.ID")) # not the same row
),
"left_anti"
).drop("ID")
df_result.show()
#+---------+---+
#| a| b|
#+---------+---+
#|[A, B, C]| 10|
#| [A, C]| 15|
#|[J, S, K]| 8|
#| [J, S]| 9|
#+---------+---+

Pyspark -- Filter ArrayType rows which contain null value

I am a beginner of PySpark. Suppose I have a Spark dataframe like this:
test_df = spark.createDataFrame(pd.DataFrame({"a":[[1,2,3], [None,2,3], [None, None, None]]}))
Now I hope to filter rows that the array DO NOT contain None value (in my case just keep the first row).
I have tried to use:
test_df.filter(array_contains(test_df.a, None))
But it does not work and throws an error:
AnalysisException: "cannot resolve 'array_contains(a, NULL)' due to
data type mismatch: Null typed values cannot be used as
arguments;;\n'Filter array_contains(a#166, null)\n+- LogicalRDD
[a#166], false\n
How should I filter in the correct way? Many thanks!
You need to use the forall function.
df = test_df.filter(F.expr('forall(a, x -> x is not null)'))
df.show(truncate=False)
You can use aggregate higher order function to count the number of nulls and filter rows with the count = 0. This will enable you to drop all rows with at least 1 None within the array.
data_ls = [
(1, ["A", "B"]),
(2, [None, "D"]),
(3, [None, None])
]
data_sdf = spark.sparkContext.parallelize(data_ls).toDF(['a', 'b'])
data_sdf.show()
+---+------+
| a| b|
+---+------+
| 1|[A, B]|
| 2| [, D]|
| 3| [,]|
+---+------+
# count the number of nulls within an array
data_sdf. \
withColumn('c', func.expr('aggregate(b, 0, (x, y) -> x + int(y is null))')). \
show()
+---+------+---+
| a| b| c|
+---+------+---+
| 1|[A, B]| 0|
| 2| [, D]| 1|
| 3| [,]| 2|
+---+------+---+
Once you have the column created you can apply the filter as filter(func.col('c')==0).
You can use exists function:
test_df.filter("!exists(a, x -> x is null)").show()
#+---------+
#| a|
#+---------+
#|[1, 2, 3]|
#+---------+

groupby category and sum the count

Let's say I have a table (df) like so:
type count
A 5000
B 5000
C 200
D 123
... ...
... ...
Z 453
How can I sum the column count by type A, B and all other types fall into Others category?
I currently have this:
df = df.withColumn('type', when(col("type").isnot("A", "B"))
My expected output would be like so:
type count
A 5000
B 5000
Other 3043
You want to group by when expression and sum the count :
from pyspark.sql import functions as F
df1 = df.groupBy(
when(
F.col("type").isin("A", "B"), F.col("type")
).otherwise("Others").alias("type")
).agg(
F.sum("count").alias("count")
)
df1.show()
#+------+-----+
#| type|count|
#+------+-----+
#| B| 5000|
#| A| 5000|
#|Others| 776|
#+------+-----+
You can divide the dataframe into two parts based on the type, aggregate a sum for the second part, and do a unionAll to combine them.
import pyspark.sql.functions as F
result = df.filter("type in ('A', 'B')").unionAll(
df.filter("type not in ('A', 'B')")
.select(F.lit('Other'), F.sum('count'))
)
result.show()
+-----+-----+
| type|count|
+-----+-----+
| A| 5000|
| B| 5000|
|Other| 776|
+-----+-----+

Collect only not null columns of each row to an array

The difficulty is is that I'm trying to avoid UDFs as much as possible.
I have a dataset "wordsDS", which contains many null values:
+------+------+------+------+
|word_0|word_1|word_2|word_3|
+------+------+------+------+
| a| b| null| d|
| null| f| m| null|
| null| null| d| null|
+--------------+------+-----|
I need to collect all of the columns for each row to array. I don't know the number of columns in advance, so I'm using columns() method.
groupedQueries = wordsDS.withColumn("collected",
functions.array(Arrays.stream(wordsDS.columns())
.map(functions::col).toArray(Column[]::new)));;
But this approach produces empty elements:
+--------------------+
| collected|
+--------------------+
| [a, b,,d]|
| [, f, m,,]|
| [,, d,,]|
+--------------------+
Instead, I need the following result:
+--------------------+
| collected|
+--------------------+
| [a, b, d]|
| [f, m]|
| [d]|
+--------------------+
So basically, I need to collect all of the columns for each row to array with the following requirements:
Resulting array doesn't contain empty elements.
Don't know number of columns upfront.
I've also though of the approach of filter the dataset's "collected" column for empty values, but can't come up with anything else except UDF. I'm trying to avoid UDFs in order not to kill performance, if anyone could suggest a way to filter the dataset's "collected" column for empty values with as little overhead as possible, that would be really helpful.
you can use array("*") to get all the elements into 1 array, then use array_except (needs Spark 2.4+) to filter out nulls:
df
.select(array_except(array("*"),array(lit(null))).as("collected"))
.show()
gives
+---------+
|collected|
+---------+
|[a, b, d]|
| [f, m]|
| [d]|
+---------+
spark <2.0 you can use def to remove null
scala> var df = Seq(("a", "b", "null", "d"),("null", "f", "m", "null"),("null", "null", "d", "null")).toDF("word_0","word_1","word_2","word_3")
scala> def arrayNullFilter = udf((arr: Seq[String]) => arr.filter(x=>x != "null"))
scala> df.select(array('*).as('all)).withColumn("test",arrayNullFilter(col("all"))).show
+--------------------+---------+
| all| test|
+--------------------+---------+
| [a, b, null, d]|[a, b, d]|
| [null, f, m, null]| [f, m]|
|[null, null, d, n...| [d]|
+--------------------+---------+
hope this helps you.
display(df_part_groups.withColumn("combined", F.array_except(F.array("*"), F.array(F.lit("null"))) ))
This statement doesn't remove the null. It keeps the distinct occurrences of null.
Use this instead:
display(df_part_groups.withColumn("combined", F.array_except(F.array("*"), F.array(F.lit(""))) ))

Resources