get distinct count from an array of each rows using pyspark - apache-spark

I am looking for distinct counts from an array of each rows using pyspark dataframe:
input:
col1
[1,1,1]
[3,4,5]
[1,2,1,2]
output:
1
3
2
I used below code but it is giving me the length of an array:
output:
3
3
4
please help me how do i achieve this using python pyspark dataframe.
slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()
Thanks in advanced !

For spark2.4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
(welcome to SO)
df.show()
+------------+
| col1|
+------------+
| [1, 1, 1]|
| [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+
df.withColumn("count", F.size(F.array_distinct("col1"))).show()
+------------+-----+
| col1|count|
+------------+-----+
| [1, 1, 1]| 1|
| [3, 4, 5]| 3|
|[1, 2, 1, 2]| 2|
+------------+-----+

Related

Spark: Replace collect()[][] operation

I am having code as:
new_df=spark.sql("Select col1,col2 from table1 where id=2").collect()[0][0]
I have tried toLocalIterator() but getting message that is not subscriptable.
Please suggest a better way to replace collect()[0][0].
IIUC -
Assume this is the resulted DF
+----+---+---------------+
| id|num| list_col|
+----+---+---------------+
|1001| 5|[1, 2, 3, 4, 5]|
|1002| 3| [1, 2, 3]|
+----+---+---------------+
In order to get the first value of list_col use one more [] in your existing code
print(df.select("list_col").collect()[0][0][0])
will give you 1
Likewise, this will give you 2
print(df.select("list_col").collect()[0][0][1])
Updating my answer as per new understanding -
i.e. To access the first element of a list column from a dataframe
df = df.withColumn("list_element", F.col("list_col").getItem(0))
df.show()
+----+---+---------------+------------+
| id|num| list_col|list_element|
+----+---+---------------+------------+
|1001| 5|[1, 2, 3, 4, 5]| 1|
|1002| 3| [1, 2, 3]| 1|
+----+---+---------------+------------+

collect_set by preserving order

I was referring to this question Here, however it works for collect_list and not collect_set
I have a dataframe like this
data = [(("ID1", 9)),
(("ID1", 9)),
(("ID1", 8)),
(("ID1", 7)),
(("ID1", 5)),
(("ID1", 5))]
df = spark.createDataFrame(data, ["ID", "Values"])
df.show()
+---+------+
| ID|Values|
+---+------+
|ID1| 9|
|ID1| 9|
|ID1| 8|
|ID1| 7|
|ID1| 5|
|ID1| 5|
+---+------+
I am trying to create a new column, collecting it as set
df = df.groupBy('ID').agg(collect_set('Values').alias('Value_set'))
df.show()
+---+------------+
| ID| Value_set|
+---+------------+
|ID1|[9, 5, 7, 8]|
+---+------------+
But the order is not maintained, my order should be [9, 8, 7, 5]
I solved it like this
df = df.groupby('ID').agg(collect_list('Values').alias('Values_List'))
df.show()
def my_function(x):
return list(dict.fromkeys(x))
udf_set = udf(lambda x: my_function(x), ArrayType(IntegerType()))
df = df.withColumn("Values_Set", udf_set("Values_List"))
df.show(truncate=False)
+---+------------------+------------+
|ID |Values_List |Values_Set |
+---+------------------+------------+
|ID1|[9, 9, 8, 7, 5, 5]|[9, 8, 7, 5]|
+---+------------------+------------+
From the pyspark source code, the documentation for collect_set:
_collect_set_doc = """
Aggregate function: returns a set of objects with duplicate elements eliminated.
.. note:: The function is non-deterministic because the order of collected results depends
on order of rows which may be non-deterministic after a shuffle.
>>> df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
>>> df2.agg(collect_set('age')).collect()
[Row(collect_set(age)=[5, 2])]
"""
This means, you will have unordered sets which are based on a hash table and you can get more information on the 'order' of unordered Python sets
If you data is relatively small , you can coalesce it to 1 and then sort it before using collect_set()
Eg : relation,index
cook,3
jone,1
sam,7
zack,4
tim,2
singh,9
ambani,5
ram,8
jack,0
nike,6
df.coalesce(1).sort("ind").agg(collect_list("name").alias("names_list")).show
names_list
[jack, jone, tim, cook, zack, ambani, nike, sam, ram, singh]
you can apply the array_sort() function to your column if you use spark 2.4 or above:

Add different arrays from numpy to each row of dataframe

I have a SparkSQL dataframe and 2D numpy matrix. They have the same number of rows. I intend to add each different array from numpy matrix as a new column to the existing PySpark data frame. In this way, the list added to each row is different.
For example, the PySpark dataframe is like this
| Id | Name |
| ------ | ------ |
| 1 | Bob |
| 2 | Alice |
| 3 | Mike |
And the numpy matrix is like this
[[2, 3, 5]
[5, 2, 6]
[1, 4, 7]]
The resulting expected dataframe should be like this
| Id | Name | customized_list
| ------ | ------ | ---------------
| 1 | Bob | [2, 3, 5]
| 2 | Alice | [5, 2, 6]
| 3 | Mike | [1, 4, 7]
Id column correspond to the order of the entries in the numpy matrix.
I wonder is there any efficient way to implement this?
Create a DataFrame from your numpy matrix and add an Id column to indicate the row number. Then you can join to your original PySpark DataFrame on the Id column.
import numpy as np
a = np.array([[2, 3, 5], [5, 2, 6], [1, 4, 7]])
list_df = spark.createDataFrame(enumerate(a.tolist(), start=1), ["Id", "customized_list"])
list_df.show()
#+---+---------------+
#| Id|customized_list|
#+---+---------------+
#| 1| [2, 3, 5]|
#| 2| [5, 2, 6]|
#| 3| [1, 4, 7]|
#+---+---------------+
Here I used enumerate(..., start=1) to add the row number.
Now just do an inner join:
df.join(list_df, on="Id", how="inner").show()
#+---+-----+---------------+
#| Id| Name|customized_list|
#+---+-----+---------------+
#| 1| Bob| [2, 3, 5]|
#| 3| Mike| [1, 4, 7]|
#| 2|Alice| [5, 2, 6]|
#+---+-----+---------------+

How to get the minimum of nested lists in PySpark

See the following data frame for example,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df = spark.createDataFrame([[[1, 2, 3, 4]],[[0, 2, 4]],[[]],[[3]]])
df.show()
Then we have
+------------+
| _1|
+------------+
|[1, 2, 3, 4]|
| [0, 2, 4]|
| []|
| [3]|
+------------+
Then I want to find the minimum of each list; use -1 in case of empty list. I tried the following, which does not work.
import pyspark.sql.functions as F
sim_col = F.col('_1')
df.withColumn('min_turn_sim', F.when(F.size(sim_col)==0, -1.0).otherwise(F.min(sim_col))).show()
The error is:
AnalysisException: "cannot resolve 'CASE WHEN (_1 IS NULL) THEN -1.0D ELSE min(_1) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;\n'Aggregate [_1#404, CASE WHEN isnull(_1#404) THEN -1.0 ELSE min(_1#404) END AS min_turn_sim#411]\n+- LogicalRDD [_1#404], false\n"
The size function will work. Don't understand why 'min' does not.
df.withColumn('min_turn_sim', F.when(F.size(sim_col)==0, -1.0).otherwise(F.size(sim_col))).show()
+------------+------------+
| _1|min_turn_sim|
+------------+------------+
|[1, 2, 3, 4]| 4.0|
| [0, 2, 4]| 3.0|
| []| -1.0|
| [3]| 1.0|
+------------+------------+
min is an aggregate function - it operates on columns, not values. Therefore min(sim_col) means minimum array value across all rows in the scoper, according to array ordering, not minimum value in each row.
To find a minimum for each row you'll need a non-aggregate function. In the latest Spark versions (2.4.0 and later) this would be array_min (similarly array_max to get the maximum value):
df.withColumn("min_turn_sim", F.coalesce(F.array_min(sim_col), F.lit(-1)))
Earlier versions will require an UDF:
#F.udf("long")
def long_array_min(xs):
return min(xs) if xs else -1
df.withColumn("min_turn_sim", F.coalesce(long_array_min(sim_col), F.lit(-1))

Filter array column content

I am using pyspark 2.3.1 and would like to filter array elements with an expression and not an using udf:
>>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"])
>>> df.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| A| [1, 2, 3, 4]|
| 2| B|[1, 2, 3, 4, 5]|
+----+----+---------------+
The expreesion shown below is wrong, I wonder how to tell spark to remove out any values from the array in col3 which are smaller than 3. I want something like:
>>> filtered = df.withColumn("newcol", expr("filter(col3, x -> x >= 3)")).show()
>>> filtered.show()
+----+----+---------+
|col1|col2| newcol|
+----+----+---------+
| 1| A| [3, 4]|
| 2| B|[3, 4, 5]|
+----+----+---------+
I have already an udf solution, but it is very slow (> 1 billions data rows):
largerThan = F.udf(lambda row,max: [x for x in row if x >= max], ArrayType(IntegerType()))
df = df.withColumn('newcol', size(largerThan(df.queries, lit(3))))
Any help is welcome. Thank you very much in advance.
Spark < 2.4
There is no *reasonable replacement for udf in PySpark.
Spark >= 2.4
Your code:
expr("filter(col3, x -> x >= 3)")
can be used as is.
Reference
Querying Spark SQL DataFrame with complex types
* Given the cost of exploding or converting to and from RDD udf is almost exclusively preferable.

Resources