collect_set by preserving order - apache-spark

I was referring to this question Here, however it works for collect_list and not collect_set
I have a dataframe like this
data = [(("ID1", 9)),
(("ID1", 9)),
(("ID1", 8)),
(("ID1", 7)),
(("ID1", 5)),
(("ID1", 5))]
df = spark.createDataFrame(data, ["ID", "Values"])
df.show()
+---+------+
| ID|Values|
+---+------+
|ID1| 9|
|ID1| 9|
|ID1| 8|
|ID1| 7|
|ID1| 5|
|ID1| 5|
+---+------+
I am trying to create a new column, collecting it as set
df = df.groupBy('ID').agg(collect_set('Values').alias('Value_set'))
df.show()
+---+------------+
| ID| Value_set|
+---+------------+
|ID1|[9, 5, 7, 8]|
+---+------------+
But the order is not maintained, my order should be [9, 8, 7, 5]

I solved it like this
df = df.groupby('ID').agg(collect_list('Values').alias('Values_List'))
df.show()
def my_function(x):
return list(dict.fromkeys(x))
udf_set = udf(lambda x: my_function(x), ArrayType(IntegerType()))
df = df.withColumn("Values_Set", udf_set("Values_List"))
df.show(truncate=False)
+---+------------------+------------+
|ID |Values_List |Values_Set |
+---+------------------+------------+
|ID1|[9, 9, 8, 7, 5, 5]|[9, 8, 7, 5]|
+---+------------------+------------+

From the pyspark source code, the documentation for collect_set:
_collect_set_doc = """
Aggregate function: returns a set of objects with duplicate elements eliminated.
.. note:: The function is non-deterministic because the order of collected results depends
on order of rows which may be non-deterministic after a shuffle.
>>> df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
>>> df2.agg(collect_set('age')).collect()
[Row(collect_set(age)=[5, 2])]
"""
This means, you will have unordered sets which are based on a hash table and you can get more information on the 'order' of unordered Python sets

If you data is relatively small , you can coalesce it to 1 and then sort it before using collect_set()
Eg : relation,index
cook,3
jone,1
sam,7
zack,4
tim,2
singh,9
ambani,5
ram,8
jack,0
nike,6
df.coalesce(1).sort("ind").agg(collect_list("name").alias("names_list")).show
names_list
[jack, jone, tim, cook, zack, ambani, nike, sam, ram, singh]

you can apply the array_sort() function to your column if you use spark 2.4 or above:

Related

Merge two data frames and retrieve all the information from the right data frame

Hi Stack Overflow community. I am new to spark/pyspark and I have this question.
Say I have two data frames (df2 being the interest data set with a lot of records and df1 is a new update). I want to join the two data frames on multiple columns (if possible) and get the updated information from df1 when there is a key match otherwise keep the df2 information as it is.
here is my sample data set and my expected output (df30)
df1 = spark.createDataFrame([("a", 4, 'x'), ("b", 3, 'y'), ("c", 4, 'z'), ("d", 4, 'l')], ["C1", "C2", "C3"])
df2 = spark.createDataFrame([("a", 4, 5), ("f", 3, 4), ("b", 3, 6), ("c", 4, 7), ("d", 4, 8)], ["C1", "C2","C3"])
df1_s = df1.select([col(c).alias('s_' + c) for c in df1.columns])
You can use left join on your list of columns and using a list comprehension over the remaining columns select updates using the coalesce function:
from pyspark.sql import functions as F
join_columns = ["C1"]
result = df2.alias("df2").join(
df1.alias("df1"),
join_columns,
"left"
).select(
*join_columns,
*[
F.coalesce(f"df1.{c}", f"df2.{c}").alias(c)
for c in df1.columns if c not in join_columns
]
)
result.show()
#+---+---+---+
#| C1| C2| C3|
#+---+---+---+
#| f| 3| 4|
#| d| 4| l|
#| c| 4| z|
#| b| 3| y|
#| a| 4| x|
#+---+---+---+

get distinct count from an array of each rows using pyspark

I am looking for distinct counts from an array of each rows using pyspark dataframe:
input:
col1
[1,1,1]
[3,4,5]
[1,2,1,2]
output:
1
3
2
I used below code but it is giving me the length of an array:
output:
3
3
4
please help me how do i achieve this using python pyspark dataframe.
slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()
Thanks in advanced !
For spark2.4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
(welcome to SO)
df.show()
+------------+
| col1|
+------------+
| [1, 1, 1]|
| [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+
df.withColumn("count", F.size(F.array_distinct("col1"))).show()
+------------+-----+
| col1|count|
+------------+-----+
| [1, 1, 1]| 1|
| [3, 4, 5]| 3|
|[1, 2, 1, 2]| 2|
+------------+-----+

Filter array column content

I am using pyspark 2.3.1 and would like to filter array elements with an expression and not an using udf:
>>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"])
>>> df.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| A| [1, 2, 3, 4]|
| 2| B|[1, 2, 3, 4, 5]|
+----+----+---------------+
The expreesion shown below is wrong, I wonder how to tell spark to remove out any values from the array in col3 which are smaller than 3. I want something like:
>>> filtered = df.withColumn("newcol", expr("filter(col3, x -> x >= 3)")).show()
>>> filtered.show()
+----+----+---------+
|col1|col2| newcol|
+----+----+---------+
| 1| A| [3, 4]|
| 2| B|[3, 4, 5]|
+----+----+---------+
I have already an udf solution, but it is very slow (> 1 billions data rows):
largerThan = F.udf(lambda row,max: [x for x in row if x >= max], ArrayType(IntegerType()))
df = df.withColumn('newcol', size(largerThan(df.queries, lit(3))))
Any help is welcome. Thank you very much in advance.
Spark < 2.4
There is no *reasonable replacement for udf in PySpark.
Spark >= 2.4
Your code:
expr("filter(col3, x -> x >= 3)")
can be used as is.
Reference
Querying Spark SQL DataFrame with complex types
* Given the cost of exploding or converting to and from RDD udf is almost exclusively preferable.

Can we do a groupby on one column in spark using pyspark and get list of values of other columns (raw values without an aggregation) [duplicate]

This question already has answers here:
Apply a function to groupBy data with pyspark
(2 answers)
Closed 4 years ago.
I have a dataframe where there are more than 10 columns. I would like to groupby on one of the columns in the dataframe, suppose "Column1" and get the list of all possible values in "Column1" and "Column2" corresponding to the grouped "Column1".
Is there a way to do that using pyspark groupby or any other funtion ?
You can groupBy your Column1 and then use agg function to collect the values from Column2 and Column3
from pyspark.sql.functions import col
import pyspark.sql.functions as F
df = spark.createDataFrame([[1, 'r1', 1],
[1, 'r2', 0],
[2, 'r2', 1],
[2, 'r1', 1],
[3, 'r1', 1]], schema=['col1', 'col2', 'col3'])
df.show()
#output
#+----+----+----+
#|col1|col2|col3|
#+----+----+----+
#| 1| r1| 1|
#| 1| r2| 0|
#| 2| r2| 1|
#| 2| r1| 1|
#| 3| r1| 1|
#+----+----+----+
df.groupby(col('col1')).
agg(F.collect_list(F.struct(col('col2'), col('col3'))).
alias('merged')).show()
You should see the output as,
+----+----------------+
|col1| merged|
+----+----------------+
| 1|[[r1,1], [r2,0]]|
| 3| [[r1,1]]|
| 2|[[r2,1], [r1,1]]|
+----+----------------+

Why does createDataFrame reorder the columns?

Suppose I am creating a data frame from a list without a schema:
data = [Row(c=0, b=1, a=2), Row(c=10, b=11, a=12)]
df = spark.createDataFrame(data)
df.show()
+---+---+---+
| a| b| c|
+---+---+---+
| 2| 1| 0|
| 12| 11| 10|
+---+---+---+
Why are the columns reordered in alphabet order ?
Can I preserve the original order of columns without adding a schema ?
Why are the columns reordered in alphabet order ?
Because Row created with **kwargs sorts the arguments by name.
This design choice is required to address the issues described in PEP 468. Please check SPARK-12467 for a discussion.
Can I preserve the original order of columns without adding a schema ?
Not with **kwargs. You can use plain tuples:
df = spark.createDataFrame([(0, 1, 2), (10, 11, 12)], ["c", "b", "a"])
or namedtuple:
from collections import namedtuple
CBA = namedtuple("CBA", ["c", "b", "a"])
spark.createDataFrame([CBA(0, 1, 2), CBA(10, 11, 12)])

Resources