Keeping identifier in exceptAll in PySpark - apache-spark

I was curious if there is an easy way to keep an identifying ID in the exceptALL command in PySpark. For example, suppose I have two dataframes (DF1,DF2) both with an ID column and another column "A." I want to keep the rows in DF1 that have a value for "A" not in DF2, so essentially I am trying to keep the identifier with the usual output of exceptAll. I attached an image with the ideal output.
Cheers!

You are probably looking for leftanti join in Spark:
df1 = spark.createDataFrame([
[1, "Dog"],
[2, "Cat"],
[3, "Dog"]
], ["id", "A"])
df2 = spark.createDataFrame([
[4, "Dog"],
[5, "Elmo"]
], ["id", "A"])
df1.join(df2, ["A"], "leftanti").show()
# +---+---+
# | A| id|
# +---+---+
# |Cat| 2|
# +---+---+

pyspark's dataframe method subtract should give you what you want. See Spark: subtract two DataFrames for more details.
Using exceptAll will not give the results you wanted, as it will retain the second dog entry in df1 because exceptAll keeps duplicates.
Given your dataframes :
df1 = spark.createDataFrame([{'id': 1, 'A': 'dog'},
{'id': 2, 'A': 'cat'},
{'id': 3, 'A': 'dog'}])
df2 = spark.createDataFrame([{'id': 4, 'A': 'dog'},
{'id': 5, 'A': 'elmo'}])
Use subtract on column of interest (i.e. A), then join the results back to original dataframe to get the remaining columns (i.e. id).
except_df = df1.select('A').subtract(df2.select('A'))
except_df.join(df1, on='A').show()
+---+---+
| A| id|
+---+---+
|cat| 2|
+---+---+

Related

PySpark, save unique letters in strings of a column

I'm using PySpark, and I want a simple way of doing the next process without it being overcomplicated.
Suppose I have a table that looks like this:
ID
Letters
1
a,b,c,d
2
b,d,b
3
c,y,u
I want to get the unique letters in this dataframe from the column "Letters", this would be:
List = [a,b,c,d,y,u].
I tried using the in operator, I don't really know how to iterate through each register, but I don't wanna make a mess because the original plan is for a big dataset.
You can try something like this:
import pyspark.sql.functions as F
data1 = [
[1, "a,b,c,d"],
[2, "b,d,b"],
[3, "c,y,u"],
]
df = spark.createDataFrame(data1).toDF("ID", "Letters")
dfWithDistinctValues = df.select(
F.array_distinct(
F.flatten(F.collect_set(F.array_distinct(F.split(df.Letters, ","))))
).alias("unique_letters")
)
defaultValues = [
data[0] for data in dfWithDistinctValues.select("unique_letters").collect()
]
print(defaultValues)
What is happening here:
First i am splitting string by "," with F.split and droping duplicates at row level with F.array_distinct
I am using collect_set to get all distinct arrays into one row which is array of arrays at this stage and it looks like this:
[[b, d], [a, b, c, d], [c, y, u]]
Then i am using flatten to get all values as separate strings:
[b, d, a, b, c, d, c, y, u]
There are still some duplicates which are removed by array_distinct so at the end the output looks like this:
[b, d, a, c, y, u]
If you need also counts you may use explode as Koedit mentioned, you may change part of his code to something like this:
# Unique letters with counts
uniqueLettersDf = (
df.select(explode(array_distinct("Letters")).alias("Letter"))
.groupBy("Letter")
.count()
.show()
)
Now you will get something like this:
+------+-----+
|Letter|count|
+------+-----+
| d| 2|
| c| 2|
| b| 2|
| a| 1|
| y| 1|
| u| 1|
+------+-----+
Depending on how large your dataset and your arrays are (if they are very large, this might not be the route you want to take), you can use the explode function to easily get what you want:
from pyspark.sql.functions import explode
df = spark.createDataFrame(
[
(1, ["a", "b", "c", "d"]),
(2, ["b", "d", "b"]),
(3, ["c", "y", "u"])
],
["ID", "Letters"]
)
# Creating a dataframe with 1 column, "letters", with distinct values per row
uniqueLettersDf = df.select(explode("Letters").alias("letters")).distinct()
# Using list comprehension and the .collect() method to turn our dataframe into a Python list
output = [row['letters'] for row in uniqueLettersDf.collect()]
output
['d', 'c', 'b', 'a', 'y', 'u']
EDIT: To make it a bit safer, we can use array_distinct before using explode: this will limit the amount of rows that will get made by removing the doubles before exploding.
The code would be identical, except for these lines:
from pyspark.sql.functions import explode, array_distinct
...
uniqueLettersDf = df.select(explode(array_distinct("Letters")).alias("letters")).distinct()
...

Spark: Replace collect()[][] operation

I am having code as:
new_df=spark.sql("Select col1,col2 from table1 where id=2").collect()[0][0]
I have tried toLocalIterator() but getting message that is not subscriptable.
Please suggest a better way to replace collect()[0][0].
IIUC -
Assume this is the resulted DF
+----+---+---------------+
| id|num| list_col|
+----+---+---------------+
|1001| 5|[1, 2, 3, 4, 5]|
|1002| 3| [1, 2, 3]|
+----+---+---------------+
In order to get the first value of list_col use one more [] in your existing code
print(df.select("list_col").collect()[0][0][0])
will give you 1
Likewise, this will give you 2
print(df.select("list_col").collect()[0][0][1])
Updating my answer as per new understanding -
i.e. To access the first element of a list column from a dataframe
df = df.withColumn("list_element", F.col("list_col").getItem(0))
df.show()
+----+---+---------------+------------+
| id|num| list_col|list_element|
+----+---+---------------+------------+
|1001| 5|[1, 2, 3, 4, 5]| 1|
|1002| 3| [1, 2, 3]| 1|
+----+---+---------------+------------+

get distinct count from an array of each rows using pyspark

I am looking for distinct counts from an array of each rows using pyspark dataframe:
input:
col1
[1,1,1]
[3,4,5]
[1,2,1,2]
output:
1
3
2
I used below code but it is giving me the length of an array:
output:
3
3
4
please help me how do i achieve this using python pyspark dataframe.
slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()
Thanks in advanced !
For spark2.4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
(welcome to SO)
df.show()
+------------+
| col1|
+------------+
| [1, 1, 1]|
| [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+
df.withColumn("count", F.size(F.array_distinct("col1"))).show()
+------------+-----+
| col1|count|
+------------+-----+
| [1, 1, 1]| 1|
| [3, 4, 5]| 3|
|[1, 2, 1, 2]| 2|
+------------+-----+

collect_set by preserving order

I was referring to this question Here, however it works for collect_list and not collect_set
I have a dataframe like this
data = [(("ID1", 9)),
(("ID1", 9)),
(("ID1", 8)),
(("ID1", 7)),
(("ID1", 5)),
(("ID1", 5))]
df = spark.createDataFrame(data, ["ID", "Values"])
df.show()
+---+------+
| ID|Values|
+---+------+
|ID1| 9|
|ID1| 9|
|ID1| 8|
|ID1| 7|
|ID1| 5|
|ID1| 5|
+---+------+
I am trying to create a new column, collecting it as set
df = df.groupBy('ID').agg(collect_set('Values').alias('Value_set'))
df.show()
+---+------------+
| ID| Value_set|
+---+------------+
|ID1|[9, 5, 7, 8]|
+---+------------+
But the order is not maintained, my order should be [9, 8, 7, 5]
I solved it like this
df = df.groupby('ID').agg(collect_list('Values').alias('Values_List'))
df.show()
def my_function(x):
return list(dict.fromkeys(x))
udf_set = udf(lambda x: my_function(x), ArrayType(IntegerType()))
df = df.withColumn("Values_Set", udf_set("Values_List"))
df.show(truncate=False)
+---+------------------+------------+
|ID |Values_List |Values_Set |
+---+------------------+------------+
|ID1|[9, 9, 8, 7, 5, 5]|[9, 8, 7, 5]|
+---+------------------+------------+
From the pyspark source code, the documentation for collect_set:
_collect_set_doc = """
Aggregate function: returns a set of objects with duplicate elements eliminated.
.. note:: The function is non-deterministic because the order of collected results depends
on order of rows which may be non-deterministic after a shuffle.
>>> df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
>>> df2.agg(collect_set('age')).collect()
[Row(collect_set(age)=[5, 2])]
"""
This means, you will have unordered sets which are based on a hash table and you can get more information on the 'order' of unordered Python sets
If you data is relatively small , you can coalesce it to 1 and then sort it before using collect_set()
Eg : relation,index
cook,3
jone,1
sam,7
zack,4
tim,2
singh,9
ambani,5
ram,8
jack,0
nike,6
df.coalesce(1).sort("ind").agg(collect_list("name").alias("names_list")).show
names_list
[jack, jone, tim, cook, zack, ambani, nike, sam, ram, singh]
you can apply the array_sort() function to your column if you use spark 2.4 or above:

Filter array column content

I am using pyspark 2.3.1 and would like to filter array elements with an expression and not an using udf:
>>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"])
>>> df.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| A| [1, 2, 3, 4]|
| 2| B|[1, 2, 3, 4, 5]|
+----+----+---------------+
The expreesion shown below is wrong, I wonder how to tell spark to remove out any values from the array in col3 which are smaller than 3. I want something like:
>>> filtered = df.withColumn("newcol", expr("filter(col3, x -> x >= 3)")).show()
>>> filtered.show()
+----+----+---------+
|col1|col2| newcol|
+----+----+---------+
| 1| A| [3, 4]|
| 2| B|[3, 4, 5]|
+----+----+---------+
I have already an udf solution, but it is very slow (> 1 billions data rows):
largerThan = F.udf(lambda row,max: [x for x in row if x >= max], ArrayType(IntegerType()))
df = df.withColumn('newcol', size(largerThan(df.queries, lit(3))))
Any help is welcome. Thank you very much in advance.
Spark < 2.4
There is no *reasonable replacement for udf in PySpark.
Spark >= 2.4
Your code:
expr("filter(col3, x -> x >= 3)")
can be used as is.
Reference
Querying Spark SQL DataFrame with complex types
* Given the cost of exploding or converting to and from RDD udf is almost exclusively preferable.

Resources