Spark: Replace collect()[][] operation - apache-spark

I am having code as:
new_df=spark.sql("Select col1,col2 from table1 where id=2").collect()[0][0]
I have tried toLocalIterator() but getting message that is not subscriptable.
Please suggest a better way to replace collect()[0][0].

IIUC -
Assume this is the resulted DF
+----+---+---------------+
| id|num| list_col|
+----+---+---------------+
|1001| 5|[1, 2, 3, 4, 5]|
|1002| 3| [1, 2, 3]|
+----+---+---------------+
In order to get the first value of list_col use one more [] in your existing code
print(df.select("list_col").collect()[0][0][0])
will give you 1
Likewise, this will give you 2
print(df.select("list_col").collect()[0][0][1])
Updating my answer as per new understanding -
i.e. To access the first element of a list column from a dataframe
df = df.withColumn("list_element", F.col("list_col").getItem(0))
df.show()
+----+---+---------------+------------+
| id|num| list_col|list_element|
+----+---+---------------+------------+
|1001| 5|[1, 2, 3, 4, 5]| 1|
|1002| 3| [1, 2, 3]| 1|
+----+---+---------------+------------+

Related

get distinct count from an array of each rows using pyspark

I am looking for distinct counts from an array of each rows using pyspark dataframe:
input:
col1
[1,1,1]
[3,4,5]
[1,2,1,2]
output:
1
3
2
I used below code but it is giving me the length of an array:
output:
3
3
4
please help me how do i achieve this using python pyspark dataframe.
slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()
Thanks in advanced !
For spark2.4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
(welcome to SO)
df.show()
+------------+
| col1|
+------------+
| [1, 1, 1]|
| [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+
df.withColumn("count", F.size(F.array_distinct("col1"))).show()
+------------+-----+
| col1|count|
+------------+-----+
| [1, 1, 1]| 1|
| [3, 4, 5]| 3|
|[1, 2, 1, 2]| 2|
+------------+-----+

collect_set by preserving order

I was referring to this question Here, however it works for collect_list and not collect_set
I have a dataframe like this
data = [(("ID1", 9)),
(("ID1", 9)),
(("ID1", 8)),
(("ID1", 7)),
(("ID1", 5)),
(("ID1", 5))]
df = spark.createDataFrame(data, ["ID", "Values"])
df.show()
+---+------+
| ID|Values|
+---+------+
|ID1| 9|
|ID1| 9|
|ID1| 8|
|ID1| 7|
|ID1| 5|
|ID1| 5|
+---+------+
I am trying to create a new column, collecting it as set
df = df.groupBy('ID').agg(collect_set('Values').alias('Value_set'))
df.show()
+---+------------+
| ID| Value_set|
+---+------------+
|ID1|[9, 5, 7, 8]|
+---+------------+
But the order is not maintained, my order should be [9, 8, 7, 5]
I solved it like this
df = df.groupby('ID').agg(collect_list('Values').alias('Values_List'))
df.show()
def my_function(x):
return list(dict.fromkeys(x))
udf_set = udf(lambda x: my_function(x), ArrayType(IntegerType()))
df = df.withColumn("Values_Set", udf_set("Values_List"))
df.show(truncate=False)
+---+------------------+------------+
|ID |Values_List |Values_Set |
+---+------------------+------------+
|ID1|[9, 9, 8, 7, 5, 5]|[9, 8, 7, 5]|
+---+------------------+------------+
From the pyspark source code, the documentation for collect_set:
_collect_set_doc = """
Aggregate function: returns a set of objects with duplicate elements eliminated.
.. note:: The function is non-deterministic because the order of collected results depends
on order of rows which may be non-deterministic after a shuffle.
>>> df2 = spark.createDataFrame([(2,), (5,), (5,)], ('age',))
>>> df2.agg(collect_set('age')).collect()
[Row(collect_set(age)=[5, 2])]
"""
This means, you will have unordered sets which are based on a hash table and you can get more information on the 'order' of unordered Python sets
If you data is relatively small , you can coalesce it to 1 and then sort it before using collect_set()
Eg : relation,index
cook,3
jone,1
sam,7
zack,4
tim,2
singh,9
ambani,5
ram,8
jack,0
nike,6
df.coalesce(1).sort("ind").agg(collect_list("name").alias("names_list")).show
names_list
[jack, jone, tim, cook, zack, ambani, nike, sam, ram, singh]
you can apply the array_sort() function to your column if you use spark 2.4 or above:

Add different arrays from numpy to each row of dataframe

I have a SparkSQL dataframe and 2D numpy matrix. They have the same number of rows. I intend to add each different array from numpy matrix as a new column to the existing PySpark data frame. In this way, the list added to each row is different.
For example, the PySpark dataframe is like this
| Id | Name |
| ------ | ------ |
| 1 | Bob |
| 2 | Alice |
| 3 | Mike |
And the numpy matrix is like this
[[2, 3, 5]
[5, 2, 6]
[1, 4, 7]]
The resulting expected dataframe should be like this
| Id | Name | customized_list
| ------ | ------ | ---------------
| 1 | Bob | [2, 3, 5]
| 2 | Alice | [5, 2, 6]
| 3 | Mike | [1, 4, 7]
Id column correspond to the order of the entries in the numpy matrix.
I wonder is there any efficient way to implement this?
Create a DataFrame from your numpy matrix and add an Id column to indicate the row number. Then you can join to your original PySpark DataFrame on the Id column.
import numpy as np
a = np.array([[2, 3, 5], [5, 2, 6], [1, 4, 7]])
list_df = spark.createDataFrame(enumerate(a.tolist(), start=1), ["Id", "customized_list"])
list_df.show()
#+---+---------------+
#| Id|customized_list|
#+---+---------------+
#| 1| [2, 3, 5]|
#| 2| [5, 2, 6]|
#| 3| [1, 4, 7]|
#+---+---------------+
Here I used enumerate(..., start=1) to add the row number.
Now just do an inner join:
df.join(list_df, on="Id", how="inner").show()
#+---+-----+---------------+
#| Id| Name|customized_list|
#+---+-----+---------------+
#| 1| Bob| [2, 3, 5]|
#| 3| Mike| [1, 4, 7]|
#| 2|Alice| [5, 2, 6]|
#+---+-----+---------------+

How to get the minimum of nested lists in PySpark

See the following data frame for example,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df = spark.createDataFrame([[[1, 2, 3, 4]],[[0, 2, 4]],[[]],[[3]]])
df.show()
Then we have
+------------+
| _1|
+------------+
|[1, 2, 3, 4]|
| [0, 2, 4]|
| []|
| [3]|
+------------+
Then I want to find the minimum of each list; use -1 in case of empty list. I tried the following, which does not work.
import pyspark.sql.functions as F
sim_col = F.col('_1')
df.withColumn('min_turn_sim', F.when(F.size(sim_col)==0, -1.0).otherwise(F.min(sim_col))).show()
The error is:
AnalysisException: "cannot resolve 'CASE WHEN (_1 IS NULL) THEN -1.0D ELSE min(_1) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;\n'Aggregate [_1#404, CASE WHEN isnull(_1#404) THEN -1.0 ELSE min(_1#404) END AS min_turn_sim#411]\n+- LogicalRDD [_1#404], false\n"
The size function will work. Don't understand why 'min' does not.
df.withColumn('min_turn_sim', F.when(F.size(sim_col)==0, -1.0).otherwise(F.size(sim_col))).show()
+------------+------------+
| _1|min_turn_sim|
+------------+------------+
|[1, 2, 3, 4]| 4.0|
| [0, 2, 4]| 3.0|
| []| -1.0|
| [3]| 1.0|
+------------+------------+
min is an aggregate function - it operates on columns, not values. Therefore min(sim_col) means minimum array value across all rows in the scoper, according to array ordering, not minimum value in each row.
To find a minimum for each row you'll need a non-aggregate function. In the latest Spark versions (2.4.0 and later) this would be array_min (similarly array_max to get the maximum value):
df.withColumn("min_turn_sim", F.coalesce(F.array_min(sim_col), F.lit(-1)))
Earlier versions will require an UDF:
#F.udf("long")
def long_array_min(xs):
return min(xs) if xs else -1
df.withColumn("min_turn_sim", F.coalesce(long_array_min(sim_col), F.lit(-1))

Spark simpler value_counts

Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to:
The resulting object will be in descending order so that the first
element is the most frequently-occurring element. Excludes NA values
by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
I am curious if this can't be achieved nicer / simpler for data frames in Spark.
It is just a basic aggregation, isn't it?
df.groupBy($"value").count.orderBy($"count".desc)
Pandas:
import pandas as pd
pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()
2 3
3 2
4 1
1 1
dtype: int64
Spark SQL:
Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
.groupBy($"value").count.orderBy($"count".desc)
+-----+-----+
|value|count|
+-----+-----+
| 2| 3|
| 3| 2|
| 1| 1|
| 4| 1|
+-----+-----+
If you want to include additional grouping columns (like "key") just put these in the groupBy:
df.groupBy($"key", $"value").count.orderBy($"count".desc)

Resources