How to get the minimum of nested lists in PySpark

How to get the minimum of nested lists in PySpark - apache-spark

See the following data frame for example,
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('test').getOrCreate()
df = spark.createDataFrame([[[1, 2, 3, 4]],[[0, 2, 4]],[[]],[[3]]])
df.show()
Then we have
+------------+
| _1|
+------------+
|[1, 2, 3, 4]|
| [0, 2, 4]|
| []|
| [3]|
+------------+
Then I want to find the minimum of each list; use -1 in case of empty list. I tried the following, which does not work.
import pyspark.sql.functions as F
sim_col = F.col('_1')
df.withColumn('min_turn_sim', F.when(F.size(sim_col)==0, -1.0).otherwise(F.min(sim_col))).show()
The error is:
AnalysisException: "cannot resolve 'CASE WHEN (_1 IS NULL) THEN -1.0D ELSE min(_1) END' due to data type mismatch: THEN and ELSE expressions should all be same type or coercible to a common type;;\n'Aggregate [_1#404, CASE WHEN isnull(_1#404) THEN -1.0 ELSE min(_1#404) END AS min_turn_sim#411]\n+- LogicalRDD [_1#404], false\n"
The size function will work. Don't understand why 'min' does not.
df.withColumn('min_turn_sim', F.when(F.size(sim_col)==0, -1.0).otherwise(F.size(sim_col))).show()
+------------+------------+
| _1|min_turn_sim|
+------------+------------+
|[1, 2, 3, 4]| 4.0|
| [0, 2, 4]| 3.0|
| []| -1.0|
| [3]| 1.0|
+------------+------------+

min is an aggregate function - it operates on columns, not values. Therefore min(sim_col) means minimum array value across all rows in the scoper, according to array ordering, not minimum value in each row.
To find a minimum for each row you'll need a non-aggregate function. In the latest Spark versions (2.4.0 and later) this would be array_min (similarly array_max to get the maximum value):
df.withColumn("min_turn_sim", F.coalesce(F.array_min(sim_col), F.lit(-1)))
Earlier versions will require an UDF:
#F.udf("long")
def long_array_min(xs):
return min(xs) if xs else -1
df.withColumn("min_turn_sim", F.coalesce(long_array_min(sim_col), F.lit(-1))

Related

Spark: Replace collect()[][] operation

I am having code as:
new_df=spark.sql("Select col1,col2 from table1 where id=2").collect()[0][0]
I have tried toLocalIterator() but getting message that is not subscriptable.
Please suggest a better way to replace collect()[0][0].

IIUC -
Assume this is the resulted DF
+----+---+---------------+
| id|num| list_col|
+----+---+---------------+
|1001| 5|[1, 2, 3, 4, 5]|
|1002| 3| [1, 2, 3]|
+----+---+---------------+
In order to get the first value of list_col use one more [] in your existing code
print(df.select("list_col").collect()[0][0][0])
will give you 1
Likewise, this will give you 2
print(df.select("list_col").collect()[0][0][1])
Updating my answer as per new understanding -
i.e. To access the first element of a list column from a dataframe
df = df.withColumn("list_element", F.col("list_col").getItem(0))
df.show()
+----+---+---------------+------------+
| id|num| list_col|list_element|
+----+---+---------------+------------+
|1001| 5|[1, 2, 3, 4, 5]| 1|
|1002| 3| [1, 2, 3]| 1|
+----+---+---------------+------------+

get distinct count from an array of each rows using pyspark

I am looking for distinct counts from an array of each rows using pyspark dataframe:
input:
col1
[1,1,1]
[3,4,5]
[1,2,1,2]
output:
1
3
2
I used below code but it is giving me the length of an array:
output:
3
3
4
please help me how do i achieve this using python pyspark dataframe.
slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()
Thanks in advanced !

For spark2.4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
(welcome to SO)
df.show()
+------------+
| col1|
+------------+
| [1, 1, 1]|
| [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+
df.withColumn("count", F.size(F.array_distinct("col1"))).show()
+------------+-----+
| col1|count|
+------------+-----+
| [1, 1, 1]| 1|
| [3, 4, 5]| 3|
|[1, 2, 1, 2]| 2|
+------------+-----+

Filter array column content

I am using pyspark 2.3.1 and would like to filter array elements with an expression and not an using udf:
>>> df = spark.createDataFrame([(1, "A", [1,2,3,4]), (2, "B", [1,2,3,4,5])],["col1", "col2", "col3"])
>>> df.show()
+----+----+---------------+
|col1|col2| col3|
+----+----+---------------+
| 1| A| [1, 2, 3, 4]|
| 2| B|[1, 2, 3, 4, 5]|
+----+----+---------------+
The expreesion shown below is wrong, I wonder how to tell spark to remove out any values from the array in col3 which are smaller than 3. I want something like:
>>> filtered = df.withColumn("newcol", expr("filter(col3, x -> x >= 3)")).show()
>>> filtered.show()
+----+----+---------+
|col1|col2| newcol|
+----+----+---------+
| 1| A| [3, 4]|
| 2| B|[3, 4, 5]|
+----+----+---------+
I have already an udf solution, but it is very slow (> 1 billions data rows):
largerThan = F.udf(lambda row,max: [x for x in row if x >= max], ArrayType(IntegerType()))
df = df.withColumn('newcol', size(largerThan(df.queries, lit(3))))
Any help is welcome. Thank you very much in advance.

Spark < 2.4
There is no *reasonable replacement for udf in PySpark.
Spark >= 2.4
Your code:
expr("filter(col3, x -> x >= 3)")
can be used as is.
Reference
Querying Spark SQL DataFrame with complex types
* Given the cost of exploding or converting to and from RDD udf is almost exclusively preferable.

How to use array type column value in CASE statement

I have a dataframe with two columns, listA stored as Seq[String] and valB stored as String. I want to create a third column valC, which will be of Int type and its value is
iff valB is present in listA then 1 otherwise 0
I tried doing the following:
val dfWithAdditionalColumn = df.withColumn("valC", when($"listA".contains($"valB"), 1).otherwise(0))
But Spark failed to execute this and gave the following error:
cannot resolve 'contains('listA', 'valB')' due to data type mismatch: argument 1 requires string type, however, 'listA' is of array type.;
How do I use a array type column value in CASE statement?
Thanks,
Devj

You should use array_contains:
import org.apache.spark.sql.functions.{expr, array_contains}
df.withColumn("valC", when(expr("array_contains(listA, valB)"), 1).otherwise(0))

You can write a simple udf that will check if the element is present in the array :
val arrayContains = udf( (col1: Int, col2: Seq[Int]) => if(col2.contains(col1) ) 1 else 0 )
And then just call it and pass the necessary columns in the correct order :
df.withColumn("hasAInB", arrayContains($"a", $"b" ) ).show
+---+---------+-------+
| a| b|hasAInB|
+---+---------+-------+
| 1| [1, 2]| 1|
| 2|[2, 3, 4]| 1|
| 3| [1, 4]| 0|
+---+---------+-------+

Spark simpler value_counts

Something similar to Spark - Group by Key then Count by Value would allow me to emulate df.series.value_counts() the functionality of Pandas in Spark to:
The resulting object will be in descending order so that the first
element is the most frequently-occurring element. Excludes NA values
by default. (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html)
I am curious if this can't be achieved nicer / simpler for data frames in Spark.

It is just a basic aggregation, isn't it?
df.groupBy($"value").count.orderBy($"count".desc)
Pandas:
import pandas as pd
pd.Series([1, 2, 2, 2, 3, 3, 4]).value_counts()
2 3
3 2
4 1
1 1
dtype: int64
Spark SQL:
Seq(1, 2, 2, 2, 3, 3, 4).toDF("value")
.groupBy($"value").count.orderBy($"count".desc)
+-----+-----+
|value|count|
+-----+-----+
| 2| 3|
| 3| 2|
| 1| 1|
| 4| 1|
+-----+-----+
If you want to include additional grouping columns (like "key") just put these in the groupBy:
df.groupBy($"key", $"value").count.orderBy($"count".desc)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to get the minimum of nested lists in PySpark - apache-spark

Related

Spark: Replace collect()[][] operation

get distinct count from an array of each rows using pyspark

Filter array column content

How to use array type column value in CASE statement

Spark simpler value_counts

Categories

Resources