Add different arrays from numpy to each row of dataframe - apache-spark

I have a SparkSQL dataframe and 2D numpy matrix. They have the same number of rows. I intend to add each different array from numpy matrix as a new column to the existing PySpark data frame. In this way, the list added to each row is different.
For example, the PySpark dataframe is like this
| Id | Name |
| ------ | ------ |
| 1 | Bob |
| 2 | Alice |
| 3 | Mike |
And the numpy matrix is like this
[[2, 3, 5]
[5, 2, 6]
[1, 4, 7]]
The resulting expected dataframe should be like this
| Id | Name | customized_list
| ------ | ------ | ---------------
| 1 | Bob | [2, 3, 5]
| 2 | Alice | [5, 2, 6]
| 3 | Mike | [1, 4, 7]
Id column correspond to the order of the entries in the numpy matrix.
I wonder is there any efficient way to implement this?

Create a DataFrame from your numpy matrix and add an Id column to indicate the row number. Then you can join to your original PySpark DataFrame on the Id column.
import numpy as np
a = np.array([[2, 3, 5], [5, 2, 6], [1, 4, 7]])
list_df = spark.createDataFrame(enumerate(a.tolist(), start=1), ["Id", "customized_list"])
list_df.show()
#+---+---------------+
#| Id|customized_list|
#+---+---------------+
#| 1| [2, 3, 5]|
#| 2| [5, 2, 6]|
#| 3| [1, 4, 7]|
#+---+---------------+
Here I used enumerate(..., start=1) to add the row number.
Now just do an inner join:
df.join(list_df, on="Id", how="inner").show()
#+---+-----+---------------+
#| Id| Name|customized_list|
#+---+-----+---------------+
#| 1| Bob| [2, 3, 5]|
#| 3| Mike| [1, 4, 7]|
#| 2|Alice| [5, 2, 6]|
#+---+-----+---------------+

Related

Get unique elements for every array-based row

I have a dataset which looks somewhat like this:
idx | attributes
--------------------------
101 | ['a','b','c']
102 | ['a','b','d']
103 | ['b','c']
104 | ['c','e','f']
105 | ['a','b','c']
106 | ['c','g','h']
107 | ['b','d']
108 | ['d','g','i']
I wish to transform the above dataframe into something like this:
idx | attributes
--------------------------
101 | [0,1,2]
102 | [0,1,3]
103 | [1,2]
104 | [2,4,5]
105 | [0,1,2]
106 | [2,6,7]
107 | [1,3]
108 | [3,6,8]
Here, 'a' is replaced by 0, 'b' is replaced by 1 and so. Essentially, I wish to find all unique elements and assign them numbers so that integer operations can be made on them. My current approach is by using RDDs to maintain a single set and loop across rows but it's highly memory and time-intensive. Is there any other method for this in PySpark?
Thanks in advance
Annotated code
from pyspark.ml.feature import StringIndexer
# Explode the dataframe by `attributes`
df1 = df.selectExpr('idx', "explode(attributes) as attributes")
# Create a StringIndexer to encode the labels
idx = StringIndexer(inputCol='attributes', outputCol='encoded', stringOrderType='alphabetAsc')
df1 = idx.fit(df1).transform(df1)
# group the encoded column by idx and aggregate using `collect_list`
df1 = df1.groupBy('idx').agg(F.collect_list(F.col('encoded').cast('int')).alias('attributes'))
Result
df1.show()
+---+----------+
|idx|attributes|
+---+----------+
|101| [0, 1, 2]|
|102| [0, 1, 3]|
|103| [1, 2]|
|104| [2, 4, 5]|
|105| [0, 1, 2]|
|106| [2, 6, 7]|
|107| [1, 3]|
|108| [3, 6, 8]|
+---+----------+
This can be done in spark 2.4 as a one liner.
In spark 3.0 this can be done without expr.
df = spark.createDataFrame(data=[(101,['a','b','c']),
(102,['a','b','d']),
(103,['b','c']),
(104,['c','e','f']),
(105,['a','b','c']),
(106,['c','g','h']),
(107,['b','d']),
(108,['d','g','i']),],schema = ["idx","attributes"])
df.select(df.idx, expr("transform( attributes, x -> ascii(x)-96)").alias("attributes") ).show()
+---+----------+
|idx|attributes|
+---+----------+
|101| [1, 2, 3]|
|102| [1, 2, 4]|
|103| [2, 3]|
|104| [3, 5, 6]|
|105| [1, 2, 3]|
|106| [3, 7, 8]|
|107| [2, 4]|
|108| [4, 7, 9]|
+---+----------+
The tricky bit: expr("transform( attributes, x -> ascii(x)-96)")
expr is used to say this is a SQL expression
transform takes a column [that is an array] and applies a function to each element in the array ( x is the lambda parameter for the element of the array. -> function start and ) function end.
ascii(x)-96) convert ascii code into integer.
If you are considering performance you may consider the explain plan for my answer vs the other one provided so far:
df1.groupBy('idx').agg(collect_list(col('encoded').cast('int')).alias('attributes')).explain()
== Physical Plan ==
ObjectHashAggregate(keys=[idx#24L], functions=[collect_list(cast(encoded#140 as int), 0, 0)])
+- Exchange hashpartitioning(idx#24L, 200)
+- ObjectHashAggregate(keys=[idx#24L], functions=[partial_collect_list(cast(encoded#140 as int), 0, 0)])
+- *(1) Project [idx#24L, UDF(attributes#132) AS encoded#140]
+- Generate explode(attributes#25), [idx#24L], false, [attributes#132]
+- Scan ExistingRDD[idx#24L,attributes#25]
my answer:
df.select(df.idx, expr("transform( attributes, x -> ascii(x)-96)").alias("attributes") ).explain()
== Physical Plan ==
Project [idx#24L, transform(attributes#25, lambdafunction((ascii(lambda x#128) - 96), lambda x#128, false)) AS attributes#127]

Get index of column item that is in an array in another column in a Spark dataframe

I have a data frame that looks like this:
+-------+-------+--------------------+
| user| item| ls_rec_items|
+-------+-------+--------------------+
| 321| 3| [4, 3, 2, 6, 1, 5]|
| 123| 2| [5, 6, 3, 1, 2, 4]|
| 123| 7| [5, 6, 3, 1, 2, 4]|
+-------+-------+--------------------+
I want to know in which position the "item" is in the "ls_rec_items" array.
I know the function array_position, but I don't know how to get the "item" value there.
I know this:
df.select(F.array_position(df.ls_rec_items, 3)).collect()
But I want this:
df.select(F.array_position(df.ls_rec_items, df.item)).collect()
The output should look like this:
+-------+-------+--------------------+-----+
| user| item| ls_rec_items| pos|
+-------+-------+--------------------+-----+
| 321| 3| [4, 3, 2, 6, 1, 5]| 2|
| 123| 2| [5, 6, 3, 1, 2, 4]| 5|
| 123| 7| [5, 6, 3, 1, 2, 4]| 0|
+-------+-------+--------------------+-----+
You could use expr with array_position like this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
data = [
{"user": 321, "item": 3, "ls_rec_items": [4, 3, 2, 6, 1, 5]},
{"user": 123, "item": 2, "ls_rec_items": [5, 6, 3, 1, 2, 4]},
{"user": 123, "item": 7, "ls_rec_items": [5, 6, 3, 1, 2, 4]},
]
df = spark.createDataFrame(data)
df = df.withColumn("pos", F.expr("array_position(ls_rec_items, item)"))
Result
+----+------------------+----+---+
|item| ls_rec_items|user|pos|
+----+------------------+----+---+
| 3|[4, 3, 2, 6, 1, 5]| 321| 2|
| 2|[5, 6, 3, 1, 2, 4]| 123| 5|
| 7|[5, 6, 3, 1, 2, 4]| 123| 0|
+----+------------------+----+---+

Spark: Replace collect()[][] operation

I am having code as:
new_df=spark.sql("Select col1,col2 from table1 where id=2").collect()[0][0]
I have tried toLocalIterator() but getting message that is not subscriptable.
Please suggest a better way to replace collect()[0][0].
IIUC -
Assume this is the resulted DF
+----+---+---------------+
| id|num| list_col|
+----+---+---------------+
|1001| 5|[1, 2, 3, 4, 5]|
|1002| 3| [1, 2, 3]|
+----+---+---------------+
In order to get the first value of list_col use one more [] in your existing code
print(df.select("list_col").collect()[0][0][0])
will give you 1
Likewise, this will give you 2
print(df.select("list_col").collect()[0][0][1])
Updating my answer as per new understanding -
i.e. To access the first element of a list column from a dataframe
df = df.withColumn("list_element", F.col("list_col").getItem(0))
df.show()
+----+---+---------------+------------+
| id|num| list_col|list_element|
+----+---+---------------+------------+
|1001| 5|[1, 2, 3, 4, 5]| 1|
|1002| 3| [1, 2, 3]| 1|
+----+---+---------------+------------+

Remove value from different datasets

I have 2 pyspark datasets:
df_1
name | number <Array>
-------------------------
12 | [1, 2, 3]
-------------------------
34 | [9, 8, 7]
-------------------------
46 | [10]
-------------------------
68 | [2, 88]
-------------------------
df_2
number_to_be_deleted <String>
------------------
1
------------------
2
------------------
10
------------------
I would like to delete numbers of df_2 if they exist in df_1.
In case array will be empty I change it's value to null.
I used array_remove
df = df_1.select(F.array_remove(df_1.number, df_2.number_to_be_deleted)).collect()
I got :
TypeError: 'Column' object is not callable in array_remove
Expected result:
df_1
name | number <Array>
-------------------------
12 | [3]
-------------------------
34 | [9, 8, 7]
-------------------------
46 | null
-------------------------
68 | [88]
-------------------------
Any suggestions, please?
Thank you
You can join df1 with df2 using cross join then use array_except to remove the values. Finally, using when you can check if the size of the result array is empty to replace it with null.
df2 = df2.groupBy().agg(collect_list("number_to_be_deleted").alias("to_delete"))
df1.crossJoin(df2).withColumn("number", array_except("number", "to_delete"))\
.withColumn("number", when(size(col("number")) == 0, lit(None)).otherwise(col("number")))\
.select("name", "number")\
.show()
#+----+---------+
#|name| number|
#+----+---------+
#| 12| [3]|
#| 34|[9, 8, 7]|
#| 46| null|
#| 68| [88]|
#+----+---------+

get distinct count from an array of each rows using pyspark

I am looking for distinct counts from an array of each rows using pyspark dataframe:
input:
col1
[1,1,1]
[3,4,5]
[1,2,1,2]
output:
1
3
2
I used below code but it is giving me the length of an array:
output:
3
3
4
please help me how do i achieve this using python pyspark dataframe.
slen = udf(lambda s: len(s), IntegerType())
count = Df.withColumn("Count", slen(df.col1))
count.show()
Thanks in advanced !
For spark2.4+ you can use array_distinct and then just get the size of that, to get count of distinct values in your array. Using UDF will be very slow and inefficient for big data, always try to use spark in-built functions.
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.functions.array_distinct
(welcome to SO)
df.show()
+------------+
| col1|
+------------+
| [1, 1, 1]|
| [3, 4, 5]|
|[1, 2, 1, 2]|
+------------+
df.withColumn("count", F.size(F.array_distinct("col1"))).show()
+------------+-----+
| col1|count|
+------------+-----+
| [1, 1, 1]| 1|
| [3, 4, 5]| 3|
|[1, 2, 1, 2]| 2|
+------------+-----+

Resources