How to zip/concat value and list in pyspark - apache-spark

I work on a function working with 4 imputs.
To do so, I would like to get a list summarizing these 4 elements. However I have two variables where the data is unique and two variables composed of lists. I can zip the two lists with arrays_zip, but I can't get an array list with the 4 elements :
+----+----+---------+---------+
| l1 | l2 | l3 | l4 |
+----+----+---------+---------+
| 1 | 5 | [1,2,3] | [2,2,2] |
| 2 | 9 | [8,2,7] | [1,7,7] |
| 3 | 3 | [8,4,9] | [5,1,3] |
| 4 | 1 | [5,5,3] | [8,4,3] |
What I want to get :
+----+----+---------+---------+------------------------------------------+
| l1 | l2 | l3 | l4 | l5 |
+----+----+---------+---------+------------------------------------------+
| 1 | 5 | [1,2,3] | [2,2,2] | [[1, 5, 1, 2],[1, 5, 2, 2],[1, 5, 3, 2]] |
| 2 | 9 | [8,2,7] | [1,7,7] | [[2, 9, 8, 1],[2, 9, 2, 7],[2, 9, 7, 7]] |
| 3 | 3 | [8,4,9] | [5,1,3] | [[3, 3, 8, 5],[3 ,3, 4, 1],[3, 3, 9, 3]] |
| 4 | 1 | [5,5,3] | [8,4,3] | [[4, 1, 5, 8],[4, 1, 5, 4],[4, 1, 3, 3]] |
My idea was to transform l1 and l2 in list with the l3 size, and apply then the arrays_zip. I did,'t found a consistent way to create this list.
As long as I obtained this list of list, I would apply a function as follow:
def is_good(data):
a,b,c,d = data
return a+b+c+d
is_good_udf = f.udf(lambda x: is_good(x), ArrayType(FloatType()))
spark.udf.register("is_good_udf ", is_good, T.FloatType())
My guess would be to build something like this, thanks to #kafels, where for each rows and each list of the list, it applies the function :
df.withColumn("tot", f.expr("transform(l5, y -> is_good_udf(y))"))
In order to obtain a list of results as [9, 10, 11] for the first row for instance.

You can use expr function and apply TRANSFORM:
import pyspark.sql.functions as f
df = df.withColumn('l5', f.expr("""TRANSFORM(arrays_zip(l3, l4), el -> array(l1, l2, el.l3, el.l4))"""))
# +---+---+---------+---------+------------------------------------------+
# |l1 |l2 |l3 |l4 |l5 |
# +---+---+---------+---------+------------------------------------------+
# |1 |5 |[1, 2, 3]|[2, 2, 2]|[[1, 5, 1, 2], [1, 5, 2, 2], [1, 5, 3, 2]]|
# |2 |9 |[8, 2, 7]|[1, 7, 7]|[[2, 9, 8, 1], [2, 9, 2, 7], [2, 9, 7, 7]]|
# |3 |3 |[8, 4, 9]|[5, 1, 3]|[[3, 3, 8, 5], [3, 3, 4, 1], [3, 3, 9, 3]]|
# |4 |1 |[5, 5, 3]|[8, 4, 3]|[[4, 1, 5, 8], [4, 1, 5, 4], [4, 1, 3, 3]]|
# +---+---+---------+---------+------------------------------------------+

Related

Get unique elements for every array-based row

I have a dataset which looks somewhat like this:
idx | attributes
--------------------------
101 | ['a','b','c']
102 | ['a','b','d']
103 | ['b','c']
104 | ['c','e','f']
105 | ['a','b','c']
106 | ['c','g','h']
107 | ['b','d']
108 | ['d','g','i']
I wish to transform the above dataframe into something like this:
idx | attributes
--------------------------
101 | [0,1,2]
102 | [0,1,3]
103 | [1,2]
104 | [2,4,5]
105 | [0,1,2]
106 | [2,6,7]
107 | [1,3]
108 | [3,6,8]
Here, 'a' is replaced by 0, 'b' is replaced by 1 and so. Essentially, I wish to find all unique elements and assign them numbers so that integer operations can be made on them. My current approach is by using RDDs to maintain a single set and loop across rows but it's highly memory and time-intensive. Is there any other method for this in PySpark?
Thanks in advance
Annotated code
from pyspark.ml.feature import StringIndexer
# Explode the dataframe by `attributes`
df1 = df.selectExpr('idx', "explode(attributes) as attributes")
# Create a StringIndexer to encode the labels
idx = StringIndexer(inputCol='attributes', outputCol='encoded', stringOrderType='alphabetAsc')
df1 = idx.fit(df1).transform(df1)
# group the encoded column by idx and aggregate using `collect_list`
df1 = df1.groupBy('idx').agg(F.collect_list(F.col('encoded').cast('int')).alias('attributes'))
Result
df1.show()
+---+----------+
|idx|attributes|
+---+----------+
|101| [0, 1, 2]|
|102| [0, 1, 3]|
|103| [1, 2]|
|104| [2, 4, 5]|
|105| [0, 1, 2]|
|106| [2, 6, 7]|
|107| [1, 3]|
|108| [3, 6, 8]|
+---+----------+
This can be done in spark 2.4 as a one liner.
In spark 3.0 this can be done without expr.
df = spark.createDataFrame(data=[(101,['a','b','c']),
(102,['a','b','d']),
(103,['b','c']),
(104,['c','e','f']),
(105,['a','b','c']),
(106,['c','g','h']),
(107,['b','d']),
(108,['d','g','i']),],schema = ["idx","attributes"])
df.select(df.idx, expr("transform( attributes, x -> ascii(x)-96)").alias("attributes") ).show()
+---+----------+
|idx|attributes|
+---+----------+
|101| [1, 2, 3]|
|102| [1, 2, 4]|
|103| [2, 3]|
|104| [3, 5, 6]|
|105| [1, 2, 3]|
|106| [3, 7, 8]|
|107| [2, 4]|
|108| [4, 7, 9]|
+---+----------+
The tricky bit: expr("transform( attributes, x -> ascii(x)-96)")
expr is used to say this is a SQL expression
transform takes a column [that is an array] and applies a function to each element in the array ( x is the lambda parameter for the element of the array. -> function start and ) function end.
ascii(x)-96) convert ascii code into integer.
If you are considering performance you may consider the explain plan for my answer vs the other one provided so far:
df1.groupBy('idx').agg(collect_list(col('encoded').cast('int')).alias('attributes')).explain()
== Physical Plan ==
ObjectHashAggregate(keys=[idx#24L], functions=[collect_list(cast(encoded#140 as int), 0, 0)])
+- Exchange hashpartitioning(idx#24L, 200)
+- ObjectHashAggregate(keys=[idx#24L], functions=[partial_collect_list(cast(encoded#140 as int), 0, 0)])
+- *(1) Project [idx#24L, UDF(attributes#132) AS encoded#140]
+- Generate explode(attributes#25), [idx#24L], false, [attributes#132]
+- Scan ExistingRDD[idx#24L,attributes#25]
my answer:
df.select(df.idx, expr("transform( attributes, x -> ascii(x)-96)").alias("attributes") ).explain()
== Physical Plan ==
Project [idx#24L, transform(attributes#25, lambdafunction((ascii(lambda x#128) - 96), lambda x#128, false)) AS attributes#127]

Get index of column item that is in an array in another column in a Spark dataframe

I have a data frame that looks like this:
+-------+-------+--------------------+
| user| item| ls_rec_items|
+-------+-------+--------------------+
| 321| 3| [4, 3, 2, 6, 1, 5]|
| 123| 2| [5, 6, 3, 1, 2, 4]|
| 123| 7| [5, 6, 3, 1, 2, 4]|
+-------+-------+--------------------+
I want to know in which position the "item" is in the "ls_rec_items" array.
I know the function array_position, but I don't know how to get the "item" value there.
I know this:
df.select(F.array_position(df.ls_rec_items, 3)).collect()
But I want this:
df.select(F.array_position(df.ls_rec_items, df.item)).collect()
The output should look like this:
+-------+-------+--------------------+-----+
| user| item| ls_rec_items| pos|
+-------+-------+--------------------+-----+
| 321| 3| [4, 3, 2, 6, 1, 5]| 2|
| 123| 2| [5, 6, 3, 1, 2, 4]| 5|
| 123| 7| [5, 6, 3, 1, 2, 4]| 0|
+-------+-------+--------------------+-----+
You could use expr with array_position like this:
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
if __name__ == "__main__":
spark = SparkSession.builder.getOrCreate()
data = [
{"user": 321, "item": 3, "ls_rec_items": [4, 3, 2, 6, 1, 5]},
{"user": 123, "item": 2, "ls_rec_items": [5, 6, 3, 1, 2, 4]},
{"user": 123, "item": 7, "ls_rec_items": [5, 6, 3, 1, 2, 4]},
]
df = spark.createDataFrame(data)
df = df.withColumn("pos", F.expr("array_position(ls_rec_items, item)"))
Result
+----+------------------+----+---+
|item| ls_rec_items|user|pos|
+----+------------------+----+---+
| 3|[4, 3, 2, 6, 1, 5]| 321| 2|
| 2|[5, 6, 3, 1, 2, 4]| 123| 5|
| 7|[5, 6, 3, 1, 2, 4]| 123| 0|
+----+------------------+----+---+

Add different arrays from numpy to each row of dataframe

I have a SparkSQL dataframe and 2D numpy matrix. They have the same number of rows. I intend to add each different array from numpy matrix as a new column to the existing PySpark data frame. In this way, the list added to each row is different.
For example, the PySpark dataframe is like this
| Id | Name |
| ------ | ------ |
| 1 | Bob |
| 2 | Alice |
| 3 | Mike |
And the numpy matrix is like this
[[2, 3, 5]
[5, 2, 6]
[1, 4, 7]]
The resulting expected dataframe should be like this
| Id | Name | customized_list
| ------ | ------ | ---------------
| 1 | Bob | [2, 3, 5]
| 2 | Alice | [5, 2, 6]
| 3 | Mike | [1, 4, 7]
Id column correspond to the order of the entries in the numpy matrix.
I wonder is there any efficient way to implement this?
Create a DataFrame from your numpy matrix and add an Id column to indicate the row number. Then you can join to your original PySpark DataFrame on the Id column.
import numpy as np
a = np.array([[2, 3, 5], [5, 2, 6], [1, 4, 7]])
list_df = spark.createDataFrame(enumerate(a.tolist(), start=1), ["Id", "customized_list"])
list_df.show()
#+---+---------------+
#| Id|customized_list|
#+---+---------------+
#| 1| [2, 3, 5]|
#| 2| [5, 2, 6]|
#| 3| [1, 4, 7]|
#+---+---------------+
Here I used enumerate(..., start=1) to add the row number.
Now just do an inner join:
df.join(list_df, on="Id", how="inner").show()
#+---+-----+---------------+
#| Id| Name|customized_list|
#+---+-----+---------------+
#| 1| Bob| [2, 3, 5]|
#| 3| Mike| [1, 4, 7]|
#| 2|Alice| [5, 2, 6]|
#+---+-----+---------------+

Numpy specific reshape order for 2D Array

I've got a 4x3 matrix and I need to reshape it to a 2x6 matrix with a specific order:
Initial :
|1 5 9|
|2 6 10|
|3 7 11|
|4 8 12|
Wanted :
|1 2 5 6 9 10|
|3 4 7 8 11 12|
I tried tu use numpy but the order was not what I expected :
import numpy as np
a=np.array([[1,5,9],[2,6,10],[3,17,11],[4,8,12]])
b=a.reshape(2,6)
print(a)
print(b)
Reshape, permute and reshape -
In [51]: n = 2 # no. of rows to select from input to form each row of output
In [52]: a.reshape(-1,n,a.shape[1]).swapaxes(1,2).reshape(-1,n*a.shape[1])
Out[52]:
array([[ 1, 2, 5, 6, 9, 10],
[ 3, 4, 17, 8, 11, 12]])
Sort of an explanation :
Cut along the first axis to end up with 3D array such that we select n along the second one. Swap this second axis with the last(third) axis, so that we push back that second one to the end.
Reshape to 2D to merge the last two axes. That's our output!
For more in-depth info, please refer to the linked Q&A.
If we are given the number of rows in output -
In [54]: nrows = 2 # number of rows in output
In [55]: a.reshape(nrows,-1,a.shape[1]).swapaxes(1,2).reshape(nrows,-1)
Out[55]:
array([[ 1, 2, 5, 6, 9, 10],
[ 3, 4, 17, 8, 11, 12]])

pyspark dataframe from rdd containing key and values as list of lists

I have a RDD like below with keys and values as list of list containing some parameters.
(32719, [[[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]])
(32897, [[[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]])
I want to create a dataframe with rows and columns as below
32719, '200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0
32719, '177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0
32897, 200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0
Or just a dataframe of all the values but grouped by the key. How can I do this.
Use flat map Values
a =[(32719, [[[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]]]),
(32897, [[[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]]])]
rdd =sc.parallelize(a)
rdd.flatMapValues(lambda x:x[0]).map(lambda x: [x[0]]+x[1]).toDF().show()
Output
+-------+----------------+---------------+----+----+-------+-----+----+
| _1 | _2 | _3 | _4 | _5 | _6 | _7 | _8 |
+-------+----------------+---------------+----+----+-------+-----+----+
| 32719 | 200.73.55.34 | 192.16.48.217 | 0 | 6 | 10163 | 443 | 0 |
| 32719 | 177.207.76.243 | 192.16.58.8 | 0 | 6 | 59575 | 80 | 0 |
| 32897 | 200.73.55.34 | 193.16.48.217 | 0 | 6 | 10163 | 443 | 0 |
| 32897 | 167.207.76.243 | 194.16.58.8 | 0 | 6 | 59575 | 80 | 0 |
+-------+----------------+---------------+----+----+-------+-----+----+
You can map to add the key to each value and create dataframe. I tried in my way,
>>>dat1 = [(32719, [[u'200.73.55.34', u'192.16.48.217', 0, 6, 10163, 443, 0], [u'177.207.76.243', u'192.16.58.8', 0, 6, 59575, 80, 0]]),(32897, [[u'200.73.55.34', u'193.16.48.217', 0, 6, 10163, 443, 0], [u'167.207.76.243', u'194.16.58.8', 0, 6, 59575, 80, 0]])]
>>>rdd1 = sc.parallelize(dat1).map(lambda x : [[x[0]]+i for i in x[1]]).flatMap(lambda x : (x))
>>>df = rdd1.toDF(['col1','col2','col3','col4','col5','col6','col7','col8'])
>>> df.show()
+-----+--------------+-------------+----+----+-----+----+----+
| col1| col2| col3|col4|col5| col6|col7|col8|
+-----+--------------+-------------+----+----+-----+----+----+
|32719| 200.73.55.34|192.16.48.217| 0| 6|10163| 443| 0|
|32719|177.207.76.243| 192.16.58.8| 0| 6|59575| 80| 0|
|32897| 200.73.55.34|193.16.48.217| 0| 6|10163| 443| 0|
|32897|167.207.76.243| 194.16.58.8| 0| 6|59575| 80| 0|
+-----+--------------+-------------+----+----+-----+----+----+

Resources