PySpark RDD: Manipulating Inner Array

PySpark RDD: Manipulating Inner Array - apache-spark

I have a dataset (for example)
sc = SparkContext()
x = [(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])]
y = sc.parallelize(x)
print(y.take(1))
The print statement returns [(1, [2, 3, 4, 5])]
I now need to multiply everything in the sub-array by 2 across the RDD. Since I have already parallelized, I can't further break down "y.take(1)" to multiply [2, 3, 4, 5] by 2.
How can I essentially isolate the inner array across my worker nodes to then do the multiplication?

I think you can use map with a lambda function:
y = sc.parallelize(x).map(lambda x: (x[0], [2*t for t in x[1]]))
Then y.take(2) returns:
[(1, [4, 6, 8, 10]), (2, [4, 14, 16, 20])]

It will be more efficient if you use DataFrame API instead of RDDs - in this case all your processing will happen without serialization to Python that happens when you use RDD APIs.
For example you can use the transform function to apply transformation to array values:
import pyspark.sql.functions as F
df = spark.createDataFrame([(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])],
schema="id int, arr array<int>")
df2 = df.select("id", F.transform("arr", lambda x: x*2).alias("arr"))
df2.show()
will give you desired:
+---+---------------+
| id| arr|
+---+---------------+
| 1| [4, 6, 8, 10]|
| 2|[4, 14, 16, 20]|
+---+---------------+

Related

Pyspark: Convert list to list of lists

In a pyspark dataframe, I have a column which has list values, for example: [1,2,3,4,5,6,7,8]
I would like to convert the above as [[1,2,3,4] , [5,6,7,8]] given 4 for every column value.
Please let me know, how can I achieve this.
Thanks for your help in advance.

You can use transform function as shown below:
df = spark.createDataFrame([([1, 2, 3, 4, 5, 6, 7, 8],)], ['values'])
df.selectExpr("transform(sequence(1, size(values), 4), v-> slice(values, v, 4)) as values")\
.show(truncate=False)
+----------------------------+
|values |
+----------------------------+
|[[1, 2, 3, 4], [5, 6, 7, 8]]|
+----------------------------+

How to let pandas groupby add a count column for each group after applying list aggregations?

I have a pandas DataFrame:
df = pd.DataFrame({"col_1": ["apple", "banana", "apple", "banana", "banana"],
"col_2": [1, 4, 8, 8, 6],
"col_3": [56, 4, 22, 1, 5]})
on which I apply a groupby operation that aggregates multiple columns into a list, using:
df = df.groupby(['col_1'])[["col_2", "col_3"]].agg(list)
Now I want to additionally add a column that for each resulting group adds the number of elements in that group. The result should look like this:
{"col_1": ["apple", "banana"],
"col_2": [[1, 8], [4, 8, 6]],
"col_3": [[56, 22], [4, 1, 5]]
"count": [2, 3]}
I tried the following from reading other Stack Overflow posts:
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).size()
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list, "count")
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).agg("count")
But all gave either incorrect results (option 3) or an error (option 1 and 2)
How to solve this?

We can try named aggregation
d = {c:(c, list) for c in ('col_2', 'col_3')}
df.groupby('col_1').agg(**{**d, 'count': ('col_2', 'size')})
Or we can separately calculate the size of each group, then join it with the dataframe that contains the columns aggregated as lists
g = df.groupby('col_1')
g[['col_2', 'col_3']].agg(list).join(g.size().rename('count'))
col_2 col_3 count
col_1
apple [1, 8] [56, 22] 2
banana [4, 8, 6] [4, 1, 5] 3

Just adding another performant approach to solve the problem:
x = df.groupby('col_1')
x.agg({ 'col_2': lambda x: list(x),'col_3': lambda x: list(x),}).reset_index().join(
x['col_2'].transform('count').rename('count'))
Output
col_1 col_2 col_3 count
0 apple [1, 8] [56, 22] 2
1 banana [4, 8, 6] [4, 1, 5] 3

Access non-sequential multiple entries in a list of lists

I have a matrix of M vectors where each vector is of size N (NxM).
I also have a Boolean vector of size L>=M, with exactly M entries = True.
I want to create a list of lists and place the M vectors where the Boolean vector is True in same order as they are in the matrix, and the rest I want to be empty lists
Example: M = 3, N = 4, L = 5
mat = np.array([[1, 5, 9],
[2, 6, 10],
[3, 7, 11],
[4, 8, 12]])
mask = [True, False, True, True, False]
I want to create the following:
res = [ [1, 2, 3, 4], [], [5, 6, 7, 8], [9, 10, 11, 12], []]
Accessing it can be done using:
data = [res[idx] for idx in range(len(res)) if mask(idx)]
However, creating it is a bit problematic.
I tried creating a list of empty lists, but I can't access all relevant entries at once.
Is there an elegant way of doing it?

Here is how I would do it:
mi = iter(mat.T.tolist())
[(m or []) and next(mi) for m in mask]
# [[1, 2, 3, 4], [], [5, 6, 7, 8], [9, 10, 11, 12], []]

As you are already using a list comprehension to get the data back from res, I would do a similar thing to create res in the first place.
mask_cs = np.cumsum(mask) - 1 # array([0, 0, 1, 2, 2]) , gives the corresponding index in mat
res = [mat[:, mask_cs[idx]].tolist() if mask[idx] else [] for idx in range(L)]
As alternativ, which accesses all columns of mat at once, on can create an intermediate array with size [N, L]
import numpy as np
res = np.zeros((N, L)) # Create result array
res[:, mask] = mat # Copy the data at the right positions
res = res.T.tolist() # Transform the array to a list of lists
for idx in range(L): # Replace the columns with empty lists, if mask[idx] is False
if not mask[idx]:
res[idx] = []

We could make use of np.split for some elegance, like so -
In [162]: split_cols = np.split(mat.T,np.cumsum(mask)[:-1])
In [163]: split_cols
Out[163]:
[array([[1, 2, 3, 4]]),
array([], shape=(0, 4), dtype=int64),
array([[5, 6, 7, 8]]),
array([[ 9, 10, 11, 12]]),
array([], shape=(0, 4), dtype=int64)]
So, that gives us a list of 2D arrays. For the desired output of list of lists, we need to map them to such -
In [164]: list(map(list,(map(np.ravel,split_cols))))
Out[164]: [[1, 2, 3, 4], [], [5, 6, 7, 8], [9, 10, 11, 12], []]
Alternatively, we can use lambda if that looks more elegant to some -
In [165]: F = lambda a: np.ravel(a).tolist()
In [166]: list(map(F,split_cols))
Out[166]: [[1, 2, 3, 4], [], [5, 6, 7, 8], [9, 10, 11, 12], []]

How to do Rdd and broadcasted Rdd multiplication in pyspark?

I have two data frames like Below:
data frame1:(df1)
+---+----------+
|id |features |
+---+----------+
|8 |[5, 4, 5] |
|9 |[4, 5, 2] |
+---+----------+
data frame2:(df2)
+---+----------+
|id |features |
+---+----------+
|1 |[1, 2, 3] |
|2 |[4, 5, 6] |
+---+----------+
after that i have converted into Df to Rdd
rdd1=df1.rdd
if I am doing rdd1.collect() result is like below
[Row(id=8, f=[5, 4, 5]), Row(id=9, f=[4, 5, 2])]
rdd2=df2.rdd
broadcastedrddif = sc.broadcast(rdd2.collectAsMap())
now if I am doing broadcastedrddif.value
{1: [1, 2, 3], 2: [4, 5, 6]}
now i want to do sum of multiplication of rdd1 and broadcastedrddif i.e it should return output like below.
((8,[(1,(5*1+2*4+5*3)),(2,(5*4+4*5+5*6))]),(9,[(1,(4*1+5*2+2*3)),(2,(4*4+5*5+2*6)]) ))
so my final output should be
((8,[(1,28),(2,70)]),(9,[(1,20),(2,53)]))
where (1, 28) is a tuple not a float.
Please help me on this.

I did not understand why you used sc.broadcast() but i used it anyway...
Very useful in this case mapValues on the last RDD and I used a list comprehension to execute the operations using the dictionary.
x1=sc.parallelize([[8,5,4,5], [9,4,5,2]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x1.collect()
x2=sc.parallelize([[1,1,2,3], [2,4,5,6]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x2.collect()
#I took immediately an RDD because is more simply to test
broadcastedrddif = sc.broadcast(x2.collectAsMap())
d2=broadcastedrddif.value
def sum_prod(x,y):
c=0
for i in range(0,len(x)):
c+=x[i]*y[i]
return c
x1.mapValues(lambda x: [(i, sum_prod(list(x),list(d2[i]))) for i in [k for k in d2.keys()]]).collect()
Out[19]: [(8, [(1, 28), (2, 70)]), (9, [(1, 20), (2, 53)])]

How to Sum values of Column Within RDD

I have an RDD with the following rows:
[(id,value)]
How would you sum the values of all rows in the RDD?

Simply use sum, you just need to get the data into a list.
For example
sc.parallelize([('id', [1, 2, 3]), ('id2', [3, 4, 5])]) \
.flatMap(lambda tup: tup[1]) \ # [1, 2, 3, 3, 4, 5]
.sum()
Outputs 18
Similarly, just use values() to get that second column as an RDD on it's own.
sc.parallelize([('id', 6), ('id2', 12)]) \
.values() \ # [6, 12]
.sum()

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

PySpark RDD: Manipulating Inner Array - apache-spark

I think you can use map with a lambda function: y = sc.parallelize(x).map(lambda x: (x[0], [2*t for t in x[1]])) Then y.take(2) returns: [(1, [4, 6, 8, 10]), (2, [4, 14, 16, 20])]

Related

Pyspark: Convert list to list of lists

How to let pandas groupby add a count column for each group after applying list aggregations?

Access non-sequential multiple entries in a list of lists

How to do Rdd and broadcasted Rdd multiplication in pyspark?

How to Sum values of Column Within RDD

Categories

Resources