I have a dataset (for example)
sc = SparkContext()
x = [(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])]
y = sc.parallelize(x)
print(y.take(1))
The print statement returns [(1, [2, 3, 4, 5])]
I now need to multiply everything in the sub-array by 2 across the RDD. Since I have already parallelized, I can't further break down "y.take(1)" to multiply [2, 3, 4, 5] by 2.
How can I essentially isolate the inner array across my worker nodes to then do the multiplication?
I think you can use map with a lambda function:
y = sc.parallelize(x).map(lambda x: (x[0], [2*t for t in x[1]]))
Then y.take(2) returns:
[(1, [4, 6, 8, 10]), (2, [4, 14, 16, 20])]
It will be more efficient if you use DataFrame API instead of RDDs - in this case all your processing will happen without serialization to Python that happens when you use RDD APIs.
For example you can use the transform function to apply transformation to array values:
import pyspark.sql.functions as F
df = spark.createDataFrame([(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])],
schema="id int, arr array<int>")
df2 = df.select("id", F.transform("arr", lambda x: x*2).alias("arr"))
df2.show()
will give you desired:
+---+---------------+
| id| arr|
+---+---------------+
| 1| [4, 6, 8, 10]|
| 2|[4, 14, 16, 20]|
+---+---------------+
Related
In a pyspark dataframe, I have a column which has list values, for example: [1,2,3,4,5,6,7,8]
I would like to convert the above as [[1,2,3,4] , [5,6,7,8]] given 4 for every column value.
Please let me know, how can I achieve this.
Thanks for your help in advance.
You can use transform function as shown below:
df = spark.createDataFrame([([1, 2, 3, 4, 5, 6, 7, 8],)], ['values'])
df.selectExpr("transform(sequence(1, size(values), 4), v-> slice(values, v, 4)) as values")\
.show(truncate=False)
+----------------------------+
|values |
+----------------------------+
|[[1, 2, 3, 4], [5, 6, 7, 8]]|
+----------------------------+
I have a pandas DataFrame:
df = pd.DataFrame({"col_1": ["apple", "banana", "apple", "banana", "banana"],
"col_2": [1, 4, 8, 8, 6],
"col_3": [56, 4, 22, 1, 5]})
on which I apply a groupby operation that aggregates multiple columns into a list, using:
df = df.groupby(['col_1'])[["col_2", "col_3"]].agg(list)
Now I want to additionally add a column that for each resulting group adds the number of elements in that group. The result should look like this:
{"col_1": ["apple", "banana"],
"col_2": [[1, 8], [4, 8, 6]],
"col_3": [[56, 22], [4, 1, 5]]
"count": [2, 3]}
I tried the following from reading other Stack Overflow posts:
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).size()
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list, "count")
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).agg("count")
But all gave either incorrect results (option 3) or an error (option 1 and 2)
How to solve this?
We can try named aggregation
d = {c:(c, list) for c in ('col_2', 'col_3')}
df.groupby('col_1').agg(**{**d, 'count': ('col_2', 'size')})
Or we can separately calculate the size of each group, then join it with the dataframe that contains the columns aggregated as lists
g = df.groupby('col_1')
g[['col_2', 'col_3']].agg(list).join(g.size().rename('count'))
col_2 col_3 count
col_1
apple [1, 8] [56, 22] 2
banana [4, 8, 6] [4, 1, 5] 3
Just adding another performant approach to solve the problem:
x = df.groupby('col_1')
x.agg({ 'col_2': lambda x: list(x),'col_3': lambda x: list(x),}).reset_index().join(
x['col_2'].transform('count').rename('count'))
Output
col_1 col_2 col_3 count
0 apple [1, 8] [56, 22] 2
1 banana [4, 8, 6] [4, 1, 5] 3
I have a matrix of M vectors where each vector is of size N (NxM).
I also have a Boolean vector of size L>=M, with exactly M entries = True.
I want to create a list of lists and place the M vectors where the Boolean vector is True in same order as they are in the matrix, and the rest I want to be empty lists
Example: M = 3, N = 4, L = 5
mat = np.array([[1, 5, 9],
[2, 6, 10],
[3, 7, 11],
[4, 8, 12]])
mask = [True, False, True, True, False]
I want to create the following:
res = [ [1, 2, 3, 4], [], [5, 6, 7, 8], [9, 10, 11, 12], []]
Accessing it can be done using:
data = [res[idx] for idx in range(len(res)) if mask(idx)]
However, creating it is a bit problematic.
I tried creating a list of empty lists, but I can't access all relevant entries at once.
Is there an elegant way of doing it?
Here is how I would do it:
mi = iter(mat.T.tolist())
[(m or []) and next(mi) for m in mask]
# [[1, 2, 3, 4], [], [5, 6, 7, 8], [9, 10, 11, 12], []]
As you are already using a list comprehension to get the data back from res, I would do a similar thing to create res in the first place.
mask_cs = np.cumsum(mask) - 1 # array([0, 0, 1, 2, 2]) , gives the corresponding index in mat
res = [mat[:, mask_cs[idx]].tolist() if mask[idx] else [] for idx in range(L)]
As alternativ, which accesses all columns of mat at once, on can create an intermediate array with size [N, L]
import numpy as np
res = np.zeros((N, L)) # Create result array
res[:, mask] = mat # Copy the data at the right positions
res = res.T.tolist() # Transform the array to a list of lists
for idx in range(L): # Replace the columns with empty lists, if mask[idx] is False
if not mask[idx]:
res[idx] = []
We could make use of np.split for some elegance, like so -
In [162]: split_cols = np.split(mat.T,np.cumsum(mask)[:-1])
In [163]: split_cols
Out[163]:
[array([[1, 2, 3, 4]]),
array([], shape=(0, 4), dtype=int64),
array([[5, 6, 7, 8]]),
array([[ 9, 10, 11, 12]]),
array([], shape=(0, 4), dtype=int64)]
So, that gives us a list of 2D arrays. For the desired output of list of lists, we need to map them to such -
In [164]: list(map(list,(map(np.ravel,split_cols))))
Out[164]: [[1, 2, 3, 4], [], [5, 6, 7, 8], [9, 10, 11, 12], []]
Alternatively, we can use lambda if that looks more elegant to some -
In [165]: F = lambda a: np.ravel(a).tolist()
In [166]: list(map(F,split_cols))
Out[166]: [[1, 2, 3, 4], [], [5, 6, 7, 8], [9, 10, 11, 12], []]
I have two data frames like Below:
data frame1:(df1)
+---+----------+
|id |features |
+---+----------+
|8 |[5, 4, 5] |
|9 |[4, 5, 2] |
+---+----------+
data frame2:(df2)
+---+----------+
|id |features |
+---+----------+
|1 |[1, 2, 3] |
|2 |[4, 5, 6] |
+---+----------+
after that i have converted into Df to Rdd
rdd1=df1.rdd
if I am doing rdd1.collect() result is like below
[Row(id=8, f=[5, 4, 5]), Row(id=9, f=[4, 5, 2])]
rdd2=df2.rdd
broadcastedrddif = sc.broadcast(rdd2.collectAsMap())
now if I am doing broadcastedrddif.value
{1: [1, 2, 3], 2: [4, 5, 6]}
now i want to do sum of multiplication of rdd1 and broadcastedrddif i.e it should return output like below.
((8,[(1,(5*1+2*4+5*3)),(2,(5*4+4*5+5*6))]),(9,[(1,(4*1+5*2+2*3)),(2,(4*4+5*5+2*6)]) ))
so my final output should be
((8,[(1,28),(2,70)]),(9,[(1,20),(2,53)]))
where (1, 28) is a tuple not a float.
Please help me on this.
I did not understand why you used sc.broadcast() but i used it anyway...
Very useful in this case mapValues on the last RDD and I used a list comprehension to execute the operations using the dictionary.
x1=sc.parallelize([[8,5,4,5], [9,4,5,2]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x1.collect()
x2=sc.parallelize([[1,1,2,3], [2,4,5,6]]).map(lambda x: (x[0], (x[1],x[2],x[3])))
x2.collect()
#I took immediately an RDD because is more simply to test
broadcastedrddif = sc.broadcast(x2.collectAsMap())
d2=broadcastedrddif.value
def sum_prod(x,y):
c=0
for i in range(0,len(x)):
c+=x[i]*y[i]
return c
x1.mapValues(lambda x: [(i, sum_prod(list(x),list(d2[i]))) for i in [k for k in d2.keys()]]).collect()
Out[19]: [(8, [(1, 28), (2, 70)]), (9, [(1, 20), (2, 53)])]
I have an RDD with the following rows:
[(id,value)]
How would you sum the values of all rows in the RDD?
Simply use sum, you just need to get the data into a list.
For example
sc.parallelize([('id', [1, 2, 3]), ('id2', [3, 4, 5])]) \
.flatMap(lambda tup: tup[1]) \ # [1, 2, 3, 3, 4, 5]
.sum()
Outputs 18
Similarly, just use values() to get that second column as an RDD on it's own.
sc.parallelize([('id', 6), ('id2', 12)]) \
.values() \ # [6, 12]
.sum()