How to Sum values of Column Within RDD - apache-spark

I have an RDD with the following rows:
[(id,value)]
How would you sum the values of all rows in the RDD?

Simply use sum, you just need to get the data into a list.
For example
sc.parallelize([('id', [1, 2, 3]), ('id2', [3, 4, 5])]) \
.flatMap(lambda tup: tup[1]) \ # [1, 2, 3, 3, 4, 5]
.sum()
Outputs 18
Similarly, just use values() to get that second column as an RDD on it's own.
sc.parallelize([('id', 6), ('id2', 12)]) \
.values() \ # [6, 12]
.sum()

Related

PySpark RDD: Manipulating Inner Array

I have a dataset (for example)
sc = SparkContext()
x = [(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])]
y = sc.parallelize(x)
print(y.take(1))
The print statement returns [(1, [2, 3, 4, 5])]
I now need to multiply everything in the sub-array by 2 across the RDD. Since I have already parallelized, I can't further break down "y.take(1)" to multiply [2, 3, 4, 5] by 2.
How can I essentially isolate the inner array across my worker nodes to then do the multiplication?
I think you can use map with a lambda function:
y = sc.parallelize(x).map(lambda x: (x[0], [2*t for t in x[1]]))
Then y.take(2) returns:
[(1, [4, 6, 8, 10]), (2, [4, 14, 16, 20])]
It will be more efficient if you use DataFrame API instead of RDDs - in this case all your processing will happen without serialization to Python that happens when you use RDD APIs.
For example you can use the transform function to apply transformation to array values:
import pyspark.sql.functions as F
df = spark.createDataFrame([(1, [2, 3, 4, 5]), (2, [2, 7, 8, 10])],
schema="id int, arr array<int>")
df2 = df.select("id", F.transform("arr", lambda x: x*2).alias("arr"))
df2.show()
will give you desired:
+---+---------------+
| id| arr|
+---+---------------+
| 1| [4, 6, 8, 10]|
| 2|[4, 14, 16, 20]|
+---+---------------+

Pyspark: Convert list to list of lists

In a pyspark dataframe, I have a column which has list values, for example: [1,2,3,4,5,6,7,8]
I would like to convert the above as [[1,2,3,4] , [5,6,7,8]] given 4 for every column value.
Please let me know, how can I achieve this.
Thanks for your help in advance.
You can use transform function as shown below:
df = spark.createDataFrame([([1, 2, 3, 4, 5, 6, 7, 8],)], ['values'])
df.selectExpr("transform(sequence(1, size(values), 4), v-> slice(values, v, 4)) as values")\
.show(truncate=False)
+----------------------------+
|values |
+----------------------------+
|[[1, 2, 3, 4], [5, 6, 7, 8]]|
+----------------------------+

How to let pandas groupby add a count column for each group after applying list aggregations?

I have a pandas DataFrame:
df = pd.DataFrame({"col_1": ["apple", "banana", "apple", "banana", "banana"],
"col_2": [1, 4, 8, 8, 6],
"col_3": [56, 4, 22, 1, 5]})
on which I apply a groupby operation that aggregates multiple columns into a list, using:
df = df.groupby(['col_1'])[["col_2", "col_3"]].agg(list)
Now I want to additionally add a column that for each resulting group adds the number of elements in that group. The result should look like this:
{"col_1": ["apple", "banana"],
"col_2": [[1, 8], [4, 8, 6]],
"col_3": [[56, 22], [4, 1, 5]]
"count": [2, 3]}
I tried the following from reading other Stack Overflow posts:
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).size()
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list, "count")
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).agg("count")
But all gave either incorrect results (option 3) or an error (option 1 and 2)
How to solve this?
We can try named aggregation
d = {c:(c, list) for c in ('col_2', 'col_3')}
df.groupby('col_1').agg(**{**d, 'count': ('col_2', 'size')})
Or we can separately calculate the size of each group, then join it with the dataframe that contains the columns aggregated as lists
g = df.groupby('col_1')
g[['col_2', 'col_3']].agg(list).join(g.size().rename('count'))
col_2 col_3 count
col_1
apple [1, 8] [56, 22] 2
banana [4, 8, 6] [4, 1, 5] 3
Just adding another performant approach to solve the problem:
x = df.groupby('col_1')
x.agg({ 'col_2': lambda x: list(x),'col_3': lambda x: list(x),}).reset_index().join(
x['col_2'].transform('count').rename('count'))
Output
col_1 col_2 col_3 count
0 apple [1, 8] [56, 22] 2
1 banana [4, 8, 6] [4, 1, 5] 3

Sorting a list of tensors by their length in Pytorch

I have a list of tensors in the form list = [tensor([1,2]), tensor([3, 4, 5])] and would like to order it in descending order based on the length of the tensors. This means the sorted list should look like list = [tensor([3, 4, 5]), tensor([1, 2])].
Using .sort(key=length) does not work, and have also tried using .sort(key= lambda x: len(x)) unsuccessfully.
You should avoid using python built-ins (list) for your variable names.
you can sort like following:
list_tensors = [torch.tensor([1,2]), torch.tensor([3, 4, 5])]
print(sorted(list_tensors, key=lambda x: x.size()[0]))
which will output :
[tensor([1, 2]), tensor([3, 4, 5])]
Or in descending order :
list_tensors = [torch.tensor([1,2]), torch.tensor([3, 4, 5])]
print(sorted(list_tensors, key=lambda x: x.size()[0], reverse=True))
output :
[tensor([3, 4, 5]), tensor([1, 2])]

Selecting pandas dataframe column by row-specific list

For each row in a dataframe, I'm trying to select the column, which is specified in a list. The list has the same length as the dataframe has rows.
df = pd.DataFrame({"a":[1,2,3,4,5],
"b":[3,4,5,6,7],
"c":[9,10,11,12,13]})
lst = ["a","a","c","b","a"]
The result would look like this:
[1,2,11,6,5]
Just lookup would be fine:
df.lookup(df.index,lst)
#array([ 1, 2, 11, 6, 5], dtype=int64)
lookup should be the way, but try something diff
df.stack().reindex(pd.MultiIndex.from_arrays([df.index,lst])).values
array([ 1, 2, 11, 6, 5])

Resources