Pyspark: sum column values

Pyspark: sum column values - apache-spark

I have this RDD (showing two elements):
[['a', [1, 2]], ['b', [3, 0]]]
and I'd like to add up elements in the list based on the index, so to have
a final result
[4, 2]
how would I achieve this? I know the presence of first element ('a'/'b') is irrelevant as I could strip it out with a map so the question becomes how to sum column values.

$ pyspark
>>> x = [['a', [1, 2]], ['b', [3, 0]]]
>>> rdd = sc.parallelize(x)
>>> rdd.map(lambda x: x[1]).reduce(lambda x,y: [sum(i) for i in zip(x, y)])

You can strip the keys as you said, and then reduce your RDD as follows (given that you have 2 columns):
myRDD.reduce(lambda x,y:[x[0]+y[0], x[1]+y[1]])
This will give you the sum of all the columns

Related

What is the role of [:] in overwriting a list in a for loop?

I came across a weird syntactical approach at work today that I couldn't wrap my head around. Let's say I have the following list:
my_list = [[1, 2, 3], [4, 5, 6]]
My objective is to filter each nested list according to some criteria and overwrite the elements of the list in place. So, let's say I want to remove odd numbers from each nested list such that my_list contains lists of even numbers, where the end result would look like this:
[[2], [4, 6]]
If I try to do this using a simple assignment operator, it doesn't work.
my_list = [[1, 2, 3], [4, 5, 6]]
for l in my_list:
l = [num for num in l if num % 2 == 0]
print(my_list)
Output: [[1, 2, 3], [4, 5, 6]]
However, if I "slice" the list, it provides the expected output.
my_list = [[1, 2, 3], [4, 5, 6]]
for l in my_list:
l[:] = [num for num in l if num % 2 == 0]
print(my_list)
Output: [[2], [4, 6]]
My original hypothesis was that l was a newly created object that didn't actually point to the corresponding object in the list, but comparing the outputs of id(x[i]), id(l), and id(l[:]) (where i is the index of l in x), I realized that l[:] was the one with the differing id. So, if Python is creating a new object when I assign to l[:] then how does Python know to overwrite the existing object of l? Why does this work? And why doesn't the simple assignment operator l = ... work?

It's subtle.
Snippet one:
my_list = [[1, 2, 3], [4, 5, 6]]
for l in my_list:
l = [num for num in l if num % 2 == 0]
Why doesn't this work? Because when you do l = , you're only reassigning the variable l, not making any change to its value.
If we write the loop out "manually", it hopefully will become more clear why this strategy fails:
my_list = [[1, 2, 3], [4, 5, 6]]
# iteration 1
l = my_list[0]
l = [num for num in l if num % 2 == 0]
# iteration 2
l = my_list[1]
l = [num for num in l if num % 2 == 0]
Snippet two:
my_list = [[1, 2, 3], [4, 5, 6]]
for l in my_list:
l[:] = [num for num in l if num % 2 == 0]
Why does this work? Because by using l[:] = , you're actually modifying the value that l references, not just the variable l. Let me elaborate.
Generally speaking, using [:] notation (slice notation) on lists allows one to work with a section of the list.
The simplest use is for getting values out of a list; we can write a[n:k] to get the nth, item n+1st item, etc, up to k-1. For instance:
>>> a = ["a", "very", "fancy", "list"]
>>> print(a[1:3])
['very', 'fancy']
Python also allows use of slice notation on the left-side of a =. In this case, it interprets the notation to mean that we want to update only part of a list. For instance, we can replace "very", "fancy" with "not", "so", "fancy" like so:
>>> print(a)
['a', 'very', 'fancy', 'list']
>>> a[1:3] = ["not", "so", "fancy"]
>>> print(a)
['a', 'not', 'so', 'fancy', 'list']
When using slice syntax, Python also provides some convenient shorthand. Instead of writing [n:k], we can omit n or k or both.
If we omit n, then our slice looks like [:k], and Python understands it to mean "up to k", i.e., the same as [0:k].
If we omit k, then our slice looks like a[n:], and Python understands it to mean "n and after", i.e., the same as a[n:len(a)].
If we omit both, then both rules take place, so a[:] is the same as a[0:len(a)], which is a slice over the entire list.
Examples:
>>> print(a)
['a', 'not', 'so', 'fancy', 'list']
>>> print(a[2:4])
['so', 'fancy']
>>> print(a[:4])
['a', 'not', 'so', 'fancy']
>>> print(a[2:])
['so', 'fancy', 'list']
>>> print(a[:])
['a', 'not', 'so', 'fancy', 'list']
Crucially, this all still applies if we are using our slice on the left-hand side of a =:
>>> print(a)
['a', 'not', 'so', 'fancy', 'list']
>>> a[:4] = ["the", "fanciest"]
>>> print(a)
['the', 'fanciest', 'list']
And using [:] means to replace every item in the list:
>>> print(a)
['the', 'fanciest', 'list']
>>> a[:] = ["something", "completely", "different"]
>>> print(a)
['something', 'completely', 'different']
Okay, so far so good.
They key thing to note is that using slice notation on the left-hand side of a list updates the list in-place. In other words, when I do a[1:3] =, the variable a is never updated; the list that it references is.
We can see this with id(), as you were doing:
>>> print(a)
['something', 'completely', 'different']
>>> print(id(a))
139848671387072
>>> a[1:] = ["truly", "amazing"]
>>> print(a)
['something', 'truly', 'amazing']
>>> print(id(a))
139848671387072
Perhaps more pertinently, this means that if a were a reference to a list within some other object, then using a[:] = will update the list within that object. Like so:
>>> list_of_lists = [ [1, 2], [3, 4], [5, 6] ]
>>> second_list = list_of_lists[1]
>>> print(second_list)
[3, 4]
>>> second_list[1:] = [2, 1, 'boom!']
>>> print(second_list)
[3, 2, 1, 'boom!']
>>> print(list_of_lists)
[[1, 2], [3, 2, 1, 'boom!'], [5, 6]]

Filter pandas dataframe in python3 depending on the value of a list

So I have a dataframe like this:
df = {'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]}
And I want to create another dataframe that contains only the rows that have a certain value contained in the lists of x. For example, if I only want the ones that contain a 3, to get something like:
df2 = {'c': ['A','C'],
'x': [[1,2,3],[1,3]]}
I am trying to do something like this:
df2 = df[(3 in df.x.tolist())]
But I am getting a
KeyError: False
exception. Any suggestion/idea? Many thanks!!!

df = df[df.x.apply(lambda x: 3 in x)]
print(df)
Prints:
c x
0 A [1, 2, 3]
2 C [1, 3]

Below code would help you
To create the Correct dataframe
df = pd.DataFrame({'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]})
To filter the rows which contains 3
df[df.x.apply(lambda x: 3 in x)==True]
Output:
c x
0 A [1, 2, 3]
2 C [1, 3]

Sort python dictionary with value as list

If we want to compare on the basis of all indices of the list and not just the 1st element. If lists are identical, then sort by key. Also length of the list is not known in advance. In that case how to sort the keys. Below is the example:
{'A': [5, 0, 0], 'B': [0, 2, 3], 'C': [0, 3, 2]}
output:
[A, C, B]
Explanation: A is at 1st position because at 0th index 5 is highest and rest is 0. C is at 2nd position because 2nd 1st index of C is 3 compared to 1st index of B. As you can see we need to compare all positions to sort it and we don't know the array size before hand.
I tried below code:
countPos = {'A': [5, 0, 0], 'B': [0, 2, 3], 'C': [0, 3, 2]}
res = sorted(countPos.items(), key=lambda x: ((-x[1][i]) for i in range(3)))
Getting an error for above code. Could someone help me on this?

I think got a solution, which worked. This might be naive. I encourage gurus to rectify me.
r = sorted(countPos.items(), key=lambda x: x[0])
r = dict(r)
res = sorted(r.items(), key=lambda x: x[1], reverse=True)
So, first sorted based on keys and then I sorted based on values in reverse order.

Multiply values of two columns per row

I'd like to multiply the values of two columns per row...
from this:
to this:

I think this can be easily done by numpy or pandas. Here is a sample solution-
import pandas as pd
column = ['A','B','C']
dataframe = pd.DataFrame({"A":['a','b','c'],"B":[1,2,3],"C":[2,2,2]})
dataframe['D'] = dataframe['B']*dataframe['C']
print(dataframe)

The answer using pandas is perfectly ok, but to learn Python it is perhaps better to start using the built-in functions first. Here is the answer using lists
my_list = []
my_list.append([1, 2])
my_list.append([2, 2])
my_list.append([3, 2])
print(my_list)
sum_list = []
for element in my_list:
my_sum = element[0] + element[1]
sum_list.append(element + [my_sum])
print(sum_list)
Result
[[1, 2], [2, 2], [3, 2]]
[[1, 2, 3], [2, 2, 4], [3, 2, 5]]
Your exercise to add the first column!

Different behavior in list comprehension

In my mind this two pieces of code do the same thing:
l = [[1,2], [3,4],[3,2], [5,4], [4,4],[5,7]]
1)
In [4]: [list(g) for k,g in groupby(sorted(l,key=lambda x:x[1]),
key = lambda x:x[1]) if len(list(g)) == 2]
Out[4]: [[]]
2)
In [5]: groups = [list(g) for k,g in groupby(sorted(l,
key=lambda x:x[1]), key = lambda x:x[1])]
In [6]: [g for g in groups if len(g) == 2]
Out[6]: [[[1, 2], [3, 2]]]
But as you see first one gives an empty list while the second one gives what I need. Where am I mistaken?

The group is an iterator, you cannot consume it (e.g. by calling list on it) twice. For example:
>>> from operator import itemgetter
>>> from itertools import groupby
>>> l = [[1,2], [3,4],[3,2], [5,4], [4,4],[5,7]]
>>> for _, group in groupby(sorted(l, key=itemgetter(1)), key=itemgetter(1)):
... print('first', list(group))
... print('second', list(group))
...
first [[1, 2], [3, 2]]
second []
first [[3, 4], [5, 4], [4, 4]]
second []
first [[5, 7]]
second []
Instead, you need to call list once per group and filter on the results of that, e.g. by using map:
>>> [lst for lst in map(list, (group for _, group in groupby(sorted(l, key=itemgetter(1))), key=itemgetter(1))) if len(lst) == 2]
[[[1, 2], [3, 2]]]

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Pyspark: sum column values - apache-spark

$ pyspark >>> x = [['a', [1, 2]], ['b', [3, 0]]] >>> rdd = sc.parallelize(x) >>> rdd.map(lambda x: x[1]).reduce(lambda x,y: [sum(i) for i in zip(x, y)])

You can strip the keys as you said, and then reduce your RDD as follows (given that you have 2 columns): myRDD.reduce(lambda x,y:[x[0]+y[0], x[1]+y[1]]) This will give you the sum of all the columns

Related

What is the role of [:] in overwriting a list in a for loop?

Filter pandas dataframe in python3 depending on the value of a list

Sort python dictionary with value as list

Multiply values of two columns per row

Different behavior in list comprehension

Categories

Resources