Group two dimensional list records Python [duplicate] - python-3.x

This question already has answers here:
Python summing values in list if it exists in another list
(5 answers)
Closed 4 years ago.
I have a list of lists (string,integer)
eg:
my_list=[["apple",5],["banana",6],["orange",6],["banana",9],["orange",3],["apple",111]]
I'd like to sum the same items and finally get this:
my2_list=[["apple",116],["banana",15],["orange",9]]

You can use itertools.groupby on the sorted list:
from itertools import groupby
my_list=[["apple",5],["banana",6],["orange",6],["banana",9],["orange",3],["apple",111]]
my_list2 = []
for i, g in groupby(sorted(my_list), key=lambda x: x[0]):
my_list2.append([i, sum(v[1] for v in g)])
print(my_list2)
# [['apple', 116], ['banana', 15], ['orange', 9]]
Speaking of SQL Group By and pre-sorting:
The operation of groupby() is similar to the uniq filter in Unix. It
generates a break or new group every time the value of the key
function changes (which is why it is usually necessary to have sorted
the data using the same key function). That behavior differs from
SQL’s GROUP BY which aggregates common elements regardless of their
input order.
Emphasis Mine

from collections import defaultdict
my_list= [["apple",5],["banana",6],["orange",6],["banana",9],["orange",3],["apple",111]]
result = defaultdict(int)
for fruit, value in my_list:
result[fruit] += value
result = result.items()
print result
Or you can keep result as dictionary

Using Pandas and groupby:
import pandas as pd
>>> pd.DataFrame(my_list, columns=['fruit', 'count']).groupby('fruit').sum()
count
fruit
apple 116
banana 15
orange 9

from itertools import groupby
[[k, sum(v for _, v in g)] for k, g in groupby(sorted(my_list), key = lambda x: x[0])]
# [['apple', 116], ['banana', 15], ['orange', 9]]

If you dont want the order to preserved, then plz use the below code.
my_list=[["apple",5],["banana",6],["orange",6],["banana",9],["orange",3],["apple",111]]
my_dict1 = {}
for d in my_list:
if d[0] in my_dict1.keys():
my_dict1[d[0]] += d[1]
else:
my_dict1[d[0]] = d[1]
my_list2 = [[k,v] for (k,v) in my_dict1.items()]

Related

numpy selecting elements in sub array using slicing [duplicate]

I have a list like this:
a = [[4.0, 4, 4.0], [3.0, 3, 3.6], [3.5, 6, 4.8]]
I want an outcome like this (EVERY first element in the list):
4.0, 3.0, 3.5
I tried a[::1][0], but it doesn't work
You can get the index [0] from each element in a list comprehension
>>> [i[0] for i in a]
[4.0, 3.0, 3.5]
Use zip:
columns = zip(*rows) #transpose rows to columns
print columns[0] #print the first column
#you can also do more with the columns
print columns[1] # or print the second column
columns.append([7,7,7]) #add a new column to the end
backToRows = zip(*columns) # now we are back to rows with a new column
print backToRows
You can also use numpy:
a = numpy.array(a)
print a[:,0]
Edit:
zip object is not subscriptable. It need to be converted to list to access as list:
column = list(zip(*row))
You could use this:
a = ((4.0, 4, 4.0), (3.0, 3, 3.6), (3.5, 6, 4.8))
a = np.array(a)
a[:,0]
returns >>> array([4. , 3. , 3.5])
You can get it like
[ x[0] for x in a]
which will return a list of the first element of each list in a
Compared the 3 methods
2D list: 5.323603868484497 seconds
Numpy library : 0.3201274871826172 seconds
Zip (Thanks to Joran Beasley) : 0.12395167350769043 seconds
D2_list=[list(range(100))]*100
t1=time.time()
for i in range(10**5):
for j in range(10):
b=[k[j] for k in D2_list]
D2_list_time=time.time()-t1
array=np.array(D2_list)
t1=time.time()
for i in range(10**5):
for j in range(10):
b=array[:,j]
Numpy_time=time.time()-t1
D2_trans = list(zip(*D2_list))
t1=time.time()
for i in range(10**5):
for j in range(10):
b=D2_trans[j]
Zip_time=time.time()-t1
print ('2D List:',D2_list_time)
print ('Numpy:',Numpy_time)
print ('Zip:',Zip_time)
The Zip method works best.
It was quite useful when I had to do some column wise processes for mapreduce jobs in the cluster servers where numpy was not installed.
If you have access to numpy,
import numpy as np
a_transposed = a.T
# Get first row
print(a_transposed[0])
The benefit of this method is that if you want the "second" element in a 2d list, all you have to do now is a_transposed[1]. The a_transposed object is already computed, so you do not need to recalculate.
Description
Finding the first element in a 2-D list can be rephrased as find the first column in the 2d list. Because your data structure is a list of rows, an easy way of sampling the value at the first index in every row is just by transposing the matrix and sampling the first list.
Try using
for i in a :
print(i[0])
i represents individual row in a.So,i[0] represnts the 1st element of each row.

Appending value to a list based on dictionary key

I started writing Python scripts for my research this past summer, and have been picking up the language as I go. For my current work, I have a dictionary of lists, sample_range_dict, that is initialized with descriptor_cols as the keys and empty lists for values. Sample code is below:
import numpy as np
import pandas as pd
def rangeFunc(arr):
return (np.max(arr) - np.min(arr))
df_sample = pd.DataFrame(np.random.rand(2000, 4), columns=list("ABCD")) #random dataframe for testing
col_list = df_sample.columns
sample_range_dict = dict.fromkeys(col_list, []) #creates dictionary where each key pairs with an empty list
rand_df = df_sample.sample(n=20) #make a new dataframe with 20 random rows of df_sample
I want to go through each column from rand_df and calculate the range of values, putting each range in the list with the specified column name (e.g. sample_range_dict["A"] = [range in column A]). The following is the code I initially thought to use for this:
for d in col_list:
sample_range_dict[d].append(rangeFunc(rand_df[d].tolist()))
However, instead of each key having one item in the list, printing sample_range_dict shows each key having an identical list of 4 values:
{'A': [0.8404352070810013,
0.9766398946246098,
0.9364714925930782,
0.9801082480908744],
'B': [0.8404352070810013,
0.9766398946246098,
0.9364714925930782,
0.9801082480908744],
'C': [0.8404352070810013,
0.9766398946246098,
0.9364714925930782,
0.9801082480908744],
'D': [0.8404352070810013,
0.9766398946246098,
0.9364714925930782,
0.9801082480908744]}
I've determined that the first value is the range for "A", second value is the range for "B", and so on. My question is about why this is happening, and how I could rewrite the code in order to get one item in the list for each key.
P.S. I'm looking to make this an iterative process, hence using lists instead of single numbers.
The issue is this line:
sample_range_dict = dict.fromkeys(col_list, [])
You only created one list. You don't have four lists with the same elements; you have one list, and four references to it. When you add to it via one reference, the element is visible through the other references, because it's the same list:
>>> a = dict.fromkeys(['x', 'y', 'z'], [])
>>> a['x'] is a['y']
True
>>> a['x'].append(5)
>>> a['y']
[5]
If you want each key to have a different list, either create a new list for each key:
>>> a = { k: [] for k in ['x', 'y', 'z'] }
>>> a['x'] is a['y']
False
>>> a['x'].append(5)
>>> a['y']
[]
Or use a defaultdict which will do it for you:
>>> from collections import defaultdict
>>> a = defaultdict(list)
>>> a['x'] is a['y']
False
>>> a['x'].append(5)
>>> a['y']
[]

Effective ways to group things into list

I am doing a K-means project and I have to do it by hand, which is why I am trying to figure out what is the best ways to group things according to their last values into a list or a dictionary. Here is what I am talking about
list_of_tuples = [(honey,1),(bee,2),(tree,5),(flower,2),(computer,5),(key,1)]
Now my ultimate goal is to be able to sort out the list and have 3 different lists each with its respected element
"""This is the goal"""
list_1 = [honey,key]
list_2 = [bee,flower]
list_3 = [tree, computer]
I can use a lot of if statements and a for loop, but is there a more efficient way to do it?
If you're not opposed to using something like pandas, you could do something along these lines:
import pandas as pd
list_1, list_2, list_3 = pd.DataFrame(list_of_tuples).groupby(1)[0].apply(list).values
Result:
In [19]: list_1
Out[19]: ['honey', 'key']
In [20]: list_2
Out[20]: ['bee', 'flower']
In [21]: list_3
Out[21]: ['tree', 'computer']
Explanation:
pd.DataFrame(list_of_tuples).groupby(1) groups your list of tuples by the value at index 1, then you extract the values as lists of index 0 with [0].apply(list).values. This gives you an array of lists as below:
array([list(['honey', 'key']), list(['bee', 'flower']),
list(['tree', 'computer'])], dtype=object)
Something to the effect can be achieved with a dictionary and a for loop, using the second element of the tuple as a key value.
list_of_tuples = [("honey",1),("bee",2),("tree",5),("flower",2),("computer",5),("key",1)]
dict_list = {}
for t in list_of_tuples:
# create key and a single element list if key doesn't exist yet
# append to existing list otherwise
if t[1] not in dict_list.keys():
dict_list[t[1]] = [t[0]]
else:
dict_list[t[1]].append( t[0] )
list_1, list_2, list_3 = dict_list.values()

Split/partition list based on invariant/hash?

I have a list [a1,21,...] and would like to split it based on the value of a function f(a).
For example if the input is the list [0,1,2,3,4] and the function def f(x): return x % 3,
I would like to return a list [0,3], [1,4], [2], since the first group all takes values 0 under f, the 2nd group take value 1, etc...
Something like this works:
return [[x for x in lst if f(x) == val] for val in set(map(f,lst))],
But it does not seem optimal (nor pythonic) since the inner loop unnecessarily scans the entire list and computes same f values of the elements several times.
I'm looking for a solution that would compute the value of f ideally once for every element...
If you're not irrationally ;-) set on a one-liner, it's straightforward:
from collections import defaultdict
lst = [0,1,2,3,4]
f = lambda x: x % 3
d = defaultdict(list)
for x in lst:
d[f(x)].append(x)
print(list(d.values()))
displays what you want. f() is executed len(lst) times, which can't be beat
EDIT: or, if you must:
from itertools import groupby
print([[pair[1] for pair in grp]
for ignore, grp in
groupby(sorted((f(x), x) for x in lst),
key=lambda pair: pair[0])])
That doesn't require that f() produce values usable as dict keys, but incurs the extra expense of a sort, and is close to incomprehensible. Clarity is much more Pythonic than striving for one-liners.
#Tim Peters is right, and here is a mentioned setdefault and another itertool.groupby option.
Given
import itertools as it
iterable = range(5)
keyfunc = lambda x: x % 3
Code
setdefault
d = {}
for x in iterable:
d.setdefault(keyfunc(x), []).append(x)
list(d.values())
groupby
[list(g) for _, g in it.groupby(sorted(iterable, key=keyfunc), key=keyfunc)]
See also more on itertools.groupby

testing if the values of a dictionary are non zero with all() function

I use Python 3
I want to check if all of my tested values in the nested dictionary are non 0.
So here is the simplified example dict:
d = {'a': {'1990': 10, '1991': 0, '1992': 30},
'b': {'1990': 15, '1991': 40, '1992': 0}}
and I want to test if for both dicts 'a' and 'b' the values of the keys '1990' and '1991' are not zero
for i in d:
for k in range(2):
year = 1990
year = year + k
if all((d[i][str(year)]) != 0):
print(d[i])
so it should only return b, because a['1991']=0
but this is the first time I work with the all() function and I get the error core: TypeError: 'bool' object is not iterable
the error is in the if all() line
thank you very much!
This can done a bit more generally with a list comprehension where you iterate over the items in dict d. A simple comprehension to iterate over the keys and values in our dictionary looks like this:
>>> [k for k, v in d.items()]
['a', 'b']
In the above k will contain the keys and v the values. The comprehension also has an if clause. With that you can filter out the items you don't want. So we define years = ('1990', '1991'). Now we can do another comprehension to test our year values.
To iterate over only 'a', we could do this:
>>> [d['a'][y] for y in years]
[10, 0]
>>> all([d['a'][y] for y in years])
False
Gluing the whole thing together:
>>> d={'a' :{ '1990': 10, '1991':0, '1992':30},'b':{ '1990':15, '1991':40, '1992':0}}
>>> years = ('1990', '1991')
>>> [k for k, v in d.items() if all([v[y] for y in years])]
['b']
See the python docs for more information on list comprehensions.

Resources