pandas: aggregate a column of list into one list - python-3.x

I have the following data frame my_df:
name numbers
----------------------
A [4,6]
B [3,7,1,3]
C [2,5]
D [1,2,3]
I want to combine all numbers to a new list, so the output should be:
new_numbers
---------------
[4,6,3,7,1,3,2,5,1,2,3]
And here is my code:
def combine_list(my_lists):
new_list = []
for x in my_lists:
new_list.append(x)
return new_list
new_df = my_df.agg({'numbers': combine_list})
but the new_df still looks the same as original:
numbers
----------------------
0 [4,6]
1 [3,7,1,3]
2 [2,5]
3 [1,2,3]
What did I do wrong? How do I make new_df like:
new_numbers
---------------
[4,6,3,7,1,3,2,5,1,2,3]
Thanks!

You need flatten values and then create new Dataframe by constructor:
flatten = [item for sublist in df['numbers'] for item in sublist]
Or:
flatten = np.concatenate(df['numbers'].values).tolist()
Or:
from itertools import chain
flatten = list(chain.from_iterable(df['numbers'].values.tolist()))
df1 = pd.DataFrame({'numbers':[flatten]})
print (df1)
numbers
0 [4, 6, 3, 7, 1, 3, 2, 5, 1, 2, 3]
Timings are here.

You can use df['numbers'].sum() which returns a combined list to create the new dataframe
new_df = pd.DataFrame({'new_numbers': [df['numbers'].sum()]})
new_numbers
0 [4, 6, 3, 7, 1, 3, 2, 5, 1, 2, 3]

This should do:
newdf = pd.DataFrame({'numbers':[[x for i in mydf['numbers'] for x in i]]})

Check this pandas groupby and join lists
What you are looking for is,
my_df = my_df.groupby(['name']).agg(sum)

Related

Python: from list group by elements after a specific trigger element [duplicate]

This question already has answers here:
Python3: How can I split a list based on condition?
(2 answers)
Closed last year.
I have a list like
a=['a',2,'[abcd]','bb',4,5,'kk','[efgh]',6,7,'no','[ijkl]',4,5,'lo']
So here we want group by after each '[]'
so the expected one would be
[['a',2],{'abcd': ['bb',4,5,'kk']},{'efgh': [6,7,'no']},{'ijkl': [4,5,'lo']}]
Any help would be appriciable
You can use groupby:
from itertools import groupby
a=['a',2,'[abcd]','bb',4,5,'kk','[efgh]',6,7,'no','[ijkl]',4,5,'lo']
def group_by_tag(li):
def tag_counter(x):
if isinstance(x, str) and x.startswith('[') and x.endswith(']'):
tag_counter.cnt += 1
return tag_counter.cnt
tag_counter.cnt = 0
return groupby(li, key=tag_counter)
Which you can use to make a list of tuples for each segment partitioned by the [tag]:
>>> x=[(k,list(l)) for k, l in group_by_tag(a)]
>>> x
[(0, ['a', 2]), (1, ['[abcd]', 'bb', 4, 5, 'kk']), (2, ['[efgh]', 6, 7, 'no']), (3, ['[ijkl]', 4, 5, 'lo'])]
And then create your desired mixed-type list from that:
>>> [v if k==0 else {v[0].strip('[]'):v[1:]} for k,v in x]
[['a', 2], {'abcd': ['bb', 4, 5, 'kk']}, {'efgh': [6, 7, 'no']}, {'ijkl': [4, 5, 'lo']}]
But consider that it is usually better to have a list of the same type of object to make processing that list easier.
If you want that, you could do:
>>> [{'no_tag':v} if k==0 else {v[0].strip('[]'):v[1:]} for k,v in x]
[{'no_tag': ['a', 2]}, {'abcd': ['bb', 4, 5, 'kk']}, {'efgh': [6, 7, 'no']}, {'ijkl': [4, 5, 'lo']}]
By "=>" I assume you want a dictionary if there is a preceding keyword and key be the word enclosed within the square brackets (if that's not what you intended feel free to comment and I'll edit this post)
import re
def sort(iterable):
result = {}
governing_match = None
for each in list(iterable):
match = re.findall(r"\[.{1,}\]", str(each))
if len(match) > 0:
governing_match = match[0][1:-1]
result[governing_match] = []
continue
result[governing_match].append(each)
return result
a=['[foo]', 'a' , 2 ,'[abcd]','bb',4,5,'kk','[efgh]',6,7,'no','[ijkl]',4,5,'lo']
for (k, v) in sort(a).items():
print(f"{k} : {v}")
Result :
foo : ['a', 2]
abcd : ['bb', 4, 5, 'kk']
efgh : [6, 7, 'no']
ijkl : [4, 5, 'lo']
The limitation of this is that every sequence should start with an element that is enclosed within square brackets.
You can use pairwise for that:
from itertools import pairwise
a=['a',2,'[abcd]','bb',4,5,'kk','[efgh]',6,7,'no','[ijkl]',4,5,'lo']
bracket_indices = [i for i, x in enumerate(a) if isinstance(x, str)
and x.startswith('[') and x.endswith(']')]
bracket_indices.append(len(a))
output = [a[:bracket_indices[0]]] # First item is special cased
output.extend({a[i][1:-1]: a[i+1:j]} for i, j in pairwise(bracket_indices))

In Pandas dataframe remove duplicate strings in a list column and remove the corresponding ids in the other list column

I have Pandas dataframe like this:
df =
A B
[1, 5, 8, 10] [str1, str_to_be_removed*, str_to_be_removed*, str2]
[4, 7, 10] [str2, str5, str10]
[5, 2, 7, 9, 15] [str6, str2, str_to_be_removed*, str_to_be_removed*, str_to_be_removed*]
... ...
Given, str_to_be_removed, I would like to keep only the 1st instance of the string which contains str_to_be_removed, in column B and remove the other ones. In addition, I would also like to remove the corresponding ids from A. Size of lists contained in A and B in each row is the same.
How to do this?
Desired Output:
df =
A B
[1, 5, 10] [str1, str_to_be_removed*, str2]
[4, 7, 10] [str2, str5, str10]
[5, 2, 7] [str6, str2, str_to_be_removed*]
EDIT:
So, this is the sample df:
df =
df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5]],[10,11,12]\
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],['str_to_be_removed_bla','b'],['b','c','d'],\
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']\
]})
Desired Output:
df =
A B
[1] [str_to_be_removed_bla]
[2,3] [str_to_be_removed_bla, b]
[3,4,5] [b, c, d]
[10, 12] [str_to_be_removed_bla_bla,f]
steps:
use zip to combine relate elements together.
unnest the zip list, use explode
drop duplicates, keep first
goupby index, agg as list.
df = pd.DataFrame({'A':[
[1,2],[2,3],[3,4,5],[10, 11, 12]
],
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
['str_to_be_removed_bla','b'],
['b', 'c', 'd'],
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']
]})
# zip and explode step
# df_obj = df.apply(lambda x: list(zip(x.A, x.B)),
# axis = 1
# ).to_frame()
# df_obj = df_obj.explode(0).reset_index()
# df_obj['A'] = df_obj[0].str[0]
# df_obj['B'] = df_obj[0].str[1]
update with #Joe Ferndz's solution, simplify the steps.
# explode the columns
dfTemp = df.apply(lambda x: x.explode())
df_obj = dfTemp.reset_index()
# can add more conditions to filter and handle to drop_duplicates
cond = df_obj['B'].astype(str).str.contains("str_to_be_removed")
df1 = df_obj[cond].drop_duplicates(['index'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
result = df3.groupby('index').agg({'A':list, 'B':list})
result
A B
index
0 [1] [str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 12] [str_to_be_removed_bla_bla, f]
extend this solution to multiple patterns in 1 step: str_to_be_removed1, str_to_be_removed2,
df = pd.DataFrame({'A':[
[1,2],[2,3],[3,4,5, 6],[10, 11, 12]
],
'B':[['str_to_be_removed1_bla','str_to_be_removed1_bla'],
['str_to_be_removed1_bla','b'],
['b', 'b', 'c', 'd'],
['str_to_be_removed2_bla_bla','str_to_be_removed2_bla','f']
]})
print(df)
# A B
# 0 [1, 2] [str_to_be_removed1_bla, str_to_be_removed1_bla]
# 1 [2, 3] [str_to_be_removed1_bla, b]
# 2 [3, 4, 5, 6] [b, b, c, d]
# 3 [10, 11, 12] [str_to_be_removed2_bla_bla, str_to_be_removed2_bla, f]
df_obj = df.apply(lambda x: x.explode()).reset_index()
df_obj['tag'] = df_obj['B'].str.extract(r'(str_to_be_removed1|str_to_be_removed2)')
print(df_obj)
# index A B tag
# 0 0 1 str_to_be_removed1_bla str_to_be_removed1
# 1 0 2 str_to_be_removed1_bla str_to_be_removed1
# 2 1 2 str_to_be_removed1_bla str_to_be_removed1
# 3 1 3 b NaN
# 4 2 3 b NaN
# 5 2 4 b NaN
# 6 2 5 c NaN
# 7 2 6 d NaN
# 8 3 10 str_to_be_removed2_bla_bla str_to_be_removed2
# 9 3 11 str_to_be_removed2_bla str_to_be_removed2
# 10 3 12 f NaN
cond = df_obj['tag'].notnull()
df1 = df_obj[cond].drop_duplicates(['index', 'tag'], keep='first')
df2 = df_obj[~cond]
df3 = pd.concat([df1, df2]).sort_index()
print(df3.groupby('index').agg({'A':list, 'B':list}))
# A B
# index
# 0 [1] [str_to_be_removed1_bla]
# 1 [2, 3] [str_to_be_removed1_bla, b]
# 2 [3, 4, 5, 6] [b, b, c, d]
# 3 [10, 12] [str_to_be_removed2_bla_bla, f]
Here's an another approach to do this. Here' you can give the search string and it will remove it from the list.
import pandas as pd
pd.set_option('display.max_colwidth', 500)
df = pd.DataFrame({'A':[[1,2],[2,3],[3,4,5],[10,11,12]],
'B':[['str_to_be_removed_bla','str_to_be_removed_bla'],
['str_to_be_removed_bla','b'],
['b','c','d'],
['str_to_be_removed_bla_bla','str_to_be_removed_bla','f']]})
print (df)
search_str = 'str_to_be_removed'
#explode each column into its own rows
dfTemp = df.apply(lambda x: x.explode())
#flag columns that contains search string
dfTemp['Found'] = dfTemp.B.str.contains(search_str)
#cumcount by Index and by search strings to get duplicates
dfTemp['Bx'] = dfTemp.groupby([dfTemp.index,'Found']).cumcount()
#Exclude all records where search string was found and count is more than 1
dfTemp = dfTemp[~(dfTemp['Found'] & dfTemp['Bx'] > 0)]
#Extract back all the records into the dataframe and store as list
df_final = dfTemp.groupby(dfTemp.index).agg({'A':list, 'B':list})
del dfTemp
print (df_final)
The output of this is:
Original dataframe:
A B
0 [1, 2] [str_to_be_removed_bla, str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 11, 12] [str_to_be_removed_bla_bla, str_to_be_removed_bla, f]
search string: 'str_to_be_removed'
Final dataframe:
A B
0 [1] [str_to_be_removed_bla]
1 [2, 3] [str_to_be_removed_bla, b]
2 [3, 4, 5] [b, c, d]
3 [10, 12] [str_to_be_removed_bla_bla, f]
Let us try dict
out = pd.DataFrame([[list(dict(zip(y[::-1],x[::-1])).values())[::-1],list(dict(zip(y[::-1],x[::-1])).keys())[::-1]] for x , y in zip(df.A,df.B)])
out
0 1
0 [1] [a]
1 [2, 2] [a, b]
Sample dataframe
df = pd.DataFrame({'A':[[1,2],[2,2]],'B':[['a','a'],['a','b']]})
Don't think a native pandas solution is possible for this case. And even there is, you are not likely to have a performance gain with it. A traditional for loop might work better for your case:
def remove_dupes(a, b, dupe_pattern):
seen_dupe = False
_a, _b = [], []
for x, y in zip(a, b):
if dupe_pattern in y:
if not seen_dupe:
seen_dupe = True
_a.append(x)
_b.append(y)
else:
_a.append(x)
_b.append(y)
return _a, _b
_A, _B = [], []
for a, b in zip(df.A, df.B):
_a, _b = remove_dupes(a, b, 'str_to_be_removed')
_A.append(_a)
_B.append(_b)
df['A'], df['B'] = _A, _B
print(df)
# A B
#0 [1] [str_to_be_removed_bla]
#1 [2, 3] [str_to_be_removed_bla, b]
#2 [3, 4, 5] [b, c, d]
#3 [10, 12] [str_to_be_removed_bla_bla, f]
Try run it here.

How to convert the dictionary containing keys and list of list value?

How to convert the dictionary containing keys and list of list values to dictionary containing keys and list, drop duplicated in the newly created list ?
I tried running the following function and the computer got an 'memory error'
from collections import defaultdict
my_dict = defaultdict(list)
for k, v in zip(df_Group.guest_group, df_Group.list_guest):
for item in v:
v.append(item)
my_dict[k].append(set(v))
My origin dictionary created from 2 columns of one dataframe like: {Group1: [[1,2,3,4], [1, 2, 5, 6 ]]}
I want my dictionary like : {Group1: [1,2,3,4,5,6]}
From what I understood, what you essentially want to do is flatten your list while keeping unique items.
The unique items can be achieved by converting the list into set and then back to list.
The unpacking part is really well explained in this post. Here's a working code for you -
df_dict = {
'Group1': [[1,2,3,4], [1, 2, 5, 6 ]],
'Group2': [[1,2], [2,3],[2,3,4]]
}
final_dict = {}
for k, v in df_dict.items():
# flatten the list using double list comprehension
flat_list = [item for sublist in v for item in sublist]
final_dict[k] = list(set(flat_list))
This gives the final_dict as -
{'Group1': [1, 2, 3, 4, 5, 6], 'Group2': [1, 2, 3, 4]}
Please tell me if this answers your query.
Edit for Integer values in between the lists -
If we have a list with integer values in between, then you will get the int object not iterable error, to solve it we can check the instance to be int and make the item list by ourselves
Working code -
df_dict = {
'Group1': [[1,2,3,4], 3, [1, 2, 5, 6 ]],
'Group2': [[1,2], [2,3],[2,3,4]],
}
final_dict = {}
for k, v in df_dict.items():
# making a new list
new_list = []
for item in v:
# for int we will convert it to 1 length list
if isinstance(item, int):
item = [item]
for ele in item:
new_list.append(ele)
final_dict[k] = list(set(new_list))
final_dict
Final dict -
{'Group1': [1, 2, 3, 4, 5, 6], 'Group2': [1, 2, 3, 4]}
As expected

Filter pandas dataframe in python3 depending on the value of a list

So I have a dataframe like this:
df = {'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]}
And I want to create another dataframe that contains only the rows that have a certain value contained in the lists of x. For example, if I only want the ones that contain a 3, to get something like:
df2 = {'c': ['A','C'],
'x': [[1,2,3],[1,3]]}
I am trying to do something like this:
df2 = df[(3 in df.x.tolist())]
But I am getting a
KeyError: False
exception. Any suggestion/idea? Many thanks!!!
df = df[df.x.apply(lambda x: 3 in x)]
print(df)
Prints:
c x
0 A [1, 2, 3]
2 C [1, 3]
Below code would help you
To create the Correct dataframe
df = pd.DataFrame({'c': ['A','B','C','D'],
'x': [[1,2,3],[2],[1,3],[1,2,5]]})
To filter the rows which contains 3
df[df.x.apply(lambda x: 3 in x)==True]
Output:
c x
0 A [1, 2, 3]
2 C [1, 3]

how to multiply nested list with list?

i have:
dataA=[[1,2,3],[1,2,5]]
dataB=[1,2]
I want to multiply index [0] dataA with index [0] dataB, and index [1] dataA with index [1] dataB, how to do it.
I tried it, but the results didn't match expectations
dataA=[[1,2,3],[1,2,5]]
dataB=[1,2]
tmp=[]
for a in dataA:
tampung = []
for b in a:
cou=0
hasil = b*dataB[cou]
tampung.append(hasil)
cou+=1
tmp.append(tampung)
print(tmp)
output : [[1, 2, 3], [1, 2, 5]]
expected output : [[1,2,3],[2,4,10]]
Please help
List-expression are sth wonderful in Python.
result = [[x*y for y in l] for x, l in zip(dataB, dataA)]
This does the same like:
result = []
for x, l in zip(dataB, dataA):
temp = []
for y in l:
temp.append(x * y)
result.append(temp)
result
## [[1, 2, 3], [2, 4, 10]]
If you are working with numbers consider using numpy as it will make your operations much easier.
dataA = [[1,2,3],[1,2,5]]
dataB = [1,2]
# map list to array
dataA = np.asarray(dataA)
dataB = np.asarray(dataB)
# dataA = array([[1, 2, 3], [1, 2, 5]])
# 2 x 3 array
# dataB = array([1, 2])
# 1 x 2 array
dataC_1 = dataA[0] * dataB[0] #multiply first row of dataA w/ first row of dataB
dataC_2 = dataA[1] * dataB[1] #multiply second row of dataA w/ second row of dataB
# dataC_1 = array([1, 2, 3])
# dataC_2 = array([2, 4, 10])
These arrays can always be cast back into lists by passing them into List()
As other contributors have said, please look into the numpy library!

Resources