How to let pandas groupby add a count column for each group after applying list aggregations? - python-3.x

I have a pandas DataFrame:
df = pd.DataFrame({"col_1": ["apple", "banana", "apple", "banana", "banana"],
"col_2": [1, 4, 8, 8, 6],
"col_3": [56, 4, 22, 1, 5]})
on which I apply a groupby operation that aggregates multiple columns into a list, using:
df = df.groupby(['col_1'])[["col_2", "col_3"]].agg(list)
Now I want to additionally add a column that for each resulting group adds the number of elements in that group. The result should look like this:
{"col_1": ["apple", "banana"],
"col_2": [[1, 8], [4, 8, 6]],
"col_3": [[56, 22], [4, 1, 5]]
"count": [2, 3]}
I tried the following from reading other Stack Overflow posts:
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).size()
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list, "count")
df = df.groupby(['col_1'])[["col_2", "col_3", "col_4"]].agg(list).agg("count")
But all gave either incorrect results (option 3) or an error (option 1 and 2)
How to solve this?

We can try named aggregation
d = {c:(c, list) for c in ('col_2', 'col_3')}
df.groupby('col_1').agg(**{**d, 'count': ('col_2', 'size')})
Or we can separately calculate the size of each group, then join it with the dataframe that contains the columns aggregated as lists
g = df.groupby('col_1')
g[['col_2', 'col_3']].agg(list).join(g.size().rename('count'))
col_2 col_3 count
col_1
apple [1, 8] [56, 22] 2
banana [4, 8, 6] [4, 1, 5] 3

Just adding another performant approach to solve the problem:
x = df.groupby('col_1')
x.agg({ 'col_2': lambda x: list(x),'col_3': lambda x: list(x),}).reset_index().join(
x['col_2'].transform('count').rename('count'))
Output
col_1 col_2 col_3 count
0 apple [1, 8] [56, 22] 2
1 banana [4, 8, 6] [4, 1, 5] 3

Related

Merging DataFrame with aditional column

Given DataFrames:
a = pd.DataFrame({"Question Skill": ["Algebra", "Patterns"],
"Average": [56,76],
"SD": [45,30]})
b = pd.DataFrame({"Question No.": [1, 2, 3, 4, 5],
"Question Skill": ['Algebra', 'Patterns', 'Algebra', 'Patterns', 'Patterns']
Below is the required output:
c = pd.DataFrame({"Question Skill": ["Algebra","Patterns"],
"Question No.": [[1,3],[2,4,5]],
"Average": [56,76],
"SD": [45,30]})
You probably wanna create a function if you want to do more question skill categories, but for just 2 categories, this should be sufficient
algebra=[b['Question No.'][i] for i in range(len(b)) if b['Question Skill'][i]=='Algebra']
patterns=[b['Question No.'][i] for i in range(len(b)) if b['Question Skill'][i]=='Patterns']
a['Question No.']=[algebra, patterns]
You can get it done in SQL style with some additional processing mixed in.
import pandas as pd
a = pd.DataFrame(
{"Question Skill": ["Algebra", "Patterns"], "Average": [56, 76], "SD": [45, 30]}
)
b = pd.DataFrame(
{
"Question No.": [1, 2, 3, 4, 5],
"Question Skill": ["Algebra", "Patterns", "Algebra", "Patterns", "Patterns"],
}
)
c = (
pd.merge(a, b, how="left", on=["Question Skill"])
.groupby("Question Skill")["Question No.", "Average", "SD",]
.agg(lambda x: list(set(x))[0] if len(list(set(x))) == 1 else list(set(x)))
.reset_index()
)
print(c)
Output:
Question Skill Question No. Average SD
0 Algebra [1, 3] 56 45
1 Patterns [2, 4, 5] 76 30
You can aggregate Question No. column to list in b first, then map the result to a
a['Question No.'] = a['Question Skill'].map(b.groupby("Question Skill")["Question No."].apply(list))
print(a)
Question Skill Average SD Question No.
0 Algebra 56 45 [1, 3]
1 Patterns 76 30 [2, 4, 5]

How to merge lists value with shared key of two dictionaries?

e.g.
d1 = {'a':[1, 2, 3], 'b': [1, 2, 3]}
d2 = {'a':[4, 5, 6], 'b': [3, 4, 5]}
The output should be like this:
{'a':[1, 2, 3, 4, 5, 6], 'b': [1, 2, 3, 4, 5]}
If the value repeats itself, it should be recorded only once.
Assuming both dictionaries have the same keys and all keys are present in both dictionaries.
One way to achieve could be:
d1 = {'a':[1, 2, 3], 'b': [1, 2, 3]}
d2 = {'a':[4, 5, 6], 'b': [3, 4, 5]}
# make a list of both dictionaries
ds = [d1, d2]
# d will be the resultant dictionary
d = {}
for k in d1.keys():
d[k] = [d[k] for d in ds]
d[k] = list(set([item for sublist in d[k] for item in sublist]))
print(d)
Output
{'a': [1, 2, 3, 4, 5, 6], 'b': [1, 2, 3, 4, 5]}

Python 3 ~ How to take rows from a csv file and put them into a list

I would like to know how to take this file:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
and put it in a list like the following:
[['Alice', 'Bob', 'Charlie'], [2, 8, 3], [4, 1, 5], [3, 2, 5]]
I'm fairly new to python so excuse me
my current code looks like this:
file = open(argv[1] , "r")
file1 = open(argv[2] , "r")
text = file1.read()
strl = []
with file:
csv = csv.reader(file,delimiter=",")
for row in csv:
strl = row[1:9]
break
df = pd.read_csv(argv[1],header=0)
df = [df[col].tolist() for col in df.columns]
ignore the strl part its for something else unrelated
but it outputs like this:
[['Alice', 'Bob', 'Charlie'], [2, 4, 3], [8, 1, 2], [3, 5, 5]]
i want it to output like this:
[['Alice', 'Bob', 'Charlie'], [2, 8, 3], [4, 1, 5], [3, 2, 5]]
i would like it to output like the above sample
Using pandas
In [13]: import pandas as pd
In [14]: df = pd.read_csv("a.csv",header=None)
In [15]: df
Out[15]:
0 1 2 3
0 Alice 2 8 3
1 Bob 4 1 5
2 Charlie 3 2 5
In [16]: [df[col].tolist() for col in df.columns]
Out[16]: [['Alice', 'Bob', 'Charlie'], [2, 4, 3], [8, 1, 2], [3, 5, 5]]
Update:
In [51]: import pandas as pd
In [52]: df = pd.read_csv("a.csv",header=None)
In [53]: data = df[df.columns[1:]].to_numpy().tolist()
In [57]: data.insert(0,df[0].tolist())
In [58]: data
Out[58]: [['Alice', 'Bob', 'Charlie'], [2, 8, 3], [4, 1, 5], [3, 2, 5]]
Update:
In [51]: import pandas as pd
In [52]: df = pd.read_csv("a.csv")
In [94]: df
Out[94]:
name AGATC AATG TATC
0 Alice 2 8 3
1 Bob 4 1 5
2 Charlie 3 2 5
In [97]: data = df.loc[:, df.columns != 'name'].to_numpy().tolist()
In [98]: data.insert(0, df["name"].tolist())
In [99]: data
Out[99]: [['Alice', 'Bob', 'Charlie'], [2, 8, 3], [4, 1, 5], [3, 2, 5]]

Python: How do I split a list into multiple by comparing its contents?

I have an list such as:
list = [[1, 3, 'orange'], [3, 5, 'apple'], [2, 3, 'orange'], [7, 9, 'pear']]
and i would like to convert it into multiple lists such as:
list1 = [[1, 3, 'orange'], [2, 3, 'orange']]
list2 = [3, 5, 'apple']
list3 = [7, 9, 'pear']
Thank you.
You can iterate over list.
Now check if your filter element is present inside list.
for l in list:
if filter_element in l:
filtered_list1.append(l)
elif condition2:
filtered_list2.append(l)
If you want to do this more aesthetically use can use filter from functools.

How to merge two array and group by key?

How to merge two array and group by key?
Example:
my_list = [3, 4, 5, 6, 4, 6, 8]
keys = [1, 1, 2, 2, 3, 5, 7]
Expected outcome:
[[1, 3, 4], [2, 5, 6], [3, 4], [5, 6], [7, 8]]
If I understand it right, the list of keys map to the list of values. You can use the zip function to iterate through two lists at the same time. Its convenient in this case. Also check up on the beautiful defaultdict functionality - we can use it to fill a list without initialising it explicitely.
from collections import defaultdict
result = defaultdict(list) # a dictionary which by default returns a list
for key, val in zip(keys, my_list):
result[key].append(val)
result
# {1: [3, 4], 2: [5, 6], 3: [4], 5: [6], 7: [8]}
You can then go to a list (but not sure why you would want to) with:
final = []
for key, val in result.items():
final.append([key] + val) # add key back to the list of values
final
# [[1, 3, 4], [2, 5, 6], [3, 4], [5, 6], [7, 8]]
I think you have to write it by your own using set() to remove duplicates, so I have made a function called merge_group
my_list = [3, 4, 5, 6, 4, 6, 8]
keys = [1, 1, 2, 2, 3, 5, 7]
def merge_group(input_list : list, input_key : list):
result = []
i = 0
while i < len(my_list):
result.append([my_list[i], keys[i]])
i += 1
j = 0
while j < len(result):
if j+1 < len(result):
check_sum = result[j] + result[j+1]
check_sum_set = list(set(check_sum))
if len(check_sum) != len(check_sum_set):
result[j] = check_sum_set
j += 1
return result
print(merge_group(my_list, keys))

Resources