Sum a pandas dataframe - python-3.x

>>> df
A B C D E
0 one A foo 2.284039 0.802802
1 one B foo -1.463983 0.710178
2 two C foo -0.109677 2.930710
3 three A bar -0.356390 -1.972306
4 one B bar 1.425968 -0.285079
5 one C bar -0.657890 -0.555669
6 two A foo -0.168804 -1.930447
7 three B foo 0.488953 -2.512408
8 one C foo 0.251062 -0.465522
9 one A bar 0.427243 -0.845034
10 two B bar 0.629268 -0.892264
11 three C bar 0.171773 0.457268
I want to get the sum of column D, and the sum of column E, where column A is "one" and column C is "foo".
I know this works:
>>> x = df[df["A"] == "one"]
>>> y = x[x["C"] == "foo"]
>>> sum(y["D"])
1.0711178939632426
>>> sum(y["E"])
1.0474592505139344
Is there a more compact/elegant solution?

Using pandas, you can do:
df.groupby(['A','C']).sum()
Hope it helps you.

Related

How can I duplicate a row and append it directly after the duplicated row using pandas?

I've been trying to figure this problem for a couple of hours now and seem to reach a dead end everytime. A small example of what I want to do is shown below.
Normal Series
a
b
c
d
Duplicated Series
a
a
b
b
c
c
d
d
Try with loc and df.index.repeat:
>>> df.loc[df.index.repeat(2)]
Normal Series
0 a
0 a
1 b
1 b
2 c
2 c
3 d
3 d
>>>
Or with reset_index:
>>> df.loc[df.index.repeat(2)].reset_index(drop=True)
Normal Series
0 a
1 a
2 b
3 b
4 c
5 c
6 d
7 d
>>>
You can just concat a duplicated series together and sort it.
sample = pd.Series(['a','b','c','d'])
output = pd.concat([sample,sample]).sort_values().reset_index(drop=True)
output

split based on comma and create new data frame in Python

Let's say I have the following data frame.
df
Nodes Weight
A,B 10
A,C,F 8
B,F,D 6
B,E 4
I would like to split based on comma and keeping their weight as well. For example, Nodes (A,C,F) A has connection with C and C has connection F. So, I would like to see A >>C, and C>>F. No need to see A>>F. AND their weight should be 8 as well as shown below.
The final data frame that am looking for looks like below.
Node_1 Node_2 Weight
A B 10
A C 8
C F 8
B F 6
F D 6
B E 4
The goal of creating this data frame is creating a network graph out of it.
There are similar solutions but I couldn't get the result what I want.
I tried with the following:
df = (df['Nodes'].str.split(',') .groupby(df['Weight'])
Can anyone help on this?
Here is one way to do this:
# From https://docs.python.org/3/library/itertools.html#itertools-recipes
from itertools import tee
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = tee(iterable)
next(b, None)
return zip(a, b)
df['Node_pairs'] = df['Nodes'].str.split(',').apply(lambda x: list(pairwise(x)))
df = df.explode('Node_pairs')
df['Node1'] = df['Node_pairs'].str[0]
df['Node2'] = df['Node_pairs'].str[1]
df
Output:
Nodes Weight Node_pairs Node1 Node2
0 A,B 10 (A, B) A B
1 A,C,F 8 (A, C) A C
1 A,C,F 8 (C, F) C F
2 B,F,D 6 (B, F) B F
2 B,F,D 6 (F, D) F D
3 B,E 4 (B, E) B E
Details:
Use pairwise recipe from itertools documentation to create
'Node_pairs'
Explode the dataframe on list of 'Node_pairs'
Assign 'Node1' and 'Node2' using .str get shortcut.
The logic is same as the solution provided by the Scott.
def grouper(input_list, n = 2):
for i in range(len(input_list) - (n - 1)):
yield input_list[i:i+n]
(df.set_index('Weight')['Nodes']
.str.split(',')
.map(grouper)
.map(list)
.explode()
.apply(pd.Series).add_prefix('Node_')
.reset_index())
Weight Node_0 Node_1
0 10 A B
1 8 A C
2 8 C F
3 6 B F
4 6 F D
5 4 B E

Do I use a loop, df.melt or df.explode to achieve a flattened dataframe?

Can anyone help with some code that will achieve the following transformation? I have tried variations of df.melt, df.explode, and also a looping statement but only get error statements. I think it might need nesting but don't have the experience to do so.
index A B C D
0 X d 4 2
1 Y b 5 2
Where column D represents frequency of column C.
desired output is:
index A B C
0 X d 4
1 X d 4
2 Y b 5
3 Y b 5
If you want to repeat rows, why not use index.repeat?
import pandas as pd
#recreate the sample dataframe
df = pd.DataFrame({"A":["X","Y"],"B":["d","b"],"C":[4,5],"D":[3,2]}, columns=list("ABCD"))
df = df.reindex(df.index.repeat(df["D"])).drop("D", 1).reset_index(drop=True)
print(df)
Sample output
A B C
0 X d 4
1 X d 4
2 X d 4
3 Y b 5
4 Y b 5

Easily generate edge list from specific structure using pandas

This is a question about how to make things properly with pandas (I use version 1.0).
Let say I have a DataFrame with missions which contains an origin and one or more destinations:
mid from to
0 0 A [C]
1 1 A [B, C]
2 2 B [B]
3 3 C [D, E, F]
Eg.: For the mission (mid=1) people will travel from A to B, then from B to C and finally from C to A. Notice, that I have no control on the datamodel of the input DataFrame.
I would like to compute metrics on each travel of the mission. The expected output would be exactly:
tid mid from to
0 0 0 A C
1 1 0 C A
2 2 1 A B
3 3 1 B C
4 4 1 C A
5 5 2 B B
6 6 2 B B
7 7 3 C D
8 8 3 D E
9 9 3 E F
10 10 3 F C
I have found a way to achieve my goal. Please, find bellow the MCVE:
import pandas as pd
# Input:
df = pd.DataFrame(
[["A", ["C"]],
["A", ["B", "C"]],
["B", ["B"]],
["C", ["D", "E", "F"]]],
columns = ["from", "to"]
).reset_index().rename(columns={'index': 'mid'})
# Create chain:
df['chain'] = df.apply(lambda x: list(x['from']) + x['to'] + list(x['from']), axis=1)
# Explode chain:
df = df.explode('chain')
# Shift to create travel:
df['end'] = df.groupby("mid")["chain"].shift(-1)
# Remove extra row, clean, reindex and rename:
df = df.dropna(subset=['end']).reset_index(drop=True).reset_index().rename(columns={'index': 'tid'})
df = df.drop(['from', 'to'], axis=1).rename(columns={'chain': 'from', 'end': 'to'})
My question is: Is there a better/easier way to make it with Pandas? By saying better I mean, not necessary more performant (it can be off course), but more readable and intuitive.
Your operation is basically explode and concat:
# turn series of lists in to single series
tmp = df[['mid','to']].explode('to')
# new `from` is concatenation of `from` and the list
df1 = pd.concat((df[['mid','from']],
tmp.rename(columns={'to':'from'})
)
).sort_index()
# new `to` is concatenation of list and `to``
df2 = pd.concat((tmp,
df[['mid','from']].rename(columns={'from':'to'})
)
).sort_index()
df1['to'] = df2['to']
Output:
mid from to
0 0 A C
0 0 C A
1 1 A B
1 1 B C
1 1 C A
2 2 B B
2 2 B B
3 3 C D
3 3 D E
3 3 E F
3 3 F C
If you don't mind re-constructing the entire DataFrame then you can clean it up a bit with np.roll to get the pairs of destinations and then assign the value of mid based on the number of trips (length of each sublist in l)
import pandas as pd
import numpy as np
from itertools import chain
l = [[fr]+to for fr,to in zip(df['from'], df['to'])]
df1 = (pd.DataFrame(data=chain.from_iterable([zip(sl, np.roll(sl, -1)) for sl in l]),
columns=['from', 'to'])
.assign(mid=np.repeat(df['mid'].to_numpy(), [*map(len, l)])))
from to mid
0 A C 0
1 C A 0
2 A B 1
3 B C 1
4 C A 1
5 B B 2
6 B B 2
7 C D 3
8 D E 3
9 E F 3
10 F C 3

Print dictionary to file using pandas DataFrame, but changing dataframe format

I have a dictionary of dictionaries I want to print into a csv file. I came across a way to do this using pandas.DataFrame:
import pandas as pd
dict = {'foo': {'A':'a', 'B':'b'}, 'bar': {'C':'c', 'D':'d'}}
df = pd.DataFrame(dict)
#df.to_csv(path_or_buf = r"results.txt", mode='w')
This gives me a formatted result like so:
bar foo
A NaN a
B NaN b
C c NaN
D d NaN
I expected (and would like to have) a DataFrame that instead looks like:
foo A a
foo B b
bar C c
bar D d
I'm new to manipulation of dataframes, so I'm not sure how to change the formatting - would I do it in the DataFrame argument? Or is there a way to change it once the dictionary is already a df?
You are looking for stack
df.stack()
Out[91]:
A foo a
B foo b
C bar c
D bar d
dtype: object
That is multiple index
dict = {'foo': {'A':'a', 'B':'b'}, 'bar': {'A':'a','C':'c', 'D':'d'}}
df = pd.DataFrame(dict)
df.stack()
Out[93]:
A bar a
foo a
B foo b
C bar c
D bar d
dtype: object
df.stack().reset_index()
Out[94]:
level_0 level_1 0
0 A bar a
1 A foo a
2 B foo b
3 C bar c
4 D bar d

Resources