How can I create a new dataframe by taking the rolling COLUMN total/sum of another dataframe? - python-3.x

import pandas as pd
df = {'a': [1,1,1], 'b': [2,2,2], 'c': [3,3,3], 'd': [4,4,4], 'e': [5,5,5], 'f': [6,6,6], 'g': [7,7,7]}
df1 = pd.DataFrame(df, columns = ['a', 'b', 'c', 'd', 'e', 'f', 'g'])
dg = {'h': [10,10,10], 'i': [14,14,14], 'j': [18,18,18], 'k': [22,22,22]}
df2 = pd.DataFrame(dg, columns = ['h', 'i', 'j', 'k'])
df1
a b c d e f g
0 1 2 3 4 5 6 7
1 1 2 3 4 5 6 7
2 1 2 3 4 5 6 7
df1 is my original data frame. I would like to create another dataframe by adding each consecutive 4 columns (rolling column sum).
df2
h i j k
0 10 14 18 22
1 10 14 18 22
2 10 14 18 22
df2 is the resulting dataframe after adding 4 consecutive columns of df1.
For example: column h in df2 is the sum of columns a, b, c, d in df1; column i in df2 is the sum of columns b, c, d, e in df1; column j in df2 is the sum of columns c, d, e, f in df1; column k in df2 is the sum of columns d, e, f, g in df1.
I could not find any similar question/answer/example like this.
I would appreciate any help.

You can use rolling over 4 columns and take the sum. Finally drop the first 3 columns.
df1.rolling(4, axis=1).sum().dropna(axis=1)
d e f g
0 10.0 14.0 18.0 22.0
1 10.0 14.0 18.0 22.0
2 10.0 14.0 18.0 22.0

Related

How to create a weighted edgelist directed from a pandas dataframe with weights in two columns?

I have the following pandas DataFrame (df):
>>> import pandas as pd
>>> df = pd.DataFrame([
['A', 'B', '1'],
['A', 'B', '2'],
['B', 'A', '41'],
['A', 'C', '11'],
['C', 'B', '3'],
['B', 'D', '4'],
['D', 'B','51']
], columns=('station_i', 'station_j','UID'))
I used
>>> df2=df.groupby(by=['station_i', 'station_j']).size().to_frame(name = 'counts_ij').reset_index()
to obtain the dataframe df2:
>>> print(df2)
station_i station_j counts_ij
0 A B 2
1 A C 1
2 B A 1
3 B D 1
4 C B 1
5 D B 1
Now, I would obtain the dataframe df3, build as shown below, where couples with same values, but reversed, are dropped and counted in an extra column as showed below:
>>>print(df3)
station_i station_j counts_ij counts_ji
0 A B 2 1
1 A C 1 0
2 C B 1 0
3 B D 1 1
Would really appreciate some suggestions
import pandas as pd
import numpy as np
# find out indices that are reverse duplicated
dupe = pd.concat([
np.maximum(df2.station_i, df2.station_j),
np.minimum(df2.station_i, df2.station_j)
], axis=1).duplicated()
df2[dupe]
station_i station_j counts_ij
2 B A 1
5 D B 1
df2[~dupe]
station_i station_j counts_ij
0 A B 2
1 A C 1
3 B D 1
4 C B 1
# split by dupe, reverse the dupe station and merge with the non dupe
df2[~dupe].merge(
df2[dupe].rename(
columns={'station_i': 'station_j', 'station_j': 'station_i', 'counts_ij': 'counts_ji'}
), how='left'
).fillna(0).astype({'counts_ji': int})
station_i station_j counts_ij counts_ji
0 A B 2 1
1 A C 1 0
2 B D 1 1
3 C B 1 0

Pandas: categorical column and insertion of rows for every category

I seem unable to achieve inserting rows with missing values, while having one column as Categorical.
Assume the following dataframe df, where column B is categorical and categories should appear in the order of 'd', 'b', 'c', 'a'.
df= pd.DataFrame({'A':['i', 'i', 'i', 'j', 'k'], \
'B':pd.Categorical(['d', 'c', 'b','b', 'a'], \
categories= ['d', 'b', 'c', 'a'], \
ordered=True), \
'C':[1, 0, 3 ,2, np.nan]})
I need to convert df into the following format:
A B C
0 i d 1.0
1 i b 0.0
2 i c 3.0
3 i a NaN
4 j d NaN
5 j b 2.0
6 j c NaN
7 j a NaN
8 k d NaN
9 k b NaN
10 k c NaN
11 k a NaN
Thank you in advance!
You could set the dataframe index to column B, this way we can use the reindex later on to fill the missing categorical values for each group. Use groupby column A and select the column C, then apply the reindex function as mention before, using now the desired category sequence. Afterwards, use reset_index to insert the indices (A and B) back into dataframe columns.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A':['i', 'i', 'i', 'j', 'k'], \
'B':pd.Categorical(['d', 'c', 'b','b', 'a'], \
categories= ['d', 'b', 'c', 'a'], \
ordered=True), \
'C':[1, 0, 3 ,2, np.nan]})
print(df)
df = df.set_index('B')
df = df.groupby('A')['C']\
.apply(lambda x: x.reindex(['d', 'b', 'c', 'a']))\
.reset_index()
df.B = pd.Categorical(df.B)
print(df)
Output from df
A B C
0 i d 1.0
1 i b 3.0
2 i c 0.0
3 i a NaN
4 j d NaN
5 j b 2.0
6 j c NaN
7 j a NaN
8 k d NaN
9 k b NaN
10 k c NaN
11 k a NaN

Can I apply vectorization here? Or should I think about this differently?

To put it simply, I have rows of activity that happen in a given month of the year. I want to append additional rows of inactivity in between this activity, while resetting the month values into a sequence. For example, if I have months 2, 5, 7, I need to map these to 1, 4, 7, while my inactive months happen in 2, 3, 5, and 6. So, I would have to add four rows with this inactivity. I've done this with dictionaries and for-loops, but I know this is not efficient, especially when I move this to thousands of rows of data to process. Any suggestions on how to optimize this? Do I need to think about the data format differently? I've had a suggestion to make lists and then move that to the dataframe at the end, but I don't see a huge gain there. I don't know enough of NumPy to figure out how to do this with vectorization since that's super fast and it would be awesome to learn something new. Below is my code with the steps I took:
df = pd.DataFrame({'col1': ['A','A', 'B','B','B','C','C'], 'col2': ['X','Y','X','Y','Z','Y','Y'], 'col3': [1, 8, 2, 5, 7, 6, 7]})
Output:
col1 col2 col3
0 A X 1
1 A Y 8
2 B X 2
3 B Y 5
4 B Z 7
5 C Y 6
6 C Y 7
I'm creating a dictionary to handle this in for loops:
df1 = df.groupby('col1')['col3'].apply(list).to_dict()
df2 = df.groupby('col1')['col2'].apply(list).to_dict()
max_num = max(df.col3)
Output:
{'A': [1, 8], 'B': [2, 5, 7], 'C': [6, 7]}
{'A': ['X', 'Y'], 'B': ['X', 'Y', 'Z'], 'C': ['Y', 'Y']}
8
And now I'm adding those rows using my dictionaries by creating a new data frame:
df_new = pd.DataFrame({'col1': [], 'col2': [], 'col3': []})
for key in df1.keys():
k = 1
if list(df1[key])[-1] - list(df1[key])[0] + 1 < max_num:
for i in list(range(list(df1[key])[0], list(df1[key])[-1] + 1, 1)):
if i in df1[key]:
df_new = df_new.append({'col1': key, 'col2': list(df2[key])[list(df1[key]).index(i)],'col3': str(k)}, ignore_index=True)
else:
df_new = df_new.append({'col1': key, 'col2': 'N' ,'col3': str(k)}, ignore_index=True)
k += 1
df_new = df_new.append({'col1': key, 'col2': 'E', 'col3': str(k)}, ignore_index=True)
else:
for i in list(range(list(df1[key])[0], list(df1[key])[-1] + 1, 1)):
if i in df1[key]:
df_new = df_new.append({'col1': key, 'col2': list(df2[key])[list(df1[key]).index(i)],'col3': str(k)}, ignore_index=True)
else:
df_new = df_new.append({'col1': key, 'col2': 'N' ,'col3': str(k)}, ignore_index=True)
k += 1
Output:
col1 col2 col3
0 A X 1
1 A N 2
2 A N 3
3 A N 4
4 A N 5
5 A N 6
6 A N 7
7 A Y 8
8 B X 1
9 B N 2
10 B N 3
11 B Y 4
12 B N 5
13 B Z 6
14 B E 7
15 C Y 1
16 C Y 2
17 C E 3
And then I pivot to the form I want it:
df_pivot = df_new.pivot(index='col1', columns='col3', values='col2')
Output:
col3 1 2 3 4 5 6 7 8
col1
A X N N N N N N Y
B X N N Y N Z E NaN
C Y Y E NaN NaN NaN NaN NaN
Thanks for the help.
We can replace the steps of creating and using dictionaries by the statement below, which utilizes reindex to place the additional values N and E without explicit loops.
df_new = df.set_index('col3')\
.groupby('col1')\
.apply(lambda dg:
dg.drop('col1', 1)
.reindex(range(dg.index.min(), dg.index.max()+1), fill_value='N')
.reindex(range(dg.index.min(), min(max_num, dg.index.max()+1)+1), fill_value='E')
.set_index(pd.RangeIndex(1, min(max_num, dg.index.max()-dg.index.min()+1+1)+1, name='col3'))
)\
.reset_index()
After this, you can apply your pivot statement as it is.

append one dataframe column value to another dataframe

I have two dataframes. df1 is empty dataframe and df2 is having some data as shown. There are few columns common in both dfs. I want to append df2 dataframe columns data into df1 dataframe's column. df3 is expected result.
I have referred Python + Pandas + dataframe : couldn't append one dataframe to another, but not working. It gives following error:
ValueError: Plan shapes are not aligned
df1:
Empty DataFrame
Columns: [a, b, c, d, e]
Index: [] `
df2:
c e
0 11 55
1 22 66
df3 (expected output):
a b c d e
0 11 55
1 22 66
tried with append but not getting desired result
import pandas as pd
l1 = ['a', 'b', 'c', 'd', 'e']
l2 = []
df1 = pd.DataFrame(l2, columns=l1)
l3 = ['c', 'e']
l4 = [[11, 55],
[22, 66]]
df2 = pd.DataFrame(l4, columns=l3)
print("concat","\n",pd.concat([df1,df2])) # columns will be inplace
print("merge Nan","\n",pd.merge(df2, df1,how='left', on=l3)) # columns occurence is not preserved
#### Output ####
#concat
a b c d e
0 NaN NaN 11 NaN 55
1 NaN NaN 22 NaN 66
#merge
c e a b d
0 11 55 NaN NaN NaN
1 22 66 NaN NaN NaN
Append seems to work for me. Does this not do what you want?
df1 = pd.DataFrame(columns=['a', 'b', 'c'])
print("df1: ")
print(df1)
df2 = pd.DataFrame(columns=['a', 'c'], data=[[0, 1], [2, 3]])
print("df2:")
print(df2)
print("df1.append(df2):")
print(df1.append(df2, ignore_index=True, sort=False))
Output:
df1:
Empty DataFrame
Columns: [a, b, c]
Index: []
df2:
a c
0 0 1
1 2 3
df1.append(df2):
a b c
0 0 NaN 1
1 2 NaN 3
Have you tried pd.concat ?
pd.concat([df1,df2])

How to trim and reshape dataframe?

I have df that looks like this:
a b c d e f
1 na 2 3 4 5
1 na 2 3 4 5
1 na 2 3 4 5
1 6 2 3 4 5
How do I trim and reshape the dataframe so that for every column the n/a are dropped and the dataframe looks like this:
Edit;
df.dropna() is dropping all the rows.
a b c d e f
1 6 2 3 4 5
This dataframe has millions of rows, I need to be able to drop the n/a rows by column while retaining rows and columns with data in them.
edit;
df.dropna() is dropping all the rows in the column. When I check if the columns with n/a are empty, df.column_name.empty() I get false. So there is data in columns with n/a
For me dropna working nice for remove missing values and Nones:
df = df.dropna()
print (df)
a b c d e f
3 1 6.0 2 3 4 5
But if possible multiple values for removing create mask by isin, chain testing missing values with isnull and last filter by any - return at least one True per row by inverted mask ~:
df = pd.DataFrame({'a': ['a', None, 's', 'd'],
'b': ['na',7, 2, 6],
'c': [2, 2, 2, 2],
'd': [3, 3, 3, 3],
'e': [4, 4, np.nan, 4],
'f': [5, 5, 5, 5]})
print (df)
a b c d e f
0 a na 2 3 4.0 5
1 None 7 2 3 4.0 5
2 s 2 2 3 NaN 5
3 d 6 2 3 4.0 5
df1 = df.dropna()
print (df1)
a b c d e f
0 a na 2 3 4.0 5
3 d 6 2 3 4.0 5
mask = (df.isin(['na', 'n/a']) | df.isnull()).any(axis=1)
df2 = df[~mask]
print (df2)
a b c d e f
3 d 6 2 3 4.0 5

Resources