Pandas aggregate column and keep header - python-3.x

I have code which works but gives me data without header is there a way I can write this code so header is not removed? I know one way will be to add back header, but is there a better way?
My code:
df = pd.read_csv(“_data.csv",skiprows=[0], header=None)
df = df.groupby([2])[10].sum().astype(float)
Data:
A B
1 2
1 1
2 3
2 4
I have data like above trying to get this result:
A B
1 3
2 7

Try to use the function reset_index after the sum:
data = [{'a': 1, 'b': 2},{'a': 1, 'b': 1},{'a': 2, 'b': 3},{'a': 2, 'b': 4}]
df = pd.DataFrame(data)
df
a b
0 1 2
1 1 1
2 2 3
3 2 4
df.groupby('a').sum().reset_index()
a b
0 1 3
1 2 7

You should specify the separator (several spaces in your case) and that the header is the first row (=0, with python indexing), than groupby the column you want.
df = pd.read_csv("_data.csv", sep='\s*', header=0)
A B
0 1 2
1 1 1
2 2 3
3 2 4
df = df.groupby(['A']).sum()
B
A
1 3
2 7

Related

Pandas Adding Column Maximum to the Original Dataframe [duplicate]

I have a dataframe with columns A,B. I need to create a column C such that for every record / row:
C = max(A, B).
How should I go about doing this?
You can get the maximum like this:
>>> import pandas as pd
>>> df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
>>> df
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]]
A B
0 1 -2
1 2 8
2 3 1
>>> df[["A", "B"]].max(axis=1)
0 1
1 8
2 3
and so:
>>> df["C"] = df[["A", "B"]].max(axis=1)
>>> df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If you know that "A" and "B" are the only columns, you could even get away with
>>> df["C"] = df.max(axis=1)
And you could use .apply(max, axis=1) too, I guess.
#DSM's answer is perfectly fine in almost any normal scenario. But if you're the type of programmer who wants to go a little deeper than the surface level, you might be interested to know that it is a little faster to call numpy functions on the underlying .to_numpy() (or .values for <0.24) array instead of directly calling the (cythonized) functions defined on the DataFrame/Series objects.
For example, you can use ndarray.max() along the first axis.
# Data borrowed from #DSM's post.
df = pd.DataFrame({"A": [1,2,3], "B": [-2, 8, 1]})
df
A B
0 1 -2
1 2 8
2 3 1
df['C'] = df[['A', 'B']].values.max(1)
# Or, assuming "A" and "B" are the only columns,
# df['C'] = df.values.max(1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
If your data has NaNs, you will need numpy.nanmax:
df['C'] = np.nanmax(df.values, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
You can also use numpy.maximum.reduce. numpy.maximum is a ufunc (Universal Function), and every ufunc has a reduce:
df['C'] = np.maximum.reduce(df['A', 'B']].values, axis=1)
# df['C'] = np.maximum.reduce(df[['A', 'B']], axis=1)
# df['C'] = np.maximum.reduce(df, axis=1)
df
A B C
0 1 -2 1
1 2 8 8
2 3 1 3
np.maximum.reduce and np.max appear to be more or less the same (for most normal sized DataFrames)—and happen to be a shade faster than DataFrame.max. I imagine this difference roughly remains constant, and is due to internal overhead (indexing alignment, handling NaNs, etc).
The graph was generated using perfplot. Benchmarking code, for reference:
import pandas as pd
import perfplot
np.random.seed(0)
df_ = pd.DataFrame(np.random.randn(5, 1000))
perfplot.show(
setup=lambda n: pd.concat([df_] * n, ignore_index=True),
kernels=[
lambda df: df.assign(new=df.max(axis=1)),
lambda df: df.assign(new=df.values.max(1)),
lambda df: df.assign(new=np.nanmax(df.values, axis=1)),
lambda df: df.assign(new=np.maximum.reduce(df.values, axis=1)),
],
labels=['df.max', 'np.max', 'np.maximum.reduce', 'np.nanmax'],
n_range=[2**k for k in range(0, 15)],
xlabel='N (* len(df))',
logx=True,
logy=True)
For finding max among multiple columns would be:
df[['A','B']].max(axis=1).max(axis=0)
Example:
df =
A B
timestamp
2019-11-20 07:00:16 14.037880 15.217879
2019-11-20 07:01:03 14.515359 15.878632
2019-11-20 07:01:33 15.056502 16.309152
2019-11-20 07:02:03 15.533981 16.740607
2019-11-20 07:02:34 17.221073 17.195145
print(df[['A','B']].max(axis=1).max(axis=0))
17.221073

How to perform cumulative sum inside iterrows

I have a pandas dataframe as below:
df2 = pd.DataFrame({ 'b' : [1, 1, 1]})
df2
b
0 1
1 1
2 1
I want to create a column 'cumsum' with the cumulative sum of column b starting row 2. Also I want to use iterrows to perform this. I tried below code but it doesnot seem to work.
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b'].cumsum()
My expected output:
b cum_sum
0 1 NaN
1 1 2
2 1 3
As your requirement, you may try this
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[:row_index, 'b'].sum()
Out[10]:
b cumsum
0 1 NaN
1 1 2.0
2 1 3.0
To stick to iterrows():
i=0
df2['cumsum']=0
col=list(df2.columns).index('cumsum')
for row_index, row in df2.iloc[1:].iterrows():
df2.loc[row_index, 'cumsum'] = df2.loc[row_index, 'b']+df2.iloc[i, col]
i+=1
Outputs:
b cumsum
0 1 0
1 1 1
2 1 2

The way `Drop column by id ` result in all same columns removed in dataframe

import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df = pd.concat([df1,df2],axis=1)
Let's see the concated df,the first column and third column shares the same column name A.
df
A B A C
0 14 1 14 5
1 4 2 4 6
2 5 3 5 7
3 4 4 4 8
I want to get the following format.
df
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8
Drop column by id.
result = df.drop(df.columns[2],axis=1)
result
B C
0 1 5
1 2 6
2 3 7
3 4 8
I can get what i expect this way:
import pandas as pd
df1 = pd.DataFrame({"A":[14, 4, 5, 4],"B":[1,2,3,4]})
df2 = pd.DataFrame({"A":[14, 4, 5, 4],"C":[5,6,7,8]})
df2 = df2.drop(df2.columns[0],axis=1)
df = pd.concat([df1,df2],axis=1)
It is so strange that both the first and third column removed when to drop specified column by id.
1.Please tell me the reason of dataframe's this action.
2.How can i remove the third column at the same time keep the first column undeleted?
Here's a way using indexes:
index_to_drop = 2
# get indexes to keep
col_idxs = [en for en, _ in enumerate(df.columns) if en != index_to_drop]
# subset the df
df = df.iloc[:,col_idxs]
A B C
0 14 1 5
1 4 2 6
2 5 3 7
3 4 4 8

Drop a column in pandas if all values equal 1?

How do I drop columns in pandas where all values in that column are equal to a particular number? For instance, consider this dataframe:
df = pd.DataFrame({'A': [1, 1, 1, 1],
'B': [0, 1, 2, 3],
'C': [1, 1, 1, 1]})
print(df)
Output:
A B C
0 1 0 1
1 1 1 1
2 1 2 1
3 1 3 1
How would I drop the 1 columns so that the output is:
B
0 0
1 1
2 2
3 3
Use DataFrame.loc with test if at least one non 1 value by DataFrame.ne with DataFrame.any:
df1 = df.loc[:, df.ne(1).any()]
Or test for 1 by DataFrame.eq with DataFrame.all for all Trues per columns and inverted mask by ~:
df1 = df.loc[:, ~df.eq(1).all()]
print (df1)
B
0 0
1 1
2 2
3 3
EDIT:
One consideration is what do you want to happen if you have a column with Nan and 1 only?
Then replace NaNs to 0 by DataFrame.fillna and use same solution like before:
df1 = df.loc[:, df.fillna(0).ne(1).any()]
df1 = df.loc[:, ~df.fillna(0).eq(1).all()]
You can use any:
df.loc[:, df.ne(1).any()]
One consideration is what do you want to happen if you have a column with Nan and 1 only?
If you want to drop under this condition also, you will to either fillna with 1 or add or and new condition.
df = pd.DataFrame({'A': [1, 1, 1, 1],
'B': [0, 1, 2, 3],
'C': [1, 1, 1, np.nan]})
print(df)
A B C
0 1 0 1.0
1 1 1 1.0
2 1 2 1.0
3 1 3 NaN
All these leave that column with NaN and 1's.
df.loc[:, df.ne(1).any()]
df.loc[:, ~df.eq(1).all()]
So, you can add this addition to drop that column also.
df.loc[:, ~(df.eq(1) | df.isna()).all()]
Output:
B
0 0
1 1
2 2
3 3

Index order of a shuffle dataframe

I have two DataFrame, namely A and B. Bis generated by shuffling rows of A. I would like to know each row of B, what's the index of the same row in A.
Example:
A=pd.DataFrame({"a":[1,2,3],"b":[1,2,3],"c":[1,2,3]})
B=pd.DataFrame({"a":[2,3,1],"b":[2,3,1],"c":[2,3,1]})
A
a b c
0 1 1 1
1 2 2 2
2 3 3 3
B
a b c
0 2 2 2
1 3 3 3
2 1 1 1
The answer should be [1,2,0], because B equals A.loc[[1,2,0]]. I am wondering how to do this efficiently since my A and B is large.
I came up with probable solution using Dataframe.merge
A=pd.DataFrame({"a":[1,2,3],"b":[1,2,3],"c":[1,2,3]})
B=pd.DataFrame({"a":[2,3,1],"b":[2,3,1],"c":[2,3,1]})
A['index_a'] = A.index
B['index_b'] = B.index
merge_df= pd.merge(A, B, left_on=['a', 'b', 'c'], right_on=['a', 'b', 'c'])
Where merge_df is
a b c index_a index_b
0 1 1 1 0 2
1 2 2 2 1 0
2 3 3 3 2 1
Now you can reference the rows from A or B Dataframe
Example
You know that row with index 0 at A is at index 2 in B
NOTE Rows that do not match on neither dataframe will not be shown in merge_df
IIUC use merge
pd.merge(B.reset_index(), A.reset_index(),
left_on = A.columns.tolist(),
right_on = B.columns.tolist()).iloc[:,-1].values
array([1, 2, 0], dtype=int64)

Resources