Maximum for each column, return value of other for max, create new dataframe of returns - python-3.x

I hope the title is not misleading.
I need to go from this dataframe:
Column_1 Columns_2 First Second Third
0 Element_1 to_be_ignored 10 5 77
1 Element_2 to_be_ignored 30 30 11
2 Element_3 to_be_ignored 60 7 3
3 Element_4 to_be_ignored 20 87 90
to:
New_Column New_Column_1 Max
0 Element_3 First 60
1 Element_4 Second 87
2 Element_4 Third 90
get maximum value of every column
get responding value of Column_1 for maximum value
transform to new dataframe
what i got so far:
data = {'Column_1': ['Element_1', 'Element_2', 'Element_3', 'Element_4'],
'Columns_2': ['to_be_ignored', 'to_be_ignored', 'to_be_ignored', 'to_be_ignored'],
'First': [10,30,60,20], 'Second': [5,30,7,87], 'Third': [77,11,3,90]}
df = pd.DataFrame(data)
df.loc[df.iloc[:, 1:].idxmax(), ['Column_1']
so i am able to get the index position and value for the maximum in the columns.
2 Element_3
3 Element_4
3 Element_4
Unfortunately i can't figure out the rest.
THX

IIUC melt then sort_values + drop_duplicates
df.melt(['Column_1','Columns_2']).sort_values('value').drop_duplicates(['variable'],keep='last')
Column_1 Columns_2 variable value
2 Element_3 to_be_ignored First 60
7 Element_4 to_be_ignored Second 87
11 Element_4 to_be_ignored Third 90

Related

Grouping data based on month-year in pandas and then dropping all entries except the latest one- Python

Below is my example dataframe
Date Indicator Value
0 2000-01-30 A 30
1 2000-01-31 A 40
2 2000-03-30 C 50
3 2000-02-27 B 60
4 2000-02-28 B 70
5 2000-03-31 C 90
6 2000-03-28 C 100
7 2001-01-30 A 30
8 2001-01-31 A 40
9 2001-03-30 C 50
10 2001-02-27 B 60
11 2001-02-28 B 70
12 2001-03-31 C 90
13 2001-03-28 C 100
Desired Output
Date Indicator Value
2000-01-31 A 40
2000-02-28 B 70
2000-03-31 C 90
2001-01-31 A 40
2001-02-28 B 70
2001-03-31 C 90
I want to write a code that groups data by particular month-year and then keep the entry of latest date in that particular month-year and drop the rest. The data is till year 2020
I was only able to fetch the count by month-year. I am not able to drop create a proper code that helps to group data as per month-year and indicator and get the correct results
Use Series.dt.to_period for months periods, aggregate index of maximal date per groups by DataFrameGroupBy.idxmax and then pass to DataFrame.loc:
df['Date'] = pd.to_datetime(df['Date'])
print (df['Date'].dt.to_period('m'))
0 2000-01
1 2000-01
2 2000-03
3 2000-02
4 2000-02
5 2000-03
6 2000-03
7 2001-01
8 2001-01
9 2001-03
10 2001-02
11 2001-02
12 2001-03
13 2001-03
Name: Date, dtype: period[M]
df = df.loc[df.groupby(df['Date'].dt.to_period('m'))['Date'].idxmax()]
print (df)
Date Indicator Value
1 2000-01-31 A 40
4 2000-02-28 B 70
5 2000-03-31 C 90
8 2001-01-31 A 40
11 2001-02-28 B 70
12 2001-03-31 C 90

Mark sudden changes in prices in a dataframe time series and color them

I have a Pandas dataframe of prices for different months and years (timeseries), 80 columns. I want to be able to detect significant changes in prices either up or down and color them differently in a dataframe. Is that possible and what would be the best approach?
Jan-2001 Feb-2001 Jan-2002 Feb-2002 ....
100 30 10 ...
110 25 1 ...
40 5 50
70 11 4
120 35 2
Here in the first column 40 and 70 should be marked, in the second column 5 and 11 should be marked, in the third column not really sure but probably 1, 50, 4, 2...
Your question involves 2 problems I can see.
Printing the highlighting depends on the output method your trying to get to, be it STDOUT, file, or some program specific.
Identification of outliers based on the Column data. Its hard to interpret if you want it based on the entire dataset, vice the previous data in the column like a rolling outlier, ie the data previous is calculated to identify if the next thing is out of wack.
In the below instance I provide a method to go at the data with std dev/zscoring based on the mean of the data in the entire column. You will have to tweak the > < items to get to your desired state, there is many intricacies dealing with this concept and I would suggest taking a look at a few resources about this subject.
For your data:
Jan-2001,Feb-2001,Jan-2002
100,30,10
110,25,1
40,5,50
70,11,4
120,35,20000
I am aware of methods to highlight, but not in the terminal. The https://pandas.pydata.org/pandas-docs/stable/style.html method works in a few programs.
To get at the original item, identification of outliers in your data, you could use something like below to identify based on standard deviation and zscore.
Sample Code:
df = pd.read_csv("full.txt")
original = df.columns
print(df)
for col in df.columns:
col_zscore = col + "_zscore"
df[col_zscore] = (df[col] - df[col].mean())/df[col].std(ddof=0)
print(df[col].loc[(df[col_zscore] > 1.5) | (df[col_zscore] < -.5)])
print(df)
Output 1: # prints the original dataframe
Jan-2001 Feb-2001 Jan-2002
100 30 10
110 25 1
40 5 50
70 11 4
120 35 20000
Output 2: # Identifies the outliers
2 40
3 70
Name: Jan-2001, dtype: int64
2 5
3 11
Name: Feb-2001, dtype: int64
0 10
1 1
3 4
4 20000
Name: Jan-2002, dtype: int64
Output 3: # Prints the full dataframe created, with zscore of each item based on the column
Jan-2001 Feb-2001 Jan-2002 Jan-2001_std Jan-2001_zscore \
0 100 30 10 32.710854 0.410152
1 110 25 1 32.710854 0.751945
2 40 5 50 32.710854 -1.640606
3 70 11 4 32.710854 -0.615227
4 120 35 2 32.710854 1.093737
Feb-2001_std Feb-2001_zscore Jan-2002_std Jan-2002_zscore
0 12.735776 0.772524 20.755722 -0.183145
1 12.735776 0.333590 20.755722 -0.667942
2 12.735776 -1.422147 20.755722 1.971507
3 12.735776 -0.895426 20.755722 -0.506343
4 12.735776 1.211459 20.755722 -0.614076
Resources for zscore are here:
https://statistics.laerd.com/statistical-guides/standard-score-2.php

How to take values in the column as the columns in the DataFrame in pandas

My current DataFrame is:
Term value
Name
A 1 35
A 2 40
A 3 50
B 1 20
B 2 45
B 3 50
I want to get a dataframe as:
Term 1 2 3
Name
A 35 40 50
B 20 45 50
How can i get it?I've tried using pivot_table but i didn't get my expected output.Is there any way to get my expected output?
Use:
df = df.set_index('Term', append=True)['value'].unstack()
Or:
df = pd.pivot(df.index, df['Term'], df['value'])
print (df)
Term 1 2 3
Name
A 35 40 50
B 20 45 50
EDIT: If duplicates in pairs Name with Term is necessary aggretion, e.g. sum or mean:
df = df.groupby(['Name','Term'])['value'].sum().unstack(fill_value=0)

row substraction in lambda pandas dataframe

I have a dataframe with multiple columns. One of the column is the cumulative revenue column. If the year is not ended then the revenue will be constant for the rest of the period because the coming daily revenue is 0.
The dataframe looks like this
Now I want to create a new column where the row is substracted by the last row and if the result is 0 then print 0 for that row in the new column. If not zero then use the row value. The new dataframe should look like this:
My idea was to do this with the apply lambda method. So this is the thinking:
{df['2017new'] = df['2017'].apply(lambda x: 0 if row - lastrow == 0 else x)}
But i do not know how to write the row - lastrow part of the code. How to do this? Thanks in advance!
By using np.where
df2['New']=np.where(df2['2017'].diff().eq(0),0,df2['2017'])
df2
Out[190]:
2016 2017 New
0 10 21 21
1 15 34 34
2 70 40 40
3 90 53 53
4 93 53 0
5 99 53 0
We can shift the data and fill the values based on condition using np.where i.e
df['new'] = np.where(df['2017']-df['2017'].shift(1)==0,0,df['2017'])
or with df.where i.e
df['new'] = df['2017'].where(df['2017']-df['2017'].shift(1)!=0,0)
2016 2017 new
0 10 21 21
1 15 34 34
2 70 40 40
3 90 53 53
4 93 53 0
5 99 53 0

Pandas multi-index subtract from value based on value in other column part 2

Based on a thorough and accurate response to this question, I am now faced with a new issue based on slightly different data.
Given this data frame:
df = pd.DataFrame({
('A', 'a'): [23,3,54,7,32,76],
('B', 'b'): [23,'n/a',54,7,32,76],
('possible','possible'):[100,100,100,100,100,100]
})
df
A B possible
a b possible
0 23 23 100
1 3 n/a 100
2 54 54 100
3 7 n/a 100
4 32 32 100
5 76 76 100
I'd like to subtract 4 from 'possible', per row, for any instance (column) where the value is 'n/a' for that row (and then change all 'n/a' values to 0).
A B possible
a b possible
0 23 23 100
1 3 n/a 96
2 54 54 100
3 7 n/a 96
4 32 32 100
5 76 76 100
Some conditions:
It may occur that a column is all floats (though they appear to be integers upon inspection). This was not factored into the original question.
It may also occur that a row contains two instances (columns) of 'n/a' values. This was addressed by the previous solution.
Here is the previous solution:
idx = pd.IndexSlice
df.loc[:, idx['possible', 'possible']] -= (df.loc[:, idx[('A','B'),:]] == 'n/a').sum(axis=1) * 4
df.replace({'n/a':0}, inplace=True)
It works, except for where a column (A or B) contains all floats (seemingly integers). When that's the case, this error occurs:
TypeError: Could not compare ['n/a'] with block values
I think you can add casting to string by astype to condition:
idx = pd.IndexSlice
df.loc[:, idx['possible', 'possible']] -=
(df.loc[:, idx[('A','B'),:]].astype(str) == 'n/a').sum(axis=1) * 4
df.replace({'n/a':0}, inplace=True)
print df
A B possible
a b possible
0 23 23 100
1 3 0 96
2 54 54 100
3 7 0 96
4 32 32 100
5 76 76 100

Resources