Pandas Dataframe Query - Location of highest value per row - python-3.x

I following code generates a small dataframe that is intended to be a fictitious Olympics medal table.
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.randint(0, 47, 20).reshape(4,5),
index = ['USA', 'USR', 'ITL', 'GBR'],
columns=[1996, 2000, 2004, 2008, 2102])
df['Highest'] = df.max(axis=1).round()
df = df.sort_values('Highest', ascending = False).head(10)
df
I have added a column at the end to establish the highest medal tally per row (Country).
I need to add an additional 'Year' column that adds the year in which the highest medal tally was won for each row.
So, if the highest number of medals on row 1 was won in the year 2012, the value of 2012 should be added in row 1 of the new 'Year' column.
How can I do that?
Thanks

Here's one option to find the index location, then find the Year. You can adapt for your purpose as needed. Create random df first.
Using .index gives a list; in this case the list is one element at the max, so use [0] to get the value from the list
Then use .at to get the year at the max value.
df = pd.DataFrame(data={'Year': range(2000, 2010), 'Value': np.random.uniform(low=0.5, high=13.3, size=(10,))}, columns=['Year', 'Value'])
max_value = df.Value.max()
idx_max_value = df.loc[df.Value == max_value].index[0]
year_at_max_value = df.at[idx_max_value,'Year']

Probably not the most Pythonic solution, but this works:
year = []
for x in range(len(df)):
pip = np.array(df.iloc[x, :5])
i = np.argmax(pip)
year.append(df.columns[i])
df['Year'] = year

Related

How to calculate the diff between values of 2 adjacent values across every column in a pandas dataframe?

I made a dataset of shape (252,60) by concatenating the ['Close'] columns of every stock of the Sensex-30 index, and making columns by shifting each ['Close'] column by 1 level down. Here I wanted to count the difference between the shifted price and current price for every day and every stock, I tried to do so in a colab notebook, but I get an error as IndexError: single positional indexer is out-of-bounds
The dataset and code is too long to be shown, so you can look at it at this colab notebook
Reducing your code, I find the below works
import requests
df = pd.DataFrame()
for stock in ['RELIANCE','INFY','HCLTECH','TCS','BAJAJ-AUTO',
'TITAN','LT','NESTLEIND','TECHM','ASIANPAINT',
'M&M','ICICIBANK','POWERGRID','HINDUNILVR','SUNPHARMA',
'TATASTEEL','AXISBANK','SBIN','ULTRACEMCO','BAJAJFINSV',
'ITC','NTPC','BAJFINANCE','BHARTIARTL','MARUTI',
'KOTAKBANK','HDFC','HDFCBANK','ONGC','INDUSINDBK']:
url = "https://query1.finance.yahoo.com/v7/finance/download/"+stock+".BO?period1=1577110559&period2=1608732959&interval=1d&events=history&includeAdjustedClose=true"
df = pd.concat([df, pd.read_csv(io.BytesIO(requests.get(url).content), index_col="Date")
.loc[:,"Close"]
.to_frame().rename(columns={"Close":stock})], axis=1)
profit={f"{c}_profit":lambda dfa: dfa[c]-dfa[c].shift(periods=1) for c in df.columns}
df = df.assign(**profit)
df.shape
output
(252, 60)

How do I get the maximum and minimum values of a column depending on another two columns in pandas dataframe?

This is my first time asking a question. I have a dataframe that looks like below:
import pandas as pd
data = [['AK', 'Co',2957],
['AK', 'Ot', 15],
['AK','Petr', 86848],
['AL', 'Co',167],
['AL', 'Ot', 10592],
['AL', 'Petr',1667]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
I need to find the maximum and minimum values of the third column based on the first two columns. I did browse through a few stackoverflow questions but couldn't find the right way to solve this.
My output should look like below:
data = [['AK','Ot', 15],
['AK','Petr',86848],
['AL','Co',167],
['AL','Ot', 10592]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
Note: Please let me know where I am lagging before leaving a negative marking on the question
This link helped me: Python pandas dataframe: find max for each unique values of an another column
try idxmin and idxmax with .loc filter.
new_df = my_df.loc[
my_df.groupby(["State"])
.agg(ElecMin=("Elec", "idxmin"), ElecMax=("Elec", "idxmax"))
.stack()
]
)
print(new_df)
State Energy Elec
0 AK Ot 15
1 AK Petr 86848
2 AL Co 167
3 AL Ot 10592

How to selecting multiple rows and take mean value based on name of the row

From this data frame I like to select rows with same concentration and also almost same name. For example, first three rows has same concentration and also same name except at the end of the name Dig_I, Dig_II, Dig_III. This 3 rows same with same concentration. I like to somehow select this three rows and take mean value of each column. After that I want to create a new data frame.
here is the whole data frame:
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
import pandas as pd
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new_df = df.groupby('concentration').mean()
Note: This will only find the averages for columns with dtype float or int... this will drop the img_name column and will take the averages of all columns...
This may be faster...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js").groupby('concentration').mean()
If you would like to preserve the img_name...
df = pd.read_csv("https://gist.github.com/akash062/75dea3e23a002c98c77a0b7ad3fbd25b.js")
new = df.groupby('concentration').mean()
pd.merge(df, new, left_on = 'concentration', right_on = 'concentration', how = 'inner')
Does that help?

Valueerror when filtering and adding a new column at once

I'm getting the error code:
ValueError: Wrong number of items passed 3, placement implies 1.
What i want to do is import a dataset and count the duplicated values, drop the duplicated values and add a column which says that there were x number of duplicates of that number.
This is to try and sort a dataset of 13 000 rows and 45 columns.
I've tried different solutions found online but it seems like it does not help. I'm pretty new to programming and all help is really appreciated
'''import pandas as pd
# Making file ready
data = pd.read_excel(r'Some file.xlsx', header = 0)
data.rename(columns={'Dato': 'Last ordered', 'ArtNr': 'Item No:'}, inplace
= True)
#Formatting dates
pd.to_datetime(data['Last ordered'],
format = '%Y-%m-%d %H:%M:%S')
#Creates new table content and order
df = data[['Item No:','Last ordered', 'Description']]
df['Last ordered'] = df['Last ordered'].dt.strftime('%Y-/%m-/%d')
df = df.sort_values('Last ordered', ascending = False)
#Adds total sold quantity column
df['Quantity'] = df.groupby('Item No:').transform('count')
df2 = df.drop_duplicates('Item No:').reset_index(drop=True)
#Prints to environment and creates new excel file
print(df2)
df2.to_excel(r'New Sorted File.xlsx')'''
I expect it to provide a new excel file with columns:
Item No | Last ordered | Description | Quantity
And i want to be able to add other columns from the original dataset as well if i need to later on.
The problem is at this line:
df['Quantity'] = df.groupby('Item No:').transform('count')
The right side part of the assignment is a dataframe and you are trying to fit it inside a column. You need to select only one of the columns. Something like
df['Quantity'] = df.groupby('Item No:').transform('count')['Description']
should work.

Create Dynamic Columns with Calculation

I have a dataframe called prices, with historical stocks prices for the following companies:
['APPLE', 'AMAZON', 'GOOGLE']
So far on, with the help of a friendly user, I was able to create a dataframe for each of this periods with the following code:
import pandas as pd
import numpy as np
from datetime import datetime, date
prices = pd.read_excel('database.xlsx')
companies=prices.columns
companies=list(companies)
del companies[0]
timestep = 250
prices_list = [prices[day:day + step] for day in range(len(prices) - step)]
Now, I need to evaluate the change in price for every period of 251 days (Price251/Price1; Price252/Price2; Price 253/Price and so on) for each one of the companies, and create a column for each one of them.
I would also like to put the column name dynamic, so I can replicate this to a much longer database.
So, I would get a dataframe similar to this:
open image here
Here you can find the dataframe head(3): Initial Dataframe
IIUC, try this:
def create_cols(df,num_dates):
for col in list(df)[1:]:
df['{}%'.format(col)] = - ((df['{}'.format(col)].shift(num_dates) - df['{}'.format(col)]) / df['{}'.format(col)].shift(num_dates)).shift(- num_dates)
return df
create_cols(prices,251)
you only would have to format the columns to percentages.

Resources