Pandas groupby specific range - python-3.x

I am trying to use groupby to group a dataset by a specific range. In the following dataframe, I'd like to groupby the max_speed which is above 150 and also count the number of items that are above 150.
An example dataset is as follows:
df = pd.DataFrame(
[
("bird", "Falconiformes", 250.0),
("bird", "Psittaciformes", 250.0),
("mammal", "Carnivora", 180.2),
("mammal", "Primates", 159.0),
("mammal", "Carnivora", 58),
],
index=["falcon", "parrot", "lion", "monkey", "leopard"],
columns=("class", "order", "max_speed"),
)
What I've tried is:
df[['max_speed'] > 150].groupby('max_speed').size()
Expected output:
max_speed count
180.2 1
250.0 2
159.0 1
How can I do this?
Thank you

Related

iterate over a column check condition and carry calculations with values of other data frames

import pandas as pd
import numpy as np
I do have 3 dataframes df1, df2 and df3.
df1=
data = {'Period': ['2024-04-O1', '2024-07-O1', '2024-10-O1', '2025-01-O1', '2025-04-O1', '2025-07-O1', '2025-10-O1', '2026-01-O1', '2026-04-O1', '2026-07-O1', '2026-10-O1', '2027-01-O1', '2027-04-O1', '2027-07-O1', '2027-10-O1', '2028-01-O1', '2028-04-O1', '2028-07-O1', '2028-10-O1'],
'Price': ['NaN','NaN','NaN','NaN', 'NaN','NaN','NaN','NaN', 'NaN','NaN','NaN','NaN',
'NaN','NaN','NaN','NaN', 'NaN','NaN','NaN'],
'years': [2024,2024,2024,2025,2025,2025,2025,2026,2026,2026,2026,2027,2027,2027,2027,2028,
2028,2028,2028],
'quarters':[2,3,4, 1,2,3,4, 1,2,3,4, 1,2,3,4, 1,2,3,4]
}
df1 = pd.DataFrame(data=data)
df2=
data = {'price': [473.26,244,204,185, 152, 157],
'year': [2023, 2024, 2025, 2026, 2027, 2028]
}
df3 = pd.DataFrame(data=data)
df3=
data = {'quarters': [1,2,3,4],
'weights': [1.22, 0.81, 0.83, 1.12]
}
df2 = pd.DataFrame(data=data)
My aim is to compute the price of df1. For each iteration through df1 check condition and carry calculations accordingly. For example for the 1st iteration, check if df1['year']=2024 and df1['quarters']=2. Then df1['price']=df2.loc[df2['year']=='2024', 'price'] * df3.loc[df3['quarters']==2, 'weights'].
===>>> df1['price'][0]=**473.26*0.81**.
df1['price'][1]=**473.26*0.83**.
...
...
...
and so on.
I could ha used this method but i want to write a code in a more efficient way. I would like to use the following code structure.
for i in range(len(df1)):
if (df1['year']==2024) & (df1['quarter']==2):
df1['Price']= df2.loc[df2['year']==2024, 'price'] * df3.loc[df3['quarters']==2, 'weights']
elif (df1['year']==2024) & (df1['quarter']==3):
df1['price']= df2.loc[df2['year']=='2024', 'price'] * df3.loc[df3['quarters']==3, 'weights']
elif (df1['year']==2024) & (df1['quarters']==4):
df1['Price']= df2.loc[df2['year']=='2024', 'price'] * df3.loc[df3['quarters']==4, 'weights']
...
...
...
Thanks!!!
I think if I understand correctly you can use pd.merge to bring these fields together first.
df1 = df1.merge(df2, how='left' , left_on='years', right_on='year')
df1 = df1.merge(df3, how='left' , left_on='quarters', right_on='quarters')
df1['Price'] = df1['price']*df1['weights']

Replace element with specific value to pandas dataframe

I have a pandas dataframe with the following form:
cluster number
Robin_lodging_Dorthy 0
Robin_lodging_Phillip 1
Robin_lodging_Elmer 2
... ...
I want to replace replace every 0 that is in the column cluster number with with the string "low", every 1 with "mid" and every 2 with "high". Any idea of how that can be possible?
You can use replace function with some mappings to change your column values:
values = {
0: 'low',
1: 'mid',
2: 'high'
}
data = {
'name': ['Robin_lodging_Dorthy', 'Robin_lodging_Phillip', 'Robin_lodging_Elmer'],
'cluster_number': [0, 1, 2]
}
df = pd.DataFrame(data)
df.replace({'cluster_number': values}, inplace=True)
df
Output:
name cluster_number
0 Robin_lodging_Dorthy low
1 Robin_lodging_Phillip mid
2 Robin_lodging_Elmer high
More info on replace function.

Concatenation of pd.Series results in pandas.core.indexes.base.InvalidIndexError

I don't get why my code doesn't work.
If I execute the code below I get the following error:
pandas.core.indexes.base.InvalidIndexError: Reindexing only valid with uniquely valued Index objects.
The created series have all a timestamp index, where some entries are the same. I don't understand what the difference to the example from the pandas manual is,
because the indices of the dfs have also identical indices.
ds = pd.Series([dat[1] for dat in dic_data[name]], index=[dat[0] for dat in dic_data[name]], name=name) # create the series
ds_list.append(ds)
df = pd.concat(ds_list,axis=1,sort=False) # i copyed this line from the example, from the pandas manual
the resulting dataframe should be created like the one from the example
pics
In the example by pandas you can have a pandas.Series which has the same or similar index values as another which you intend to concatenate. However each pandas.Series index values have to be unique.
i.e. len(ds.index) - len(ds.index.unique()) should equal 0
Your index in ds has multiple values that are the same.
If you want to concatenate based on the index, it is unclear how to combine these indices with another pandas.Series, so you will recieve an error.
For example your code will work if each ds has an index which is unique.
ds_list = []
dates = pd.date_range(start='1/1/2019', periods=7); values = np.arange(7)
dic_data = {'Col1' : [(dates[i], values[i]) for i in range(7)], 'Col2' : [(dates[i], values[i]*2) for i in range(7)]}
for name in dic_data.keys():
ds = pd.Series([dat[1] for dat in dic_data[name]], index=[dat[0] for dat in dic_data[name]], name=name)
ds_list.append(ds)
df = pd.concat(ds_list,axis=1,sort=False)
In [43]: df
Out[43]:
Col1 Col2
2019-01-01 0 0
2019-01-02 1 2
2019-01-03 2 4
2019-01-04 3 6
2019-01-05 4 8
2019-01-06 5 10
2019-01-07 6 12
However if your ds has duplicated index values it will not:
dic_data = {'Col1' : [(dates[i % 7], values[i % 7]) for i in range(12)], 'Col2' : [(dates[i], values[i]*2) for i in range(7)]}
for name in dic_data.keys():
ds = pd.Series([dat[1] for dat in dic_data[name]], index=[dat[0] for dat in dic_data[name]], name=name)
ds_list.append(ds)
InvalidIndexError: Reindexing only valid with uniquely valued Index objects
As a fix I would aggregate on timestamps by for example doing a groupby or however you want to aggregate on these duplicates:
dic_data = {'Col1' : [(dates[i % 7], values[i % 7]) for i in range(12)], 'Col2' : [(dates[i], values[i]*2) for i in range(7)]}
for name in dic_data.keys():
ds = pd.Series([dat[1] for dat in dic_data[name]], index=[dat[0] for dat in dic_data[name]], name=name)
ds = ds.groupby(level=0).sum()
ds_list.append(ds)
Or you could just drop duplicates, e.g. ds.drop_duplicates() ...

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)
I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

Python - unable to count occurences of values in defined ranges in dataframe

I'm trying to write a code that takes analyses values in a dataframe, if the values fall in a class, the total number of those values are assigned to a key in the dictionary. But the code is not working for me. Im trying to create logarithmic classes and count the total number of values that fall in it
def bins(df):
"""Returns new df with values assigned to bins"""
bins_dict = {500: 0, 5000: 0, 50000: 0, 500000: 0}
for i in df:
if 100<i and i<=1000:
bins_dict[500]+=1,
elif 1000<i and i<=10000:
bins_dict[5000]+=1
print(bins_dict)
However, this is returning the original dictionary.
I've also tried modifying the dataframe using
def transform(df, range):
for i in df:
for j in range:
b=10**j
while j==1:
while i>100:
if i>=b:
j+=1,
elif i<b:
b = b/2,
print (i = b*(int(i/b)))
This code is returning the original dataframe.
My dataframe consists of only one column with values ranging between 100 and 10000000
Data Sample:
Area
0 1815
1 907
2 1815
3 907
4 907
Expected output
dict={500:3, 5000:2, 50000:0}
If i can get a dataframe output directly that would be helpful too
PS. I am very new to programming and I only know python
You need to use pandas for it:
import pandas as pd
df = pd.DataFrame()
df['Area'] = [1815, 907, 1815, 907, 907]
# create new column to categorize your data
df['bins'] = pd.cut(df['Area'], [0,1000,10000,100000], labels=['500', '5000', '50000'])
# converting into dictionary
dic = dict(df['bins'].value_counts())
print(dic)
Output:
{'500': 3, '5000': 2, '50000': 0}

Resources