Replace element with specific value to pandas dataframe - python-3.x

I have a pandas dataframe with the following form:
cluster number
Robin_lodging_Dorthy 0
Robin_lodging_Phillip 1
Robin_lodging_Elmer 2
... ...
I want to replace replace every 0 that is in the column cluster number with with the string "low", every 1 with "mid" and every 2 with "high". Any idea of how that can be possible?

You can use replace function with some mappings to change your column values:
values = {
0: 'low',
1: 'mid',
2: 'high'
}
data = {
'name': ['Robin_lodging_Dorthy', 'Robin_lodging_Phillip', 'Robin_lodging_Elmer'],
'cluster_number': [0, 1, 2]
}
df = pd.DataFrame(data)
df.replace({'cluster_number': values}, inplace=True)
df
Output:
name cluster_number
0 Robin_lodging_Dorthy low
1 Robin_lodging_Phillip mid
2 Robin_lodging_Elmer high
More info on replace function.

Related

Change a dataframe column value based on the current value?

I have a pandas dataframe with several columns and in one of them, there are string values. I need to change these strings to an acceptable value based on the current value. The dataframe is relatively large (40.000 x 32)
I've made a small function that takes the string to be changed as a parameter and then lookup what this should be changed to.
df = pd.DataFrame({
'A': ['Script','Scrpt','MyScript','Sunday','Monday','qwerty'],
'B': ['Song','Blues','Rock','Classic','Whatever','Something']})
def lut(txt):
my_lut = {'Script' : ['Script','Scrpt','MyScript'],
'Weekday' : ['Sunday','Monday','Tuesday']}
for key, value in my_lut.items():
if txt in value:
return(key)
break
return('Unknown')
The desired output should be:
A B
0 Script Song
1 Script Blues
2 Script Rock
3 Weekday Classic
4 Weekday Whatever
5 Unknown Something
I can't figure out how to apply this to the dataframe.
I've struggled over this for some time now so any input will be appreciated
Regards,
Check this out:
import pandas as pd
df = pd.DataFrame({
'A': ['Script','Scrpt','MyScript','Sunday','sdfsd','qwerty'],
'B': ['Song','Blues','Rock','Classic','Whatever','Something']})
dic = {'Weekday': ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday'], 'Script': ['Script','Scrpt','MyScript']}
for k, v in dic.items():
for item in v:
df.loc[df.A == item, 'A'] = k
df.loc[~df.A.isin(k for k, v in dic.items()), 'A'] = "Unknown"
Output:

Pandas groupby specific range

I am trying to use groupby to group a dataset by a specific range. In the following dataframe, I'd like to groupby the max_speed which is above 150 and also count the number of items that are above 150.
An example dataset is as follows:
df = pd.DataFrame(
[
("bird", "Falconiformes", 250.0),
("bird", "Psittaciformes", 250.0),
("mammal", "Carnivora", 180.2),
("mammal", "Primates", 159.0),
("mammal", "Carnivora", 58),
],
index=["falcon", "parrot", "lion", "monkey", "leopard"],
columns=("class", "order", "max_speed"),
)
What I've tried is:
df[['max_speed'] > 150].groupby('max_speed').size()
Expected output:
max_speed count
180.2 1
250.0 2
159.0 1
How can I do this?
Thank you

How to filter time series if data exists at least data every 6 hours?

I'd like to verify if there is data at least once every 6 hours per ID, and filter out the IDs that do not meet this criteria.
essentially a filter: "if ID's data not at least every 6h, drop id from dataframe"
I try to use the same method for filtering one per day, but having trouble adapting the code.
# add day column from datetime index
df['1D'] = df.index.day
# reset index
daily = df.reset_index()
# count per ID per day. Result is per ID data of non-zero
a = daily.groupby(['1D', 'id']).size()
# filter by right join
filtered = a.merge(df, on = id", how = 'right')
I cannot figure out how to adapt this for the following 6hr periods each day: 00:01-06:00, 06:01-12:00, 12:01-18:00, 18:01-24:00.
Groupby ID and then integer divide hour by 6 and get unique counts. In your case it should be greater than or equal to 4 because there are 4 - 6 hour bins in 24 hours and each day has 4 unique bins i.e.
Bins = 4
00:01-06:00
06:01-12:00
12:01-18:00
18:01-24:00
Code
mask = df.groupby('id')['date'].transform(lambda x: (x.dt.hour // 6).nunique() >= 4)
df = df[mask]
I propose to use pivot_table with resample which allows to change to arbitrary frequencies. Please see comments for further explanations.
# build test data. I need a dummy column to use pivot_table later. Any column with numerical values will suffice
data = [[datetime(2020, 1, 1, 1), 1, 1],
[datetime(2020, 1, 1, 6), 1, 1],
[datetime(2020, 1, 1, 12), 1, 1],
[datetime(2020, 1, 1, 18), 1, 1],
[datetime(2020, 1, 1, 1), 2, 1],
]
df = pd.DataFrame.from_records(data=data, columns=['date', 'id', 'dummy'])
df = df.set_index('date')
# We need a helper dataframe df_tmp.
# Transform id entries to columns. resample with 6h = 360 minutes = 360T.
# Take mean() because it will produce nan values
# WARNING: It will only work if you have at least one id with observations for every 6h.
df_tmp = pd.pivot_table(df, columns='id', index=df.index).resample('360T').mean()
# Drop MultiColumnHierarchy and drop all columns with NaN values
df_tmp.columns = df_tmp.columns.get_level_values(1)
df_tmp.dropna(axis=1, inplace=True)
# Filter values in original dataframe where
mask_id = df.id.isin(df_tmp.columns.to_list())
df = df[mask_id]
I kept your requirements on timestamps but I believe you want to use the commented lines in my solution.
import pandas as pd
period = pd.to_datetime(['2020-01-01 00:01:00', '2020-01-01 06:00:00'])
# period = pd.to_datetime(['2020-01-01 00:00:00', '2020-01-01 06:00:00'])
shift = pd.to_timedelta(['6H', '6H'])
id_with_data = set(df['ID'])
for k in range(4): # for a day (00:01 --> 24:00)
period_mask = (period[0] <= df.index) & (df.index <= period[1])
# period_mask = (period[0] <= df.index) & (df.index < period[1])
present_ids = set(df.loc[period_mask, 'ID'])
id_with_data = id_with_data.intersection(present_ids)
period += shift
df = df.loc[df['ID'].isin(list(id_with_data))]

Using non-zero values from columns in function - pandas

I am having the below dataframe and would like to calculate the difference between columns 'animal1' and 'animal2' over their sum within a function while only taking into consideration the values that are bigger than 0 in each of the columns 'animal1' and 'animal2.
How could I do this?
import pandas as pd
animal1 = pd.Series({'Cat': 4, 'Dog': 0,'Mouse': 2, 'Cow': 0,'Chicken': 3})
animal2 = pd.Series({'Cat': 2, 'Dog': 3,'Mouse': 0, 'Cow': 1,'Chicken': 2})
data = pd.DataFrame({'animal1':animal1, 'animal2':animal2})
def animals():
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ ['animal2'])
return data['anim_diff'].abs().idxmax()
print(data)
I believe you need check all rows are greater by 0 with DataFrame.gt with test DataFrame.all and filter by boolean indexing:
def animals(data):
data['anim_diff']=(data['animal1']-data['animal2'])/(data['animal1']+ data['animal2'])
return data['anim_diff'].abs().idxmax()
df = data[data.gt(0).all(axis=1)].copy()
#alternative for not equal 0
#df = data[data.ne(0).all(axis=1)].copy()
print (df)
animal1 animal2
Cat 4 2
Chicken 3 2
print(animals(df))
Cat

Python - unable to count occurences of values in defined ranges in dataframe

I'm trying to write a code that takes analyses values in a dataframe, if the values fall in a class, the total number of those values are assigned to a key in the dictionary. But the code is not working for me. Im trying to create logarithmic classes and count the total number of values that fall in it
def bins(df):
"""Returns new df with values assigned to bins"""
bins_dict = {500: 0, 5000: 0, 50000: 0, 500000: 0}
for i in df:
if 100<i and i<=1000:
bins_dict[500]+=1,
elif 1000<i and i<=10000:
bins_dict[5000]+=1
print(bins_dict)
However, this is returning the original dictionary.
I've also tried modifying the dataframe using
def transform(df, range):
for i in df:
for j in range:
b=10**j
while j==1:
while i>100:
if i>=b:
j+=1,
elif i<b:
b = b/2,
print (i = b*(int(i/b)))
This code is returning the original dataframe.
My dataframe consists of only one column with values ranging between 100 and 10000000
Data Sample:
Area
0 1815
1 907
2 1815
3 907
4 907
Expected output
dict={500:3, 5000:2, 50000:0}
If i can get a dataframe output directly that would be helpful too
PS. I am very new to programming and I only know python
You need to use pandas for it:
import pandas as pd
df = pd.DataFrame()
df['Area'] = [1815, 907, 1815, 907, 907]
# create new column to categorize your data
df['bins'] = pd.cut(df['Area'], [0,1000,10000,100000], labels=['500', '5000', '50000'])
# converting into dictionary
dic = dict(df['bins'].value_counts())
print(dic)
Output:
{'500': 3, '5000': 2, '50000': 0}

Resources