GroupBy and Sum with a RangeIndex - python-3.x

I'm reading in a CSV with wide data, which I convert to long data. The data contains daily values for all of 2020. I'm trying to aggregate this by month and sum. This is what I have tried:
import pandas as pd
df = pd.read_csv('Notebooks/updated_predicted_data.csv', parse_dates=['Unnamed: 0'])
df.rename( columns={'Unnamed: 0':'yyyy_mm_dd'}, inplace=True)
df = df.melt(id_vars=['yyyy_mm_dd'])
df.rename(columns={'variable': 'name'}, inplace=True)
df.rename(columns={'value': 'predicted_value'}, inplace=True)
df['predicted_value'] = df['predicted_value'].str.replace('€', '')
df['predicted_value'] = df['predicted_value'].str.replace(',', '')
df['predicted_value'] = df['predicted_value'].astype(int)
df.dtypes
yyyy_mm_dd datetime64[ns]
name object
predicted_ttv int64
dtype: object
All of the above works fine, when I then try to group the data, I run into the issue:
sum_df = df.groupby(pd.Grouper(freq='M'))
TypeError: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex'
How come I get the above error, despite converting my date column from string to datetime and also predicted column from string to int?

You can add key parameter in Grouper for pass column name of datetimes with name column in list and added aggregation function sum:
sum_df = (df.groupby(['name',pd.Grouper(freq='M', key='yyyy_mm_dd')])['predicted_ttv']
.sum()
.reset_index())
Or convert yyyy_mm_dd to DatetimeIndex first:
sum_df = (df.set_index('yyyy_mm_dd')
.groupby(['name', pd.Grouper(freq='M')])['predicted_ttv']
.sum()
.reset_index())

Related

How to map column values to a formatted string (% formatting) in another pandas' column?

I have a pandas dataframe (df) with a df['Message'], df['value1'], df['value2'] ...etc
The Messages in the ['Message'] column are formatted strings such as: 'CPU-Value: %d, Event-Type: %s, ..,..'
My goal is to replace the %d or %s in ['Message'] with values from the other columns ['value1'] and ['value2']...
The following approach would not work: (it works only for {} formatted strings)
for index, row in df['Message'].items():
df['Message'] = df['Message'][index].format(str(df['value1'][index]), ......)
Is there any advice on how to replace the variables/format the string with values from other columns.
Since your variables in df['Message'] should be all formatted the same way (same order of %d, %s, etc..), you can just use a str.replace(str1,str2,occurrencies):
df = pd.DataFrame({'mes':[ 'CPU-Value: %d, Event-Type: %s'],'d':[1],'s':['a']})
def formatter(a,b,c):
a = a.replace('%d',str(b),1)
a = a.replace('%s',str(c),1)
return a
df['formatted'] = df.apply(lambda x: formatter(x['mes'],x['d'],x['s']), axis=1)
You can apply a function (e.g. format) to an axis (e.g. per row via axis=1) with df.apply:
import pandas as pd
#sample data
df = pd.DataFrame.from_dict({
'value1': [0, 1, 2],
'value2': ['TestA', 'TestB', 'TestC']
})
#format message with values
df['message'] = df.apply(lambda x: f"CPU-Value: {x['value1']:d}, Event-Type: {x['value2']}", axis=1)

Applying a function on pandas groupby object return variable type

When applying a user-defined function f that takes a pd.DataFrame as input and returns a pd.Series on a pd.DataFrame.groupby object, the type of the returned object seems to depend on the number of unique values present in the field used to perform the grouping operation.
I am trying to understand why the api is behaving this way, and for a neat way to have a pd.Series returned regardless of the number of unique values in the grouping field.
I went through the split-apply-combine section of pandas, and it seems like the single-valued dataframe is treated as a pd.Series which does not make sense to me.
import pandas as pd
from typing import Union
def f(df : pd.DataFrame) -> pd.Series:
"""
User-defined function
"""
return df['B'] / df['B'].max()
# Should only output a pd.Series
def perform_apply(df : pd.DataFrame) -> Union[pd.Series,pd.DataFrame] :
return df.groupby('A').apply(f)
# Some dummy dataframe with multiple values in field 'A'
df1 = pd.DataFrame({'A': 'a a b'.split(),
'B': [1,2,3],
'C': [4,6,5]})
# Subset of dataframe wiht a single value in field 'A'
df2 = df1[df1['A'] == 'a'].copy()
res1 = perform_apply(df1)
res2 = perform_apply(df2)
print(type(res1),type(res2))
# --------------------------------
# -> <class 'pandas.core.series.Series'> <class 'pandas.core.frame.DataFrame'>
pandas : 1.4.2
python : 3.9.0

Iteratively append new data into pandas dataframe column and join with another dataframe

I have been doing data extract from many API. I would like to add a common column among all APIs.
And I have tried below
df = pd.DataFrame()
for i in range(1,200):
url = '{id}/values'.format(id=i)
res = request.get(url,headers=headers)
if res.status_code==200:
data =json.loads(res.content.decode('utf-8'))
if data['success']:
df['id'] = i
test = pd.json_normalize(data[parent][child])
df = df.append(test,index=False)
But data-frame id column I'm getting only the last iterated id only. And in case of APIs has many rows I'm getting invalid data.
From performance reasons it would be better first storing data in a dictionary and then create from this dictionary dataframe:
import pandas as pd
from collections import defaultdict
d = defaultdict(list)
for i in range(1,200):
# simulate dataframe retrieved from pd.json_normalize() call
row = pd.DataFrame({'id': [i], 'field1': [f'f1-{i}'], 'field2': [f'f2-{i}'], 'field3': [f'f3-{i}']})
for k, v in row.to_dict().items():
d[k].append(v[0])
df = pd.DataFrame(d)

Changing column data type in Pandas DF using for command

I want to change the data type for a few columns in a database from Int/Float to Object.
I am trying to do this using for loop. But the data type is not changing.
for i in convert_to_object:
i = i.astype('object',inplace=True)
#Convert_to_object is a list of column names in data frame df
convert_to_object = [df_app1.REGION_RATING_CLIENT,
df_app1.REGION_RATING_CLIENT_W_CITY,
df_app1.REG_REGION_NOT_LIVE_REGION,
df_app1.REG_REGION_NOT_WORK_REGION,
df_app1.LIVE_REGION_NOT_WORK_REGION,
df_app1.REG_CITY_NOT_LIVE_CITY,
df_app1.REG_CITY_NOT_WORK_CITY,
df_app1.LIVE_CITY_NOT_WORK_CITY]
for i in convert_to_object:
i = i.astype('object',inplace=True)
df.info()
The data type for given columns should change to Object.
As far as I know DataFrame.astype does not handle the inplace param. See https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.astype.html How about:
cols = [
"REGION_RATING_CLIENT",
"REGION_RATING_CLIENT_W_CITY",
"REG_REGION_NOT_LIVE_REGION",
"REG_REGION_NOT_WORK_REGION",
"LIVE_REGION_NOT_WORK_REGION",
"REG_CITY_NOT_LIVE_CITY",
"REG_CITY_NOT_WORK_CITY",
"LIVE_CITY_NOT_WORK_CITY"
]
df[cols] = df[cols].astype(object)
Consider this example:
import pandas as pd
df = pd.DataFrame({
'A': [1,2],
'B': [1.2,3.4]
})
print(df.dtypes)
df[['A','B']] = df[['A','B']].astype(object)
print(df.dtypes)
Returns:
A int64
B float64
dtype: object
A object
B object
dtype: object

Calling a data frame via a string

I have a list of countries such as:
country = ["Brazil", "Chile", "Colombia", "Mexico", "Panama", "Peru", "Venezuela"]
I created data frames using the names from the country list:
for c in country:
c = pd.read_excel(str(c + ".xls"), skiprows = 1)
c = pd.to_datetime(c.Date, infer_datetime_format=True)
c = c[["Date", "spreads"]]
Now I want to be able to merge all the countries data frames using the columns date as the key. The idea is to create a loop like the following:
df = Brazil #this is the first dataframe, which also corresponds to the first element of the list country.
for i in range(len(country)-1):
df = df.merge(country[i+1], on = "Date", how = "inner")
df.set_index("Date", inplace=True)
I got the error ValueError: can not merge DataFrame with instance of type <class 'str'>. It seems python is not calling the data frame which the name is in the country list. How can I call those data frames starting from the country list?
Thanks masters!
Your loop doesn't modify the contents of the country list, so country is still a list of strings.
Consider building a new list of dataframes and looping over that:
country_dfs = []
for c in country:
df = pd.read_excel(c + ".xls", skiprows=1)
df = pd.to_datetime(df.Date, infer_datetime_format=True)
df = df[["Date", "spreads"]]
# add new dataframe to our list of dataframes
country_dfs.append(df)
then to merge,
merged_df = country_dfs[0]
for df in country_dfs[1:]:
merged_df = merged_df.merge(df, on='Date', how='inner')
merged_df.set_index('Date', inplace=True)

Resources