How can I count categorical columns by month in Pandas? - python-3.x

I have time series data with a column which can take a value A, B, or C.
An example of my data looks like this:
date,category
2017-01-01,A
2017-01-15,B
2017-01-20,A
2017-02-02,C
2017-02-03,A
2017-02-05,C
2017-02-08,C
I want to group my data by month and store both the sum of the count of A and the count of B in column a_or_b_count and the count of C in c_count.
I've tried several things, but the closest I've been able to do is to preprocess the data with the following function:
def preprocess(df):
# Remove everything more granular than day by splitting the stringified version of the date.
df['date'] = pd.to_datetime(df['date'].apply(lambda t: t.replace('\ufeff', '')), format="%Y-%m-%d")
# Set the time column as the index and drop redundant time column now that time is indexed. Do this op in-place.
df = df.set_index(df.date)
df.drop('date', inplace=True, axis=1)
# Group all events by (year, month) and count category by values.
counted_events = df.groupby([(df.index.year), (df.index.month)], as_index=True).category.value_counts()
counted_events.index.names = ["year", "month", "category"]
return counted_events
which gives me the following:
year month category
2017 1 A 2
B 1
2 C 3
A 1
The process to sum up all A's and B's would be quite manual since category becomes a part of the index in this case.
I'm an absolute pandas menace, so I'm likely making this much harder than it actually is. Can anyone give tips for how to achieve this grouping in pandas?

I tried this so posting though I like #Scott Boston's solution better as I combined A and B values earlier.
df.date = pd.to_datetime(df.date, format = '%Y-%m-%d')
df.loc[(df.category == 'A')|(df.category == 'B'), 'category'] = 'AB'
new_df = df.groupby([df.date.dt.year,df.date.dt.month]).category.value_counts().unstack().fillna(0)
new_df.columns = ['a_or_b_count', 'c_count']
new_df.index.names = ['Year', 'Month']
a_or_b_count c_count
Year Month
2017 1 3.0 0.0
2 1.0 3.0

Related

pandas to_datetime() funtion is not converting for date 08-12-1600 in dataframe

raw_data = {'Event': ['A','B','C','D', 'E'],
'dates': ['08-12-1600','26-09-1400', '04-11-1991','25-03-1991', '10-05-1991']}
df_1 = pd.DataFrame(raw_data, columns = ['Event', 'dates'])
df_1['dates'] = pd.to_datetime(df_1['dates'])
the above code gives error due to date 08-12-1600 if the date is removed it works fine, what could be the possible reason for it?
error is:
Out of bounds nanosecond timestamp: 1600-08-12 00:00:00
That is because the provided dates are outside the range of Timestamp.
pd.Timestamp.min
Timestamp('1677-09-21 00:12:43.145225')
pd.Timestamp.max
Timestamp('2262-04-11 23:47:16.854775807')
Details here
If we need the dates even out of range
Then we can convert them to period using below code
raw_data = {'Event': ['A','B','C','D', 'E'],
'dates': ['08-12-1600','26-09-1400', '04-11-1991','25-03-1991', '10-05-1991']}
df_1 = pd.DataFrame(raw_data, columns = ['Event', 'dates'])
def conv(x):
day,month,year = tuple(x.split('-'))
return pd.Period(year=int(year), month=int(month), day=int(day), freq="D")
df_1['dates'] = df_1.dates.apply(conv)
df_1
Output
Event dates
0 A 1600-12-08
1 B 1400-09-26
2 C 1991-11-04
3 D 1991-03-25
4 E 1991-05-10
If we can ignore dates outside range
df_1['dates'] = pd.to_datetime(df_1.dates, errors='coerce')
df_1
Output
Event dates
0 A NaT
1 B NaT
2 C 1991-04-11
3 D 1991-03-25
4 E 1991-10-05
Bonus Fact
Why timestamp can hold values for around 584 years 1677-2262?
Since timestamps provides nano second precision and is stored in 64-bit integer, hence it can store around 584 years with this nano second resolution in 64-bit int space.

Extract row from pandas dateframe

I have a data frame as the image below. I want to extract the rows of data frame which are having year and month as '1395/01'. I used the code below, but I know it is not correct because we can use string slice on a series of strings. Can anyone show me a way without using nested for loops?
df[df['Date'][:7] == '1395/01']
I might use str.match here:
df[df['Date'].str.match(r'^1395/01')]
But in general it is usually preferable to store dates as datetime and not text. Also, the year 1395 seems dubious.
You can use loc and startswith to filter your dataframe.
Sample:
df = pd.DataFrame({'Date': ['1395/01/01', '1395/02/01', '1395/01/01', '1395/05/01']})
print(df)
Date
0 1395/01/01
1 1395/02/01
2 1395/01/01
3 1395/05/01
Solution:
print(df.loc[df['Date'].str.startswith('1395/01'), :])
Date
0 1395/01/01
2 1395/01/01
If you would like to extract year and month for all rows, you can use str.slice:
df['Extracted Date'] = df['Date'].str.slice(0, 7)
print(df)
Date Extracted Date
0 1395/01/01 1395/01
1 1395/02/01 1395/02
2 1395/01/01 1395/01
3 1395/05/01 1395/05

Add new rows to dataframe using existing rows from previous year

I'm creating a Pandas dataframe from an existing file and it ends up essentially like this.
import pandas as pd
import datetime
data = [[i, i+1] for i in range(14)]
index = pd.date_range(start=datetime.date(2019,1,1), end=datetime.date(2020,2,1), freq='MS')
columns = ['col1', 'col2']
df = pd.DataFrame(data, index, columns)
Notice that this doesn't go all the way up to the present -- often the file I'm pulling from is a month or two behind. What I then need to do is add on any missing months and fill them with the same value as the previous year.
So in this case I need to add another row that is
2020-03-01 2 3
It could be anywhere from 0-2 rows that need to be added to the end of the dataframe at a given point in time. What's the best way to do this?
Note: The data here is not real so please don't take advantage of the simple pattern of entries I gave above. It was just a quick way to fill two columns of a table as an example.
If I understand your problem, then the following should help you. This does assume that you always have data 12 months ago however. You can define a new DataFrame which includes the months up to the most recent date.
# First create the new index. Get the most recent date and add an offset.
start, end = df.index[-1] + pd.DateOffset(), pd.Timestamp.now()
index_new = pd.date_range(start, end, freq='MS')
Create your DataFrame
# Get the data from the previous year.
data = df.loc[index_new - pd.DateOffset(years=1)].values
df_new = pd.DataFrame(data, index = index_new, columns=df.columns)
which looks like
col1 col2
2020-03-01 2 3
then just use;
pd.concat([df, df_new], axis=0)
Which gives
col1 col2
2019-01-01 0 1
2019-02-01 1 2
2019-03-01 2 3
... ... ...
2020-02-01 13 14
2020-03-01 2 3
Note
This also works for cases where the number of months missing is greater than 1.
Edit
Slightly different variation
# Create series with missing months added.
# Get the corresponding data 12 months prior.
s = pd.date_range(df.index[0], pd.Timestamp.now(), freq='MS')
fill = df.loc[s[~s.isin(df.index)] - pd.DateOffset(years=1)]
# Reindex the original dataframe
df = df.reindex(s)
# Find the dates to fill and replace with lagged data
df.iloc[-1 * fill.shape[0]:] = fill.values

Normalize Column Values by Monthly Averages with added Group dimension

Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

Code optimisation - comparing two datetime columns by month and creating a new column too slow

I am trying to create a new column in Pandas dataframe. If the other two date columns in my dataframe share the same month, then this new column should have 1 as a value, otherwise 0. Also, I need to check that ids match my other list of ids that I have saved previously in another place and mark those only with 1. I have some code but it is useless since I am dealing with almost a billion of rows.
my_list_of_ids = df[df.bool_column == 1].id.values
def my_func(date1, date2):
for id_ in df.id:
if id_ in my_list_of_ids:
if date1.month == date2.month:
my_var = 1
else:
my_var = 0
else:
my_var = 0
return my_var
df["new_column"] = df.progress_apply(lambda x: my_func(x['date1'], x['date2']), axis=1)
Been waiting for 30 minutes and still 0%. Any help is appreciated.
UPDATE (adding an example):
id | date1 | date2 | bool_column | new_column |
id1 2019-02-13 2019-04-11 1 0
id1 2019-03-15 2019-04-11 0 0
id1 2019-04-23 2019-04-11 0 1
id2 2019-08-22 2019-08-11 1 1
id2 ....
id3 2019-09-01 2019-09-30 1 1
.
.
.
What I need to do is save the ids that are 1 in my bool_column, then I am looping through all of the ids in my dataframe and checking if they are in the previously created list (= 1). Then I want to compare month and the year of date1 and date2 columns and if they are the same, create a new_column with a value 1 where they mach, otherwise, 0.
The pandas way to do this is
mask = ((df['date1'].month == df['date2'].month) & (df['id'].isin(my_list_of_ids)))
df['new_column'] = mask.replace({False: 0, True: 1})
Since you have a large data-set, this will take time, but should be faster than using apply
The best way to deal with the month match is to use vectorization in pandas and do this:
new_column = (df.date1.dt.month == df.date2.dt.month).astype(int)
That is, avoid using apply() over the DataFrame (which will probably be iterative) and take advantage of the underlying numpy vectorization. The gateway to such functionality is almost always in families of Series functions and properties, like the dt family for dates, str family for strings, and so forth.
Luckily, you have pre-computed the id_list membership in your bool_column, so to add membership as a criterion, just do this:
new_column = ((df.date1.dt.month == df.date2.dt.month) & df.bool_column).astype(int)
Once again, the & of two Series takes advantage of vectorization. You stay inside boolean space till the end, then cast to int with astype(int). Reviewing your code, it occurs to me that the iterative checking of your id_list may be the real performance hit here, even more so than the DataFrame.apply(). Whatever you do, avoid at all costs iterating your id_list at each row, since you already have a vector denoting membership in your bool_column.
By the way I believe there's a tiny error in your example data, the new_column value for your third row should be 0, since your bool_column value there is 0.

Resources