Compare two classes with range of Marks - python-3.x

I have a dataframe with two classes (A or B) and marks and I want to present the mark ranges per class.
Dataframe:
Class Mark Department
A 74.0 1
A 73.0 2
B 72.0 1
A 75.0 1
B 64.0 2
What I want to achieve:
Class Mark Range
A 73.0-75.0
B 64.0-72.0
and I was thinking of using the min max (creating a new field for the range). But as a start, I tried to just group it:
df['count'] = 1
result = df.pivot_table('count', index='Mark', columns='Class', aggfunc='sum').fillna(0)
which is complex and I abandoned this quickly.
I then I only kept two columns in my dataframe (Mark and Class) and used the following:
df[['Mark','Class']].values
And now I just have to create the Mark range column. I was thinking whether there was a simpler way without the steps to simply pivot the data and check the range (min max of columnA grouped by ColumnB).

We can use GroupBy.apply and get the max and min per group and represent them as string with f-strings:
df = (
df.groupby('Class')['Mark'].apply(lambda x: f'{x.min()}-{x.max()}')
.reset_index(name='Mark Range')
)
Class Mark Range
0 A 73.0-75.0
1 B 64.0-72.0

Simple but ugly:
temp = df.groupby('Class')['Mark'].agg({'min': min, 'max': max})
temp['range'] = temp['min'].map(str) + '-' + temp['max'].map(str)
Result of doing temp[['range']]:
range
Class
A 73.0-75.0
B 64.0-72.0

If you are interested in using pivot_table:
df_new = (df.pivot_table('Mark', 'Class', aggfunc=lambda x: f'{x.min()}-{x.max()}')
.add_suffix(' Range').reset_index())
Out[1543]:
Class Mark Range
0 A 73.0-75.0
1 B 64.0-72.0
As in your comment. To add Deparment, just use the list ['Class', 'Department'] for index as follows
df_new = (df.pivot_table('Mark', ['Class', 'Department'],
aggfunc=lambda x: f'{x.min()}-{x.max()}')
.add_suffix(' Range').reset_index())
Out[259]:
Class Department Mark Range
0 A 1 74.0-75.0
1 A 2 73.0-73.0
2 B 1 72.0-72.0
3 B 2 64.0-64.0

Related

groupby consecutive identical values in pandas dataframe and cumulative count of the number of occurences

I have a problem where I would like to count the number of times the current value has not changed in a dataframe over rolling periods.
For example:
df = pd.DataFrame({'col':list('aaaabbab')})
would somehow give output of
0
1
2
3
0
1
0
0
I have been trying something along the following
df['col'] = df['col'] == df['col'].shift(1)
df.rolling(window=3).sum().reset_index(drop=True, level=0)
I have added in the rolling as I will want to look at the full data set in terms of rolling periods but even without having it over rolling periods I can not quite figure out the logic.
I am not sure if I am missing something simple or this may not be possible using shift
You need to generate a grouper for the change in values. For this compare each value with the previous one and apply a cumsum. This gives you groups in the itertools.groupby style ([1, 1, 1, 1, 2, 2, 3, 4]), finally group and apply a cumcount.
df['count'] = (df.groupby(df['col'].ne(df['col'].shift()).cumsum())
.cumcount()
)
output:
col count
0 a 0
1 a 1
2 a 2
3 a 3
4 b 0
5 b 1
6 a 0
7 b 0
edit: for fun here is a solution using itertools (much faster):
from itertools import groupby, chain
df['count'] = list(chain(*(list(range(len(list(g))))
for _,g in groupby(df['col']))))
NB. this runs much faster (88 µs vs 707 µs on the provided example)
I can't comment so just to add some more to #mozway answer.
My goal was to count consecutives value for an entire huge dataframe effectively.
The pb I encounter is that by construction
np.nan == np.nan
will return False so you could have a whole column full of only NaN and yet the counter will be at 0.
A simple workaround would be to replace all NaN in your df by a value not already in it.
For instance in the case of a float dataset you could do
df.fillna('NA')
which will work but by changing the dtype of your columns to Object the following code will be much slower (20x on my set up).
I would rather advised something like :
all_values = list(np.unique(np.array(df)))
all_values = [a for a in all_values if a==a]
unik_val = min(all_values)-1
temp = df.fillna(unik_val).copy()
from itertools import groupby, chain
for col in temp.columns:
temp[col] = list(chain(*(list(range(len(list(g))))
for _,g in groupby(temp[col]))))
count_df

How to turn a column of a data frame into suffixes for other column names? [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
Suppose I have a data frame like this:
A B C D
0 1 10 x 5
1 1 20 y 5
2 1 30 z 5
3 2 40 x 6
4 2 50 y 6
5 2 60 z 6
This, can be viewed, as a table that stores the value of B as a function of A, C, and D. Now, I would look like to transform the B column into three columns B_x, B_y, B_z, like this:
A B_x B_y B_z D
0 1 10 20 30 5
1 2 40 50 60 6
I.e., B_x stores B(A, D) when C = 'x', B_y stores B(A, D) when C = 'y', etc.
What is the most efficient way to do this?
I have a found a solution like this:
frames = []
for c, subframe in df.groupby('C'):
subframe = subframe.rename(columns={'B': f'B_{c}'})
subframe = subframe.set_index(['A', 'D'])
del subframe['C']
frames.append(subframe)
out = frames[0]
for frame in frames[1:]:
out = out.join(frame)
out = out.reset_index()
This gives the correct response, but I feel that it is highly inefficient. I am also not too happy with the fact that to implement this solution one would need to know which columns should not get the prefix in column C explicitly. (In this MWE there were only two of them, but there could be tens in real life.)
Is there a better solution? I.e., a method that says, take a column as a suffix column (in this case C) and a set of 'value' columns (in this case only B); turn the value column names into name_prefix and fill them appropriately?
Here's one way to do it:
import pandas as pd
df = pd.DataFrame( data = {'A':[1,1,1,2,2,2],
'B':[10,20,30,40,50,60],
'C':['x','y','z','x','y','z'],
'D':[5,5,5,6,6,6]})
df2 = df.pivot_table( index=['A','D'],
columns=['C'],
values=['B']
)
df2.columns = ['_'.join(col) for col in df2.columns.values]
df2 = df2.reset_index()

Randomly select and assign values to given number of rows in python dataframe

How can I randomly select and assign values to given number of rows in python dataframe.
Col B contains only 1's and 0's.
Suppose I have a dataframe as below
Col A Col B
A 0
B 0
A 0
B 0
C 0
A 0
B 0
C 0
D 0
A 0
I aim to randomly chose 5% of the rows and change the value of Col B to 1. I saw df.sample() but that wont allow me to do inplace changes to the column data
You can try Random library. Random has it's own sample function.
import Random
randindx = Random.sample(arr.between(0, dataframe['Col B'].size), dataframe['Col B'].size//20)
Considering 5%, you need to divide by 20.
You can first use the sample method to get the random 5% of examples and get hold of their indices like so:
samples_indices = df.sample(frac=0.05, replace=False).index
With the knowledge of the indices, loc method can be used to update the values corresponding to the examples.
df.loc[samples_indices, 'Col B'] = 1

Normalize Column Values by Monthly Averages with added Group dimension

Initial Note
I already got this running, but it takes a very long time to execute. My DataFrame is around 500MB large. I am hoping to hear some feedback on how to execute this as quickly as possible.
Problem Statement
I want to normalize the DataFrame columns by the mean of the column's values during each month. An added complexity is that I have a column named group which denotes a different sensor in which the parameter (column) was measured. Therefore, the analysis needs to iterate around group and each month.
DF example
X Y Z group
2019-02-01 09:30:07 1 2 1 'grp1'
2019-02-01 09:30:23 2 4 3 'grp2'
2019-02-01 09:30:38 3 6 5 'grp1'
...
Code (Functional, but slow)
This is the code that I used. Coding annotations provide descriptions of most lines. I recognize that the three for loops are causing this runtime issue, but I do not have the foresight to see a way around it. Does anyone know any
# Get mean monthly values for each group
mean_per_month_unit = process_df.groupby('group').resample('M', how='mean')
# Store the monthly dates created in last line into a list called month_dates
month_dates = mean_per_month_unit.index.get_level_values(1)
# Place date on multiIndex columns. future note: use df[DATE, COL_NAME][UNIT] to access mean value
mean_per_month_unit = mean_per_month_unit.unstack().swaplevel(0,1,1).sort_index(axis=1)
divide_df = pd.DataFrame().reindex_like(df)
process_cols.remove('group')
for grp in group_list:
print(grp)
# Iterate through month
for mnth in month_dates:
# Make mask where month and group
mask = (df.index.month == mnth.month) & (df['group'] == grp)
for col in process_cols:
# Set values of divide_df
divide_df.iloc[mask.tolist(), divide_df.columns.get_loc(col)] = mean_per_month_unit[mnth, col][grp]
# Divide process_df with divide_df
final_df = process_df / divide_df.values
EDIT: Example data
Here is the data in CSV format.
EDIT2: Current code (according to current answer)
def normalize_df(df):
df['month'] = df.index.month
print(df['month'])
df['year'] = df.index.year
print(df['year'])
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
agg = df.groupby(by=['group', 'month', 'year'], as_index=True).mean()
print("###################", x.name, x['month'])
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by
print(column)
mean_col = agg.loc[(x['group'], x['month'], x['year']), column]
print(mean_col)
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
normalize_cols = df.columns.tolist()
normalize_cols.remove('group')
#normalize_cols.remove('mode')
df2 = df.apply(find_norm, df_col_list = normalize_cols, axis=1)
The code runs perfectly for one iteration and then it fails with the error:
KeyError: ('month', 'occurred at index 2019-02-01 11:30:17')
As I said, it runs correctly once. However, it iterates over the same row again and then fails. I see according to df.apply() documentation that the first row always runs twice. I'm just not sure why this fails on the second time through.
Assuming that the requirement is to group the columns by mean and the month, here is another approach:
Create new columns - month and year from the index. df.index.month can be used for this provided the index is of type DatetimeIndex
type(df.index) # df is the original dataframe
#pandas.core.indexes.datetimes.DatetimeIndex
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.
Now, group over (grp, month, year) and aggregate to find mean of every column. (Added year assuming the grouping occurs per grp per month per year. No need to add this column if year is not to be considered.)
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
Use a function to calculate the normalized values and use apply() over the original dataframe
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)
#df2 will now have 3 additional columns - normA, normB, normC
df2:
A B C grp month year normA normB normC
2019-02-01 09:30:07 1 2 3 1 2 2019 0.666667 0.8 1.5
2019-03-02 09:30:07 2 3 4 1 3 2019 1.000000 1.0 1.0
2019-02-01 09:40:07 2 3 1 2 2 2019 1.000000 1.0 1.0
2019-02-01 09:38:07 2 3 1 1 2 2019 1.333333 1.2 0.5
Alternatively, for step 3, one can join the agg and df dataframes and find the norm.
Hope this helps!
Here is how the code would look like:
# Step 1
df['month'] = df.index.month
df['year'] = df.index.year # added year assuming the grouping occurs
# Step 2
agg = df.groupby(by=['grp', 'month', 'year'], as_index=True).mean()
# Step 3
def find_norm(x, df_col_list): # x is a row in dataframe, col_list is the list of columns to normalize
for column in df_col_list: # iterate over col list, find mean from aggregations, and divide the value by the mean.
mean_col = agg.loc[(str(x['grp']), x['month'], x['year']), column]
col_name = "norm" + str(column)
x[col_name] = x[column] / mean_col # norm
return x
df2 = df.apply(find_norm, df_col_list = ['A','B','C'], axis=1)

Take a column from a Dataframe and normalize all of the other columns against it?

I've got a Dataframe like this:
df = pd.DataFrame(np.reshape(np.arange(0,9), (3,3)))
print(df)
0 1 2
0 0 1 2
1 3 4 5
2 6 7 8
I'd like to normalize two of the columns against a reference column. For example, if I chose df[0] as my reference column, then df[1] and df[2] would also have a mean of 3 and a standard deviation of 3.
What's the best way to do this?
You can shift and scale the values in each column by the mean and standard deviation of the reference column ref:
ref = 0
means = df.mean()
stds = df.std()
(df - means + means[ref]) / stds * stds[ref]

Resources