Apply styles to specific cells in Pandas Multindex Dataframe based on value comparison - python-3.x

I have a pandas multindex dataframe that looks something like this:
in [1]
import pandas as pd
import numpy as np
iterables = [['Chemistry', 'Math', 'English'],['Semester_1', 'Semester_2']]
columns = pd.MultiIndex.from_product(iterables)
index = ['Gabby', 'Sam', 'Eric', 'Joe']
df = pd.DataFrame(data=np.random.randint(50, 100, (len(index), len(columns))), index=index, columns=columns)
df
out[1]
Chemistry Math English
Semester_1 Semester_2 Semester_1 Semester_2 Semester_1 Semester_2
Gabby 86 80 63 50 87 75
Sam 57 84 91 84 60 87
Eric 67 64 52 96 84 70
Joe 51 68 74 69 85 86
I am trying to see if there were students who's grades dropped in more than 10 points in the last semester, color the cells containing the bad grade red and export the whole table to excel. For example, Gabby's Math grade in the second semester dropped 13 points, so I would like the cell containing "50" to be colored red.
Here is the full output I'm expecting.
I have tried the following:
def color_values(row):
change = row['Semester_1'] - row['Semester_2']
color = 'red' if change > 10 else ''
return 'color: ' + color
for subject in ['English', 'Algebra', 'Geometry']:
df = df.style.apply(color_values, axis=1, subset=[subject])
However I'm getting the following error:
AttributeError Traceback (most recent call last)
<ipython-input-5-e83756bce6ef> in <module>
1 for subject in ['English', 'Algebra', 'Geometry']:
----> 2 df = df.style.apply(color_values, axis=1, subset=[subject])
AttributeError: 'Styler' object has no attribute 'style'
I cannot figure out a way to do this. Please help.

in your first loop when subject is "English" you are setting df to be a Styler object, e.g. df.style.
The in your second loop you are calling .style on df which you set as a Styler object, hence the AttributeError.
styler = df.style
for subject in ['English', 'Algebra', 'Geometry']:
styler.apply(color_values, axis=1, subset=[subject])

Related

Appending DataFrame to empty DataFrame in {Key: Empty DataFrame (with columns)}

I am struggling to understand this one.
I have a regular df (same columns as the empty df in dict) and an empty df which is a value in a dictionary (the keys in the dict are variable based on certain inputs, so can be just one key/value pair or multiple key/value pairs - think this might be relevant). The dict structure is essentially:
{key: [[Empty DataFrame
Columns: [list of columns]
Index: []]]}
I am using the following code to try and add the data:
dict[key].append(df, ignore_index=True)
The error I get is:
temp_dict[product_match].append(regular_df, ignore_index=True)
TypeError: append() takes no keyword arguments
Is this error due to me mis-specifying the value I am attempting to append the df to (like am I trying to append the df to the key instead here) or something else?
Your dictionary contains a list of lists at the key, we can see this in the shown output:
{key: [[Empty DataFrame Columns: [list of columns] Index: []]]}
# ^^ list starts ^^ list ends
For this reason dict[key].append is calling list.append as mentioned by #nandoquintana.
To append to the DataFrame access the specific element in the list:
temp_dict[product_match][0][0].append(df, ignore_index=True)
Notice there is no inplace version of append. append always produces a new DataFrame:
Sample Program:
import numpy as np
import pandas as pd
temp_dict = {
'key': [[pd.DataFrame()]]
}
product_match = 'key'
np.random.seed(5)
df = pd.DataFrame(np.random.randint(0, 100, (5, 4)))
temp_dict[product_match][0][0].append(df, ignore_index=True)
print(temp_dict)
Output (temp_dict was not updated):
{'key': [[Empty DataFrame
Columns: []
Index: []]]}
The new DataFrame will need to be assigned to the correct location.
Either a new variable:
some_new_variable = temp_dict[product_match][0][0].append(df, ignore_index=True)
some_new_variable
0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65
Or back to the list:
temp_dict[product_match][0][0] = (
temp_dict[product_match][0][0].append(df, ignore_index=True)
)
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Assuming there the DataFrame is actually an empty DataFrame, append is unnecessary as simply updating the value at the key to be that DataFrame works:
temp_dict[product_match] = df
temp_dict
{'key': 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65}
Or if list of list is needed:
temp_dict[product_match] = [[df]]
temp_dict
{'key': [[ 0 1 2 3
0 99 78 61 16
1 73 8 62 27
2 30 80 7 76
3 15 53 80 27
4 44 77 75 65]]}
Maybe you have an empty list at dict[key]?
Remember that "append" list method (unlike Pandas dataframe one) only receives one parameter:
https://docs.python.org/3/tutorial/datastructures.html#more-on-lists
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.append.html

Performing pair-wise comparisons of some pandas dataframe rows as efficiently as possible

For a given pandas dataframe df, I would like to compare every sample (row) with each other.
For bigger datasets this would lead to too many comparisons (n**2). Therefore, it is necessary to perform these comparisons only for smaller groups (i.e. for all of those which share the same id) and as efficiently as possible.
I would like to construct a dataframe (df_pairs), which contains in every row one pair. Additionally, I would like to get all pair indices (ideally as a Python set).
First, I construct an example dataframe:
import numpy as np
import pandas as pd
from functools import reduce
from itertools import product, combinations
n_samples = 10_000
suffixes = ["_1", "_2"] # for df_pairs
id_str = "id"
df = pd.DataFrame({id_str: np.random.randint(0, 10, n_samples),
"A": np.random.randint(0, 100, n_samples),
"B": np.random.randint(0, 100, n_samples),
"C": np.random.randint(0, 100, n_samples)}, index=range(0, n_samples))
columns_df_pairs = ([elem + suffixes[0] for elem in df.columns] +
[elem + suffixes[1] for elem in df.columns])
In the following, I am comparing 4 different options with the corresponding performance measures:
Option 1
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [set(product(elem.tolist(), repeat=2)) for _, elem in groups.items()] # determine pairs per group
set_of_pairs = reduce(set.union, pairs_per_group) # convert all groups into one set
idcs1, idcs2 = zip(*[(e1, e2) for e1, e2 in set_of_pairs])
df_pairs = pd.DataFrame(np.hstack([df.values[idcs1, :], df.values[idcs2, :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_tuples(set_of_pairs, names=('index 1', 'index 2')))
df_pairs.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 1 takes 34.2 s ± 1.28 s.
Option 2
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array(np.meshgrid(elem.values, elem.values)).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs2 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs2.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 2 takes 13 s ± 1.34 s.
Option 3
groups = df.groupby(id_str).groups # get the groups
pairs_per_group = [np.array([np.tile(elem.values, len(elem.values)), np.repeat(elem.values, len(elem.values))]).T.reshape(-1, 2) for _, elem in groups.items()]
idcs = np.unique(np.vstack(pairs_per_group), axis=0)
df_pairs3 = pd.DataFrame(np.hstack([df.values[idcs[:, 0], :], df.values[idcs[:, 1], :]]), # construct the dataframe of pairs
columns=columns_df_pairs,
index=pd.MultiIndex.from_arrays([idcs[:, 0], idcs[:, 1]], names=('index 1', 'index 2')))
df_pairs3.drop([id_str + suffixes[0], id_str + suffixes[1]], inplace=True, axis=1)
Option 3 takes 12.1 s ± 347 ms.
Option 4
df_pairs4 = pd.merge(left=df, right=df, how="inner", on=id_str, suffixes=suffixes)
# here, I do not know how to get the MultiIndex in
df_pairs4.drop([id_str], inplace=True, axis=1)
Option 4 is computed the quickest with 1.41 s ± 239 ms. However, I do not have the paired indices in this case.
I could improve the performance a little bit by using comparisons instead of product of itertools. I could also build the comparison matrix and use only the upper triangular one and construct my dataframe from there. This however does not seem to be more efficient than performing the cartesian product and removing the self-references as well as inverse comparisons (a, b) = (b, a).
Could you tell me a more efficient way to get pairs for comparison (ideally as a set to be able to use set operations)?
Could I use merge or another pandas function to construct my desired dataframe with the multi-indices?
An inner merge will destroy the index in favor of a new Int64Index. If the index is important bring it along as a column by reset_index, then set those columns back to the Index.
df_pairs4 = (pd.merge(left=df.reset_index(), right=df.reset_index(),
how="inner", on=id_str, suffixes=suffixes)
.set_index(['index_1', 'index_2']))
id A_1 B_1 C_1 A_2 B_2 C_2
index_1 index_2
0 0 4 92 79 10 92 79 10
13 4 92 79 10 83 68 69
24 4 92 79 10 67 73 90
25 4 92 79 10 22 31 35
36 4 92 79 10 64 44 20
... .. ... ... ... ... ... ...
9993 9971 7 20 65 92 47 65 21
9977 7 20 65 92 50 35 27
9980 7 20 65 92 43 36 62
9992 7 20 65 92 99 2 17
9993 7 20 65 92 20 65 92

Splitting time formatted object doesn't work with python and pandas

I have the simple line of code:
print(df['Duration'])
df['Duration'].str.split(':')
print(df['Duration'])
Here are the value I have for each print
00:58:59
00:27:41
00:27:56
Name: Duration, dtype: object
Why is the split not working here? What do I'm missing?
str.split doesn't modify column inplace, so you need to assign the result to something:
import pandas as pd
df = pd.DataFrame({'Duration':['00:58:59', '00:27:41', '00:27:56'], 'other':[10, 20, 30]})
df['Duration'] = df['Duration'].str.split(':')
print(df)
Prints:
Duration other
0 [00, 58, 59] 10
1 [00, 27, 41] 20
2 [00, 27, 56] 30
If you want to expand the columns of DataFrame by splitting, you can try:
import pandas as pd
df = pd.DataFrame({'Duration':['00:58:59', '00:27:41', '00:27:56'], 'other':[10, 20, 30]})
df[['hours', 'minutes', 'seconds']] = df['Duration'].str.split(':', expand=True)
print(df)
Prints:
Duration other hours minutes seconds
0 00:58:59 10 00 58 59
1 00:27:41 20 00 27 41
2 00:27:56 30 00 27 56

sum of all the columns values in the given dataframe and display output in in a new data frame

I have tried the below code:
import pandas as pd
dataframe = pd(C1,columns=['School-A','School-B','School-C','School-D','School-E'])
sum_column = daeframet.sum(axis=0)
print (sum_column)
I am getting the below error
TypeError: 'module' object is not callable
Data:
Output:
The error is coming from calling the module pd as a function. It's difficult to know which function you should be calling from pandas without knowing what C1 is, but if it is a dictionary or a pandas data frame, try:
import pandas as pd
# common to abbreviate dataframe as df
df = pd.DataFrame(C1, columns=['School-A','School-B','School-C','School-D','School-E'])
sum_column = df.sum(axis=0)
print(sum_column)
using sum will only return a series and not a dataframe, there are many ways you can do this. Lets try using select_dtypes and the to_frame() method
import numpy as np
import pandas as pd
np.random.seed(5)
df = pd.DataFrame({'class' : ['first','second','third','fourth','fifth'],
'School A' : np.random.randint(1,50,5),
'School B' : np.random.randint(1,50,5),
'School C' : np.random.randint(1,50,5),
'School D' : np.random.randint(1,50,5),
'School E' : np.random.randint(1,50,5)})
print(df)
class School A School B School C School D School E
0 first 36 10 49 16 14
1 second 15 9 31 40 12
2 third 48 37 17 17 2
3 fourth 39 40 8 28 48
4 fifth 17 28 13 45 31
new_df = (df.select_dtypes(include='int').sum(axis=0).to_frame()
.reset_index().rename(columns={0 : 'Total','index' : 'School'}))
print(new_df)
School Total
0 School A 155
1 School B 124
2 School C 118
3 School D 146
4 School E 107
Edit
seems like there are some typos in your code :
import pandas as pd
dataframe = pd.DataFrame(C1,columns=['School-A','School-B','School-C','School-D','School-E'])
sum_column = dataframe.sum(axis=0)
print (sum_column)
will return the sum as a series, and also sum the text columns by way of string concatenation :
class firstsecondthirdfourthfifth
School A 155
School B 124
School C 118
School D 146
School E 107
dtype: object

Pandas HDF limiting number of rows of CSV file

I have a CSV file with 3GB. I'm trying to save it to HDF format with Pandas so I can load it faster.
import pandas as pd
import traceback
df_all = pd.read_csv('file_csv.csv', iterator=True, chunksize=20000)
for _i, df in enumerate(df_all):
try:
print ('Saving %d chunk...' % _i, end='')
df.to_hdf('file_csv.hdf',
'file_csv',
format='table',
data_columns=True)
print ('Done!')
except:
traceback.print_exc()
print (df)
print (df.info())
del df_all
The original CSV file has about 3 million rows, which is reflected by the output of this piece of code. The last line of output is: Saving 167 chunk...Done!
That means: 167*20000 = 3.340.000 rows
My issue is:
df_hdf = pd.read_hdf('file_csv.hdf')
df_hdf.count()
=> 4613 rows
And:
item_info = pd.read_hdf('ItemInfo_train.hdf', where="item=1")
Returns nothing, even I'm sure the "item" column has an entry equals to 1 in the original file.
What can be wrong?
Use append=True to tell to_hdf to append new chunks to the same file.
df.to_hdf('file_csv.hdf', ..., append=True)
Otherwise, each call overwrites the previous contents and only the last chunk remains saved in file_csv.hdf.
import os
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(10, size=(100, 2)), columns=list('AB'))
df.to_csv('file_csv.csv')
if os.path.exists('file_csv.hdf'): os.unlink('file_csv.hdf')
for i, df in enumerate(pd.read_csv('file_csv.csv', chunksize=50)):
print('Saving {} chunk...'.format(i), end='')
df.to_hdf('file_csv.hdf',
'file_csv',
format='table',
data_columns=True,
append=True)
print('Done!')
print(df.loc[df['A']==1])
print('-'*80)
df_hdf = pd.read_hdf('file_csv.hdf', where="A=1")
print(df_hdf)
prints
Unnamed: 0 A B
22 22 1 7
30 30 1 7
41 41 1 9
44 44 1 0
19 69 1 3
29 79 1 1
31 81 1 5
34 84 1 6
Use append=True to tell to_hdf to append new chunks to the same file. Otherwise, only the last chunk is saved in file_csv.hdf:
import os
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(10, size=(100, 2)), columns=list('AB'))
df.to_csv('file_csv.csv')
if os.path.exists('file_csv.hdf'): os.unlink('file_csv.hdf')
for i, df in enumerate(pd.read_csv('file_csv.csv', chunksize=50)):
print('Saving {} chunk...'.format(i), end='')
df.to_hdf('file_csv.hdf',
'file_csv',
format='table',
data_columns=True,
append=True)
print('Done!')
print(df.loc[df['A']==1])
print('-'*80)
df_hdf = pd.read_hdf('file_csv.hdf', where="A=1")
print(df_hdf)
prints
Unnamed: 0 A B
22 22 1 7
30 30 1 7
41 41 1 9
44 44 1 0
19 69 1 3
29 79 1 1
31 81 1 5
34 84 1 6

Resources