Categorizing a data based on string in each row - python-3.x

I have the following dataframe:
raw_data = {'name': ['Willard', 'Nan', 'Omar', 'Spencer'],
'Last_Name': ['Smith', 'Nan', 'Sheng', 'Poursafar'],
'favorite_color': ['blue', 'red', 'Nan', "green"],
'Statues': ['Match', 'Mis-Match', 'Match', 'Mis_match']}
df = pd.DataFrame(raw_data, columns = ['name', 'age', 'favorite_color', 'grade'])
df
I wanna do the following tasks:
Separate the rows that contain Match and Mis-match
Make a category that only contains people whose first name and last name are Nan and love a color(any color except for nan).
Can you guys help me?

Use boolean indexing:
df1 = df[df['Statues'] == 'Match']
df2 = df[df['Statues'] =='Mis-Match']
If missing values are not strings use Series.isna and
Series.notna:
df3 = df[df['Name'].isna() & df['Last_NameName'].isna() & df['favorite_color'].notna()]
If Nans are strings compare by Nan:
df3 = df[(df['Name'] == 'Nan') &
(df['Last_NameName'] == 'Nan') &
(df['favorite_color'] != 'Nan')]

Related

turn three columns into dictionary python

Name = [list(['Amy', 'A', 'Angu']),
list(['Jon', 'Johnson']),
list(['Bob', 'Barker'])]
Other = [list(['Amy', 'Any', 'Anguish']),
list(['Jon', 'Jan']),
list(['Baker', 'barker'])]
import pandas as pd
df = pd.DataFrame({'Other' : Other,
'ID': ['E123','E456','E789'],
'Other_ID': ['A123','A456','A789'],
'Name' : Name,
})
ID Name Other Other_ID
0 E123 [Amy, A, Angu] [Amy, Any, Anguish] A123
1 E456 [Jon, Johnson] [Jon, Jan] A456
2 E789 [Bob, Barker] [Baker, barker] A789
I have the df as seen above. I want to make columns ID, Name and Other into a dictionary with they key being ID. I tried this according to python pandas dataframe columns convert to dict key and value
todict = dict(zip(df.ID, df.Name))
Which is close to what I want
{'E123': ['Amy', 'A', 'Angu'],
'E456': ['Jon', 'Johnson'],
'E789': ['Bob', 'Barker']}
But I would like to get this output that includes values from Other column
{'E123': ['Amy', 'A', 'Angu','Amy', 'Any','Anguish'],
'E456': ['Jon', 'Johnson','Jon','Jan'],
'E789': ['Bob', 'Barker','Baker','barker']
}
And If I put the third column Other it gives me errors
todict = dict(zip(df.ID, df.Name, df.Other))
How do I get the output I want?
Why not just combine the Name and Other column before creating a dict of the Name column.
df['Name'] = df['Name'] + df['Other']
dict(zip(df.ID, df.Name))
Gives
{'E123': ['Amy', 'A', 'Angu', 'Amy', 'Any', 'Anguish'],
'E456': ['Jon', 'Johnson', 'Jon', 'Jan'],
'E789': ['Bob', 'Barker', 'Baker', 'barker']}

Pandas checks with prefix and more checksum if searched prefix exists or no data

I have below code snippet which works fine.
import pandas as pd
import numpy as np
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
df = df.rename_axis(None)
Example File new_hosts
sj000001
sj000002
sj000003
sj000004
sj124000
sj125000
sj126000
sj127000
sj128000
sj129000
sj130000
sj131000
sj132000
cr000011
cr000012
cr000013
cr000014
crn00001
crn00002
crn00003
crn00004
euk000011
eu0000012
eu0000013
eu0000014
eu5000011
eu5000013
eu5000014
eu5000015
Current output:
sj00 sj12 cr00 cr08 eu00 eu50
sj000001 cr000011 crn00001 euk000011 eu5000011
sj000002 cr000012 crn00002 eu0000012 eu5000013
sj000003 cr000013 crn00003 eu0000013 eu5000014
sj000004 cr000014 crn00004 eu0000014 eu5000015
What's expected:
1) As code works fine but as you see the current output the second column don't have any values but still appearing So, how could i have a checksum if a particular column don't have any values then remove that from display.
2) Can we place a check for the prefixes if they exists in the dataframe before processing to avoid the error.
Appreciate any help.
IIUC, before
df = df[ prefixes ].dropna(axis=0, how='all').replace(np.nan, '', regex=True)
you can do:
# remove all empty columns
df = df.dropna(axis=1, how='all')
That would solve your first part. Second part can be reindex?
# select prefixes:
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50', 'sh00', 'dt00', 'sh00', 'dt00']
df = df.reindex(prefixes, axis=1).dropna(axis=1, how='all').replace(np.nan, '', regex=True)
Note the axis=1, not axis=0 is identical to what I propose for question 1.
Much thanks to Quang Hoang for the hints on the post, Just for the workaround, i got it working as follows until i get a better answer:
# Select prefixes
prefixes = ['sj00', 'sj12', 'cr00', 'cr08', 'eu00', 'eu50']
df = pd.read_csv('new_hosts', index_col=False, header=None)
df['prefix'] = df[0].str[:4]
df['grp'] = df.groupby('prefix').cumcount()
df = df.pivot(index='grp', columns='prefix', values=0)
df = df[prefixes]
# For column `sj12` only extract the values having `sj12` and a should be a word immediately after that like `sj12[a-z]`
df['sj12'] = df['sj12'].str.extract('(\w{2}\d{2}\w\*)', expand=True)
df.replace('', np.nan, inplace=True)
# Remove the empty columns
df = df.dropna(axis=1, how='all')
# again drop if all values in the row are nan and replace nan to empty for live columns
df = df.dropna(axis=0, how='all').replace(np.nan, '', regex=True)
# drop the index field
df = df.rename_axis(None)
print(df)

xlsxwriter - Conditional formatting based on column name of the dataframe

I have a dataframe as below. I want to apply conditional formatting on column "Data2" using the column name. I know how to define format for a specific column but I am not sure how to define it based on column name as shown below.
So basically I want to do the same formatting on column name(because the order of column might change)
df1 = pd.DataFrame({'Data1': [10, 20, 30],
'Data2': ["a", "b", "c"]})
writer = pd.ExcelWriter('pandas_filter.xlsx', engine='xlsxwriter', )
workbook = writer.book
df1.to_excel(writer, sheet_name='Sheet1', index=False)
worksheet = writer.sheets['Sheet1']
blue = workbook.add_format({'bg_color':'#000080', 'font_color': 'white'})
red = workbook.add_format({'bg_color':'#E52935', 'font_color': 'white'})
l = ['B2:B500']
for columns in l:
worksheet.conditional_format(columns, {'type': 'text',
'criteria': 'containing',
'value': 'a',
'format': blue})
worksheet.conditional_format(columns, {'type': 'text',
'criteria': 'containing',
'value': 'b',
'format': red})
writer.save()
using xlsxwriter with xl_col_to_name we can get the column name using the index.
from xlsxwriter.utility import xl_col_to_name
target_col = xl_col_to_name(df1.columns.get_loc("Data2"))
l = [f'{target_col}2:{target_col}500']
for columns in l:
using opnpyxl with get_column_letter we can get the column name using the index.
from openpyxl.utils import get_column_letter
target_col = get_column_letter(df1.columns.get_loc("Data2") + 1) # add 1 because get_column_letter index start from 1
l = [f'{target_col}2:{target_col}500']
for columns in l:
...

Manipulating a dataframe conditionally

I have the following data I am attempt to do the following;
If elements in tag_3 & tag_4 are 'NaN' then return an intermediate df with the following columns: tag_0, tag_1 & tag_2.
If elements in tag_4 only are 'NaN' then return another intermediate df with the following columns: tag_0, tag_2, tag_3.
Finally if ALL columns have non-NaN values then return an intermediate df with the following columns: tag_0, tag_3, tag_4.
DATA:
data = {'tag_0': ['1', '2', '3'],
'tag_1': ['4', '5', '6'],
'tag_2': ['7', '8', '9'],
'tag_3': ['NaN', '10', '11'],
'tag_4': ['NaN', 'NaN', '12']}
df_1 = pd.DataFrame(data, columns = ['tag_0', 'tag_1', 'tag_2', 'tag_3', 'tag_4'])
dummy data
I like to use bool masks for this sort of task in pandas because I think it is easy to read, but there are other ways to go about it.
What is bool mask?
A bool mask is essentially a Series of True/False values that is applied to a DataFrame to filter it.
Step 1: create the Series of True/False values.
tag_3_is_nan = df['tag3'].isna()
tag_4_is_nan = df['tag4'].isna()
Step 2: apply them to the DataFrame
df[bool_mask]
In your case this would be applied using the following logic.
Case 1: If elements in tag_3 & tag_4 are 'NaN' then return an intermediate df with the following columns: tag_0, tag_1 & tag_2.
df[tag_3_is_nan & tag_4_is_nan][['tag_0', 'tag_1', 'tag_2']]
Case 2: If elements in tag_4 only are 'NaN' then return another intermediate df with the following columns: tag_0, tag_2, tag_3.
df[tag_4_is_nan & ~tag_3_is_nan][['tag_0', 'tag_2', 'tag_3']]
The ~ is equal to not - so ~tag_3_is_nan means tag_3 is not nan.
Case 3: Finally if ALL columns have non-NaN values then return an intermediate df with the following columns: tag_0, tag_3, tag_4.
Dropping all rows that contain at least one NaN value is simple in pandas - just use the method dropna()
df.dropna()[['tag_0', 'tag_3', 'tag_4']]
To avoid settingWithCopyWarning down the line you should copy the filtered df.
Above uses None but your example uses 'NaN' as a string. You can use the same method if your data contains strings of 'NaN' rather than actual None.
tag_3_is_nan_string = df['tag3'] == 'NaN'

Multi-index pandas dataframes: find an index related to the number of unique values a column has

# import Pandas library
import pandas as pd
idx = pd.MultiIndex.from_product([['A001', 'B001','C001'],
['0', '1', '2']],
names=['ID', 'Entries'])
col = ['A', 'B']
df = pd.DataFrame('-', idx, col)
df.loc['A001', 'A'] = [10,10,10]
df.loc['A001', 'B'] = [90,84,70]
df.loc['B001', 'A'] = [10,20,10]
df.loc['B001', 'B'] = [70,86,67]
df.loc['C001', 'A'] = [20,20,20]
df.loc['C001', 'B'] = [98,81,72]
#df is a dataframe
df
Following is the problem: How to return the ID which has more than one unique values for column 'A'? In the above dataset, ideally it should return B001.
I would appreciate if anyone could help me out with performing operations in multi-index pandas dataframes.
Use GroupBy.transform with nunique and filter by boolean indexing and for values of first levl of MultiIndex add get_level_values with unique:
a = df[df.groupby(level=0)['A'].transform('nunique') > 1].index.get_level_values(0).unique()
print(a)
Index(['B001'], dtype='object', name='ID')
Or use duplicated, but first need columns from MultiIndex by reset_index:
m = df.reset_index().duplicated(subset=['ID','A'], keep=False).values
a = df[~m].index.get_level_values(0).unique()
print(a)
Index(['B001'], dtype='object', name='ID')

Resources