Match pandas column values and headers across dataframes - python-3.x

I have 3 files that I am reading into dataframes (https://pastebin.com/v7BnSH3s)
map_df: Maps data_file headers to codes_df headers
Field Name Code Name
Gender gender_codes
Race race_codes
Ethnicity ethnicity_codes
code_df: Valid codes
gender_codes race_codes ethnicity_codes
1 1 1
2 2 2
3 3 3
4 4 4
NaN NaN 5
NaN NaN 6
NaN NaN 7
data_df: the actual data that needs to be checked against the codes
Name Gender Race Ethnicity
Alex 99 1 7
Cindy 2 4 5
Tom 1 99 1
Problem:
I need to confirm that each value in every column of data_df is a valid code. If not, I need to write the Name, the invalid value and the column header label as a new column. So my example data_df would yield the following dataframe for the gender_codes check:
result_df:
Name Value Column
Alex 99 Gender
Background:
My actual data file has over 100 columns.
A code column can map to multiple columns in the data_df.
I am currently not using the map_df other than to know which columns map to
which codes. However, if I can incorporate this into my script, that would be
ideal.
What I've tried:
I am currently sending each code column to a list, removing the nan string, performing the lookup with loc and isin, then setting up the result_df...
# code column to list
gender_codes = codes_df["gender_codes"].tolist()
# remove nan string
gender_codes = [gender_codes
for gender_codes in gender_codes
if str(gender_codes) != "nan"]
# check each value against code list
result_df = data_df.loc[(~data_df.Gender.isin(gender_codes))]
result_df = result_df.filter(items = ["Name","Gender"])
result_df.rename(columns = {"Gender":"Value"}, inplace = True)
result_df['Column'] = 'Gender'
This works but obviously is extremely primitive and won't scale with my dataset. I'm hoping to find an iterative and pythonic approach to this problem.
EDIT:
Modified Dataset with np.nan
https://pastebin.com/v7BnSH3s

Boolean indexing
I'd reformat your data into different forms
m = dict(map_df.itertuples(index=False))
c = code_df.T.stack().groupby(level=0).apply(set)
ddf = data_df.melt('Name', var_name='Column', value_name='Value')
ddf[[val not in c[col] for val, col in zip(ddf.Value, ddf.Column.map(m))]]
Name Column Value
0 Alex Gender 99
5 Tom Race 99
Details
m # Just a dictionary with the same content as `map_df`
{'Gender': 'gender_codes',
'Race': 'race_codes',
'Ethnicity': 'ethnicity_codes'}
c # Series of membership sets
ethnicity_codes {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0}
gender_codes {1.0, 2.0, 3.0, 4.0}
race_codes {1.0, 2.0, 3.0, 4.0}
dtype: object
ddf # Melted dataframe to help match the final output
Name Column Value
0 Alex Gender 99
1 Cindy Gender 2
2 Tom Gender 1
3 Alex Race 1
4 Cindy Race 4
5 Tom Race 99
6 Alex Ethnicity 7
7 Cindy Ethnicity 5
8 Tom Ethnicity 1

You will need to preprocess your dataframes and define a validation function. Something like below:
1. Preprocessing
# call melt() to convert columns to rows
mcodes = codes_df.melt(
value_vars=list(codes_df.columns),
var_name='Code Name',
value_name='Valid Code').dropna()
mdata = data_df.melt(
id_vars='Name',
value_vars=list(data_df.columns[1:]),
var_name='Column',
value_name='Value')
validation_df = mcodes.merge(map_df, on='Code Name')
Out:
mcodes:
Code Name Valid Code
0 gender_codes 1
1 gender_codes 2
7 race_codes 1
8 race_codes 2
9 race_codes 3
10 race_codes 4
14 ethnicity_codes 1
15 ethnicity_codes 2
16 ethnicity_codes 3
17 ethnicity_codes 4
18 ethnicity_codes 5
19 ethnicity_codes 6
20 ethnicity_codes 7
mdata:
Name Column Value
0 Alex Gender 99
1 Cindy Gender 2
2 Tom Gender 1
3 Alex Race 1
4 Cindy Race 4
5 Tom Race 99
6 Alex Ethnicity 7
7 Cindy Ethnicity 5
8 Tom Ethnicity 1
validation_df:
Code Name Valid Code Field Name
0 gender_codes 1 Gender
1 gender_codes 2 Gender
2 race_codes 1 Race
3 race_codes 2 Race
4 race_codes 3 Race
5 race_codes 4 Race
6 ethnicity_codes 1 Ethnicity
7 ethnicity_codes 2 Ethnicity
8 ethnicity_codes 3 Ethnicity
9 ethnicity_codes 4 Ethnicity
10 ethnicity_codes 5 Ethnicity
11 ethnicity_codes 6 Ethnicity
12 ethnicity_codes 7 Ethnicity
2. Validation Function
def isValid(row):
valid_list = validation_df[validation_df['Field Name'] == row.Column]['Valid Code'].tolist()
return row.Value in valid_list
3. Validation
mdata['isValid'] = mdata.apply(isValid, axis=1)
result = mdata[mdata.isValid == False]
Out:
result:
Name Column Value isValid
0 Alex Gender 99 False
5 Tom Race 99 False

m, df1 = dict(map_df.values), data_df.set_index('Name')
df1[df1.apply(lambda x:~x.isin(code_df[m[x.name]]))].stack().reset_index()
Out:
Name level_1 0
0 Alex Gender 99.0
1 Tom Race 99.0

Related

For and if loop combination takes lot of time in Pandas (Data manipulation)

I have two datasets, each about half a million observations. I am writing the below code and it seems the code never seems to stop executing. I would like to know if there is a better way of doing it. Appreciate inputs.
Below are sample formats of my dataframes. Both dataframes share a set of 'sid' values , meaning all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values. The 'tid' values and consequently the 'rid' values (which are a combination of 'sid' and 'tid' values) may not appear in both sets.
The task is simple. I would like to create the 'tv' column in df2. Wherever the 'rid' in df2 matches with the 'rid' in 'df1', the 'tv' column in df2 takes the corresponding 'tv' value from df1. If it does not match, the 'tv' value in 'df2' will be the median 'tv' value for the matching 'sid' subset in 'df1'.
In fact my original task includes creating a few more similar columns like 'tv' in df2 (based on their values in 'df1' ; these columns exist in 'df1').
I believe as my code contains for loop combined with if else statement and multiple value assignment statements, it is taking forever to execute. Appreciate any inputs.
df1
sid tid rid tv
0 0 0 0-0 9
1 0 1 0-1 8
2 0 3 0-3 4
3 1 5 1-5 2
4 1 7 1-7 3
5 1 9 1-9 14
6 1 10 1-10 24
7 1 11 1-11 13
8 2 14 2-14 2
9 2 16 2-16 5
10 3 17 3-17 6
11 3 18 3-18 8
12 3 20 3-20 5
13 3 21 3-21 11
14 4 23 4-23 6
df2
sid tid rid
0 0 0 0-0
1 0 2 0-2
2 1 3 1-3
3 1 6 1-6
4 1 9 1-9
5 2 10 2-10
6 2 12 2-12
7 3 1 3-1
8 3 15 3-15
9 3 1 3-1
10 4 19 4-19
11 4 22 4-22
rids = [rid.split('-') for rid in df1.rid]
for r in df2.rid:
s,t = r.split('-')
if [s,t] in rids:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.rid == r,'tv']
else:
df2.loc[df2.rid== r,'tv'] = df1.loc[df1.sid == int(s),'tv'].median()
The expected df2 shall be as follows:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
You can left merge on df2 with a subset(because you need only tv column you can also pass the df1 without any subset) of df1 on 'rid' then calculate median and fill values:
out=df2.merge(df1[['rid','tv']],on='rid',how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tid_y','tv_y'], axis=1)
out = out.rename(columns = {'tid_x': 'tid'})
out
OR
Since you said that:
all the 'sid' values in 'df2' will have a match in 'df1' 'sid' values
So you can also left merge them on ['sid','rid'] and then fillna() value of tv with the median of df1 'tv' column by mapping values using map() method:
out=df2.merge(df1,on=['sid','rid'],how='left')
out['tv']=out['tv_y'].fillna(out['sid'].map(df1.groupby('sid')['tv'].median()))
out= out.drop(['tv_x','tv_y'], axis=1)
out
output of out:
sid tid rid tv
0 0 0 0-0 9.0
1 0 2 0-2 8.0
2 1 3 1-3 13.0
3 1 6 1-6 13.0
4 1 9 1-9 14.0
5 2 10 2-10 3.5
6 2 12 2-12 3.5
7 3 1 3-1 7.0
8 3 15 3-15 7.0
9 3 1 3-1 7.0
10 4 19 4-19 6.0
11 4 22 4-22 6.0
Here is a suggestion without any loops, based on dictionaries:
matching_values = dict(zip(df1['rid'][df1['rid'].isin(df2['rid'])], df1['tv'][df1['rid'].isin(df2['rid'])]))
df2[df2['rid'].isin(df1['rid'])]['tv'] = df2[df2['rid'].isin(df1['rid'])]['rid']
df2[df2['rid'].isin(df1['rid'])]['tv'].replace(matching_values)
median_values = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])].groupby('sid')['tv'].median().to_dict()
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'] = df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['sid']
df2[(~df2['rid'].isin(df1['rid']) & (df2['sid'].isin(df1['sid'])]['tv'].replace(median_values)
This should do the trick. The logic here is that we first create a dictionary, in which the "rid and "sid" values are the keys and the median and matching "tv" values are the dictionary values. Next, we replace the "tv" values in df2 with the rid and sid keys, respectively, (because they are the dictionary keys) which can thus easily be replaced by the correct tv values by calling .replace().
Don't use for loops in pandas, that is known to be slow. That way you don't get to benefit from all the internal optimizations that have been made.
Try to use the split-apply-combine pattern:
split df1 into sid to calculate the median: df1.groupby('sid')['tv'].median()
join df2 on df1: df2.join(df1.set_index('rid'), on='rid')
fill the NaN values with the median calculated in step 1.
(Haven't tested the code).

Can we separate data using Unique ID in to the following format?

Current Format:
UNIQUE ID
NAME
AGE
DEP
RANK
001
John
10
4th
1
002
Priya
11
4th
2
003
Jack
15
5th
2
004
Jill
14
5th
1
Expected Format:
UNIQUE ID
NAME
COLUMN_NO
001
John
1
001
10
2
001
4th
3
001
1
4
002
Priya
1
002
11
2
002
4th
3
002
2
4
My starting point:
>>> df
UNIQUE ID NAME AGE DEP RANK
0 1 John 10 4th 1
1 2 Priya 11 4th 2
2 3 Jack 15 5th 2
3 4 Jill 14 5th 1
The basic transformation you need is provided by df.stack, which results in:
0 UNIQUE ID 1
NAME John
AGE 10
DEP 4th
RANK 1
1 UNIQUE ID 2
NAME Priya
[...]
However, you want column UNIQUE ID to be treated separately. This can be accomplished by making it the index:
>>> df.set_index('UNIQUE ID').stack()
UNIQUE ID
1 NAME John
AGE 10
DEP 4th
RANK 1
2 NAME Priya
AGE 11
DEP 4th
RANK 2
The last missing bit are the column names: you want them renamed to numbers. This could be accomplished two different ways: a) by re-assigning df.columns (after having moved column UNIQUE ID to the index first):
df = df.set_index('UNIQUE_ID')
df.columns = range(1, 5)
or b) by df.renaming the columns:
df = df.set_index('UNIQUE_ID')
df = df.rename(columns={'NAME': 1, 'AGE': 2, 'DEP': 3, 'RANK': 4})
And finally you can convert the resulting Series back to a DataFrame. The most elegant way to get COLUMN NO at the right place is using df.rename_axis before stacking. All together as one expression (possibly better to split it up):
>>> (df.set_index('UNIQUE ID')
.rename(columns={'NAME': 1, 'AGE': 2, 'DEP': 3, 'RANK': 4})
.rename_axis('COLUMN NO', axis=1)
.stack()
.to_frame('NAME')
.reset_index())
UNIQUE ID COLUMN NO NAME
0 1 1 John
1 1 2 10
2 1 3 4th
3 1 4 1
4 2 1 Priya
5 2 2 11
6 2 3 4th
7 2 4 2
8 3 1 Jack
9 3 2 15
10 3 3 5th
11 3 4 2
12 4 1 Jill
13 4 2 14
14 4 3 5th
15 4 4 1
Things left out: reading the data; preserving the correct type: UNIQUE ID only looks numeric, but has leading zeros that probably want to be preserved; so parsing them as a string would be better.

Display row with False values in validated pandas dataframe column [duplicate]

This question already has answers here:
Display rows with one or more NaN values in pandas dataframe
(5 answers)
Closed 2 years ago.
I was validating 'Price' column in my dataframe. Sample:
ArticleId SiteId ZoneId Date Quantity Price CostPrice
53 194516 9 2 2018-11-26 11.0 40.64 27.73
164 200838 9 2 2018-11-13 5.0 99.75 87.24
373 200838 9 2 2018-11-27 1.0 99.75 87.34
pd.to_numeric(df_sales['Price'], errors='coerce').notna().value_counts()
And I'd love to display those rows with False values so I know whats wrong with them. How do I do that?
True 17984
False 13
Name: Price, dtype: int64
Thank you.
You could print your rows when price isnull():
print(df_sales[df_sales['Price'].isnull()])
ArticleId SiteId ZoneId Date Quantity Price CostPrice
1 200838 9 2 2018-11-13 5 NaN 87.240
pd.to_numeric(df['Price'], errors='coerce').isna() returns a Boolean, which can be used to select the rows that cause errors.
This includes NaN or rows with strings
import pandas as pd
# test data
df = pd.DataFrame({'Price': ['40.64', '99.75', '99.75', pd.NA, 'test', '99. 0', '98 0']})
Price
0 40.64
1 99.75
2 99.75
3 <NA>
4 test
5 99. 0
6 98 0
# find the value of the rows that are causing issues
problem_rows = df[pd.to_numeric(df['Price'], errors='coerce').isna()]
# display(problem_rows)
Price
3 <NA>
4 test
5 99. 0
6 98 0
Alternative
Create an extra column and then use it to select the problem rows
df['Price_Updated'] = pd.to_numeric(df['Price'], errors='coerce')
Price Price_Updated
0 40.64 40.64
1 99.75 99.75
2 99.75 99.75
3 <NA> NaN
4 test NaN
5 99. 0 NaN
6 98 0 NaN
# find the problem rows
problem_rows = df.Price[df.Price_Updated.isna()]
Explanation
Updating the column with .to_numeric(), and then checking for NaNs will not tell you why the rows had to be coerced.
# update the Price row
df.Price = pd.to_numeric(df['Price'], errors='coerce')
# check for NaN
problem_rows = df.Price[df.Price.isnull()]
# display(problem_rows)
3 NaN
4 NaN
5 NaN
6 NaN
Name: Price, dtype: float64

Groupby and create a new column by randomly assign multiple strings into it in Pandas

Let's say I have students infos id, age and class as follows:
id age class
0 1 23 a
1 2 24 a
2 3 25 b
3 4 22 b
4 5 16 c
5 6 16 d
I want to groupby class and create a new column named major by randomly assign math, art, business, science into it, which means for same class, the major strings are same.
We may need to use apply(lambda x: random.choice..) to realize this, but I don't know how to do this. Thanks for your help.
Output expected:
id age major class
0 1 23 art a
1 2 24 art a
2 3 25 science b
3 4 22 science b
4 5 16 business c
5 6 16 math d
Use numpy.random.choice with number of values by length of DataFrame:
df['major'] = np.random.choice(['math', 'art', 'business', 'science'], size=len(df))
print (df)
id age major
0 1 23 business
1 2 24 art
2 3 25 science
3 4 22 math
4 5 16 science
5 6 16 business
EDIT: for same major values per groups use Series.map with dictionary:
c = df['class'].unique()
vals = np.random.choice(['math', 'art', 'business', 'science'], size=len(c))
df['major'] = df['class'].map(dict(zip(c, vals)))
print (df)
id age class major
0 1 23 a business
1 2 24 a business
2 3 25 b art
3 4 22 b art
4 5 16 c science
5 6 16 d math

how to change a value of a cell that contains nan to another specific value?

I have a dataframe that contains nan values in particular column. while iterating through the rows, if it come across nan(using isnan() method) then I need to change it to some other value(since I have some conditions). I tried using replace() and fillna() with limit parameter also but they are modifying whole column when they come across the first nan value? Is there any method that I can assign value to specific nan rather than changing all the values of a column?
Example: the dataframe looks like it:
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 NaN
2 x3 3 'cat' 1 2 3 1 1 NaN
3 x4 6 'lion' 8 4 3 7 1 NaN
4 x5 4 'lion' 1 1 3 1 1 NaN
5 x6 8 'cat' 10 10 9 7 1 0.0
an I have a list like
a = [1.0, 0.0]
and I expect to be like
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
I wanted to change the target_class values based on some conditions and assign values of the above list.
I believe need replace NaNs values to 1 only for indexes specified in list idx:
mask = df['target_class'].isnull()
idx = [1,2,3]
df.loc[mask, 'target_class'] = df[mask].index.isin(idx).astype(int)
print (df)
points sundar cate king varun vicky john charlie target_class
1 x2 5 'cat' 4 10 3 2 1 1.0
2 x3 3 'cat' 1 2 3 1 1 1.0
3 x4 6 'lion' 8 4 3 7 1 1.0
4 x5 4 'lion' 1 1 3 1 1 0.0
5 x6 8 'cat' 10 10 9 7 1 0.0
Or:
idx = [1,2,3]
s = pd.Series(df.index.isin(idx).astype(int), index=df.index)
df['target_class'] = df['target_class'].fillna(s)
EDIT:
From comments solution is assign values by index and columns values with DataFrame.loc:
df2.loc['x2', 'target_class'] = list1[0]
I suppose your conditions for imputing the nan values does not depend on the number of them in a column. In the code below I stored all the imputation rules in one function that receives as parameters the entire row (containing the nan) and the column you are investigating for. If you also need all the dataframe for the imputation rules, just pass it through the replace_nan function. In the example I imputate the col element with the mean values of the other columns.
import pandas as pd
import numpy as np
def replace_nan(row, col):
row[col] = row.drop(col).mean()
return row
df = pd.DataFrame(np.random.rand(5,3), columns = ['col1', 'col2', 'col3'])
col_to_impute = 'col1'
df.loc[[1, 3], col_to_impute] = np.nan
df = df.apply(lambda x: replace_nan(x, col_to_impute) if np.isnan(x[col_to_impute]) else x, axis=1)
The only thing that you should do is making the right assignation. That is, make an assignation in the rows that contain nulls.
Example dataset:
,event_id,type,timestamp,label
0,asd12e,click,12322232,0.0
1,asj123,click,212312312,0.0
2,asd321,touch,12312323,0.0
3,asdas3,click,33332233,
4,sdsaa3,touch,33211333,
Note: The last two rows contains nulls in column: 'label'. Then, we load the dataset:
df = pd.read_csv('dataset.csv')
Now, we make the appropiate condition:
cond = df['label'].isnull()
Now, we make the assignation over these rows (I don't know the logical of assignation. Therefore I assign 1 value to NaN's):
df1.loc[cond,'label'] = 1
There are another more accurate approaches. fillna() method could be used. You should provide the logical in order to help you.

Resources