How do I validate data mapping between 2 data frames in pandas - python-3.x

I am trying to validate a data mapping between two data frames for specific columns. I need to validate the following:
if values in a specific column in df1 matches the mapping in a specific column in df2.
if values in a specific column in df1 does not match the specified mapping in a specific column in df2 - a different value in df2.
if values in a specific column in df1 does not have a match in in df2.
df1 looks like this:
cp_id
cp_code
2A23
A
2A24
D
3A45
G
7A96
B
2A30
R
6A18
K
df2 looks like like:
cp_type_id
cp_type_code
2A23
8
2A24
7
3A45
3
2A44
1
6A18
8
4A08
2
The data mapping constitutes of sets of values where the combination could match any values within the set, as following:
('A','C','F','K','M') in df1 should map to (2, 8) in df2 - either 2 or 8
('B') in df1 should map to 4 in df2
('D','G','I') in df1 should map to 7 in df2
('T','U') in df1 should map to (3,5) in df2 - either 3 or 5
Note that df1 has a cp_code as R which is not mapped and that 3A45 is a mismatch. The good news is there is a unique identifier key to use.
First, I created a list for each mapping set and created a statement using merge to check for each mapping. I ended up with 3 lists and 3 statements per set, which I am not sure if this is the right way to do it.
At the end I want to combine the matches into one df that I call match, all no_matches into another df that I call no_match, and all no_mappings into another df that I call no_mapping, like the following:
Match
cp_id
cp_code
cp_type_id
cp_type_code
2A23
A
2A23
8
2A24
D
2A24
7
6A18
K
6A18
8
Mismatch
cp_id
cp_code
cp_type_id
cp_type_code
3A45
G
3A45
3
No Mapping
cp_id
cp_code
cp_type_id
cp_type_code
7A96
B
NaN
NaN
NaN
NaN
2A44
1
2A30
R
NaN
NaN
NaN
NaN
4A08
2
I am having a hard time to make the no_match to work.
This is what I tried for no match:
filtered df1 based on the set 2 codes
filtered df2 based on not in map 2 codes
for the no mapping, I did a df merge with on='cp_id'
no_mapping_set2 = df1_filtered.merge(df2_filtered, on='cp_id', indicator = True)
With the code above, for cp_id = 'B', for example, instead of getting only 1 row back, I get a lot of duplicate rows with cp_id = 'B'.
Just to state my level, I am a beginner in Python. Any help would be appreciated.
Thank you so much for your time.
Rob

Related

How to find complete empty row in pandas

I am working on one dataset in which I need to find complete empty columns from the dataset.
example:
A B C D
nan nan nan nan
1 ss nan 3.0
2 bb w2 4.0
nan nan nan nan
Currently, I am using
import pandas as pd
nan_col=[]
for col in df.columns:
if df.loc[df[col].isnull()].empty !=True:
nan_col.append(col)
But this is capturing null values in the specified columns but I need to capture null rows.
expected Answer: row [0,3]
Can anyone suggest me a way to proceed to identify a complete null row in the dataframe.
You can compare if all rows has missing values by DataFrame.isna with DataFrame.all and then get index values by boolean indexing:
L = df.index[df.isna().all(axis=1)].tolist()
#alternative, if huge dataframe slowier
#L = df[df.isna().all(axis=1)].index.tolist()
print (L)
[0, 3]
Or you could use dropna with set and sorted, I get the index after dropping the rows with NaNs and then also get the index of the whole dataframe and use ^ to get the values that aren't in both indexes, then after the I use sorted to sort the list and convert it into a list, like the below:
print(sorted(set(df.index) ^ set(df.dropna(how='all').index)))
If you might have duplicate index, you can do a list comprehension to iterate through the whole df's index, and add the value to the list comprehension if the value isn't in the dropna index, I also use enumerate so that if all indexes are the same (all duplicate index), it would still work, like the below:
idx = df.dropna(how='all').index
print([i for index, i in enumerate(df.index) if index not in idx])
Both codes output:
[0, 3]

How to add an empty column in a dataframe using pandas (without specifying column names)?

I have a dataframe with only one column (headerless). I want to add another empty column to it having the same number of rows.
To make it clearer, currently, the size of my data frame is 1050 (since only one column), I want the new size to be 1050*2 with the second column being completely empty.
In pandas in DataFrame are always columns, so for new default column filled by missing values use length of columns:
s = pd.Series([2,3,4])
df = s.to_frame()
df[len(df.columns)] = np.nan
#what is same for one column df like
#df[1] = np.nan
print (df)
0 1
0 2 NaN
1 3 NaN
2 4 NaN

Join one column of a dataframe with another dataframe based on a condition

I have 2 dataframes, df1 and df2 as shown below:
df1:
Name Code Title_num
0 Title_1 0 TN_1234_4687
1 Title_2 0 TN_1234_7053
2 off_1 18301 TN_1234_1915
3 off_2 18302 TN_1234_7068
4 off_3 18303 TN_1234_1828
df2:
A_Code T_Code
0 000000086 18301
1 000000126 18302
2 000001236 18303
3 000012346 18938
4 000123456 18910
5 000123457 18301
Where T_code in df2 is the same as Code in df1. I want to join column Title_num in df1 to df2.
For example, if 'T_Code' in df2 matches 'code' in df1, i want the value in column df1['Title_num'] to be joined to df2. If the value does not exist, NaN should be populated.
Expected output (df2 after join):
A_Code T_Code Title_num
0 000000086 18301 TN_1234_1915
1 000000126 18302 TN_1234_7068
2 000001236 18303 TN_1234_1828
3 000012346 18938 NaN
4 000123456 18910 NaN
5 000123457 18301 TN_1234_1915
For this, I renamed column code in df1 to 'T_code' so as to match the name on df2. Then I ran the following code:
df2.merge(df1,on='T-Code',how='left')
This gave the following error: 'T_code' # Check for duplicates
Now, one thing to note is in df2, duplicate T_codes will exist while in df1, Code is unique. I want the Title_num values in df2 to be always appear based on the T_code value [Check row 5 of expected output. T_code value is same as row 1].
Do let me know of a method to perform this. Any help is much appreciated!
Hello this question is already answered
here.
Thanks good luck.
I ended up doing this:
df2=pd.merge(df2, df1,left_on='T_Code', right_on='Code', how='left')
df2.drop(columns =['Name', 'Code'])

Merging sheets of excel using python

I am trying to take data of two sheets and comparing with each other if it matches i want to append column. Let me explain this by showing what i am doing and what i am trying to get in output using python.
This is my sheet1 from excel.xlsx:
it contains four column name,class,age and group.
This is my sheet2 from excel.xlsx:
it contains default, and name column with extra names in it.
So, Now i am trying to match name of sheet2 with sheet1, if the name containing in sheet1 matches with sheet2 then i want to add default value corresponding to that name from sheet2.
This i need in output:
As you can see only Ravi and Neha having default in sheet2 and that name matches with sheet1 name. Suhash and Aish dont have any default value so not anything coming there.
This code i tried:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1')
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
df1['DEFAULT'] = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)
and getting output excel like this:
Not getting default against Ravi.
Please help me with this to get this expected output using python.
Assuming you read each sheet into a dataframe (df = sheet1, df2 = sheet2)
it's quite easy and there are a few options (ranked in order of speed, from fastest to slowest):
# .merge
df = df.merge(df2, how='left', on='Name')
# pd.conact
df = pd.concat([df.set_index('Name'), df2.set_index('Name').Default], axis=1, sort='Name', join='inner')
# .join
df = df.set_index('Name').join(df2.set_index('Name'))
# .map
df.Default = df.Name.map(df2.set_index('Name')['Default'].to_dict())
All of them will have the following output:
Name Default Class Age Group
0 NaN NaN 4 2 tig
1 Ravi 2.0 5 5 rose
2 NaN NaN 3 3 lily
3 Suhas NaN 5 5 rose
4 NaN NaN 2 2 sun
5 Neha 3.0 5 5 rose
6 NaN NaN 5 2 sun
7 Aish NaN 5 5 rose
Then you overwrite the original sheet by using df.to_excel
EDIT
So the code you shared has 3 problems. One of which seems to be a language barrier... You only need 1 of the options I gave you. Secondly there's a missing ' when reading the first sheet into df. And lastly you're inconsistent when using the df names. you defined df1 and df2 but used just df in the code which doesn't work
So the correct code would be as follows:
import pandas as pd
import xlrd
df1 = pd.read_excel('stack.xlsx', sheet_name='Sheet1') #Here the ' was missing
df2 = pd.read_excel('stack.xlsx', sheet_name='Sheet2')
## Now you chose one of the options, I used map here, but you can pick any one of them
df1.DEFAULT = df1.NAME.map(df2.set_index('NAME')['DEFAULT'].to_dict())
df1.to_excel('play.xlsx',index=False)

Filter columns based on a value (Pandas): TypeError: Could not compare ['a'] with block values

I'm trying filter a DataFrame columns based on a value.
In[41]: df = pd.DataFrame({'A':['a',2,3,4,5], 'B':[6,7,8,9,10]})
In[42]: df
Out[42]:
A B
0 a 6
1 2 7
2 3 8
3 4 9
4 5 10
Filtering columns:
In[43]: df.loc[:, (df != 6).iloc[0]]
Out[43]:
A
0 a
1 2
2 3
3 4
4 5
It works! But, When I used strings,
In[44]: df.loc[:, (df != 'a').iloc[0]]
I'm getting this error: TypeError: Could not compare ['a'] with block values
You are trying to compare string 'a' with numeric values in column B.
If you want your code to work, first promote dtype of column B as numpy.object, It will work.
df.B = df.B.astype(np.object)
Always check data types of the columns before performing the operations using
df.info()
You could do this with masks instead, for example:
df[df.A!='a'].A
and to filter from any column:
df[df.apply(lambda x: sum([x_=='a' for x_ in x])==0, axis=1)]
The problem is due to the fact that there are numeric and string objects in the dataframe.
You can loop through each column and check each column as a series for a specific value using
(Series=='a').any()

Resources