Check if dictionaries are equal in df - python-3.x

I have a df, in which a column contains dictionaries:
a b c d
0 a1 b1 c1 {0.0: 'a', 1.0: 'b'}
1 a2 b2 c2 NaN
2 a3 b3 c3 {0.0: 'cs', 1.0: 'ef', 2.0: 'efg'}
and another dict:
di = {0.0: 'a', 1.0: 'b'}
I want to add a new column with 'yes' in it when d=di, and 'no' or 'NaN' or just empty when it's not. I tried the below but it doesnt work:
df.loc[df['d'] == di, 'e'] = 'yes'
The result would be:
a b c d e
0 a1 b1 c1 {0.0: 'a', 1.0: 'b'} yes
1 a2 b2 c2 NaN
2 a3 b3 c3 {0.0: 'cs', 1.0: 'ef', 2.0: 'efg'}
Anyone being able to help me here?
Thank you in advance!

You can try
df['new'] = df['d'].eq(di).map({True:'yes',False:''})

Related

DataFrame merge for on specific columns

I have a basic question on dataframe merge. After I merge two dataframe , is there a way to pick only few columns in the result.
For Example:
left = pd.DataFrame({'key1': ['K0', 'K0', 'K1', 'K2'],
'key2': ['K0', 'K1', 'K0', 'K1'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key1': ['K0', 'K1', 'K1', 'K2'],
'key2': ['K0', 'K0', 'K0', 'K0'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
result = pd.merge(left, right, on=['key1', 'key2'])
RESULT :
A B key1 key2 C D
0 A0 B0 K0 K0 C0 D0
1 A2 B2 K1 K0 C1 D1
2 A2 B2 K1 K0 C2 D2
None
Is there a way I can chose only column 'C' from 'right' dataframe and 'A' column from left dataframe? For example, I would like my result to be like:
A key1 key2 C
0 A0 K0 K0 C0
1 A2 K1 K0 C1
2 A2 K1 K0 C2
None
Sure, first filter necessary columns + columns used for join:
result = pd.merge(left[['A','key1', 'key2']],
right[['C','key1', 'key2']],
on=['key1', 'key2'])
Or:
keys = ['key1', 'key2']
result = pd.merge(left[['A'] + keys], right[['C'] + keys], on=keys)
mergeDF = pd.merge(left['key1','key2','A'], right[['key1','key2','C']], on=['key1', 'key2'])

How to reorganize/restructure values in a dataframe with no column header by refering to a master dataframe in python?

Master Dataframe:
B
D
E
b1
d1
e1
b2
d2
e2
b3
d3
d4
d5
Dataframe with no column name:
b1
d3
e1
d2
b2
e2
e1
d5
e1
How do i convert the dataframe above to something like in the table below (with column names) by refering to master dataframe?
B
D
E
b1
d3
e1
d2
b2
e2
e1
d5
e1
Thank you in advance for your help!
One way would be to make a mapping dict, then reindex each row:
# Mapping dict
d = {}
for k, v in df.to_dict("list").items():
d.update(**dict.fromkeys(set(v) - {np.nan}, k))
# or pandas approach
d = df.melt().dropna().set_index("value")["variable"].to_dict()
def reorganize(ser):
data = [i for i in ser if pd.notna(i)]
ind = [d.get(i, i) for i in data]
return pd.Series(data, index=ind)
df2.apply(reorganize, axis=1)
Output:
B D E
0 b1 NaN NaN
1 NaN d3 e1
2 NaN d2 NaN
3 b2 NaN e2
4 NaN NaN e1
5 NaN d5 e1
It's not a beautiful answer, but I think I was able to do it by using .loc. I don't think you need to use Master Dataframe.
import pandas as pd
df = pd.DataFrame({'col1': ['b1', 'd3', 'd2', 'b2', 'e1', 'd5'],
'col2': ['', 'e1', '', 'e2', '', 'e1']},
columns=['col1', 'col2'])
df
# col1 col2
# 0 b1
# 1 d3 e1
# 2 d2
# 3 b2 e2
# 4 e1
# 5 d5 e1
df_reshaped = pd.DataFrame()
for index, row in df.iterrows():
for col in df.columns:
i = row[col]
j = i[0] if i != '' else ''
if j != '':
df_reshaped.loc[index, j] = i
df_reshaped.columns = df_reshaped.columns.str.upper()
df_reshaped
# B D E
# 0 b1 NaN NaN
# 1 NaN d3 e1
# 2 NaN d2 NaN
# 3 b2 NaN e2
# 4 NaN NaN e1
# 5 NaN d5 e1

Remove duplicate within column depending on 2 lists in python

I have a dataframe such as :
Groups NAME LETTER
G1 Canis_lupus A
G1 Canis_lupus B
G1 Canis_lupus F
G1 Cattus_cattus C
G1 Cattus_cattus C
G2 Canis_lupus C
G2 Zebra_fish A
G2 Zebra_fish D
G2 Zebra-fish B
G2 Cattus_cattus D
G2 Cattus_cattus E
and the idea is that I would like within Groups to keep only two duplicated NAME where LETTER is within the list1=['A','B','C'] and list2=['D','E','F']
When there is for instance duplicate having A and B, I keep the A in the alphabet order
In the example I should then get :
Groups NAME LETTER
G1 Canis_lupus A
G1 Canis_lupus F
G1 Cattus_cattus C
G2 Canis_lupus C
G2 Zebra_fish A
G2 Zebra_fish D
G2 Cattus_cattus D
Here is tha dataframe
{'Groups': {0: 'G1', 1: 'G1', 2: 'G1', 3: 'G1', 4: 'G1', 5: 'G2', 6: 'G2', 7: 'G2', 8: 'G2', 9: 'G2', 10: 'G2'}, 'NAME': {0: 'Canis_lupus', 1: 'Canis_lupus', 2: 'Canis_lupus', 3: 'Cattus_cattus', 4: 'Cattus_cattus', 5: 'Canis_lupus', 6: 'Zebra_fish', 7: 'Zebra_fish', 8: 'Zebra-fish', 9: 'Cattus_cattus', 10: 'Cattus_cattus'}, 'LETTER': {0: 'A', 1: 'B', 2: 'F', 3: 'C', 4: 'C', 5: 'C', 6: 'A', 7: 'D', 8: 'B', 9: 'D', 10: 'E'}}
Idea is first sorting by LETTER if necessary and then filter by both lists with Series.isin, remove duplicates by DataFrame.drop_duplicates and last join together by concat:
#sorting per groups
df = df.sort_values(['Groups','LETTER'])
#sorting by one column
#df = df.sort_values('LETTER')
list1=['A','B','C']
list2=['D','E','F']
df1 = df[df['LETTER'].isin(list1)].drop_duplicates(['Groups','NAME'])
df2 = df[df['LETTER'].isin(list2)].drop_duplicates(['Groups','NAME'])
df = pd.concat([df1, df2]).sort_index(ignore_index=True)
print (df)
Groups NAME LETTER
0 G1 Canis_lupus A
1 G1 Canis_lupus F
2 G1 Cattus_cattus C
3 G2 Canis_lupus C
4 G2 Zebra_fish A
5 G2 Zebra_fish D
6 G2 Cattus_cattus D
Idea with mapping values to new column with merge dictionaries, similar like another solution, only also removed rows if no match both lists by DataFrame.dropna, last remove helper column and sorting:
d = {**dict.fromkeys(list1, 'a'),
**dict.fromkeys(list2, 'b')}
df = (df.assign(new = df.LETTER.map(d))
.dropna(subset=['new'])
.drop_duplicates(subset=['Groups', 'NAME', 'new'])
.sort_index(ignore_index=True)
.drop('new', 1)
)
print (df)
Groups NAME LETTER
0 G1 Canis_lupus A
1 G1 Canis_lupus F
2 G1 Cattus_cattus C
3 G2 Canis_lupus C
4 G2 Zebra_fish A
5 G2 Zebra_fish D
6 G2 Cattus_cattus D
Create a list_id column to identify which list a particular letter belongs to. Then just drop the duplicates using the subset parameter.
condlist = [df.LETTER.isin(list1),
df.LETTER.isin(list2)]
choicelist = [
'list1',
'list2'
]
df['list_id'] = np.select(condlist, choicelist)
df = df.sort_values('LETTER').drop_duplicates(
subset=['Groups', 'NAME', 'list_id']).drop('list_id', 1).sort_values(['Groups', 'NAME'])
OUTPUT:
Groups NAME LETTER
0 G1 Canis_lupus A
2 G1 Canis_lupus F
3 G1 Cattus_cattus C
5 G2 Canis_lupus C
9 G2 Cattus_cattus D
6 G2 Zebra_fish A
7 G2 Zebra_fish D

How to subset a DataFrame based on similar column names

How to subset similar columns in pandas based on keywords like A B C D. Now I have taken this as example is there any better way if new columns were given and logic should work.
df
A1 A2 A3 B1 B2 B3 C1 C2 D1 D2 D3 D4
1 a x 1 a x 3 c 7 d s 4
2 b 5 2 b 5 4 d s c 7 d
3 c 7 3 c 7 1 a x 1 a x
4 d s 4 d s b 5 2 b s 7
You can use pandas.Index.groupby
groups = df.columns.groupby(df.columns.str[0])
#{'A': ['A1', 'A2', 'A3'],
# 'B': ['B1', 'B2', 'B3'],
# 'C': ['C1', 'C2'],
# 'D': ['D1', 'D2', 'D3', 'D4']}
Then you can access data this way:
df[groups['B']]
# B1 B2 B3
#0 1 a x
#1 2 b 5
#2 3 c 7
#3 4 d s
Keep in mind groups is a dict, so you can use any dict method too.
Another approach can be to use df.columns in conjuction with str.contains
a_col_lst = df.columns[df.columns.str.contains('A')]
b_col_lst = df.columns[df.columns.str.contains('B')]
df_A = df.loc[a_col_lst]
df_B = df.loc[b_col_lst]
You can apply regex as well within columns.str.contains
You could use filter along with a regex pattern, e.g.
df_A = df.filter(regex=(r'^A.*'))
You could also use select along with startswith:
df_A = df.select(lambda col: col.startswith('A'), axis=1)

Merging 2 data frames on 3 columns where data sometimes exists

I am attempting merge and fill in missing values in one data frame from another one. Hopefully this isn't too long of an explanation i have just been wracking my brain around this for too long. I am working with 2 huge CSV files so i made a small example here. I have included the entire code at the end in case you were curious to assist. THANK YOU SO MUCH IN ADVANCE. Here we go!
print(df1)
A B C D E
0 1 B1 D1 E1
1 C1 D1 E1
2 1 B1 D1 E1
3 2 B2 D2 E2
4 B2 C2 D2 E2
5 3 D3 E3
6 3 B3 C3 D3 E3
7 4 C4 D4 E4
print(df2)
A B C F G
0 1 C1 F1 G1
1 B2 C2 F2 G2
2 3 B3 F3 G3
3 4 B4 C4 F4 G4
I would essentially like to merge df2 into df1 by 3 different columns. i understand that you can merge on multiple column names but it seems to not give me the desired result. I would like to KEEP all data in df1, and fill in the data from df2 so i use how='left'.
I am fairly new to python and have done a lot of research but have hit a stuck point. Here is what i have tried.
data3 = df1.merge(df2, how='left', on=['A'])
print(data3)
A B_x C_x D E B_y C_y F G
0 1 B1 D1 E1 C1 F1 G1
1 C1 D1 E1 B2 C2 F2 G2
2 1 B1 D1 E1 C1 F1 G1
3 2 B2 D2 E2 NaN NaN NaN NaN
4 B2 C2 D2 E2 B2 C2 F2 G2
5 3 D3 E3 B3 F3 G3
6 3 B3 C3 D3 E3 B3 F3 G3
7 4 C4 D4 E4 B4 C4 F4 G4
As you can see here it sort of worked with just A, however since this is a csv file with blank values. the blank values seem to merge together. which i do not want. because df2 was blank in row 2 it filled in the data where it saw blanks, which is not what i want. it should be NaN if it could not find a match.
whenever i start putting additional rows into my "on=['A', 'B'] it does not do anything different. in-fact, A no longer merges.
data3 = df1.merge(df2, how='left', on=['A', 'B'])
print(data3)
A B C_x D E C_y F G
0 1 B1 D1 E1 NaN NaN NaN
1 C1 D1 E1 NaN NaN NaN
2 1 B1 D1 E1 NaN NaN NaN
3 2 B2 D2 E2 NaN NaN NaN
4 B2 C2 D2 E2 C2 F2 G2
5 3 D3 E3 NaN NaN NaN
6 3 B3 C3 D3 E3 F3 G3
7 4 C4 D4 E4 NaN NaN NaN
Rows A, B, and C are the values i want to correlate and merge on. Using both data frames it should know enough to fill in all the gaps. my ending df should look like:
print(desired_output):
A B C D E F G
0 1 B1 C1 D1 E1 F1 G1
1 1 B1 C1 D1 E1 F1 G1
2 1 B1 C1 D1 E1 F1 G1
3 2 B2 C2 D2 E2 F2 G2
4 2 B2 C2 D2 E2 F2 G2
5 3 B3 C3 D3 E3 F3 G3
6 3 B3 C3 D3 E3 F3 G3
7 4 B4 C4 D4 E4 F4 G4
even though A, B, and C have repeating rows i want to keep ALL the data and just fill in the data from df2 where it might fit, even if it is repeat data. i also do not want to have all of the _x and _y the suffix's from merging. i know how to rename but doing 3 different merges and merging those merges starts to get really complicated really fast with repeated rows and suffix's...
long story short, how can i merge both data-frames by A, and then B, and then C? order in which it happens is irrelevant.
Here is a sample of actual data. I have my own data that has additional data and i relate it to this data by certain identifiers. basically by MMSI, Name and IMO. i want to keep duplicates because they aren't actually duplicates, just additional data points for each vessel
MMSI BaseDateTime LAT LON VesselName IMO CallSign
366940480.0 2017-01-04T11:39:36 52.48730 -174.02316 EARLY DAWN 7821130 WDB7319
366940480.0 2017-01-04T13:51:07 52.41575 -174.60041 EARLY DAWN 7821130 WDB7319
273898000.0 2017-01-06T16:55:33 63.83668 -174.41172 MYS CHUPROVA NaN UAEZ
352844000.0 2017-01-31T22:51:31 51.89778 -176.59334 JACHA 8512920 3EFC4
352844000.0 2017-01-31T23:06:31 51.89795 -176.59333 JACHA 8512920 3EFC4

Resources