Convert a dictionary of list to a dataframe in a specific format - python-3.x

I have a dictionary with each key holding a list of values. Now, I want to convert that to an dataframe in a specific format.
Example -
dct = {key1:["North", "South", "East"], key2:["East"], key3:["East", "West", "North", "South"]}
The table should look like
key1 North
South
East
key2 East
key3 East
West
North
South

First create DataFrame in list comprehension for tuples:
df = pd.DataFrame([(k, x) for k, v in dct.items() for x in v], columns=['a','b'])
print (df)
a b
0 key1 North
1 key1 South
2 key1 East
3 key2 East
4 key3 East
5 key3 West
6 key3 North
7 key3 South
In pandas for need replace non existing values by NaN or some another value, like here '' for empty string:
#replace values by NaN
#df['a'] = df['a'].mask(df['a'].duplicated())
df['a'] = df['a'].mask(df['a'].duplicated(), '')
print (df)
a b
0 key1 North
1 South
2 East
3 key2 East
4 key3 East
5 West
6 North
7 South
If need convert a column to index (but stil index values are key1, key2, key3 and ''):
s = df.set_index('a')['b']
print (s)
a
key1 North
South
East
key2 East
key3 East
West
North
South
Name: b, dtype: object

Related

Extract the mapping dictionary between two columns in pandas

I have a dataframe as shown below.
df:
id player country_code country
1 messi arg argentina
2 neymar bra brazil
3 tevez arg argentina
4 aguero arg argentina
5 rivaldo bra brazil
6 owen eng england
7 lampard eng england
8 gerrard eng england
9 ronaldo bra brazil
10 marria arg argentina
from the above df, I would like to extract the mapping dictionary that relates the country_code with country columns.
Expected Output:
d = {'arg':'argentina', 'bra':'brazil', 'eng':'england'}
Dictionary has unique keys, so is possible convert Series with duplicated index by column country_code:
d = df.set_index('country_code')['country'].to_dict()
If there is possible some country should be different per country_code, then is used last value per country.

How to merge data with duplicates using panda python

I have two dataframe below, I 'd like to merge them to get ID on df1. However, I find by using merge, I cannot get the ID if the names are more than one. df2 has unique name, df1 and df2 are different in rows and columns. My code below:
df1: Name Region
0 P Asia
1 Q Eur
2 R Africa
3 S NA
4 R Africa
5 R Africa
6 S NA
df2: Name Id
0 P 1234
1 Q 1244
2 R 1233
3 S 1111
code:
x= df1.assign(temp1 = df1.groupby ('Name').cumcount())
y= df2.assign(temp1 = df2.groupby ('Name').cumcount())
xy= x.merge(y, on=['Name',temp2],how = 'left').drop(columns = ['temp1'])
the output is:
df1: Name Region Id
0 P Asia 1234
1 Q Eur 1244
2 R Africa 1233
3 S NA 1111
4 R Africa NAN
5 R Africa NAN
6 S NA NAN
How do I find all the id for these duplicate names?

Compare three dataframe and create a new column in one of the dataframe based on a condition

I am comparing two data frames with master_df and create a new column based on a new condition if available.
for example I have master_df and two region df as asia_df and europe_df. I want to check if company of master_df is available in any of the region data frames and create a new column as region as Europe and Asia
master_df
company product
ABC Apple
BCA Mango
DCA Apple
ERT Mango
NFT Oranges
europe_df
account sales
ABC 12
BCA 13
DCA 12
asia_df
account sales
DCA 15
ERT 34
My final output dataframe is expected to be
company product region
ABC Apple Europe
BCA Mango Europe
DCA Apple Europe
DCA Apple Asia
ERT Mango Asia
NFT Oranges Others
When I try to merge and compare, some datas are removed. I need help on how to fix this issues
final_df = europe_df.merge(master_df, left_on='company', right_on='account', how='left').drop_duplicates()
final1_df = asia_df.merge(master_df, left_on='company', right_on='account', how='left').drop_duplicates()
final['region'] = np.where(final_df['account'] == final_df['company'] ,'Europe','Others')
final['region'] = np.where(final1_df['account'] == final1_df['company'] ,'Asia','Others')
First using pd.concat concat the dataframes asia_df and europe_df then use DataFrame.merge to merge them with master_df, finally use Series.fillna to fill NaN values in Region with Others:
r = pd.concat([europe_df.assign(Region='Europe'), asia_df.assign(Region='Asia')])\
.rename(columns={'account': 'company'})[['company', 'Region']]
df = master_df.merge(r, on='company', how='left')
df['Region'] = df['Region'].fillna('Others')
Result:
print(df)
company product Region
0 ABC Apple Europe
1 BCA Mango Europe
2 DCA Apple Europe
3 DCA Apple Asia
4 ERT Mango Asia
5 NFT Oranges Others

How to select records with not exists condition in pandas dataframe

I am have two dataframes as below. I want to rewrite the data selection SQL query into pandaswhich contains not exists condition
SQL
Select ORDER_NUM, DRIVER FROM DF
WHERE
1=1
AND NOT EXISTS
(
SELECT 1 FROM
order_addition oa
WHERE
oa.Flag_Value = 'Y'
AND df.ORDER_NUM = oa.ORDER_NUM)
Sample data
order_addition.head(10)
ORDER_NUM Flag_Value
22574536 Y
32459745 Y
15642314 Y
12478965 N
25845673 N
36789156 N
df.head(10)
ORDER_NUM REGION DRIVER
22574536 WEST Ravi
32459745 WEST David
15642314 SOUTH Rahul
12478965 NORTH David
25845673 SOUTH Mani
36789156 SOUTH Tim
How can this be done in pandas easily.
IIUC, you can merge on df1 with values equal to Y, and then find the nans:
result = df2.merge(df1[df1["Flag_Value"].eq("Y")],how="left",on="ORDER_NUM")
print (result[result["Flag_Value"].isnull()])
ORDER_NUM REGION DRIVER Flag_Value
3 12478965 NORTH David NaN
4 25845673 SOUTH Mani NaN
5 36789156 SOUTH Tim NaN
Or even simpler if your ORDER_NUM are unique:
print (df2.loc[~df2["ORDER_NUM"].isin(df1.loc[df1["Flag_Value"].eq("Y"),"ORDER_NUM"])])
ORDER_NUM REGION DRIVER
3 12478965 NORTH David
4 25845673 SOUTH Mani
5 36789156 SOUTH Tim

How to apply IF, else, else if condition in Pandas DataFrame

I have a column in my pandas DataFrame with country names. I want to apply different filters on the column using if-else conditions and have to add a new column on that DataFrame with those conditions.
Current DataFrame:-
Company Country
BV Denmark
BV Sweden
DC Norway
BV Germany
BV France
DC Croatia
BV Italy
DC Germany
BV Austria
BV Spain
I have tried this but in this, I have to define countries again and again.
bookings_d2.loc[(bookings_d2.Country== 'Denmark') | (bookings_d2.Country== 'Norway'), 'Country'] = bookings_d2.Country
In R I am currently using if else condition like this, I want to implement this same thing in python.
R Code Example 1 :
ifelse(bookings_d2$COUNTRY_NAME %in% c('Denmark','Germany','Norway','Sweden','France','Italy','Spain','Germany','Austria','Netherlands','Croatia','Belgium'),
as.character(bookings_d2$COUNTRY_NAME),'Others')
R Code Example 2 :
ifelse(bookings_d2$country %in% c('Germany'),
ifelse(bookings_d2$BOOKING_BRAND %in% c('BV'),'Germany_BV','Germany_DC'),bookings_d2$country)
Expected DataFrame:-
Company Country
BV Denmark
BV Sweden
DC Norway
BV Germany_BV
BV France
DC Croatia
BV Italy
DC Germany_DC
BV Others
BV Others
Not sure exactly what you are trying to achieve, but I guess it is something along the lines of:
df=pd.DataFrame({'country':['Sweden','Spain','China','Japan'], 'continent':[None] * 4})
country continent
0 Sweden None
1 Spain None
2 China None
3 Japan None
df.loc[(df.country=='Sweden') | ( df.country=='Spain'), 'continent'] = "Europe"
df.loc[(df.country=='China') | ( df.country=='Japan'), 'continent'] = "Asia"
country continent
0 Sweden Europe
1 Spain Europe
2 China Asia
3 Japan Asia
You can also use python list comprehension like:
df.continent=["Europe" if (x=="Sweden" or x=="Denmark") else "Other" for x in df.country]
You can use:
For example1: Use Series.isin with numpy.where or loc, but necessary invert mask by ~:
#removed Austria, Spain
L = ['Denmark','Germany','Norway','Sweden','France','Italy',
'Germany','Netherlands','Croatia','Belgium']
df['Country'] = np.where(df['Country'].isin(L), df['Country'], 'Others')
Alternative:
df.loc[~df['Country'].isin(L), 'Country'] ='Others'
For example2: Use numpy.select or nested np.where:
m1 = df['Country'] == 'Germany'
m2 = df['Company'] == 'BV'
df['Country'] = np.select([m1 & m2, m1 & ~m2],['Germany_BV','Germany_DC'], df['Country'])
Alternative:
df['Country'] = np.where(~m1, df['Country'],
np.where(m2, 'Germany_BV','Germany_DC'))
print (df)
Company Country
0 BV Denmark
1 BV Sweden
2 DC Norway
3 BV Germany_BV
4 BV France
5 DC Croatia
6 BV Italy
7 DC Germany_DC
8 BV Others
9 BV Others
You can do to get it:
country_others=['Poland','Switzerland']
df.loc[df['Country']=='Germany','Country']=df.loc[df['Country']=='Germany'].apply(lambda x: x+df['Company'])['Country']
df.loc[(df['Company']=='DC') &(df['Country'].isin(country_others)),'Country']='Others'

Resources