if a column has duplicates and next column has a value how to add these to duplicate values - python-3.x

Their is a dataframes with the columns, below is input ,if the username has duplicates and the owner region should be added to the duplicates also
Queue Owner Region username
xxy aan
xyz india aan
yyx aandiapp
xox UK aandiapp
yox china aashwins
zxy aashwins
yoz aus aasyed
zxo aasyed
The required output should be
Queue Owner Region username
xxy india aan
xyz india aan
yyx Uk aandiapp
xox Uk aandiapp
yox china aashwins
zxy china aashwins
yoz aus aasyed
zxo aus aasyed
please anyone help me , Thanks inadvance

I think need replace empty values to NaNs first and then per groups repalce them by forward and back filling:
df['Owner Region'] = df['Owner Region'].replace('', np.nan)
df['Owner Region'] = df.groupby('username')['Owner Region'].transform(lambda x: x.ffill().bfill())

You can use mask and groupby.
df['Owner Region'] = (
df['Owner Region']
.mask(df['Owner Region'].str.len().eq(0))
.groupby(df.username)
.ffill()
.bfill())
df
Queue Owner Region username
0 xxy india aan
1 xyz india aan
2 yyx UK aandiapp
3 xox UK aandiapp
4 yox china aashwins
5 zxy china aashwins
6 yoz aus aasyed
7 zxo aus aasyed
When you call groupby + ffill, the subsequent bfill call does not require a groupby.
If it is possible a group is only NaNs in it, you cannot avoid the apply...
df['Owner Region'] = (
df['Owner Region']
.mask(df['Owner Region'].str.len().eq(0))
.groupby(df.username)
.apply(lambda x: x.ffill().bfill()))
df
Queue Owner Region username
0 xxy india aan
1 xyz india aan
2 yyx UK aandiapp
3 xox UK aandiapp
4 yox china aashwins
5 zxy china aashwins
6 yoz aus aasyed
7 zxo aus aasyed

Related

Convert a dataset with a single level header to a dataset with multiple level headers

For the dataset df with one single-level header as follows, I hope to convert it into a dataset with a three-level header.
How to achieve this? Thanks.
year cpi ppi Country code Country name
0 1995 0.107797 0.085297 AUS Australia
1 1996 0.110997 0.082973 AUS Australia
2 1997 0.110098 0.083651 AUS Australia
3 1998 0.107635 0.086049 AUS Australia
4 1999 0.104969 0.089700 AUS Australia
5 1995 0.120071 0.129769 MEX Mexico
6 1996 0.119249 0.142606 MEX Mexico
7 1997 0.124866 0.151372 MEX Mexico
8 1998 0.127448 0.153303 MEX Mexico
9 1999 0.134342 0.159876 MEX Mexico
The expected output:
Country name Australia Australia Mexico Mexico
Country code AUS AUS MEX MEX
Indicator Name cpi ppi cpi ppi
0 1995 0.107797 0.085297 0.120071 0.129769
1 1996 0.110997 0.082973 0.119249 0.142606
2 1997 0.110098 0.083651 0.124866 0.151372
3 1998 0.107635 0.086049 0.127448 0.153303
4 1999 0.104969 0.0897 0.134342 0.159876
Use DataFrame.pivot with DataFrame.reorder_levels and DataFrame.sort_index:
df = (df.pivot(index=['year'],
columns=['Country code','Country name'])
.reorder_levels([2, 1, 0],axis=1)
.sort_index(axis=1))
print(df)
Country name Australia Mexico
Country code AUS MEX
cpi ppi cpi ppi
year
1995 0.107797 0.085297 0.120071 0.129769
1996 0.110997 0.082973 0.119249 0.142606
1997 0.110098 0.083651 0.124866 0.151372
1998 0.107635 0.086049 0.127448 0.153303
1999 0.104969 0.089700 0.134342 0.159876

Compare three dataframe and create a new column in one of the dataframe based on a condition

I am comparing two data frames with master_df and create a new column based on a new condition if available.
for example I have master_df and two region df as asia_df and europe_df. I want to check if company of master_df is available in any of the region data frames and create a new column as region as Europe and Asia
master_df
company product
ABC Apple
BCA Mango
DCA Apple
ERT Mango
NFT Oranges
europe_df
account sales
ABC 12
BCA 13
DCA 12
asia_df
account sales
DCA 15
ERT 34
My final output dataframe is expected to be
company product region
ABC Apple Europe
BCA Mango Europe
DCA Apple Europe
DCA Apple Asia
ERT Mango Asia
NFT Oranges Others
When I try to merge and compare, some datas are removed. I need help on how to fix this issues
final_df = europe_df.merge(master_df, left_on='company', right_on='account', how='left').drop_duplicates()
final1_df = asia_df.merge(master_df, left_on='company', right_on='account', how='left').drop_duplicates()
final['region'] = np.where(final_df['account'] == final_df['company'] ,'Europe','Others')
final['region'] = np.where(final1_df['account'] == final1_df['company'] ,'Asia','Others')
First using pd.concat concat the dataframes asia_df and europe_df then use DataFrame.merge to merge them with master_df, finally use Series.fillna to fill NaN values in Region with Others:
r = pd.concat([europe_df.assign(Region='Europe'), asia_df.assign(Region='Asia')])\
.rename(columns={'account': 'company'})[['company', 'Region']]
df = master_df.merge(r, on='company', how='left')
df['Region'] = df['Region'].fillna('Others')
Result:
print(df)
company product Region
0 ABC Apple Europe
1 BCA Mango Europe
2 DCA Apple Europe
3 DCA Apple Asia
4 ERT Mango Asia
5 NFT Oranges Others

Python: how to remove footnotes when loading data, and how to select the first when there is a pair of numbers

I am new to python and looking for help.
resp =requests.get("https://en.wikipedia.org/wiki/World_War_II_casualties")
soup = bs.BeautifulSoup(resp.text)
table = soup.find("table", {"class": "wikitable sortable"})
deaths = []`
for row in table.findAll('tr')[1:]:
death = row.findAll('td')[5].text.strip()
deaths.append(death)
It comes out as
'30,000',
'40,400',
'',
'88,000',
'2,000',
'21,500',
'252,600',
'43,600',
'15,000,000[35]to 20,000,000[35]',
'100',
'340,000 to 355,000',
'6,000',
'3,000,000to 4,000,000',
'1,100',
'83,000',
'100,000[49]',
'85,000 to 95,000',
'600,000',
'1,000,000to 2,200,000',
'6,900,000 to 7,400,000',
...
'557,000',
'5,900,000[115] to 6,000,000[116]',
'40,000to 70,000',
'500,000[39]',
'36,000–50,000',
'11,900',
'10,000',
'20,000,000[141] to 27,000,000[142][143][144][145][146]',
'',
'2,100',
'100',
'7,600',
'200',
'450,900',
'419,400',
'1,027,000[160] to 1,700,000[159]',
'',
'70,000,000to 85,000,000']`
I want to plot a graph, but the [] footnote would completely ruin it. Many of the values are with footnotes. Is it also possible to select the first number when there is a pair in one cell? I'd appreciate if anyone of you could teach me... Thank you
You can use soup.find_next() with text=True parameter, then split/strip accordingly.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/World_War_II_casualties'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tr in soup.table.select('tr:has(td)')[1:]:
tds = tr.select('td')
if not tds[0].b:
continue
name = tds[0].b.get_text(strip=True, separator=' ')
casualties = tds[5].find_next(text=True).strip()
print('{:<30} {}'.format(name, casualties.split('–')[0].split()[0] if casualties else ''))
Prints:
Albania 30,000
Australia 40,400
Austria
Belgium 88,000
Brazil 2,000
Bulgaria 21,500
Burma 252,600
Canada 43,600
China 15,000,000
Cuba 100
Czechoslovakia 340,000
Denmark 6,000
Dutch East Indies 3,000,000
Egypt 1,100
Estonia 83,000
Ethiopia 100,000
Finland 85,000
France 600,000
French Indochina 1,000,000
Germany 6,900,000
Greece 507,000
Guam 1,000
Hungary 464,000
Iceland 200
India 2,200,000
Iran 200
Iraq 700
Ireland 100
Italy 492,400
Japan 2,500,000
Korea 483,000
Latvia 250,000
Lithuania 370,000
Luxembourg 5,000
Malaya & Singapore 100,000
Malta 1,500
Mexico 100
Mongolia 300
Nauru 500
Nepal
Netherlands 210,000
Newfoundland 1,200
New Zealand 11,700
Norway 10,200
Papua and New Guinea 15,000
Philippines 557,000
Poland 5,900,000
Portuguese Timor 40,000
Romania 500,000
Ruanda-Urundi 36,000
South Africa 11,900
South Pacific Mandate 10,000
Soviet Union 20,000,000
Spain
Sweden 2,100
Switzerland 100
Thailand 7,600
Turkey 200
United Kingdom 450,900
United States 419,400
Yugoslavia 1,027,000
Approx. totals 70,000,000

How to replace dataframe columns country name with continent?

I have Dataframe like this.
problem.head(30)
Out[25]:
Country
0 Sweden
1 Africa
2 Africa
3 Africa
4 Africa
5 Germany
6 Germany
7 Germany
8 Germany
9 UK
10 Germany
11 Germany
12 Germany
13 Germany
14 Sweden
15 Sweden
16 Africa
17 Africa
18 Africa
19 Africa
20 Africa
21 Africa
22 Africa
23 Africa
24 Africa
25 Africa
26 Pakistan
27 Pakistan
28 ZA
29 ZA
Now i want to replace the country name with the continent name. So the country name will be replace with its continent name.
What i did is, i have created all the Continent array(which is there in my data frame, i have 56 country),
asia = ['Afghanistan', 'Bahrain', 'United Arab Emirates','Saudi Arabia', 'Kuwait', 'Qatar', 'Oman',
'Sultanate of Oman','Lebanon', 'Iraq', 'Yemen', 'Pakistan', 'Lebanon', 'Philippines', 'Jordan']
europe = ['Germany','Spain', 'France', 'Italy', 'Netherlands', 'Norway', 'Sweden','Czech Republic', 'Finland',
'Denmark', 'Czech Republic', 'Switzerland', 'UK', 'UK&I', 'Poland', 'Greece','Austria',
'Bulgaria', 'Hungary', 'Luxembourg', 'Romania' , 'Slovakia', 'Estonia', 'Slovenia','Portugal',
'Croatia', 'Lithuania', 'Latvia','Serbia', 'Estonia', 'ME', 'Iceland' ]
africa = ['Morocco', 'Tunisia', 'Africa', 'ZA', 'Kenya']
other = ['USA', 'Australia', 'Reunion', 'Faroe Islands']
Now trying to replace using
dataframe['Continent'] = dataframe['Country'].replace(asia, 'Asia', regex=True)
where asia is my list name and Asia is text to be replace. But is not working
it only work for
dataframe['Continent'] = dataframe['Country'].replace(np.nan, 'Asia', regex=True)
So, help will be appreciated
Using apply with a custom function.
Demo:
import pandas as pd
asia = ['Afghanistan', 'Bahrain', 'United Arab Emirates','Saudi Arabia', 'Kuwait', 'Qatar', 'Oman',
'Sultanate of Oman','Lebanon', 'Iraq', 'Yemen', 'Pakistan', 'Lebanon', 'Philippines', 'Jordan']
europe = ['Germany','Spain', 'France', 'Italy', 'Netherlands', 'Norway', 'Sweden','Czech Republic', 'Finland',
'Denmark', 'Czech Republic', 'Switzerland', 'UK', 'UK&I', 'Poland', 'Greece','Austria',
'Bulgaria', 'Hungary', 'Luxembourg', 'Romania' , 'Slovakia', 'Estonia', 'Slovenia','Portugal',
'Croatia', 'Lithuania', 'Latvia','Serbia', 'Estonia', 'ME', 'Iceland' ]
africa = ['Morocco', 'Tunisia', 'Africa', 'ZA', 'Kenya']
other = ['USA', 'Australia', 'Reunion', 'Faroe Islands']
def GetConti(counry):
if counry in asia:
return "Asia"
elif counry in europe:
return "Europe"
elif counry in africa:
return "Africa"
else:
return "other"
df = pd.DataFrame({"Country": ["Sweden", "Africa", "Africa", "Germany", "Germany", "UK","Pakistan"]})
df['Continent'] = df['Country'].apply(lambda x: GetConti(x))
print(df)
Output:
Country Continent
0 Sweden Europe
1 Africa Africa
2 Africa Africa
3 Germany Europe
4 Germany Europe
5 UK Europe
6 Pakistan Asia
It would be better to store your country-to-continent map as a dictionary rather than four separate lists. You can do this as follows, starting with your current lists:
continents = {country: 'Asia' for country in asia}
continents.update({country: 'Europe' for country in europe})
continents.update({country: 'Africa' for country in africa})
continents.update({country: 'Other' for country in other})
Then you can use the Pandas map function to map continents to countries:
dataframe['Continent'] = dataframe['Country'].map(continents)

Dynamically fusion rows cells with same values in Excel

In a datasheet with automatic filters, I have this (values and columns names are for example) :
Continent Country City Street
----------------------------------------------------------
Asia Vietnam Hanoi egdsqgdfgdsfg
Asia Vietnam Hanoi fhfdghdfdh
Asia Vietnam Hanoi dfhdfhfdhfdhfdhfdh
Asia Vietnam Saigon ggdsfgfdsdgsdfgdf
Asia Vietnam Hue qsdfqsfqsdf
Asia China Beijing qegfqsddfgdf
Asia China Canton sdgsdfgsdgsdg
Asia China Canton tjgjfgj
Asia China Canton tzeryrty
Asia Japan Tokyo ertsegsgsdfdg
Asia Japan Kyoto qegdgdfgdfgdf
Asia Japan Sapporo gsdgfdgsgsdfgf
Europa France Paris qfqsdfdsqfgsdfgsg
Europa France Toulon qgrhrgqzfqzetzeqrr
Europa France Lyon pàjhçuhàçuh
Europa Italy Rome qrgfqegfgdfg
Europa Italy Rome qergqegsdfgsdfgdsg
I would like this to be displayed like this, with rows fusionned dynamically if filters changes
Continent Country City Street
----------------------------------------------------------
egdsqgdfgdsfg
Hanoi fhfdghdfdh
Vietnam dfhdfhfdhfdhfdhfdh
Saigon ggdsfgfdsdgsdfgdf
Hue qsdfqsfqsdf
---
Asia Beijing qegfqsddfgdf
China sdgsdfgsdgsdg
Canton tjgjfgj
tzeryrty
---
Tokyo ertsegsgsdfdg
Japan Kyoto qegdgdfgdfgdf
Sapporo gsdgfdgsgsdfgf
---
Paris qfqsdfdsqfgsdfgsg
France Toulon qgrhrgqzfqzetzeqrr
Europa Lyon pàjhçuhàçuh
Italy Rome qrgfqegfgdfg
qergqegsdfgsdfgdsg
Is macro mandatory for this ?
I don't want to merge values in Street column. I want to keep all lines. I just want to work on the first column display to avoid having long series of same values.
You can also setup a PivotTable - this would look like this:
Just go to "insert->pivottable" and select your given data as input and create the pivottable as new worksheet ;)
Put all field in the "rows" section, remove any subsum or sum calculations.
Because you don't have any values to sum up, you should just hide those columns, to get a clear view.
If you want to use a Function.
You can do it like this:
=IF(MATCH(Tabelle1!A1;(Tabelle1!A:A);0)=ROW();Tabelle1!A1;"")
Insert this Formula in a other Sheet.

Resources