Ordering values across different groups in Pandas - python-3.x

I am trying to order values of different cars across different regions, as an example. Following is the sample data set.
import pandas as pd
region = ['east','west', 'central', 'east', 'west', 'central', 'east', 'west', 'central']
automobile = ['bmw', 'bmw', 'bmw', 'tesla', 'tesla', 'tesla', 'lucid', 'lucid', 'lucid']
price = [250, 350, 300, 500, 550, 575, 950, 900, 850]
df_test = pd.DataFrame({'region':region,
'automobile':automobile,
'price':price} )
display(df_test)
I would like to make sure that for each automobile, the price across three reqions is synchronized such
that East <= Central <= West (as they are for BMW). If they are not sync'd', price on the East should be
the base price. Eg. for Lucid, its price in Central should be 950 and then in West should be 950 as well. For Testla,
the price in West needs to be raised to match Central i.e. 575.
I think I should use GROUPBY but just cant make any progress. I imagine that a function like ffill() could be used after pivoting the data, but I hope there is a simpler solution.
Any help would appreciated.
Thank you

You can use cummax with groupby, but you need to sort your data in the correct order with categorical dtype:
# assign the order for the regions
df_test['region'] = pd.Categorical(df_test['region'], ordered=True, categories=['east','central', 'west'])
df['price'] = (df_test.sort_values(['automobile','region']) # sort data in the correct order
.groupby('automobile')['price'].cummax() # use cummax to correct the values
)
Output:
region automobile price
0 east bmw 250
1 west bmw 350
2 central bmw 300
3 east tesla 500
4 west tesla 575
5 central tesla 575
6 east lucid 950
7 west lucid 950
8 central lucid 950

Related

Python: how to remove footnotes when loading data, and how to select the first when there is a pair of numbers

I am new to python and looking for help.
resp =requests.get("https://en.wikipedia.org/wiki/World_War_II_casualties")
soup = bs.BeautifulSoup(resp.text)
table = soup.find("table", {"class": "wikitable sortable"})
deaths = []`
for row in table.findAll('tr')[1:]:
death = row.findAll('td')[5].text.strip()
deaths.append(death)
It comes out as
'30,000',
'40,400',
'',
'88,000',
'2,000',
'21,500',
'252,600',
'43,600',
'15,000,000[35]to 20,000,000[35]',
'100',
'340,000 to 355,000',
'6,000',
'3,000,000to 4,000,000',
'1,100',
'83,000',
'100,000[49]',
'85,000 to 95,000',
'600,000',
'1,000,000to 2,200,000',
'6,900,000 to 7,400,000',
...
'557,000',
'5,900,000[115] to 6,000,000[116]',
'40,000to 70,000',
'500,000[39]',
'36,000–50,000',
'11,900',
'10,000',
'20,000,000[141] to 27,000,000[142][143][144][145][146]',
'',
'2,100',
'100',
'7,600',
'200',
'450,900',
'419,400',
'1,027,000[160] to 1,700,000[159]',
'',
'70,000,000to 85,000,000']`
I want to plot a graph, but the [] footnote would completely ruin it. Many of the values are with footnotes. Is it also possible to select the first number when there is a pair in one cell? I'd appreciate if anyone of you could teach me... Thank you
You can use soup.find_next() with text=True parameter, then split/strip accordingly.
For example:
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/World_War_II_casualties'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tr in soup.table.select('tr:has(td)')[1:]:
tds = tr.select('td')
if not tds[0].b:
continue
name = tds[0].b.get_text(strip=True, separator=' ')
casualties = tds[5].find_next(text=True).strip()
print('{:<30} {}'.format(name, casualties.split('–')[0].split()[0] if casualties else ''))
Prints:
Albania 30,000
Australia 40,400
Austria
Belgium 88,000
Brazil 2,000
Bulgaria 21,500
Burma 252,600
Canada 43,600
China 15,000,000
Cuba 100
Czechoslovakia 340,000
Denmark 6,000
Dutch East Indies 3,000,000
Egypt 1,100
Estonia 83,000
Ethiopia 100,000
Finland 85,000
France 600,000
French Indochina 1,000,000
Germany 6,900,000
Greece 507,000
Guam 1,000
Hungary 464,000
Iceland 200
India 2,200,000
Iran 200
Iraq 700
Ireland 100
Italy 492,400
Japan 2,500,000
Korea 483,000
Latvia 250,000
Lithuania 370,000
Luxembourg 5,000
Malaya & Singapore 100,000
Malta 1,500
Mexico 100
Mongolia 300
Nauru 500
Nepal
Netherlands 210,000
Newfoundland 1,200
New Zealand 11,700
Norway 10,200
Papua and New Guinea 15,000
Philippines 557,000
Poland 5,900,000
Portuguese Timor 40,000
Romania 500,000
Ruanda-Urundi 36,000
South Africa 11,900
South Pacific Mandate 10,000
Soviet Union 20,000,000
Spain
Sweden 2,100
Switzerland 100
Thailand 7,600
Turkey 200
United Kingdom 450,900
United States 419,400
Yugoslavia 1,027,000
Approx. totals 70,000,000

Use my custom row order with pandas .describe() function

Assuming I have the following test DataFrame df:
Car Sold make profit
Honda 100 Accord 10
Honda 20 Fit 5
Toyota 300 Corolla 20
Hyundai 150 Elantra 20
BMW 20 Z-class 100
Toyota 45 Lexus 7
BMW 50 X-class 30
JEEP 150 cherokee 2
Honda 20 CRV 5
Toyota 30 Yaris 3
I need a summary statistic table for number of cars sold, by type of car.
I can do that this way:
df.groupby('Car')['Sold'].describe()
this gives me something like the following:
Car count mean std min 25th 50th 75th max
BMW 2
Honda 3
Hyundai 1
JEEP 1
Toyota 3
The 'Car' column values are listed in the summary statistic table in alphabetically ascending order. I am looking for a way to sort it in my own pre-specified way. I want the summary statistic table to be listed as "Toyota, Hyundai, JEEP, BMW, Honda"
df.groupby('Car')['Sold'].describe().loc[["Toyota", "Hyundai", "JEEP", "BMW", "Honda"]]
helps me put it in order, but I am not able to do it for multi-level indexing. For instance, if I want the summary statistics table by 'Car', and further by the make, .loc does not give me the desired solution.

Handling duplicate data with pandas

Hello everyone, I'm having some issues with using pandas python library. Basically I'm reading csv
file with pandas and want to remove duplicates. I've tried everything and problem is still there.
import sqlite3
import pandas as pd
import numpy
connection = sqlite3.connect("test.db")
## pandas dataframe
dataframe = pd.read_csv('Countries.csv')
##dataframe.head(3)
countries = dataframe.loc[:, ['Retailer country', 'Continent']]
countries.head(6)
Output of this will be:
Retailer country Continent
-----------------------------
0 United States North America
1 Canada North America
2 Japan Asia
3 Italy Europe
4 Canada North America
5 United States North America
6 France Europe
I want to be able to drop duplicate values based on columns from
a dataframe above so I would have smth like this unique values from each country, and continent
so that desired output of this will be:
Retailer country Continent
-----------------------------
0 United States North America
1 Canada North America
2 Japan Asia
3 Italy Europe
4 France Europe
I have tried some methods mentioned there: Using pandas for duplicate values and looked around the net and realized I could use df.drop_duplicates() function, but when I use the code below and df.head(3) function it displays only one row. What can I do to get those unique rows and finally loop through them ?
countries.head(4)
country = countries['Retailer country']
continent = countries['Continent']
df = pd.DataFrame({'a':[country], 'b':[continent]})
df.head(3)
It seems like a simple group-by could solve your problem.
import pandas as pd
na = 'North America'
a = 'Asia'
e = 'Europe'
df = pd.DataFrame({'Retailer': [0, 1, 2, 3, 4, 5, 6],
'country': ['Unitied States', 'Canada', 'Japan', 'Italy', 'Canada', 'Unitied States', 'France'],
'continent': [na, na, a, e, na, na, e]})
df.groupby(['country', 'continent']).agg('count').reset_index()
The Retailer column is now showing a count of the number of times that country, continent combination occurs. You could remove this by `df = df[['country', 'continent']].

Subtotal for each level in Pivot table

I'm trying to create a pivot table that has, besides the general total, a subtotal between each row level.
I created my df.
import pandas as pd
df = pd.DataFrame(
np.array([['SOUTH AMERICA', 'BRAZIL', 'SP', 500],
['SOUTH AMERICA', 'BRAZIL', 'RJ', 200],
['SOUTH AMERICA', 'BRAZIL', 'MG', 150],
['SOUTH AMERICA', 'ARGENTINA', 'BA', 180],
['SOUTH AMERICA', 'ARGENTINA', 'CO', 300],
['EUROPE', 'SPAIN', 'MA', 400],
['EUROPE', 'SPAIN', 'BA', 110],
['EUROPE', 'FRANCE', 'PA', 320],
['EUROPE', 'FRANCE', 'CA', 100],
['EUROPE', 'FRANCE', 'LY', 80]], dtype=object),
columns=["CONTINENT", "COUNTRY","LOCATION","POPULATION"]
)
After that i created my pivot table as shown bellow
table = pd.pivot_table(df, values=['POPULATION'], index=['CONTINENT', 'COUNTRY', 'LOCATION'], fill_value=0, aggfunc=np.sum, dropna=True)
table
To do the subtotal i started sum CONTINENT level
tab_tots = table.groupby(level='CONTINENT').sum()
tab_tots.index = [tab_tots.index, ['Total'] * len(tab_tots)]
And concatenated with my first pivot to get subtotal.
pd.concat([table, tab_tots]).sort_index()
And got it:
How can i get the values separated in level like the first table?
I'm not finding a way to do this.
With margins=True, and need change little bit of your pivot index and columns .
newdf=pd.pivot_table(df, index=['CONTINENT'],values=['POPULATION'], columns=[ 'COUNTRY', 'LOCATION'], aggfunc=np.sum, dropna=True,margins=True)
newdf.drop('All').stack([1,2])
Out[132]:
POPULATION
CONTINENT COUNTRY LOCATION
EUROPE All 1010.0
FRANCE CA 100.0
LY 80.0
PA 320.0
SPAIN BA 110.0
MA 400.0
SOUTH AMERICA ARGENTINA BA 180.0
CO 300.0
All 1330.0
BRAZIL MG 150.0
RJ 200.0
SP 500.0
IIUC:
contotal = table.groupby(level=0).sum().assign(COUNTRY='TOTAL', LOCATION='').set_index(['COUNTRY','LOCATION'], append=True)
coutotal = table.groupby(level=[0,1]).sum().assign(LOCATION='TOTAL').set_index(['LOCATION'], append=True)
df_out = (pd.concat([table,contotal,coutotal]).sort_index())
df_out
Output:
POPULATION
CONTINENT COUNTRY LOCATION
EUROPE FRANCE CA 100
LY 80
PA 320
TOTAL 500
SPAIN BA 110
MA 400
TOTAL 510
TOTAL 1010
SOUTH AMERICA ARGENTINA BA 180
CO 300
TOTAL 480
BRAZIL MG 150
RJ 200
SP 500
TOTAL 850
TOTAL 1330
You want to do something like this instead
tab_tots.index = [tab_tots.index, ['Total'] * len(tab_tots), [''] * len(tab_tots)]
Which gives the following I think you are after
In [277]: pd.concat([table, tab_tots]).sort_index()
Out[277]:
POPULATION
CONTINENT COUNTRY LOCATION
EUROPE FRANCE CA 100
LY 80
PA 320
SPAIN BA 110
MA 400
Total 1010
SOUTH AMERICA ARGENTINA BA 180
CO 300
BRAZIL MG 150
RJ 200
SP 500
Total 1330
Note that although this solves your problem, it isn't good programming stylistically. You have inconsistent logic on your summed levels.
This makes sense for a UI interface but if you are using the data it would be better to perhaps use
tab_tots.index = [tab_tots.index, ['All'] * len(tab_tots), ['All'] * len(tab_tots)]
This follows SQL table logic and will give you
In [289]: pd.concat([table, tab_tots]).sort_index()
Out[289]:
POPULATION
CONTINENT COUNTRY LOCATION
EUROPE All All 1010
FRANCE CA 100
LY 80
PA 320
SPAIN BA 110
MA 400
SOUTH AMERICA ARGENTINA BA 180
CO 300
All All 1330
BRAZIL MG 150
RJ 200
SP 500

Plotting UK Districts, Postcode Areas and Regions

I am wondering if we can do similar choropleth as below with UK District, Postcode Area and Region map.
It would be great if you could show an example for UK choropleths.
Geographic shape files can be downloaded from http://martinjc.github.io/UK-GeoJSON/
state_geo = os.path.join('data', 'us-states.json')
state_unemployment = os.path.join('data', 'US_Unemployment_Oct2012.csv')
state_data = pd.read_csv(state_unemployment)
j1 = pd.read_json(state_geo)
from branca.utilities import split_six
threshold_scale = split_six(state_data['Unemployment'])
m = folium.Map(location=[48, -102], zoom_start=3)
m.choropleth(
geo_path=state_geo,
geo_str='choropleth',
data=state_data,
columns=['State', 'Unemployment'],
key_on='feature.id',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Unemployment Rate (%)'
)
m
m.save('choropleth.html')
This is what I did.
First, collect your data. I used www.nomisweb.co.uk to collect employment rates for the main regions:
North East (England)
North West (England)
Yorkshire and The Humber
East Midlands (England)
West Midlands (England)
East of England
London South East (England)
South West (England)
Wales Scotland
Northern Ireland
I saved this dataset as UKEmploymentData.csv. Note that you will have to change the region names to match the geo data id's.
Then I followed what you posted using the NUTS data from the ONS geoportal.
import pandas as pd
import os
import json
# read in population data
df = pd.read_csv('UKEmploymentData.csv')
import folium
from branca.utilities import split_six
state_geo = 'http://geoportal1-ons.opendata.arcgis.com/datasets/01fd6b2d7600446d8af768005992f76a_4.geojson'
m = folium.Map(location=[55, 4], zoom_start=5)
m.choropleth(
geo_data=state_geo,
data=df,
columns=['region', 'Total in employment - aged 16 and over'],
key_on='feature.properties.nuts118nm',
fill_color='YlGn',
fill_opacity=0.7,
line_opacity=0.2,
legend_name='Employment Rate (%)',
highlight=True
)
m

Resources