Creating cross-tabulation table from panda data frame - python-3.x

I am new to cross tables. I have 3 data frames (df_1,df_2,df_3) created in my Jupyther NoteBook:
df_1
person ID, date, price, location, code
542, 12/04/12, $2.5, 66234, 103
df_2
Brand, region, location
AA, Texas, 66234
BB, SF, 15467
df_3
person ID, First, Last, Region
542, Tom, Barker, Texas
I need to combine these data frames to get something like this:
Brand, code, Texas, SF
AA, 103, $2.5, $3.8
Any idea where should I start?

Related

Create a parent child tree dictionary based on two columns in dataframe

Suggest I've a dataframe containing countries and cities looking like this:
data = {'Parent':['Netherlands','Belgium','Germany','France'],'Child':['Amsterdam','Brussels','Berlin', '']}
I want to create a tree dictionary depicting which city belongs to which country.
In this example I the country France has no cities, I don't want to have the empty values as child node in the dictionary.
Can someone point me into the right direction on what to use to reach this solution?
Kind regards
Using pandas:
df = pd.DataFrame(data)
df.transpose()
0 1 2 3
Parent Netherlands Belgium Germany France
Child Amsterdam Brussels Berlin
Or using zip:
dict(zip(*data.values()))
{'Netherlands': 'Amsterdam', 'Belgium': 'Brussels', 'Germany': 'Berlin', 'France': ''}

Python Pandas Move Na or Null values to a new dataframe

I know I can drop NaN rows from a DataFrame with df.dropna(). But what if I want to move those NaN rows to a new DataFrame?
Dataframe looks like
FNAME, LNAME, ADDRESS, latitude, logitude, altitude
BOB, JONES, 555 Seaseme Street, 38.00,-91.00,0.0
JOHN, GREEN, 111 Maple Street, 34.00,-75.00,0.0
TOM, SMITH, 100 A Street, 20.00,-80.00,0.0
BETTY, CROCKER, 5 Elm Street, NaN,NaN,NaN
I know I can group and move to a new DataFrame like this
grouped = df.groupby(df.FNAME)
df1 = grouped.get_group("BOB")
and it will give me a new DataFrame with FNAME of BOB but when I try
grouped = df.groupby(df.altitude)
df1 = grouped.get_group("NaN")
I get a KeyError: 'NaN'. So how can I group by Nan or Null values?
Assuming you're satisfied that all 'Nan' values in a column are to be grouped together, what you can do is use DataFrame.fillna() to convert the 'Nan' into something else, to be grouped.
df.fillna(value={'altitude':'null_altitudes'}
This fills every null in the altitude column with the string 'null_altitudes'. If you do a groupby now, all 'null_altitudes' will be together. You can fill multiple columns at once using multiple key value pairs: values = {'col_1':'val_1', 'col_2':'val_2', etc}
You can use isna with any on rows:
# to get rows with NA in a new df
df1 = df[df.isna().any(axis=1)]

sorting multi-column grouped by data frame

I'm trying to work on this data set drinks by country and find out the mean of beer servings of each country in each continent sorted from highest to lowest.
So my result should look something like below:
South America: Venezuela 333, Brazil 245, paraguay 213
and like that for the other continents (Don't want to mix countries of different continents!)
Creating the grouped data without the sorting is quite easy like below:
ddf = pd.read_csv(drinks.csv)
grouped_continent_and_country = ddf.groupby(['continent', 'country'])
print(grouped_continent_and_country['beer_servings'].mean())
but how to do the sorting??
Thanks a lot.
In this case you can just sort values by 'continent' and 'beer_servings' without applying .mean():
ddf = pd.read_csv('drinks.csv')
#sorting by continent and beer_servings columns
ddf = ddf.sort_values(by=['continent','beer_servings'], ascending=True)
#making the dataframe with only needed columns
ddf = ddf[['continent', 'country', 'beer_servings']].copy()
#exporting to csv
ddf.to_csv("drinks1.csv")
Output fragment:
continent,country,beer_servings
...
Africa,Botswana,173
Africa,Angola,217
Africa,South Africa,225
Africa,Gabon,347
Africa,Namibia,376
Asia,Afghanistan,0
Asia,Bangladesh,0
Asia,North Korea,0
Asia,Iran,0
Asia,Kuwait,0
Asia,Maldives,0
...

Merging DataFrames in Pandas and Numpy

I have two different data frames pertaining to sales analytics. I would like to merge them together to make a new data frame with the columns customer_id, name, and total_spend. The two data frames are as follows:
import pandas as pd
import numpy as np
customers = pd.DataFrame([[100, 'Prometheus Barwis', 'prometheus.barwis#me.com',
'(533) 072-2779'],[101, 'Alain Hennesey', 'alain.hennesey#facebook.com',
'(942) 208-8460'],[102, 'Chao Peachy', 'chao.peachy#me.com',
'(510) 121-0098'],[103, 'Somtochukwu Mouritsen',
'somtochukwu.mouritsen#me.com','(669) 504-8080'],[104,
'Elisabeth Berry', 'elisabeth.berry#facebook.com','(802) 973-8267']],
columns = ['customer_id', 'name', 'email', 'phone'])
orders = pd.DataFrame([[1000, 100, 144.82], [1001, 100, 140.93],
[1002, 102, 104.26], [1003, 100, 194.6 ], [1004, 100, 307.72],
[1005, 101, 36.69], [1006, 104, 39.59], [1007, 104, 430.94],
[1008, 103, 31.4 ], [1009, 104, 180.69], [1010, 102, 383.35],
[1011, 101, 256.2 ], [1012, 103, 930.56], [1013, 100, 423.77],
[1014, 101, 309.53], [1015, 102, 299.19]],
columns = ['order_id', 'customer_id', 'order_total'])
When I group by customer_id and order_id I get the following table:
customer_id order_id order_total
100 1000 144.82
1001 140.93
1003 194.60
1004 307.72
1013 423.77
101 1005 36.69
1011 256.20
1014 309.53
102 1002 104.26
1010 383.35
1015 299.19
103 1008 31.40
1012 930.56
104 1006 39.59
1007 430.94
1009 180.69
This is where I get stuck. I do not know how to sum up all of the orders for each customer_id in order to make a total_spent column. If anyone knows of a way to do this it would be much appreciated!
IIUC, you can do something like below
orders.groupby('customer_id')['order_total'].sum().reset_index(name='Customer_Total')
Output
customer_id Customer_Total
0 100 1211.84
1 101 602.42
2 102 786.80
3 103 961.96
4 104 651.22
You can create an additional table then merge back to your current output.
# group by customer id and order id to match your current output
df = orders.groupby(['customer_id', 'order_id']).sum()
# create a new lookup table called total by customer
totalbycust = orders.groupby('customer_id').sum()
totalbycust = totalbycust.reset_index()
# only keep the columsn you want
totalbycust = totalbycust[['customer_id', 'order_total']]
# merge bcak to your current table
df =df.merge(totalbycust, left_on='customer_id', right_on='customer_id')
df = df.rename(columns = {"order_total_x": "order_total", "order_total_y": "order_amount_by_cust"})
# expect output
df
df_merge = customers.merge(orders, how='left', left_on='customer_id', right_on='customer_id').filter(['customer_id','name','order_total'])
df_merge = df_merge.groupby(['customer_id','name']).sum()
df_merge = df_merge.rename(columns={'order_total':'total_spend'})
df_merge.sort_values(['total_spend'], ascending=False)
Results in:
total_spend
customer_id name
100 Prometheus Barwis 1211.84
103 Somtochukwu Mouritsen 961.96
102 Chao Peachy 786.80
104 Elisabeth Berry 651.22
101 Alain Hennesey 602.42
A step-by-step explanation:
Start by merging your orders table onto your customers table using a left join. For this you will need pandas' .merge() method. Be sure to set the how argument to left because the default merge type is inner (which would ignore customers with no orders).
This step requires some basic understanding of SQL-style merge methods. You can find a good visual overview of the various merge types in this thread.
You can append your merge with the .filter() method to only keep your columns of interest (in your case: customer_id, name and order_total).
Now that you have your merged table, we still need to sum up all the order_total values per customer. To achieve this we need to group all non-numeric columns using .groupby() and then apply an aggregation method on the remaining numeric columns (.sum() in this case).
The .groupby() documentation link above provides some more examples on this. It is also worth knowing that this is a pattern referred to as "split-apply-combine" in the pandas documentation.
Next you will need to rename your numeric column from order_total to total_spend using the .rename() method and setting its column argument.
And last, but not least, sort your customers by your total_spend column using .sort_values().
I hope that helps.

How to create and populate new column in 2nd Pandas dataframe based on condition in 1st df?

I have a pandas dataframe which looks like the following.
df1 = {'City' :['London', 'Basel', 'Kairo'], 'Country': ['UK','Switzerland','Egypt']}
City Country
0 London UK
1 Basel Switzerland
2 Kairo Egypt
Now I want to assign a new column called 'Continent' that should be filled based on the values.
So df2 should look like:
City Country Continent
0 London UK Europe
1 Basel Switzerland Europe
2 Kairo Egypt Africa
So df1 inidicates the values for the new column in df2. How do I do this in the easiest way?
With a dictionary of country->continent, you can pass it pd.DataFrame.replace() method, like this:
df['Continent'] = df['Country'].replace(country_dict)

Resources