I have two different data frames pertaining to sales analytics. I would like to merge them together to make a new data frame with the columns customer_id, name, and total_spend. The two data frames are as follows:
import pandas as pd
import numpy as np
customers = pd.DataFrame([[100, 'Prometheus Barwis', 'prometheus.barwis#me.com',
'(533) 072-2779'],[101, 'Alain Hennesey', 'alain.hennesey#facebook.com',
'(942) 208-8460'],[102, 'Chao Peachy', 'chao.peachy#me.com',
'(510) 121-0098'],[103, 'Somtochukwu Mouritsen',
'somtochukwu.mouritsen#me.com','(669) 504-8080'],[104,
'Elisabeth Berry', 'elisabeth.berry#facebook.com','(802) 973-8267']],
columns = ['customer_id', 'name', 'email', 'phone'])
orders = pd.DataFrame([[1000, 100, 144.82], [1001, 100, 140.93],
[1002, 102, 104.26], [1003, 100, 194.6 ], [1004, 100, 307.72],
[1005, 101, 36.69], [1006, 104, 39.59], [1007, 104, 430.94],
[1008, 103, 31.4 ], [1009, 104, 180.69], [1010, 102, 383.35],
[1011, 101, 256.2 ], [1012, 103, 930.56], [1013, 100, 423.77],
[1014, 101, 309.53], [1015, 102, 299.19]],
columns = ['order_id', 'customer_id', 'order_total'])
When I group by customer_id and order_id I get the following table:
customer_id order_id order_total
100 1000 144.82
1001 140.93
1003 194.60
1004 307.72
1013 423.77
101 1005 36.69
1011 256.20
1014 309.53
102 1002 104.26
1010 383.35
1015 299.19
103 1008 31.40
1012 930.56
104 1006 39.59
1007 430.94
1009 180.69
This is where I get stuck. I do not know how to sum up all of the orders for each customer_id in order to make a total_spent column. If anyone knows of a way to do this it would be much appreciated!
IIUC, you can do something like below
orders.groupby('customer_id')['order_total'].sum().reset_index(name='Customer_Total')
Output
customer_id Customer_Total
0 100 1211.84
1 101 602.42
2 102 786.80
3 103 961.96
4 104 651.22
You can create an additional table then merge back to your current output.
# group by customer id and order id to match your current output
df = orders.groupby(['customer_id', 'order_id']).sum()
# create a new lookup table called total by customer
totalbycust = orders.groupby('customer_id').sum()
totalbycust = totalbycust.reset_index()
# only keep the columsn you want
totalbycust = totalbycust[['customer_id', 'order_total']]
# merge bcak to your current table
df =df.merge(totalbycust, left_on='customer_id', right_on='customer_id')
df = df.rename(columns = {"order_total_x": "order_total", "order_total_y": "order_amount_by_cust"})
# expect output
df
df_merge = customers.merge(orders, how='left', left_on='customer_id', right_on='customer_id').filter(['customer_id','name','order_total'])
df_merge = df_merge.groupby(['customer_id','name']).sum()
df_merge = df_merge.rename(columns={'order_total':'total_spend'})
df_merge.sort_values(['total_spend'], ascending=False)
Results in:
total_spend
customer_id name
100 Prometheus Barwis 1211.84
103 Somtochukwu Mouritsen 961.96
102 Chao Peachy 786.80
104 Elisabeth Berry 651.22
101 Alain Hennesey 602.42
A step-by-step explanation:
Start by merging your orders table onto your customers table using a left join. For this you will need pandas' .merge() method. Be sure to set the how argument to left because the default merge type is inner (which would ignore customers with no orders).
This step requires some basic understanding of SQL-style merge methods. You can find a good visual overview of the various merge types in this thread.
You can append your merge with the .filter() method to only keep your columns of interest (in your case: customer_id, name and order_total).
Now that you have your merged table, we still need to sum up all the order_total values per customer. To achieve this we need to group all non-numeric columns using .groupby() and then apply an aggregation method on the remaining numeric columns (.sum() in this case).
The .groupby() documentation link above provides some more examples on this. It is also worth knowing that this is a pattern referred to as "split-apply-combine" in the pandas documentation.
Next you will need to rename your numeric column from order_total to total_spend using the .rename() method and setting its column argument.
And last, but not least, sort your customers by your total_spend column using .sort_values().
I hope that helps.
Related
Hello I have dataframe which has two column in which I wanted get minimum and maximum values of the data frame I am new to pandas dataframe anyone suggest me how can I do it
df=
data1 data2 name
100 200 cand1
300 400 cand1
400 100 cand1
from the example above my output should look like as below
data1 data2 name
400 100 cand1
I have tried indivisual column like
df[data1].max()
df[data2].min() but this I want in a dataframe suggest me
You can use groupby_agg:
>>> df.groupby('name', as_index=False).agg({'data1': max, 'data2': min})
name data1 data2
0 cand1 400 100
You have almost managed to do it correctly, so only a little bit of help is needed to make it work as you want it to work. No groupby or anything more complicated is needed.
First we make similar dataframe that you have.
df = pd.DataFrame({'data1': [100, 200, 400], 'data2': [200, 400, 100], 'name': ['cand1', 'cand1', 'cand1']})
Then we take the max and min values of columns 'data1' and 'data2' respectively.
df['data1'] = df.data1.max()
df['data2'] = df.data2.min()
Finally we drop all rows except the first one.
df = df.iloc[0:1,:]
And now we have the df as you need it to be.
data1 data2 name
400 100 cand1
This is my first time asking a question. I have a dataframe that looks like below:
import pandas as pd
data = [['AK', 'Co',2957],
['AK', 'Ot', 15],
['AK','Petr', 86848],
['AL', 'Co',167],
['AL', 'Ot', 10592],
['AL', 'Petr',1667]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
I need to find the maximum and minimum values of the third column based on the first two columns. I did browse through a few stackoverflow questions but couldn't find the right way to solve this.
My output should look like below:
data = [['AK','Ot', 15],
['AK','Petr',86848],
['AL','Co',167],
['AL','Ot', 10592]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
Note: Please let me know where I am lagging before leaving a negative marking on the question
This link helped me: Python pandas dataframe: find max for each unique values of an another column
try idxmin and idxmax with .loc filter.
new_df = my_df.loc[
my_df.groupby(["State"])
.agg(ElecMin=("Elec", "idxmin"), ElecMax=("Elec", "idxmax"))
.stack()
]
)
print(new_df)
State Energy Elec
0 AK Ot 15
1 AK Petr 86848
2 AL Co 167
3 AL Ot 10592
I'm trying to work on this data set drinks by country and find out the mean of beer servings of each country in each continent sorted from highest to lowest.
So my result should look something like below:
South America: Venezuela 333, Brazil 245, paraguay 213
and like that for the other continents (Don't want to mix countries of different continents!)
Creating the grouped data without the sorting is quite easy like below:
ddf = pd.read_csv(drinks.csv)
grouped_continent_and_country = ddf.groupby(['continent', 'country'])
print(grouped_continent_and_country['beer_servings'].mean())
but how to do the sorting??
Thanks a lot.
In this case you can just sort values by 'continent' and 'beer_servings' without applying .mean():
ddf = pd.read_csv('drinks.csv')
#sorting by continent and beer_servings columns
ddf = ddf.sort_values(by=['continent','beer_servings'], ascending=True)
#making the dataframe with only needed columns
ddf = ddf[['continent', 'country', 'beer_servings']].copy()
#exporting to csv
ddf.to_csv("drinks1.csv")
Output fragment:
continent,country,beer_servings
...
Africa,Botswana,173
Africa,Angola,217
Africa,South Africa,225
Africa,Gabon,347
Africa,Namibia,376
Asia,Afghanistan,0
Asia,Bangladesh,0
Asia,North Korea,0
Asia,Iran,0
Asia,Kuwait,0
Asia,Maldives,0
...
I am trying to sort my dataframe first for each id. Then within id, I want to sort by visitnumer.
in: df.sort_values(by=[('id'),'visitnumber'], ascending=[False, True])
out: df=
id visitnumber
-9223372 194
-9223372 226
-9223372 181
As you can see, the sort for visitnumber did not work as visitnumber=181 is the smallest, thus should be listed first.
What am I doing wrong?
I'm currently working with some fits tables and I'm having trouble with outputting in Astropy.io.fits. Essentially, I am slicing out a bunch of rows that have data for objects I'm not interested in, but when I save the new table all of those rows have magically reappeared.
For example:
import astropy.io.fits as fits
import numpy as np
hdu = fits.open('some_fits_file.fits')[1].data
sample_slice = [True True True False False True]
hdu_sliced = hdu[sample_slice]
Now my naive mind expects that "hdu" has 6 rows and hdu_sliced has 4 rows, which is what you would get if you used np.size(). So if I save hdu_sliced, the new fits file will also have 4 rows:
new_hdu = fits.BinTableHDU.from_columns(fits.ColDefs(hdu_sliced.columns))
new_hdu.writeto('new_fits_file.fits')
np.size(hdu3)
6
So those two rows that I got rid of with the slice are for some reason not actually being removed from the table and the outputted file is just the same as the original file.
How do I delete the rows I don't want from the table and then output that new data to a new file?
Cheers,
Ashley
Can you use astropy.table.Table instead of astropy.io.fits.BinTable?
It's a much more friendly table object.
One way to make a row selection is to index into the table object with a list (or array) of rows you want:
>>> from astropy.table import Table
>>> table = Table()
>>> table['col_a'] = [1, 2, 3]
>>> table['col_b'] = ['spam', 'ham', 'jam']
>>> print(table)
col_a col_b
----- -----
1 spam
2 ham
3 jam
>>> table[[0, 2]] # Table with rows 0 and 2 only, row 1 removed (a copy)
<Table length=2>
col_a col_b
int64 str4
----- -----
1 spam
3 jam
You can read and write to FITS directly with Table:
table = Table.read('file.fits', hdu='mydata')
table2 = table[[2, 7, 10]]
table2.write('file2.fits')
There are potential issues, e.g. the FITS BINTABLE header isn't preserved when using Table, only the key, value info is storead in table.meta. You can consult the Astropy docs on table and FITS BINTABLE for details about the two table objects, how they represent data or how you can convert between the two, or just ask follow-up questions here or on the astropy-dev mailing list.
If you want to stick to using FITS_rec, you can try the following, which seems to be a workaround:
new_hdu = fits.BinTableHDU.from_columns(hdu_sliced._get_raw_data())