Getting maximum and minimum values in dataframe - python-3.x

Hello I have dataframe which has two column in which I wanted get minimum and maximum values of the data frame I am new to pandas dataframe anyone suggest me how can I do it
df=
data1 data2 name
100 200 cand1
300 400 cand1
400 100 cand1
from the example above my output should look like as below
data1 data2 name
400 100 cand1
I have tried indivisual column like
df[data1].max()
df[data2].min() but this I want in a dataframe suggest me

You can use groupby_agg:
>>> df.groupby('name', as_index=False).agg({'data1': max, 'data2': min})
name data1 data2
0 cand1 400 100

You have almost managed to do it correctly, so only a little bit of help is needed to make it work as you want it to work. No groupby or anything more complicated is needed.
First we make similar dataframe that you have.
df = pd.DataFrame({'data1': [100, 200, 400], 'data2': [200, 400, 100], 'name': ['cand1', 'cand1', 'cand1']})
Then we take the max and min values of columns 'data1' and 'data2' respectively.
df['data1'] = df.data1.max()
df['data2'] = df.data2.min()
Finally we drop all rows except the first one.
df = df.iloc[0:1,:]
And now we have the df as you need it to be.
data1 data2 name
400 100 cand1

Related

How to transpose and Pandas DataFrame and name new columns?

I have simple Pandas DataFrame with 3 columns. I am trying to Transpose it into and then rename that new dataframe and I am having bit trouble.
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
I tried using
df =df.T
which transpose the DataFrame into:
TotalInvoicedPrice,123
TotalProductCost,18
ShippingCost,5
So now i have to add column names to this data frame "Metrics" and "Values"
I tried using
df.columns["Metrics","Values"]
but im getting errors.
What I need to get is DataFrame that looks like:
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5
Let's reset the index then set the column labels
df.T.reset_index().set_axis(['Metrics', 'Values'], axis=1)
Metrics Values
0 TotalInvoicedPrice 123
1 TotalProductCost 18
2 ShippingCost 5
Maybe you can avoid transpose operation (little performance overhead)
#YOUR DATAFRAME
df = pd.DataFrame({'TotalInvoicedPrice': [123],
'TotalProductCost': [18],
'ShippingCost': [5]})
#FORM THE LISTS FROM YOUR COLUMNS AND FIRST ROW VALUES
l1 = df.columns.values.tolist()
l2 = df.iloc[0].tolist()
#CREATE A DATA FRAME.
df2 = pd.DataFrame(list(zip(l1, l2)),columns = ['Metrics', 'Values'])
print(df2)

Python: Join/Merge 2 dfs on Approximate Key Match

I have two DataFrames:
data = [['B100',30], ['C200',33], ['C201',11]]
data2 = [['B99/B100/B105','Yes'], ['C150/C200/C201','Yes'], ['D56/D500/D501','Yes']]
df_1 = pd.DataFrame(data, columns = ['code', 'value'])
df_2 = pd.DataFrame(data2, columns = ['code_agg', 'rating'])
I need to pull in the rating from df_2 into df_1 using a partial match from the 'code' columns in each dataframe (df_1 only has a partial key/code). The result should look like this:
I have tried several methods, the most common error I get is "TypeError: 'Series' objects are mutable, thus they cannot be hashed"
I would greatly appreciate any help on this. thank you!
df_1.merge(df_2.assign(code=df_2.code_agg.str.split('/')).explode('code'))
Out[]:
code value code_agg rating
0 B100 30 B99/B100/B105 Yes
1 C200 33 C150/C200/C201 Yes
2 C201 11 C150/C200/C201 Yes

How do I get the maximum and minimum values of a column depending on another two columns in pandas dataframe?

This is my first time asking a question. I have a dataframe that looks like below:
import pandas as pd
data = [['AK', 'Co',2957],
['AK', 'Ot', 15],
['AK','Petr', 86848],
['AL', 'Co',167],
['AL', 'Ot', 10592],
['AL', 'Petr',1667]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
I need to find the maximum and minimum values of the third column based on the first two columns. I did browse through a few stackoverflow questions but couldn't find the right way to solve this.
My output should look like below:
data = [['AK','Ot', 15],
['AK','Petr',86848],
['AL','Co',167],
['AL','Ot', 10592]]
my_df = pd.DataFrame(data, columns = ['State', 'Energy', 'Elec'])
print(my_df)
Note: Please let me know where I am lagging before leaving a negative marking on the question
This link helped me: Python pandas dataframe: find max for each unique values of an another column
try idxmin and idxmax with .loc filter.
new_df = my_df.loc[
my_df.groupby(["State"])
.agg(ElecMin=("Elec", "idxmin"), ElecMax=("Elec", "idxmax"))
.stack()
]
)
print(new_df)
State Energy Elec
0 AK Ot 15
1 AK Petr 86848
2 AL Co 167
3 AL Ot 10592

Merging DataFrames in Pandas and Numpy

I have two different data frames pertaining to sales analytics. I would like to merge them together to make a new data frame with the columns customer_id, name, and total_spend. The two data frames are as follows:
import pandas as pd
import numpy as np
customers = pd.DataFrame([[100, 'Prometheus Barwis', 'prometheus.barwis#me.com',
'(533) 072-2779'],[101, 'Alain Hennesey', 'alain.hennesey#facebook.com',
'(942) 208-8460'],[102, 'Chao Peachy', 'chao.peachy#me.com',
'(510) 121-0098'],[103, 'Somtochukwu Mouritsen',
'somtochukwu.mouritsen#me.com','(669) 504-8080'],[104,
'Elisabeth Berry', 'elisabeth.berry#facebook.com','(802) 973-8267']],
columns = ['customer_id', 'name', 'email', 'phone'])
orders = pd.DataFrame([[1000, 100, 144.82], [1001, 100, 140.93],
[1002, 102, 104.26], [1003, 100, 194.6 ], [1004, 100, 307.72],
[1005, 101, 36.69], [1006, 104, 39.59], [1007, 104, 430.94],
[1008, 103, 31.4 ], [1009, 104, 180.69], [1010, 102, 383.35],
[1011, 101, 256.2 ], [1012, 103, 930.56], [1013, 100, 423.77],
[1014, 101, 309.53], [1015, 102, 299.19]],
columns = ['order_id', 'customer_id', 'order_total'])
When I group by customer_id and order_id I get the following table:
customer_id order_id order_total
100 1000 144.82
1001 140.93
1003 194.60
1004 307.72
1013 423.77
101 1005 36.69
1011 256.20
1014 309.53
102 1002 104.26
1010 383.35
1015 299.19
103 1008 31.40
1012 930.56
104 1006 39.59
1007 430.94
1009 180.69
This is where I get stuck. I do not know how to sum up all of the orders for each customer_id in order to make a total_spent column. If anyone knows of a way to do this it would be much appreciated!
IIUC, you can do something like below
orders.groupby('customer_id')['order_total'].sum().reset_index(name='Customer_Total')
Output
customer_id Customer_Total
0 100 1211.84
1 101 602.42
2 102 786.80
3 103 961.96
4 104 651.22
You can create an additional table then merge back to your current output.
# group by customer id and order id to match your current output
df = orders.groupby(['customer_id', 'order_id']).sum()
# create a new lookup table called total by customer
totalbycust = orders.groupby('customer_id').sum()
totalbycust = totalbycust.reset_index()
# only keep the columsn you want
totalbycust = totalbycust[['customer_id', 'order_total']]
# merge bcak to your current table
df =df.merge(totalbycust, left_on='customer_id', right_on='customer_id')
df = df.rename(columns = {"order_total_x": "order_total", "order_total_y": "order_amount_by_cust"})
# expect output
df
df_merge = customers.merge(orders, how='left', left_on='customer_id', right_on='customer_id').filter(['customer_id','name','order_total'])
df_merge = df_merge.groupby(['customer_id','name']).sum()
df_merge = df_merge.rename(columns={'order_total':'total_spend'})
df_merge.sort_values(['total_spend'], ascending=False)
Results in:
total_spend
customer_id name
100 Prometheus Barwis 1211.84
103 Somtochukwu Mouritsen 961.96
102 Chao Peachy 786.80
104 Elisabeth Berry 651.22
101 Alain Hennesey 602.42
A step-by-step explanation:
Start by merging your orders table onto your customers table using a left join. For this you will need pandas' .merge() method. Be sure to set the how argument to left because the default merge type is inner (which would ignore customers with no orders).
This step requires some basic understanding of SQL-style merge methods. You can find a good visual overview of the various merge types in this thread.
You can append your merge with the .filter() method to only keep your columns of interest (in your case: customer_id, name and order_total).
Now that you have your merged table, we still need to sum up all the order_total values per customer. To achieve this we need to group all non-numeric columns using .groupby() and then apply an aggregation method on the remaining numeric columns (.sum() in this case).
The .groupby() documentation link above provides some more examples on this. It is also worth knowing that this is a pattern referred to as "split-apply-combine" in the pandas documentation.
Next you will need to rename your numeric column from order_total to total_spend using the .rename() method and setting its column argument.
And last, but not least, sort your customers by your total_spend column using .sort_values().
I hope that helps.

extract second row values in spark data frame

I have spark dataframe for table (1000000x4) sorted by second column
I need to get 2 values second row, column 0 and second row, column 3
How can I do it?
If you just need the values it's pretty simple, just use the DataFrame's internal RDD. You didn't specify the language, so I will take this freedom to show you how to achieve this using python2.
df = sqlContext.createDataFrame([("Bonsanto", 20, 2000.00),
("Hayek", 60, 3000.00),
("Mises", 60, 1000.0)],
["name", "age", "balance"])
requiredRows = [0, 2]
data = (df.rdd.zipWithIndex()
.filter(lambda ((name, age, balance), index): index in requiredRows)
.collect())
And now you can manipulate the variables inside the data list. By the way, I didn't remove the index inside every tuple just to provide you another idea about how this works.
print data
#[(Row(name=u'Bonsanto', age=20, balance=2000.0), 0),
# (Row(name=u'Mises', age=60, balance=1000.0), 2)]

Resources