Pandas calculating over duplicated entries - python-3.x

This is my sample dataframe
Price DateOfTrasfer PAON Street
115000 2018-07-13 00:00 4 THE LANE
24000 2018-04-10 00:00 20 WOODS TERRACE
56000 2018-06-22 00:00 6 HEILD CLOSE
220000 2018-05-25 00:00 25 BECKWITH CLOSE
58000 2018-05-09 00:00 23 AINTREE DRIVE
115000 2018-06-21 00:00 4 EDEN VALE MEWS
82000 2018-06-01 00:00 24 ARKLESS GROVE
93000 2018-07-06 00:00 14 HORTON CRESCENT
42500 2018-06-27 00:00 18 CATHERINE TERRACE
172000 2018-05-25 00:00 67 HOLLY CRESCENT
this is the task to perform:
For any address that appears more than once in a dataset, define a holding period as the time
between any two consecutive transactions involving that property (i.e. N(holding_periods)
= N(appearances) - 1. Implement a function that takes price paid data and returns the
average length of a holding period and the annualised change in value between the purchase
and sale, grouped by the year a holding period ends and the property type.
def holding_time(df):
df = df.copy()
# to work only with dates (day)
df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
cols = ['PAON', 'Street']
df['address'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df.drop(["PAON", 'Street'],axis=1,inplace=True)
df = df.groupby(['address', 'Price'],as_index=False).agg({'PPD':'size'})\
.rename(columns={'PPD':'count_2'})
return df

This script creates columns containing the individual holding times, the average holding time for that property, and the price changes during the holding times:
import numpy as np
import pandas as pd
# assume df is defined above ...
hdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,1]).reset_index(name='hgb')
pdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,0]).reset_index(name='pgb')
df['holding_periods'] = hdf['hgb'].apply(lambda c: np.diff(c.astype(np.datetime64)))
df['price_changes'] = pdf['pgb'].apply(lambda c: np.diff(c.astype(np.int64)))
df['holding_periods'] = df['holding_periods'].fillna("").apply(list)
df['avg_hold'] = df['holding_periods'].apply(lambda c: np.array(c).astype(np.float64).mean() if c else 0).fillna(0)
df.drop_duplicates(subset=['Street','avg_hold'], keep=False, inplace=True)
I created 2 new dummy entries for "Heild Close" to test it:
# Input:
Price DateOfTransfer PAON Street
0 115000 2018-07-13 4 THE LANE
1 24000 2018-04-10 20 WOODS TERRACE
2 56000 2018-06-22 6 HEILD CLOSE
3 220000 2018-05-25 25 BECKWITH CLOSE
4 58000 2018-05-09 23 AINTREE DRIVE
5 115000 2018-06-21 4 EDEN VALE MEWS
6 82000 2018-06-01 24 ARKLESS GROVE
7 93000 2018-07-06 14 HORTON CRESCENT
8 42500 2018-06-27 18 CATHERINE TERRACE
9 172000 2018-05-25 67 HOLLY CRESCENT
10 59000 2018-06-27 12 HEILD CLOSE
11 191000 2018-07-13 1 HEILD CLOSE
# Output:
Price DateOfTransfer PAON Street holding_periods price_changes avg_hold
0 115000 2018-07-13 4 THE LANE [] [] 0.0
1 24000 2018-04-10 20 WOODS TERRACE [] [] 0.0
2 56000 2018-06-22 6 HEILD CLOSE [5 days, 16 days] [3000, 132000] 10.5
3 220000 2018-05-25 25 BECKWITH CLOSE [] [] 0.0
4 58000 2018-05-09 23 AINTREE DRIVE [] [] 0.0
5 115000 2018-06-21 4 EDEN VALE MEWS [] [] 0.0
6 82000 2018-06-01 24 ARKLESS GROVE [] [] 0.0
7 93000 2018-07-06 14 HORTON CRESCENT [] [] 0.0
8 42500 2018-06-27 18 CATHERINE TERRACE [] [] 0.0
9 172000 2018-05-25 67 HOLLY CRESCENT [] [] 0.0
Your question also mentions the annualised change in value between the purchase and sale, grouped by the year a holding period ends and the property type, but there is no property type column (PAON maybe?) and grouping by year would make the table extremely difficult to read, so I did not implement it. As it stands, you have the holding time between each transaction and the change of price at each time, so it should be trivial to implement a function to use this information to plot annualized data, if you so choose.

After manually calculating the max and min average difference checking, I had to modify the accepted solution, in order to match the manual results.
these are the database, this function is a bit slow so I would appreciate a faster implementation.
urls = ['http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2020.csv',
'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2019.csv',
'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2018.csv']
def holding_time(df):
df = df.copy()
df = df[['Price', 'DateOfTrasfer', 'Prop_Type', 'Postcode', 'PAON', 'Street']]
df = df[df.duplicated(subset=['Postcode', 'PAON', 'Street'], keep=False)]
cols = ['Postcode', 'PAON', 'Street']
df['address'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
df['address'] = df['address'].apply(lambda x: x.replace(' ', '_'))
df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
df['avg_price'] = df.groupby(['address'])['Price'].transform(lambda x: x.diff().mean())
df['avg_hold'] = df.groupby(['address'])['DateOfTrasfer'].transform(lambda x: x.diff().dt.days.mean())
df.drop_duplicates(subset=['address'], keep='first', inplace=True)
df.drop(['Price', 'DateOfTrasfer', 'address'], axis=1, inplace=True)
df = df.dropna()
df['avg_hold'] = df['avg_hold'].map('Days {:.1f}'.format)
df['avg_price'] = df['avg_price'].map('£{:,.1F}'.format)
return df

Related

Sum all elements in a column in pandas

I have a data in one column in Python dataframe.
1-2 3-4 8-9
4-5 6-2
3-1 4-2 1-4
The need is to sum all the data available in that column.
I tried to apply below logic but it's not working for list of list.
lst=[]
str='5-7 6-1 6-3'
str2 = str.split(' ')
for ele in str2:
lst.append(ele.split('-'))
print(lst)
sum(lst)
Can anyone please let me know the simplest method ?
My expected result should be:
27
17
15
I think we can do a split
df.col.str.split(' |-').map(lambda x : sum(int(y) for y in x))
Out[149]:
0 27
1 17
2 15
Name: col, dtype: int64
Or
pd.DataFrame(df.col.str.split(' |-').tolist()).astype(float).sum(1)
Out[156]:
0 27.0
1 17.0
2 15.0
dtype: float64
Using pd.Series.str.extractall:
df = pd.DataFrame({"col":['1-2 3-4 8-9', '4-5 6-2', '3-1 4-2 1-4']})
print (df["col"].str.extractall("(\d+)")[0].astype(int).groupby(level=0).sum())
0 27
1 17
2 15
Name: 0, dtype: int32
Use .str.extractall and sum on a level:
df['data'].str.extractall('(\d+)').astype(int).sum(level=0)
Output:
0
0 27
1 17
2 15
A for loop works fine here, and should be performant, since we are dealing with strings:
Using #HenryYik's sample data:
df.assign(sum_ = [sum(int(n) for n in ent
if n.isdigit())
for ent in df.col])
Out[1329]:
col sum_
0 1-2 3-4 8-9 27
1 4-5 6-2 17
2 3-1 4-2 1-4 15
I hazard that it will be faster taking it out and working within Python, before returning back to the pandas dataframe.

Moving aggregate within a specified date range

Using a sample credit card transactions data below:
df = pd.DataFrame({
'card_id' : [1, 1, 1, 2, 2],
'date' : [datetime(2020, 6, random.randint(1, 14)) for i in range(5)],
'amount' : [random.randint(1, 100) for i in range(5)]})
df
card_id date amount
0 1 2020-06-07 11
1 1 2020-06-11 45
2 1 2020-06-14 87
3 2 2020-06-04 48
4 2 2020-06-12 76
I'm trying to take the total amount spent in the past 7 days of a card at the point of the transaction. For example, if card_id 1 made a transaction on June 8, I want to get the total transactions from June 1 to June 7. This is what I was hoping to get:
card_id date amount sum_past_7d
0 1 2020-06-07 11 0
1 1 2020-06-11 45 11
2 1 2020-06-14 87 56
3 2 2020-06-04 48 0
4 2 2020-06-12 76 48
I'm currently using this function and pd.apply to generate my desired column but it's taking too long on the actual data (> 1 million rows).
df['past_week'] = df['date'].apply(lambda x: x - timedelta(days=7))
def myfunction(x):
return df.loc[(df['card_id'] == x.card_id) & \
(df['date'] >= x.past_week) & \
(df['date'] < x.date), :]['amount'].sum()
Is there a faster and more efficient way to do this?
Let's try rolling on date with groupby:
# make sure the data is sorted properly
# your sample is already sorted, so you can skip this
df = df.sort_values(['card_id', 'date'])
df['sum_past_7D'] = (df.set_index('date').groupby('card_id')
['amount'].rolling('7D').sum()
.groupby('card_id').shift(fill_value=0)
.values
)
Output:
card_id date amount sum_past_7D
0 1 2020-06-07 11 0.0
1 1 2020-06-11 45 11.0
2 1 2020-06-14 87 56.0
3 2 2020-06-04 48 0.0
4 2 2020-06-12 76 48.0

Calculate Percentage using Pandas DataFrame

Of all the Medals won by these 5 countries across all olympics,
what is the percentage medals won by each one of them?
i have combined all excel file in one using panda dataframe but now stuck with finding percentage
Country Gold Silver Bronze Total
0 USA 10 13 11 34
1 China 2 2 4 8
2 UK 1 0 1 2
3 Germany 12 16 8 36
4 Australia 2 0 0 2
0 USA 9 9 7 25
1 China 2 4 5 11
2 UK 0 1 0 1
3 Germany 11 12 6 29
4 Australia 1 0 1 2
0 USA 9 15 13 37
1 China 5 2 4 11
2 UK 1 0 0 1
3 Germany 10 13 7 30
4 Australia 2 1 0 3
Combined data sheet
Code that i have tried till now
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
df= pd.DataFrame()
for f in ['E:\\olympics\\Olympics-2002.xlsx','E:\\olympics\\Olympics-
2006.xlsx','E:\\olympics\\Olympics-2010.xlsx',
'E:\\olympics\\Olympics-2014.xlsx','E:\\olympics\\Olympics-
2018.xlsx']:
data = pd.read_excel(f,'Sheet1')
df = df.append(data)
df.to_excel("E:\\olympics\\combineddata.xlsx")
data = pd.read_excel("E:\\olympics\\combineddata.xlsx")
print(data)
final_Data={}
for i in data['Country']:
x=i
t1=(data[(data.Country==x)].Total).tolist()
print("Name of Country=",i, int(sum(t1)))
final_Data.update({i:int(sum(t1))})
t3=data.groupby('Country').Total.sum()
t2= df['Total'].sum()
t4= t3/t2*100
print(t3)
print(t2)
print(t4)
this how is got the answer....Now i need to pull that in plot i want to put it pie
Let's assume you have created the DataFrame as 'df'. Then you can do the following to first group by and then calculate percentages.
df = df.groupby('Country').sum()
df['Gold_percent'] = (df['Gold'] / df['Gold'].sum()) * 100
df['Silver_percent'] = (df['Silver'] / df['Silver'].sum()) * 100
df['Bronze_percent'] = (df['Bronze'] / df['Bronze'].sum()) * 100
df['Total_percent'] = (df['Total'] / df['Total'].sum()) * 100
df.round(2)
print (df)
The output will be as follows:
Gold Silver Bronze ... Silver_percent Bronze_percent Total_percent
Country ...
Australia 5 1 1 ... 1.14 1.49 3.02
China 9 8 13 ... 9.09 19.40 12.93
Germany 33 41 21 ... 46.59 31.34 40.95
UK 2 1 1 ... 1.14 1.49 1.72
USA 28 37 31 ... 42.05 46.27 41.38
I am not having the exact dataset what you have . i am explaining with similar dataset .Try to add a column with sum of medals across rows.then find the percentage by dividing all the row by sum of entire column.
i am posting this as model check this
import pandas as pd
cars = {'Brand': ['Honda Civic','Toyota Corolla','Ford Focus','Audi A4'],
'ExshowroomPrice': [21000,26000,28000,34000],'RTOPrice': [2200,250,2700,3500]}
df = pd.DataFrame(cars, columns = ['Brand', 'ExshowroomPrice','RTOPrice'])
Brand ExshowroomPrice RTOPrice
0 Honda Civic 21000 2200
1 Toyota Corolla 26000 250
2 Ford Focus 28000 2700
3 Audi A4 34000 3500
df['percentage']=(df.ExshowroomPrice +df.RTOPrice) * 100
/(df.ExshowroomPrice.sum() +df.RTOPrice.sum())
print(df)
Brand ExshowroomPrice RTOPrice percentage
0 Honda Civic 21000 2200 19.719507
1 Toyota Corolla 26000 250 22.311942
2 Ford Focus 28000 2700 26.094348
3 Audi A4 34000 3500 31.874203
hope its clear

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

Determining the number of unique entry's left after experiencing a specific item in pandas

I have a data frame with three columns timestamp, lecture_id, and userid
I am trying to write a loop that will count up the number of students who dropped (never seen again) after experiencing a specific lecture. The goal is to ultimately have a fourth column that shows the number of students remaining after exposure to a specific lecture.
I'm having trouble writing this in python, I tried a for loop which never finished (I have 13m rows).
import pandas as pd
import numpy as np
ids = list(np.random.randint(0,5,size=(100, 1)))
users = list(np.random.randint(0,10,size=(100, 1)))
dates = list(pd.date_range('20130101',periods=100, freq = 'H'))
dft = pd.DataFrame(
{'lecture_id': ids,
'userid': users,
'timestamp': dates
})
I want to make a new data frame that shows for every user that experienced x lecture, how many never came back (dropped).
Not sure if this is what you want and also not sure if this can be done simpler but this could be a way to do it:
import pandas as pd
import numpy as np
np.random.seed(42)
ids = list(np.random.randint(0,5,size=(100, 1)[0]))
users = list(np.random.randint(0,10,size=(100, 1)[0]))
dates = list(pd.date_range('20130101',periods=100, freq = 'H'))
df = pd.DataFrame({'lecture_id': ids, 'userid': users, 'timestamp': dates})
# Get the last date for each user
last_seen = df.timestamp.iloc[df.groupby('userid').timestamp.apply(lambda x: np.argmax(x))]
df['remaining'] = len(df.userid.unique())
tmp = np.zeros(len(df))
tmp[last_seen.index] = 1
df['remaining'] = (df['remaining']- tmp.cumsum()).astype(int)
df[-10:]
where the last 10 entries are:
lecture_id timestamp userid remaining
90 2 2013-01-04 18:00:00 9 6
91 0 2013-01-04 19:00:00 5 6
92 2 2013-01-04 20:00:00 6 6
93 2 2013-01-04 21:00:00 3 5
94 0 2013-01-04 22:00:00 6 4
95 2 2013-01-04 23:00:00 7 4
96 4 2013-01-05 00:00:00 0 3
97 1 2013-01-05 01:00:00 5 2
98 1 2013-01-05 02:00:00 7 1
99 0 2013-01-05 03:00:00 4 0

Resources