Determining the number of unique entry's left after experiencing a specific item in pandas - python-3.x

I have a data frame with three columns timestamp, lecture_id, and userid
I am trying to write a loop that will count up the number of students who dropped (never seen again) after experiencing a specific lecture. The goal is to ultimately have a fourth column that shows the number of students remaining after exposure to a specific lecture.
I'm having trouble writing this in python, I tried a for loop which never finished (I have 13m rows).
import pandas as pd
import numpy as np
ids = list(np.random.randint(0,5,size=(100, 1)))
users = list(np.random.randint(0,10,size=(100, 1)))
dates = list(pd.date_range('20130101',periods=100, freq = 'H'))
dft = pd.DataFrame(
{'lecture_id': ids,
'userid': users,
'timestamp': dates
})
I want to make a new data frame that shows for every user that experienced x lecture, how many never came back (dropped).

Not sure if this is what you want and also not sure if this can be done simpler but this could be a way to do it:
import pandas as pd
import numpy as np
np.random.seed(42)
ids = list(np.random.randint(0,5,size=(100, 1)[0]))
users = list(np.random.randint(0,10,size=(100, 1)[0]))
dates = list(pd.date_range('20130101',periods=100, freq = 'H'))
df = pd.DataFrame({'lecture_id': ids, 'userid': users, 'timestamp': dates})
# Get the last date for each user
last_seen = df.timestamp.iloc[df.groupby('userid').timestamp.apply(lambda x: np.argmax(x))]
df['remaining'] = len(df.userid.unique())
tmp = np.zeros(len(df))
tmp[last_seen.index] = 1
df['remaining'] = (df['remaining']- tmp.cumsum()).astype(int)
df[-10:]
where the last 10 entries are:
lecture_id timestamp userid remaining
90 2 2013-01-04 18:00:00 9 6
91 0 2013-01-04 19:00:00 5 6
92 2 2013-01-04 20:00:00 6 6
93 2 2013-01-04 21:00:00 3 5
94 0 2013-01-04 22:00:00 6 4
95 2 2013-01-04 23:00:00 7 4
96 4 2013-01-05 00:00:00 0 3
97 1 2013-01-05 01:00:00 5 2
98 1 2013-01-05 02:00:00 7 1
99 0 2013-01-05 03:00:00 4 0

Related

Pandas calculating over duplicated entries

This is my sample dataframe
Price DateOfTrasfer PAON Street
115000 2018-07-13 00:00 4 THE LANE
24000 2018-04-10 00:00 20 WOODS TERRACE
56000 2018-06-22 00:00 6 HEILD CLOSE
220000 2018-05-25 00:00 25 BECKWITH CLOSE
58000 2018-05-09 00:00 23 AINTREE DRIVE
115000 2018-06-21 00:00 4 EDEN VALE MEWS
82000 2018-06-01 00:00 24 ARKLESS GROVE
93000 2018-07-06 00:00 14 HORTON CRESCENT
42500 2018-06-27 00:00 18 CATHERINE TERRACE
172000 2018-05-25 00:00 67 HOLLY CRESCENT
this is the task to perform:
For any address that appears more than once in a dataset, define a holding period as the time
between any two consecutive transactions involving that property (i.e. N(holding_periods)
= N(appearances) - 1. Implement a function that takes price paid data and returns the
average length of a holding period and the annualised change in value between the purchase
and sale, grouped by the year a holding period ends and the property type.
def holding_time(df):
df = df.copy()
# to work only with dates (day)
df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
cols = ['PAON', 'Street']
df['address'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df.drop(["PAON", 'Street'],axis=1,inplace=True)
df = df.groupby(['address', 'Price'],as_index=False).agg({'PPD':'size'})\
.rename(columns={'PPD':'count_2'})
return df
This script creates columns containing the individual holding times, the average holding time for that property, and the price changes during the holding times:
import numpy as np
import pandas as pd
# assume df is defined above ...
hdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,1]).reset_index(name='hgb')
pdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,0]).reset_index(name='pgb')
df['holding_periods'] = hdf['hgb'].apply(lambda c: np.diff(c.astype(np.datetime64)))
df['price_changes'] = pdf['pgb'].apply(lambda c: np.diff(c.astype(np.int64)))
df['holding_periods'] = df['holding_periods'].fillna("").apply(list)
df['avg_hold'] = df['holding_periods'].apply(lambda c: np.array(c).astype(np.float64).mean() if c else 0).fillna(0)
df.drop_duplicates(subset=['Street','avg_hold'], keep=False, inplace=True)
I created 2 new dummy entries for "Heild Close" to test it:
# Input:
Price DateOfTransfer PAON Street
0 115000 2018-07-13 4 THE LANE
1 24000 2018-04-10 20 WOODS TERRACE
2 56000 2018-06-22 6 HEILD CLOSE
3 220000 2018-05-25 25 BECKWITH CLOSE
4 58000 2018-05-09 23 AINTREE DRIVE
5 115000 2018-06-21 4 EDEN VALE MEWS
6 82000 2018-06-01 24 ARKLESS GROVE
7 93000 2018-07-06 14 HORTON CRESCENT
8 42500 2018-06-27 18 CATHERINE TERRACE
9 172000 2018-05-25 67 HOLLY CRESCENT
10 59000 2018-06-27 12 HEILD CLOSE
11 191000 2018-07-13 1 HEILD CLOSE
# Output:
Price DateOfTransfer PAON Street holding_periods price_changes avg_hold
0 115000 2018-07-13 4 THE LANE [] [] 0.0
1 24000 2018-04-10 20 WOODS TERRACE [] [] 0.0
2 56000 2018-06-22 6 HEILD CLOSE [5 days, 16 days] [3000, 132000] 10.5
3 220000 2018-05-25 25 BECKWITH CLOSE [] [] 0.0
4 58000 2018-05-09 23 AINTREE DRIVE [] [] 0.0
5 115000 2018-06-21 4 EDEN VALE MEWS [] [] 0.0
6 82000 2018-06-01 24 ARKLESS GROVE [] [] 0.0
7 93000 2018-07-06 14 HORTON CRESCENT [] [] 0.0
8 42500 2018-06-27 18 CATHERINE TERRACE [] [] 0.0
9 172000 2018-05-25 67 HOLLY CRESCENT [] [] 0.0
Your question also mentions the annualised change in value between the purchase and sale, grouped by the year a holding period ends and the property type, but there is no property type column (PAON maybe?) and grouping by year would make the table extremely difficult to read, so I did not implement it. As it stands, you have the holding time between each transaction and the change of price at each time, so it should be trivial to implement a function to use this information to plot annualized data, if you so choose.
After manually calculating the max and min average difference checking, I had to modify the accepted solution, in order to match the manual results.
these are the database, this function is a bit slow so I would appreciate a faster implementation.
urls = ['http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2020.csv',
'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2019.csv',
'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2018.csv']
def holding_time(df):
df = df.copy()
df = df[['Price', 'DateOfTrasfer', 'Prop_Type', 'Postcode', 'PAON', 'Street']]
df = df[df.duplicated(subset=['Postcode', 'PAON', 'Street'], keep=False)]
cols = ['Postcode', 'PAON', 'Street']
df['address'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
df['address'] = df['address'].apply(lambda x: x.replace(' ', '_'))
df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
df['avg_price'] = df.groupby(['address'])['Price'].transform(lambda x: x.diff().mean())
df['avg_hold'] = df.groupby(['address'])['DateOfTrasfer'].transform(lambda x: x.diff().dt.days.mean())
df.drop_duplicates(subset=['address'], keep='first', inplace=True)
df.drop(['Price', 'DateOfTrasfer', 'address'], axis=1, inplace=True)
df = df.dropna()
df['avg_hold'] = df['avg_hold'].map('Days {:.1f}'.format)
df['avg_price'] = df['avg_price'].map('£{:,.1F}'.format)
return df

Get the last date before an nth date for each month in Python

I am using a csv with an accumulative number that changes daily.
Day Accumulative Number
0 9/1/2020 100
1 11/1/2020 102
2 18/1/2020 98
3 11/2/2020 105
4 24/2/2020 95
5 6/3/2020 120
6 13/3/2020 100
I am now trying to find the best way to aggregate it and compare the monthly results before a specific date. So, I want to check the balance on the 11th of each month but for some months, there is no activity for the specific day. As a result, I trying to get the latest day before the 12th of each Month. So, the above would be:
Day Accumulative Number
0 11/1/2020 102
1 11/2/2020 105
2 6/3/2020 120
What I managed to do so far is to just get the latest day of each month:
dateparse = lambda x: pd.datetime.strptime(x, "%d/%m/%Y")
df = pd.read_csv("Accumulative.csv",quotechar="'", usecols=["Day","Accumulative Number"], index_col=False, parse_dates=["Day"], date_parser=dateparse, na_values=['.', '??'] )
df.index = df['Day']
grouped = df.groupby(pd.Grouper(freq='M')).sum()
print (df.groupby(df.index.month).apply(lambda x: x.iloc[-1]))
which returns:
Day Accumulative Number
1 2020-01-18 98
2 2020-02-24 95
3 2020-03-13 100
Is there a way to achieve this in Pandas, Python or do I have to use SQL logic in my script? Is there an easier way I am missing out in order to get the "balance" as per the 11th day of each month?
You can do groupby with factorize
n = 12
df = df.sort_values('Day')
m = df.groupby(df.Day.dt.strftime('%Y-%m')).Day.transform(lambda x :x.factorize()[0])==n
df_sub = df[m].copy()
You can try filtering the dataframe where the days are less than 12 , then take last of each group(grouped by month) :
df['Day'] = pd.to_datetime(df['Day'],dayfirst=True)
(df[df['Day'].dt.day.lt(12)]
.groupby([df['Day'].dt.year,df['Day'].dt.month],sort=False).last()
.reset_index(drop=True))
Day Accumulative_Number
0 2020-01-11 102
1 2020-02-11 105
2 2020-03-06 120
I would try:
# convert to datetime type:
df['Day'] = pd.to_datetime(df['Day'], dayfirst=True)
# select day before the 12th
new_df = df[df['Day'].dt.day < 12]
# select the last day in each month
new_df.loc[~new_df['Day'].dt.to_period('M').duplicated(keep='last')]
Output:
Day Accumulative Number
1 2020-01-11 102
3 2020-02-11 105
5 2020-03-06 120
Here's another way using expanding the date range:
# set as datetime
df2['Day'] = pd.to_datetime(df2['Day'], dayfirst=True)
# set as index
df2 = df2.set_index('Day')
# make a list of all dates
dates = pd.date_range(start=df2.index.min(), end=df2.index.max(), freq='1D')
# add dates
df2 = df2.reindex(dates)
# replace NA with forward fill
df2['Number'] = df2['Number'].ffill()
# filter to get output
df2 = df2[df2.index.day == 11].reset_index().rename(columns={'index': 'Date'})
print(df2)
Date Number
0 2020-01-11 102.0
1 2020-02-11 105.0
2 2020-03-11 120.0

How to calculate 6 months moving average from daily data using pyspark

I am trying to calculate moving average of price for last six months in pyspark.
Currently my table has 6month lagged date.
id dates lagged_6month price
1 2017-06-02 2016-12-02 14.8
1 2017-08-09 2017-02-09 16.65
2 2017-08-16 2017-02-16 16
2 2018-05-14 2017-11-14 21.05
3 2017-09-01 2017-03-01 16.75
Desired Results
id dates avg6mprice
1 2017-06-02 20.6
1 2017-08-09 21.5
2 2017-08-16 16.25
2 2018-05-14 25.05
3 2017-09-01 17.75
Sample code
from pyspark.sql.functions import col
from pyspark.sql import functions as F
df = sqlContext.table("price_table")
w = Window.partitionBy([col('id')]).rangeBetween(col('dates'),col('lagged_6month'))
RangeBetween does not seem to accept columns as argument in the window function.

Plot values for multiple months and years in Plotly/Dash

I have a Dash dashboard and I need to plot on the x axis months from 0-12 and I need to have multiple lines on the same figure for different years that have been selected, ie 1991-2040. The plotted value is a columns say 'total' in a dataframe. The labels should be years and the total value is on the y axis. My data looks like this:
Month Year Total
0 0 1991 31.4
1 0 1992 31.4
2 0 1993 31.4
3 0 1994 20
4 0 1995 300
.. ... ... ...
33 0 2024 31.4
34 1 2035 567
35 1 2035 10
36 1 2035 3
....
Do I need to group it and how to achieve that in Dash/Plotly?
It seems to me that you should have a look at pd.pivot_table.
%matplotlib inline
import pandas as pd
import numpy as np
import plotly.offline as py
import plotly.graph_objs as go
# create a df
N = 100
df = pd.DataFrame({"Date":pd.date_range(start='1991-01-01',
periods=N,
freq='M'),
"Total":np.random.randn(N)})
df["Month"] = df["Date"].dt.month
df["Year"] = df["Date"].dt.year
# use pivot_table to have years as columns
pv = pd.pivot_table(df,
index=["Month"],
columns=["Year"],
values=["Total"])
# remove multiindex in columns
pv.columns = [col[1] for col in pv.columns]
data = [go.Scatter(x = pv.index,
y = pv[col],
name = col)
for col in pv.columns]
py.iplot(data)

Python correlation matrix 3d dataframe

I have in SQL Server a historical return table by date and asset Id like this:
[Date] [Asset] [1DRet]
jan asset1 0.52
jan asset2 0.12
jan asset3 0.07
feb asset1 0.41
feb asset2 0.33
feb asset3 0.21
...
So I need to calculate the correlation matrix for a given date range for all assets combinations: A1,A2 ; A1,A3 ; A2,A3
Im using pandas and in my SQL Select Where I'm filtering tha date range and ordering it by date.
I'm trying to do it using pandas df.corr(), numpy.corrcoef and Scipy but not able to do it for my n-variable dataframe
I see some example but it's always for a dataframe where you have an asset per column and one row per day.
This my code block where I'm doing it:
qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"
result = conn.execute(qryRet)
df = pd.DataFrame(data=list(result),columns=result.keys())
df1d = df[['Date','Id_RiskFactor','1DReturn']]
corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)
conn.close()
For it I'm reciving this msg:
corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'
**Print(df1d.head())**
Date Id_RiskFactor 1DReturn
0 2010-09-02 149 0E-12
1 2010-09-02 150 -0.004242875148
2 2010-09-02 33 0.000590000011
3 2010-09-02 28 0.000099999997
4 2010-09-02 34 -0.000010000000
**print(df.head())**
Date Id_RiskFactor Value 1DReturn 5DReturn
0 2010-09-02 149 0.040096000000 0E-12 0E-12
1 2010-09-02 150 1.736700000000 -0.004242875148 -0.013014321215
2 2010-09-02 33 2.283000000000 0.000590000011 0.001260000048
3 2010-09-02 28 2.113000000000 0.000099999997 0.000469999999
4 2010-09-02 34 0.615000000000 -0.000010000000 0.000079999998
**print(corr.columns)**
Index([], dtype='object')
Create a sample DataFrame:
import pandas as pd
import numpy as np
df = pd.DataFrame({'daily_return': np.random.random(15),
'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5,
'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})
>>> df
daily_return date symbol
0 0.011467 2015-01-01 A
1 0.613518 2015-01-02 A
2 0.334343 2015-01-03 A
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
5 0.431729 2015-01-01 B
6 0.474905 2015-01-02 B
7 0.372366 2015-01-03 B
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
10 0.946504 2015-01-01 C
11 0.337204 2015-01-02 C
12 0.798704 2015-01-03 C
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
I'll assume you've already filtered your DataFrame for the relevant dates. You then want a pivot table where you have unique dates as your index and your symbols as separate columns, with daily returns as the values. Finally, you call corr() on the result.
corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2 A B C
symbol_1
A 1.000000 0.188065 -0.745115
B 0.188065 1.000000 -0.688808
C -0.745115 -0.688808 1.000000
You can select the subset of your DataFrame based on dates as follows:
start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
daily_return date symbol
3 0.371809 2015-01-04 A
4 0.169016 2015-01-05 A
8 0.801619 2015-01-04 B
9 0.505487 2015-01-05 B
13 0.311597 2015-01-04 C
14 0.545215 2015-01-05 C
If you want to flatten your correlation matrix:
corr.stack().reset_index()
symbol_1 symbol_2 0
0 A A 1.000000
1 A B 0.188065
2 A C -0.745115
3 B A 0.188065
4 B B 1.000000
5 B C -0.688808
6 C A -0.745115
7 C B -0.688808
8 C C 1.000000

Resources