Pandas Cumulative Sum of Difference Between Value Counts in Two Dataframe Columns - python-3.x

The charts below show my basic challenge: subtract NUMBER OF STOCKS WITH DATA END from NUMBER OF STOCKS WITH DATA START. The challenge I am having is that the date range for each series does not match so I need to merge both sets to a common date range, perform the subtraction, and save results to a new comma seperated value file.
Input data in file named 'meta.csv' contains 3187 lines. Fields per line are data for ticker, start, & end. Head and tail as shown here:
0000 ticker,start,end
0001 A,1999-11-18,2016-12-27
0002 AA,2016-11-01,2016-12-27
0003 AAL,2005-09-27,2016-12-27
0004 AAMC,2012-12-13,2016-12-27
0005 AAN,1984-09-07,2016-12-27
...
3183 ZNGA,2011-12-16,2016-12-27
3184 ZOES,2014-04-11,2016-12-27
3185 ZQK,1990-03-26,2015-09-09
3186 ZTS,2013-02-01,2016-12-27
3187 ZUMZ,2005-05-06,2016-12-27
Python code and console output:
import pandas as pd
df = pd.read_csv('meta.csv')
s = df.groupby('start').size().cumsum()
e = df.groupby('end').size().cumsum()
#s.plot(title='NUMBER OF STOCKS WITH DATA START',
# grid=True,style='k.')
#e.plot(title='NUMBER OF STOCKS WITH DATA END',
# grid=True,style='k.')
print(s.head(5))
print(s.tail(5))
print(e.tail(5))
OUT:
start
1962-01-02 11
1962-11-19 12
1970-01-02 30
1971-08-06 31
1972-06-01 54
dtype: int64
start
2016-07-05 3182
2016-10-04 3183
2016-11-01 3184
2016-12-05 3185
2016-12-08 3186
end
2016-12-08 544
2016-12-15 545
2016-12-16 546
2016-12-21 547
2016-12-27 3186
dtype: int64
Chart output when comments removed for code shown above:
I want to create one population file with the date and number of stocks with active data which should have a head and tail shown as follows:
date,num_stocks
1962-01-02,11
1962-11-19,12
1970-01-02,30
1971-08-06,31
1972-06-01,54
...
2016-12-08,2642
2016-12-15,2641
2016-12-16,2640
2016-12-21,2639
2016-12-27,2639
The ultimate goal is to be able to plot the number of stocks with data over any specified date range by reading the population file.

To align the dates with their respective counts. I'd take the difference of pd.Series.value_counts
df.start.value_counts().sub(df.end.value_counts(), fill_value=0)
1984-09-07 1.0
1990-03-26 1.0
1999-11-18 1.0
2005-05-06 1.0
2005-09-27 1.0
2011-12-16 1.0
2012-12-13 1.0
2013-02-01 1.0
2014-04-11 1.0
2015-09-09 -1.0
2016-11-01 1.0
2016-12-27 -9.0
dtype: float64

Thanks to the crucial tip provided by piRSquared I solved the challenge using this code:
import pandas as pd
df = pd.read_csv('meta.csv')
x = df.start.value_counts().sub(df.end.value_counts(), fill_value=0)
x.iloc[-1] = 0
r = x.cumsum()
r.to_csv('pop.csv')
z = pd.read_csv('pop.csv', index_col=0, header=None)
z.plot(title='NUMBER OF STOCKS WITH DATA',legend=None,
grid=True,style='k.')
'pop.csv' file head/tail:
1962-01-02 11.0
1962-11-19 12.0
1970-01-02 30.0
1971-08-06 31.0
1972-06-01 54.0
...
2016-12-08 2642.0
2016-12-15 2641.0
2016-12-16 2640.0
2016-12-21 2639.0
2016-12-27 2639.0
Chart:

Related

Pandas calculating over duplicated entries

This is my sample dataframe
Price DateOfTrasfer PAON Street
115000 2018-07-13 00:00 4 THE LANE
24000 2018-04-10 00:00 20 WOODS TERRACE
56000 2018-06-22 00:00 6 HEILD CLOSE
220000 2018-05-25 00:00 25 BECKWITH CLOSE
58000 2018-05-09 00:00 23 AINTREE DRIVE
115000 2018-06-21 00:00 4 EDEN VALE MEWS
82000 2018-06-01 00:00 24 ARKLESS GROVE
93000 2018-07-06 00:00 14 HORTON CRESCENT
42500 2018-06-27 00:00 18 CATHERINE TERRACE
172000 2018-05-25 00:00 67 HOLLY CRESCENT
this is the task to perform:
For any address that appears more than once in a dataset, define a holding period as the time
between any two consecutive transactions involving that property (i.e. N(holding_periods)
= N(appearances) - 1. Implement a function that takes price paid data and returns the
average length of a holding period and the annualised change in value between the purchase
and sale, grouped by the year a holding period ends and the property type.
def holding_time(df):
df = df.copy()
# to work only with dates (day)
df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
cols = ['PAON', 'Street']
df['address'] = df[cols].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
df.drop(["PAON", 'Street'],axis=1,inplace=True)
df = df.groupby(['address', 'Price'],as_index=False).agg({'PPD':'size'})\
.rename(columns={'PPD':'count_2'})
return df
This script creates columns containing the individual holding times, the average holding time for that property, and the price changes during the holding times:
import numpy as np
import pandas as pd
# assume df is defined above ...
hdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,1]).reset_index(name='hgb')
pdf = df.groupby("Street", sort=False).apply(lambda c: c.values[:,0]).reset_index(name='pgb')
df['holding_periods'] = hdf['hgb'].apply(lambda c: np.diff(c.astype(np.datetime64)))
df['price_changes'] = pdf['pgb'].apply(lambda c: np.diff(c.astype(np.int64)))
df['holding_periods'] = df['holding_periods'].fillna("").apply(list)
df['avg_hold'] = df['holding_periods'].apply(lambda c: np.array(c).astype(np.float64).mean() if c else 0).fillna(0)
df.drop_duplicates(subset=['Street','avg_hold'], keep=False, inplace=True)
I created 2 new dummy entries for "Heild Close" to test it:
# Input:
Price DateOfTransfer PAON Street
0 115000 2018-07-13 4 THE LANE
1 24000 2018-04-10 20 WOODS TERRACE
2 56000 2018-06-22 6 HEILD CLOSE
3 220000 2018-05-25 25 BECKWITH CLOSE
4 58000 2018-05-09 23 AINTREE DRIVE
5 115000 2018-06-21 4 EDEN VALE MEWS
6 82000 2018-06-01 24 ARKLESS GROVE
7 93000 2018-07-06 14 HORTON CRESCENT
8 42500 2018-06-27 18 CATHERINE TERRACE
9 172000 2018-05-25 67 HOLLY CRESCENT
10 59000 2018-06-27 12 HEILD CLOSE
11 191000 2018-07-13 1 HEILD CLOSE
# Output:
Price DateOfTransfer PAON Street holding_periods price_changes avg_hold
0 115000 2018-07-13 4 THE LANE [] [] 0.0
1 24000 2018-04-10 20 WOODS TERRACE [] [] 0.0
2 56000 2018-06-22 6 HEILD CLOSE [5 days, 16 days] [3000, 132000] 10.5
3 220000 2018-05-25 25 BECKWITH CLOSE [] [] 0.0
4 58000 2018-05-09 23 AINTREE DRIVE [] [] 0.0
5 115000 2018-06-21 4 EDEN VALE MEWS [] [] 0.0
6 82000 2018-06-01 24 ARKLESS GROVE [] [] 0.0
7 93000 2018-07-06 14 HORTON CRESCENT [] [] 0.0
8 42500 2018-06-27 18 CATHERINE TERRACE [] [] 0.0
9 172000 2018-05-25 67 HOLLY CRESCENT [] [] 0.0
Your question also mentions the annualised change in value between the purchase and sale, grouped by the year a holding period ends and the property type, but there is no property type column (PAON maybe?) and grouping by year would make the table extremely difficult to read, so I did not implement it. As it stands, you have the holding time between each transaction and the change of price at each time, so it should be trivial to implement a function to use this information to plot annualized data, if you so choose.
After manually calculating the max and min average difference checking, I had to modify the accepted solution, in order to match the manual results.
these are the database, this function is a bit slow so I would appreciate a faster implementation.
urls = ['http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2020.csv',
'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2019.csv',
'http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-2018.csv']
def holding_time(df):
df = df.copy()
df = df[['Price', 'DateOfTrasfer', 'Prop_Type', 'Postcode', 'PAON', 'Street']]
df = df[df.duplicated(subset=['Postcode', 'PAON', 'Street'], keep=False)]
cols = ['Postcode', 'PAON', 'Street']
df['address'] = df[cols].apply(lambda row: '_'.join(row.values.astype(str)), axis=1)
df['address'] = df['address'].apply(lambda x: x.replace(' ', '_'))
df.DateOfTrasfer = pd.to_datetime(df.DateOfTrasfer)
df['avg_price'] = df.groupby(['address'])['Price'].transform(lambda x: x.diff().mean())
df['avg_hold'] = df.groupby(['address'])['DateOfTrasfer'].transform(lambda x: x.diff().dt.days.mean())
df.drop_duplicates(subset=['address'], keep='first', inplace=True)
df.drop(['Price', 'DateOfTrasfer', 'address'], axis=1, inplace=True)
df = df.dropna()
df['avg_hold'] = df['avg_hold'].map('Days {:.1f}'.format)
df['avg_price'] = df['avg_price'].map('£{:,.1F}'.format)
return df

Unable to convert text format to proper data frame using Pandas

I am reading text source from URL = 'https://www.census.gov/construction/bps/txt/tb2u201901.txt'
here i used Pandas to convert it into Dataframe
df = pd.read_csv(URL, sep = '\t')
After exporting the df i see all the columns are merged into single column inspite of giving the seperator
as '\t'. how to solve this issue.
As your file is not a CSV file, you should use the function read_fwf() from pandas because your columns have a fixed width. You need also to remove the first 12 lines that are not part of your data and you need to remove the empty lines with dropna().
df = pd.read_fwf(URL, skiprows=12)
df.dropna(inplace=True)
df.head()
United States 94439 58086 1600 1457 33296 1263
1 Northeast 9099.0 3330.0 272.0 242.0 5255.0 242.0
2 New England 1932.0 1079.0 90.0 72.0 691.0 46.0
3 Connecticut 278.0 202.0 8.0 3.0 65.0 8.0
4 Maine 357.0 222.0 6.0 0.0 129.0 5.0
5 Massachusetts 819.0 429.0 38.0 54.0 298.0 23.0
Your output is coming correct . If you open the URL , you will see that there sentences written which are not tab separated so its not able to present in correct way.
From line number 9 the results are correct
[![enter image description here][1]][1]
[1]: https://i.stack.imgur.com/2K61J.png

Python - Create copies of rows based on column value and increase date by number of iterations

I have a dataframe in Python:
md
Out[94]:
Key_ID ronDt multidays
0 Actuals-788-8AA-0001 2017-01-01 1.0
11 Actuals-788-8AA-0012 2017-01-09 1.0
20 Actuals-788-8AA-0021 2017-01-16 1.0
33 Actuals-788-8AA-0034 2017-01-25 1.0
36 Actuals-788-8AA-0037 2017-01-28 1.0
... ... ...
55239 Actuals-789-8LY-0504 2020-02-12 1.0
55255 Actuals-788-T11-0001 2018-08-23 8.0
55257 Actuals-788-T11-0003 2018-09-01 543.0
55258 Actuals-788-T15-0001 2019-02-20 368.0
55259 Actuals-788-T15-0002 2020-02-24 2.0
I want to create an additional record for every multiday and increase the date (ronDt) by number of times that record was duplicated.
For example:
row[0] would repeat one time with the new date reading 2017-01-02.
row[55255] would be repeated 8 times with the corresponding dates ranging from 2018-08-24 - 2018-08-31.
When I did this in VBA, I used loops, and in Alteryx I used multirow functions. What is the best way to achieve this in Python? Thanks.
Here's a way to in pandas:
# get list of dates possible
df['datecol'] = df.apply(lambda x: pd.date_range(start=x['ronDt'], periods=x['multidays'], freq='D'), 1)
# convert the list into new rows
df = df.explode('datecol').drop('ronDt', 1)
# rename the columns
df.rename(columns={'datecol': 'ronDt'}, inplace=True)
print(df)
Key_ID multidays ronDt
0 Actuals-788-8AA-0001 1.0 2017-01-01
1 Actuals-788-8AA-0012 1.0 2017-01-09
2 Actuals-788-8AA-0021 1.0 2017-01-16
3 Actuals-788-8AA-0034 1.0 2017-01-25
4 Actuals-788-8AA-0037 1.0 2017-01-28
.. ... ... ...
8 Actuals-788-T15-0001 368.0 2020-02-20
8 Actuals-788-T15-0001 368.0 2020-02-21
8 Actuals-788-T15-0001 368.0 2020-02-22
9 Actuals-788-T15-0002 2.0 2020-02-24
9 Actuals-788-T15-0002 2.0 2020-02-25
# Get count of duplication for each row which corresponding to multidays col
df = df.groupby(df.columns.tolist()).size().reset_index().rename(columns={0: 'multidays'})
# Assume ronDt dtype is str so convert it to datetime object
# Then sum ronDt and multidays columns
df['ronDt_new'] = pd.to_datetime(df['ronDt']) + pd.to_timedelta(df['multidays'], unit='d')

Iterate over rows in a data frame create a new column then adding more columns based on the new column

I have a data frame as below:
Date Quantity
2019-04-25 100
2019-04-26 148
2019-04-27 124
The output that I need is to take the quantity difference between two next dates and average over 24 hours and create 23 columns with hourly quantity difference added to the column before such as below:
Date Quantity Hour-1 Hour-2 ....Hour-23
2019-04-25 100 102 104 .... 146
2019-04-26 148 147 146 .... 123
2019-04-27 124
I'm trying to iterate over a loop but it's not working ,my code is as below:
for i in df.index:
diff=(df.get_value(i+1,'Quantity')-df.get_value(i,'Quantity'))/24
for j in range(24):
df[i,[1+j]]=df.[i,[j]]*(1+diff)
I did some research but I have not found how to create columns like above iteratively. I hope you could help me. Thank you in advance.
IIUC using resample and interpolate, then we pivot the output
s=df.set_index('Date').resample('1 H').interpolate()
s=pd.pivot_table(s,index=s.index.date,columns=s.groupby(s.index.date).cumcount(),values=s,aggfunc='mean')
s.columns=s.columns.droplevel(0)
s
Out[93]:
0 1 2 3 ... 20 21 22 23
2019-04-25 100.0 102.0 104.0 106.0 ... 140.0 142.0 144.0 146.0
2019-04-26 148.0 147.0 146.0 145.0 ... 128.0 127.0 126.0 125.0
2019-04-27 124.0 NaN NaN NaN ... NaN NaN NaN NaN
[3 rows x 24 columns]
If I have understood the question correctly.
for loop approach:
list_of_values = []
for i,row in df.iterrows():
if i < len(df) - 2:
qty = row['Quantity']
qty_2 = df.at[i+1,'Quantity']
diff = (qty_2 - qty)/24
list_of_values.append(diff)
else:
list_of_values.append(0)
df['diff'] = list_of_values
Output:
Date Quantity diff
2019-04-25 100 2
2019-04-26 148 -1
2019-04-27 124 0
Now create the columns required.
i.e.
df['Hour-1'] = df['Quantity'] + df['diff']
df['Hour-2'] = df['Quantity'] + 2*df['diff']
.
.
.
.
There are other approaches which will work way better.

When I download date and convert into DataFrame I lose the first column with data

I use a quandl into download a stock prices. I have a list of names of companies and I download all informations. After that, I convert it into data frame. When I do it for only one company all works well but when I try do it for all in the same time something goes wrong. The first column with data convert into index with the value from 0 to 3 insted of data
My code looks like below:
import quandl
import pandas as pd
names_of_company = [11BIT, ABCDATA, ALCHEMIA]
for names in names_of_company:
x = quandl.get('WSE/%s' %names, start_date='2018-11-29',
end_date='2018-11-29',
paginate=True)
x['company'] = names
results = results.append(x).reset_index(drop=True)
Actual results looks like below:
Index Open High Low Close %Change Volume # of Trades Turnover (1000) company
0 204.5 208.5 204.5 206.0 0.73 3461.0 105.0 717.31 11BIT
1 205.0 215.0 202.5 214.0 3.88 10812.0 392.0 2254.83 ABCDATA
2 215.0 215.0 203.5 213.0 -0.47 12651.0 401.0 2656.15 ALCHEMIA
But I expected:
Data Open High Low Close %Change Volume # of Trades Turnover (1000) company
2018-11-29 204.5 208.5 204.5 206.0 0.73 3461.0 105.0 717.31 11BIT
2018-11-29 205.0 215.0 202.5 214.0 3.88 10812.0 392.0 2254.83 ABCDATA
2018-11-29 215.0 215.0 203.5 213.0 -0.47 12651.0 401.0 2656.15 ALCHEMIA
So as you can see, there is an issue with data beacues it can't convert into a correct way. But as I said if I do it for only one company, it works. Below is code:
x = quandl.get('WSE/11BIT', start_date='2019-01-01', end_date='2019-01-03')
df = pd.DataFrame(x)
I will be very grateful for any help ! Thanks All
When you store it to a dataframe, the date is your index. You lose it because when you use .reset_index(), you over write the old index (the date), and instead of the date being added as a column, you tell it to drop it with .reset_index(drop=True)
So I'd append, but then once the whole results dataframe is populated, I'd then reset the index, but NOT drop by either doing results = results.reset_index(drop=False) or results = results.reset_index() since the default is false.
import quandl
import pandas as pd
names_of_company = ['11BIT', 'ABCDATA', 'ALCHEMIA']
results = pd.DataFrame()
for names in names_of_company:
x = quandl.get('WSE/%s' %names, start_date='2018-11-29',
end_date='2018-11-29',
paginate=True)
x['company'] = names
results = results.append(x)
results = results.reset_index(drop=False)
Output:
print (results)
Date Open High ... # of Trades Turnover (1000) company
0 2018-11-29 269.50 271.00 ... 280.0 1822.02 11BIT
1 2018-11-29 0.82 0.92 ... 309.0 1027.14 ABCDATA
2 2018-11-29 4.55 4.55 ... 1.0 0.11 ALCHEMIA
[3 rows x 10 columns]

Resources