Calculation of stock values with yfinance and python - python-3.x

I would like to make some calculations on stock prices in Python 3 and I have installed the module yfinance.
I try to get an individual value like this:
import yfinance as yf
#define the ticker symbol
tickerSymbol = 'MSFT'
#get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
#get the historical prices for this ticker
tickerDf = tickerData.history(period='1d', start='2015-1-1', end='2020-12-30')
row_date = tickerDf[tickerDf['Date']=='2020-12-30']
value = row_date.Open.item()
#see your data
print (value)
But when I run this, it says:
KeyError: 'Date'
Which is strange because when I do this, it works well and I have the column Date:
import yfinance as yf
#define the ticker symbol
tickerSymbol = 'MSFT'
#get data on this ticker
tickerData = yf.Ticker(tickerSymbol)
#get the historical prices for this ticker
tickerDf = tickerData.history(period='1d', start='2015-1-1', end='2020-12-30')
#row_date = tickerDf[tickerDf['Date']=='2020-12-30']
#value = row_date.Open.item()
#see your data
print (tickerDf)
I get the following result:
G:\python> python test.py
Open High Low Close Volume Dividends Stock Splits
Date
2014-12-31 41.512481 42.143207 41.263744 41.263744 21552500 0.0 0
2015-01-02 41.450302 42.125444 41.343701 41.539135 27913900 0.0 0
2015-01-05 41.192689 41.512495 41.086088 41.157158 39673900 0.0 0
2015-01-06 41.201567 41.530255 40.455355 40.553074 36447900 0.0 0
2015-01-07 40.846223 41.272629 40.410934 41.068310 29114100 0.0 0
... ... ... ... ... ... ... ...
2020-12-22 222.690002 225.630005 221.850006 223.940002 22612200 0.0 0
2020-12-23 223.110001 223.559998 220.800003 221.020004 18699600 0.0 0
2020-12-24 221.419998 223.610001 221.199997 222.750000 10550600 0.0 0
2020-12-28 224.449997 226.029999 223.020004 224.960007 17933500 0.0 0
2020-12-29 226.309998 227.179993 223.580002 224.149994 17403200 0.0 0
[1510 rows x 7 columns]

Under the hood, yfinance uses a Pandas data frame to create a Ticker. In this dataframe, Date isn't an ordinary column, but is instead a name given to the index (see line 240 in base.py of yfinance). The index column behaves differently than other columns and actually can't be referenced by name. You can access it using TickerDf.index=='2020-12-30' or by turning it into a regular column using reset_index as explained in another question. Searching through an index is faster than searching a regular column, so if you are looking through a lot of data, it will be to your advantage to leave it as an index.

Related

how to replace stock-prices symbols in a dataframe

I would like to get the S&P 500 ['Adj Close'] column and replace the column with the corresponding stock symbol, however, I am not able to replace the dataframe columns because it gives me an error: KeyError: '5'
What I would like to achieve is to loop through all the available stocks from the list and replace the Adj Close with the stock symbol.
This is what I did:
First I have scraped the stock symbols from Wikipedia and added them to a list.
data = pd.read_html('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
symbols = data[0] # get first column
symbols.head()
stock = symbols['Symbol'].to_list()
print(stock[0:5])
this gives me a list of stock symbols as below:
['MMM', 'ABT', 'ABBV', 'ABMD', 'ACN']
then I scraped Yahoo finance to get the daily financial data as below
stock_url = 'https://query1.finance.yahoo.com/v7/finance/download/{}?'
params = {
'range' : '1y',
'interval' : '1d',
'events' : 'history'
}
response = requests.get(stock_url.format(stock[0]), params=params)
file = StringIO(response.text)
reader = csv.reader(file)
data = list(reader)
df = pd.DataFrame(data)
stock_data = df['5']
Fix for key error
You are calling the the url using the list 'stock' and it gives a 404 response when I tried.
Call the URL with individual stock like below,
requests.get(stock_url.format(stock[0]), params=params)
Also do below, The column 5 is stored as integer instead of character. That is the reason you got 'key error'
stock_data = df[5]
I tried for stock 'MMM' - stock[0] and it prints below:
0 1 2 3 4 5 \
0 Date Open High Low Close Adj Close
1 2019-12-11 168.380005 168.839996 167.330002 168.740005 162.682480
2 2019-12-12 166.729996 170.850006 166.330002 168.559998 162.508926
3 2019-12-13 169.619995 171.119995 168.080002 168.789993 162.730667
4 2019-12-16 168.940002 170.830002 168.190002 170.750000 164.620316
.. ... ... ... ... ... ...
249 2020-12-04 172.130005 173.160004 171.539993 172.460007 172.460007
250 2020-12-07 171.720001 172.500000 169.179993 170.149994 170.149994
251 2020-12-08 169.740005 172.830002 169.699997 172.460007 172.460007
252 2020-12-09 172.669998 175.639999 171.929993 175.289993 175.289993
253 2020-12-10 174.869995 175.399994 172.690002 173.490005 173.490005
[254 rows x 7 columns]
Loop through stocks and replace Adj Close (Edited as per requirements from comments)
Code for looping through stocks and replacing Adj close with Stock symbol.
stock_url = 'https://query1.finance.yahoo.com/v7/finance/download/{}?'
params = {
'range' : '1y',
'interval' : '1d',
'events' : 'history'
}
df = pd.DataFrame()
for i in stock:
response = requests.get(stock_url.format(i), params=params)
file = io.StringIO(response.text)
reader = csv.reader(file)
data = list(reader)
df1 = pd.DataFrame(data)
df1.loc[df1[5] == 'Adj Close',5] = i
df = df.append(df1)
Tried the code for first 3 stocks and here it is:

Counting the number of times the values are more than the mean for a specific column in Dataframe

I'm trying to find the number of times the value in a certain column (in this case under "AveragePrice") is more than its mean & median. I calculated the mean using the below:
mean_AveragePrice = avocadodf["AveragePrice"].mean(axis = 0)
median_AveragePrice = avocadodf["AveragePrice"].median(axis = 0)
how do I count the number of times the values were more than the mean?
Sample of the Dataframe:
Date AveragePrice Total Volume PLU4046 PLU4225 PLU4770 Total Bags
0 27/12/2015 1.33 64236.62 1036.74 54454.85 48.16 8696.87
1 20/12/2015 1.35 54876.98 674.28 44638.81 58.33 9505.56
2 13/12/2015 0.93 118220.22 794.70 109149.67 130.50 8145.35
3 06/12/2015 1.08 78992.15 1132.00 71976.41 72.58 5811.16
4 29/11/2015 1.28 51039.60 941.48 43838.39 75.78 6183.95
5 22/11/2015 1.26 55979.78 1184.27 48067.99 43.61 6683.91
6 15/11/2015 0.99 83453.76 1368.92 73672.72 93.26 8318.86
7 08/11/2015 0.98 109428.33 703.75 101815.36 80.00 6829.22
8 01/11/2015 1.02 99811.42 1022.15 87315.57 85.34 11388.36
import numpy as np
mean_AveragePrice = avocadodf["AveragePrice"].mean(axis = 0)
median_AveragePrice = avocadodf["AveragePrice"].median(axis = 0)
where_bigger = np.where((avocadodf["AveragePrice"] > mean_AveragePrice) & (avocadodf["AveragePrice"] > median_AveragePrice), 1, 0 )
where_bigger.sum()
So you got the data you need and now you need the test. np.where will help you out

why am i getting the same post data though i'm posting to different URL

I'm trying to scrape http://www.moneycontrol.com/stocks/histstock.php?sc_id=BPC&mycomp=BPCL
to get price data .
So i followed the following
Opened up that link and fed in the dates(daily)
chrome->inspect->Network - obtained the Form details and found out that the URL for POST
Fed in the form data and hit POST .
I have multiple tickers for which i need the data.
Eg:
'AXISBANK': 'http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=AXISBANK',
'BAJAJ-AUTO': 'http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=BPCL',
But when i run the POST i get the same output even though the URLs i'm posting to are differnt.
What could i be missing?
Output:
running for http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=AXISBANK
Date Open High Low Close Volume
244 05-01-2016 881.3 905.00 881.3 900.65 1372748
245 04-01-2016 876.2 892.45 871.7 880.80 709103
246 01-01-2016 882.0 885.60 876.9 878.75 294006
running for http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=BPCL
Date Open High Low Close Volume
244 05-01-2016 881.3 905.00 881.3 900.65 1372748
245 04-01-2016 876.2 892.45 871.7 880.80 709103
246 01-01-2016 882.0 885.60 876.9 878.75 294006
This is the code i wrote to test it.
url='http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=AXISBANK'
url2='http://www.moneycontrol.com/stocks/hist_stock_result.php?ex=N&sc_id=API&mycomp=BPCL'
import requests
import pandas as pd
from bs4 import BeautifulSoup as bs
data = {
'frm_dy':'01',
'frm_mth':'01',
'frm_yr':'2016',
'to_dy':'31',
'to_mth':'12',
'to_yr':'2016',
'hdn':'daily'
# 'x':'15',
# 'y':'14'
}
print('running for {}'.format(url))
test = requests.post(url,data=data) # Post the data
doc = bs(test.text,'html.parser')
tables = doc.find('table',{'class':'tblchart'})
tData = pd.read_html(str(tables),header=1) #You get a list
#Convert it to dataFrame
tData = tData[0].drop(columns=['(High-Low)','(Open-Close)'])
print(tData.tail(3))
import time
time.sleep(20) # Hopefully sleep works?
url = url2 # test only
print('running for {}'.format(url))
test = requests.post(url,data=data)
doc = bs(test.text,'html.parser')
tables = doc.find('table',{'class':'tblchart'})
tData = pd.read_html(str(tables),header=1) #You get a list
#Convert it to dataFrame
tData = tData[0].drop(columns=['(High-Low)','(Open-Close)'])
print(tData.tail(3))
I noticed that sc_id changed when i ran it directly from the URL vs when i looked at the 'Inspect'.
I dont know what sc_id is (sessions_ID?)
Im totally new to web scraping . SO i dont really know the gotchas or if i've hit any.
What could i be missing?
You have to set correctly the parameter sc_id= in the URL.
For AXIS Bank it's UTI10
For Bajaj Auto it's BA06
For example:
import re
import requests
import pandas as pd
from bs4 import BeautifulSoup
def get_sc_id(name, full_name):
url = 'https://www.moneycontrol.com/stocks/autosuggest.php'
params = {'str': name}
return re.search(r'set_val\(\'{}\',\'(.*?)\'\)'.format(full_name), requests.get(url, params=params).text, flags=re.I)[1]
def get_table(sc_id, mycomp):
url = 'https://www.moneycontrol.com/stocks/hist_stock_result.php'
params = {
'ex':'B',
'sc_id': sc_id,
'mycomp': mycomp
}
data = {
'frm_dy':'01',
'frm_mth':'01',
'frm_yr':'2016',
'to_dy':'31',
'to_mth':'12',
'to_yr':'2016',
'hdn':'daily'
}
soup = BeautifulSoup(requests.post(url, data=data, params=params).content, 'html.parser')
return pd.read_html( str(soup.select_one('.tblchart')) )[0].droplevel(0, axis=1)
code = get_sc_id('AXIS', 'Axis Bank')
print('Axis Bank code: ', code)
print(get_table(code, 'Axis Bank'))
code = get_sc_id('BAJAJ', 'Bajaj Auto')
print('Bajaj Auto code:', code )
print(get_table(code, 'Bajaj Auto'))
Prints:
Axis Bank code: UTI10
Date Open High Low Close Volume (High-Low) (Open-Close)
0 30-12-2016 446.00 451.80 443.45 450.00 234037 8.35 -4.00
1 29-12-2016 447.00 447.00 437.80 444.15 267677 9.20 2.85
2 28-12-2016 437.45 447.85 436.00 439.50 251149 11.85 -2.05
3 27-12-2016 430.00 438.55 430.00 437.45 210857 8.55 -7.45
4 26-12-2016 432.15 436.00 427.00 431.75 405044 9.00 0.40
.. ... ... ... ... ... ... ... ...
242 07-01-2016 424.25 425.00 407.30 409.35 1441934 17.70 14.90
243 06-01-2016 439.70 439.70 429.80 430.80 730512 9.90 8.90
244 05-01-2016 439.00 440.00 433.65 436.35 726947 6.35 2.65
245 04-01-2016 448.85 448.85 437.40 439.25 743518 11.45 9.60
246 01-01-2016 450.00 452.70 445.80 449.80 433052 6.90 0.20
[247 rows x 8 columns]
Bajaj Auto code: BA06
Date Open High Low Close Volume (High-Low) (Open-Close)
0 30-12-2016 2655.55 2667.00 2627.25 2633.85 10377 39.75 21.70
1 29-12-2016 2621.00 2665.65 2611.50 2655.45 8704 54.15 -34.45
2 28-12-2016 2629.35 2653.00 2624.55 2631.60 6475 28.45 -2.25
3 27-12-2016 2563.00 2642.00 2563.00 2633.60 15491 79.00 -70.60
4 26-12-2016 2618.00 2618.35 2578.00 2596.70 7205 40.35 21.30
.. ... ... ... ... ... ... ... ...
242 07-01-2016 2470.00 2481.80 2407.25 2419.25 15962 74.55 50.75
243 06-01-2016 2495.00 2513.70 2475.00 2485.50 11975 38.70 9.50
244 05-01-2016 2518.00 2520.00 2480.00 2497.05 11967 40.00 20.95
245 04-01-2016 2507.90 2545.85 2480.65 2488.15 23077 65.20 19.75
246 01-01-2016 2530.00 2530.00 2512.15 2520.05 9055 17.85 9.95
[247 rows x 8 columns]

Finding out the NAN values for Summary report

List item
```def drag_mis(data):
list = []
for val in data.values:
if np.any(val) == None:
list.append(val)
return list.count(val)```
""" Need a summary report like a file attached in xls format need to automate this boring stuff"""
**
The Above function will help us to drag nan values give the count
**
df.groupby(["Operator","Model"],axis=0)[['Jan-17', 'Feb-17', 'Mar-17', 'Apr-17', 'May-17',
'Jun-17', 'Jul-17', 'Aug-17', 'Sep-17', 'Oct-17', 'Nov-17', 'Dec-17',
'Jan-18', 'Feb-18', 'Mar-18', 'Apr-18', 'May-18', 'Jun-18', 'Jul-18',
'Aug-18', 'Sep-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Jan-19', 'Feb-19',
'Mar-19', 'Apr-19', 'May-19', 'Jun-19', 'Jul-19', 'Aug-19', 'Sep-19',
'Oct-19', 'Nov-19', 'Dec-19', 'Jan-20', 'Feb-20', 'Mar-20', 'Apr-20',
'May-20']].apply(drag_mis)
####I want to drag all nan values so that i can make count for summary report in new CSV file
#### The output is as follows:
AAL 737 0
757 0
767 0
777 0
787 0
MD80 0
AAR 747 0
767 0
777 0
ABM 747 0
ACN 737 0
######Please add your ideas,any one,where my function going wrong#######
********tried below code but i need a summary like value_counts,which can not be implemented in dataframe[![enter image description here][1]][1]********
**
df.groupby(["Operator","Model"])[['Jan-17', 'Feb-17', 'Mar-17', 'Apr-17', 'May-17', 'Jun-17', 'Jul-17', 'Aug-17', 'Sep-17', 'Oct-17', 'Nov-17', 'Dec-17', 'Jan-18', 'Feb-18', 'Mar-18', 'Apr-18', 'May-18', 'Jun-18', 'Jul-18', 'Aug-18', 'Sep-18', 'Oct-18', 'Nov-18', 'Dec-18', 'Jan-19', 'Feb-19', 'Mar-19', 'Apr-19', 'May-19', 'Jun-19', 'Jul-19', 'Aug-19', 'Sep-19', 'Oct-19', 'Nov-19', 'Dec-19', 'Jan-20', 'Feb-20', 'Mar-20', 'Apr-20', 'May-20']].apply(lambda x: x.isnull().sum())
**
******
Please look in to this snapshot of xls file
`
<[1]: https://i.stack.imgur.com/E1FTN.jpg>strong text

Find two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth

I have the following df with data about the American quarterly GDP in billions of chained 2009 dollars, from 1947q1 to 2016q2:
df = pd.DataFrame(data = [1934.5, 1932.3, 1930.3, 1960.7, 1989.5, 2021.9, 2033.2, 2035.3, 2007.5, 2000.8, 2022.8, 2004.7, 2084.6, 2147.6, 2230.4, 2273.4, 2304.5, 2344.5, 2392.8, 2398.1, 2423.5, 2428.5, 2446.1, 2526.4, 2573.4, 2593.5, 2578.9, 2539.8, 2528.0, 2530.7, 2559.4, 2609.3, 2683.8, 2727.5, 2764.1, 2780.8, 2770.0, 2792.9, 2790.6, 2836.2, 2854.5, 2848.2, 2875.9, 2846.4, 2772.7, 2790.9, 2855.5, 2922.3, 2976.6, 3049.0, 3043.1, 3055.1, 3123.2, 3111.3, 3119.1, 3081.3, 3102.3, 3159.9, 3212.6, 3277.7, 3336.8, 3372.7, 3404.8, 3418.0, 3456.1, 3501.1, 3569.5, 3595.0, 3672.7, 3716.4, 3766.9, 3780.2, 3873.5, 3926.4, 4006.2, 4100.6, 4201.9, 4219.1, 4249.2, 4285.6, 4324.9, 4328.7, 4366.1, 4401.2, 4490.6, 4566.4, 4599.3, 4619.8, 4691.6, 4706.7, 4736.1, 4715.5, 4707.1, 4715.4, 4757.2, 4708.3, 4834.3, 4861.9, 4900.0, 4914.3, 5002.4, 5118.3, 5165.4, 5251.2, 5380.5, 5441.5, 5411.9, 5462.4, 5417.0, 5431.3, 5378.7, 5357.2, 5292.4, 5333.2, 5421.4, 5494.4, 5618.5, 5661.0, 5689.8, 5732.5, 5799.2, 5913.0, 6017.6, 6018.2, 6039.2, 6274.0, 6335.3, 6420.3, 6433.0, 6440.8, 6487.1, 6503.9, 6524.9, 6392.6, 6382.9, 6501.2, 6635.7, 6587.3, 6662.9, 6585.1, 6475.0, 6510.2, 6486.8, 6493.1, 6578.2, 6728.3, 6860.0, 7001.5, 7140.6, 7266.0, 7337.5, 7396.0, 7469.5, 7537.9, 7655.2, 7712.6, 7784.1, 7819.8, 7898.6, 7939.5, 7995.0, 8084.7, 8158.0, 8292.7, 8339.3, 8449.5, 8498.3, 8610.9, 8697.7, 8766.1, 8831.5, 8850.2, 8947.1, 8981.7, 8983.9, 8907.4, 8865.6, 8934.4, 8977.3, 9016.4, 9123.0, 9223.5, 9313.2, 9406.5, 9424.1, 9480.1, 9526.3, 9653.5, 9748.2, 9881.4, 9939.7, 10052.5, 10086.9, 10122.1, 10208.8, 10281.2, 10348.7, 10529.4, 10626.8, 10739.1, 10820.9, 10984.2, 11124.0, 11210.3, 11321.2, 11431.0, 11580.6, 11770.7, 11864.7, 11962.5, 12113.1, 12323.3, 12359.1, 12592.5, 12607.7, 12679.3, 12643.3, 12710.3, 12670.1, 12705.3, 12822.3, 12893.0, 12955.8, 12964.0, 13031.2, 13152.1, 13372.4, 13528.7, 13606.5, 13706.2, 13830.8, 13950.4, 14099.1, 14172.7, 14291.8, 14373.4, 14546.1, 14589.6, 14602.6, 14716.9, 14726.0, 14838.7, 14938.5, 14991.8, 14889.5, 14963.4, 14891.6, 14577.0, 14375.0, 14355.6, 14402.5, 14541.9, 14604.8, 14745.9, 14845.5, 14939.0, 14881.3, 14989.6, 15021.1, 15190.3, 15291.0, 15362.4, 15380.8, 15384.3, 15491.9, 15521.6, 15641.3, 15793.9, 15747.0, 15900.8, 16094.5, 16186.7, 16269.0, 16374.2, 16454.9, 16490.7, 16525.0, 16583.1],
index = ['1947q1', '1947q2', '1947q3', '1947q4', '1948q1', '1948q2', '1948q3',
'1948q4', '1949q1', '1949q2', '1949q3', '1949q4', '1950q1', '1950q2',
'1950q3', '1950q4', '1951q1', '1951q2', '1951q3', '1951q4', '1952q1',
'1952q2', '1952q3', '1952q4', '1953q1', '1953q2', '1953q3', '1953q4',
'1954q1', '1954q2', '1954q3', '1954q4', '1955q1', '1955q2', '1955q3',
'1955q4', '1956q1', '1956q2', '1956q3', '1956q4', '1957q1', '1957q2',
'1957q3', '1957q4', '1958q1', '1958q2', '1958q3', '1958q4', '1959q1',
'1959q2', '1959q3', '1959q4', '1960q1', '1960q2', '1960q3', '1960q4',
'1961q1', '1961q2', '1961q3', '1961q4', '1962q1', '1962q2', '1962q3',
'1962q4', '1963q1', '1963q2', '1963q3', '1963q4', '1964q1', '1964q2',
'1964q3', '1964q4', '1965q1', '1965q2', '1965q3', '1965q4', '1966q1',
'1966q2', '1966q3', '1966q4', '1967q1', '1967q2', '1967q3', '1967q4',
'1968q1', '1968q2', '1968q3', '1968q4', '1969q1', '1969q2', '1969q3',
'1969q4', '1970q1', '1970q2', '1970q3', '1970q4', '1971q1', '1971q2',
'1971q3', '1971q4', '1972q1', '1972q2', '1972q3', '1972q4', '1973q1',
'1973q2', '1973q3', '1973q4', '1974q1', '1974q2', '1974q3', '1974q4',
'1975q1', '1975q2', '1975q3', '1975q4', '1976q1', '1976q2', '1976q3',
'1976q4', '1977q1', '1977q2', '1977q3', '1977q4', '1978q1', '1978q2',
'1978q3', '1978q4', '1979q1', '1979q2', '1979q3', '1979q4', '1980q1',
'1980q2', '1980q3', '1980q4', '1981q1', '1981q2', '1981q3', '1981q4',
'1982q1', '1982q2', '1982q3', '1982q4', '1983q1', '1983q2', '1983q3',
'1983q4', '1984q1', '1984q2', '1984q3', '1984q4', '1985q1', '1985q2',
'1985q3', '1985q4', '1986q1', '1986q2', '1986q3', '1986q4', '1987q1',
'1987q2', '1987q3', '1987q4', '1988q1', '1988q2', '1988q3', '1988q4',
'1989q1', '1989q2', '1989q3', '1989q4', '1990q1', '1990q2', '1990q3',
'1990q4', '1991q1', '1991q2', '1991q3', '1991q4', '1992q1', '1992q2',
'1992q3', '1992q4', '1993q1', '1993q2', '1993q3', '1993q4', '1994q1',
'1994q2', '1994q3', '1994q4', '1995q1', '1995q2', '1995q3', '1995q4',
'1996q1', '1996q2', '1996q3', '1996q4', '1997q1', '1997q2', '1997q3',
'1997q4', '1998q1', '1998q2', '1998q3', '1998q4', '1999q1', '1999q2',
'1999q3', '1999q4', '2000q1', '2000q2', '2000q3', '2000q4', '2001q1',
'2001q2', '2001q3', '2001q4', '2002q1', '2002q2', '2002q3', '2002q4',
'2003q1', '2003q2', '2003q3', '2003q4', '2004q1', '2004q2', '2004q3',
'2004q4', '2005q1', '2005q2', '2005q3', '2005q4', '2006q1', '2006q2',
'2006q3', '2006q4', '2007q1', '2007q2', '2007q3', '2007q4', '2008q1',
'2008q2', '2008q3', '2008q4', '2009q1', '2009q2', '2009q3', '2009q4',
'2010q1', '2010q2', '2010q3', '2010q4', '2011q1', '2011q2', '2011q3',
'2011q4', '2012q1', '2012q2', '2012q3', '2012q4', '2013q1', '2013q2',
'2013q3', '2013q4', '2014q1', '2014q2', '2014q3', '2014q4', '2015q1',
'2015q2', '2015q3', '2015q4', '2016q1', '2016q2'])
df.columns = ['GDP in billions of chained 2009 dollars']
df.index.rename('quarter', inplace = True)
A recession period is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth. The goal is to create a function 'get_recession_periods()' that returns all of the recession periods between 1947q1 and 2016q2. The output could a dataframe with two columns (start and end) or a list of tuples [(start and end), ...] with all the recession periods found.
Here is my try:
get_recession_periods()
lst_start = []
for i in range(0,len(df['GDP in billions of chained 2009 dollars'])-2):
if df['GDP in billions of chained 2009 dollars'][i] < df['GDP in billions of chained 2009 dollars'][i-1] and df['GDP in billions of chained 2009 dollars'][i+1] < df['GDP in billions of chained 2009 dollars'][i]:
lst_start.append(df.index[i])
start = lst_start[0]
lst_end = []
for j in range(df.index.get_loc(start),len(df['GDP in billions of chained 2009 dollars'])-2):
if df['GDP in billions of chained 2009 dollars'][j] > df['GDP in billions of chained 2009 dollars'][j-1] and df['GDP in billions of chained 2009 dollars'][j+1] > df['GDP in billions of chained 2009 dollars'][j]:
lst_end.append(df.index[j])
return (lst_start[0], lst_end[0])
But with the function above, I am only able to get the start and end quarter of the first recession in 1947.
Any idea?
This is probably overkill for this particular example... In a nutshell this is a bit more complicated than #zaq's answer but also much faster (about 9x here, and the difference would be much bigger on larger datasets) because it's vectorized instead of looped. But for this very small dataset here, clearly you would go with the simpler answer since even the slower way is fast enough. Finally, it stores the data in the dataframe itself rather than as a tuple (which could be an advantage or disadvantage, depending on the situation).
Thanks to #zaq for pointing out that I misread the question initially. I believe this now gives the same answer as zaq's except we have different implicit assumptions about the initial state of the world (beginning in recession or not) which is indeterminate in the data provided.
df['change'] = df.diff() # change in GDP from prior quarter
start = (df.change<0) & (df.change.shift(-1)<0) # potential start
end = (df.change>0) & (df.change.shift(-1)>0) # potential end
df['recess' ] = np.nan
df.loc[ start, 'recess' ] = -1
df.loc[ end, 'recess' ] = 1
df['recess'] = df.recess.ffill() # if the current row doesn't fit the
# definition of a start or end, then
# fill it with the prior row value
df['startend'] = np.nan
df.loc[ (df.recess==-1) & (df.recess.shift()== 1), 'startend'] = -1 # start
df.loc[ (df.recess== 1) & (df.recess.shift()==-1), 'startend'] = 1 # end
df[df.startend.notnull()]
GDP change recess startend
quarter
1947q4 1960.7 30.4 1.0 1.0
1949q1 2007.5 -27.8 -1.0 -1.0
1950q1 2084.6 79.9 1.0 1.0
1953q3 2578.9 -14.6 -1.0 -1.0
1954q2 2530.7 2.7 1.0 1.0
1957q4 2846.4 -29.5 -1.0 -1.0
1958q2 2790.9 18.2 1.0 1.0
1969q4 4715.5 -20.6 -1.0 -1.0
1970q2 4715.4 8.3 1.0 1.0
1974q3 5378.7 -52.6 -1.0 -1.0
1975q2 5333.2 40.8 1.0 1.0
1980q2 6392.6 -132.3 -1.0 -1.0
1980q4 6501.2 118.3 1.0 1.0
1981q4 6585.1 -77.8 -1.0 -1.0
1982q4 6493.1 6.3 1.0 1.0
1990q4 8907.4 -76.5 -1.0 -1.0
1991q2 8934.4 68.8 1.0 1.0
2008q3 14891.6 -71.8 -1.0 -1.0
2009q3 14402.5 46.9 1.0 1.0
One issue with your code is that you are not tracking the current status of the economy. If there are 20 consecutive quarters of GDP decline, your code will report 18 recessions beginning. And if there are 20 quarters of growth, it will report 18 recessions ending even if there wasn't one to begin with.
So, I introduce a Boolean recession to indicate whether we are in recession currently. Other changes: chained inequalities like a < b < c work in Python as expected, and improve readability; also, your column name is so verbose that I used positional indexing iloc instead, to have readable conditions in if-statements.
lst_start = []
lst_end = []
recession = False
for i in range(1, len(df)-1):
if not recession and (df.iloc[i-1, 0] > df.iloc[i, 0] > df.iloc[i+1, 0]):
recession = True
lst_start.append(df.index[i])
elif recession and (df.iloc[i-1, 0] < df.iloc[i, 0] < df.iloc[i+1, 0]):
recession = False
lst_end.append(df.index[i])
print(list(zip(lst_start, lst_end)))

Resources