How to remove u' when running pandas' df.columns function - python-3.x

I have a data frame as shown below.
df:
id fcb psg rma
1 4.0 2.9 4.1
2 3.5 4.2 3.5
3 2.5 4.5 4.0
4 4.1 4.6 4.2
I desired to see all of the ratings in a single column. So I ran the code below.
df.set_index(['id']).apply(dict, axis=1).reset_index(name='ratings')
However, I obtained the following results. But I wanted to remove the letter 'u" from all of the keys in each dictionary.
id ratings
1 {u'fcb': 4.0, u'psg': 2.9, u'rma': 4.1}
2 {u'fcb': 3.5, u'psg': 4.2, u'rma': 3.5}
3 {u'fcb': 2.5, u'psg': 4.5, u'rma': 4.0}
4 {u'fcb': 4.1, u'psg': 4.6, u'rma': 4.2}
Expected output:
id ratings
1 {'fcb': 4.0, 'psg': 2.9, 'rma': 4.1}
2 {'fcb': 3.5, 'psg': 4.2, 'rma': 3.5}
3 {'fcb': 2.5, 'psg': 4.5, 'rma': 4.0}
4 {'fcb': 4.1, 'psg': 4.6, 'rma': 4.2}
I tried below code to eliminate leading unicode in each keys.
df['rec_dict'] = df['rec_dict'].apply(lambda x: {str(k[1:]): v for k, v in x.items()})

Try this
df = df.set_index(['id']).apply(dict, axis=1).reset_index(name='ratings')
df['ratings'] = df['ratings'].apply(lambda x: {float(k[1:]): v for k, v in x.items()})
The first line is the same as your original code, but the second line uses the apply() function along with a lambda function to iterate over each dictionary in the 'ratings' column, and for each key-value pair, it converts the key from a string to a float and removes the 'u' from the beginning of the key. The resulting dictionary is then assigned back to the corresponding row in the 'ratings' column.

u'string' denotes a unicode string in Python. Since Python 3 all strings have unicode encoding by default. So you can safely ignore the notation.

Related

convert Array of polygons to Multipolygon

I have an array of Polygons. I need to convert the array in to Multipolygon.
["POLYGON ((-93.8153401599999 31.6253224010001, -93.8154545089999 31.613245482, -93.8256952309999 31.6133096470001, -93.8239846819999 31.6142335050001, -93.822649241 31.614534889, -93.819589744 31.6141266810001, -93.8187199179999 31.6145615630001, -93.818796329 31.6166099970001, -93.8191396409999 31.616805696, -93.822160944 31.6185287610001, -93.8259606669999 31.6195415540001, -93.827173805 31.6202834370001, -93.826861 31.621054014, -93.826721397 31.6210996090001, -93.825838469 31.621387795, -93.823763302 31.620645804, -93.8224278609999 31.620880388, -93.8207344099999 31.6214468590001, -93.817712918 31.621645233, -93.8171636009999 31.6218779230001, -93.8170138 31.622175612, -93.816896795 31.622408104, -93.816843193 31.622514901, -93.8172703129999 31.623758464, -93.817027909 31.6250143240001, -93.816942408 31.624910524, -93.8153401599999 31.6253224010001))", "POLYGON ((-93.827875499 31.6135011530001, -93.8276549939999 31.6133218590001, -93.830593683 31.613340276, -93.827860513 31.616556659, -93.825911348 31.6159317660001, -93.825861447 31.615915767, -93.826296355 31.6149087000001, -93.8272805829999 31.614407122, -93.827341685 31.6143140250001, -93.827875499 31.6135011530001))"]
I am using the following code to convert the Multipolygons using Apache Sedona
select FID,ST_Multi(ST_GeomFromText(collect_list(polygon))) polygon_list group by 1
I am getting the error like "org.apache.spark.sql.catalyst.util.GenericArrayData cannot be cast to org.apache.spark.unsafe.types.UTF8String" .How can I overcome this issue ? is the same thing can be achieved using Geopandas or shapely?
The answer given by #Antoine B is a very good attempt. But it won't work with the polygons that have hole(s) in them. There is another approach that works with such polygons, and the code is easier to comprehend.
from shapely.geometry import Polygon, MultiPolygon
from shapely import wkt
from shapely.wkt import loads
# List of strings representing polygons
poly_string = ["POLYGON ((-93.8153401599999 31.6253224010001, -93.8154545089999 31.613245482, -93.8256952309999 31.6133096470001, -93.8239846819999 31.6142335050001, -93.822649241 31.614534889, -93.819589744 31.6141266810001, -93.8187199179999 31.6145615630001, -93.818796329 31.6166099970001, -93.8191396409999 31.616805696, -93.822160944 31.6185287610001, -93.8259606669999 31.6195415540001, -93.827173805 31.6202834370001, -93.826861 31.621054014, -93.826721397 31.6210996090001, -93.825838469 31.621387795, -93.823763302 31.620645804, -93.8224278609999 31.620880388, -93.8207344099999 31.6214468590001, -93.817712918 31.621645233, -93.8171636009999 31.6218779230001, -93.8170138 31.622175612, -93.816896795 31.622408104, -93.816843193 31.622514901, -93.8172703129999 31.623758464, -93.817027909 31.6250143240001, -93.816942408 31.624910524, -93.8153401599999 31.6253224010001))", "POLYGON ((-93.827875499 31.6135011530001, -93.8276549939999 31.6133218590001, -93.830593683 31.613340276, -93.827860513 31.616556659, -93.825911348 31.6159317660001, -93.825861447 31.615915767, -93.826296355 31.6149087000001, -93.8272805829999 31.614407122, -93.827341685 31.6143140250001, -93.827875499 31.6135011530001))"]
# Create a list of polygons from the list of strings
all_pgons = [loads(pgon) for pgon in poly_string]
# Create the required multipolygon
multi_pgon = MultiPolygon(all_pgons)
This is a list of strings of polygons with holes.
# List of polygons with hole
poly_string = ['POLYGON ((1 2, 1 5, 4 4, 1 2), (1.2 3, 3 4, 1.3 4, 1.2 3))',
'POLYGON ((11 12, 11 15, 14 14, 11 12), (11.2 13, 13 14, 11.3 14, 11.2 13))']
The code above also works well in this case.
a MultiPolygon is just a list of Polygon, so you need to reconstruct every Polygon in a list and then pass it to MultiPolygon.
With the format of the string you gave, I got it to work like that :
from shapely.geometry import Polygon, MultiPolygon
poly_string = ["POLYGON ((-93.8153401599999 31.6253224010001, -93.8154545089999 31.613245482, -93.8256952309999 31.6133096470001, -93.8239846819999 31.6142335050001, -93.822649241 31.614534889, -93.819589744 31.6141266810001, -93.8187199179999 31.6145615630001, -93.818796329 31.6166099970001, -93.8191396409999 31.616805696, -93.822160944 31.6185287610001, -93.8259606669999 31.6195415540001, -93.827173805 31.6202834370001, -93.826861 31.621054014, -93.826721397 31.6210996090001, -93.825838469 31.621387795, -93.823763302 31.620645804, -93.8224278609999 31.620880388, -93.8207344099999 31.6214468590001, -93.817712918 31.621645233, -93.8171636009999 31.6218779230001, -93.8170138 31.622175612, -93.816896795 31.622408104, -93.816843193 31.622514901, -93.8172703129999 31.623758464, -93.817027909 31.6250143240001, -93.816942408 31.624910524, -93.8153401599999 31.6253224010001))", "POLYGON ((-93.827875499 31.6135011530001, -93.8276549939999 31.6133218590001, -93.830593683 31.613340276, -93.827860513 31.616556659, -93.825911348 31.6159317660001, -93.825861447 31.615915767, -93.826296355 31.6149087000001, -93.8272805829999 31.614407122, -93.827341685 31.6143140250001, -93.827875499 31.6135011530001))"]
polygons = []
for poly in poly_string:
coordinates = []
for s in poly.split('('):
if len(s.split(')')) > 1:
for c in s.split(')')[0].split(','):
coordinates.append((float(c.lstrip().split(' ')[0]),
float(c.lstrip().split(' ')[1])))
polygons.append(Polygon(coordinates))
multipoly = MultiPolygon(polygons)
The resulting MultiPolygon looks like that :
I would try
select
FID,
ST_Multi(ST_Collect(ST_GeomFromText(polygon))) polygon_list
group by 1

Making two-dimensional list with the data given

name_list=['William','Laura','Robert','Alicia','Sharon','Jack','Mary','Edward','Jessie','Debra']
day_list=['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
hormones_list=[name_list,day_list]
print(hormones_list[0][0])
I wrote the one-dimensional ones. But how can I write the 2d? When I write hormone_list[0][0] it has to show william's monday data. But I don't know how. Also how can I temporarily store seperate lists? Should I use if structures without writing them one by one? It says shortest so I'm a bit cautious.
You can nest lists, so instead of [value, value, value] you'd use [list, list, list].
So the values from monday-sunday would be a list and then you'd add all these list in a row.
Like:
hormone_list = [[1,2,3,4,5,6,7],[2,3,4,5,6,7,8], ...]
This is likely not how I would solve the problem if presented as a business case but the answer seems to ask for lists and comprehensions so I might do:
## ----------------------------
## Our data
## ----------------------------
name_list=['William', 'Laura', 'Robert', 'Alicia', 'Sharon', 'Jack', 'Mary', 'Edward', 'Jessie', 'Debra']
day_list=['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
hormones_list=[
[1.0, 1.3, 1.7, 1.1, 1.8, 1.3, 1.2], # William's daily readings
[1.9, 1.9, 1.5, 1.8, 1.9, 1.7, 1.5], # Laura's daily readings
[3.0, 3.3, 3.1, 2.7, 3.5, 4.5, 4.7], # Robert's daily readings
# ....
]
## ----------------------------
print(f"Williams reading on Monday was: {hormones_list[0][0]}")
## ----------------------------
## build the list of high days using a list of lists like hormones_list.
## Note: I would probably use a days_high dictionary with something like
## days_high.setdefault(patient_index, []).append(day_index)
## ----------------------------
days_high = [
[
day_index
for day_index, day_reading
in enumerate(patient_readings)
if day_reading > 4.0
]
for patient_readings
in hormones_list
]
## ----------------------------
## ----------------------------
## Print out the warnings for a given patient (if there are any)
## ----------------------------
for patient_index, patient_readings in enumerate(days_high):
if not patient_readings:
continue
patient_name = name_list[patient_index]
days = " ".join([day_list[i] for i in patient_readings])
print(f"WARNING: High values for {patient_name}: {days}")
## ----------------------------
giving the following output:
Williams reading on Monday was: 1.0
WARNING: High values for Robert: Saturday Sunday

Keras Prediction result (getting score,use of argmax)

I am trying to use the elmo model for text classification for my own dataset. The training is completed and the number of classes is 4(used keras model and elmo embedding).In the prediction, I got a numpy array. I am attaching the sample code and the result below...
import tensorflow as tf
import keras.backend as K
new_text_pr = np.array(data, dtype=object)[:, np.newaxis]
with tf.Session() as session:
K.set_session(session)
session.run(tf.global_variables_initializer())
session.run(tf.tables_initializer())
model_elmo = build_model(classes)
model_elmo.load_weights(model+"/"+elmo_model)
import time
t = time.time()
predicted = model_elmo.predict(new_text_pr)
print("time: ", time.time() - t)
print(predicted)
# print(predicted[0][0])
print("result:",np.argmax(predicted[0]))
return np.argmax(predicted[0])
when I print the predicts variable I got this.
time: 1.561854362487793
[[0.17483692 0.21439584 0.24001297 0.3707543 ]
[0.15607062 0.24448264 0.4398888 0.15955798]
[0.06494818 0.3439018 0.42254424 0.16860574]
[0.08343349 0.37218323 0.32528472 0.2190985 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.0365712 0.4194748 0.3321385 0.21181548]
[0.05350104 0.18225929 0.56712115 0.19711846]
[0.08343349 0.37218323 0.32528472 0.2190985 ]
[0.09541835 0.19085276 0.41069734 0.30303153]
[0.03930932 0.40526104 0.45785302 0.09757669]
[0.06377257 0.33980298 0.32396355 0.27246094]
[0.09784496 0.2292052 0.44426462 0.22868524]
[0.06089798 0.31685832 0.47317514 0.14906852]
[0.03956613 0.46605557 0.3502095 0.14416872]
[0.10513227 0.26166025 0.36598155 0.26722598]
[0.15165758 0.22900137 0.50939053 0.10995051]
[0.06377257 0.33980298 0.32396355 0.27246094]
[0.11404029 0.21311268 0.46880838 0.2040386 ]
[0.07556026 0.20502563 0.52019936 0.19921473]
[0.11096822 0.23295449 0.36192006 0.29415724]
[0.05018891 0.16656907 0.60114646 0.18209551]
[0.08880813 0.2893545 0.44374797 0.1780894 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.09596984 0.18282187 0.5053091 0.2158991 ]
[0.09428936 0.13995855 0.62395805 0.14179407]
[0.10513227 0.26166025 0.36598155 0.26722598]
[0.08244281 0.15743142 0.5462735 0.21385226]
[0.07199708 0.2446867 0.44568574 0.23763043]
[0.1339082 0.27288827 0.43478844 0.15841508]
[0.07354636 0.24499843 0.44873005 0.23272514]
[0.08880813 0.2893545 0.44374797 0.1780894 ]
[0.14868192 0.25948635 0.32722548 0.2646063 ]
[0.08924995 0.36547357 0.40014726 0.14512917]
[0.05132649 0.28190497 0.5224545 0.14431408]
[0.06377257 0.33980292 0.32396355 0.27246094]
[0.04849219 0.36724472 0.39698333 0.1872797 ]
[0.07206573 0.31368822 0.4667826 0.14746341]
[0.05948553 0.28048623 0.41831577 0.2417125 ]
[0.07582933 0.18771031 0.54879296 0.18766735]
[0.03858965 0.20433436 0.5596278 0.19744818]
[0.07443814 0.20681688 0.3933627 0.32538226]
[0.0639974 0.23687115 0.5357675 0.16336392]
[0.11005415 0.22901568 0.4279426 0.23298755]
[0.12625505 0.22987585 0.31619486 0.32767424]
[0.08893713 0.14554602 0.45740074 0.30811617]
[0.07906891 0.18683094 0.5214609 0.21263924]
[0.06316617 0.30398315 0.4475617 0.185289 ]
[0.07060979 0.17987429 0.4829593 0.26655656]
[0.0720717 0.27058697 0.41439256 0.24294883]
[0.06377257 0.33980292 0.32396355 0.27246094]
[0.04745338 0.25831962 0.46751252 0.22671448]
[0.06624557 0.20708969 0.54820716 0.17845756]]
result:3
Anyone have any idea about what is the use of taking the 0th index value only. Considering this as a list of lists 0th index means first list and the argmax returns index the maximum value from the list. Then what is the use of other values in the lists?. Why isn't it considered?. Also is it possible to get the score from this? I hope the question is clear. Is it the correct way or is it wrong?
I have found the issue. just posting it others who met the same problem.
Answer: When predicting with Elmo model, it expects a list of strings. In code, the prediction data were split and the model predicted for each word. That's why I got this huge array. I have used a temporary fix. The data is appended to a list then an empty string is also appended with the list. The model will predict the both list values but I took only the first predicted data. This is not the correct way but I have done this as a quick fix and hoping to find a fix in the future
To find the predicted class for each test example, you need to use axis=1. So, in your case the predicted classes will be:
>>> predicted_classes = predicted.argmax(axis=1)
>>> predicted_classes
[3 2 2 1 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 2
2 2 2 2 2 2 3 2 2 2 2 2 1 2 2]
Which means that the first test example belongs to the third class, and the second test example belongs to the second class and so on.
The previous part answers your question (I think), now let's see what the np.argmax(predicted) does. Using np.argmax() alone without specifying the axis will flatten your predicted matrix and get the argument of the maximum number.
Let's see this simple example to know what I mean:
>>> x = np.matrix(np.arange(12).reshape((3,4)))
>>> x
matrix([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
>>> x.argmax()
11
11 is the index of the 11 which is the biggest number in the whole matrix.

how to find exponential weighted moving average using dataframe.ewma?

Previously I used the following to calculate the ewma
dataset['26ema'] = pd.ewma(dataset['price'], span=26)
But, in the latest version of pandas pd.ewma has been removed. How to calculate using the new method dataframe.ewma?
dataset['26ema'] = dataset['price'].ewma(span=26)
This is giving an error 'AttributeError: 'Series' object has no attribute 'ewma'
Use Series.ewm:
dataset['price'].ewm(span=26)
See GH11603 for the relevant PR and mapping of the old API to new ones.
Minimal Code Example
s = pd.Series(range(5))
s.ewm(span=3).mean()
0 0.000000
1 0.666667
2 1.428571
3 2.266667
4 3.161290
dtype: float64

Find two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth

I have the following df with data about the American quarterly GDP in billions of chained 2009 dollars, from 1947q1 to 2016q2:
df = pd.DataFrame(data = [1934.5, 1932.3, 1930.3, 1960.7, 1989.5, 2021.9, 2033.2, 2035.3, 2007.5, 2000.8, 2022.8, 2004.7, 2084.6, 2147.6, 2230.4, 2273.4, 2304.5, 2344.5, 2392.8, 2398.1, 2423.5, 2428.5, 2446.1, 2526.4, 2573.4, 2593.5, 2578.9, 2539.8, 2528.0, 2530.7, 2559.4, 2609.3, 2683.8, 2727.5, 2764.1, 2780.8, 2770.0, 2792.9, 2790.6, 2836.2, 2854.5, 2848.2, 2875.9, 2846.4, 2772.7, 2790.9, 2855.5, 2922.3, 2976.6, 3049.0, 3043.1, 3055.1, 3123.2, 3111.3, 3119.1, 3081.3, 3102.3, 3159.9, 3212.6, 3277.7, 3336.8, 3372.7, 3404.8, 3418.0, 3456.1, 3501.1, 3569.5, 3595.0, 3672.7, 3716.4, 3766.9, 3780.2, 3873.5, 3926.4, 4006.2, 4100.6, 4201.9, 4219.1, 4249.2, 4285.6, 4324.9, 4328.7, 4366.1, 4401.2, 4490.6, 4566.4, 4599.3, 4619.8, 4691.6, 4706.7, 4736.1, 4715.5, 4707.1, 4715.4, 4757.2, 4708.3, 4834.3, 4861.9, 4900.0, 4914.3, 5002.4, 5118.3, 5165.4, 5251.2, 5380.5, 5441.5, 5411.9, 5462.4, 5417.0, 5431.3, 5378.7, 5357.2, 5292.4, 5333.2, 5421.4, 5494.4, 5618.5, 5661.0, 5689.8, 5732.5, 5799.2, 5913.0, 6017.6, 6018.2, 6039.2, 6274.0, 6335.3, 6420.3, 6433.0, 6440.8, 6487.1, 6503.9, 6524.9, 6392.6, 6382.9, 6501.2, 6635.7, 6587.3, 6662.9, 6585.1, 6475.0, 6510.2, 6486.8, 6493.1, 6578.2, 6728.3, 6860.0, 7001.5, 7140.6, 7266.0, 7337.5, 7396.0, 7469.5, 7537.9, 7655.2, 7712.6, 7784.1, 7819.8, 7898.6, 7939.5, 7995.0, 8084.7, 8158.0, 8292.7, 8339.3, 8449.5, 8498.3, 8610.9, 8697.7, 8766.1, 8831.5, 8850.2, 8947.1, 8981.7, 8983.9, 8907.4, 8865.6, 8934.4, 8977.3, 9016.4, 9123.0, 9223.5, 9313.2, 9406.5, 9424.1, 9480.1, 9526.3, 9653.5, 9748.2, 9881.4, 9939.7, 10052.5, 10086.9, 10122.1, 10208.8, 10281.2, 10348.7, 10529.4, 10626.8, 10739.1, 10820.9, 10984.2, 11124.0, 11210.3, 11321.2, 11431.0, 11580.6, 11770.7, 11864.7, 11962.5, 12113.1, 12323.3, 12359.1, 12592.5, 12607.7, 12679.3, 12643.3, 12710.3, 12670.1, 12705.3, 12822.3, 12893.0, 12955.8, 12964.0, 13031.2, 13152.1, 13372.4, 13528.7, 13606.5, 13706.2, 13830.8, 13950.4, 14099.1, 14172.7, 14291.8, 14373.4, 14546.1, 14589.6, 14602.6, 14716.9, 14726.0, 14838.7, 14938.5, 14991.8, 14889.5, 14963.4, 14891.6, 14577.0, 14375.0, 14355.6, 14402.5, 14541.9, 14604.8, 14745.9, 14845.5, 14939.0, 14881.3, 14989.6, 15021.1, 15190.3, 15291.0, 15362.4, 15380.8, 15384.3, 15491.9, 15521.6, 15641.3, 15793.9, 15747.0, 15900.8, 16094.5, 16186.7, 16269.0, 16374.2, 16454.9, 16490.7, 16525.0, 16583.1],
index = ['1947q1', '1947q2', '1947q3', '1947q4', '1948q1', '1948q2', '1948q3',
'1948q4', '1949q1', '1949q2', '1949q3', '1949q4', '1950q1', '1950q2',
'1950q3', '1950q4', '1951q1', '1951q2', '1951q3', '1951q4', '1952q1',
'1952q2', '1952q3', '1952q4', '1953q1', '1953q2', '1953q3', '1953q4',
'1954q1', '1954q2', '1954q3', '1954q4', '1955q1', '1955q2', '1955q3',
'1955q4', '1956q1', '1956q2', '1956q3', '1956q4', '1957q1', '1957q2',
'1957q3', '1957q4', '1958q1', '1958q2', '1958q3', '1958q4', '1959q1',
'1959q2', '1959q3', '1959q4', '1960q1', '1960q2', '1960q3', '1960q4',
'1961q1', '1961q2', '1961q3', '1961q4', '1962q1', '1962q2', '1962q3',
'1962q4', '1963q1', '1963q2', '1963q3', '1963q4', '1964q1', '1964q2',
'1964q3', '1964q4', '1965q1', '1965q2', '1965q3', '1965q4', '1966q1',
'1966q2', '1966q3', '1966q4', '1967q1', '1967q2', '1967q3', '1967q4',
'1968q1', '1968q2', '1968q3', '1968q4', '1969q1', '1969q2', '1969q3',
'1969q4', '1970q1', '1970q2', '1970q3', '1970q4', '1971q1', '1971q2',
'1971q3', '1971q4', '1972q1', '1972q2', '1972q3', '1972q4', '1973q1',
'1973q2', '1973q3', '1973q4', '1974q1', '1974q2', '1974q3', '1974q4',
'1975q1', '1975q2', '1975q3', '1975q4', '1976q1', '1976q2', '1976q3',
'1976q4', '1977q1', '1977q2', '1977q3', '1977q4', '1978q1', '1978q2',
'1978q3', '1978q4', '1979q1', '1979q2', '1979q3', '1979q4', '1980q1',
'1980q2', '1980q3', '1980q4', '1981q1', '1981q2', '1981q3', '1981q4',
'1982q1', '1982q2', '1982q3', '1982q4', '1983q1', '1983q2', '1983q3',
'1983q4', '1984q1', '1984q2', '1984q3', '1984q4', '1985q1', '1985q2',
'1985q3', '1985q4', '1986q1', '1986q2', '1986q3', '1986q4', '1987q1',
'1987q2', '1987q3', '1987q4', '1988q1', '1988q2', '1988q3', '1988q4',
'1989q1', '1989q2', '1989q3', '1989q4', '1990q1', '1990q2', '1990q3',
'1990q4', '1991q1', '1991q2', '1991q3', '1991q4', '1992q1', '1992q2',
'1992q3', '1992q4', '1993q1', '1993q2', '1993q3', '1993q4', '1994q1',
'1994q2', '1994q3', '1994q4', '1995q1', '1995q2', '1995q3', '1995q4',
'1996q1', '1996q2', '1996q3', '1996q4', '1997q1', '1997q2', '1997q3',
'1997q4', '1998q1', '1998q2', '1998q3', '1998q4', '1999q1', '1999q2',
'1999q3', '1999q4', '2000q1', '2000q2', '2000q3', '2000q4', '2001q1',
'2001q2', '2001q3', '2001q4', '2002q1', '2002q2', '2002q3', '2002q4',
'2003q1', '2003q2', '2003q3', '2003q4', '2004q1', '2004q2', '2004q3',
'2004q4', '2005q1', '2005q2', '2005q3', '2005q4', '2006q1', '2006q2',
'2006q3', '2006q4', '2007q1', '2007q2', '2007q3', '2007q4', '2008q1',
'2008q2', '2008q3', '2008q4', '2009q1', '2009q2', '2009q3', '2009q4',
'2010q1', '2010q2', '2010q3', '2010q4', '2011q1', '2011q2', '2011q3',
'2011q4', '2012q1', '2012q2', '2012q3', '2012q4', '2013q1', '2013q2',
'2013q3', '2013q4', '2014q1', '2014q2', '2014q3', '2014q4', '2015q1',
'2015q2', '2015q3', '2015q4', '2016q1', '2016q2'])
df.columns = ['GDP in billions of chained 2009 dollars']
df.index.rename('quarter', inplace = True)
A recession period is defined as starting with two consecutive quarters of GDP decline, and ending with two consecutive quarters of GDP growth. The goal is to create a function 'get_recession_periods()' that returns all of the recession periods between 1947q1 and 2016q2. The output could a dataframe with two columns (start and end) or a list of tuples [(start and end), ...] with all the recession periods found.
Here is my try:
get_recession_periods()
lst_start = []
for i in range(0,len(df['GDP in billions of chained 2009 dollars'])-2):
if df['GDP in billions of chained 2009 dollars'][i] < df['GDP in billions of chained 2009 dollars'][i-1] and df['GDP in billions of chained 2009 dollars'][i+1] < df['GDP in billions of chained 2009 dollars'][i]:
lst_start.append(df.index[i])
start = lst_start[0]
lst_end = []
for j in range(df.index.get_loc(start),len(df['GDP in billions of chained 2009 dollars'])-2):
if df['GDP in billions of chained 2009 dollars'][j] > df['GDP in billions of chained 2009 dollars'][j-1] and df['GDP in billions of chained 2009 dollars'][j+1] > df['GDP in billions of chained 2009 dollars'][j]:
lst_end.append(df.index[j])
return (lst_start[0], lst_end[0])
But with the function above, I am only able to get the start and end quarter of the first recession in 1947.
Any idea?
This is probably overkill for this particular example... In a nutshell this is a bit more complicated than #zaq's answer but also much faster (about 9x here, and the difference would be much bigger on larger datasets) because it's vectorized instead of looped. But for this very small dataset here, clearly you would go with the simpler answer since even the slower way is fast enough. Finally, it stores the data in the dataframe itself rather than as a tuple (which could be an advantage or disadvantage, depending on the situation).
Thanks to #zaq for pointing out that I misread the question initially. I believe this now gives the same answer as zaq's except we have different implicit assumptions about the initial state of the world (beginning in recession or not) which is indeterminate in the data provided.
df['change'] = df.diff() # change in GDP from prior quarter
start = (df.change<0) & (df.change.shift(-1)<0) # potential start
end = (df.change>0) & (df.change.shift(-1)>0) # potential end
df['recess' ] = np.nan
df.loc[ start, 'recess' ] = -1
df.loc[ end, 'recess' ] = 1
df['recess'] = df.recess.ffill() # if the current row doesn't fit the
# definition of a start or end, then
# fill it with the prior row value
df['startend'] = np.nan
df.loc[ (df.recess==-1) & (df.recess.shift()== 1), 'startend'] = -1 # start
df.loc[ (df.recess== 1) & (df.recess.shift()==-1), 'startend'] = 1 # end
df[df.startend.notnull()]
GDP change recess startend
quarter
1947q4 1960.7 30.4 1.0 1.0
1949q1 2007.5 -27.8 -1.0 -1.0
1950q1 2084.6 79.9 1.0 1.0
1953q3 2578.9 -14.6 -1.0 -1.0
1954q2 2530.7 2.7 1.0 1.0
1957q4 2846.4 -29.5 -1.0 -1.0
1958q2 2790.9 18.2 1.0 1.0
1969q4 4715.5 -20.6 -1.0 -1.0
1970q2 4715.4 8.3 1.0 1.0
1974q3 5378.7 -52.6 -1.0 -1.0
1975q2 5333.2 40.8 1.0 1.0
1980q2 6392.6 -132.3 -1.0 -1.0
1980q4 6501.2 118.3 1.0 1.0
1981q4 6585.1 -77.8 -1.0 -1.0
1982q4 6493.1 6.3 1.0 1.0
1990q4 8907.4 -76.5 -1.0 -1.0
1991q2 8934.4 68.8 1.0 1.0
2008q3 14891.6 -71.8 -1.0 -1.0
2009q3 14402.5 46.9 1.0 1.0
One issue with your code is that you are not tracking the current status of the economy. If there are 20 consecutive quarters of GDP decline, your code will report 18 recessions beginning. And if there are 20 quarters of growth, it will report 18 recessions ending even if there wasn't one to begin with.
So, I introduce a Boolean recession to indicate whether we are in recession currently. Other changes: chained inequalities like a < b < c work in Python as expected, and improve readability; also, your column name is so verbose that I used positional indexing iloc instead, to have readable conditions in if-statements.
lst_start = []
lst_end = []
recession = False
for i in range(1, len(df)-1):
if not recession and (df.iloc[i-1, 0] > df.iloc[i, 0] > df.iloc[i+1, 0]):
recession = True
lst_start.append(df.index[i])
elif recession and (df.iloc[i-1, 0] < df.iloc[i, 0] < df.iloc[i+1, 0]):
recession = False
lst_end.append(df.index[i])
print(list(zip(lst_start, lst_end)))

Resources