negative forcasts using facebook prophet - python-3.x

I have a daily time series data for almost 2 years for cluster available space (in GB). I am trying to to use facebook's prophet to do future forecasts. Some forecasts have negative values. Since negative values does not make sense I saw that using carrying capacity for logistic growth model helps in eliminating negative forecasts with cap values. I am not sure if this is applicable for this case and how to get the cap value for my time series. Please help as I am new to this and confused. I am using Python 3.6
import numpy as np
import pandas as pd
import xlrd
import openpyxl
from pandas import datetime
import csv
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
from fbprophet import Prophet
import os
import sys
import signal
df = pd.read_excel("Data_Per_day.xlsx")
df1=df.filter(['cluster_guid','date','avail_capacity'],axis=1)
uniquevalues = np.unique(df1[['cluster_guid']].values)
for id in uniquevalues:
newdf = df1[df1['cluster_guid'] == id]
newdf1=newdf.groupby(['cluster_guid','date'],as_index=False['avail_capacity'].sum()
#newdf11=newdf.groupby(['cluster_guid','date'],as_index=False)['total_capacity'].sum()
#cap[id]=newdf11['total_capacity'].max()
#print(cap[id])
newdf1.set_index('cluster_guid', inplace=True)
newdf1.to_csv('my_csv.csv', mode='a',header=None)
with open('my_csv.csv',newline='') as f:
r = csv.reader(f)
data = [line for line in r]
with open('my_csv.csv','w',newline='') as f:
w = csv.writer(f)
w.writerow(['cluster_guid','DATE_TAKEN','avail_capacity'])
w.writerows(data)
in_df = pd.read_csv('my_csv.csv', parse_dates=True, index_col='DATE_TAKEN' )
in_df.to_csv('my_csv.csv')
dfs= pd.read_csv('my_csv.csv')
uni=dfs.cluster_guid.unique()
while True:
try:
print(" Press Ctrl +C to exit or enter the cluster guid to be forcasted")
i=input('Please enter the cluster guid')
if i not in uni:
print( 'Please enter a valid cluster guid')
continue
else:
dfs1=dfs.loc[df['cluster_guid'] == i]
dfs1.drop('cluster_guid', axis=1, inplace=True)
dfs1.to_csv('dataframe'+i+'.csv', index=False)
dfs2=pd.read_csv('dataframe'+i+'.csv')
dfs2['DATE_TAKEN'] = pd.DatetimeIndex(dfs2['DATE_TAKEN'])
dfs2 = dfs2.rename(columns={'DATE_TAKEN': 'ds','avail_capacity': 'y'})
my_model = Prophet(interval_width=0.99)
my_model.fit(dfs2)
future_dates = my_model.make_future_dataframe(periods=30, freq='D')
forecast = my_model.predict(future_dates)
print(forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']])
my_model.plot(forecast,uncertainty=True)
my_model.plot_components(forecast)
plt.show()
os.remove('dataframe'+i+'.csv')
os.remove('my_csv.csv')
except KeyboardInterrupt:
try:
os.remove('my_csv.csv')
except OSError:
pass
sys.exit(0)

Box-Cox transform of order 0 get the trick done. Here are the steps:
1. Add 1 to each values (so as to avoid log(0))
2. Take natural log of each value
3. Make forecasts
4. Take exponent and subtract 1
This way you will not get negative forecasts. Also log have a nice property of converting multiplicative seasonality to additive form.

Related

How do I interpolate the time based on a given condition?

I have over 50 csv files to process, but each dataset like this:
[a simplified csv file example] (https://i.stack.imgur.com/yoGo9.png)
There are three columns: Times, Index, Voltage
I want to interpolate time when total voltage decreases [here is (84-69) = 15] reaches 53% [ it means 15*0.53] at index 2.
I will repeat this process for index 4, too. May I ask what I should do?
I am a beginner for python and try this following script:
source code (
import pandas as pd
import glob
import numpy as np
import os
import matplotlib.pyplot as plt
import xlwings as xw
path = os.getcwd()
csv_files = glob.glob(os.path.join(path, "*.csv")) # read all csv files
for f in csv_files: # process each dataset
df = pd.read_csv(f)
ta95 = df[df.Step_Index == 2] # create new dataframe based on index
a=ta95.iloc[1] # choose first row
b=a.loc[:, "Voltage(V)"] # save voltage in first row
c=ta95.iloc[-1] # choose last row
d=c.loc[:, "Voltage(V)"] # save voltage in last row
e=(b-d)*0.53 # get 53% decrease voltage
)
I don't know what should I do next for this script.
I appreciate your time and support if you can offer the help.
If you have any recommendation websites for me to read and help me solve this kind of problem. I do appreciate it, too. Thanks again.

Visual Studio Code autocomplete comma to '%%!'

While using the Jupyter extension in VS Code, for some reason, every time I type a comma, VS Code suggests %%! which means I have to hit esc every time in order to make comma separated lists over multiple lines. Can anyone tell me why this is happening or how to stop it? It doesn't happen in a blank notebook, but after running two cells it's back again.
import pandas as pd
import matplotlib.dates as mdates
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import datetime as dt
%matplotlib inline
sns.set()
def open_hbal(file):
df_balance = pd.read_excel(file) #open xlsx
t_0 = dt.datetime(2018, 1, 1) #set start point for time series
#add Date and time in fraction steps starting from t_0
df_balance["Date and Time"] = t_0 + pd.to_timedelta(df_balance["Time"],
unit='h')
#convert to datetime object
df_balance["Date and Time"] = pd.to_datetime(df_balance["Date and Time"])
#replace Index with Date and Time. inplace overwrites df
df_balance.set_index("Date and Time", inplace=True)
#remove Time column as it is no longer needed, axis 0 = row, 1 = column
df_balance.drop("Time", axis=1, inplace=True)
#df_balance = df_balance / 1000 # convert to kWh
#replace units in all columns
#df_balance.columns = df_balance.columns.str.replace(", W", ", kWh")
df_balance.rename(columns = {"Net losses, W" :"_Net losses, W"},
inplace = True)
return(df_balance)
In case anyone else lands here, other suggested solutions did not work for me on v2021.10.1101450599. As per this issue, rolling back to v2021.8.1236758218 removes the problem until it gets fixed.

How To Find a Point In Polygon from a Geojson file Using Python And Geopandas

So I have a .geojson file that contains a FeatureCollection of multiple polygons representing a country. I am trying to determine if a specific point is inside one of these polygons. If so, I return the entire feature itself; if not, I return a simple message.
So far, I am able to load the data into a GeoDataFrame using geopandas, but for some reasons, I can't successfully iterate through the geodataframe and successfully perform polygon.contains(point). It seems to me that the iteration stops after a certain point, or maybe my code does not work at all.
I have tried multiple suggestions from S/O and other tutorials on Google, but I couldn't successfully get what I wanted. Below is my code.
Geojson file
data
Code
%matplotlib inline
import json
import pandas as pd
import geopandas as gpd
import matplotlib.pyplot as plt
import shapely
from shapely.geometry import Point, Polygon
from descartes import PolygonPatch
import geocoder
import requests
import copy
session = requests.Session()
test_point = [14.1747157, 10.4952759]
f, ax = plt.subplots(1, figsize=(10, 10))
url = 'https://trello-attachments.s3.amazonaws.com/599b7f6ff18b8d629ac53168/5d03586a06add530095c325c/26f5d54bbfa9731ec16737641b59de9a/CMR_adm3-2.geojson'
df = gpd.read_file(url)
df['Area']= df['geometry'].area
df['centroid'] = df['geometry'].centroid
df.plot(ax=ax, column="Area", cmap='OrRd', alpha=0.5, edgecolor='k')
# ax.set_title(arr + " " + depart + " " + region, fontsize = font_size)
# print(df.head(3))
plt.show()
print("The length of the Dataframe is:", len(df))
def find_department(df, point):
for feature in df['geometry']:
polygon = Polygon(feature)
# print(type(polygon))
if polygon.contains(point):
# print(feature.to_json())
print ('Found containing polygon:', feature)
else:
print('Found nothing!')
p1 = Point(float(test_point[0]), float(test_point[1]))
dept = find_department(df, p1)
print("The department is:", dept)
This is the response I get when I run it on notebook:
This worked for me:
def find_department(df, point):
for index, row in df.iterrows():
if row.geometry.contains(point):
return row

Keyerror when adding a column to a Dataframe (Pandas)

Pandas DataFrame is not really accepting adding a second column, and I cannot really troubleshoot the issue. I am trying to display Moving Averages. The code works fine just for the first one (MA_9), and gives me error as soon I try to add additional MA (MA_20).
Is it not possible in this case to add more than one column?
The code:
import numpy as np
import pandas as pd
import pandas_datareader as pdr
import matplotlib.pyplot as plt
symbol = 'GOOG.US'
start = '20140314'
end = '20180414'
google = pdr.DataReader(symbol, 'stooq', start, end)
print(google.head())
google_close = pd.DataFrame(google.Close)
print(google_close.last_valid_index)
google_close['MA_9'] = google_close.rolling(9).mean()
google_close['MA_20'] = google_close.rolling(20).mean()
# google_close['MA_60'] = google_close.rolling(60).mean()
# print(google_close)
plt.figure(figsize=(15, 10))
plt.grid(True)
# display MA's
plt.plot(google_close['Close'], label='Google_Cls')
plt.plot(google_close['MA_9'], label='MA 9 day')
plt.plot(google_close['MA_20'], label='MA 20 day')
# plt.plot(google_close['MA_60'], label='MA 60 day')
plt.legend(loc=2)
plt.show()
Please update your code as below and then it should work:
google_close['MA_9'] = google_close.Close.rolling(9).mean()
google_close['MA_20'] = google_close.Close.rolling(20).mean()
Initially there was only one column data of Close so your old code google_close['MA_9'] = google_close.rolling(9).mean() worked but after this line of code now it has two column and so it does not know which data you are trying to mean. So updating with the column details of data you wanted to mean, it works google_close['MA_20'] = google_close.Close.rolling(20).mean()

Key Error: nan when applying KNN classifier

I am trying to test my KNN classifier against some data that I sourced from UCI's Machine Learning Repository. When running the classifier I keep getting the same KeyError
train_set[i[-1]].append(i[:-1])
KeyError: NaN
I am not sure why this keeps happening because if I comment out the classifier and just print the first 10 lines or so, the data shows up fine with no corruption or duplication of any kind.
Here is a link to the data that I am using, I just simply downloaded it and added the column ID's (note: in this link the column ID's have not been added)
Here is what some of the code looks like:
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import style
import warnings
from math import sqrt
from collections import Counter
import pandas as pd
import random
style.use('fivethirtyeight')
def k_nearest_neighbors(data, predict, k=3):
if len(data) >= k:
warnings.warn('K is set to a value less than total voting groups!')
distances = []
for group in data:
for features in data[group]:
euclidean_distance = np.linalg.norm(np.array(features)-np.array(predict))
distances.append([euclidean_distance,group])
votes = [i[1] for i in sorted(distances)[:k]]
vote_result = Counter(votes).most_common(1)[0][0]
return vote_result
df = pd.read_csv('breast-cancer-wisconsin.data.txt')
df.replace('?',-99999, inplace=True)
df.drop(['id'], 1, inplace=True)
full_data = df.astype(float).values.tolist()
random.shuffle(full_data)
test_size = 0.2
train_set = {2:[], 4:[]}
test_set = {2:[], 4:[]}
train_data = full_data[:-int(test_size*len(full_data))]
test_data = full_data[-int(test_size*len(full_data)):]
for i in train_data:
train_set[i[-1]].append(i[:-1])
for i in test_data:
test_set[i[-1]].append(i[:-1])
correct = 0
total = 0
for group in test_set:
for data in test_set[group]:
vote = k_nearest_neighbors(train_set, data, k=5)
if group == vote:
correct += 1
total += 1
print('Accuracy:', correct/total)
I am completely stumped as to why this KeyError keeps showing up, (it also happens on the
test_set[i[-1]].append(i[:-1]) line as well.
I tried looking for people who experienced similar issues but have since found nobody with the same issue as me. As always any assistance is greatly appreciated, thank you.
I figured out that the error was caused by a spacing issue. When typing in the classes for the data after I downloaded it, I forgot to input the classes on their own line. I instead typed my classes right in front of the first data point causing the error to occur.

Resources