Python time series using FB Prophet with covid-19 - python-3.x

I have prepared a time series model using FB Prophet for making forecasts. The model forecasts for the coming 30 days and my data ranges from Jan 2019 until Mar 2020 both months inclusive with all the dates filled in. The model has been built specifically for the UK market
I have already taken care of the following:
Seasonality
Holidaying Effect
My question is, that how do I take care of the current COVID-19 situation into the same model? The cases that I am trying to forecast are also dependent on the previous data at least from Jan 2020. So in order to forecast I need to take into account the current coronavirus situation as well that would impact my forecasts apart from seasonality and holidaying effect.
How should I achieve this?

Haven't tried it but one thing I have in mind is to add covid-19 as a one-time holiday. You can see an example for adding a one-time promo as a pseudo-holiday in the following link: https://towardsdatascience.com/forecasting-in-python-with-facebook-prophet-29810eb57e66
I assume that it can be applied to a negative effect, caudes by covid-19,

I have had the same issue with COVID at my work with sales forecasting. The easy solution for me was to make an additional regressor which indicates the COVID period, and use that in my model. Then my future is not affected by COVID, unless I tell it that it should be.

Related

Quarterly forecast with Midas (eviews)

I'm really struggling with this problem.
Suppose I want to forecast 2022Q3 GDP using a monthly variable for industrial production (let´s call it IP), the first lag of GDP and a constant with MIDAS. Now, let's consider that I'm in July/22 and have the value for the July/22 IP, and I want to forecast my 2022Q3 GDP using all data for IP until July/22 (consider a sample from 2010Q1-2022Q2 for GDP and 2010M1-2022M7 for IP). Considering that, my idea was to use two lags of IP in my Midas estimation. In the next month, with the August/22 IP in hand, I would reduce lag structure to one and so on.
In Eviews, we can can estimate the equation below without problem for the sample 2010Q1-2022Q2:
equation eq_umid.midas(midwgt=umidas, fixedlag=12) gdp c gdp(-1) # monthly\IP(-2)
The problem is when I try to forecast the GDP for 2022Q3. I receive a message: "Unable to compute due to missing data". But this make no sense to me, given that I have all data that I need with the IP with two lags.
I don't know if anyone has ever faced a similar problem, but I would really appreciate some help.
Thank you very much.

How to understand the answer_start parameter of Squad dataset for training BERT-QA model + practical implications for creating custom dataset?

I am in the process of creating a custom dataset to benchmark the accuracy of the 'bert-large-uncased-whole-word-masking-finetuned-squad' model for my domain, to understand if I need to fine-tune further, etc.
When looking at the different Question Answering datasets on the Hugging Face site (squad, adversarial_qa, etc. ), I see that the answer is commonly formatted as a dictionary with keys: answer (the text) and answer_start (char index where answer starts).
I'm trying to understand:
The intuition behind how the model uses the answer_start when calculating the loss, accuracy, etc.
If I need to go through the process of adding this to my custom dataset (easier to run model evaluation code, etc?)
If so, is there a programmatic way to do this to avoid manual effort?
Any help or direction would be greatly appreciated!
Code example to show format:
import datasets
ds = datasets.load_dataset('squad')
train = ds['train']
print('Example: \n')
print(train['answers'][0])
Your question is a bit broad to give you a specific answer, but I will try my best to point you in some directions.
The intuition behind how the model uses the answer_start when
calculating the loss, accuracy, etc.
There are different types of QA tasks/datasets. The ones you mentioned (SQuAD and adversarial_qa) belong to the field of extractive question answering. There, a model must select a span from a given context that answers the given question. For example:
context = 'Second, Democrats have always elevated their minority floor leader to the speakership upon reclaiming majority status. Republicans have not always followed this leadership succession pattern. In 1919, for instance, Republicans bypassed James R. Mann, R-IL, who had been minority leader for eight years, and elected Frederick Gillett, R-MA, to be Speaker. Mann "had angered many Republicans by objecting to their private bills on the floor;" also he was a protégé of autocratic Speaker Joseph Cannon, R-IL (1903–1911), and many Members "suspected that he would try to re-centralize power in his hands if elected Speaker." More recently, although Robert H. Michel was the Minority Leader in 1994 when the Republicans regained control of the House in the 1994 midterm elections, he had already announced his retirement and had little or no involvement in the campaign, including the Contract with America which was unveiled six weeks before voting day.'
question='How did Republicans feel about Mann in 1919?'
answer='angered' #-> starting at character 365
A simple approach that is often used today, is a linear layer that predicts the answer start and answer end from the last hidden state of a transformer encoder (code example). The last hidden state holds one vector for each input token (token!= words) and the linear layer is trained to assign high probabilities to tokens that could potentially be the start and end of the answer span. To train a model with your data, the loss function needs to know which tokens should get a high probability (i.e. the answer and the start token).
If I need to go through the process of adding this to my custom
dataset (easier to run model evaluation code, etc?)
You should go through this process, otherwise, how should someone know where the answer starts in your context? They can of course interfere with it programmatically, but what if your answer string appears twice in the context? Providing an answer start position avoids confusion and allows your users to use it right away with one of the many extractive questions answering scripts that are already available out there.
If so, is there a programmatic way to do this to avoid manual effort?
You could simply loop through your dataset and use str.find:
context.find(answer)
Output:
365

Creating radar image from web api data

To get familiar with front-end web development, I'm creating a weather app. Most of the tutorials I found display the temperature, humidity, chance of rain, etc.
Looking at the Dark Sky API, I see the "Time Machine Request" returns observed weather conditions, and the response contains a 'precipIntensity' field: The intensity (in inches of liquid water per hour) of precipitation occurring at the given time. This value is conditional on probability (that is, assuming any precipitation occurs at all).
So, it made me wonder about creating a 'radar image' of precipitation intensity?
Assuming other weather apis are similar, is generating a radar image of precipitation as straightforward as:
Create a grid of latitude/longitude coordinates.
Submit a request for weather data for each coordinate.
Build a color-coded grid of received precipitation intensity values and smooth between them.
Or would that be considered a misuse of the data?
Thanks,
Mike
This would most likely end up in a very low resolution product. I will explain.
Weather observations come in from input sources ranging from mesonet stations, airports, and other programs like the "citizen weather observer" program. All of these thousands of inputs are input into the NOAA MADIS system, a centralized server that stores all observations. The companies that generate the API's pull the data from MADIS.
The problem with the observed conditions is twofold : one is that the stations are highly clustered in urban areas. In Texas, for example - there are 100's of stations in Central TX near the cities of San Antonio and Austin, but 100 miles west there is essentially nothing. To generate a radar image using this method would involve extreme interpolation- and...
The second problem is observation time. The input from rain gauges are many times delayed several minutes to an hour or more. This would give inaccurate data.
If you wanted a gridded system, the best answer would be to use MRMS (multi-radar-multi-sensor) data from the NWS. It is not an API. These are .grib files that must be downloaded and processed. This is the live viewer and if you want to work on the data itself you can use the NOAA Weather Climate Toolkit to view and/or process by GUI or batch process (You can export to geoTIF and colorize it with GDAL tools). The actual MRMS data is located here and for the basic usage you are looking for, you could use the latest data in the "MergedReflectivityComposite" folder. (That would be how other radar apps show rain.) If you want actual precip intensity, check the "PrecipRate" folder.
For anything else except radar (warning polygons, etc) the NWS has an API that is located here.
If you have other questions, I will be happy to help.

Number of training samples for text classification tas

Suppose you have a set of transcribed customer service calls between customers and human agents, where on average each call's length is 7 minutes. Customers will mostly call because of issues they have with the product. Let's assume that a human can assign one label per axis per call:
Axis 1: What was the problem from the customer's perspective?
Axis 2: What was the problem from the agent's perspective?
Axis 3: Could the agent resolve the customer's issue?
Based on the manually labeled texts you want to train a text classifier that shall predict a label for each call for each of the three axes. But the labeling of recordings takes time and costs money. On the other hand you need a certain amount of training data to get good prediction results.
Given the above assumptions, how many manually labeled training texts would you start with? And how do you know that you need more labeled training texts?
Maybe you've worked on a similar task before and can give some advice.
UPDATE (2018-01-19): There's no right or wrong answer to my question. Ok, ideally, somebody worked on exactly the same task, but that's very unlikely. I'll leave the question open for one more week and then accept the best answer.
This would be tricky to answer but I will try my best based on my experience.
In the past, I have performed text classification on 3 datasets; the number in the bracket indicates how big my dataset was: restaurant reviews (50K sentences), reddit comments (250k sentences) and developer comments from issue tracking systems (10k sentences). Each of them had multiple labels as well.
In each of the three cases, including the one with 10k sentences, I achieved an F1 score of more than 80%. I am stressing on this dataset specifically because I was told by some that the size is less for this dataset.
So, in your case, assuming you have atleast 1000 instances (calls that include conversation between customer and agent) of average 7 minute calls, this should be a decent start. If the results are not satisfying, you have the following options:
1) Use different models (MNB, Random Forest, Decision Tree, and so on in addition to whatever you are using)
2) If point 1 gives more or less similar results, check the ratio of instances of all the classes you have (the 3 axis you are talking about here). If they do not share a good ratio, get more data or try out the different balancing techniques if you cannot get more data.
3) Another way would be to classify them at a sentence level than message or conversation level to generate more data and individual labels for sentences rather than message or the conversation itself.

Calculate expected color temperature of daylight

I have a location (latitude/longitude) and a timestamp (year/month/day/hour/minute).
Assuming clear skies, is there an algorithm to loosely estimate the color temperature of sunlight at that time and place?
If I know what the weather was at that time, is there a suggested way to modify the color temperature for the amount of cloud cover at that time?
I suggest taking a look at this paper which has nice practical implementation for CG applications:
A Practical Analytic Model for Daylight A. J. Preetham Peter Shirley Brian Smits
Abstract
Sunlight and skylight are rarely rendered correctly in computer
graphics. A major reason for this is high computational expense.
Another is that precise atmospheric data is rarely available. We
present an inexpensive analytic model that approximates full spectrum
daylight for various atmospheric conditions. These conditions are
parameterized using terms that users can either measure or estimate.
We also present an inexpensive analytic model that approximates the
effects of atmosphere (aerial perspective). These models are fielded
in a number of conditions and intermediate results verified against
standard literature from atmospheric science. Our goal is to achieve
as much accuracy as possible without sacrificing usability.
Both compressed postscript and pdf files of the paper are available.
Example code is available.
Color images from the paper are shown below.
Link only answers are discouraged but I can not post neither sufficient portion of the article nor any complete C++ code snippet here as both are way too big. Following the link you can find both right now.

Resources