Finding probabilities of patterns in asset price movements based on multiple variables - statistics

I am seeking a method to allow me to analyse/search for patterns in asset price movements using 5 variables that move and change with price (from historical data).
I'd like to be able to assign a probability to a forecasted price move when for example, var1 and var2 do this and var3..5 do this, then price should do this with x amount of certainty.
Q1: Could someone point me in the right direction as to what framework / technique can help me achieve this?
Q2: Would this be a multivariate continuous random series analysis?
Q3: A Hidden Markov modelling?
Q4: Or perhaps is it a data-mining problem?
I'm looking for what rather then how.

One may opt to use Machine-Learning tools to build a learner to either
both classify of what kind the said "asset price movement" will beand serve also statistical probability measures for such a Classifier prediction
both regress a real target value, to which the asset price will moveandserve also statistical probability measures for such a Regressor prediction
A1: ( while StackOverflow strongly discourages users to ask about an opinion about a tool or a particular framework ) there would be not much damages or extra time to be spent, if one performs academia papers research and there would be quite a remarkable list of repeatedly used tools, used for ML in the context of academic R&D. For a reason, there would not be a surprise to meet scikit-learn ML-classes a lot, some other papers may work with R-based quantitative finance / statistical libraries. The tools, however, with all due respect, are not the core to answer all the doubts and inital confusion present in a mix of your questions. The subject confusion is.
A2: No, it would not. Well, unless you beat all the advanced quantitative research and happen to prove that the Market exhibits a random behaviour ( which it is not and for which it would be waste of time to re-cite remarkable research published about why it is not indeed a random process ).
A3: Do not try to jump on any wagon just because of it's attractive Tag or "contemporary popularity" in marketing minded texts. With all due respect, understanding HMM is outside of your sight while you now appear to move just to the nearest horizons to first understand what to look for.
A4: This is a nice proof of a missed target. Your question shows in this particular point better than in others, how small amount of own research efforts were put into covering the problem-domain and acquiring at least some elementary knowledge before typing the last two questions.
StackOverflow encourages users to ask high quality questions, so do not hesitate to re-edit your post to add some polishing efforts to this subject.
If in a need for an inspiration, try to review a nice and a powerful approach for a fast Machine Learning process, where both Classification and Regression tasks obtain also probability estimates for each predicted target value.
To have some idea about highly performant ML-predictors, these typically operate on much more than a set of 5 variables ( called in the ML-domain "features" ) . ( Think rather about some large hundreds to small thousands features, typically heavily non-linear transformations from the original TimeSeries' data ).
There you go, if indeed willing to master ML for algorithmic trading.
May like to read about a state-of-art research in this direction:
[1] Mondrian Forests: Efficient Online Random Forests
>>> arXiv:1406.2673v2 [stat.ML] 16 Feb 2015
[2] Mondrian Forests for Large-Scale Regression when Uncertainty Matters
>>> arXiv:1506.03805v4 [stat.ML] 27 May 2016 >>>
May also enjoy other posts on subject: >>> StackOverflow Algorithmic-Trading >>>

Related

Is my Statistical Treatment of Data Correct?

I am aware that consulting a statistician is not free and it is something I cannot afford, so I am trying my shot here. So for the problem at hand, I've already finished data gathering for my research and am now calculating the results. However, I am stuck on what should I use for my statistical treatment of data.
For background, I am using ISO 25010 to test my software quality and user acceptance. The questionnaire consists of a number of questions for each cluster (functionality, reliability, usability, efficiency, maintainability, and portability). I've also used Likert Scale: Agreement Type. The hypothesis of my research says "There is no significant difference in the user acceptance results in terms of [clusters]". As of now, I've used Descriptive Statistics, mean(for each question), average mean(ave. of mean for each cluster, and mode), for calculating the results.
I feel that the result I currently have might be lacking when the final defense came. As far as I know, using a combination of statistical methods is okay to give a more strong foundation for your result.
Based on the background of my research, what other statistical methods should I use?
I am thinking of sample standard deviation, but I don't know if I should compute it by questions or by cluster.
Sorry, statistics is not really my forte.
Thank you in advance for your answers

Can we apply multi-criteria decision making algorithms in incomplete data?

I am currently working on a project where a multi criteria decision making algorithm is needed in order to evaluate several alternatives for a given goal. After long research, I decided to use the AHP method for my case study. The problem is that the alternatives taken into account for the given goal contain incomplete data.
For example, I am interested in buying a house and I have three alternatives to consider. One criterion for comparing them is the size of the house. Let’s assume that I know the sizes of some of the rooms of these houses, but I do not have information about the actual sizes of the entire houses.
My questions are:
Can we apply AHP (or any MCDM method) when we are dealing with
incomplete data?
What are the consequences?
And, how can we minimize the presence of missing data in MCDM?
I would really appreciate some advice or help! Thanks!
If you still looking for answers, please let me answer your questions.
Before the detail explain, I coludn't answer with a technical approach in programming language.
First, we can use uncertinal data for MCDM, AHP with statical method.
As reducing lost of data, you can use deep learning concepts like entropy.
The result of it will be get reliability by accuracy of probabilistic approach.
The example that you talked, you could find the data of entire extent in other houses has same extent of criteria. Accuracy will depend on number of criteria, reliability of inference.
To get the perfect answer in your problem, you might need to know optimization, linear algebra, calculus, statistics above intermediate level
I'm student in management major, and I would help you as I can. I hope you get what you want

Are transformer-based language models overfitting on the paraphrase identification task? What tools overcome this?

I've been working on a sentence transformation task that involves paraphrase identification as a critical step: if we are confident enough that the state of the program (a sentence repeatedly modified) has become a paraphrase of a target sentence, stop transforming. The overall goal is actually to study potential reasoning in predictive models that can generate language prior to a target sentence. The approach is just one specific way of reaching that goal. Nevertheless, I've become interested in the paraphrase identification task itself, as it's received some boost from language models recently.
The problem I run into is when I manipulate sentences from examples or datasets. For example, in this HuggingFace example, if I negate either sequence or change the subject to Bloomberg, I still get a majority "is paraphrase" prediction. I started going through many examples in the MSRPC training set and negating one sentence in a positive example or making one sentence in a negative example a paraphrase of the other, especially when doing so would be a few word edit. I found to my surprise that various language models, like bert-base-cased-finetuned-mrpc and textattack/roberta-base-MRPC, don't change their confidences much on these sorts of changes. It's surprising as these models claim an f1 score of 0.918+. The dataset is clearly missing a focus on negative examples and small perturbative examples.
My question is, are there datasets, techniques, or models that deal well when given small edits? I know that this is an extremely generic question, much more than is typically asked on StackOverflow, but my concern is in finding practical tools. If there is a theoretical technique, then it might not be suitable as I'm in the category of "available tools define your approach" rather than vice-versa. So I hope that the community would have a recommendation on this.
Short answer to the question: yes, they are overfitting. Most of the important NLP data sets are not actually well-crafted enough to test what they claim to test, and instead test the ability of the model to find subtle (and not-so-subtle) patterns in the data.
The best tool I know for creating data sets that help deal with this is Checklist. The corresponding paper, "Beyond Accuracy: Behavioral Testing of NLP models with CheckList" is very readable and goes into depth on this type of issue. They have a very relevant table... but need some terms:
We prompt users to evaluate each capability with
three different test types (when possible): Minimum Functionality tests, Invariance, and Directional Expectation tests... A Minimum Functionality test (MFT), is a collection of simple examples (and labels) to check a
behavior within a capability. MFTs are similar to
creating small and focused testing datasets, and are
particularly useful for detecting when models use
shortcuts to handle complex inputs without actually
mastering the capability.
...An Invariance test (INV) is when we apply
label-preserving perturbations to inputs and expect
the model prediction to remain the same.
A Directional Expectation test (DIR) is similar,
except that the label is expected to change in a certain way. For example, we expect that sentiment
will not become more positive if we add “You are
lame.” to the end of tweets directed at an airline
(Figure 1C).
I haven't been actively involved in NLG for long, so this answer will be a bit more anecdotal than SO's algorithms would like. Starting with the fact that in my corner of Europe, the general sentiment toward peer review requirements for any kind of NLG project are higher by several orders of magnitude compared to other sciences - and likely not without reason or tensor thereof.
This makes funding a bigger challenge, so wherever you are, I wish you luck on that front. I'm not sure of how big of a deal this site is in the niche, but [Ehud Reiter's Blog][1] is where I would start looking into your tooling ideas.
Maybe even reach out to them/him personally, because I can't think of another source that has an academic background and a strong propensity for practical applications of NLG, at least based on the kind of content they've been putting out over the years.
Your background, environment/funding, and seniority level/control you have over the project will eventually compose your vector decision for you. I's just how it goes on the bleeding edge of anything. What I will add, though, is not to limit yourself to a single language or technology in this phase because of those precise reasons you've mentioned. I'd recommend the same in terms of potential open source involvement but if your profile information is accurate, that probably won't happen, no matter what you do and accomplish.
But yeah, in the grand scheme of things, your question is far from too broad, in my view. It identifies a rather unmistakable problem pattern that not all branches of science are as lackadaisical to approach as NLG-adjacent fields seem to be right now. In that regard, it's not broad enough and will need to be promulgated far and wide before community-driven tooling will give you serious options on a micro level.
Blasphemy, sure, but the performance is already stacked against you As for the question potentially being too broad, I'd posit it is not broad enough, so long as we collectively remain in a "oh, I was waiting for you to start doing something about it" phase.
P.S. I'd eliminate any Rust and ECMAScript alternatives prior to looking into Python, blapshemous as this might sound to a 2021 data scientist
. Some might ARight nowccounting forr the ridicule this would receive xou sltrsfx hsbr s fszs drz zhsz s mrnzsl rcrtvidr, sz lrsdz
due to performance easons.
[1]: https://ehudreiter.com/2016/12/18/nlg-vs-templates/

Cannot generalize my Genetic Algorithm to new Data

I've written a GA to model a handful of stocks (4) over a period of time (5 years). It's impressive how quickly the GA can find an optimal solution to the training data, but I am also aware that this is mainly due to it's tendency to over-fit in the training phase.
However, I still thought I could take a few precautions and and get some kind of prediction on a set of unseen test stocks from the same period.
One precaution I took was:
When multiple stocks can be bought on the same day the GA only buys one from the list and it chooses this one randomly. I thought this randomness might help to avoid over-fitting?
Even if over-fitting is still occurring,shouldn't it be absent in the initial generations of the GA since it hasn't had a chance to over-fit yet?
As a note, I am aware of the no-free-lunch theorem which demonstrates ( I believe) that there is no perfect set of parameters which will produce an optimal output for two different datasets. If we take this further, does this no-free-lunch theorem also prohibit generalization?
The graph below illustrates this.
->The blue line is the GA output.
->The red line is the training data (slightly different because of the aforementioned randomness)
-> The yellow line is the stubborn test data which shows no generalization. In fact this is the most flattering graph I could produce..
The y-axis is profit, the x axis is the trading strategies sorted from worst to best ( left to right) according to there respective profits (on the y axis)
Some of the best advice I've received so far (thanks seaotternerd) is to focus on the earlier generations and increase the number of training examples. The graph below has 12 training stocks rather than just 4, and shows only the first 200 generations (instead of 1,000). Again, it's the most flattering chart I could produce, this time with medium selection pressure. It certainly looks a little bit better, but not fantastic either. The red line is the test data.
The problem with over-fitting is that, within a single data-set it's pretty challenging to tell over-fitting apart from actually getting better in the general case. In many ways, this is more of an art than a science, but here are some general guidelines:
A GA will learn to do exactly what you attach fitness to. If you tell it to get really good at predicting one series of stocks, it will do that. If you keep swapping in different stocks to predict, though, you might be more successful at getting it to generalize. There are a few ways to do this. The one that has had perhaps the most promising results for reducing over-fitting is imposing spatial structure on the population and evaluating on different test cases in different cells, as in the SCALP algorithm. You could also switch out the test cases on a time basis, but I've had more mixed results with that sort of an approach.
You are correct that over-fitting should be less of a problem early on. Generally, the longer you run a GA, the more over-fitting will be possible. Typically, people tend to assume that the general rules will be learned first, before the rote memorization of over-fitting takes place. However, I don't think I've actually ever seen this studied rigorously - I could imagine a scenario where over-fitting was so much easier than finding general rules that it happens first. I have no idea how common that is, though. Stopping early will also reduce the ability of the GA to find better general solutions.
Using a larger data-set (four stocks isn't that many) will make your GA less susceptible to over-fitting.
Randomness is an interesting idea. It will definitely hurt the GA's ability to find general rules, but it should also reduce over-fitting. Without knowing more about the specifics of your algorithm, it's hard to say which would win out.
That's a really interesting thought about the no free lunch theorem. I'm not 100% sure, but I think it does apply here to some extent - better fitting some data will make your results fit other data worse, by necessity. However, as wide as the range of possible stock behaviors is, it is much narrower than the range of all possible time series in general. This is why it is possible to have optimization algorithms at all - a given problem that we are working with tends produce data that cluster relatively closely together, relative to the entire space of possible data. So, within that set of inputs that we actually care about, it is possible to get better. There is generally an upper limit of some sort on how well you can do, and it is possible that you have hit that upper limit for your data-set. But generalization is possible to some extent, so I wouldn't give up just yet.
Bottom line: I think that varying the test cases shows the most promise (although I'm biased, because that's one of my primary areas of research), but it is also the most challenging solution, implementation-wise. So as a simpler fix you can try stopping evolution sooner or increasing your data-set.

Document Analysis and Tagging

Let's say I have a bunch of essays (thousands) that I want to tag, categorize, etc. Ideally, I'd like to train something by manually categorizing/tagging a few hundred, and then let the thing loose.
What resources (books, blogs, languages) would you recommend for undertaking such a task? Part of me thinks this would be a good fit for a Bayesian Classifier or even Latent Semantic Analysis, but I'm not really familiar with either other than what I've found from a few ruby gems.
Can something like this be solved by a bayesian classifier? Should I be looking more at semantic analysis/natural language processing? Or, should I just be looking for keyword density and mapping from there?
Any suggestions are appreciated (I don't mind picking up a few books, if that's what's needed)!
Wow, that's a pretty huge topic you are venturing into :)
There is definitely a lot of books and articles you can read about it but I will try to provide a short introduction. I am not a big expert but I worked on some of this stuff.
First you need to decide whether you are want to classify essays into predefined topics/categories (classification problem) or you want the algorithm to decide on different groups on its own (clustering problem). From your description it appears you are interested in classification.
Now, when doing classification, you first need to create enough training data. You need to have a number of essays that are separated into different groups. For example 5 physics essays, 5 chemistry essays, 5 programming essays and so on. Generally you want as much training data as possible but how much is enough depends on specific algorithms. You also need verification data, which is basically similar to training data but completely separate. This data will be used to judge quality (or performance in math-speak) of your algorithm.
Finally, the algorithms themselves. The two I am familiar with are Bayes-based and TF-IDF based. For Bayes, I am currently developing something similar for myself in ruby, and I've documented my experiences in my blog. If you are interested, just read this - http://arubyguy.com/2011/03/03/bayes-classification-update/ and if you have any follow up questions I will try to answer.
The TF-IDF is a short for TermFrequence - InverseDocumentFrequency. Basically the idea is for any given document to find a number of documents in training set that are most similar to it, and then figure out it's category based on that. For example if document D is similar to T1 which is physics and T2 which is physics and T3 which is chemistry, you guess that D is most likely about physics and a little chemistry.
The way it's done is you apply the most importance to rare words and no importance to common words. For instance 'nuclei' is rare physics word, but 'work' is very common non-interesting word. (That's why it's called inverse term frequency). If you can work with Java, there is a very very good Lucene library which provides most of this stuff out of the box. Look for API for 'similar documents' and look into how it is implemented. Or just google for 'TF-IDF' if you want to implement your own
I've done something similar in the past (though it was for short news articles) using some vector-cluster algorithm. I don't remember it right now, it was what Google used in its infancy.
Using their paper I was able to have a prototype running in PHP in one or two days, then I ported it to Java for speed purposes.
http://en.wikipedia.org/wiki/Vector_space_model
http://www.la2600.org/talks/files/20040102/Vector_Space_Search_Engine_Theory.pdf

Resources