How to make a score from multiples KPIs

How to make a score from multiples KPIs - statistics

I am wondering if is it possible to create a global score using multiple KPIs with different scales.
Example:
I would like to join all this KPIs in one score that could tell me what version is better. Is it possible? (I consider the 3 with the same weight in the score)

There is quite some theory on (credit) rating methods which provides a profound mathematical base for what you are after. You might start reading about score cards in general. An abundant way of combining different scores uses Logit.
The short answer to your question: There is no single best way to combine three KPIs, you have to try different formulas, and decide on one of the formulas based on some statistics tests in a validation step.
Further reading
Using a Balanced Scorecard to Measure Your Key Performance Indicators - a brief primer on the topic
Chapter on Logit from the book Stefan Trueck, Svetlozar T. Rachev: Rating Based Modeling of Credit Risk: Theory and Application of Migration
Guidelines on Credit Risk Management - OeNB as PDF

Related

How can I determine the best data structure/implementation for my dataset?

Preface: I'm a self-taught coder, so a lot of my knowledge is limited to my research. I'm hoping to have other opinions as I want to build things right the first time. I need help with determining an appropriate solution and how to implement the solution.
I'm looking to build a least cost alternative model (essentially a shortest path) for delivering between locations (nodes), based on different modes of transportation (vehicles) and the different roads taken (paths). Another consideration is the product price (value) to determine the least cost path.
Here are my important data items:
nodes: cities where the product will travel to and from.
paths: roads have different costs, depending on the road.
vehicles: varying vehicles have differing rental costs when transporting (motorbike, car, truck). Note that the cost of a vehicle is not constant, it is highly dependent on the to/from nodes. For example, using a car to go from city A to city B will have a different cost than using a car to go from city B to A or city A to city C.
value: Product value. Again, a product's value is highly dependent on its destination node. The same product can have a different value at City A, B or C.
Problem Statement
How to setup data structure to best determine where the least cost path would be to get a product from one location to every other location.
Possible Solutions
From my research, I believe a weighted graph data structure would be most suitable for my situation in combination with dijkstra's algorithm. I believe breaking the problem down simpler would be essential, to first create a simple weighted graph of only nodes and paths.
From there, adding the vehicle cost and the product value considerations afterwards. Perhaps just adding the two values as a cost to "visit" a node? (aka incorporate it into the path cost?)
Thoughts on my current solution? Other considerations I overlooked? Perhaps a better solution?
Implementation
I'd love to be able to build this within Excel VBA (as that is how I learned how to code) and Excel is what I use for my tools. Would VBA be too limited in this task? How else can I incorporate my analysis with Excel with another language?

Try the book Practical Management Science by Winston & Albright and check out the chapter on Operations Management - lots of models explained in there from the simple onwards. Available online as a pdf : http://ingenieria-industrial.net/downloads/practicalmanagementscience.pdf

VBA is more a scripting language than a full-fledged one, though one may contend that the underlying framework is .NET. Why don't you give a shot at C++ or Java? If you intuitively understand the data structure and the algorithm, then it'll be a breeze coding in these. Chapter 4 of Algorithms by Sedgewick and Wayne has a beautiful explanation of Shortest Paths. You may also consider studying Bellman-Ford algorithm if you foresee any negative weight cycles on a vertex.

Alternatives to TF-IDF and Cosine Similarity (comparing documents with different formats)

I've been working on a small, personal project which takes a user's job skills and suggests the most ideal career for them based on those skills. I use a database of job listings to achieve this. At the moment, the code works as follows:
1) Process the text of each job listing to extract skills that are mentioned in the listing
2) For each career (e.g. "Data Analyst"), combine the processed text of the job listings for that career into one document
3) Calculate the TF-IDF of each skill within the career documents
After this, I'm not sure which method I should use to rank careers based on a list of a user's skills. The most popular method that I've seen would be to treat the user's skills as a document as well, then to calculate the TF-IDF for the skill document, and use something like cosine similarity to calculate the similarity between the skill document and each career document.
This doesn't seem like the ideal solution to me, since cosine similarity is best used when comparing two documents of the same format. For that matter, TF-IDF doesn't seem like the appropriate metric to apply to the user's skill list at all. For instance, if a user adds additional skills to their list, the TF for each skill will drop. In reality, I don't care what the frequency of the skills are in the user's skills list -- I just care that they have those skills (and maybe how well they know those skills).
It seems like a better metric would be to do the following:
1) For each skill that the user has, calculate the TF-IDF of that skill in the career documents
2) For each career, sum the TF-IDF results for all of the user's skill
3) Rank career based on the above sum
Am I thinking along the right lines here? If so, are there any algorithms that work along these lines, but are more sophisticated than a simple sum? Thanks for the help!

The second approach you explained will work. But there are better ways to solve this kind of problem.
At first you should know a little bit about language models and leave the vector space model.
In the second step based on your kind of problem that is similar to expert finding/profiling you should learn a baseline language model framework to implement a solution.
You can implement A language modeling framework for expert finding with a little changes so that the formulas can be adapted to your problem.
Also reading On the assessment of expertise profiles will give you a better understanding of expert profiling with the framework above.
you can find some good ideas, resources and projects on expert finding/profiling at Balog's blog.

I would take SSRM [1] approach to expand query (job documents) using WordNet (extracted database [2]) as semantic lexicon - so you are not constrained only to direct word-vs-word matches. SSRM has its own similarity measure (I believe the paper is open-access, if not, check this: http://blog.veles.rs/document-similarity-computation-models-literature-review/, there are many similarity computation models listed). Alternativly, and if your corpus is big enough, you might try LSA/LSI[3,4] (also covered on the page) - without using external lexicon. But, if it is on English, WordNet's semantic graph is really rich in all directions (hyponims, synonims, hypernims... concepts/SinSet).
The bottom line: I would avoid simple SVM/TF-IDF for such concrete domain. I measured really serious margin of SSRM, over TF-IDF/VSM (measured as macro-average F1, 5-class single label classification, narrow domain).
[1] A. Hliaoutakis, G. Varelas, E. Voutsakis, E.G.M. Petrakis, E. Milios, Information Retrieval by Semantic Similarity, Int. J. Semant. Web Inf. Syst. 2 (2006) 55–73. doi:10.4018/jswis.2006070104.
[2] J.E. Petralba, An extracted database content from WordNet for Natural Language Processing and Word Games, in: 2014 Int. Conf. Asian Lang. Process., 2014: pp. 199–202. doi:10.1109/IALP.2014.6973502.
[3] P.W. Foltz, Latent semantic analysis for text-based research, Behav. Res. Methods, Instruments, Comput. 28 (1996) 197–202. doi:10.3758/BF03204765.
[4] A. Kashyap, L. Han, R. Yus, J. Sleeman, T. Satyapanich, S. Gandhi, T. Finin, Robust semantic text similarity using LSA, machine learning, and linguistic resources, Springer Netherlands, 2016. doi:10.1007/s10579-015-9319-2.

Finding probabilities of patterns in asset price movements based on multiple variables

I am seeking a method to allow me to analyse/search for patterns in asset price movements using 5 variables that move and change with price (from historical data).
I'd like to be able to assign a probability to a forecasted price move when for example, var1 and var2 do this and var3..5 do this, then price should do this with x amount of certainty.
Q1: Could someone point me in the right direction as to what framework / technique can help me achieve this?
Q2: Would this be a multivariate continuous random series analysis?
Q3: A Hidden Markov modelling?
Q4: Or perhaps is it a data-mining problem?
I'm looking for what rather then how.

One may opt to use Machine-Learning tools to build a learner to either
both classify of what kind the said "asset price movement" will beand serve also statistical probability measures for such a Classifier prediction
both regress a real target value, to which the asset price will moveandserve also statistical probability measures for such a Regressor prediction
A1: ( while StackOverflow strongly discourages users to ask about an opinion about a tool or a particular framework ) there would be not much damages or extra time to be spent, if one performs academia papers research and there would be quite a remarkable list of repeatedly used tools, used for ML in the context of academic R&D. For a reason, there would not be a surprise to meet scikit-learn ML-classes a lot, some other papers may work with R-based quantitative finance / statistical libraries. The tools, however, with all due respect, are not the core to answer all the doubts and inital confusion present in a mix of your questions. The subject confusion is.
A2: No, it would not. Well, unless you beat all the advanced quantitative research and happen to prove that the Market exhibits a random behaviour ( which it is not and for which it would be waste of time to re-cite remarkable research published about why it is not indeed a random process ).
A3: Do not try to jump on any wagon just because of it's attractive Tag or "contemporary popularity" in marketing minded texts. With all due respect, understanding HMM is outside of your sight while you now appear to move just to the nearest horizons to first understand what to look for.
A4: This is a nice proof of a missed target. Your question shows in this particular point better than in others, how small amount of own research efforts were put into covering the problem-domain and acquiring at least some elementary knowledge before typing the last two questions.
StackOverflow encourages users to ask high quality questions, so do not hesitate to re-edit your post to add some polishing efforts to this subject.
If in a need for an inspiration, try to review a nice and a powerful approach for a fast Machine Learning process, where both Classification and Regression tasks obtain also probability estimates for each predicted target value.
To have some idea about highly performant ML-predictors, these typically operate on much more than a set of 5 variables ( called in the ML-domain "features" ) . ( Think rather about some large hundreds to small thousands features, typically heavily non-linear transformations from the original TimeSeries' data ).
There you go, if indeed willing to master ML for algorithmic trading.
May like to read about a state-of-art research in this direction:
[1] Mondrian Forests: Efficient Online Random Forests
>>> arXiv:1406.2673v2 [stat.ML] 16 Feb 2015
[2] Mondrian Forests for Large-Scale Regression when Uncertainty Matters
>>> arXiv:1506.03805v4 [stat.ML] 27 May 2016 >>>
May also enjoy other posts on subject: >>> StackOverflow Algorithmic-Trading >>>

Clustering non-numeric groups

I am trying to group together parts of a data set that I am working with. I have a group of individuals that work with a variety of different skills. The idea is to get the largest pct of agents and skills represented.
So in a perfect scenario, it would be nice to get a sample of agents that comprise 85-90% of the records along with a group of skills that represent 85-90% of records too. Basically, I want to obtain the largest percent sample without having small groups of agents that work with only a few skills or have skills that only a very small pct of agents work with.
I am trying to find a more statistical approach to doing this and thought about clustering. But from my understanding, clustering requires a distance definition. I am not sure that that this data would fit this requirement.
Below is a small sample of what the data looks like:
Agent Skill
1 Claims
1 Benefits
2 Claims
2 -
3 Other

You are looking at the wrong tools for this problem.
What you are trying to do is a variant of the set cover problem, not clustering.
Except that you are not looking for a minmal cover, but an approximative upper cover.
You'll need to decide when a solution is better than another. Your description of this is too vague - it allows the trivial solution of keeping everything: 100% cover.
Then repeatedly try to either:
remove an agent
remove a skill
depending on what yields the best improvement.
But again, you need to have a formal quality criterion.

Content based recommendation in scale

This question is probably very repeated in the blogging and Q&A websites but I couldn't find any concrete answer yet.
I am trying to build a recommendation system for customers using only their purchase history.
Let's say my application has n products.
Compute item similarities for all the n products based on their attributes (like country, type, price)
When user needs recommendation - loop the previously purchased products p for user u and fetch the similar products (similarity is done in the previous step)
If am right we call this as content-based recommendation as opposed to collaborative filtering since it doesn't involve co-occurrence of items or user preferences to an item.
My problem is multi-fold:
Is there any existing scalable ML platform that addresses contend based recommendation (I am fine to adopt different technologies/language)
Is there a way to tweak Mahout to get this result?
Is classification a way to handle content based recommendation?
Is it something that a graph database good at solving?
Note: I looked at Mahout (since am familiar with Java and Mahout apparently utilizes Hadoop for distributed processing) for doing this in scale and advantage of having a well tested ML algorithms.
Your help is appreciated. Any examples would be really great. Thanks.

The so called item-item recommenders are natural candidates for precomputing the similarities, because the attributes of the items rarely change. I would suggest you precompute the item similarity between each item, and perhaps store the top K for each item, and if you have enough resources you could load the similarity matix into main memory for real time recommendation.
Check out my answer to this question for a way to do this in Mahout: Does Mahout provide a way to determine similarity between content (for content-based recommendations)?
The example is how to compute the textual similarity between the items, and than load the precomputed values into main memory.
For performance comparison about different data structures to hold the values check out this question: Mahout precomputed Item-item similarity - slow recommendation

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string