Clustering non-numeric groups - statistics

I am trying to group together parts of a data set that I am working with. I have a group of individuals that work with a variety of different skills. The idea is to get the largest pct of agents and skills represented.
So in a perfect scenario, it would be nice to get a sample of agents that comprise 85-90% of the records along with a group of skills that represent 85-90% of records too. Basically, I want to obtain the largest percent sample without having small groups of agents that work with only a few skills or have skills that only a very small pct of agents work with.
I am trying to find a more statistical approach to doing this and thought about clustering. But from my understanding, clustering requires a distance definition. I am not sure that that this data would fit this requirement.
Below is a small sample of what the data looks like:
Agent Skill
1 Claims
1 Benefits
2 Claims
2 -
3 Other

You are looking at the wrong tools for this problem.
What you are trying to do is a variant of the set cover problem, not clustering.
Except that you are not looking for a minmal cover, but an approximative upper cover.
You'll need to decide when a solution is better than another. Your description of this is too vague - it allows the trivial solution of keeping everything: 100% cover.
Then repeatedly try to either:
remove an agent
remove a skill
depending on what yields the best improvement.
But again, you need to have a formal quality criterion.

Related

How to make a score from multiples KPIs

I am wondering if is it possible to create a global score using multiple KPIs with different scales.
Example:
I would like to join all this KPIs in one score that could tell me what version is better. Is it possible? (I consider the 3 with the same weight in the score)
There is quite some theory on (credit) rating methods which provides a profound mathematical base for what you are after. You might start reading about score cards in general. An abundant way of combining different scores uses Logit.
The short answer to your question: There is no single best way to combine three KPIs, you have to try different formulas, and decide on one of the formulas based on some statistics tests in a validation step.
Further reading
Using a Balanced Scorecard to Measure Your Key Performance Indicators - a brief primer on the topic
Chapter on Logit from the book Stefan Trueck, Svetlozar T. Rachev: Rating Based Modeling of Credit Risk: Theory and Application of Migration
Guidelines on Credit Risk Management - OeNB as PDF

Hypothesis search tree

I have a object with many fields. Each field has different range of values. I want to use hypothesis to generate different instances of this object.
Is there a limit to the number of combination of field values Hypothesis can handle? Or what does the search tree hypothesis creates look like? I don't need all the combinations but I want to make sure that I get a fair number of combinations where I test many different values for each field. I want to make sure Hypothesis is not doing a DFS until it hits the max number of examples to generate
TLDR: don't worry, this is a common use-case and even a naive strategy works very well.
The actual search process used by Hypothesis is complicated (as in, "lead author's PhD topic"), but it's definitely not a depth-first search! Briefly, it's a uniform distribution layered on a psudeo-random number generator, with a coverage-guided fuzzer biasing that towards less-explored code paths, with strategy-specific heuristics on top of that.
In general, I trust this process to pick good examples far more than I trust my own judgement, or that of anyone without years of experience in QA or testing research!

How can I determine the best data structure/implementation for my dataset?

Preface: I'm a self-taught coder, so a lot of my knowledge is limited to my research. I'm hoping to have other opinions as I want to build things right the first time. I need help with determining an appropriate solution and how to implement the solution.
I'm looking to build a least cost alternative model (essentially a shortest path) for delivering between locations (nodes), based on different modes of transportation (vehicles) and the different roads taken (paths). Another consideration is the product price (value) to determine the least cost path.
Here are my important data items:
nodes: cities where the product will travel to and from.
paths: roads have different costs, depending on the road.
vehicles: varying vehicles have differing rental costs when transporting (motorbike, car, truck). Note that the cost of a vehicle is not constant, it is highly dependent on the to/from nodes. For example, using a car to go from city A to city B will have a different cost than using a car to go from city B to A or city A to city C.
value: Product value. Again, a product's value is highly dependent on its destination node. The same product can have a different value at City A, B or C.
Problem Statement
How to setup data structure to best determine where the least cost path would be to get a product from one location to every other location.
Possible Solutions
From my research, I believe a weighted graph data structure would be most suitable for my situation in combination with dijkstra's algorithm. I believe breaking the problem down simpler would be essential, to first create a simple weighted graph of only nodes and paths.
From there, adding the vehicle cost and the product value considerations afterwards. Perhaps just adding the two values as a cost to "visit" a node? (aka incorporate it into the path cost?)
Thoughts on my current solution? Other considerations I overlooked? Perhaps a better solution?
Implementation
I'd love to be able to build this within Excel VBA (as that is how I learned how to code) and Excel is what I use for my tools. Would VBA be too limited in this task? How else can I incorporate my analysis with Excel with another language?
Try the book Practical Management Science by Winston & Albright and check out the chapter on Operations Management - lots of models explained in there from the simple onwards. Available online as a pdf : http://ingenieria-industrial.net/downloads/practicalmanagementscience.pdf
VBA is more a scripting language than a full-fledged one, though one may contend that the underlying framework is .NET. Why don't you give a shot at C++ or Java? If you intuitively understand the data structure and the algorithm, then it'll be a breeze coding in these. Chapter 4 of Algorithms by Sedgewick and Wayne has a beautiful explanation of Shortest Paths. You may also consider studying Bellman-Ford algorithm if you foresee any negative weight cycles on a vertex.

Test multiple algorithms in one experiment

Is there any way to test multiple algorithms rather than doing it once for each and every algorithm; then checking the result? There are a lot of times where I don’t really know which one to use, so I would like to test multiple and get the result (error rate) fairly quick in Azure Machine Learning Studio.
You could connect the scores of multiple algorithms with an 'Evaluate Model' button to evaluate algorithms against each other.
Hope this helps.
The module you are looking for, is the one called “Cross-Validate Model”. It basically splits whatever comes in from the input-port (dataset) into 10 pieces, then reserves the last piece as the “answer”; and trains the nine other subset models and returns a set of accuracy statistics measured towards the last subset. What you would look at is the column called “Mean absolute error” which is the average error for the trained models. You can connect whatever algorithm you want to one of the ports, and subsequently you will receive the result for that algorithm in particular after you “right-click” the port which gives the score.
After that you can assess which algorithm did the best. And as a pro-tip; you could use the Filter-based-feature selection to actually see which column had a significant impact on the result.
You can check section 6.2.4 of hands-on-lab at GitHub https://github.com/Azure-Readiness/hol-azure-machine-learning/blob/master/006-lab-model-evaluation.md which focuses on the evaluation of multiple algorithms etc.

How to estimate search application's efficiency?

I hope it belongs here.
Can anyone please tell me is there any method to compare different search applications working in the same domain with the same dataset?
The problem is they are quite different - one is a web application which looks up the database where items are grouped in categories, and another one is a rich client which makes search by keywords.
Is there any standard test giudes for that purpose?
There are testing methods. You may use e.g. Precision/Recall or the F beta method to estimate a rate which computes the "efficiency". However you need to make a reference set by yourself. That means you will somehow measure not the efficiency in the domain, more likely the efficiency compared to your own reasoning.
The more you need to make sure that your reference set is representative for the data you have.
In most cases common reasoning will give you also the result.
If you want to measure the performance in matters of speed you need to formulate a set of assumed queries against the search and query your search engine with these at a given rate. Thats doable with every common loadtesting tool.

Resources