How to detect outlier detection in two dimensional arrays? - statistics

Given an array like:
[
{ final_amount: 20.0, shipping_amount: 5 },
{ final_amount: 30.0, shipping_amount: 5.5 },
{ final_amount: 25.0, shipping_amount: 105.5 },
{ final_amount: 325.0, shipping_amount: 125.5 }
]
How could I detect that
{ final_amount: 25.0, shipping_amount: 105.5 }
is an outlier?
Bigger final_amount means bigger shipping_amount, however we have some bad entries in our data set.
If I take into consideration only shipping_amount (with median and standard deviation) it removes some valid entries because it's not taking the final_amount into consideration.

The right way to go about any problem like this is to have a model of "normal" data and one or more models of "abnormal" data. Each of these is a p(data|category) for some category. Apply Bayes' rule to compute p(category|data) and make some choice among the categories, e.g. pick category with largest p(category|data). This a pretty wide-open field so good luck and have fun. Also you might get more interest on stats.stackexchange.com.

Related

combine a list of dictionaries with one key value match

listofdicts = [
{
"if-e0": "e0",
"ip-add-e0": "192.168.1.1",
"name": "host1"
},
{
"if-e1": "e1",
"ip-add-e1": "192.168.2.1",
"name": "host1"
},
{
"if-e1": "e1",
"ip-add-e1": "172.16.1.1",
"name": "host2"
},
{
"if-e2": "e2",
"ip-add-e2": "172.16.2.1",
"name": "host2"
}]
Expected Result:
listofdicts = [
{
"if-e0": "e0",
"ip-add-e0": "192.168.1.1",
"if-e1": "e1",
"ip-add-e1": "192.168.2.1",
"name": "host1"
},
{
"if-e1": "e1",
"ip-add-e1": "172.16.1.1",
"if-e2": "e2",
"ip-add-e2": "172.16.2.1",
"name": "host2"
}]
Have been trying to make this work but no luck yet, actual list has more than 60K dicts with unique and matching hosts.
It could be easier to solve but for me, it's been a nightmare from past few hrs.
Appreciate your assistance.
Regards,
Avinash
Graph theory seems to be helpful here.
To solve this, you need to build a graph, where each vertex relates to one dictionary from your input list.
There should be an edge between two vertices if there is a common key-value pair in the corresponding dictionaries (more specifically, for dictionaries d1 and d2 there should be an edge if len(set(d1.items()).intersection(d2.items())) != 0 or, simpler, if set(d1.items()).intersection(d2.items()). The condition means there is at least one key-value pair in the intersection of the sets of items of d1 and d2).
After the graph is built, you need to find all the connectivity components (that's a pretty simple DFS (depth-first search), you can google it if you're not familiar with graph algorithms). Each connectivity component's dictionaries should be combined in one: there should be one resulting dictionary per component. The list of these resulting dictionaries is your answer.
Here is an example of how you combine some dictionaries:
connectivity_component_dicts = [{...}, ...]
resulting_dict = {**d for d in connectivity_component_dicts}
# Note that the greater the index of `d` in `connectivity_component_dicts`,
# the higher priority its keys have if there are same keys in some dicts.
#Kolay.Ne Hi, Hey guys,
It did work with a very basic catch. Graph method is fantastic to solve it although I used below approach n that worked:
for d in listofdicts:
x = listofdicts.index(d)
for y in range(len(listofdicts)):
k = 'name'
if y != x and y < len(listofdicts):
if listofdicts[x][k] == listofdicts[y][k]:
dc = copy.deepcopy(listofdicts[y])
listofdicts[x].update(dc)
listofdicts.remove(dc)
Could be other approaches to solve it, im sure pathonic way will be just couple of lines, this solved my problem for the job in hand.
Thank you to kolay.Ne for responding quickly and trying to assist, and graph method is fantastic as well, requires professional coding and for sure that will be more scalable.
a = []
for i in listofdicts:
if i["name"] not in a:
a.append(i["name"])
print(i)

Precise anchoring of shadows on entities

I would have a question to ask, hoping for your advice. In my project I'm working on the different entities that will be present, one of them is the "TreeEntity". It is visibly composed of four parts. "Leaves, Shadow, Stump, Trunk". All four parts have only one id and are randomly chosen when a new "TreeEntity" is called. In addition, leaves and shadow each have four different states, based on the age of the tree. The shadow will have to rotate around the trunk based on the time. Now, since the engine of my project would like to create it as universal as possible, so that it can also be used in other projects or other people, I am facing a dead end.
I have this result at the moment (premise, all the images used are for testing purposes only):
As can be seen, the shadow is very detached from the trunk. This, theoretically, would be possible to fix it, modifying the image used:
Or write a huge dictionary with all the precise anchors written for each image that make up leaves, shadow, stump and trunk.
But doing so would be detrimental to my goal. So, the advice I would like to ask you, is whether in pyglet or otherwise, regardless of the shadow used (which in this case is random) is it possible to choose an anchor point that places it in the right place? Or do you have to modify the image or create a dictionary?
My question arises from the fact that I have seen several times images for example of shadows or something else that is like the one I inserted here, but in that case the shadows were in their correct place. The same game from here I took the pictures to test and I'm inspired by one or two mechanics has the shadows placed in the right place. So I don't know which way to go sincerely.
At the moment, all four parts have a universal anchor dictated by a small dictionary:
__statdict = {
"leaves": {
"cell": (3, 1),
"anchor_x": "center",
"anchor_y": "bottom"
},
"shadow": {
"cell": (4, 1),
"anchor_x": "left",
"anchor_y": "bottom"
},
"stump": {
"cell": (1, 1),
"anchor_x": "center",
"anchor_y": "bottom"
},
"trunk": {
"cell": (1, 1),
"anchor_x": "center",
"anchor_y": "bottom"
}
}
And when loading resources, anchor is assigned:
if self.__statdict[name]["anchor_x"] == "center":
ancx = texture.width // 2
elif self.__statdict[name]["anchor_x"] == "right":
ancx = texture.width
else:
ancx = 0
if self.__statdict[name]["anchor_y"] == "center":
ancy = texture.height // 2
elif self.__statdict[name]["anchor_y"] == "top":
ancy = texture.height
else:
ancy = 0
texture.anchor_x = ancx
texture.anchor_y = ancy
I have tried to be as complete as possible. Thanks for your help. Maybe maybe I'm having problems for nothing and the solution is simpler than I thought, so I apologize if this were the case.
Update (1):
Is it possible, via pyglet or a module to accompany it, given the direction of a sprite (in this case on the right), to know what is the position of the first pixel that is not an alpha? Something like this: "The direction of the SpriteShadow is to the right. The first pixel that has an rbg value with alpha equal to 255 is ay = 20 (relative to the width of the sprite). Then I move the SpriteShadow to the left with respect to the general anchor of 20 pixels "

Redundant query trigger when creating a graph?

whenever I try to create a new graph with 700.000 to 2 Mio edges, it takes a long time. I observed due to the great new feature in the API
/_api/query/current
that possibly the graph creation triggers automatically some kind of cache loading, but twice?
[
{
"id": "70",
"query": "FOR x IN GRAPH_VERTICES(#graph, {}) SORT RAND() LIMIT #limit RETURN x",
"started": "2015-03-31T19:06:59Z",
"runTime": 41.95919394493103
},
{
"id": "71",
"query": "FOR x IN GRAPH_VERTICES(#graph, {}) SORT RAND() LIMIT #limit RETURN x",
"started": "2015-03-31T19:06:59Z",
"runTime": 41.95719385147095
}
]
Is this correct. Is there a more efficient way?
Thanks in Advance!
The graph viewer issued the mentioned RAND() query two times:
- one instance is fired to determine a random vertex from the graph
- the other instance is fired to determine the attributes of some random vertices of the graph, in order to populate the search input field
The AQL that was used by the graph viewer was inefficient. It build a big list, sorted it randomly and returned 1 (first query) or 10 (second query) documents from it. This has been fixed in commit c28575f202a58d5c93e6c36883effda48c2a7159 so it's much more efficient now.
The fix will be included in the next build (i.e. 2.5.2).

Flot data serie with specific color per data element?

being new to Flot i am struggling a bit. My goal is the present a bar with different data elements in it that must have a different color per element. I want to provide the color per data element.
Any hints on how this can be done?
Example:
[0,100][0,200][0,100][0,200]
All elements with value 100 should be blue and all elements with 200 should be green.
A nice one would be,
[0,100,blue][0,200,green][0,100,blue][0,200,green]
But this off course does not work, it is just an explanation what i want to achieve!
Doing this with multiple data series seems does not work in my case.
Any hints on how this can be done?
You can do it with multiple series, as this is how I have done something like this :)
So in your plot method you would have something like this:
$.plot($('#placeholder'), []); // The array would hold your data
You can also further extend this to provide some options to flot to tell it what to make your chart look like. To do that you pass an object of options after the array.
However, im not quite sure what you need in terms of the Graph, your example gave some numbers but the x coordinate was all 0, im not sure if you just want a bar graph?
Anyway, heres the code of how you would get it to display a bar graph, with one green line and one blue line:
var flotOptions = {
series: {
lines: { show: true, fill: false, lineWidth: 15 }
}
},
var data = [{
color: '#001EFF',
data: [[0, 0], [0, 100]]
}, {
color: '#00FF0E',
data: [[5, 0], [5, 200]]
}];
$.plot($('#placeholder'), data, flotOptions);
I would recommend making it so that the data is automatically generated on a server side, then you can check in javascript, and add the color depending on the y value in the data array of each series.
I have also created a jsFiddle of this for you, so you can go take a look and play around with it. As I have said im not quite sure what you want but this is a good start for you. Good luck!!

ElasticSearch default scoring mechanism

What I am looking for, is plain, clear explanation, of how default scoring mechanism of ElasticSearch (Lucene) really works. I mean, does it use Lucene scoring, or maybe it uses scoring of its own?
For example, I want to search for document by, for example, "Name" field. I use .NET NEST client to write my queries. Let's consider this type of query:
IQueryResponse<SomeEntity> queryResult = client.Search<SomeEntity>(s =>
s.From(0)
.Size(300)
.Explain()
.Query(q => q.Match(a => a.OnField(q.Resolve(f => f.Name)).QueryString("ExampleName")))
);
which is translated to such JSON query:
{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}
There is about 1.1 million documents that search is performed on. What I get in return, is (that is only part of the result, formatted on my own):
650 "ExampleName" 7,313398
651 "ExampleName" 7,313398
652 "ExampleName" 7,313398
653 "ExampleName" 7,239194
654 "ExampleName" 7,239194
860 "ExampleName of Something" 4,5708737
where first field is just an Id, second is Name field on which ElasticSearch performed it's searching, and third is score.
As you can see, there are many duplicates in ES index. As some of found documents have diffrent score, despite that they are exactly the same (with only diffrent Id), I concluded that diffrent shards performed searching on diffrent parts of whole dataset, which leads me to trail that the score is somewhat based on overall data in given shard, not exclusively on document that is actually considered by search engine.
The question is, how exactly does this scoring work? I mean, could you tell me/show me/point me to exact formula to calculate score for each document found by ES? And eventually, how this scoring mechanism can be changed?
The default scoring is the DefaultSimilarity algorithm in core Lucene, largely documented here. You can customize scoring by configuring your own Similarity, or using something like a custom_score query.
The odd score variation in the first five results shown seems small enough that it doesn't concern me much, as far as the validity of the query results and their ordering, but if you want to understand the cause of it, the explain api can show you exactly what is going on there.
The score variation is based on the data in a given shard (like you suspected). By default ES uses a search type called 'query then fetch' which, sends the query to each shard, finds all the matching documents with scores using local TDIFs (this will vary based on data on a given shard - here's your problem).
You can change this by using 'dfs query then fetch' search type - prequery each shard asking about term and document frequencies and then sends a query to each shard etc..
You can set it in the url
$ curl -XGET '/index/type/search?pretty=true&search_type=dfs_query_then_fetch' -d '{
"from": 0,
"size": 300,
"explain": true,
"query": {
"match": {
"Name": {
"query": "ExampleName"
}
}
}
}'
Great explanation in ElasticSearch documentation:
What is relevance:
https://www.elastic.co/guide/en/elasticsearch/guide/current/relevance-intro.html
Theory behind relevance scoring:
https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html

Resources