Search selection - search

For a C# program that I am writing, I need to compare similarities in two entities (can be documents, animals, or almost anything).
Based on certain properties, I calculate the similarities between the documents (or entities).
I put their similarities in a table as below
X Y Z
A|0.6 |0.5 |0.4
B|0.6 |0.4 |0.2
C|0.6 |0.3 |0.6
I want to find the best matching pairs (eg: AX, BY, CZ) based on the highest similarity score. High score indicates the higher similarity.
My problem arises when there is a tie between similarity values. For example, AX and CZ have both 0.6. How do I decide which two pairs to select? Are there any procedures/theories for this kind of problems?
Thanks.

In general, tie-breaking methods are going to depend on the context of the problem. In some cases, you want to report all the tying results. In other situations, you can use an arbitrary means of selection such as which one is alphabetically first. Finally, you may choose to have a secondary characteristic which is only evaluated in the case of a tie in the primary characteristic.
Additionally, you can always report one or more and then alert the user that there was a tie to allow him or her to decide for him- or herself.

In this case, the similarities you should be looking for are:
- Value
- Row
- Column
Objects which have any of the above in common are "similar". You could assign a weighting to each property, so that objects which have the same value are more similar than objects which are in the same column. Also, objects which have the same value and are in the same column are more similar than objects with just the same value.
Depending on whether there are any natural ranges occurring in your data, you could also consider comparing ranges. For example two numbers in the range 0-0.5 might be somewhat similar.

Related

Smartest way to filter a list using a different list

I have two lists. One of them is essentially representing keys (dates), the other the values.
I really just need the values themselves, but I want to get all values that lie between two dates. And optimally, I'd also like to use a certain sampling frequency to, say, get the values for all first days of the week between my two dates (ie sampling every 7th day).
I can easily filter my dates between two dates by calling .filter(e => e > start && e < end), and combine it with my prices array into its own object and then mapping it or something.
But since I'll be running this on large datasets in AWS, I'd need to be quite efficient with the way I do this. What would be the computationally least expensive algorithm to achieve what I want?
The best way would probably be a simple for loop, or actually, probably a binary search, but is there a less ugly way of doing it? I really enjoy chaining stream operations.

How does Google S2's use of Hilbert Curve solve (if not, minimize) the problem of closer cells having different prefix values like in Geohash?

In the case of GeoHash, two points that are close can have totally different hash values, making it impossible to do things like prefix comparison. This is due to the fact that somewhere in the ancestry line, there is a split (in geographical grouping).
How does S2 try to solve that problem for the purpose of querying? I read a bunch of posts on S2 but couldn't understand.
I would not say S2 solves this problem. Two close points may still have totally different cell ids in S2 too. One can argue S2's Hilbert curve makes it somewhat less common than with Z-curve used by GeoHash, but the root problem remains.
When you use S2 you don't normally use prefix comparison though, you use interval search. Alternatively, you compute a few prefixes possible within a specific radius of a point and search for them. You can do both of these approaches with GeoHash too of course.
S2 solves a different problem with GeoHash, which makes it impractical to use GeoHash for nearby search except in local cases: very different size and geometry of the cells. The GeoHash cells near the poles are much smaller (in real area) than the cells of same level near the equator. Near-polar GeoHash cells are also stretched. S2 cells are more even across the globe.

Is there a way to call the pcfcross function on groups of marks?

I'm using the pcfcross function to estimate the pair correlation functions (PCFs) between pairs of cell types, indicated by marks. I would now like to expand my analysis to include measuring the PCFs between cell types and groups of cell types. Is there a way to use the pcfcross function on a group of marks?
Alternatively, is there a way to change the marks of a group of marks to a singular mark?
You can collapse several levels of a factor to a single level, using the spatstat function mergeLevels. This will group several types of points into a single type.
However, this may not give you any useful new information. The pair correlation function is a second-order summary, so the pair correlation for the grouped data can be calculated from the pair correlations for the un-grouped data. (See Chapter 7 of the spatstat book).

Calculating the highest 2-side average match

"First part of the question is dedicated towards explaining the concept better, so we know, what we're calculating with. Feel free to
skip below to the latter parts, if you find it unnecessary"
1. Basic overview of the question:
Hello, I've got an excel application, something akin to a dating site. You can open various user profiles and even scan through the data and find the potential matches, based on hobbies, cities and other criteria.
How it's calculated is not relevant to the question, but the result of
a "Find Match" calculation looks something like this, a sorted list
of users, depending on how fitting they are (last column)
Relevant to the question are mainly:
the first column (ID) - ID of the user
the last column (Zhoda) - Match% of other users, against the one currently selected
2. What I need to do - how it's currently done
I need to find the highest match on average out of all users. If I were to write this algorhitmically:
1. Loop through all users
2. For each user in our database calculate the potential matches
3. Store the score of selected user ID, against all the found user IDs
4. Once it's all calculated, pit all users against each other _
and find the highest match on average
Obviously that sounds pretty complicated / vague, so here's a
simplified example. Let's say I have completed the first 3 steps and
have gotten the following result:
Here, the desired result would be:
User1 <- 46% -> User2
as they have the highest combined percentage average:
User1 vs User2: 30%
User2 vs User1: 62%
User1 <- (30+62)/2 -> User2
And no other possible combination of users has higher match% average
3. The purpose behind the question:
Now obviously you may ask, if I get the calculation behind it, then why ask the question in the first place? Well, the reason is combination of everything vs everything is extremely inefficient.
As soon as there are let's say 100 users instead of 3 in my database. I would have to do 100*100 calculations on match% alone, let alone afterwards check the average Match% of each individual user against another.
Is there perhaps some better way to approach, in a way I could either
minimize the data I have to calculate with
some sorting algorithm, where I could skip certain calculations in order to be quicker
an overall better approach towards calculating the highest average match%
So to recapitulate:
I've got a database of users.
Each individual user has a certain amount of Match% against every other user
I need to find two users, who one against another (on both sides) have the highest Match% average out of all possible combinations.
If you feel like you need any additional info, please let me know.
I'll try to keep the question updated as much as possible.
As you've presented the problem -- no, you cannot speed this up significantly. Since you've presented match% as an arbitrary function, constrained only by implied range, there are no mathematical properties you can harness to reduce the worst-case search scenario.
Under the given circumstances, the best you can do is to leverage the range. First, don't bother with "average": since these are strictly binary matches, dividing by 2 is simply a waste of time; keep the total.
Start by picking a pair; do the two-way match. Once you find a total of more than 100, store that value and use it to prune any sub-standard searches. For instance, if your best match so far totals 120, then if you find a couple where match(A, B) < 20, you don't bother with computing match(B, A).
In between, you can maintain a sorted list (O(n log n)) of first matches; don't do the second match unless you have reason to believe that this one might exceed your best match.
The rest of your optimization consists of gathering statistics about your matching, so that you can balance when to do first-only against two-way matches. For instance, you might defer the second match for any first match that is below the 70th percentile of those already deferred. This is in hope of finding a far better match that would entire eliminate this one.
If you gather statistics on the distribution of your match function, then you can tune this back-and-forth process better.
If you can derive mathematical properties about your match function, then there may be ways to leverage those properties for greater efficiency. However, since it's already short of being a formal topological "distance" metric d (see below), I don't hold out much hope for that.
Basic metric properties:
d(A, B) exists for all pairs (A, B)
d(A, B) = d(B, A)
d(A, A) = 0 // does not apply to a bipartite graph
d obeys the triangle inequality -- which doesn't apply directly, but has some indirect consequences for a bipartite graph.

Decision Tree status column & related numerical value column

I have a data including two columns where one is categorically shows the status of the feature & the other one numerically shows the related value. Just like below:
I want to run a decision tree algorithm via scikit learn on this data. I am not sure how to deal with these two columns because conceptually I cannot figure out how to bond these tho very correlated features. Basically, we are not supposed to leave null data, however, this one is supposed to be null in numerical column by nature. If we make it "0", it has another meaning.
So, how should I pre-process this data to have the decision tree algorithm work properly?
My prefossor provides a reasonable answer as below.
First, fill the null cells with "0".
If you plug the data into decision tree algorithms with these two features, we have two cases:
If "Status" comes first:
The tree will split 0's and 1's into two branches. Under 0, all Amount values will be already 0, hence this feature will not be chosen. Under 1, there will not be any 0 Status.
If "Amount" comes first: All Status 0's will go under only one branch and they will get together with the ones that are very small amounts.
So, If the Amount data is noisy, it might be helpful to keep the Status column. Otherwise, I would remove the Status column.

Resources