Evaluating the "Value" Attribute - nlp

I'm attempting to use the OpenAmplify API to evaluate the content of a URI. The point is to draw out the topics that are truly relevant to the article. Unfortunately, the topical analysis I'm getting back is:
Huge, and
Varied
Neither quality is terribly useful for what I'm trying to do because the signal to noise ratio is being heavily skewed towards noise. I'm analyzing web content, so there is a certain amount (perhaps a large amount) of irrelevant content (ads, etc.) involved. I get that.
Nonetheless, many of the topics being returned are either useless (utterly non-sensical, not even words), irrelevant (as in, where did that come from?) or too granular to provide any meaning or insight. I can probably filter out most of this noise using the value, um, value that is returned for each domain, subdomain, topic, et al, but I don't really know what it means.
Certainly I understand that the value it's a measure of "the prominence of the word in the text," but the number itself appears entirely arbitrary in a way that I prevents me saying something like "ignore any terms with a value less than 50" and have it carry any real meaning.
Are there any range criteria that I can use to help me understand how to use a topic's value score as a filtering threshold? Alternatively, is there another field that I should be using for this sort of filtration?
Thanks for your help.

From other channels, I've learned that the value attribute can't be evaluated the way I was hoping. It means different things for different signals and none are defined in such a way that are meaningful for this kind of requirement.

Related

Find the most similar text in a list of strings

First of all, I want to acknowledge that this is perhaps a very sophisticated problem; however, I have not been able to find a definitive answer for it online so I'm looking for suggestions.
Suppose I want to collect a list containing over hundred thousand strings, values of these strings are sentences that a user has typed. The values are added to the list as soon as a user types a new message. For example:
["Hello world!", "Good morning, my name is John", "Good morning, everyone"]
But I also want to have a timeout for each string so if they are not repeated within 5 min, they should be removed, so I change it to following format:
[{message:"Hello world!", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, my name is John", timeout: NodeJS.Timeout, count: 1}, {message:"Good morning, everyone", timeout: NodeJS.Timeout, count: 1}]
Now suppose a user types the following message:
Good morning, everyBODY
I want to compare this string to all the messages in list and if one is 70% or more similar, update the count of that message, otherwise insert it as a new message. For this message for example, the application should update the count for Good morning, everyone to be equal to 2.
Since users can type a lot of messages in a short amount of time, the algorithm must also support fast insertion, searching, and deleting after the timeout.
What is the best way to implement this? or are there any libraries to help me with this?
NOTE: The strings do not need to be in an array, any data structure would work.
The main purpose of this algorithm is to detect similar messages when the count reaches a predefined value. For example warning: Over 5 users typed messages similar to "Hello everybody" within 5 minutes
I have looked at B-Trees, Nearest Neighbor, etc but I can't figure out what would be the best solution.
Update:
I plan on using Levenshtein distance for string similarity, however the main problem is how to apply that to a list of strings in most time efficient way, without having to check every single string every time a new message is added.
Levenshtein distance
Unlike the other answer I think Levenshtein distance is perfectly capable of dealing with spelling mistakes. Indeed, Levenshtein and LevXenshtein only have Levenshtein distance 1, and thus can be concluded to likely be the same message.
However, if you want to use this distance, you will have to compute the distance between the new message and every message stored, every time a new message comes in. There is likely no way around this.
Unfortunately there is no real useful pre-processing you can do for this.
Other possibilities
If you can find a way to map every message to a fixed-size vector, you can use essentially any nearest neighbor search technique. I suggest doing so.
This leaves us with two problems to solve. Generating the fixed-length vector, and doing the search.
Fixed-size vector representation
There are multiple ways of doing this, all with their own set of drawbacks. I'll specifically mention two, but it will depend on your architecture and data which method is best for you.
First, you could go the machine-learning way. You could map every word to a pre-trained vector with fastText, average the words in the message, and use that as your vector. The drawbacks of this method are that it will ignore word order, and it will work less well if the words used tend to be very informal. If your messages have their own culture to them (such as for example Twitch chat) you would have to retrain these vectors instead of using pre-trained ones.
Alternatively, you could use the structure of the text directly, and make an occurrence vector of bigrams. That is, jot down how often every 2-character combination occurs in a message. This is fairly robust, but has the drawback that the vectors will become relatively large.
Regardless, these are just two options, and it's impossible to tell what method is ideal for you. Unless of course someone has a brilliant idea.
Nearest neighbor search
Given that we have fixed length vectors, we can now do nearest neighbor search. As you've probably found, there are once again many different methods for this, all with their own drawbacks. Exhausting, I know.
I'll choose to discuss three categories.
Approximate search: This method may seem a little silly, but it could be what you want. Specifically, Locality-sensitive hashing is essentially just making some hashing function where "similar" vectors are likely to end up in the same bucket. You could then do anything you want, such as Levenshtein, with all of the other members of the bucket, because there should not be too many of them. The advantage of such an approximate algorithm is that it can be fast, and with some smart hashing you don't even need fixed-length vectors. A downside, of course, is that it is not guaranteed to work.
Exact search: We can also choose to instead solve the problem of Fixed-radius near neighbors. That is, find the points within some distance of the target point. You could do this by mapping vectors to integers (if they aren't already) and simply checking every lattice point within the distance you want to search. The primary drawback here is that the search time grows very fast not with the number of points, but with the number of dimensions of the vector. This method would necessitate small vectors.
Fancy datastructures: This seems to me most likely to be the right solution. Unfortunately you have a lot of letter-trees. You mention B-trees, but there's also R-trees, R+-Trees, R*-Trees, X-Trees, and that's just the direct descendants of the R-tree. With the risk of missing the trees for the forest, I'd suggest taking a look at the k-d tree. It can do nearest neighbor search in logarithmic time, as well as insertion and deletion.
You want to covert all of the words to their Soundex value.
Then you need a database for the soundex values that ranks the importance of the word in the sentence, e.g. the should probably get 0. The more information the word carries the higher its value.
Then sort the words in the sentence into a list of integers.
Use the list of integers as the key to find similar sentences.
Since the key is a list of integers a Rose tree should work as data structure.
While some may suggest measuring using something like Levenshtein distance that presupposes that the sentences have no spelling mistakes or such. You need something that is flexible enough to deal with human error.
I would suggest you to use Algolia. Which has their own ranking algorithm rates each matching record on several criteria (such as the number of typos or the geo-distance), to which they individually assign a integer value score.
I would totally take a loook on it, since they have Search-as-you-type and different Ranking algorithm criterias.
https://blog.algolia.com/search-ranking-algorithm-unveiled/
I think Search Engine like SOLR or Elastic Search are best fit for your problem.
You have to create single collection in which you can store data as you have mention in the question after that you just have to add data to solr and search it in the solr search with your time limit.

Hypothesis search tree

I have a object with many fields. Each field has different range of values. I want to use hypothesis to generate different instances of this object.
Is there a limit to the number of combination of field values Hypothesis can handle? Or what does the search tree hypothesis creates look like? I don't need all the combinations but I want to make sure that I get a fair number of combinations where I test many different values for each field. I want to make sure Hypothesis is not doing a DFS until it hits the max number of examples to generate
TLDR: don't worry, this is a common use-case and even a naive strategy works very well.
The actual search process used by Hypothesis is complicated (as in, "lead author's PhD topic"), but it's definitely not a depth-first search! Briefly, it's a uniform distribution layered on a psudeo-random number generator, with a coverage-guided fuzzer biasing that towards less-explored code paths, with strategy-specific heuristics on top of that.
In general, I trust this process to pick good examples far more than I trust my own judgement, or that of anyone without years of experience in QA or testing research!

Effect of randomness on search results

I am currently working on a search ranking algorithm which will be applied to elastic search queries (domain: e-commerce). It assigns scores on several entities returned and finally sorts them based on the score assigned.
My question is: Has anyone ever tried to introduce a certain level of randomness to any search algorithm and has experienced a positive effect of it. I am thinking that it might be useful to reduce bias and promote the lower ranking items to give them a chance to be seen easier and get popular if they deserve it. I know that some machine learning algorithms are introducing some randomization to reduce the bias so I thought it might be applied to search as well.
Closest I can get here is this but not exactly what I am hoping to get answers for:
Randomness in Artificial Intelligence & Machine Learning
I don't see this mentioned in your post... Elasticsearch offers a random scoring feature: https://www.elastic.co/guide/en/elasticsearch/guide/master/random-scoring.html
As the owner of the website, you want to give your advertisers as much exposure as possible. With the current query, results with the same _score would be returned in the same order every time. It would be good to introduce some randomness here, to ensure that all documents in a single score level get a similar amount of exposure.
We want every user to see a different random order, but we want the same user to see the same order when clicking on page 2, 3, and so forth. This is what is meant by consistently random.
The random_score function, which outputs a number between 0 and 1, will produce consistently random results when it is provided with the same seed value, such as a user’s session ID
Your intuition is right - randomization can help surface results that get a lower than deserved score due to uncertainty in the estimation. Empirically, Google search ads seemed to have sometimes been randomized, and e.g. this paper is hinting at it (see Section 6).
This problem describes an instance of a class of problems called Explore/Exploit algorithms, or Multi-Armed Bandit problems; see e.g. http://en.wikipedia.org/wiki/Multi-armed_bandit. There is a large body of mathematical theory and algorithmic approaches. A general idea is to not always order by expected, "best" utility, but by an optimistic estimate that takes the degree of uncertainty into account. A readable, motivating blog post can be found here.

String Matching Algorithms

I have a python app with a database of businesses and I want to be able to search for businesses by name (for autocomplete purposes).
For example, consider the names "best buy", "mcdonalds", "sony" and "apple".
I would like "app" to return "apple", as well as "appel" and "ple".
"Mc'donalds" should return "mcdonalds".
"bst b" and "best-buy" should both return "best buy".
Which algorithm am I looking for, and does it have a python implementation?
Thanks!
The Levenshtein distance should do.
Look around - there are implementations in many languages.
Levenshtein distance will do this.
Note: this is a distance, you have to calculate it to every string in your database, which can be a big problem if you have a lot of entries.
If you have this problem then record all the typos the users make (typo=no direct match) and offline build a correction database which contains all the typo->fix mappings. some companies do this even more clever, eg: google watches how users correct their own typos and learns the mappings from this.
Soundex or Metaphone might work.
I think what you are looking for is a huge field of Data Quality and Data Cleansing. I fear if you could find a python implementation regarding this as it has to be something which cleanses considerable amount of data in db which could be of business value.
Levensthein distance goes in the right direction but only half the way. There are several tricks to get it to use the half matches as well.
One would be to use a subsequence dynamic time warping (DTW is actually a generalization of levensthein distance). For this you relax the start and end cases when calcualting the cost matrix. If you only relax one of the conditions you can get autocompletion with spell checking. I am not sure if there is a python implementation available, but if you want to implement it for yourself it should not be more than 10-20 LOC.
The other idea would be to use a Trie for speed up, which can do DTW/Levensthein on multiple results simultaniously (huge speedup if your database is large). There is a paper on Levensthein on Tries at IEEE, so you can find the algorithm there. Again for this you would need to relax the final boundary condition, so you get partial matches. However since you step down in the trie you just need to check when you have fully consumed the input and then return all leaves.
check this one http://docs.python.org/library/difflib.html
it should help you

Is it OK to have a precision value of 100% in text retrieval system?

Since the formula for precision is :
retrieved_and_relevant/(retrieved_and_relevant+retrieved_and_irrelevant)
I am wondering if the value for precision in a text-retrieval system will ever be different from 100%. I think so because, all we programmers put a hell lot of effort in not forgetting to squeeze each and every text of all documents out there. So, when a query text is fired into the text retrieval system, it will output all the documents containing the query text. This means that all those documents retrieved are relevant documents; essentially making the score of 100%.
Is this true or am I missing some point ?
You're slightly confused on the concept behind precision.
A simple example would be searching for the terms iraq war. Depending on how the search engine is designed and the results may or may not be what the user is looking for. It might return
Wars that Iraq, the country is involved in
A fictional story about a soldier in the current Iraq war,
A news article that talks about various wars and their financial impact.
Each document could be completely different and contain the exact search terms, but might be irrelevant to what the user was looking for.
The search engine would definitely LIKE to have a precision of 100% but it's very rare that this is the case.
Precision can ONLY be determined by the user who performs the search query itself as they are the only one who knows without a doubt that a result is relevant or not. It's definitely something to strive for, but don't believe it will always equal 100%.

Resources