I'm trying to change a document ranker from a project that needs a model that takes a huge amount of memory to be trained to a simpler one based on wikipedia library.
From queries, a list of query that contains only one query at the moment
queries : ['What is the population of Toulon']
I would like to change the way it was ranked to the closest doc using wikipedia.page() function. Yet for the good functioning of this ranker I know that I need an interable object at the end. Indeed I tried
# Rank documents for queries.
if len(queries) == 1:
# ranked = [self.ranker.closest_docs(queries[0], k=n_docs)]
ranked = [wikipedia.page(queries),wikipedia.page(queries)] # which is stupid I know, but don't know how to do it differently yet.
all_docids, all_doc_scores = zip(*ranked)
and got an all_docids, all_doc_scores = zip(*ranked) TypeError: zip argument #1 must support iteration error.
Until now I have two wikipedia pages :
<WikipediaPage 'Toulon'> <WikipediaPage 'Toulon'>
Related
I have a collection of objects
[{name: Peter}, {name: Evan}, {name: Michael}];
I and i want to get an object for example {name: Evan} by his index(1).
How can i pull this out?
I tried get All objects by find() and then get an object with index but it's not a good idea in terms of speed.
There are a few notable aspects of this question. In the comments you clarify:
yes they are different documents. By the index I mean const users = await User.find(); users[1] // {name: "Evan"}
Probably what you are looking to do here is something along the lines of:
const users = await User.find().skip(1).limit(1);
This will return a cursor that will contain just the single document that you are looking for.
Keep in mind, however, that the without providing a sort to the operation the database is free to return the results in any order. So the "index" (position) is not guaranteed to be consistent without the sort clause.
I tried get All objects by find() and then get an object with index but it's not a good idea in terms of speed.
In general, your current approach requires that the database iterate through all of the items being skipped which can be slow. Limiting the results at least reduces the amount of network activity that is required. Depending on what you are trying to achieve, you could consider setting a smaller batch size (and iterating the cursor) or using range queries as outlined on that page.
I'm building a workout app that has an entity called Workout and another one called Exercise.
A workout can contain multiple exercises (thus a one-to-many relationship). I want to show the users of my app the exercises contained in a workout but in an ordered way (it's not the same to start with strength exercises as with the cardio ones).
Apparently, when establishing this kind of relationship in Core Data, I need to use an NSSet, because if I try to use for example an Array where its elements are ordered, I get the following error:
*** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: 'Unacceptable type of value for to-many relationship: property = "consistsOf"; desired type = NSSet; given type = __NSArray0; value = (
).'
I have tried to check the "ordered" checkmark in my model, but then I get an error saying "Workout.consistsOf must not be ordered".
I have also tried to use an NSDictionary whose keys would be the position and the values would be the exercises themselves, but I'm getting the same error as above.
How can I show the users the exercises that a workout consists of in an ordered way?
Thanks a lot in advance!
P.S.: Here's a screenshot of the properties of my model.
Ordered relationships use NSOrderedSet, but CloudKit doesn't support ordered sets, so you can't use an ordered relationship and CloudKit in the same data model.
To keep an order, you need to have some property on Exercise that would indicate the order. This could be as simple as an integer property called something like index. You'd sort the result based on the index value. If there's something else that also indicates order-- like a date, maybe?-- use that instead of adding a new property.
I am evaluating OpenNLP for use as a document categorizer. I have a sanitized training corpus with roughly 4k files, in about 150 categories. The documents have many shared, mostly irrelevant words - but many of those words become relevant in n-grams, so I'm using the following parameters:
TrainingParameters params = new TrainingParameters();
params.put(TrainingParameters.ITERATIONS_PARAM, 20000);
params.put(TrainingParameters.CUTOFF_PARAM, 10);
DoccatFactory dcFactory = new DoccatFactory(new FeatureGenerator[] { new NGramFeatureGenerator(3, 10) });
params.put(AbstractTrainer.ALGORITHM_PARAM, NaiveBayesTrainer.NAIVE_BAYES_VALUE);
Some of these categories apply to documents that are almost completely identical (think boiler-plate legal documents, with maybe only names and addresses different between document instances) - and will be mostly identical to documents in the test set. However, no matter how I tweak these params, I can't break out of the "1 outcome patterns" result. When running a test, every document in the test set is tagged with "Category A."
I did manage to effect a single minor change in output, by moving from previous use of the BagOfWordsFeatureGenerator to the NGramFeatureGenerator, and from maxent to Naive Bayes; before the change, every document in the test set was assigned "Category A", but after the change, all the documents were now assigned to "Category B." But other than that, I can't seem to move the dial at all.
I've tried fiddling with iterations, cutoff, ngram sizes, using maxent instead of bayes, etc; but all to no avail.
Example code from tutorials that I've found on the interweb have used much smaller training sets with less iterations, and are able to perform at least some rudimentary differentation.
Usually in such a situation - bewildering lack of expected behavior - the engineer has forgotten to flip some simple switch, or has some fatal lack of fundamental understanding. I am eminently capable of both those failures. Also, I have no Data Science training, although I have read a couple of O'Reilly books on the subject. So the problem could be procedural. Is the training set too small? Is the number of iterations off by an order of magnitude? Would a different algo be a better fit? I'm utterly surprised that no tweaks have even slightly moved the dial away from the "1 outcome" outcome.
Any response appreciated.
Well, the answer to this one did not come from the direction in which the question was asked. It turns out that there was a code sample in the OpenNLP documentation that was wrong, and no amount of parameter tuning would have solved it. I've submitted a jira to the project so it should be resolved; but for those who make their way here before then, here's the rundown:
Documentation (wrong):
String inputText = ...
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
double[] outcomes = myCategorizer.categorize(inputText);
String category = myCategorizer.getBestCategory(outcomes);
Should be something like:
String inputText = ... // sanitized document to be classified
DocumentCategorizerME myCategorizer = new DocumentCategorizerME(m);
double[] outcomes = myCategorizer.categorize(inputText.split(" "));
String category = myCategorizer.getBestCategory(outcomes);
DocumentCategorizerME.categorize() needs an array; since this is an obviously self-documenting bug the second you run the code, I had assumed the necessary array parameter should be an array of documents in string form; instead it needs
an array of tokens from a single document.
I'm trying to build my own search engine for experimenting.
I know about the inverted indexes. for example when indexing words.
the key is the word and has a list of document ids containing that word. So when you search for that word you get the documents right away
how does it work for multiple words
you get all documents for every word and traverse those document to see if have both words?
I feel it is not the case.
anyone knows the real answer for this without speculating?
Inverted index is very efficient for getting intersection, using a zig-zag alorithm:
Assume your terms is a list T:
lastDoc <- 0 //the first doc in the collection
currTerm <- 0 //the first term in T
while (lastDoc != infinity):
if (currTerm > T.last): //if we have passed the last term:
insert lastDoc into result
currTerm <- 0
lastDoc <- lastDoc + 1
continue
docId <- T[currTerm].getFirstAfter(lastDoc-1)
if (docID != lastDoc):
lastDoc <- docID
currTerm <- 0
else:
currTerm <- currTerm + 1
This algorithm assumes efficient getFirstAfter() which can give you the first document which fits the term and his docId is greater then the specified parameter. It should return infinity if there is none.
The algorithm will be most efficient if the terms are sorted such that the rarest term is first.
The algorithm ensures at most #docs_matching_first_term * #terms iterations, but practically - it will usually be much less iterations.
Note: Though this alorithm is efficient, AFAIK lucene does not use it.
More info can be found in this lecture notes slides 11-13 [copy rights in the lecture's first page]
You need to store position of a word in a document in index file.
Your index file structure should be like this..
word id - doc id- no. of hits- pos of hits.
Now suppose the query contains 4 words "w1 w2 w3 w4" . Choose those files containing most of the words. Now calculate their relative distance in the document. The document where most of the words occur and their relative distance is minimum will have high priority in search results.
I have developed a total search engine without using any crawling or indexing tool available in internet. You can read a detailed description here-Search Engine
for more info read this paper by Google founders-click here
You find the intersection of document sets as biziclop said, and you can do it in a fairly fast way. See this post and the papers linked therein for a more formal description.
As pointed out by biziclop, for an AND query you need to intersect the match lists (aka inverted lists) for the two query terms.
In typical implementations, the inverted lists are implemented such that they can be searched for any given document id very efficiently (generally, in logarithmic time). One way to achieve this is to keep them sorted (and use binary search), but note that this is not trivial as there is also a need to store them in compressed form.
Given a query A AND B, and assume that there are occ(A) matches for A and occ(B) matches for B (i.e. occ(x) := the length of the match list for term x). Assume, without loss of generality, that occ(A) > occ(B), i.e. A occurs more frequently in the documents than B. What you do then is to iterate through all matches for B and search for each of them in the list for A. If indeed the lists can be searched in logarithmic time, this means you need
occ(B) * log(occ(A))
computational steps to identify all matches that contain both terms.
A great book describing various aspects of the implementation is Managing Gigabytes.
I don't really understand why people is talking about intersection for this.
Lucene supports combination of queries using BooleanQuery, which you can nest indefinitely if you must.
The QueryParser also supports the AND keyword, which would require both words to be in the document.
Example (Lucene.NET, C#):
var outerQuery + new BooleanQuery();
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word1 ) ), BooleanClause.Occur.MUST );
outerQuery.Add(new TermQuery( new Term( "FieldNameToSearch", word2 ) ), BooleanClause.Occur.MUST );
If you want to split the words (your actual search term) using the same analyzer, there are ways to do that too. Although, a QueryParser might be easier to use.
You can view this answer for example on how to split the string using the same analyzer that you used for indexing:
No hits when searching for "mvc2" with lucene.net
The inverse document freqency is defined as follows:
IDF(term,document) = tf(term) * log(1 + n/df(term))
where tf(term) = 'frequency of term in document', n = 'number of documents', df(term) = 'number of docs containing term'.
Just curious about df(term) - do I only count a document ones even if it contains the term more than once?
Also is it easy to determine this stat with lucene(.net)? I am only starting to use the latter and use a relational db at the moment.
Thanks.
Christian
For using idf with Lucene, check the API for example here.
You are right about the docs being counted only once. The idea is to get a function with a lower bound in the log part. Like this:
If you are interested in the idf theory behind the scenes, you may peep at this paper.
HTH!
Of course you have to count the DF(term) once. therefore, you should group the words to get distinct words.
See my class IDF here