Can someone help me finding the word frequency in all lucene index
for example if doc A has 3 number of word (B) and doc C has 2 of them, I'd like a method to return 5 showing the frequency of word (B) in all lucene index
This has been asked multiple times:
Get term frequencies in Lucene
How to count term frequency for set of documents?
Get highest frequency terms from Lucene index
How do I get solr term frequency?
Assuming you work with Lucene 3.x:
IndexReader ir = IndexReader.open(dir);
TermDocs termDocs = ir.termDocs(new Term("your_field", "your_word"));
int count = 0;
while (termDocs.next()) {
count += termDocs.freq();
}
Some comments:
dir is the instance of Lucene Directory class. It's creation differs for RAM and Filesystem indexes, see Lucene documentation for details.
"your_filed" is a filed to search a term. If you have multiple fields, you can run procedure for all of them or, alternatively, when you index your files, you can create special field (e.g. "_content") and keep there concatenated values of all other fields.
using lucene 3.4
easy way to get the count, but you need two arrays :-/
int[] docs = new int[1000];
int[] freqs = new int[1000];
int count = indexReader.termDocs(term).read(docs, freqs);
beware: if you would use for read you are not able to use next() any more, because after the read() you are already at the end of the enumeration:
int[] docs = new int[1000];
int[] freqs = new int[1000];
TermDocs td = indexReader.termDocs(term);
int count = td.read(docs, freqs);
while (td.next()){ // always false, already at the end of the enumartion
}
Related
I'm trying to figure out what the weight assigned to each word in a topic represents in Mallet.
I'm assuming it's some form of document occurrence count. However, I'm having a hard time figuring out how that figure is arrived at.
In my model, there are several words that occur in more than one topic, and in each topic they have a different weight assigned, so clearly the number is not the word count over the entire corpus. My next guess was that the number is the occurrences of the word in the total set of documents that are assigned to the topic, but when I tried to verify that manually, this seems to be incorrect.
As an example: I'm training a model over a corpus of about 12,000 documents (alpha 0.1, beta 0.01, t = 50). After training, my model has the following topic:
t1 = "knoflook (158.0), olie (156.0), ...."
So the word 'knoflook' is assigned a weight of 158. Yet when I manually count the number of documents in my corpus that contain that word and have t1 assigned, I get a completely different number (1855).
It's possible that my manual verification is off, of course, but it would be useful to know, in general, how the word weight in each topic is arrived at.
By the way, the above topic is a rendering based on the following code:
// The data alphabet maps word IDs to strings
Alphabet dataAlphabet = instances.getDataAlphabet();
// Get an array of sorted sets of word ID/count pairs
ArrayList<TreeSet<IDSorter>> topicSortedWords = topicModel.getSortedWords();
for (int t = 0; t < numberOfTopics; t++) {
Iterator<IDSorter> iterator = topicSortedWords.get(t).iterator();
StringBuilder sb = new StringBuilder();
while (iterator.hasNext()) {
IDSorter idWeightPair = iterator.next();
final String wordLabel = dataAlphabet.lookupObject(idWeightPair.getID()).toString();
final double weight = idWeightPair.getWeight();
sb.append(wordLabel + " (" + weight + "), ");
}
sb.setLength(sb.length() - 2);
// sb.toString is now a human-readable representation of the topic
}
Mallet assigns each word token to a topic. The getSortedWords() method counts how many word tokens are of a particular type (eg knoflook) and are also assigned to topic k. The division of tokens into documents does not matter for this calculation.
If I'm understanding correctly, you're finding that there are 1855 documents that have a word token of type knoflook and also have a word token assigned to topic t1. But there's no guarantee that those two tokens are the same.
From other work looking at recipes, I would guess that garlic is a common ingredient that occurs in many contexts, and probably has high probability in many topics. It would not be surprising if many instances of the word are assigned to other topics.
Task
I have a set S of n = 10,000,000 strings s and need to find the set Sp containing the strings s of S that contain the substring p.
Simple solution
As I'm using C# this is quite a simple task using LINQ:
string[] S = new string[] { "Hello", "world" };
string p = "ll";
IEnumerable<string> S_p = S.Where(s => s.Contains(p));
Problem
If S contains many strings (like the mentioned 10,000,000 strings) this gets horribly slow.
Idea
Build some kind of index to retrieve Sp faster.
Question
What is the best way to index S for this task and do you have any implementation in C#?
Here is one way to do it:
1. Create a string T = S[0] + sep_0 + S[1] + sep_1 + ... + S[n - 1] + sep_n-1(where sep_i is a unique character that never appears in S[j] for any j(it can actually be an integer number if the set of characters is not big enough)).
2. Build a suffix tree for T(it can be done in linear time).
3. For each query string Q traverse the suffix tree(it takes O(length(Q)) time). Then all possible answers will be located in the leaves of some subtree. So you can just traverse all these leaves. If Q is rather long, then the number of leaves in this subtree is likely to be much smaller than n.
4. If Q is really short, then the number of leaves in a subtree can be pretty large. That's why you can use another strategy for short query strings: precompute all short substrings of S[0] ... S[n - 1] and for each of them store a set of indices where it has occurred. Then you can just print these indices for a given Q. It is difficult to say what 'short' exactly means here, but it can be found out experimentally.
I want to search for a query (a string) in a subject (another string).
The query may appear in whole or in parts, but will not be rearranged. For instance, if the query is 'da', and the subject is 'dura', it is still a match.
I am not allowed to use string functions like strfind or find.
The constraints make this actually quite straightforward with a single loop. Imagine you have two indices initially pointing at the first character of both strings, now compare them - if they don't match, increment the subject index and try again. If they do, increment both. If you've reached the end of the query at that point, you've found it. The actual implementation should be simple enough, and I don't want to do all the work for you ;)
If this is homework, I suggest you look at the explanation which precedes the code and then try for yourself, before looking at the actual code.
The code below looks for all occurrences of chars of the query string within the subject string (variables m; and related ii, jj). It then tests all possible orders of those occurrences (variable test). An order is "acceptable" if it contains all desired chars (cond1) in increasing positions (cond2). The result (variable result) is affirmative if there is at least one acceptable order.
subject = 'this is a test string';
query = 'ten';
m = bsxfun(#eq, subject.', query);
%'// m: test if each char of query equals each char of subject
[ii jj] = find(m);
jj = jj.'; %'// ii: which char of query is found within subject...
ii = ii.'; %'// jj: ... and at which position
test = nchoosek(1:numel(jj),numel(query)).'; %'// test all possible orders
cond1 = all(jj(test) == repmat((1:numel(query)).',1,size(test,2)));
%'// cond1: for each order, are all chars of query found in subject?
cond2 = all(diff(ii(test))>0);
%// cond2: for each order, are the found chars in increasing positions?
result = any(cond1 & cond2); %// final result: 1 or 0
The code could be improved by using a better approach as regards to test, i.e. not testing all possible orders given by nchoosek.
Matlab allows you to view the source of built-in functions, so you could always try reading the code to see how the Matlab developers did it (although it will probably be very complex). (thanks Luis for the correction)
Finding a string in another string is a basic computer science problem. You can read up on it in any number of resources, such as Wikipedia.
Your requirement of non-rearranging partial matches recalls the bioinformatics problem of mapping splice variants to a genomic sequence.
You may solve your problem by using a sequence alignment algorithm such as Smith-Waterman, modified to work with all English characters and not just DNA bases.
Is this question actually from bioinformatics? If so, you should tag it as such.
I came across the word break problem which goes something like this:
Given an input string and a dictionary of words,segment the input
string into a space-separated sequence of dictionary words if
possible.
For example, if the input string is "applepie" and dictionary contains a standard set of English words,then we would return the string "apple pie" as output
Now I myself came up with a quadratic time solution. And I came across various other quadratic time solutions using DP.
However in Quora a user posted a linear time solution to this problem
I cant figure out how it comes out to be linear. Is their some mistake in the time complexity calculations? What is the best possible worst case time complexity for this problem. I am posting the most common DP solution here
String SegmentString(String input, Set<String> dict) {
int len = input.length();
for (int i = 1; i < len; i++) {
String prefix = input.substring(0, i);
if (dict.contains(prefix)) {
String suffix = input.substring(i, len);
if (dict.contains(suffix)) {
return prefix + " " + suffix;
}
}
}
return null;
}
The 'linear' time algorithm that you linked here works as follows:
If the string is sharperneedle and dictionary is sharp, sharper, needle,
It pushes sharp in the string.
Then it sees that er is not in dictionary, but if we combine it with the last word added, then sharper exists. Hence it pops out the last element and pushes this in.
IMO the above logic fails for string eaterror and dictionary eat, eater, error.
Here er shall pop out eat from the list, and push in eater. The remaining string ror shall not be recognized and discarded.
As regards the code you posted, as mentioned in the comments, this works for only two words with one partition place.
This is a coding exercise. Suppose I have to decide if one string is created by a cyclic shift of another. For example: cab is a cyclic shift of abc but cba is not.
Given two strings s1 and s2 we can do that as follows:
if (s1.length != s2.length)
return false
for(int i = 0; i < s1.length(); i++)
if ((s1.substring(i) + s1.substring(0, i)).equals(s2))
return true
return false
Now what if I have an array of strings and want to find all strings that are cyclic shift of one another? For example: ["abc", "xyz", "yzx", "cab", "xxx"] -> ["abc", "cab"], ["xyz", "yzx"], ["xxx"]
It looks like I have to check all pairs of the strings. Is there a "better" (more efficient) way to do that?
As a start, you can know if a string s1 is a rotation of a string s2 with a single call to contains(), like this:
public boolean isRotation(String s1, String s2){
String s2twice = s2+s2;
return s2twice.contains(s1);
}
Namely, if s1 is "rotation" and s2 is "otationr", the concat gives you "otationrotationr", which contains s1 indeed.
Now, even if we assume this is linear, or close to it (which is not impossible using Rabin-Karp, for instance), you are still left with O(n^2) pair comparisons, which may be too much.
What you could do is build an hashtable where the sorted word is the key, and the posting list contains all the words from your list that, if sorted, give the key (ie. key("bca") and key("cab") both should return "abc"):
private Map<String, List<String>> index;
/* ... */
public void buildIndex(String[] words){
for(String word : words){
String sortedWord = sortWord(word);
if(!index.containsKey(sortedWord)){
index.put(sortedWord, new ArrayList<String>());
}
index.get(sortedWord).add(word);
}
}
CAVEAT: The hashtable will contain, for each key, all the words that have exactly the same letters occurring the same amount of times (not just the rotations, ie. "abba" and "baba" will have the same key but isRotation("abba", "baba") will return false).
But once you have built this index, you can significantly reduce the number of pairs you need to consider: if you want all the rotations for "bca" you just need to sort("bca"), look it up in the hashtable, and check (using the isRotation method above, if you want) if the words in the posting list are the result of a rotation or not.
If strings are short compared to the number of strings in the list, you can do significantly better by rotating all strings to some normal form (lexicographic minimum, for example). Then sort lexicographically and find runs of the same string. That's O(n log n), I think... neglecting string lengths. Something to try, maybe.
Concerning the way to find the pairs in the table, there could be many better way, but what I came up as a first thought is to sort the table and apply the check per adjacent pair.
This is much better and simpler that checking every string with every other string in the table
Consider building an automaton for each string against which you wish to test.
Each automaton should have one entry point for each possible character in the string, and transitions for each character, plus an extra transition from the end to the start.
You could improve performance even further if you amalgated the automata.
I think a combination of the answers by Patrick87 and savinos would make a fair amount of sense. Specifically, in a Java-esque pseudo-code:
List<String> inputs = ["abc", "xyz", "yzx", "cab", "xxx"];
Map<String,List<String>> uniques = new Map<String,List<String>>();
for(String value : inputs) {
String normalized = normalize(value);
if(!uniques.contains(normalized)) {
unqiues.put(normalized, new List<String>());
}
uniques.get(normalized).add(value);
}
// you now have a Map of normalized strings to every string in the input
// that is "equal to" that normalized version
Normalizing the string, as stated by Patrick87 might be best done by picking the rotation of the string that results in the lowest lexographic ordering.
It's worth noting, however, that the "best" algorithm probably relies heavily on the inputs... the number of strings, the length of those string, how many duplicates there are, etc.
You can rotate all the strings to a normalized form using Booth's algorithm (https://en.wikipedia.org/wiki/Lexicographically_minimal_string_rotation) in O(s) time, where s is the length of the string.
You can then use the normalized form as a key in a HashMap (where the value is the set of rotations seen in the input). You can populate this HashMap in a single pass over the data. i.e., for each string
calculate the normalized form
check if the HashMap contains the normalized form as a key - if not insert the empty Set at this key
add the string to the Set in the HashMap
You then just need to output the values of the HashMap. This makes the total runtime of the algorithm O(n * s) - where n is the number of words and s is the average word length. The total space usage is also O(n * s).