Get field's tokens from lucene index

Get field's tokens from lucene index - search

How can I get the tokens (whether it be the list of tokens, TokenStream, or something else) that were used for a Field within a Document from a lucene index? That is, is it possible to get the tokens that were used in tokens (from the example) from the index? (I'm not wondering how to get tokens out of a TokenStream)
doc.add(new Field("title", tokens))
In the documentation there's Field.tokenStreamValue() but when I do doc.getFieldable(field_name) that simply returns null.
I've also tried (from the third comment in lucene - Fieldable.tokenStreamValue()):
TokenSources.getTokenStream(reader, doc_id, field_name)
but I get
java.lang.IllegalArgumentException: title in doc #630does not have any term position data stored
at org.apache.lucene.search.highlight.TokenSources.getTokenStream(TokenSources.java:256)

The TokenSources class is a helper class to retrieve the tokens of a document for highlighting purposes. There are two ways to retrieve the terms for a given document:
re-analyzing a stored field,
reading the document's terms vector.
The method you want to use tries to read the document's terms vector, but fails because you didn't enable term vectors at indexing time.
So you can either enable term vectors at indexing time and keep using this method (see Field constructor and the documentation of Field.TermVector) or re-analyze the content of your stored fields. The first method may provide better performance, especially for large fields whereas the second one will save space (there is no additional information to store if your field is already stored).

Related

How to get value from IMAP (hazelcast) given the list of keys?

Problem we are trying to solve:
Give a list of Keys, what is the best way to get the value from IMap given the number of entries is around 500K?
Also we need to filter the values based on fields.
Here is the example map we are trying to read from.
Given IMap[String, Object]
We are using protobuf to serialize the object
Object can be say
Message test{ Required mac_address eth_mac = 1, ….// size can be around 300 bytes }

You can use IMap.getAll(keySet) if you know the keys beforehand. It's much better than single gets since it'll be much less network trips in a bulk operation.
For filtering, you can use predicates on IMap.values(predicate), IMap.entryset(predicate) or IMap.keyset(predicate) based on what you want to filter.
See more: http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#distributed-query

Storing a list of mixed types in Cassandra

In Cassandra, when specifying a table and fields, one has to give each field a type (text, int, boolean, etc.). The same applies for collections, you have to give lock a collection to specific type (set<text> and such).
I need to store a list of mixed types in Cassandra. The list may contain numbers, strings and booleans. So I would need something like list<?>.
Is this possible in Cassandra and if not, What workaround would you suggest for storing a list of mixed type items? I sketched a few, but none of them seem the right way to go...

Cassandra's CQL interface is strictly typed, so you will not be able to create a table with an untyped collection column.
I basically see two options:
Create a list field, and convert everything to text (not too nice, I agree)
Use the thift API and store everything as is.

As suggested at http://www.mail-archive.com/user#cassandra.apache.org/msg37103.html I decided to encode the various values into binary and store them into list<blob>. This allows to still query the collection values (in Cassandra 2.1+), one just needs to encode the values in the query.
On python, simplest way is probably to pickle and hexify when storing data:
pickle.dumps('Hello world').encode('hex')
And to load it:
pickle.loads(item.decode('hex'))
Using pickle ties the implementation to python, but it automatically converts to correct type (int, string, boolean, etc.) when loading, so it's convenient.

update_sequence changed semantics in cloudant db?

I use a cloudant couchdb and I've noticed see that the "_changes" query on the database returns an "update_sequence" that is not a number, e.g.
"437985-g1AAAADveJzLYWBgYM..........".
What is more, the response is not stable: I get 3 different update_sequences if a query the db 3 times.
Is there any change in the known semantics of the "update_sequence", "since", etc. or what?
Regards,
Vangelis

Paraphrasing an answer that Robert has previously given:
The update sequence values are opaque. In CouchDB, they are currently integers but, in Cloudant, the value is an encoding of the sequence value for each shard of the database. CouchDB will likely adopt this in future as clustering support is added (via the BigCouch merge).
In both CouchDB and Cloudant, _changes will return a "seq" value with every row that is guaranteed to return newer updates if you pass it back as "since". In cases of failover, that might include changes you've already seen.
So, the correct way to read changes since a particular update sequence is this;
call /dbname/_changes?since=
read the entire response, applying the changes as you go
Record the last_seq value as your new checkpoint seq value.
Do not interpret the two values, you can't compare them for equality. You can, if you need to, record any "seq" value as you go in step 2 as your current checkpoint seq value. The key thing you cannot do is compare them.

It'll jump around, the representation is a packed base64 string representing the update_seq of the various replicas of each shard of your database. It can't be a simple integer because it's a snapshot of a distributed database.
As for CouchDB, treat the update_seq as opaque JSON and you'll be fine.

How do I access Core Data Attribute validation information from code?

I have a string attribute in a Core Data entity whose Max Length value is 40. I'd like to use this value in code and not have to re-type the value "40." Is this possible?

As #K.Steff says in the comments above, you are better off validating in your code and not setting a max length in your core data model. To add on to that comment, I would also advise you to look at using a custom NSManagedObject subclass for this entity type, and within that subclass overriding the validateValue:forKey:error: or implementing a key-specific validation method for this property.
The value of this approach is that you can do things like "coerce" the validation by truncating strings at validation time. From the NSManagedObject documentation:
This method is responsible for two things: coercing the value into an appropriate
type for the object, and validating it according to the object’s rules.
The default implementation provided by NSManagedObject consults the object’s entity
description to coerce the value and to check for basic errors, such as a null value when
that isn’t allowed and the length of strings when a field width is specified for the
attribute. It then searches for a method of the form validate< Key >:error: and invokes it
if it exists.
You can implement methods of the form validate< Key >:error: to perform validation that is
not possible using the constraints available in the property description. If it finds an
unacceptable value, your validation method should return NO and in error an NSError object
that describes the problem. For more details, see “Model Object Validation”. For
inter-property validation (to check for combinations of values that are invalid), see
validateForUpdate: and related methods.
So you can implement this method to both validate that the string is not too long and, if necessary, truncate it when it is too long.

From NSManagedObject you can access the NSEntityDescription via entity. In there you can grab the array properties and a dictionary propertiesByName, either of which will get you to NSPropertyDescriptions. Each property description has a property, validationPredicates that will return an array of NSPredicates. One of those will be the condition that your string length must be at most 40.
Sadly predicates are a lot of hassle to reverse engineer — and doing so can even be impossible, given that you can create one by supplying a block. Hopefully though you'll just have an NSComparisonPredicate or be able to get to one by tree walking downward from an NSCompoundPredicate or an NSExpression.
From the comparison predicate you'll be able to spot from the left and right expressions that one is string length and the other is a constant value.
So, in summary:
Core Data exposes validation criteria only via the very general means of predicates;
you can usually, but not always, rebuild an expression (in the natural language sense rather than the NSExpression sense) from a predicate; and
if you know specifically you're just looking for a length comparison somewhere then you can simplify that further into a tree walk for comparison predicates that involve the length.
It's definitely not going to be pretty because of the mismatch of the specific and the general but it is possible.

using Dependency Parser in Stanford coreNLP

I am using the Stanford coreNLP ( http://nlp.stanford.edu/software/corenlp.shtml ) in order to parse sentences and extract dependencies between the words.
I have managed to create the dependencies graph like in the example in the supplied link, but I don't know how to work with it. I can print the entire graph using the toString() method, but the problem I have is that the methods that search for certain words in the graph, such as getChildList, require an IndexedWord object as a parameter. Now, it is clear why they do because the nodes of the graph are of IndexedWord type, but it's not clear to me how I create such an object in order to search for a specific node.
For example: I want to find the children of the node that represents the word "problem" in my sentence. How I create an IndexWord object that represents the word "problem" so I can search for it in the graph?

In general, you shouldn't be creating your own IndexedWord objects. (These are used to represent "word tokens", i.e., particular words in a text, not "word types", and so asking for the word "problem" -- a word type -- isn't really valid; in particular, a sentence could have multiple tokens of this word type.)
There are a couple of convenience methods that let you do what you want:
sg.getNodeByWordPattern(String pattern)
sg.getAllNodesByWordPattern(String pattern)
The first is a little dangerous, since it just returns the first IndexedWord matching the pattern, or null if there are none. But it's most directly what you asked for.
Some other methods to start from are:
sg.getFirstRoot() to find the (first, usually only) root of the graph and then to navigate down from there, such as by using the sg.getChildren(root) method.
sg.vertexSet() to get all of the IndexWord objects in the graph.
sg.getNodeByIndex(int) if you already know the input sentence, and therefore can ask for words by their integer index.
Commonly these methods leave you iterating through nodes. Really, the first two get...Node... methods just do the iteration for you.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string