Retrieve analyzed tokens from ElasticSearch documents - text

Trying to access the analyzed/tokenized text in my ElasticSearch documents.
I know you can use the Analyze API to analyze arbitrary text according your analysis modules. So I could copy and paste data from my documents into the Analyze API to see how it was tokenized.
This seems unnecessarily time consuming, though. Is there any way to instruct ElasticSearch to returned the tokenized text in search results? I've looked through the docs and haven't found anything.

This question is a litte old, but maybe I think an additional answer is necessary.
With ElasticSearch 1.0.0 the Term Vector API was added which gives you direct access to the tokens ElasticSearch stores under the hood on per document basis. The API docs are not very clear on this (only mentioned in the example), but in order to use the API you have to first indicate in your mapping definition that you want to store term vectors with the term_vector property on each field.

Have a look at this other answer: elasticsearch - Return the tokens of a field. Unfortunately it requires to reanalyze on the fly the content of your field using the script provided.
It should be possible to write a plugin to expose this feature. The idea would be to add two endpoints to:
allow to read the lucene TermsEnum like the solr TermsComponent does, useful to make auto-suggestions too. Note that it wouldn't be per document, just every term on the index with term frequency and document frequency (potentially expensive with a lot of unique terms)
allow to read the term vectors if enabled, like the solr TermVectorComponent does. This would be per document but requires to store the term vectors (you can configure it in your mapping) and allows also to retrieve positions and offsets if enabled.

You may want to use scripting, however your server should have the scripting enabled.
curl 'http://localhost:9200/your_index/your_type/_search?pretty=true' -d '{
"query" : {
"match_all" : { }
},
"script_fields": {
"terms" : {
"script": "doc[field].values",
"params": {
"field": "field_x.field_y"
}
}
}
}'
The default setting for allowing the script depends on the elastic search version, so please check that out from the official documentation.

Related

Azure Cognitive Search - return full json as SearchDocument?

I'm using Azure.Search.Documents in C# to index JSON documents in Azure blob storage. About half of the fields of each json doc are meant to be searchable or fielded. The JSON also includes some fields that I don't want evaluated by my search.
My goal is to return the entire JSON document in my search results.
It seems like my choices are to (a) add SearchField records to my SearchIndex for every aspect of the document (in which the SearchDocument results are ready for me to use) or (b) leverage metadata_storage_path / metadata_storage_name and do a separate fetch for the document itself.
Option (b) feels less efficient, considering that the SearchDocument returned is already so close to the full JSON; it seems a shame to have to make a separate fetch for each document. But for option (a) to work, I'd need to tell the SearchIndex about the extra fields without them triggering false positive search results.
For (a) is there a way to add SearchFields (or the equivalent) and have them not trigger false positives? (IsSearchable seems to affect how, but not whether, they are evaluated). Also, if (b) is the better approach, is there a way to do this using "new SearchField" as opposed to declared via attributes? Thanks.
Thank you Vince. Adding your comment as answer to help other community users.
Set IsSearchable to FALSE

Google Freebase Search API Alternative?

Google deprecated their Freebase Search API, and is transferring things over to Wikidata, however there appears to be no replacement for their Freebase Search API (https://developers.google.com/freebase/v1/search-overview) that:
Autosuggesting entities (e.g. Freebase Suggest Widget)
Getting a ranked list of the most notable entities with a given name.
Finding entities using Search Metaschema.
Moreover, it would also take in malformed strings and correct them, and return nice detailed relevancy rankings, along with the associated freebase topic id. I can't find anything in their Custom Search API that returns any information relevant to their, or any other knowledge graph.
Ideally would like something that I can query similar to this and returns a result like they used to:
For example, a query of "Nirvana" in the Freebase Search API would return:
{
"status":"200 OK",
"result":[
{
"mid":"/m/0b1zz",
"name":"Nirvana",
"notable":{"name":"Record Producer","id":"/music/producer"},
"score":55.227268
},{
"mid":"/m/05b3c",
"name":"Nirvana",
"notable":{"name":"Belief","id":"/religion/belief"},
"score":44.248726
},{
"mid":"/m/01h89tx",
"name":"Nirvana",
"notable":{"name":"Musical Album","id":"/music/album"},
"score":30.371510
},{
"mid":"/m/01rn9fm",
"name":"Nirvana",
"notable":{"name":"Musical Group","id":"/music/musical_group"},
"score":30.092449
},{
"mid":"/m/02_6qh",
"name":"Nirvana",
"notable":{"name":"Film","id":"/film/film"},
"score":29.003593
},{
"mid":"/m/01rkx5",
"name":"Nirvana Sutra",
"score":21.344824
}
],
"cost":10,
"hits":0
}
Note the relevance, and Freebase mid.
Essentially are there any alternatives out there, either open source, or commercial that replaces this much needed functionality?
I've used the Prismatic Interest graph API for somewhat similar functionality. My use-case was a bit different (tagging documents with topics) but looking at their API endpoints you might be able to duplicate the functionality you described above with a query to topic/search (search for topics that match a search string) and a query to topic/topic to search for similar topics (sorted by score).
Edit
As David notes in the comments below, the Prismatic Interest Graph API has been discontinued.
Also, the Google Knowledge Graph Search API now seems to be the intended replacement for the Freebase Search API.
How about the Google Knowledge Graph Search API? There is also a web application exposing the API.
The :BaseKB project offers Freebase data (plus some other data) as RDF. :BaseKB's data can be downloaded for free or easily run on an AWS instance for live queries. The AWS machine image contains a Virtuoso database so you can query it with the SPARQL query language.

How to do "Not Equals" in couchdb?

Folks, I was wondering what is the best way to model document and/or map functions that allows me "Not Equals" queries.
For example, my documents are:
1. { name : 'George' }
2. { name : 'Carlin' }
I want to trigger a query that returns every documents where name not equals 'John'.
Note: I don't have all possible names before hand. So the parameters in query can be any random text like 'John' in my example.
In short: there is no easy solution.
You have four options:
sending a multi range query
filter the view response with a server-side list function
using a CouchDB plugin
use the mango query language
sending a multi range query
You can request the view with two ranges defined by startkey and endkey. You have to choose the range so, that the key John is not requested.
Unfortunately you have to find the commit request that somewhere exists and compile your CouchDB with it. Its not included in the official source.
filter the view response with a server-side list function
Its not recommended but you can use a list function and ignore the row with the key John in your response. Its like you will do it with a JavaScript array.
using a CouchDB plugin
Create an additional index with e.g. couchdb-lucene. The lucene server has such query capabilities.
use the "mango" query language
Its included in the CouchDB 2.0 developer preview. Not ready for production but will be definitely included in the stable release.

CouchDB and Couchbase Document Keys

In reference material for CouchDB and Couchbase it's common guidance to store the type of a document as a parameter within the actual document.
I've got a database, where I have different documents that record certain behaviour by URL. So naturally, I use the URL as the id of the document.
The problem I find is that by using just the key as the document id, I now get clashes between documents of different types. So I have started using the type as the first part of the key like this:
{ doc._id: "rss_entry|http://www.spiegel.de/1234", [...] }
{ doc._id: "page_text|http://www.spiegel.de/1234", [...] }
Now I start to wonder why I've never seen this approach to model type in any of the documentation.
Prefixes are commonly used. In addition to support for scenarios such as yours, prefixing allows one to perform logical range queries against views. There is use of this technique in the modeling examples, but perhaps the concept is not described in as much detail as you are expecting. In the section http://docs.couchbase.com/couchbase-devguide-2.5/#modeling-documents, the documents are keyed as beer_NNNN and brewery_NNNN. Also, the section http://docs.couchbase.com/couchbase-devguide-2.5/#using-reference-documents-for-lookups goes a bit deeper into this technique. There is a counter document named user::count and then each user is keyed as user::NNNN. Additionally, there are documents in the example that are keyed as fb::NNNN for a Facebook ID, email::XXX#YYYY.com for a user's email address, etc.

WildcardQuery error in Solr

I use solr to search for documents and when trying to search for documents using this query "id:*", I get this query parser exception telling that it cannot parse the query with * or ? as the first character.
HTTP Status 400 - org.apache.lucene.queryParser.ParseException: Cannot parse 'id:*': '*' or '?' not allowed as first character in WildcardQuery
type Status report
message org.apache.lucene.queryParser.ParseException: Cannot parse 'id:*': '*' or '?' not allowed as first character in WildcardQuery
description The request sent by the client was syntactically incorrect (org.apache.lucene.queryParser.ParseException: Cannot parse 'id:*': '*' or '?' not allowed as first character in WildcardQuery).
Is there any patch for getting this to work with just * ? Or is it very costly to do such a query?
If you want all documents, do a query on *:*
If you want all documents with a certain field (e.g. id) try id:[* TO *]
Lucene doesn't allow you to start WildcardQueries with an asterisk by default, because those are incredibly expensive queries and will be very, very, very slow on large indexes.
If you're using the Lucene QueryParser, call setAllowLeadingWildcard(true) on it to enable it.
If you want all of the documents with a certain field set, you are much better off querying or walking the index programmatically than using QueryParser. You should really only use QueryParser to parse user input.
id:[a* TO z*] id:[0* TO 9*] etc.
I just did this in lukeall on my index and it worked, therefore it should work in Solr which uses the standard query parser. I don't actually use Solr.
In base Lucene there's a fine reason for why you'd never query for every document, it's because to query for a document you must use a new indexReader("DirectoryName") and apply a query to it. Therefore you could totally skip applying a query to it and use the indexReader methods numDocs() to get a count of all the documents, and document(int n) to retrieve any of the documents.
If you are just trying to get all documents, Solr does support the *:* query. It's the only time I know of that Solr will let you begin a query with an *. I'm sure you've probably seen this as the default query in the Solr admin page.
If you are trying to do a more specific query with an * as the first character, like say id:*456 then one of the best ways I've seen is to index that field twice. Once normally (field name: id), and once with all the characters reversed (field name: reverse_id). Then you could essentially do the query id:456 by sending the query reverse_id:654 instead. Hope that makes sense.
You can also search the Solr user group mailing list at http://www.mail-archive.com/solr-user#lucene.apache.org/ where questions like this come up quite often.
The following Solr issue is a request to be able to configure the default lucene query parser.
https://issues.apache.org/jira/browse/SOLR-218
In this issue you can find the following description how to 'patch' Solr. This modification would allow you to start queries with a *.
Jonas Salk: I've basically updated only one Java file: SolrQueryParser.java.
public SolrQueryParser(IndexSchema schema, String defaultField) {
...
setAllowLeadingWildcard(true);
setLowercaseExpandedTerms(true);
...
}
...
public SolrQueryParser(QParser parser, String defaultField, Analyzer analyzer) {
...
setAllowLeadingWildcard(true);
setLowercaseExpandedTerms(true);
...
}
I'm not sure if setLowercaseExpandedTerms is needed...
I'm assuming with id:* you're just trying to match all documents, right?
I've never used solr before, but in my Lucene experience, when ingesting data, we've added a hidden field to every document, then when we need to return every record we do a search for the string constant in that field that's the same for every record.
If you can't add a field like that in your situation, you could use a RegexQuery with a regex that would match anything that could be found in the id field.
Edit: actually answering the question. I've never heard of a patch to get that to work, but I would be surprised if it could even be made to work reasonably well. See this question for a reason why unconstrained PrefixQuery's can cause a problem.
Actually, I have been using a workaround for this. I append a character to the id, eg: A1, A2, etc.
With such values in the field, it is possible to search using the query id:A*
But would love to find whether a true solution exists.

Resources