Does IBM-Graph use a search index? If so, what one? - ibm-graph

My understanding is that IBM-Graph uses Titan, backed by Cassandra as it's persistent datastore.
In this stack it is usual to have a separate, search-index of Solr, Lucene or Elasticsearch, in order to enable more advanced queries like full-text search and geo-related queries.
Does IBM-Graph implement a search index such as this? If so, which one. And also, are these more advanced queries exposed via 'gremlin', i.e can we make use of this search index manually in order to perform full-text queries?

IBM Graph support search index by setting composite with false when you create an index, a Mixed index will be created by this way. FYI, the API doc: https://ibm-graph-docs.ng.bluemix.net/api.html#index-apis
But IBM Graph only support first level index, for example:
An index related to field name is available for Gremlin query g.V().has("name","Jack")
But not for the 2nd criteria has("age",20) in the Gremlin query g.V().has("name","Jack").out().has("age",20)

Related

Can you find a specific documents position in a sorted Azure Search index

We have several Azure Search indexes that use a Cosmos DB collection of 25K documents as a source and each index has a large number of document properties that can be used for sorting and filtering.
We have a requirement to allow users to sort and filter the documents and then search and jump to a specific documents page in the paginated result set.
Is it possible to query an Azure Search index with sorting and filtering and get the position/rank of a specific document id from the result set? Would I need to look at an alternative option? I believe there could be a way of doing this with a SQL back-end but obviously that would be a major undertaking to implement.
I've yet to find a way of doing this other than writing a query to paginate through until I find the required document which would be a relatively expensive and possibly slow task in terms of processing on the server.
There is no mechanism in Azure Search for filtering within the resultset of another query. You'd have to page through results, looking for the document ID on the client side. If your queries aren't very selective and produce many pages of results, this can be slow as $skip actually re-evaluates all results up to the page you specify.
You could use caching to make this faster. At least one Azure Search customer is using Redis to cache search results. If your queries are selective enough, you could even cache the results in memory so you'd only pay the cost of paging once.
Trying this at the moment. I'm using a two step process:
Generate your query but set $count=true and $top=0. The query result should contain a field named #odata.count.
You can then pick an index, then use $top=1 and $skip=<index> to return a single entry. There is one caveat: $skip will only accept numbers less than 100000

Google Datastore query filter for multiple values for same property

I have a query I wish to run on google Datastore that is intended to retrieve data from multiple devices. However, I couldn't find anywhere in the documentation that would allow me to get data from e.g. device-1 or device-2 or device-3, i.e. only 1 property name can be set. Is this a Datastore limitation? Or am I just missing something that I don't know about?
Based on the NodeJS client library, the query might look something like the below filter criteria:
var query = datastore.createQuery('data')
.filter('device_id', 1)
.filter('device_id', 2)
.filter('device_id', 3);
Otherwise, I might have to run separate queries for the various devices, which doesn't seem like a very elegant solution, especially if there are a lot of devices to simultaneously run queries on.
Any suggestions for the Datastore API or alternative approaches are welcome!
Yes, this would be an OR operation which is one of the Restrictions on queries (emphasis mine):
The nature of the index query mechanism imposes certain restrictions
on what a query can do. Cloud Datastore queries do not support
substring matches, case-insensitive matches, or so-called full-text
search. The NOT, OR, and != operators are not natively
supported, but some client libraries may add support on top of Cloud
Datastore. Additionally:

Field.Store and Field.Index both set to `NO` in a Lucene document?

I am aware of what Field.store and Field.Index means in Lucene document and aware of the use-cases when either Field.store or Field.Index is set to NO.
But recently, I came across piece of code, when both are set to NO. Could anybody explain the use-case with an example, when we need to set them to NO ?.
PS: I referred to this SO question, which explains why one is set to NO and another is set to Yes, with good use-cases, but it doesn't give answer to my question.
Lucene is the generic full-text indexing and search library and its not the framework in itself like ElasticSearch or Solr.
So, if you are developing your search application and directly using Lucene then you have full control over which fields to index and/or which fields to store from your app in the Lucene inverted index.
Frameworks like ElasticSearch or Solr which are built on top of Lucene, may use a schema for indexing or it might be schemaless too.
I think in cases where it's schemaless, it makes sense to explicitly ignore the fields which we don't want to index and store both.

Neo4j as a search engine

I did several tests and read a lot of cases to use Neo4J for Graph-Based Search. I am convinced by the features as the flexible schema and Real-time search and retrieval. But I also realise it is not designed to store documents to facilitate full-text search. For me the potential of this product is in the business value through data relationships.
The product is matching for 99% with my case: a 'internal google' for the company where I work, except for full-text search on documents (Word, PDF, etc). This is not a hard requirement, but a nice to have. Nevertheless, should I drop the specific Neo4J features and go for a product like Elastic Search or is Neo4J the product we are looking for?
There are a few options for text search in Neo4j:
Cypher (the Neo4j query language) includes a few string comparison operators: CONTAINS, STARTS WITH and ENDS WITH. For example:
MATCH (d:Document) WHERE d.title STARTS WITH "Graph"
RETURN d
You can also make use of Lucene queries with Neo4j through "legacy" indexes. For example:
START doc=node:node_auto_index("title:graph*")
...
See this post for more information.
You can also model documents as graphs, and query them using Cypher as a graph model. For example, see the Neo4j Doc Manager project for converting data from MongoDB to Neo4j.
Finally, you can also use Neo4j and Elasticsearch together, indexing text data in Elasticsearch and using Neo4j for graph traversals. See this project.

What's the best approach for using SOLR with web projects?

ok, I'm totally new to SOLR and Lucene, but have got Solr running out-of-the-box under Tomcat 6.x and have just gone over some of the basic Wiki entries.
I have a few questions, and require some suggestions too.
Solr can index data in files (XML, CSV) and it can also index DBs. Can you also just point it to a URI/domain, and have it index a website in the way google would?
If I have a website with "Pages" data, so "Page Name", "Page Content" etc, and "Products Data", so "Product Name", "SKU" etc, do I need two different Schema.xml files? and if so, does that mean two different instances of Solr?
Finally, if you have a project with a large relational and normalized database, what would you say is the best approach from the 3 options below?:
Have a middleware service running in the background, which mines the DB and manually creates the relevant XML files to then send to SOLR
Have SOLR index the DB directly. In this case, would it be best to just point SOLR to views, which would abstract all the table relationships?
Any other options I'm unaware of?
Context: We're running in a Windows 2003 environment, .NET 3.5, SQLServer 2005/2008
cheers!
No, you need a crawler for that, e.g. Nutch
Yes, you want two separate indexes ( = two schema.xml) since the datasets don't seem to be related. This doesn't mean two instances of Solr, you can manage the two indexes with Cores.
As for populating the Solr index, it depends on your particular project, for example, can it tolerate stale data or does it have to absolutely fresh.
Other options to index data include:
Database triggers
If you're using some sort of ORM use its interception capabilities. For example you can use NHibernate events to update the index on update, insert or delete. If you use NHibernate and SolrNet this is taken care of automatically
I think Mauricio is dead on for his advice. The only point I would make is that when deciding to have a "middleware" indexer, or use the database directly. If your database (or the views?) map very closely to what a good Solr schema wants, then DIH is great. But, if you are indexing from multiple sources of data, or if you have to munge about the data in your database to meet what Solr would like, then having a dedicated middleware indexer is better.

Resources