Is there a way to use Lucene to work with graph data?
Example
One user has a relationship with many lucene documents (Document Connections)
One User has a relationship with other Users (User Connections [Graph])
If a user searches the Index, he gets back the documents that he has a relationship with. This is simple and straightforward.
What would be a way to get back the documents that the User Connections have a relationship with.
Indexing each document with all the user's that have a relationship with it in a user_id field is an approach. However when you query the index providing the User Connections for the user performing the search query size is unpredictable. Think of Users that have 1000's of User Connections. This will not scale.
It's almost like the User Connections and User Documents stored in a Graph DB can easily provide us the documents to search against but what is an effective way to communicate that to Lucene so it can only search against those documents for the given query. If any results are returned, this will guarantee that at least one or more of the User Connections has a relationship with the documents returned in the results.
I don't believe there is currently any graph technology that sits on top of solr or lucene.
You would probably be best looking at either one of these two camps:
Neo4j with SpringData (free for single instance)
OR
Tinkerpop Blueprints (possibly rexter if not using java/scala)
on one of these technologies:
Titan on Cassandra with Hadoop (multi master, no point of failure)
OrientDb
Neo4j
These databases are graph databases.
Tinkerpop Blueprints is a standard that allows you to abstract the specific implementation.
Springdata currently only supports neo4j for graph technologies.
Neo4j costs money if you cluster (free license is single instance only).
You can read discussion on solr/lucene with graphing here.
http://lucene.472066.n3.nabble.com/indexing-directed-graph-td2949556.html
Note neo4j supports full text search.
Graph databases are supported since solr 6.0; if you don't have solr installed, it's probably still better to use a graph database instead, but now at least, you have a choice. I found this, documentation is still sparse:
https://solr.pl/en/2016/04/18/solr-6-0-and-graph-traversal-support/
Apache Jena may be relevant here since it has some graph capabilities (SPARQL, RDF) and makes use of Lucene.
See Apache Jena Fuseki and Jena Text.
Related
Looking at the new Azure cosmos database, I'm a bit confused about the multi-model nature of it. Specifically, does it mean:
a) That the same underlying database/store can be queried multiple ways concurrently so that I can use both gremlin graph queries and mongodb api against the same collections.
or -
b) Does it mean that you can choose a different model (graph, key value, column, document) at the time of provisioning your Cosmos DB and that is how the data will be stored from then on.
The brochure makes it sound like a), but using the Azure dashboard to create a cosmos instance it makes it seem like b) since you have to choose a model type at creation.
Additionally, the literature makes reference to columnar data, but I don't see the option for it at create time.
Cosmos DB is a single NoSQL data engine, an evolution of Document DB. When you create a container ("database instance") you choose the most relevant API for your use case which optimises the way you interact with the underling data store and how the data is persisted in to that store.
So, depending on the API chosen, it projects the desired model (graph, column, key value or document) on to the underlying store.
You can only use one API against a container, multiple are not possible due to the way the data is stored and retrieved. The API dictates the storage model - graph, key value, column etc, but they all map back on to the same technology under the hood.
Thanks to #Jesse Carter's comment below it appears you are however able to mix and match the graph and DocumentSQL APIs.
From the docs:
Multi-model, multi-API support
Azure Cosmos DB natively supports multiple data models including documents, key-value, graph, and column-family. The core content-model of Cosmos DB’s database engine is based on atom-record-sequence (ARS). Atoms consist of a small set of primitive types like string, bool, and number. Records are structs composed of these types. Sequences are arrays consisting of atoms, records, or sequences.
The database engine can efficiently translate and project different data models onto the ARS-based data model. The core data model of Cosmos DB is natively accessible from dynamically typed programming languages and can be exposed as-is as JSON.
The service also supports popular database APIs for data access and querying. Cosmos DB’s database engine currently supports DocumentDB SQL, MongoDB, Azure Tables (preview), and Gremlin (preview). You can continue to build applications using popular OSS APIs and get all the benefits of a battle-tested and fully managed, globally distributed database service.
Cosmos DB at its heart is a geographically distributed database with its own Atom-Record-Sequence storage engine and index. On top of that infrastructure we are able to implement many different kinds of stores, from SQL like stores using our SQL API, to Mongo, to Cassandra, to Gremlin, to an implementation of Azure Table storage and so on.
Each of the different store types have their own data types (e.g. ways of encoding numbers, dates, etc.) and are encoded in our storage and index layer in their own way. Over time we expect most of those data types to be natively supported by our SQL API. But for now each of our data base types uses its own encoding conventions. When creating an account in Cosmos DB (this is a unit of organization, users can have many accounts) the "type" of Database is specified on the account. So one can have a Table API account or a Mongo account or what have you.
In some cases it is possible to access an account with Data Type X using API Y. For example, one can use SQL API to talk to tables in a Table API account. But outside of graph, that is usually not a great idea. Right now we encode information for each API in a special format and the different data types don't speak each other's formats. So if one were to write to a Table API using SQL API the end result will most likely be corrupt data.
The exception is graph which we work hard to make sure work reasonably well with all database types and we'll have more to say on that in the future.
So if you do want to play around with multi API access we strongly encourage you to only do so in "read only" mode when not using the "native" API for the given account. In other words, by all means play around with the SQL API reading from a Table API, just please don't write to a Table API account suing a SQL API client.
The accepted answer misses out on some points.
Cosmos DB is a NoSQL database, but it is highly distributed and we its storage format is Atom-Record-Sequence.
Why does that matter? We know that it accepts JSON as in- and output formats, that does not mean Cosmos stores its data as JSON, it could be any format actually. This helps us to reason about the multi-modelness of Cosmos: what you get when you execute a query according to a certain model is probably a projection or view of your data.
#JesseCarter already explained we can interchangeably use Document API and Graph API. Last week Table API got publicly announced and probably this API is not too different as well.
The guys over at Spectologic have written a nice blogpost about the Cross-API usage of Cosmos and have also pointed out that the multi-modelness is more cosmetics than internals, the only real exception seems Mongo. The interesting part gets pointed out in the chapter 'Switching the portal experience' here: https://blog.spectologic.com/2017/06/30/digging-into-cosmosdb-storage/
So maybe in the end it boils down to GlobalDocumentDb vs. MongoDb
I too was intrigued by this, wanting to understand more from a API usage auditing perspective and have learned more reading through these answers.
Upon experimenting it appear things have progressed further than the original answers, so to add a contemporary spin...
I have been able to successfully create a Cosmos DB account choosing the SQL API, created a document in the portal then retrieved the document via the MongoDB API.
The original answers suggested that MongoDB was the odd-one-out and couldn't interact with data created with other APIs.
Now whether with fuller testing this would result in corrupt documents due to the data type differences hinted upon by Yaron (https://stackoverflow.com/a/48286729/141022) and whether the storage differences would result in poor performance still as hints to that is to be seen.
For my purposes I'm interested to whether auditing one API is enough, which in this case it is not as data created in one can be retrieved by another, so I haven't tested in depth.
Notably, the ARM template deploys with neither GlobalDocumentDB nor MongoDB kind, however exporting the ARM template back from the portal results in GlobalDocumentDB if that happens to make a difference.
If you are interested in the implementation details of CosmosDB, you can read this whitepaper from a long time ago (assuming that the implementation hasn't changed). http://www.vldb.org/pvldb/vol8/p1668-shukla.pdf
TLDR:
At the bottom, CosmosDB stores data in ARS and exposes them in JSON format.
The database engine index ALL fields in ALL documents by default, therefore enabling very flexible query.
The database engine executes an intermediate language similar to JavaScript, bridging the low-level storage and APIs that database exposes.
Because of that bridging, more database APIs can be added to support different querying mechanism (e.g. SQL, document, columnar).
Multimodel means your data can be stored in a number of different ways. Currently, CosmosDB stores 4 different types of data and it allows you to integrate with an API and build out a user experience around these database storage types.
The 4 types are Document DB or Mongo DB, Graph Database, Key Value Paire, and Wide Column or Column Family.
I am trying to understand the difference between Solr distributed search and the concept of federated search. Can I uses Solr distributed search to implement federated searches? The requirement is that there are two or more domain models exists and each such domain system indexes its own data to lucene based index. Now I have an interesting use case that I should be able to do a federated search for a single query cutting across different domain systems having its own index.
No, distributed search is not the same as federated.
Federated search" is the term more typically used when searching
across heterogeneous data sources - think about things like
meta-search engines, as a common example of this.
Distributed search is when you have a homogeneous data source, but it
needs to be distributed in order to scale properly.
(taken from here - http://wiki.apache.org/solr/FederatedSearch)
About second question - is it possible to implement federated search using Solr - I'm pretty sure it's possible, the only question is - how much effort it will require from you.
I could see possible solution to create a separate collections in Solr and query them, and later merge all results in query time, but it just a raw idea.
Currently I'm in the process of evaluating CouchDB for a new project.
Key constraint for this project is strong privacy. There need to be resources that are readable by exactly two users.
One usecase may be something similar to Direct Messages (DMs) on Twitter. Another usecase would be User / SuperUser access level.
I currently don't have any ideas about how to solve these kind of problems with CouchDB other than creating one Database that is accessable only by these 2 users. I wonder how I would then build views aggregating data from several databases?
Do you have any hints / suggestions for me?
I've asked this question several times on couchdb mailing lists, and never got an answer.
There are a number of things that couchdb is missing.
One of them is the document level security which would :
allow only certain users to view a doc
filter the documents indexed in a view on a user level permission base
I don't think that there is a solution to the permission considerations with the current couchdb implementation.
One solution would be to use an external indexing tool like lucene, and tag your documents with user rights, then issue a lucene query with user right definition in order to get the docs. It also implies extra load on your server(s) (lucene requires a JVM) and an extra delay for the data to be available (lucene indexing time ... )
As for the several databases solution, there are language framework implementations that simply don't allow to use more then one databases ( for instance couch_potato for Ruby ).
Having several databases also means that you'll have several replication processes if your databases are replicated.
Also, this means that the views will be updated for each of the database. In some cases this is better then have huge views indexed in a single database, but it also means that distinct users might not be up to date for a single source of information ( i.e some will have their views updated, other won't). So you cannot guarantee that the data is consistent for all users.
So unless something is implemented in the couch core in order to manage document level authorizations, CouchDB does not seem appropriate for managing data with privacy constraints.
There are a bunch of details missing about what you are trying to accomplish, what the data looks like, so it's hard to make a specific recommendation. You may be able to create a database per user and copy items into each users database (for the DM use case you described). Each user would only be able to access their own database, and then you could have an admin user that could access all databases. If you need to later update those records copying them to multiple databases might not be a good idea, and then you might consider whether you want to control permissions at a different level from storage.
For views that aggregate data from several databases, I recommend looking at lounge and bigcouch, which take different approaches.
http://tilgovi.github.com/couchdb-lounge/
http://support.cloudant.com/faqs/views/chained-mapreduce-views
I'm looking at both projects and I can't really see the difference
from Cassandra Site:
Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store...Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.
from CouchDB Site:
Apache CouchDB is a distributed, fault-tolerant and schema-free document-oriented database accessible via a RESTful HTTP/JSON API.
That said, I see the specific differences between each project as: access methods, written languages, etc. but to put AN EXAMPLE, when you talk about SOLR or Sphinx you know both are indexers with big differences but at the end are indexers.
Can I say here that Cassandra and CouchDB are non-relational databases that in some cases one can replace the other?
CouchDB is a document store. You put documents (JSON objects) in it and define views (indexes) over them. The objects can be arbitrarily complex with potentially deep structure. Further, they are not constrained to following some consistent schema.
Cassandra is a ragged-table key-value store. It just stores rows, each of which has a set of named columns grouped in to families with values. It sounds quite close to BigTable; BigTable doesn't require each row to have the same structure (unlike an SQL database). The values may have some structure, but this kind of store doesn't know anything about that -- they're just strings/byte sequences.
Yes, they are both non-relational databases, and there is probably a fair amount of overlap in their applicability, but they do have distinctly different data organization models. Each can probably be forced into emulating the other, but each model will map best to a different set of problems.
CouchDB has a feature present in very few open source database technologies: offline replication. CouchDB is designed so that applications can be run at the edge of the network. These applications are available even when internet connectivity fails.
Offline replication can also be leveraged to build large clusters, but CouchDB is designed to be robust and simple whether it is running on a single server, a datacenter, or even a smartphone.
Is there a distinct winner among all the key-value stores? Cassandra, MongoDB, CouchDB? and do they all follow some central guidelines? or they all have their own say in defining their APIs.
I'm asking this question, especially from a perspective of a RDBMS skilled person who is new to key-value stores. Which one should we follow to best grasp the understanding/usage of this field?
We know about the RDMS from their theories that all available DBs (Oracle, SQL Server, ..) will have all the artifacts e.g. Tables, Indexes, Foreign keys etc. The only difference in these is the efficiency, security, features.
How can I know about the universal theory of these document-centered Databases and know what are the minimal artifacts that all these DBs (Mongo, Couch etc.) will have?
I work on MongoDB so I'm biased that way, but I think it is a nice combination of the things you're used to with an RDBMS (like dynamic queries and secondary indexes) and the performance and scalability of a key-value store.
Cassandra has a nice distributed model but afaik doesn't support secondary indexes. The document data model support by Mongo and Couch also allows for a little bit more complexity than the tabular model Cassandra uses.
One of the big differences between Mongo and Couch is the way queries are constructed. Couch uses a cool map/reduce mechanism, but your queries must be defined in advance. Mongo uses a more traditional dynamic query model that is more like what you're used to in an RDBMS.