I have a model of posts and their corresponding comments, like this:
{
"id": "1234",
"moment": "2021-02-19T10:00:00Z",
"body": "Good morning!",
"author": "Bob",
"comments": [
{
"body": "Take care!",
"moment": "2021-02-19T11:13:00Z",
"author": "Bob"
},
{
"body": "Hey there!",
"moment": "2021-02-19T11:15:00Z",
"author": "Maria"
}
]
}
Using Cassandra 3.11.10, I managed to create and query a case insensitive LIKE search by text contained in post body:
CREATE TABLE post(
id uuid,
moment timestamp,
body text,
author varchar,
PRIMARY KEY (id)
);
CREATE CUSTOM INDEX body_idx ON post (body) USING 'org.apache.cassandra.index.sasi.SASIIndex' WITH OPTIONS = {'mode': 'CONTAINS', 'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer','case_sensitive': 'false'};
INSERT INTO post (id, moment, body, author) VALUES (uuid(), '2021-02-19T10:00:00Z', 'Good morning!', 'Bob');
SELECT * FROM post WHERE body LIKE '%morning%';
But how can I create a table structure for nested comments, and also search text in both post and comments bodies?
First, please, keep in mind that Cassandra's modeling is different from a relational database, and unfortunately, the denormalization is your friend on NoSQL, especially at Cassandra.
You need to focus on the data you want to retrieve from the database or a do query-driven design.
However, if you cannot remodeling the system, there are a couple of solutions:
Append a Search Engine such as Sorl or ElasticSearch with Cassandra: This solution enables you to create a second service to do the whole search engine in this service.
Pro: You can keep the Cassandra model similar to a relational.
Cons: More operation complexity to maintain two services increases the difficulty of managing data from two different sources and synchronize them.
Use Stratio: that is a plugin where you add a Lucene index on Cassandra.
Pro: It has a full-text search engine integrated with Cassandra. You don't need to have a new infrastructure service and worry about the data replication in different services. It has support to UDT. Thus you can define the comments as UDT and search typically.
Cons: You need to include a jar in each Cassandra node. Besides, once Cassandra and Lucence are working on the same machine, it might impact performance, thus look at the references.
Cassandra DSE: A commercial version that has several features, such as a search integration.
Pro: A Search engine integrated with several features.
Cons: There is not a free version.
Related
I have a design document in CouchDB. I've set up views and filters.
{
"_id": "_design/my_index_id",
"_rev": "17-fa5c543fcc80f4420aa98d58f7a07130",
"views": {
"jobsbyid": {
"map": "function (doc,req) {if (doc.type === 'job') {emit(doc.id);}}"
}
},
"filters": {
"myfilter": "function (doc, req) {return req.query.type === 'job'}"
}
What's the different between views and filters. In terms of performance, use cases and usage. When to use Views and when to use filters?
In CouchDB you have different filtering options for the replication process. All of them are documented here CouchDB filtering options
About filtering, You should have in cosideration that filtering is one of the most expesive operation in CouchDB that could drive you into some performance degradation problems as long the database grows. You can check this answer Filtered Sync between CouchDB and PouchDB
The usage of filters or views are almost the same in terms of performance as they are filtering the whole database in each filtering request. This is stated in the doc
Using _view filter doesn’t queries the view index files, so you cannot
use common view query parameters to additionally filter the changes
feed by index key. Also, CouchDB doesn’t returns the result instantly
as it does for views - it really uses the specified map function as
filter.
Moreover, you cannot make such filters dynamic e.g. process the
request query parameters or handle the User Context Object - the map
function is only operates with the document.
The advantaje of the use of views for filtering is that you are reusing map functions for filtering.
So use cases of both approaches are very similar except that the filters may access to the query params or the security context.
I'm working on a product search engine with a big set of undefined products which is constantly growing. Each product has different attributes and at this time they're saved in an array of string key-value pairs like this:
"attributes": [
{
"key": "Producttype",
"value": "Headphones - 3.5 mm plug"
},
{
"key": "Weight",
"value": "280 g"
},
{
"key": "Soundmode",
"value": "Stereo"
},
....
]
Each product has also a category. I'm using elasticsearch 2.4.x to persist data that i want to search on via spring-data-elasticsearch. It's possible to upgrade to the newest elasticsearch version if needed.
As you can see the attributes are really generic. It's also needed to use nested objects to be able to search on this attributes. I'm also thinking about preprocessing this attributes to a standardized format. For example the "Weight" key might be written in different forms like "Productweight" or "Weight of product". Because there are a lot of attributes and i wouldn't like to create a custom property/field for each one i thought about about mapping only the important ones (like weight) to a custom, own field and to map the other attributes like described above.
Now if someone searches for example "iphone" i would like to show some facettes on the left of the search result page. The facettes should differ if someone searches "Adidas shoes". Is this possible with the given format above using nested objects? Is it possible to build the facettes dynamically regarding to the resultset elasticsearch is returning? E.g. the most common properties which all result products contain should be used to create facettes. Or do i have to persist some predefined filters/facettes on each category? I think that would be too much work and also doesn't work on search results where products can have different categories. What's the best practice to build a search feature with facetting on entities with n different properties that can grow in future?
Considering the following two documents:
{
"_id": "a6b8d3d7e2d61c97f4285220c103abca",
"_rev": "7-ad8c3eaaab2d4abfa01abe36a74da171",
"File":"/store/document/scan_bgd123.jpg",
"Commend": "Describes a person",
"DateAdded": "2014-07-17T14:13:00Z",
"Name": "Joe",
"LastName": "Soap",
"Height": "192cm",
"Age": "25"
}
{
"_id": "a6b8d3d7e2d61c97f4285220c103c4a9",
"_rev": "1-f43410cb2fe51bfa13dfcedd560f9511",
"File":"/store/document/scan_adf123.jpg",
"Comment": "Describes a car",
"Make": "Ford",
"Year": "2011",
"Model": "Focus",
"Color": "Blue"
}
How would I find a document based on multiple criteria, say for example "Make"="Ford" and "Color"="Blue". I realize I need a view for this, but I don't know what the key is going to be, and as you can see from the two documents, the key/value pairs aren't consistent. The only consistent item will be the "File" key.
I'm attempting to create couchDB database that will store the location of files, but tagged with Key/Value pairs.
EDIT:
Perhaps I should reconsider my data structure. modify it slightly?
{
"_id": "a6b8d3d7e2d61c97f4285220c103c4a9",
"_rev": "1-f43410cb2fe51bfa13dfcedd560f9511",
"File": "/store/document/scan_adf123.jpg",
"Tags": {
"Comment": "Describes a car",
"Make": "Ford",
"Year": "2011",
"Model": "Focus",
"Color": "Blue"
}
}
So, I need to find by the Key>Value pair in the tag or any number of Key>Value pairs to filter which document I want. The problem here is, I want to tag objects with a key>value pair. These tags could be very different per view, so the next document will have a whole diff set of Key>Value pairs.
Couchdb supports flexible schema. There is no need for the documents to be consistent for them to be query-able. The view for your scenario is pretty straightforward. Here is the map function that should do the trick.
function(doc){
if(doc.Make&&doc.Color)
emit([doc.Make,doc.Color],null);
}
This gives you a view which you can then query like
/view-name/key=["Ford","Blue"]&include_docs=true
This should give you the desired result.
Edit based on comment
For that you will need two separate views. Every view in couchdb is designed to fulfil a specific query need. This means that you have to think about access strategy of your data. It is a lot more work on your part initially but for the trouble you are rewarded with data that is indexed and has very fast access times.
So to answer your question directly. Create two views. One for Make like we have already done and other for Name like
function(doc){
if(doc.Name&&doc.LastName)
emit([doc.Name,doc.Name],null);
}
Now the Name view will index only those documents that have name in it. Where as Make view will index those documents that have make in it.
What happens when a requirement comes in future for which you don't have a query?
You can try a few things.
This is probably the easiest solution. Use couchdb-lucene for your dynamic queries. In this case your architecture will be like couchdb views for queries that you know your application would need. Lucene index for queries that you don't know you might need. So for instance you have indexed name and last name in the in couchdb query. But a requirement arises and you might need to query by age then simply dump the age field in lucene and it will take care of the rest.
Another approach is using the PPP technique where you exploit the fact that creating views is a one time cost and you can create views on less active hours and deploy them in a production service once they are built.
Combine steps 1 and 2! lucene to handle adhoc request while you are building views using the ppp technique.
I prepare to use CouchDB to my project. but cannot find a way to implement a view like an SQL SELECT * FROM Employees WHERE LastName NOT IN (SELECT LastName FROM Managers). In other words, I want to get a set from view A but not in view B. Question: how to implement not-in condition in CouchDB?
Keeping employees and managers lists different sets of documents is using relational structure where you DB is not relational. If, for some reason, you are forced to do that, you need some way to distinguish the scheme of the doc (from which table it is). Lets say you are doing it with field scheme:
{ _id: "EMPL_ID", scheme: "employee", ... }
{ _id: "MNGR_ID", scheme: "manager", employee: "EMPL_ID", ... }
Then you can use map:
function (doc) {
if (!doc.scheme) return;
if (doc.scheme != "manager") emit(doc.last_name, doc);
}
If, for some strange reason, you cannot do that, and you only have the reference to employee doc in manager doc, you can emit both documents:
function (doc) {
if (some_test_for_being_employee_scheme(doc))
emit([doc._id, 1], doc);
if (doc.emp_id)
emit([doc.emp_id, 0], null);
}
You will get the list of employees with keys ["employee_id", 1], and each manager is preceded with the row labeled as manager (key [..., 0]). This will require some space, but with list function you can filter out managers easily and the client will receive from DB only the non-managers.
Keep in mind, that it is only the workaround for not making proper DB design.
If you change the model to make it fit a document-oriented database, this would be easy. I generally keep a "type" key in all of my documents to keep different types of documents straight. If you have a single "person" type and decorate all "person" documents who are also "manager" with a separate key, you could easily emit view keys only for non-managerial personnel. If you opt to have a separate "manager" type, you could similarly restrict emitted view keys to non-managers only.
I think the answer is simply: you can't mix view results. Views are independent.
However, there is a strategy called view collation that probably solves your problems. I suggest reading this: http://wiki.apache.org/couchdb/View_collation
To summarize it: You need to use different document types and then use a single view to collate the results.
Considering these three documents...
[
{
_id: "...",
_rev: "...",
title: "Foo",
body: "..."
},
{
_id: "...",
_rev: "...",
title: "Bar",
body: "..."
},
{
_id: "...",
_rev: "...",
title: "Hello World!",
body: "..."
},
]
And this view...
byTitle: {
map: function (document)
{
emit(document.title, document);
}
}
What goes on behind the scenes, when I query the view?...
GET /database/_design/posts/_view/byTitle?key="Foo"
I've asked a few questions on views lately... questions about what I phrased as "dynamic parameters"... Essentially I wanted to know how to do the equivalent of SELECT ... WHERE field = parameter
All answers steered me towards using temporary views, which are really slow, and should not be used in production. So my second question is... is the above method for querying by title, fit for use in production? Or am I forcing CouchDB to do unspeakable horrors, performance-wise?... am I essentially doing the same as using a temporary view?
I think you have misinterpreted some answer. You can use a temporary view to test various map/reduce functions. When you are satisfied with the code you should put it into a design document and use it for querying.
Temporary views are slow because the index is built and deleted for every query. Putting it into a design document, tells CouchDB to not delete the index and to keep it updated (this is done on query time).
So
GET /database/_design/posts/_view/byTitle?key="Foo"
is the fastest way to query by title because it is indexed.
As a side note: you can use
byTitle: {
map: function (document)
{
emit(document.title, null);
}
}
and query with include_docs=true to save some disk space.
For answering your question, a few things have to be cleared out (and I hope I get it all right):
Permanent vs. temporary views:
The difference between permanent and temporary views is, that permanent views are stored permanently.
In order to understand the storing part, you need to know, that CouchDB's storage engine relies on a B+ Tree offering very powerful indexing capabilities that enable us to find data in that storage by key in a "logarithmic amortized time" (CouchDB book).
CouchDB is handling documents in an "append only" manner. That means it is not like in the most relational DBMS where single values within a table row get updated and locking occurs. If a document is updated, it simply incrementally is set a new revision (_rev) and is appended to the storage.
When you are creating a permanent view, upon querying it the first time, for each document in your database, your new view is executed, storing that data to a new B+ tree file for that view, thus providing a new index to aggregate data according to the key you defined in your view.
Upon updating documents that are handled by that view, not the whole permanent view needs to be recomputed, but only the updated documents.
Now you should be able to understand why temporary views are nice for developing or testing in Futon, but since they have to be computed new for all your documents are not recommended for anything else than development.
Anyways. Marcello is right. If you are intending to just pass back complete documents, it is are encouraged to query with "include_docs=true". Why? Because the B-tree for your permanent view will just need to store the copied data next to your indexing key.
#Marcello-Nuccio I am not sure although if it is correct to say, that dynamic views have no index? As I understood, they have an index, but it makes no sense as they are computed new upon every query? Ok, now my brbain is hurting!