How to check for duplication before creating a new document in CouchDB/Cloudant?

How to check for duplication before creating a new document in CouchDB/Cloudant? - couchdb

We want to check if a document already exists in the database with the same fields and values of a new object we are trying to save to prevent duplicated item.
Note: This question is not about updating documents or about duplicated document IDs, we only check the data to prevent saving a new document with the same data of an existing one.
Preferably we'd like to accomplish this with Mango/Cloudant queries and not rely on views.
The idea so far is:
1) Scan the the data that we are trying to save and dynamically create a selector that matches that document's structure. (We can't have the selectors hardcoded because we have types of many documents)
2) Query de DB with for any documents matching that selector to if any document already exists that matches those criteria.
However I wonder about the performance of this approach since many of the selector fields will not be indexed.
I also much rather follow best practices than create something out of the blue, but haven't been able to find any known solutions for this specific scenario.
If you happen to know of any, please share.

Option 1 - Define a meaningful ID for your documents
The ID could be a logical coposition or a computed hash from the values that should be unique
If you want to check if a document ID already exists you can use the HEAD method
HEAD /db/docId
which returns 200-OK if the docId exits on the database.
If you would like to check if you have the same content in the new document and in the previous one, you may use the Validate Document Update Function which allows to compare both documents.
function(newDoc, oldDoc, userCtx, secObj) {
...
}
Option 2 - Use content hash computed outside CouchDB
Before create or update a document a hash should be computed using the values of the attributes that should be unique.
The hash is included in the document in a new attribute i.e. "key_hash"
Create a mango index using the "key_hash" attribute
When a new doc should be inserted, the hash should be computed and find for documents with the same hash value using a mango expression before the doc is inserted.
Option 3 - Compute hash in a View
Define a view which emit the computed hash for each document as key
Couchdb Javascript support does not include hashing functions, this could be difficult to include in a design document.
Use erlang to define the map function, where you can access to the erlang support for hashing.
Before creating a new document you should query the view using a the hash that you need to compute previously.

One solution would be to take Juanjo's and Alexis's comment one step further.
Select the keys you wish to keep unique
Put the values in a string and generate a hash
Set the document's _id to that hash
PUT the document on the database.
check return for failure
If another document already exists on the database with the same _id value, the PUT request will fail.

Related

How to append index level information to documents when returning search results

Relatively simple question -- I want to append index-level information onto each document when returning those documents. I do not want to copy that information into each document (makes it harder to adjust that information if it changes). I've found out that you can use the _meta tag to add information to the index level, but now I want it to be appended onto the document when returning results from a search query.
My specific use case is: I have indices that store posts per user (indices are structured as: posts-USER_ID). I'm performing a search across all posts across all user indices (search index: posts-*), and I want to return user information with each index (that user information being a JSON object with fields like username, userColor, displayName).
I see that fields like _index and _type are index-level and returned with each document automatically. I essentially want to return a custom field as well. As said above I've been able to successfully append this user information on _meta for an index but I can't figure out how to append it to documents returned from that index (for my search results from that multi-index query).
The reason I want this is because I need user information with post information on search (to display various things, username, displayName, coloring posts in the userColor). Ideally I'd prefer not to have to perform another query for each search result to retrieve user information (for each document result, querying the user that created that post -- seems expensive). I also would not like to copy that user information in each document in an index (so under a posts-USERID index adding a creator field with user information). But that seems insanely repetitive (as the indices are already partitioned per user) and when the user updates information that is very very expensive (would have to iterate through each document in "their" index and change their information.
What do I do / help!
(linked question in the elastic discussion page: https://discuss.elastic.co/t/how-to-append-index-level-information-to-documents-when-returning-search-results/262923)

CouchDB check if a document exists in a validation function

I would like to see if a document exists in the database that has the name field "name" set to "a name" before allowing a new document to be added to the database.
I this possible in CouchDB using update handlers (inside design documents)?

Seems you are looking for a unique constraint in CouchDB. The only unique constraint supported by CouchDB is based on the document ID.
You should include your "name" attribute value into the document ID if you would like to have the document unicity based on it.
Validate document update functions defined in desing documents can only use the data of the document being created/updated/deleted, it can no use data from other documents in the database.
Yo can find a similar question here.

This is not widely known, but _update endpoint allowed to return a doc with _id prop different from requested. It means, in your case, you need to have an unique document say _id:"doc-name", which will serve as a constraint.
Then you call smth like POST _design/whatever/_update/saveDependentDoc/doc-name, providing new doc with different _id as a request body.
Your _update function will effectively receive two docs as an input (or null and newDoc if constraint doc is missing). The function then decides what should it do: return received doc to persist it, or return nothing.
The solution isn’t a full answer to your question, however it might be helpful in some cases.
This trick only works for updating existing docs if you know revision, for sure.

How can I retrieve the id of a document I added to a Cosmosdb collection?

I have a single collection into which I am inserting documents of different types. I use the type parameter to distinguish between different datatypes in the collection. When I am inserting a document, I have created an Id field for every document, but Cosmosdb has a built-in id field.
How can I insert a new document and retrieve the id of the created Document all in one query?

The CreateDocumentAsync method returns the created document so you should be able to get the document id.
Document created = await client.CreateDocumentAsync(collectionLink, order);

I think you just need to .getResource() method to get the create document obj.
Please refer to the java code:
DocumentClient documentClient = new DocumentClient(END_POINT,
MASTER_KEY, ConnectionPolicy.GetDefault(),
ConsistencyLevel.Session);
Document document = new Document();
document.set("name","aaa");
document = documentClient.createDocument("dbs/db/colls/coll",document,null,false).getResource();
System.out.println(document.toString());
//then do your business logic with the document.....
C# code:
Parent p = new Parent
{
FamilyName = "Andersen.1",
FirstName = "Andersen",
};
Document doc = client.CreateDocumentAsync("dbs/db/colls/coll",p,null).Result.Resource;
Console.WriteLine(doc);
Hope it helps you.

Sure, you could always fetch the id from creation method response in your favorite API as already shown in other answers. You may have reasons why you want to delegate key-assigning to DocumentDB, but to be frank, I don't see any good ones.
If inserted document would have no id set DocumentDB would generate a GUID for you. There wouldn't be any notable difference compared to simply generating a new GUID yourself and assign it into id-field before save. Self-assigning the identity would let you simplify your code a bit and also let you use the identity not only after persisting but also BEFORE. Which could simplify a lot of scenarios you may have or run into in future.
Also, note that you don't have to use GUIDs as as id and could use any unique value you already have. Since you mentioned you have and Id field (which by name, I assume to be a primary key) then you should consider reusing this instead introducing another set of keys.
Self-assigned non-Guid key is usually a better choice since it can be designed to match your data and application needs better than a GUID. For example, in addition to being just unique, it may also be a natural key, narrower, human-readable, ordered, etc.

RethinkDB: How do I create a custom duplicate check on insert

I want to bulk insert an array of data using NodeJS and RethinkDB but I don't want to insert existing records (where name & value already has a record, I don't want to dupcheck on primary key id).
[
{name:"Robert", value:"1337"},
{name:"Martin", value:"0"},
{name:"Oskar", value:"1"}
]
If any of the above values already exist, don't insert, but update "value".
My current working solution is that I loop through the array and first check if it exists using a filter, if not, i insert it. But it's very slow on 10.000 records.

I don't think we have that kind of concept in RethinkDB. I tried to read the doc more. To insert a new document, use insert, to update field, use update, to replace to a whole new document, use replace(the primary key won't change)...So I don't think it's possible in RethinkDB.
Here is some way you can make it run faster:
Create a compound index contains those two fields: name and value
Then using that index to check for existence instead of using filter
Generate your own id field, instead of letting RethinkDB generated it. Therefore, you know the primary key, and use it to look up document with get which will be very fast.

I had a similar requirement in a RethinkDB project, but in that case the primary key was being checked for duplicates, and it was also custom instead of being auto-generated.
What you could do is run an async.series or async.waterfall two-step check. First pick a single object from your array, then filter the database for the name-value pairs of your current object. If the results come up null, it is unique. If not, you have a pre-existing record with same details.
Depending on the result, you can then pass on the control to next step which will either insert the new document or update existing one. It will be simpler if you use a flag for this in async.waterfall.

How to index documents with elastic.js client?

So far I haven't found any samples of HOW the elastic.js client api (https://github.com/fullscale/elastic.js) can be used for indexing documents. There are some clues here & there but nothing concrete yet.
http://docs.fullscale.co/elasticjs/ejs.Document.html
Document ( index, type, id ): Object used to create, replace, update, and delete documents
Document > doIndex(fnCallBack): Stores a document in the given index and type. If no id is set, one is created during indexing.
Document > source (doc): Sets the source document.
Can anyone provide a sample snippet of code to show how an document object can be instantiated and used to index data?
Thanks!
Update # 1 (Sun Apr 21st, 2013 on 12:58pm CDT)
https://gist.github.com/pulkitsinghal/5430444

Your gist is correct.
You create ejs.Document objects specifying the index, type, and optionally the id of the document you want indexed. If you don't specify an id, elasticsearch will generate one for you.
You set the source to the json object you want indexed then call the doIndex method specifying a callback if needed. The node example does not index docs, but the angular and jquery examples show a basic example and can easily be used with the node client.
https://github.com/fullscale/elastic.js/blob/master/examples/angular/js/controllers.js#L30
Also have a peek at the tests:
https://github.com/fullscale/elastic.js/blob/master/tests/index_test.js#L265

elastic.js nowadays only implements the Query DSL, so it can't be used for this scenario anymore. See this commit.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string