Validate documents during CouchDB replication process - couchdb

During replication, I need to validate documents that a client tries to write to my CouchDB instance. Ideally, I should reject just "invalid" documents allowing all others to pass through. Other possible outcome might be to reject the whole replication process and do not accept any documents altogether. I cannot use the validate_doc_update function because it lacks all the necessary information to make a decision.
I thought about serving all endpoints needed for replication behind the service and validate docs on an application level. For example, take all docs from POST /_bulk_docs and send back 400 error response if some docs are invalid.
Do I understand it right that such an approach stops the replication process, and the database might be left with partially replicated documents? It's because documents are uploaded in chunks during replication, and therefore there are might be couple POST /_bulk_docs calls where the first one has all valid docs and second invalid.
Is there another way, how can I discard only invalid docs?
Thanks for your help!

You can apply a Cloudant Query selector via your replication document to specify which documents are valid. The canonical example is to ditch tombstones:
{
"_id": "repl01",
"source" : "https://.../<source>",
"target" : "https://.../<target>",
"selector": {
"_deleted": {
"$exists": false
}
}
}
See https://cloud.ibm.com/docs/Cloudant?topic=Cloudant-replication-api#the-selector-field for more details.

Related

GraphQL: One Big Query vs Lots of Little Queries

I know this question is as old as time -- and their is no silver bullet. But I think there might be a solid pattern out there, and I would like not to invent the wheel.
Consider the following two schema options:
Approach 1) My original implementation
type Query {
note(id: ID!): Note
notes(input: NotesQueryInput): [Note!]!
}
Approach 2) My current experimental approach
type DatedId {
date: DateTime!
id: ID!
}
type Query {
note(id: ID!): Note
notes(input: NotesQueryInput): [DatedId!]!
}
The differences are:
with approach 1) the notes query will either return a list of potentially large Note objects
with approach 2) the notes query will return a much lighter payload BUT then will need to execute n additional queries
So my question is with the Apollo Client / Server stack with in-memory-cache which is the best approach. to achieve a responsive client with a scalable server.
Notes
With approach 1 -- my 500mb dyno (heroku server) ran out of memory.
I expect with either approach I will implement pagination with the connection / edge pattern
the graphql server is primarly to serve my own frontend.
If you're running out memory on the server, it may be time to upgrade. If you're running out of memory now, imagine what will happen when you have multiple users hitting your endpoint.
The only other way to get around that specific problem is to break up your query into several smaller queries. However, your proposed approach suffers from a couple of problems:
You will end up hammering your server and your database with significantly more requests
Your UI may take longer to load, depending on whether the requested data needs to be rendered immediately
Handling the scenario when one of your requests fails, or attempting to retry a failed request, may be challenging
You already suggested added pagination, and I think it would be a much better way to break up your single large query into smaller ones. Not only does pagination lend itself to a better user experience, but by enforcing a limit on the size of the page, you can effectively enforce a limit on the size of a given query.
One other option you may consider exploring is using deferred queries. This experimental feature was added specifically with expensive queries in mind. By making one or more fields on your Note type deferred, you would effectively return null for them initially, and their values would be sent down in a second "patch" response after they finally resolve. This works great for fields that are expensive to resolve, but may also help with fields that return a large amount of data.

Mongodb queries from multiple processes; how to implement atomicity?

I have a mongodb database where multiple node processes read and write documents. I would like to know how can I make that so only one process can work on a document at a time. (Some sort of locking) that is freed after the process finished updating that entry.
My application should do the following:
Walk through each entry one by one with a cursor.
(Lock the entry so no other processes can work with it)
Fetch information from a thirdparty site.
Calculate new information and update the entry.
(Unlock the document)
Also after unlocking the document there will be no need for other processes to update it for a few hours.
Later on I would like to set up multiple mongodb clusters so that I can reduce the load on the databases. So the solution should apply to both single and multiple database servers. Or at least using multiple mongo servers.
An elegant solution that doesn't involve locks is:
Add a version property to your document.
When updating the document, increment the version property.
When updating the document, include the last read version in the find query. If your document has been updated elsewhere, the find query will yield no results and your update will fail.
If your update fails, you can retry the operation.
I have used this pattern with great success in the past.
Example
Imagine you have a document {_id: 123, version: 1}.
Imagine now you have 3 Mongo clients doing db.collection.findAndModify({ query: {_id: 123, version: 1}, update: { $inc: 1 }});, concurrently.
The first update will apply, the remaining will fail. Why? because now version is 2, and the query included version: 1.
Per MongoDB documentation:
isolated: Prevents a write operation that affects multiple documents from yielding to other reads or writes once the first document is written... $isolated operator causes write operations to acquire an exclusive lock on the collection...will make WiredTiger single-threaded for the duration of the operation. So if you are updating multiple documents, you could first get the data from the third-party API, parse the info into an array for example, and then use something like this in Mongo shell:
db.foo.update(
{ status : "A" , $isolated : 1 },
{ $set: { < your key >: < your info >}}, //use the info in your array
{ multi: true }
)
Or if you have to update the document one by one, you could use findAndModify() or updateOne() of the Node Driver for MongoDB. Please note that per MongoDB documentation 'When modifying a single document, both findAndModify() and the update() method atomically update the document.'
An example of updating one by one: first you connect to the Mongod with the NodeJS driver, then connect to the third part API using NodeJS's Request module, for example, get and parse the data, before using the data to modify your documents, something like below:
var request = require('request');
var MongoClient = require('mongodb').MongoClient,
test = require('assert');
MongoClient.connect('mongodb://localhost:27017/test', function(err, db) {
var collection = db.collection('simple_query');
collection.find().forEach(
function(doc) {
request('http://www.google.com', function(error, response, body) {
console.log('body:', body); // parse body for your info
collection.findAndModify({
<query based on your doc>
}, {
$set: { < your key >: < your info >
}
})
});
}, function(err) {
});
});
Encountered this question today,
I feel like it's been left open,
First, findAndModify really seems like the way to go about it,
But, I found vulnerabilities in both answers suggested:
in Treefish Zhang's answer - if you run multiple processes in parallel they will query the same documents because in the beginning you use "find" and not "findAndModify", you use "findAndModify" only after the process was done - during processing it's still not updated and other processes can query it as well.
in arboreal84's answer - what happens if the process crashes in the middle of handling the entry? if you update the version while querying, then the process crashes, you have no clue whether the operation succeeded or not.
therefore, I think the most reliable approach would be to have multiple fields:
version
locked:[true/false],
lockedAt:[timestamp] (optional - in case the process crashed and was not able to unlock, you may want to retry after x amount of time)
attempts:0 (optional - increment this if you want to know how many process attempts were done, good to count retries)
then, for your code:
findAndModify: where version=oldVersion and locked=false, set locked=true, lockedAt=now
process the entry
if process succeeded, set locked=false, version=newVersion
if process failed, set locked=false
optional: for retry after ttl you can also query by "or locked=true and lockedAt<now-ttl"
and about:
i have a vps in new york and one in hong kong and i would like to
apply the lock on both database servers. So those two vps servers wont
perform the same task at any chance.
I think the answer to this depends on why you need 2 database servers and why they have the same entries,
if one of them is a secondary in cross-region replicas for high availability, findAndModify will query the primary since writing to secondary replica is not allowed and that's why you dont need to worry about 2 servers being in sync (it might have latency issue tho, but you'll have it anyways since you're communicating between 2 regions).
if you want it just for sharding and horizontal scaling, no need to worry about it because each shard will hold different entries, therefore entry lock is relevant just for one shard.
Hope it will help someone in the future
relevant questions:
MongoDB as a queue service?
Can I trust a MongoDB collection as a task queue?

How do I resolve RequestRateTooLargeException on Azure Search when indexing a DocumentDB source?

I have a DocumentDB instance with about 4,000 documents. I just configured Azure Search to search and index it. This worked fine at first. Yesterday I updated the documents and indexed fields along with one UDF to index a complex field. Now the indexer is reporting that DocumentDB is reporting RequestRateTooLargeException. The docs on that error suggest throttling calls but it seems like Search would need to do that. Is there a workaround?
Azure Search code uses DocumentDb client SDK, which retries internally with the appropriate timeout when it encounters RequestRateTooLarge error. However, this only works if there're no other clients using the same DocumentDb collection concurrently. Check if you have other concurrent users of the collection; if so, consider adding capacity to the collection.
This could also happen because, due to some other issue with the data, DocumentDb indexer isn't able to make forward progress - then it will retry on the same data and may potentially encounter the same data problem again, akin a poison message. If you observe that a specific document (or a small number of documents) cause indexing problem, you can choose to ignore them. I'm pasting an excerpt from the documentation we're about to publish:
Tolerating occasional indexing failures
By default, an Azure Search indexer stops indexing as soon as even as single document fails to be indexed. Depending on your scenario, you can choose to tolerate some failures (for example, if you repeatedly re-index your entire datasource). Azure Search provides two indexer parameters to fine- tune this behavior:
maxFailedItems: The number of items that can fail indexing before an indexer execution is considered as failure. Default is 0.
maxFailedItemsPerBatch: The number of items that can fail indexing in a single batch before an indexer execution is considered
as failure. Default is 0.
You can change these values at any time by specifying one or both of these parameters when creating or updating your indexer:
PUT https://service.search.windows.net/indexers/myindexer?api-version=[api-version]
Content-Type: application/json
api-key: [admin key]
{
"dataSourceName" : "mydatasource",
"targetIndexName" : "myindex",
"parameters" : { "maxFailedItems" : 10, "maxFailedItemsPerBatch" : 5 }
}
Even if you choose to tolerate some failures, information about which documents failed is returned by the Get Indexer Status API.

Handling conflict in find, modify, save flow in MongoDB with Mongoose

I would like to update a document that involves reading other collection and complex modifications, so the update operators in findAndModify() cannot serve my purpose.
Here's what I have:
Collection.findById(id, function (err, doc) {
// read from other collection, validation
// modify fields in doc according to user input
// (with decent amount of logic)
doc.save(function (err, doc) {
if (err) {
return res.json(500, { message: err });
}
return res.json(200, doc);
});
}
My worry is that this flow might cause conflict if multiple clients happens to modify the same document.
It is said here that:
Operations on a single document are always atomic with MongoDB databases
I'm a bit confused about what Operations mean.
Does this means that the findById() will acquire the lock until doc is out of scope (after the response is sent), so there wouldn't be conflicts? (I don't think so)
If not, how to modify my code to support multiple clients knowing that they will modify Collection?
Will Mongoose report conflict if it occurs?
How to handle the possible conflict? Is it possible to manually lock the Collection?
I see suggestion to use Mongoose's versionKey (or timestamp) and retry for stale document
Don't use MongoDB altogether...
Thanks.
EDIT
Thanks #jibsales for the pointer, I now use Mongoose's versionKey (timestamp will also work) to avoid committing conflicts.
aaronheckmann — Mongoose v3 part 1 :: Versioning
See this sample code:
https://gist.github.com/anonymous/9dc837b1ef2831c97fe8
Operations refers to reads/writes. Bare in mind that MongoDB is not an ACID compliant data layer and if you need true ACID compliance, you're better off picking another tech. That said, you can achieve atomicity and isolation via the Two Phase Commit technique outlined in this article in the MongoDB docs. This is no small undertaking, so be prepared for some heavy lifting as you'll need to work with the native driver instead of Mongoose. Again, my ultimate suggestion is to not drink the NoSQL koolaid if you need transaction support which it sounds like you do.
When MongoDB receives a request to update a document, it will lock the database until it has completed the operation. Any other requests that MongoDB receives will wait until the locking operation has completed and the database is unlocked. This lock/wait behavior is automatic, so there aren't any conflicts to handle. You can find a lot more information about this behavior in the Concurrency section of the FAQ.
See jibsales answer for links to MongoDB's recommended technique for doing multi-document transactions.
There are a couple of NoSQL databases that do full ACID transactions, which would make your life a lot easier. FoundationDB is one such database. Data is stored as Key-Value but it supports multiple data models through layers.
Full disclosure: I'm an engineer at FoundationDB.
In my case I was wrong when "try to query the dynamic field with the upsert option". This guide helped me: How to solve error E11000 duplicate
In above guide, you're probably making one of two mistakes:
Upsert a document when findOneAndupdate() but the query finds a non-unique field.
Use insert many new documents in one go but don't use "ordered = false"

How to account for a failed write or add process in Mongodb

So I've been trying to wrap my head around this one for weeks, but I just can't seem to figure it out. So MongoDB isn't equipped to deal with rollbacks as we typically understand them (i.e. when a client adds information to the database, like a username for example, but quits in the middle of the registration process. Now the DB is left with some "hanging" information that isn't assocaited with anything. How can MongoDb handle that? Or if no one can answer that question, maybe they can point me to a source/example that can? Thanks.
MongoDB does not support transactions, you can't perform atomic multistatement transactions to ensure consistency. You can only perform an atomic operation on a single collection at a time. When dealing with NoSQL databases you need to validate your data as much as you can, they seldom complain about something. There are some workarounds or patterns to achieve SQL like transactions. For example, in your case, you can store user's information in a temporary collection, check data validity, and store it to user's collection afterwards.
This should be straight forwards, but things get more complicated when we deal with multiple documents. In this case, you need create a designated collection for transactions. For instance,
transaction collection
{
id: ..,
state : "new_transaction",
value1 : values From document_1 before updating document_1,
value2 : values From document_2 before updating document_2
}
// update document 1
// update document 2
Ooohh!! something went wrong while updating document 1 or 2? No worries, we can still restore the old values from the transaction collection.
This pattern is known as compensation to mimic the transactional behavior of SQL.

Resources