Mass update of all data with nosql - couchdb

Trying to get my head around best practice for a full update.
Scenario example is if I'm storing a document for each file on a harddrive. A process runs daily update all the file information. No history needs to be kept of deleted files, the documents can just go.
Clearly querying each record would be inefficient, and wouldn't cover deleted files (data flow is one way only).
So I guess the options are:
1) Store all the records with a timestamp, and then delete all yesterday's records
2) Delete the database, but then I would lose all my views etc
3) Something else?

No history needs to be kept of deleted files
If you use one database, every time you will delete your documents, it will keep a tombstone of your documents (which doesn't use a lot of space but still...)
Here's the solution I would use:
Create a Template database with the design documents. Maintain your views there.
Create a script that does the following :
Create a database with today's date
Replicate _design documents from the template database
Add your data to the database
With this solution, each day content can be completely deleted from CouchDB.

Related

Deleting _deleted documents on CouchDB by date

My CouchDb database is getting bigger and I would like to remove documents by date also I would like to remove _deleted documents by date
I know how to replicate my DB removing documents by date but:
¿Is there a way to do the same with _deleted documents? I mean remove _deleted documents by date
There's not really a way to conditionally cause a deletion using filtered replication, nor can you replicate a complete removal of a document.
You have a variety of options:
you can avoid replicating updates on old documents by filtering on date, but if they have already been replicated they won't be deleted
you can make a view to return old documents, and use a script to delete them at the source database. The deletions will replicate to any target databases, but all database will retain at least a {_deleted:true} tombstone of the documents [that's how the deletion gets replicated in the first place]
you can find old documents and _purge them, but you'll have to do that on each replica
What is your main goal?
If you have hundreds of objects and you want to hide the old ones from the UI of all replicas, write a script to find and DELETE/_delete:true them from in a source/master replica and the changes will propagate.
If you have bazillions of e.g. log messages and you need to free up space by forgetting old ones, write a script to find and _purge and finally _compact, then run it on every replica. But for a case like that, it might be better to rotate databases instead, e.g. manually "shard" or bin into a different database each week, and every week simply drop the N+1 weeks old database on each replica.
If your database is getting bigger, this is probably due to the versionning of your documents. A simple way to free some space is to run database compaction (Documentation)
As for _deleted documents, you can only REALLY delete them by purging
Therefore, it's not recommended to purge _deleted documents. It should only be done to remove very important files such as credentials.

rename collection vs updating collection

I have a mongo DB which i need to update daily(delete non relevant documents and add new ones).
the DB is not sharded.
I take the data from an external data master which is not so easy to work with.
There are 2 options:
1. reingest the entire DB (not so big) into a temp collection and then rename it to old collection name (with dropTarget set to true)
2. do the hard work myself, delete the old entires, and figure out from the data master which new documents are relavant and insert them to the DB
option 1 is prefrable obviously but what is the impact? I'm doing this maintenance in a late hour but I don't want the users to get errors when querying the DB during the rename process.
Is using rename to overwrite a collection a standard way to get things done or am I abusing the API ? :)
According to the documentation renameCollection blocks all database activity for the duration of the operation. If your users have set a sufficiently large time out , they will not directly be affected by this rename operation, however, as the dataset can change under their feet there might be side effects. For example, renaming a collection can invalidate open cursors which interrupts queries that are currently returning data.
Regarding renaming of collections in production, personally I would avoid this where possible, firstly because of the cursor issue above, but more importantly because an incomplete renameCollection operation can leave the target collection in an unusable state and require manual intervention to clean up. Instead I would use an update with upsert:true that overwrites the entire document or inserts a new record if it doesn't exist.

Can ElasticSearch delete all and insert new documents in a single query?

I'd like to swap out all documents for a specific index's type. I'm thinking about this like a database transaction, where I'd:
Delete all documents inside of the type
Create new documents
Commit
It appears that this is possible with ElasticSearch's bulk API, but is there a more direct way?
Based on the following statement, from the elasticsearch Delete by Query API Documentation:
Note, delete by query bypasses versioning support. Also, it is not recommended to delete "large chunks of the data in an index", many times, it’s better to simply reindex into a new index.
You might want to reconsider removing entire types and recreating them from the same index. As this statement suggests, it is better to simply reindex. In fact I have a scenario where we have an index of manufacturer products and when a manufacturer sends an updated list of products, we load the new data into our persistent store and then completely rebuild the entire index. I have implemented the use of Index Aliases to allow for masking the actual index being used. When products changes occur a process is started to rebuild the new index in the background (a process that currently takes about 15 minutes) and then switch the alias to the new index once the data load is complete and delete the old index. So this is completely seamless and does not cause any downtime for our users.

CouchDB delete and recreate a document

I'm trying to avoid revisions building up in my CouchDB, and also so I can use TouchDB's "bulk pull" for replication (it bulk-pulls on all 1st-revs.) Would it be bad practice to just delete a document, and recreate it rather than modifying it, in order for all documents to stay at rev-1?
Deleting a document in CouchDB, will not reset the _rev.
CouchDB never deletes a document, it simply marks the last revision as deleted. Compaction will delete previous revisions, keeping only the last one. This is needed for replication to work properly. And this is why the deleted revision of a document should not contain any data, but only the _id of the document and the _deleted flag.
The only method to completely remove any traces of deleted documents, is to copy all documents to a new database. But keep in mind the consequences on replication.
well I want to say that your proposal makes me feel dirty, but that wouldn't be an SO answer so..
You mention TouchDB and bulk pull so you have a mobile app with data which can be modified externally and I assume wants to be able to modify it's own data. So the biggest issue I can think of would be update conflict resolution. ie. how do you handle changes to the document on both the client and the server while the client is offline. I think you'll start having to do a lot of the synchronisation work that couch is meant to handle for you..

Delete all documents in a CouchDB database *except* the design documents

Is it possible to delete all documents in a couchdb database, except design documents, without creating a specific view for that?
My first approach has been to access the _all_docs standard view, and discard those documents starting with _design. This works but, for large databases, is too slow, since the documents need to be requested from the database (in order to get the document revision) one at a time.
If this is the only valid approach, I think it is much more practical to delete the complete database, and create it from scratch inserting the design documents again.
I can think of a couple of ideas.
Use _all_docs
You do not need to fetch all the documents, only the ID and revisions. By default, that is all that _all_docs returns. You can make a pretty big request in a batch (10k or 100k docs at a time should be fine).
Replicate then delete
You could use an _all_docs query to get the IDs of all design documents.
GET /db/_all_docs?startkey="_design/"&endkey="_design0"
Then replicate them somewhere temporary.
POST /_replicator
{ "source":"db", "target":"db_ddocs", "create_target":true
, "user_ctx": {"roles":["_admin"]}
, "doc_ids": ["_design/ddoc_1", "_design/ddoc_2", "etc..."]
}
Now you can just delete the original database and replicate the temporary one back by swapping the "source" and "target" values.
Deleting vs "deleting"
Note, these are really apples vs. oranges techniques. By deleting a database, you are wiping out the edit history of all its documents. In other words, you cannot replicate those deletion events to any other database. When you "delete" a document in CouchDB, it stores a record of that deletion. If you replicate that database, those deletions will be reflected in the target. (CouchDB stores "tombstones" indicating the document ID, its revision history, and its deleted state.)
That may or may not be important to you. The first idea is probably considered more "correct" however I can see the value of the second. You can visualize the entire program to accomplish this in your head. It's only a few queries and you're done. No looping through _all_docs batches, no headache. Your specific situation will probably make it obvious which is better.
Install couchapp, pull down the design doc to your hard disk, delete the db in futon, push the design doc back up to your recreated database. =)
You could write a shell script that goes through the list of all documents and deletes them all one by one except design docs. Apparently couch-batch can do that. Note that you don't need to fetch the whole docs to do that, just the id and revision.
Other than that, I think filtered replication (or the replication proposed by JasonSmith) is your best bet.

Resources