Point-in-time restores of databases and documents using Cloudant - couchdb

How can I save changes in CouchDB / Cloudant in order to later do point-in-time restores of my databases, or even specific documents?

We’re working on making this a first-class feature, but until we roll it out, this is how one of our customers did it:
You have collections, and within those collections, resources. So, you keep a logging database where every document has an ID like collection-resource, so for a collection named "cars" and a resource named "Ford", you'd have a document in your logging database named cars-ford. That document looks like this:
{
versions: [...]
}
Any time that resource is touched or modified, your application updates the logging document by appending the new version to the end of the versions field. That version might look like this:
{
timestamp: '...', # some integer timestamp, for sorting
doc: {...} # attributes of the document as of the save
}
We'll use that view to return a list of all versions of all documents, sorted by when each change occurred.
Then, here's how you use that to do restores and the like:
Getting the most recent version of a resource
Get the document in its entirety, and grab the last element in the versions field. That's the most recent version.
See all versions relative to a timestamp
We'll create a view to sort by timestamp. The view looks like this:
{
map: "function(doc) {
for(var i in doc.versions){
emit(doc.versions[i].timestamp, doc.versions[i].doc);
}
}"
}
Say our database is named loggy, the design doc where our views live is named restore, and the view itself is named time. Then we'll make a GET request to this URL:
{CLOUDANT_HOST}/loggy/_design/restore/_view/time?startkey='...'
...where the value for startkey is some timestamp. This, unmodified, will return every version after the indicated timestamp. Add limit=X and you'll get the X versions after the timestamp. Add descending=true and you'll get versions before the timestamp, instead of after.
See the Nth revision for a resource
Much like above, but we'll tweak our view a little:
{
map: "function(doc){
for(var i in doc.versions){
emit(i, doc.versions[i].doc);
}
}"
}
Now our view results are keyed by index rather than timestamp. So, instead of passing a timestamp to startkey, we just pass N to versions around the Nth revision.
Getting the number of revisions for a collection or resource
We'll use another view to group by collection and resource:
{
map: "function(doc){
// split te ID into collection and resource
var parts = doc._id.split('-');
// emit them as keys so we can group by them
emit([doc.parts[0], doc.parts[1]], null);
}",
reduce: "_count"
}
Use the query parameter group and group_level to group results by their keys. So, if we want the number of events that have touched resources in the cars collection, we would use a querystring like this:
?group=true&group_level=1&key="cars"
group groups results whose keys are the same, but group_level=1 says "only group on the first key", which in our case is the collection. key specifies to only return documents whose key matches the given value.
Getting all resources for a given collection
Using the _all_docs view, we'll use a querystring like this:
?reduce=false&startkey="{collection}-"&endkey="{collection}0"
Remember the reduce part of our function? That _count value means "return the number of records emitted by map". reduce=false means "Don't do that." Instead, only the map function is run.
That startkey and endkey pair uses how Cloudant sorts results to exclude everything but the values matching IDs that start with the given collection.
Updating docs
Once you've got the versions you'd like to restore, GET the current version of the resource, GET the past version from the loggy database, and PUT the past version to the resource using the current version's _rev value. Bam, restored. Rinse and repeat for point-in-time restore.

Related

Use an array of values to query Firestore and setup a snapshot listener

Here is my problem:
I have a firestore collection that has a number of documents. There are about 500 documents generated/updated every hour and saved to the collection.
I would like to query the collection and setup a real-time snapshot listener for a subset of document IDs, that are provided by the client.
I think maybe I could to something like this (this syntax is likely not correct...just trying to get a feel for if it's even possible...but isn't the "in" limited to an array of 10 items? ):
const subbedDocs = ["doc1","doc2","doc3","doc4","doc5"]
docsRef.where('docID', 'in', subbedDocs).onSnapshot((doc) => {
handleSnapshot(doc);
});
I'm sorry, that code probably doesn't make sense....I'm still trying to learn all the ins and outs of Firestore.
Essentially, what I am trying to do is take an array of ID's and setup a .onSnapshot listener for those ID's. This list of IDs could be upwards of 40-50 items. Is this even possible? I am trying to avoid just setting up a listener on the whole collection and filtering out things I am not "subscribed" too as that seems wasteful from a resources perspective.
If you have the doc IDs in your array (it looks like you have) you can loop over them and start a listener during that:
const subbedDocs = ["doc1", "doc2", "doc3", "doc4", "doc5"];
for (let i = 0; i < subbedDocs.length; i++) {
const docID = subbedDocs[i];
docsRef.doc(docID).onSnapshot((doc) => {
handleSnapshot(doc);
});
}
It would be better to listen to a query and all filtered docs at once. But if you want to listen to each of them with a explicit listener that would do the trick.
As you've discovered, Firestore's in operator only allows up to 10 entries in the array. I'm also guessing you've added the docID as a field in the document, since I don't believe 'docID references the actual documentid.
I would not take this approach, because of the 10-entry limitation. What I would do is, as the client is selecting documents to follow, set a field (same in each document) to a unique Id for the client, so your query completely avoids the limitation. You can allow an unlimited number of Client listeners (up to implementation limits of Firestore) if you add that client ID into an array (called something like "ListenerArray") [again, as the client is selecting them]. Your query would be more like:
docsRef.where('ListenerArray', 'array-contains', clientID).onSnapshot((doc) => {
handleSnapshot(doc);
})
array-contains checks a single value against all entries in a document array, without limit. Every client can mark any number of documents to subscribe to.

How To Get Item From DynamoDB Based On Multiple (one primary) Attributes Lambda/NodeJS

My table structure in DynamoDB looks like the following:
uuid (Primary Key) | ip | userAgent
From within a NodeJS function inside of lambda, I would like to get the uuid of an item whose ip and useragent match the information I provide.
Scan becomes less and less efficient and more expensive over time as millions of items are added to the table every week.
Here is the code I am using to try and accomplish this:
function tieDown(sIP, uA){
const userQuery = {
Key : {
"ip" : "192.168.0.1",
"userAgent" : "sample"
},
TableName: "mytable"
};
return ddb.get(userQuery, function(err, data){
if (err) console.log(err.stack);
}).promise();
}
When this code executes, the following error is thrown ValidationException: The provided key element does not match the schema.
So I guess my questions are:
Is it even possible to get one specific item based on non-primary attributes
Are there any issues with the code sample I provided that could lead to this error being thrown? (I'm using DocumentClient so no need to explicitly declare strings, numbers etc.
Thanks!
You cannot get a single item using the get operation without specifying the partition key, and sort key if you have one. Scans should be avoided in most cases. What you probably need is a Global Secondary Index that allows you to query by ip and userAgent. Keep in mind that the records on a GSI are not guaranteed unique, so you may get more than one result.

Bulk delete documents of a specific type in elasticsearch

I have an index called "animals" in elasticsearch. That contains several documents of type "dogs".
I want to delete all "dogs" documents in "animals" index. I am using python elasticsearch package.
My python code is as follows:
connection = Elasticsearch([{"host": "myhost", "port": "myport" } ])
body = {} # What should I put in the body???
connection.bulk(body, index="animals", doc_type="dogs", ignore=[400, 404])
Here I don't know what do I need to put in the body. Can anyone help me out?
Bulk method is defined at https://github.com/elastic/elasticsearch-py/blob/master/elasticsearch/client/init.py#L1002
There is no api in elasticsearch-py to delete all documents of a type. This is because Elasticsearch 2.x onwards there is no api to delete a type.
From the documentation
In 1.x it was possible to delete a type
mapping, along with all of the documents of that type, using the
delete mapping API. This is no longer supported, because remnants of
the fields in the type could remain in the index, causing corruption
later on.
Instead, if you need to delete a type mapping, you should reindex to a
new index which does not contain the mapping. If you just need to
delete the documents that belong to that type, then use the
delete-by-query plugin instead.

Couchdb filter using reduce functions/linked documents

Considering:
doc profile
{
_id:"1",
name:"john",
likes: ["2222","1111"]
}
doc likes
{
_id:"2222",
value:"true"
}
{
_id:"1111",
value:"false"
}
I have a filter on my xamarin app to get the profile, and it works well but I need to include the "children" (linked) docs... I can do this with a view setting include_docs=true but I want couchdb to filter so I can use replication.
Also, it would be possible to accomplish the same result if I could use a reduce function to filter data, but I can't make the filter use the reduce function.. So, any idea?
the expected result would be:
doc profile
{
_id:"1",
name:"john",
likes: {
{_id:"2222",
value:"true"},
{_id:"1111",
value:"false"]
}
}
Thanks!
I can do this with a view setting include_docs=true but I want couchdb to filter so I can use replication
You might already know this but you can use couchdb views as filters.
Also, it would be possible to accomplish the same result if I could use a reduce function to filter data
The reduce function is for "reducing" the values that are returned by the map function. The map function returns a key and a value like so:
emit(key,value)
The reduce function only gets the keys and the values that are returned from a map function. For example if you call a view with
?key=abc
and it returns results like
[{
_id:...,
type: abc
},
{
_id:...,
type:abc
}
....
]
You already have all the documents filtered by the key "abc". The reduce function will get as inputs the key, the value and a rereduce parameters. If you use the reduce function as a post map processing step to further filter the results from the view there will be two problems:
There is no way to pass a parameter to a reduce. The keys that you specify will only be used by the map function and then passed as they are to reduce.
It is not a good idea anyway. With reduce you want to return a small value that aggregates the results you get from a view. So taking the above example if you return say an integer as a value from the map function ( in emit(key,value)//suppose that the value is an integer) the reduce function may return a sum or aggregate of those values. But trying to return a modified document is not what reduce function is for. From the docs
"A reduce function must reduce the input values to a smaller output value. If you are building a composite return structure in your reduce, or only transforming the values field, rather than summarizing it, you might be misusing this feature. "
List functions might be more suited to what you are trying to do. If you want to process the results of the view query before returning them they are they way to go.
In list functions you get a set of results returned by the view function. You can even pass additional parameters if you'd like to apply complex filters on them. But you won't be able to use list functions for replication.
Finally replication works on a document level. Documents have _rev fields that is used by the replicator process to check what version the document is in before the replication is performed. So you won't be able to replicate the results returned by a view. Only the documents will be replicated.

Referencing external doc in CouchDB view

I am scraping an 90K record database using JSON-RPC and I am trying to put in some basic error checking. I want to start by scraping the database twice using two different settings and adding a prefix to the second scrape. This way I can check to ensure that the two settings are not producing different records (due to dropped updates, etc). I wanted to implement the comparison using a view which compares each document from the first scrape with it's twin produced by the second scrape and then emit the names of records with a difference between them.
However, I cannot quite figure out how to pull in another doc in the view, everything I have read only discusses external docs using the emit() function, which is too late to permit me to compare it. In the example below, the lookup() function would grab the referenced document.
Is this just not possible?
function(doc) {
if(doc._id.slice(0,1)!=='$' && doc._id.slice(0,1)!== "_"){
var otherDoc = lookup('$test" + doc._id);
if(otherDoc){
var keys = doc.value.keys();
var same = true;
keys.forEach(function(key) {
if ((key.slice(0,1) !== '_') && (key.slice(0,1) !=='$') && (key!=='expires')) {
if (!Object.equal(otherDoc[key], doc[key])) {
same = false;
}
}
});
if(!same){
emit(doc._id, 1);
}
}
}
}
Context
You are correct that this is not possible in CouchDB. The whole point of the map function is that it must be idempotent, otherwise you lose all the other nice benefits of a pre-calculated index.
This is why you cannot access external resources in the map function, whether they be other records or the clock. Any time you run a map you must always get the same result if you put the same record into it. Since there are no relationships between records in CouchDB, you cannot promise that this is possible.
Solution
However, you can still achieve your end goal, just be different means. Some possibilities...
Assuming there is some meaningful numeric value in each doc, you could use a view to take the sum of all those values and group them by which import you did ({key: <batch id>, value: <meaningful number>}). Then compare the two numbers in your client or the browser to see if they match.
A brute force approach would be to use a view to pair the docs that should match. Each doc is on a different row, but they're grouped by a common field. Then iterate through the entire index comparing the pairs. This would certainly be the quickest to code and doesn't depend on your application or data.
Implement a validation function to enforce a schema on your data. Just be warned that this will reduce your write throughput since each written record will be piped out of Erlang and into the JS engine. Also, this is only applicable if you're worried about properly formed records instead of their precise content, which might not be the case.
Instead of your different batch jobs creating different docs, have them place them into the same doc. The structure might look like this: { "_id": "something meaningful", "batch_one": { ..data.. }, "batch_two": { ..data.. } } Then your validation function could compare them or you could create a view that indexes all the docs that don't match. All depends on where in your pipeline you want to do the error checking and correction.
Personally I like the last option better, but only if you don't plan to use the database as is in production. Ie., you wouldn't want to carry around all that extra data in each record.
Hope that helps.
Cheers.

Resources