update_sequence changed semantics in cloudant db? - couchdb

I use a cloudant couchdb and I've noticed see that the "_changes" query on the database returns an "update_sequence" that is not a number, e.g.
"437985-g1AAAADveJzLYWBgYM..........".
What is more, the response is not stable: I get 3 different update_sequences if a query the db 3 times.
Is there any change in the known semantics of the "update_sequence", "since", etc. or what?
Regards,
Vangelis

Paraphrasing an answer that Robert has previously given:
The update sequence values are opaque. In CouchDB, they are currently integers but, in Cloudant, the value is an encoding of the sequence value for each shard of the database. CouchDB will likely adopt this in future as clustering support is added (via the BigCouch merge).
In both CouchDB and Cloudant, _changes will return a "seq" value with every row that is guaranteed to return newer updates if you pass it back as "since". In cases of failover, that might include changes you've already seen.
So, the correct way to read changes since a particular update sequence is this;
call /dbname/_changes?since=
read the entire response, applying the changes as you go
Record the last_seq value as your new checkpoint seq value.
Do not interpret the two values, you can't compare them for equality. You can, if you need to, record any "seq" value as you go in step 2 as your current checkpoint seq value. The key thing you cannot do is compare them.

It'll jump around, the representation is a packed base64 string representing the update_seq of the various replicas of each shard of your database. It can't be a simple integer because it's a snapshot of a distributed database.
As for CouchDB, treat the update_seq as opaque JSON and you'll be fine.

Related

What is the datatype to store json object into postgresql?

I am very new to postgresql.
I want to store below json object into postgresql database.
{
"host": "xxx.xxx.xx.xx"
"type": "OS"
}
can you please advise me what data type should I use in postgresql. Thanks in advance.
If your data always contains this same simple structure I don't see any reasons to store them as JSON. You should think about storing it simply in a table with columns host and type.
INSERT INTO table(my_host_column, my_type_column) VALUES
(my_json ->> 'host', my_json ->> 'type');
This makes many things much simpler (search, update, ...). In your case Postgres offers the inet type for IP adress columns. Such a column could do the plausibility checks for your host, for example (https://www.postgresql.org/docs/current/static/datatype-net-types.html)
You are able to recreate the JSON at any time with json_build_object('host', my_host_column, 'type', my_type_column) (https://www.postgresql.org/docs/current/static/functions-json.html)
But if you still want to store the JSON as it is:
If you do not want to do anything with it, store it as a text type (what I definitely do not recommend since you don't know what the future brings). If you want to use the JSON functions of Postgres you should store it as json or jsonb type (https://www.postgresql.org/docs/current/static/datatype-json.html).
jsonb has mostly an overhead of save space (more meta data) but is often significantly faster on operations.
Further reading:
Explanation of JSONB introduced by PostgreSQL
Faster Operations with the JSONB Data Type in PostgreSQL
Just store them as text type if no interaction is required (watch the maximum size for a text data type). Otherwise Postgresql supports JSON. Therefore, just read the corresponding documentation https://www.postgresql.org/docs/9.6/static/datatype-json.html
The advantage of the JSON types are, that Postgresql than analyses the content and you can use that later on for SELECT statements taking the JSON data structure into account.
PostgreSQL has two json data types. From Postgres docs:
There are two JSON data types: json and jsonb. They accept almost identical sets of values as input. The major practical difference is one of efficiency. The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.
So TL;DR, Postgres's json stores the JSON as text and needs to re parse it on retrieval, whereas jsonb takes a little longer to store, but is already parsed on retrieval, and it can be used as an index in the db! So jsonb is probably the way to go most of the time

Storing a list of mixed types in Cassandra

In Cassandra, when specifying a table and fields, one has to give each field a type (text, int, boolean, etc.). The same applies for collections, you have to give lock a collection to specific type (set<text> and such).
I need to store a list of mixed types in Cassandra. The list may contain numbers, strings and booleans. So I would need something like list<?>.
Is this possible in Cassandra and if not, What workaround would you suggest for storing a list of mixed type items? I sketched a few, but none of them seem the right way to go...
Cassandra's CQL interface is strictly typed, so you will not be able to create a table with an untyped collection column.
I basically see two options:
Create a list field, and convert everything to text (not too nice, I agree)
Use the thift API and store everything as is.
As suggested at http://www.mail-archive.com/user#cassandra.apache.org/msg37103.html I decided to encode the various values into binary and store them into list<blob>. This allows to still query the collection values (in Cassandra 2.1+), one just needs to encode the values in the query.
On python, simplest way is probably to pickle and hexify when storing data:
pickle.dumps('Hello world').encode('hex')
And to load it:
pickle.loads(item.decode('hex'))
Using pickle ties the implementation to python, but it automatically converts to correct type (int, string, boolean, etc.) when loading, so it's convenient.

couchDB reduce: does rereduce preserve the order of results from map?

With a couchdb view, we get results ordered by key. I have been using this to get values associated with a highest number. For example, take this result (in key: value form):
{1:'sam'}
{2:'jim'}
{4:'joan'}
{5:'jill'}
couchDB will sort those according to the key. (It could be helpful to think of the key as the "score".) I want to find out who has the highest or lowest score.
I have written a reduce function like so:
function(keys, values) {
var len = values.length;
return values[len - 1];
}
I know there's _stat and the like, but these are not possible in my application (this is a slimmed down, hypothetical example).
Usually when I run this reduce, i will get either 'sam' or 'jill' depending on whether descending is set. This is what I want. However, in large data-sets, sometimes I get someone from the middle of the list.
I suspect this is happening on rereduce. I had assumed that when rereduce has been run, the order of results is preserved. However, I can find no assurances that this is the case. I know that on rereduce, the key is null, so by the normal sorting rules they would not be sorted. Is this the case?
If so, any advice on how to get my highest scorer?
Yeah, I don't think sorting order is guaranteed, probably because it cannot be guaranteed in clustered environments. I suspect the way you're using map/reduce here is a little iffy, but you should post your view code if you really want a good answer here.

Updating TimeUUID columns in cassandra

I'm trying to store some time series data on the following column family:
create column family t_data with comparator=TimeUUIDType and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
I'm successfully inserting data this way:
data={datetime.datetime(2013, 3, 4, 17, 8, 57, 919671):'VALUE'}
key='row_id'
col_fam.insert(key,data)
As you can see, using a datetime object as the column name pycassa converts to a timeUUID object correctly.
[default#keyspace] get t_data[row_id];
=> (column=f36ad7be-84ed-11e2-af42-ef3ff4aa7c40, value=VALUE, timestamp=1362423749228331)
Sometimes, the application needs to update some data. The problem is that when I try to update that column, passing the same datetime object, pycassa creates a different UUID object (the time part is the same) so instead of updating the column, it creates another one.
[default#keyspace] get t_data[row_id];
=> (column=f36ad7be-84ed-11e2-af42-ef3ff4aa7c40, value=VALUE, timestamp=1362423749228331)
=> (column=**f36ad7be**-84ed-11e2-b2fa-a6d3e28fea13, value=VALUE, timestamp=1362424025433209)
The question is, how can I update TimeUUID based columns with pycassa passing the datetime object? or, if this is not the correct way to doing it, what is the recommended way?
Unless you do a read-modify-write you can't. UUIDs are by their nature unique. They exist to solve the problem of how to get unique IDs that sort in chronological order but at the same time avoid collisions for things that happen at exactly the same time.
So to update that column you need to first read it, so you can find its column key, change its value and write it back again.
It's not a particularly elegant solution. You should really avoid read-modify-write in Cassandra. Perhaps TimeUUID isn't the right type for your column keys? Or perhaps there's another way you can design your application to avoid having to go back and change things.
Without knowing what your query patterns look like I can't say exactly what you should do instead, but here are some suggestions that hopefully are relevant:
Don't update values, just write new values. If something was true at time T will always have been true for time T, even if it changes at time T + 1. When things change you write a new value with the time of the change and let the old values be. When you read the time line you resolve these conflics by picking the most recent value -- and since the values will be sorted in chronological order the most recent value will always be the last one. This is very similar to how Cassandra does things internally, and it's a very powerful pattern.
Don't worry that this will use up more disk space, or require some extra CPU when reading the time series, it will most likely be tiny in comparison with the read-modify-write complexity that you would otherwise have to implement.
There might be other ways to solve your problem, and if you give us some more details maybe we can come up with someting that fits better.

How can I tell if Hazelcast's IMap.putIfAbsent() worked?

I need to put a key into a Hazelcast map, and throw an error if the key already exists in the map. This must be done in an atomic way, so testing if the key is there first and then doing the put in a separate operation won't work.
Here's the problem: the only way to tell if putIfAbsent() actually put anything is to test if the returned object is the new one or the existing one. But Hazelcast doesn't return the existing one; it returns a clone of it. So you can't do if (old == new) to test. You would have to do if (old.equals(new)). The trouble is that my objects are large and complicated and it's not going to be easy to implement a custom .equals() methods.
Surely there's a better way to do this. Does Hazelcast has a different way to do an atomic putIfAbsent()?
Edit:
I've run into a similar problem with IMap.replace(). In order to supply the old and the new values, I have to clone the old value, modify it, call replace(), and be sure that I have an equals() method on my value that will do the comparison. There has got to be a better way. It would be good if Hazelcast would somehow give me a version number or a timestamp for an object in a map so I could do a compare-and-set on the version number or the the timestamp, instead of having to deal with every field of a complicated object.
Perhaps you could try an EntryProcessor.
map.executeOnKey(key, PutIfAbsentEntryProcessor(value))
This PutIfAbsentEntryProcessor you need to implement yourself and it returns true of the original value (you have access to that in the EntryProcessor) is null or not.

Resources