Updating TimeUUID columns in cassandra

Updating TimeUUID columns in cassandra - cassandra

I'm trying to store some time series data on the following column family:
create column family t_data with comparator=TimeUUIDType and default_validation_class=UTF8Type and key_validation_class=UTF8Type;
I'm successfully inserting data this way:
data={datetime.datetime(2013, 3, 4, 17, 8, 57, 919671):'VALUE'}
key='row_id'
col_fam.insert(key,data)
As you can see, using a datetime object as the column name pycassa converts to a timeUUID object correctly.
[default#keyspace] get t_data[row_id];
=> (column=f36ad7be-84ed-11e2-af42-ef3ff4aa7c40, value=VALUE, timestamp=1362423749228331)
Sometimes, the application needs to update some data. The problem is that when I try to update that column, passing the same datetime object, pycassa creates a different UUID object (the time part is the same) so instead of updating the column, it creates another one.
[default#keyspace] get t_data[row_id];
=> (column=f36ad7be-84ed-11e2-af42-ef3ff4aa7c40, value=VALUE, timestamp=1362423749228331)
=> (column=**f36ad7be**-84ed-11e2-b2fa-a6d3e28fea13, value=VALUE, timestamp=1362424025433209)
The question is, how can I update TimeUUID based columns with pycassa passing the datetime object? or, if this is not the correct way to doing it, what is the recommended way?

Unless you do a read-modify-write you can't. UUIDs are by their nature unique. They exist to solve the problem of how to get unique IDs that sort in chronological order but at the same time avoid collisions for things that happen at exactly the same time.
So to update that column you need to first read it, so you can find its column key, change its value and write it back again.
It's not a particularly elegant solution. You should really avoid read-modify-write in Cassandra. Perhaps TimeUUID isn't the right type for your column keys? Or perhaps there's another way you can design your application to avoid having to go back and change things.
Without knowing what your query patterns look like I can't say exactly what you should do instead, but here are some suggestions that hopefully are relevant:
Don't update values, just write new values. If something was true at time T will always have been true for time T, even if it changes at time T + 1. When things change you write a new value with the time of the change and let the old values be. When you read the time line you resolve these conflics by picking the most recent value -- and since the values will be sorted in chronological order the most recent value will always be the last one. This is very similar to how Cassandra does things internally, and it's a very powerful pattern.
Don't worry that this will use up more disk space, or require some extra CPU when reading the time series, it will most likely be tiny in comparison with the read-modify-write complexity that you would otherwise have to implement.
There might be other ways to solve your problem, and if you give us some more details maybe we can come up with someting that fits better.

Related

Raw sql with many columns

I'm building a CRUD application that pulls data using Persistent and executes a number of fairly complicated queries, for instance using window functions. Since these aren't supported by either Persistent or Esqueleto, I need to use raw sql.
A good example is that I want to select rows in which the value does not deviate strongly from the previous value, so in pseudo-sql the condition is WHERE val - lag(val) <= x. I need to run this selection in SQL, rather than pulling all data and then filtering in Haskell, because otherwise I'd have way to much data to handle.
These queries return many columns. However, the RawSql instance maxes out at tuples with 8 elements. So now I am writing additional functions from9, to9, from10, to10 and so on. And after that, all these are converted using functions with type (Single a, Single b, ...) -> DesiredType. Even though this could be shortened using code generation, the approach is simply hacky and clearly doesn't feel like good Haskell. This concerns me because I think most of my queries will require rawSql.
Do you have suggestions on how to improve this? Currently, my main thought is to un-normalize the database and duplicate data, e.g. by including the lagged value as column, so that I can query the data with Esqueleto.

Storing a list of mixed types in Cassandra

In Cassandra, when specifying a table and fields, one has to give each field a type (text, int, boolean, etc.). The same applies for collections, you have to give lock a collection to specific type (set<text> and such).
I need to store a list of mixed types in Cassandra. The list may contain numbers, strings and booleans. So I would need something like list<?>.
Is this possible in Cassandra and if not, What workaround would you suggest for storing a list of mixed type items? I sketched a few, but none of them seem the right way to go...

Cassandra's CQL interface is strictly typed, so you will not be able to create a table with an untyped collection column.
I basically see two options:
Create a list field, and convert everything to text (not too nice, I agree)
Use the thift API and store everything as is.

As suggested at http://www.mail-archive.com/user#cassandra.apache.org/msg37103.html I decided to encode the various values into binary and store them into list<blob>. This allows to still query the collection values (in Cassandra 2.1+), one just needs to encode the values in the query.
On python, simplest way is probably to pickle and hexify when storing data:
pickle.dumps('Hello world').encode('hex')
And to load it:
pickle.loads(item.decode('hex'))
Using pickle ties the implementation to python, but it automatically converts to correct type (int, string, boolean, etc.) when loading, so it's convenient.

update_sequence changed semantics in cloudant db?

I use a cloudant couchdb and I've noticed see that the "_changes" query on the database returns an "update_sequence" that is not a number, e.g.
"437985-g1AAAADveJzLYWBgYM..........".
What is more, the response is not stable: I get 3 different update_sequences if a query the db 3 times.
Is there any change in the known semantics of the "update_sequence", "since", etc. or what?
Regards,
Vangelis

Paraphrasing an answer that Robert has previously given:
The update sequence values are opaque. In CouchDB, they are currently integers but, in Cloudant, the value is an encoding of the sequence value for each shard of the database. CouchDB will likely adopt this in future as clustering support is added (via the BigCouch merge).
In both CouchDB and Cloudant, _changes will return a "seq" value with every row that is guaranteed to return newer updates if you pass it back as "since". In cases of failover, that might include changes you've already seen.
So, the correct way to read changes since a particular update sequence is this;
call /dbname/_changes?since=
read the entire response, applying the changes as you go
Record the last_seq value as your new checkpoint seq value.
Do not interpret the two values, you can't compare them for equality. You can, if you need to, record any "seq" value as you go in step 2 as your current checkpoint seq value. The key thing you cannot do is compare them.

It'll jump around, the representation is a packed base64 string representing the update_seq of the various replicas of each shard of your database. It can't be a simple integer because it's a snapshot of a distributed database.
As for CouchDB, treat the update_seq as opaque JSON and you'll be fine.

Check if values of two string-type items are equal in a Zabbix trigger

I am monitoring an application using Zabbix and have defined a custom item which returns a string value. Since my item's values are actually checksums, they will only contain the characters [0-9a-f]. Two mirror copies of my application are running on two servers for the sake of redundancy. I would like to create a trigger which would take the item values from both machines and fire if they are not the same.
For a moment, let's forget about the moment when values change (it's not an atomic operation, so the system may see inconsistent state, which is not a real error, for a short time), since I could work around it by looking at several previous values.
The crux is: how to write a Zabbix trigger expression which could compare for equality the string values of two items (the same item on two mirror hosts, actually)?
Both according to the fine manual and as I confirmed in praxis, the standard operators = and # only work on numeric values, so I can't just write the natural {host1:myitem[param].last(0)} # {host2:myitem[param].last(0)}. Functions such as change() or diff() can only compare values of the same item at different points in time. Functions such as regexp() can only compare the item's value with a constant string/regular expression, not with another item's value. This is very limiting.
I could move the comparison logic into the script which my custom item executes, but it's a bit messy and not elegant, so if at all possible, I would prefer to have this logic inside my Zabbix trigger.
Perhaps despite the limitations listed above, someone can come up with a workaround?

Workaround:
{host1:myitem[param].change(0)} # {host2:myitem[param].change(0)}
When only one of the servers sees a modification since the previously received value, an event is triggered.
From the Zabbix Manual,
change (float, int, str, text, log)
Returns difference between last and previous values.
For strings:
0 - values are equal
1 - values differ

I believe, and am struggling with this EXACT situation to this myself, that the correct way to do this is via calculated items.
You want to create a new ITEM, not trigger (yet!), that performs a calculated comparison on multiple item values (Strings Difference, Numbers within range, etc).
Once you have that item, have the calculation give you a value you can trigger off of. You can use ANY trigger functions in your calculation along with arrhythmic operations.
Now to the issue (which I've submitted a feature request for because this is extremely limiting), most trigger expressions evaluate to a number or a 0/1 bool.
I think I have a solution for my problem, which is that I am tracking a version number from a webpage: e.g. v2.0.1, I believe I can use string connotation and regex in calculated items in order to convert my string values into multiple number values. As these would be a breeze to compare.
But again, this is convoluted and painful.
If you want my advice, have yourself or a dev look at the code for trigger expressions and see if you can submit a patch add one trigger function for simple string comparison. (Difference, Length, Possible conversion to numerical values (using binary and/or hex combinations) etc.)
I'm trying to work on a patch myself, but I don't have time as I have so much monitoring to implement and while zabbix is powerful, it's got several huge flaws. I still believe it's the best monitoring system out there.
Simple answer: Create a UserParameter until someone writes a patch.

You could change your items to return numbers instead of strings. Because your items are checksums that are using only [0-9a-f] characters, they are numbers written in hexadecimal. So you would need to convert the checksum to decimal number.
Because the checksum is a big number, you would need to limit the hexadecimal number to 8 characters for Numeric (unsigned) type before conversion. Or if you would want higher precision, you could use float (but that would be more work):
Numeric (unsigned) - 64bit unsigned integer
Numeric (float) - floating point number
Negative values can be stored.
Allowed range (for MySQL): -999999999999.9999 to 999999999999.9999 (double(16,4)).
I wish Zabbix would have .hashedUnsigned() function that would compute hash of a string and return it as a number. Such a function should be easy to write.

Update the quantile for a dataset when a new datapoint is added

Suppose I have a list of numbers and I've computed the q-quantile (using Quantile).
Now a new datapoint comes along and I want to update my q-quantile, without having stored the whole list of previous datapoints.
What would you recommend?
Perhaps it can't be done exactly without, in the worst case, storing all previous datapoints.
In that case, can you think of something that would work well enough?

One idea I had, if you can assume normality, is to use the inverse CDF instead of the q-quantile.
Keep track of the sample variance as you go and then you can compute InverseCDF[NormalDistribution[sampleMean,sampleVariance], q] which should be the value such that a fraction q of the values are smaller, which is what the q-quantile is.
(I see belisarius was thinking along the same lines.
Here's the link he pointed to: http://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#On-line_algorithm )

Unless you know that your underlying data comes from some distribution, it is not possible to update arbitrary quantiles without retaining the original data. You can, as others suggested, assume that the data has some sort of distribution and store the quantiles this way, but this is a rather restrictive approach.
Alternately, have you thought of programming this somewhere besides Mathematica? For example, you could create a class for your datapoints that contains (1) the Double value and (2) some timestamp for when the data came in. In a SortedList of these datapoints classes (which compares based on value), you could get the quantile very fast by simply referencing the index of the datapoints. Want to get a historical quantile? Simply filter on the timestamps in your sorted list.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string