We are doing the following to update the value of a counter, now we wonder if there is a straightforward way to get back the updated counter value immediately.
mutator.incrementCounter(rowid1, "cf1", "counter1", value);
There's no single 'incrementAndGet' operation in Cassandra thrift API.
Counters in Cassandra are eventually consistent and non-atomic. Fragile ConsistencyLevel.ALL operation is required to get "guaranteed to be updated" counter value, i.e. perform consistent read. ConsistencyLevel.QUORUM is not sufficient (as specified in counters design document: https://issues.apache.org/jira/secure/attachment/12459754/Partitionedcountersdesigndoc.pdf).
To implement incrementAndGet method that looks consistent, you might want at first read counter value, then issue increment mutation, and return (read value + inc).
For example, if previous counter value is 10 to 20 (on different replicas), and one add 50 to it, read-before-increment will return either 60 or 70. And read-after-increment might still return 10 or 20.
The only way to do it is query for it. There is no increment-then-read functionality available in Cassandra.
Related
I am using a Logic App to transform some data for an integration. I am trying to avoid using For Each loops as the amount of data I am working with is high, and these incur a cost for each action and iteration of the for each loop.
However the integration I am working with requires a unique incrementing number for each line. They don't have to be sequential, or even starting with 1 but the order should be kept the same.
So with the above, the first one would get LineNumber 1, the second LineNumber 2, etc.. (or like I said, it could be 67829, 67835, etc..)
I tried to set a variable with ticks(utcNow()) before the start of the mapping, and then use sub(ticks(utcNow()), variables('startTicks')) but this is evaluated once and the same number is applied to all.
My next thought is to use an azure function/inline javascript to go through afterward and assign them, but just wondering if there is a way to accomplish this in the select.
or like I said, it could be 67829, 67835, etc..
Answering to this requirement,
Inside the Select Option :
indexOf(string(variables('<DATA Variable>')),string(item()))
Explanation :
item() - current item (of all items) in the select - stringified the same & tried to find the same in stringified version of the entire data - the index number will be returned.
OUTPUT
Please note :
Did not get a chance to check on a very large dataset.
This may fail, if a specific row(all values in the row) repetitive in nature - I assume this may not
be your case (order number might unique )
I have a 2 node cassandra cluster with RF=2.
When a delete from x where y cql statement is issued - is it known how long it will take all nodes to delete the row?
What I see in one of the integration tests:
A row is deleted, the result of the deletion is tested with a select * from y where id = xxx statement. What I see is that sometimes the result is not null as expected and the deleted row is still found.
Is the correct approch to read with CL=2 to get the result I am expecting?
make sure that the servers time are in synch if you are using server side timestamp.
Better use client side timestamp.
Is the correct approch to read with CL=2 to get the result I am
expecting?
i assume you are using default consistecy while delete ie 1 and as 1+2 > 2 (ie W+R > N) in your case hence it is ok.
Local to the replica it will be sub ms. The time is dominated in the time from app->coordinator->replica->coordinator->app network hops. Use quorum or local_quorum for consistency across sequential requests like that on both write and read.
Could someone please explain the difference between RDD countApprox() vs count() and also if possible can answer which is the fastest ? it would be of great help we have a requirement where count() is very slow takes about 30 min's ** ...tried countApprox() it was **fast for the first run (**About 1.2 min) and then slowed to 30 min's .....
this is how we used it not sure if it's the best way to use
rdd.countApprox(timeout=800, confidence=0.5)
Count() - Returns you the number of elements in an RDD.
CountApprox - Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
countApprox(timeout: Long, confidence: Double)
Default: confidence = 0.95
Note: As per the spark source code, support for countApprox is marked 'Experimental'.
With timeout=800, you should have seen an approximate count in <1min.
Are you sure nothing else is causing this slowdown of 30mins.
Share your code/code-snippet to get more accurate inputs from other members.
Not my answer, but there is a very useful and important answer here.
In very short, countApprax.getFinalValue blocks even if this is longer than the timeout.
getInitialValue does not block and so you will get a response within the timeout.
BUT, as I learned from painful experience, even if you use getInitalValue the process will continue to final value.
If you are repeating this in a loop, the getFinalValue will be running for multiple RDDs long after you have retrieved the result from getInitialValue. This can then lead to OOM conditions and broadcast errors that are difficult to diagnose
rdd.count() is an action, which is an eager operation.
This means that all the other transformations that you had written before that will start executing now because of Spark's lazy evaluation. So, essentially its not only Count() operation that's taking all the time but, all the other operations which were waiting to get executed.
Now coming back to the question of count() vs countApprox().
Count is just like doing a select count(*) from Table. countApprox can have a timeout and confidence level which returns back a result which is approximately correct and a number you can live with.
We should use countApprox when we are more interested in knowing an approximate number and save time for example in a streaming application.
Count() should be used when you need the exact count for example to log something or for auditing.
I have a cluster of 3 Cassandra 2.0 nodes. My application I wrote a test which tries to write and read some data into/from Cassandra. In general this works fine.
The curiosity is that after I restarted my computer, this test will fail, because after writting I read the same value I´ve write before and there I get null instead of the value, but the was no exception while writing.
If I manually truncate the used column family, the test will pass. After that I can execute this test how often I want, it passes again and again. Furthermore it doesn´t matter if there are values in the Cassandra or not. The result is alwalys the same.
If I look at the CLI and the CQL-shell there are two different views:
Does anyone have an ideas what is going wrong? The timestamp in the CLI is updated after re-execution, so it seems to be a read-problem?
A part of my code:
For inserts I tried
Insert.Options insert = QueryBuilder.insertInto(KEYSPACE_NAME,TABLENAME)
.value(ID, id)
.value(JAHR, zonedDateTime.getYear())
.value(MONAT, zonedDateTime.getMonthValue())
.value(ZEITPUNKT, date)
.value(WERT, entry.getValue())
.using(timestamp(System.nanoTime() / 1000));
and
Insert insert = QueryBuilder.insertInto(KEYSPACE_NAME,TABLENAME)
.value(ID, id)
.value(JAHR, zonedDateTime.getYear())
.value(MONAT, zonedDateTime.getMonthValue())
.value(ZEITPUNKT, date)
.value(WERT, entry.getValue());
My select looks like
Select.Where select = QueryBuilder.select(WERT)
.from(KEYSPACE_NAME,TABLENAME)
.where(eq(ID, id))
.and(eq(JAHR, zonedDateTime.getYear()))
.and(eq(MONAT, zonedDateTime.getMonthValue()))
.and(eq(ZEITPUNKT, Date.from(instant)));
Consistencylevel is QUORUM (for both) and replicationfactor 3
I'd say this seems to be a problem with timestamps since a truncate solves the problem. In Cassandra last write wins and this could be a problem caused by the use of System.nanoTime() since
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
...
The values returned by this method become meaningful only when the difference between two such values, obtained within the same instance of a Java virtual machine, is computed.
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#nanoTime()
This means that the write that occured before the restart could have been performed "in the future" compared to the write after the restart. This would not fail the query, but the written value would simply not be visible due to the fact that there is a "newer" value available.
Do you have a requirement to use sub-millisecond precision for the insert timestamps? If possible I would recommend using System.currentTimeMillis() instead of nanoTime().
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#currentTimeMillis()
If you have a requirement to use sub-millisecond precision it would be possible to use System.currentTimeMillis() with some kind of atomic counter that ranged between 0-999 and then use that as a timestamp. This would however break if multiple clients insert the same row at the same time.
I am new to Cassandra and am having an issue with counters double counting sometimes. I am trying to keep track of daily event counts for certain events. Here is my table structure:
create table pipes.pipe_event_counts (
count counter,
pipe_id text,
event_type text,
date text,
PRIMARY KEY ((pipe_id, event_type, date))
);
The driver I am using is the Datastax Java driver, and I am compiling and binding parameters to the following prepared statement:
incrementPipeEventCountStatement = CassandraClient.getInstance().getSession().prepare(
QueryBuilder.update("pipes", PIPE_EVENT_COUNT_TABLE_NAME).with(incr("count")).
where(eq("pipe_id", "?")).and(eq("date", "?")).and(eq("event_type", "?")).
getQueryString()
);
incrementPipeEventCountStatement.bind(
event.getAttrubution(Meta.PIPE_ID), dateString, event.getType().toString()
)
The problem is very weird. Sometimes when I process a single event, the counter increments properly by 1. However, the majority of the time, it double increments. I've been looking at my code for some time now and can't find any issues that would cause a second increment.
Is my implementation of counters in Cassandra correct for my use case? I think it is, but I could be losing my mind. I'm hoping someone can help me confirm so I can focus in the right area to find my problem.
Important edit: This is the query I'm running to check the count after the event:
select count from pipes.pipe_event_counts where pipe_id = 'homepage' and event_type = 'click' and date = '2015-04-07';
The thing with counters is that they are not idempotent operations so when you retry (and don't know if your original write was successful) you may end up over-counting.
You can also never re-try and undercount.
As Chris chared, there are some issues with the counter implementation pre-2.1 that make the overcounting issue much more severe. There are also performance issues associated with counters so you want to make sure you look into these in detail before you push a counter deployment to production.
Here are the related Jiras to help you make informed decisions:
Counters ++ (major improvement - fixed 2.1) -- https://issues.apache.org/jira/browse/CASSANDRA-6504
Memory / GC issues from large counter workloads, Counter Column (major improvement - fixed 2.1)--https://issues.apache.org/jira/browse/CASSANDRA-6405
Counters into separate cells (final solution - eta 3.1)- https://issues.apache.org/jira/browse/CASSANDRA-6506