I am new to Cassandra and am having an issue with counters double counting sometimes. I am trying to keep track of daily event counts for certain events. Here is my table structure:
create table pipes.pipe_event_counts (
count counter,
pipe_id text,
event_type text,
date text,
PRIMARY KEY ((pipe_id, event_type, date))
);
The driver I am using is the Datastax Java driver, and I am compiling and binding parameters to the following prepared statement:
incrementPipeEventCountStatement = CassandraClient.getInstance().getSession().prepare(
QueryBuilder.update("pipes", PIPE_EVENT_COUNT_TABLE_NAME).with(incr("count")).
where(eq("pipe_id", "?")).and(eq("date", "?")).and(eq("event_type", "?")).
getQueryString()
);
incrementPipeEventCountStatement.bind(
event.getAttrubution(Meta.PIPE_ID), dateString, event.getType().toString()
)
The problem is very weird. Sometimes when I process a single event, the counter increments properly by 1. However, the majority of the time, it double increments. I've been looking at my code for some time now and can't find any issues that would cause a second increment.
Is my implementation of counters in Cassandra correct for my use case? I think it is, but I could be losing my mind. I'm hoping someone can help me confirm so I can focus in the right area to find my problem.
Important edit: This is the query I'm running to check the count after the event:
select count from pipes.pipe_event_counts where pipe_id = 'homepage' and event_type = 'click' and date = '2015-04-07';
The thing with counters is that they are not idempotent operations so when you retry (and don't know if your original write was successful) you may end up over-counting.
You can also never re-try and undercount.
As Chris chared, there are some issues with the counter implementation pre-2.1 that make the overcounting issue much more severe. There are also performance issues associated with counters so you want to make sure you look into these in detail before you push a counter deployment to production.
Here are the related Jiras to help you make informed decisions:
Counters ++ (major improvement - fixed 2.1) -- https://issues.apache.org/jira/browse/CASSANDRA-6504
Memory / GC issues from large counter workloads, Counter Column (major improvement - fixed 2.1)--https://issues.apache.org/jira/browse/CASSANDRA-6405
Counters into separate cells (final solution - eta 3.1)- https://issues.apache.org/jira/browse/CASSANDRA-6506
Related
I have a node.js function that needs to be executed for each order on my application. In this function my app gets an order number from a oracle database, process the order and then adds + 1 to that number on the database (needs to be the last thing on the function because order can fail and therefore the number will not be used).
If all recieved orders at time T are processed at the same time (asynchronously) then the same order number will be used for multiple orders and I don't want that.
So I used rabbit to try to remedy this situation since it was a queue. It seems that the processes finishes in the order they should, but a second process does NOT wait for the first one to finish (ack) to begin, so in the end I'm having the same problem of using the same order number multiple times.
Is there anyway I can configure my queue to process one message at a time? To only start process n+1 when process n has been acknowledged?
This would be a life saver to me!
If the problem is to avoid duplicate order numbers, then use an Oracle sequence, or use an identity column when you insert into a table to generate the order number:
CREATE TABLE mytab (
id NUMBER GENERATED BY DEFAULT ON NULL AS IDENTITY(START WITH 1),
data VARCHAR2(20));
INSERT INTO mytab (data) VALUES ('abc');
INSERT INTO mytab (data) VALUES ('def');
SELECT * FROM mytab;
This will give:
ID DATA
---------- --------------------
1 abc
2 def
If the problem is that you want orders to be processed sequentially, then don't pull an order from the queue until the previous one is finished. This will limit your throughput, so you need to understand your requirements and make some architectural decisions.
Overall, it sounds Oracle Advanced Queuing would be a good fit. See the node-oracledb documentation on AQ.
Could someone please explain the difference between RDD countApprox() vs count() and also if possible can answer which is the fastest ? it would be of great help we have a requirement where count() is very slow takes about 30 min's ** ...tried countApprox() it was **fast for the first run (**About 1.2 min) and then slowed to 30 min's .....
this is how we used it not sure if it's the best way to use
rdd.countApprox(timeout=800, confidence=0.5)
Count() - Returns you the number of elements in an RDD.
CountApprox - Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished.
countApprox(timeout: Long, confidence: Double)
Default: confidence = 0.95
Note: As per the spark source code, support for countApprox is marked 'Experimental'.
With timeout=800, you should have seen an approximate count in <1min.
Are you sure nothing else is causing this slowdown of 30mins.
Share your code/code-snippet to get more accurate inputs from other members.
Not my answer, but there is a very useful and important answer here.
In very short, countApprax.getFinalValue blocks even if this is longer than the timeout.
getInitialValue does not block and so you will get a response within the timeout.
BUT, as I learned from painful experience, even if you use getInitalValue the process will continue to final value.
If you are repeating this in a loop, the getFinalValue will be running for multiple RDDs long after you have retrieved the result from getInitialValue. This can then lead to OOM conditions and broadcast errors that are difficult to diagnose
rdd.count() is an action, which is an eager operation.
This means that all the other transformations that you had written before that will start executing now because of Spark's lazy evaluation. So, essentially its not only Count() operation that's taking all the time but, all the other operations which were waiting to get executed.
Now coming back to the question of count() vs countApprox().
Count is just like doing a select count(*) from Table. countApprox can have a timeout and confidence level which returns back a result which is approximately correct and a number you can live with.
We should use countApprox when we are more interested in knowing an approximate number and save time for example in a streaming application.
Count() should be used when you need the exact count for example to log something or for auditing.
I've created a few reasonably complex M queries and have started running into some severe performance issues. I'm wondering if has to do with how I sometime organize my code.
The issues I've been having are:
1) Power Query constantly uses all of several CPU cores, calculating something, even if I'm not waiting for a result.
2) In task manager I can sometimes see that the Power Query threads ("Microsoft.mashup.Container.NetFX40.exe") are nearly idle, while Excel.exe is using 100% of one core for tens of minutes - even though at most I'm looking values in a few parameter tables that don't contain more than a couple dozen cells.
3) Some steps take extremely long to calculate, even though the operations involved are trivial. For example, I have a list of 10 text values taken from an Excel table. This list appears as one of my query steps when I 'preview' it. Then I want to remove a single value, so the next step = List.RemoveItems(myList, {"val"}). It didn't compute after 30 minutes, even though I could see the list was correctly loaded in a previous step.
4) UI sometimes becomes unresponsive for several minutes after changing code. Can still right-click on Queries at left hand side to enter advanced editor, and click the red X at top right and choose to keep changes, but all the rest is unresponsive. Not greyed out, just unresponsive.
Anyway, I just wanted to ask if anyone's had similar trouble, and if anyone knows what triggers particularly bad performance in PQ.
I'll often use something like the following pattern to keep the total number of queries down while still being able to easily inspect individual steps:
let
ThisWB = Excel.CurrentWorkbook(),
CfgTbl = ThisWB{[Name="myCfgTbl"]}[Content],
x = aFn(CfgTbl),
y = bFn(CfgTbl),
output = [ThisWB=ThisWB, CfgTbl=CfgTbl, x=x, y=y]
in
output
Is this likely to lead to any issues? Just thought it might because at one point after waiting a very long time for a simple function result, I created a new query = Excel.CurrentWorkbook(){[Name="myCfgTbl"]}[Content], referenced it from the other query, and my result calculated immediately. No idea why.
It calculates previews. Turn off auto preview generation.
I messed with something like this in cases with formula-heavy tables.
The rest probably requires code examples, especially your last case.
BTW, is your version of power query (or Excel 2016) up-to-date?
I have a cluster of 3 Cassandra 2.0 nodes. My application I wrote a test which tries to write and read some data into/from Cassandra. In general this works fine.
The curiosity is that after I restarted my computer, this test will fail, because after writting I read the same value I´ve write before and there I get null instead of the value, but the was no exception while writing.
If I manually truncate the used column family, the test will pass. After that I can execute this test how often I want, it passes again and again. Furthermore it doesn´t matter if there are values in the Cassandra or not. The result is alwalys the same.
If I look at the CLI and the CQL-shell there are two different views:
Does anyone have an ideas what is going wrong? The timestamp in the CLI is updated after re-execution, so it seems to be a read-problem?
A part of my code:
For inserts I tried
Insert.Options insert = QueryBuilder.insertInto(KEYSPACE_NAME,TABLENAME)
.value(ID, id)
.value(JAHR, zonedDateTime.getYear())
.value(MONAT, zonedDateTime.getMonthValue())
.value(ZEITPUNKT, date)
.value(WERT, entry.getValue())
.using(timestamp(System.nanoTime() / 1000));
and
Insert insert = QueryBuilder.insertInto(KEYSPACE_NAME,TABLENAME)
.value(ID, id)
.value(JAHR, zonedDateTime.getYear())
.value(MONAT, zonedDateTime.getMonthValue())
.value(ZEITPUNKT, date)
.value(WERT, entry.getValue());
My select looks like
Select.Where select = QueryBuilder.select(WERT)
.from(KEYSPACE_NAME,TABLENAME)
.where(eq(ID, id))
.and(eq(JAHR, zonedDateTime.getYear()))
.and(eq(MONAT, zonedDateTime.getMonthValue()))
.and(eq(ZEITPUNKT, Date.from(instant)));
Consistencylevel is QUORUM (for both) and replicationfactor 3
I'd say this seems to be a problem with timestamps since a truncate solves the problem. In Cassandra last write wins and this could be a problem caused by the use of System.nanoTime() since
This method can only be used to measure elapsed time and is not related to any other notion of system or wall-clock time.
...
The values returned by this method become meaningful only when the difference between two such values, obtained within the same instance of a Java virtual machine, is computed.
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#nanoTime()
This means that the write that occured before the restart could have been performed "in the future" compared to the write after the restart. This would not fail the query, but the written value would simply not be visible due to the fact that there is a "newer" value available.
Do you have a requirement to use sub-millisecond precision for the insert timestamps? If possible I would recommend using System.currentTimeMillis() instead of nanoTime().
http://docs.oracle.com/javase/7/docs/api/java/lang/System.html#currentTimeMillis()
If you have a requirement to use sub-millisecond precision it would be possible to use System.currentTimeMillis() with some kind of atomic counter that ranged between 0-999 and then use that as a timestamp. This would however break if multiple clients insert the same row at the same time.
We are doing the following to update the value of a counter, now we wonder if there is a straightforward way to get back the updated counter value immediately.
mutator.incrementCounter(rowid1, "cf1", "counter1", value);
There's no single 'incrementAndGet' operation in Cassandra thrift API.
Counters in Cassandra are eventually consistent and non-atomic. Fragile ConsistencyLevel.ALL operation is required to get "guaranteed to be updated" counter value, i.e. perform consistent read. ConsistencyLevel.QUORUM is not sufficient (as specified in counters design document: https://issues.apache.org/jira/secure/attachment/12459754/Partitionedcountersdesigndoc.pdf).
To implement incrementAndGet method that looks consistent, you might want at first read counter value, then issue increment mutation, and return (read value + inc).
For example, if previous counter value is 10 to 20 (on different replicas), and one add 50 to it, read-before-increment will return either 60 or 70. And read-after-increment might still return 10 or 20.
The only way to do it is query for it. There is no increment-then-read functionality available in Cassandra.