We have a single node Cassandra Cluster (Apache) with 2 vCPUs and around 16 GB RAM on AWS. We have around 28 GB of data uploaded into Cassandra.
Now Cassandra is working fine for select and group by queries using primary keys, however when using User Defined Functions to use aggregate functions on non-primary key - it is giving a timeout.
To elaborate - we have partition on Year, Month and Date for a 3 year data. Now for example if two columns are - Bill_ID and Bill_Amount we want to have a sum of Bill_Amount by Bill_ID using UDF.
Kind of confused here as I believe that if the info says it has received 1 response, why should it give a message of timeout if it has received it? Why are we getting a timeout and that too only when using User Defined functions?
ReadTimeout: Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 1 responses." info={'received_responses': 1, 'required_responses': 1, 'consistency': 'ONE'}
We have increase read timeouts in the yaml file to as high as 10 minutes.
Edit - Adding the screenshot of the query. The results displayed before setting --request-timeout and post that using UDF. The table has 150 million rows with over 1095 partitions only for 3 years of data - with primary keys been year, day and month.
Try to increase the timeouts also on client side, cqlsh for example:
cqlsh --request-timeout=3600
Related
I have raised a GitHub issue regarding this as well. Pasting the same below.
JanusGraph version - janusgraph-0.3.1
Cassandra - cassandra:3.11.4
When we run JanusGraph with the Cassandra backend, after a period of time, the JanusGraph starts throwing the below errors and goes in to an unusable state.
JanusGraph Logs:
466489 [gremlin-server-exec-6] INFO
org.janusgraph.diskstorage.util.BackendOperation - Temporary exception
during backend operation [EdgeStoreKeys].
Attempting backoff retry.
org.janusgraph.diskstorage.TemporaryBackendException:
Temporary failure in storage backend at io.vavr.API$Match$Case0.apply(API.java:3174)
at
io.vavr.API$Match.of(API.java:3137) at
org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.lambda$static$0 (CQLKeyColumnValueStore.java:123)
at io.vavr.control.Try.getOrElseThrow(Try.java:671) at
org.janusgraph.diskstorage.cql.CQLKeyColumnValueStore.getKeys (CQLKeyColumnValueStore.java:405)
Caused by: com.datastax.driver.core.exceptions.ReadFailureException:
Cassandra failure during read query at consistency QUORUM (1 responses
were required but only 0 replica responded, 1 failed)
at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:130)
at com.datastax.driver.core.exceptions.ReadFailureException.copy(ReadFailureException.java:30)
Cassandra Logs:
WARN [ReadStage-2] 2019-07-19 11:40:02,980 ReadCommand.java:569 - Read
74 live rows and 100001 tombstone cells for query SELECT * FROM
janusgraph.edgestore WHERE column1 >= 02 AND column1 <= 03 LIMIT 100
(see tombstone_warn_threshold)
ERROR [ReadStage-2] 2019-07-19
11:40:02,980 StorageProxy.java:1896 - Scanned over 100001 tombstones
during query 'SELECT * FROM janusgraph.edgestore WHERE column1 >= 02
AND column1 <= 03 LIMIT 100' (last scanned row partion key was
((00000000002b9d88), 02)); query aborted
Related Question:
Cassandra failure during read query at consistency LOCAL_ONE (1 responses were required but only 0 replica responded, 1 failed)
Questions:
1) Is Edge updates are stored as a new item causing tombstones ?. (since janus is a fork of titan).
How to increment Number of Visit count in Titan graph database Edge Label?
https://github.com/JanusGraph/janusgraph/issues/934
2) What is the right approach towards this. ?
Any solution/indications would be really helpful.
[Update]
1) Update to the edges didn't cause tombstones in the JanusGraph.
2) Solutions:
- As per the answer, reduce the gc_grace_seconds to a lower value based on the deletions of edge/vertex.
- Also can consider tuning the "tombstone_failure_threshold" in cassandra.yaml based on the needs.
For Cassandra, a tombstone is a flag that indicates that a record should be deleted, this can be occur after a delete operation was explicitly requested, or once that the Time To Live (TTL) period expired. A record with a tombstone will persist for the time defined in the gc_grace_seconds after the delete operation was executed, by default it is 10 days.
Usually running nodetool repair janusgraph edgestore (based on the error log provided) should be able to fix the issue. If you are still getting the error, you may need to decrease the gc_grace_seconds value of your table, as explained here.
For more information regarding tombstones:
https://thelastpickle.com/blog/2016/07/27/about-deletes-and-tombstones.html
Tombstone vs nodetool and repair
I need help understanding why a query with ConsistencyLevel.ONE would receive no response from any of the 3 replicas (RF=3). This happens sporadically, but frequently enough to cause headaches.
I see the same thing occasionally when doing queries with QUORUM, which I could almost rationalize as something like 2/3 of the replicas are "down" doing compaction, GC, something.
But with ONE, doesn't that mean all 3 replicas are "down"? What could cause them to be down? Could something else be going on?
Full disclosure: we're running 2.1.9 with 15 nodes in a single data center in GCE, but the nodes are in 3 different zones. As far as I know, this isn't due to a network partition among the zones, but anything is possible.
Please help me understand the possible causes for 0 replicas responding. Thanks in advance!
Here's an example of an exception:
cassandra.OperationTimedOut: errors={: ReadTimeout('Error from server: code=1200 [Coordinator node timed out waiting for replica nodes\' responses] message="Operation timed out - received only 0 responses." info={\'received_responses\': 0, \'required_responses\': 1, \'consistency\': \'ONE\'}',), : ConnectionException('Host has been marked down or removed',)}, last_host=10.240.0.31
Note
This is getting quite long so I will try and re-edit parts through the day.
These databases are no long active, which means I can play with them to work out what is going wrong.
The only thing left to answer: Given two databases running on Azure Databases at S3 (100 DTU). Should any secondary ever get significantly behind the primary database? Even while the DTU is hammered to 100% for over half the day. The reason for the DTU being hammered being IO writes mostly.
The Start: a few problems.
DTU limits were hit on Monday, Tuesday and to some extent on Wednesday for a significant amount of time. 3PM UTC - 6AM UTC.
Problem 1 (lag in data on the secondary): This had appeared to have caused a lag of data in the secondary of about 9 1/2 hours. The servers were effectively being spammed with updates causing a lot of IO updates. 6-8 million records on one table for the 24 hour period for example. This problem drove the reason for the post:
Shouldn't these be much more in sync?
The data became out of sync on Monday morning and continued out of sync until Friday. On Thursday some new databases were started up to replace these standard SQL databases and so they were left to rot. Well, for me to experiment with at least.
The application causing the redundant queries couldn't be fixed for a few days (I'm doubting they will ever fix it now) so that leaves changing the instance type. That action was attempted on the current instance but, the instance must disconnect with all standard replicas to increase to the performance tier. This led to the second problem (see below). The replica taking its time to be removed. Began on Wednesday morning and did not complete until Friday.
Problem 2 (can't remove the replica):
(Solved itself after two days)
Disconnecting the secondary database process began ~ Wed 8UTC (when the primary was at about 80GBs). The secondary being about 11GB behind in size at this point.
Setup
The databases (primary and secondary) are S3 that is geo-replicated (North + West Europe).
It has an audit log table(which I read from the secondary - normally with an SQL query), but this is currently 9 1/2 hours behind the last entry for the primary database. Running the query again on the secondary a few seconds later it is slowly catching up, but appears to be relative to the refresh rather than playing catch-up.
Both primary and secondary (read-only) databases are S3 (about to be bumped to P2).
the azure documentation states:
Active Geo-Replication (readable secondaries) is now available for all databases in all service tiers. In April 2017, the non-readable secondary type will be retired and existing non-readable databases will automatically be upgraded to readable secondaries.
How has the secondary has got so far behind? seconds to minutes would be acceptable. Hours not so much. The link above describes it as slightly:
While at any given point, the secondary database might be slightly behind the primary database, the secondary data is guaranteed to always be transactionally consistent with changes committed to the primary database.
Given the secondary is about to be destroyed and replaced by a higher level (need to remove replicas when upgrading from standard -> premium). I'm curious to know if it will happen again as well as what the definition of slight might be in this instance?
Notes: The primary did reach maximum DTU for a few hours but didn't harm the performance significantly, which is where the 9-hour difference was noticed.
Stats
Update for TheGameiswar:
I can't query it right now as it started removing itself as a replica (to be able to move the primary up to the P2 level, but that began hours ago at ~8.30UTC and 5 hours later it is still going). I think it's quite broken now.
Query - nothing special:
SELECT TOP 1000 [ID]
,[DateCreated]
,[SrcID]
,[HardwareID]
,[IP_Port]
,[Action]
,[ResponseTime]
,[ErrorCode]
,[ExtraInfo]
FROM [dbo].[Audit]
order by datecreated desc
I can't compare the tables anymore as it's quite stuck and refusing connections.
The 586 hours (10-14GB) are inserts into the primary database audit table. It was similar yesterday when noticing the 9 1/2 hour difference in data.
When the attempt to remove the replica (another person stated the process) it had about 10GB difference in size.
Cant compare data but can show DB-size at equivalent time
Primary DB Size (last 24 hours):
Secondary DB Size (last 24 hours):
Primary database size - week view
Secondary database size - week view
As mentioned ... it is being removed as a replica... but is still playing catch up with the DB size if you observe the charts above.
Stop replication errored for serverName: ---------, databaseName: Cloud4
ErrorCode: 400
ErrorMessage: A terminate operation is already in progress for database 'Cloud4'.
Update 2 - Replication - dm_continuous_copy_status
Replication is still removing ... moving on...
select * from sys.dm_continuous_copy_status
sys.dm_exec_requests
Querying from Thursday
Appears to be quite empty. The only record being
Replica removed itself at last.
The replica has removed itself after 2 days. At the 80GB mark that I predicted. It waited to replay the data in the transactions (till the point it was removed as a replica) before it would remove the replica.
A Week after on the P2 databases
DTU is holding between 20-40% at busy periods and currently performing ~12 million data entries every 24 hours (a similar amount for reads, but writing is much worse on the indexes and the table). 70-100% inserts extra in a week. This time, the replica is not struggling to keep up, which is good but that is likely due to it not reaching 100% DTU.
Conclusion
The replicas are useful but not in this case. This one caused degraded performance for several days that could have been averted. A simple increase to the performance tier until the cause of the problem could be fixed. IF the replica looks like it is dragging behind and you are on the border of Basic -> Standard or Standard -> Performance it would be safe to remove the replica as soon as possible and increase to a different tier.
Now we are on P2. The database is increasing at 20GB a day... and they say they have fixed the problem that sends 15 thousand redundant updates per minute. Thanks to the query performance insight for highlight that as querying the table is extremely painful on the DTU (even querying the last minute of data in that table is bad on the DTU. ~15 thousand new records every minute).
62617: insert ...
62618: select ...
62619: select ...
A positive from the above is that it's moved from 586 hours combined time for the insert statements (7.5 million entry rows per day) on S3 to 3 hours on P2 (12.4 million row rows per day). An extremely significant decrease in processing time. It did start with an empty table on Thursday but that has surpassed the previous size in a week whereas the previous one took a few months to get there.
It's doing well on the new tier. It should be ~5% if the applications were using the database responsibly and the secondary is up to date.
Spoke too soon. Now on P2
Someone thought it was a good idea to run an SQL query that repeats itself that deletes a thousand rows at a time. 12 million new rows a day.
10AM - 12AM it's managed to remove about 5.2 million rows. Now the database is showing signs of being in the same state as last week. Im curious if that is what happened now.
I was trying to import a CSV with about 20 million rows.
I did a pilot run with a few 100 rows worth of CSV just to check if the columns were in order and that there were no parsing errors. All went well.
Every time I tried importing the 20 million row CSV, it failed after varying amounts of time. On my local machine it failed after 90 minutes with the following error. On the server box it fails within 10 minutes:
Processed 4050000 rows; Write: 624.27 rows/ss
code=1100 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out - received only 0 responses." info=
{'received_responses': 0, 'required_responses': 1, 'write_type': 0, 'consistency': 1}
Aborting import at record #4050617. Previously-inserted values still present.
4050671 rows imported in 1 hour, 26 minutes, and 43.649 seconds.
Cassandra: Coordinator node timed out waiting for replica nodes' responses (It is a one node cluster and replication factor is 1 so why is it wating for other nodes is another question)
Then based on recommendation in another thread I changed the write time out though I was not convinced it was the root cause.
write_request_timeout_in_ms: 20000
(Also tried changing it to 300000)
But it still eventually fails.
So now, I have chopped the original CSV into many 500,000 line CSVs.
This has a better success rate (compared to 0!). But even this fails 2 of 5 times for various reasons.
Sometimes I get the following error:
Processed 460000 rows; Write: 6060.32 rows/ss
Connection heartbeat failure
Aborting import at record #443491. Previously inserted records are still present, and some records after that may be present as well.
Other times it just stops updating the progress on console and the only way out is to abort using Ctrl+C
I've spent most of the day like this. My RDBMS is running happily with 5 billion rows. I wanted to test Cassandra with 10 times as much data but I'm having trouble even importing a million rows at a time.
One observation about how the COPY command proceeds is this: Once the command is entered, it starts writing at the rate of about 10,000 rows per second. It sustanins this speed till it has inserted about 80,000 rows. Then there is a pause of about 30 seconds after which it consumes another 70,000 to 90,000 rows, pauses for another 30 seconds and so on till it either finishes all rows in the CSV or fails midway with an error or simply hangs.
I need to get to the root of this. I really hope to find that I am doing something silly and it's not something I have to accept and work around.
I am using Cassandra 2.2.3
There is a lot of people having trouble with the COPY command, it seems that it works for small datasets but it starts to fail when you have a lot of data.
In the documentation they recommend to use the SSTable loader if you have a few million rows to import, i used it with my company and I had a lot of consistency problems.
I have tried everything and for me the safest way to import large amount of data into cassandra is by writing a little script that reads your CSV and then execute async queries. Python does it very well.
Will is correct. COPY is meant for small data sets and usually struggles when you start hitting the millions of rows. In addition to SSTable loader, there's this utility: https://github.com/brianmhess/cassandra-loader which I find to have very good performance with some added convenience.
I have 5 nodes in my ring with SimpleTopologyStrategy and replication_factor=3. I inserted 1M rows using stress tool . When am trying to read the row count in cqlsh using
SELECT count(*) FROM Keyspace1.Standard1 limit 1000000;
It fails with error:
Request did not complete within rpc_timeout.
It fetches for limit 100000. Fails even for 500000.
All my nodes are up. Do I need to increase the rpc_timeout?
Please help.
You get this error because the request is timing out on the server side. One should know that this is a very expensive operation in Cassandra as others have pointed out.
Still, if you really want to do this you should update your /etc/cassandra/cassandra.yaml file and change the range_request_timeout_in_ms parameter. This will be valid for all your range queries.
Example to set a 40 second timeout:
range_request_timeout_in_ms: 40000
You will probably have to adjust at the client side as well. When using cqlsh as a client this is accomplished by creating/updating your configuration file for cqlsh under ~/.cassandra/cqlshrc and add the client_timeout parameter to the connection section.
Example to set a 40 second timeout:
[connection]
client_timeout=40
It takes a long time to read in 1M rows so that is probably why it is timing out. You shouldn't use count like this, it is very expensive since it has to read all the data. Use Cassandra counters if you need to count lots of items.
You should also check your Cassandra logs to confirm there aren't any other issues - sometimes exceptions in Cassandra lead to timeouts on the client.
If you can live with an approximate row count, take a look at this answer to Row count of a column family in Cassandra.