In Production cluster , the Cluster Write latency frequently spikes from 7ms to 4Sec. Due to this clients face a lot of Read and Write Timeouts. This repeats in every few hours.
Observation:
Cluster Write latency (99th percentile) - 4Sec
Local Write latency (99th percentile) - 10ms
Read & Write consistency - local_one
Total nodes - 7
I tried to enable trace using settraceprobability for few mins and observed that mostly of the time is taken in internode communication
session_id | event_id | activity | source | source_elapsed | thread
--------------------------------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------+----------------+------------------------------------------
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c3314-bb79-11e8-aeca-439c84a4762c | Parsing SELECT * FROM table1 WHERE uaid = '506a5f3b' AND messageid >= '01;' | cassandranode3 | 7 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a20-bb79-11e8-aeca-439c84a4762c | Preparing statement | Cassandranode3 | 47 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a21-bb79-11e8-aeca-439c84a4762c | reading data from /Cassandranode1 | Cassandranode3 | 121 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38610-bb79-11e8-aeca-439c84a4762c | REQUEST_RESPONSE message received from /cassandranode1 | cassandranode3 | 40614 | MessagingService-Incoming-/Cassandranode1
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38611-bb79-11e8-aeca-439c84a4762c | Processing response from /Cassandranode1 | Cassandranode3 | 40626 | SharedPool-Worker-5
I tried checking the connectivity between Cassandra nodes but did not see any issues. Cassandra logs are flooded with Read timeout exceptions as this is a pretty busy cluster with 30k reads/sec and 10k writes/sec.
Warning in the system.log:
WARN [SharedPool-Worker-28] 2018-09-19 01:39:16,999 SliceQueryFilter.java:320 - Read 122 live and 266 tombstone cells in system.schema_columns for key: system (see tombstone_warn_threshold). 2147483593 columns were requested, slices=[-]
During the spike the cluster just stalls and simple commands like "use system_traces" command also fails.
cassandra#cqlsh:system_traces> select * from sessions ;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
Schema metadata was not refreshed. See log for details.
I validated the schema versions on all nodes and its the same but looks like during the issue time Cassandra is not even able to read the metadata.
Has anyone faced similar issues ? any suggestions ?
(from data from your comments above) The long full gc pauses can definitely cause this. Add -XX:+DisableExplicitGC you are getting full GCs because of calls to system.gc which is most likely from a silly DGC rmi thing that gets called at regular intervals regardless of if needed. With the larger heap that is VERY expensive. It is safe to disable.
Check your gc log header, make sure min heap size is not set. I would recommend setting -XX:G1ReservePercent=20
Related
This is a question about building a pipeline for data-analytics in a kappa architecture. The question is conceptional.
Assume you have a system that emits events, for simplicity let's assume you just have two events CREATED and DELETED which tell that an item get's created or deleted at a given point in time. Those events contain an id and a timestamp. An item will get created and deleted again after a certain time. Assume the application ensures correct order of events and prevents duplicate events and no event is emitted with the exact same timestamp.
The metrics that should be available in data analytics are:
Current amount of items
Amount of items as graph over the last week
Amount of items per day as historical data
Now a proposal for an architecture for such a scenario would be like this:
Emit events to Kafka
Use kafka as short term storage
Use superset to display live data directly on kafka with presto
Use spark to consume kafka events to write aggregations to analytics Postgres db
Schematically it would look like this:
Application
|
| (publish events)
↓
Kafka [topics: item_created, item_deleted]
| ↑
| | (query short-time)
| |
| Presto ←-----------┐
| |
| (read event stream) |
↓ |
Spark |
| |
| (update metrics) |
↓ |
Postgres |
↑ |
| (query) | (query)
| |
└-----Superset-----┘
Now this data-analytics setup should be used to visualise historical and live data. Very important to note is that in this case the application can have already a database with historical data. To make this work when starting up the data analytics first the database is parsed and events are emitted to kafka to transfer the historical data. Live data can come at any time and will also be progressed.
An idea to make the metric work is the following. With the help of presto the events can easily be aggregated through the short term memory of kafka itself.
For historical data the idea could be to create a table Items that with the schema:
--------------------------------------------
| Items |
--------------------------------------------
| timestamp | numberOfItems |
--------------------------------------------
| 2021-11-16 09:00:00.000 | 0 |
| 2021-11-17 09:00:00.000 | 20 |
| 2021-11-18 09:00:00.000 | 5 |
| 2021-11-19 09:00:00.000 | 7 |
| 2021-11-20 09:00:00.000 | 14 |
Now the idea would that the spark program (which would need of course to parse the schema of the topic messages) and this will assess the timestamp check in which time-window the event falls (in this case which day) and update the number by +1 in case of a CREATED or -1 in case of a DELTED event.
The question I have is whether this is a reasonable interpretation of the problem in a kappa architecture. In startup it would mean a lot of read and writes to the analytics database. There will be multiple spark workers to update the analytics database in parallel and the queries must be written such that it's all atomic operations and not like read and then write back because the value might have been altered in the meanwhile by another spark node. What could be done to make this process efficient? How would it be possible to prevent kafka being flooded in the startup process?
Is this an intended use case for spark? What would be a good alternative for this problem?
In terms of data-throughput assume like 1000-10000 of this events per day.
Update:
Apparently spark is not intended to be used like this as it can be seen from this issue.
Apparently spark is not intended to be used like this
You don't need Spark, or at least, not completely.
Kafka Streams can be used to move data between various Kafka topics.
Kafka Connect can be used to insert/upsert into Postgres via JDBC Connector.
Also, you can use Apache Pinot for indexed real-time and batch/historical analytics from Kafka data rather than having Presto just consume and parse the data (or needing a separate Postgres database only for analytical purposes)
assume like 1000-10000 of this events per day
Should be fine. I've worked with systems that did millions of events, but were mostly written to Hadoop or S3 rather than directly into a database, which you could also have Presto query.
I'm using dsbulk 1.6.0 to unload data from cassandra 3.11.3.
Each unload results in wildly different counts of rows. Here are results from 3 invocations of unload, on the same cluster, connecting to the same cassandra host. The table being unloaded is only ever appended, data is never deleted, so a decrease in unloaded rows should not occur. There are 3 cassandra databases in the cluster, and a replication factor of 3, so all data should be present on the chosen host. Furthermore, these were executed in quick succession, the number of added rows would be in the hundreds (if there were any) not in the tens of thousands.
Run 1:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 10,937 | 7 | 97 | 15,935.46 | 20,937.97 | 20,937.97
│ Operation UNLOAD_20201024-084213-097267 completed with 7 errors in
1 minute and 51 seconds.
Run 2:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 60,558 | 3 | 266 | 12,551.34 | 21,609.05 | 21,609.05
│ Operation UNLOAD_20201025-084208-749105 completed with 3 errors in
3 minutes and 47 seconds.
Run 3:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 45,404 | 4 | 211 | 16,664.92 | 30,870.08 | 30,870.08
│ Operation UNLOAD_20201026-084206-791305 completed with 4 errors in
3 minutes and 35 seconds.
It would appear that Run 1 is missing the majority of the data. Run 2 may be closer to complete and Run 3 is missing significant data.
I'm invoking unload as follows:
dsbulk unload -h $CASSANDRA_IP -k $KEYSPACE -t $CASSANDRA_TABLE > $DATA_FILE
I'm assuming this isn't expected behaviour for dsbulk. How do I configure it to reliably unload a complete table without errors?
Data could be missing from host if host wasn't reachable when the data was written, and hints weren't replayed, and you don't run repairs periodically. And because DSBulk reads by default with consistency level LOCAL_ONE, different hosts will provide different views (the host that you're providing is just a contact point - after that the cluster topology will be discovered, and DSBulk will select replica based on the load balancing policy).
You can enforce that DSBulk read the data with another consistency level by using -cl command line option (doc). You can compare results with using LOCAL_QUORUM or ALL - in these modes Cassandra will also "fix" the inconsistencies as they will be discovered, although this would be much slower & will add the load onto the nodes because of the repaired data writes.
Could you advice even though my query do not have option
IF EXISTS and IF NOT EXISTS.
still in query tracing result shows both consistency_level "QUORUM" which is what we wanted
but it also shows 'serial_consistency_level': 'SERIAL', what is this behavior
session_id | client | command | coordinator | duration | parameters | request | started_at
--------------------------------------+-------------+---------+-------------+----------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------+---------------------------------
278a2000-3dfb-11e9-b459-9775e6c46fc6 | 10.244.*.* | QUERY | 10.244.*.* | 3338 |
`
`{'bound_var_0_stream_id': '''3c17230d-ea24-4ff7-9599-352fef883b31''',
'bound_var_1_property_name': '''Location:rxRSSI''',
'bound_var_2_shard_date': '2019-03-03T00:00:00.000Z',
'bound_var_3_time': '2019-03-03T21:27:30.749Z',
'bound_var_4_source_id': '''fe30653c-467f-401a-9646-67b10378e1c9''',
'bound_var_5_time_lag': '1328',
'bound_var_6_property_class': '''java.lang.Integer''',
'bound_var_7_property_type': '''ByteType''',
'bound_var_8_property_value': '''-44''',
'consistency_level': 'LOCAL_QUORUM',
'page_size': '5000',
'query': 'INSERT INTO "cloudleaf"."stream_48" ("stream_id", "property_name", "shard_date", "time", "source_id", "time_lag", "property_class", "property_type", "property_value")
VALUES (:"stream_id", :"property_name", :"shard_date", :"time", :"source_id", :"time_lag", :"property_class", :"property_type", :"property_value")
USING TTL 432000',
'serial_consistency_level': 'SERIAL'}
IF EXISTS and IF NOT EXISTS trigger a lightweight transaction, which can have one of the two consistency levels SERIAL or LOCAL_SERIAL. Those are defined as follows:
SERIAL:
Achieves linearizable consistency for lightweight transactions by preventing unconditional updates. This consistency level is only for use with lightweight transaction. Equivalent to QUORUM.
LOCAL_SERIAL:
Same as SERIAL but confined to the datacenter. A conditional write must be written to the commit log and memtable on a quorum of replica nodes in the same datacenter. Same as SERIAL but used to maintain consistency locally (within the single datacenter). Equivalent to LOCAL_QUORUM.
see: https://docs.datastax.com/en/cassandra/3.0/cassandra/dml/dmlConfigSerialConsistency.html
My setup is PostgreSQL-BDR on 4 servers with the same configuration.
After network problems (e.g. connection lost for some minutes), some of the nodes start to replicate in some seconds again, but other nodes starts to replicate only after 2 hours.
I couldn't find any configuration switch to set the timing of the replication.
I see the following lines when i am monitoring replication slots:
slot_name | database | active | retained_bytes
bdr_16385_6255603470654648304_1_16385__ | mvcn | t | 56
bdr_16385_6255603530602290326_1_16385__ | mvcn | f | 17640
bdr_16385_6255603501002479656_1_16385__ | mvcn | f | 17640
Any idea why this is happening?
The problem was that the default tcp_keepalive_time is 7200 seconds whitch is excatly 2 hour, so changing the value of /proc/sys/net/ipv4/tcp_keepalive_time solved the problem.
I am stuck with this issue for almost a week now. I would like to get your suggestions and help with this. I have been getting read latency problems for simple table too. I just created simple table with 4k rows and when I read 500 rows it is fetching in 5ms but if I increase 1000 it gets ~10ms if take 4k it gets around 50ms. I tried checking stats, network, iostat, tpstats, heap but couldn't get a clue of what the issue is. Could anyone help me in what more i need to do resolve this high priority issue assigned to me. Thank you very much in advance.
Tracing session: b4287090-0ea5-11e5-a9f9-bbcaf44e5ebc
activity | timestamp | source | source_elapsed
-----------------------------------------------------------------------------------------------------------------------------+----------------------------+---------------+----------------
Execute CQL3 query | 2015-06-09 07:47:35.961000 | 10.65.133.202 | 0
Parsing select * from location_eligibility_by_type12; [SharedPool-Worker-1] | 2015-06-09 07:47:35.961000 | 10.65.133.202 | 33
Preparing statement [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 62
Computing ranges to query [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 101
Submitting range requests on 1537 ranges with a concurrency of 1537 (1235.85 rows per range expected) [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 314
Submitted 1 concurrent range requests covering 1537 ranges [SharedPool-Worker-1] | 2015-06-09 07:47:35.968000 | 10.65.133.202 | 6960
Executing seq scan across 1 sstables for [min(-9223372036854775808), min(-9223372036854775808)] [SharedPool-Worker-2] | 2015-06-09 07:47:35.968000 | 10.65.133.202 | 7033
Read 4007 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-06-09 07:47:36.045000 | 10.65.133.202 | 84055
Scanned 1 rows and matched 1 [SharedPool-Worker-2] | 2015-06-09 07:47:36.046000 | 10.65.133.202 | 84109
Request complete | 2015-06-09 07:47:36.052498 | 10.65.133.202 | 91498
Selecting lots of rows in Cassandra often takes unpredictably long since the query will be routed to more machines.
It's best to avoid such schemas if you need high read performance. A better approach is to store data in a single row and spread the load between nodes by having a higher replication factor. Wide rows are generally preferable: http://www.slideshare.net/planetcassandra/cassandra-summit-2014-39677149