PostgreSQL-BDR: Some of the nodes starts to replicate only after 2 hours after network problems - postgresql-bdr

My setup is PostgreSQL-BDR on 4 servers with the same configuration.
After network problems (e.g. connection lost for some minutes), some of the nodes start to replicate in some seconds again, but other nodes starts to replicate only after 2 hours.
I couldn't find any configuration switch to set the timing of the replication.
I see the following lines when i am monitoring replication slots:
slot_name | database | active | retained_bytes
bdr_16385_6255603470654648304_1_16385__ | mvcn | t | 56
bdr_16385_6255603530602290326_1_16385__ | mvcn | f | 17640
bdr_16385_6255603501002479656_1_16385__ | mvcn | f | 17640
Any idea why this is happening?

The problem was that the default tcp_keepalive_time is 7200 seconds whitch is excatly 2 hour, so changing the value of /proc/sys/net/ipv4/tcp_keepalive_time solved the problem.

Related

dsbulk unload missing data

I'm using dsbulk 1.6.0 to unload data from cassandra 3.11.3.
Each unload results in wildly different counts of rows. Here are results from 3 invocations of unload, on the same cluster, connecting to the same cassandra host. The table being unloaded is only ever appended, data is never deleted, so a decrease in unloaded rows should not occur. There are 3 cassandra databases in the cluster, and a replication factor of 3, so all data should be present on the chosen host. Furthermore, these were executed in quick succession, the number of added rows would be in the hundreds (if there were any) not in the tens of thousands.
Run 1:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 10,937 | 7 | 97 | 15,935.46 | 20,937.97 | 20,937.97
│ Operation UNLOAD_20201024-084213-097267 completed with 7 errors in
1 minute and 51 seconds.
Run 2:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 60,558 | 3 | 266 | 12,551.34 | 21,609.05 | 21,609.05
│ Operation UNLOAD_20201025-084208-749105 completed with 3 errors in
3 minutes and 47 seconds.
Run 3:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 45,404 | 4 | 211 | 16,664.92 | 30,870.08 | 30,870.08
│ Operation UNLOAD_20201026-084206-791305 completed with 4 errors in
3 minutes and 35 seconds.
It would appear that Run 1 is missing the majority of the data. Run 2 may be closer to complete and Run 3 is missing significant data.
I'm invoking unload as follows:
dsbulk unload -h $CASSANDRA_IP -k $KEYSPACE -t $CASSANDRA_TABLE > $DATA_FILE
I'm assuming this isn't expected behaviour for dsbulk. How do I configure it to reliably unload a complete table without errors?
Data could be missing from host if host wasn't reachable when the data was written, and hints weren't replayed, and you don't run repairs periodically. And because DSBulk reads by default with consistency level LOCAL_ONE, different hosts will provide different views (the host that you're providing is just a contact point - after that the cluster topology will be discovered, and DSBulk will select replica based on the load balancing policy).
You can enforce that DSBulk read the data with another consistency level by using -cl command line option (doc). You can compare results with using LOCAL_QUORUM or ALL - in these modes Cassandra will also "fix" the inconsistencies as they will be discovered, although this would be much slower & will add the load onto the nodes because of the repaired data writes.

Sizing Azure Kubernetes Services (AKS) Cluster

I am trying to size my AKS clusters. What I understood and followed is the number of micro services and their replication copies would be primary parameters. Also the resource usage by each micro services and prediction of that usage increase during coming years also needs to be considered. But all these information seems totally scattered to reach a number for AKS sizing. Sizing I meant by how many nodes to be assigned? what could be the configuration of nodes, how many pods to be considered, how many IP numbers to be reserved based on number of pods etc..
Is there any standard matrix here or practical way of calculation to
compute AKS cluster sizing, based on any ones'experience?
no, pretty sure there is none (and how it could be)? just take your pod cpu\memory usage and sum that up, you'll get an expectation of the resources needed to run your stuff, add k8s services on top of that.
also, like Peter mentions in his comment, you can always scale your cluster, so such planning seems a bit unreasonable.
Actually, you may be interested in the Sizing of your nodes, things like Memory, CPU, Networking, Disk are directly linked with the node you chose, example:
Not all memory and CPU in a Node can be used to run Pods. The resources are partitioned in 4:
Memory and CPU reserved to the operating system and system daemons such as SSH
Memory and CPU reserved to the Kubelet and Kubernetes agents such as the CRI
Memory reserved for the hard eviction threshold
Memory and CPU available to Pods
CPU and Memory available for PODs
________________________________________________
Memory | % Available | CPU | % Available
________________________________________________
1 | 0.00% | 1 | 84.00%
2 | 32.50% | 2 | 90.00%
4 | 53.75% | 4 | 94.00%
8 | 66.88% | 8 | 96.50%
16 | 78.44% | 16 | 97.75%
64 | 90.11% | 32 | 98.38%
128 | 92.05% | 64 | 98.69%
192 | 93.54%
256 | 94.65%
Other things are Disk and Networking, example:
Node Size | Maximum Disks| Maximum Disk IOPS | Maximum Throughput (MBps)
_______________________________________________________________________________
Standard_DS2_v2 | 8 | 6,400 | 96
Standard_B2ms | 4 | 1,920 | 22.5

Frequent Spikes in Cassandra write latency

In Production cluster , the Cluster Write latency frequently spikes from 7ms to 4Sec. Due to this clients face a lot of Read and Write Timeouts. This repeats in every few hours.
Observation:
Cluster Write latency (99th percentile) - 4Sec
Local Write latency (99th percentile) - 10ms
Read & Write consistency - local_one
Total nodes - 7
I tried to enable trace using settraceprobability for few mins and observed that mostly of the time is taken in internode communication
session_id | event_id | activity | source | source_elapsed | thread
--------------------------------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------+----------------+------------------------------------------
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c3314-bb79-11e8-aeca-439c84a4762c | Parsing SELECT * FROM table1 WHERE uaid = '506a5f3b' AND messageid >= '01;' | cassandranode3 | 7 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a20-bb79-11e8-aeca-439c84a4762c | Preparing statement | Cassandranode3 | 47 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a21-bb79-11e8-aeca-439c84a4762c | reading data from /Cassandranode1 | Cassandranode3 | 121 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38610-bb79-11e8-aeca-439c84a4762c | REQUEST_RESPONSE message received from /cassandranode1 | cassandranode3 | 40614 | MessagingService-Incoming-/Cassandranode1
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38611-bb79-11e8-aeca-439c84a4762c | Processing response from /Cassandranode1 | Cassandranode3 | 40626 | SharedPool-Worker-5
I tried checking the connectivity between Cassandra nodes but did not see any issues. Cassandra logs are flooded with Read timeout exceptions as this is a pretty busy cluster with 30k reads/sec and 10k writes/sec.
Warning in the system.log:
WARN [SharedPool-Worker-28] 2018-09-19 01:39:16,999 SliceQueryFilter.java:320 - Read 122 live and 266 tombstone cells in system.schema_columns for key: system (see tombstone_warn_threshold). 2147483593 columns were requested, slices=[-]
During the spike the cluster just stalls and simple commands like "use system_traces" command also fails.
cassandra#cqlsh:system_traces> select * from sessions ;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
Schema metadata was not refreshed. See log for details.
I validated the schema versions on all nodes and its the same but looks like during the issue time Cassandra is not even able to read the metadata.
Has anyone faced similar issues ? any suggestions ?
(from data from your comments above) The long full gc pauses can definitely cause this. Add -XX:+DisableExplicitGC you are getting full GCs because of calls to system.gc which is most likely from a silly DGC rmi thing that gets called at regular intervals regardless of if needed. With the larger heap that is VERY expensive. It is safe to disable.
Check your gc log header, make sure min heap size is not set. I would recommend setting -XX:G1ReservePercent=20

Cassandra isolation model

I have a use case for Cassandra where I need to store multiple rows of data, which will belong to different customers. I'm new to Cassandra and I need to provide a permissions model where only one customer is accessible at once from a base permissions role but all could be accessible from a 'supervisor' role. Essentially every time a query is made, one customer cannot see another customer's data, except for when the query is made from a supervisor. We have to enforce a security as a design approach.
The data could look like this:
-----------------------------------------
| id | customer name | data column1... |
-----------------------------------------
| 0 | customer1 | 3 |
-----------------------------------------
| 1 | customer2 | 23 |
-----------------------------------------
| 2 | customer3 | 33 |
-----------------------------------------
| 3 | customer3 | 32 |
-----------------------------------------
Is something like this easily doable with Cassandra?
The way you have modeled this is a perfectly good way to do multi-tenant. This is how UserGrid models multiple tenants and is used in several large scale applications.
Couple of drawbacks to be up-front:
Doesn't help with a "noisy neighbor" problem and unequal tenants
Application code has to manage the tenant security

Read time increases by 1ms for every 100 rows

I am stuck with this issue for almost a week now. I would like to get your suggestions and help with this. I have been getting read latency problems for simple table too. I just created simple table with 4k rows and when I read 500 rows it is fetching in 5ms but if I increase 1000 it gets ~10ms if take 4k it gets around 50ms. I tried checking stats, network, iostat, tpstats, heap but couldn't get a clue of what the issue is. Could anyone help me in what more i need to do resolve this high priority issue assigned to me. Thank you very much in advance.
Tracing session: b4287090-0ea5-11e5-a9f9-bbcaf44e5ebc
activity | timestamp | source | source_elapsed
-----------------------------------------------------------------------------------------------------------------------------+----------------------------+---------------+----------------
Execute CQL3 query | 2015-06-09 07:47:35.961000 | 10.65.133.202 | 0
Parsing select * from location_eligibility_by_type12; [SharedPool-Worker-1] | 2015-06-09 07:47:35.961000 | 10.65.133.202 | 33
Preparing statement [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 62
Computing ranges to query [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 101
Submitting range requests on 1537 ranges with a concurrency of 1537 (1235.85 rows per range expected) [SharedPool-Worker-1] | 2015-06-09 07:47:35.962000 | 10.65.133.202 | 314
Submitted 1 concurrent range requests covering 1537 ranges [SharedPool-Worker-1] | 2015-06-09 07:47:35.968000 | 10.65.133.202 | 6960
Executing seq scan across 1 sstables for [min(-9223372036854775808), min(-9223372036854775808)] [SharedPool-Worker-2] | 2015-06-09 07:47:35.968000 | 10.65.133.202 | 7033
Read 4007 live and 0 tombstoned cells [SharedPool-Worker-2] | 2015-06-09 07:47:36.045000 | 10.65.133.202 | 84055
Scanned 1 rows and matched 1 [SharedPool-Worker-2] | 2015-06-09 07:47:36.046000 | 10.65.133.202 | 84109
Request complete | 2015-06-09 07:47:36.052498 | 10.65.133.202 | 91498
Selecting lots of rows in Cassandra often takes unpredictably long since the query will be routed to more machines.
It's best to avoid such schemas if you need high read performance. A better approach is to store data in a single row and spread the load between nodes by having a higher replication factor. Wide rows are generally preferable: http://www.slideshare.net/planetcassandra/cassandra-summit-2014-39677149

Resources