dsbulk unload missing data

dsbulk unload missing data - cassandra

I'm using dsbulk 1.6.0 to unload data from cassandra 3.11.3.
Each unload results in wildly different counts of rows. Here are results from 3 invocations of unload, on the same cluster, connecting to the same cassandra host. The table being unloaded is only ever appended, data is never deleted, so a decrease in unloaded rows should not occur. There are 3 cassandra databases in the cluster, and a replication factor of 3, so all data should be present on the chosen host. Furthermore, these were executed in quick succession, the number of added rows would be in the hundreds (if there were any) not in the tens of thousands.
Run 1:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 10,937 | 7 | 97 | 15,935.46 | 20,937.97 | 20,937.97
│ Operation UNLOAD_20201024-084213-097267 completed with 7 errors in
1 minute and 51 seconds.
Run 2:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 60,558 | 3 | 266 | 12,551.34 | 21,609.05 | 21,609.05
│ Operation UNLOAD_20201025-084208-749105 completed with 3 errors in
3 minutes and 47 seconds.
Run 3:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 45,404 | 4 | 211 | 16,664.92 | 30,870.08 | 30,870.08
│ Operation UNLOAD_20201026-084206-791305 completed with 4 errors in
3 minutes and 35 seconds.
It would appear that Run 1 is missing the majority of the data. Run 2 may be closer to complete and Run 3 is missing significant data.
I'm invoking unload as follows:
dsbulk unload -h $CASSANDRA_IP -k $KEYSPACE -t $CASSANDRA_TABLE > $DATA_FILE
I'm assuming this isn't expected behaviour for dsbulk. How do I configure it to reliably unload a complete table without errors?

Data could be missing from host if host wasn't reachable when the data was written, and hints weren't replayed, and you don't run repairs periodically. And because DSBulk reads by default with consistency level LOCAL_ONE, different hosts will provide different views (the host that you're providing is just a contact point - after that the cluster topology will be discovered, and DSBulk will select replica based on the load balancing policy).
You can enforce that DSBulk read the data with another consistency level by using -cl command line option (doc). You can compare results with using LOCAL_QUORUM or ALL - in these modes Cassandra will also "fix" the inconsistencies as they will be discovered, although this would be much slower & will add the load onto the nodes because of the repaired data writes.

Related

Sizing Azure Kubernetes Services (AKS) Cluster

I am trying to size my AKS clusters. What I understood and followed is the number of micro services and their replication copies would be primary parameters. Also the resource usage by each micro services and prediction of that usage increase during coming years also needs to be considered. But all these information seems totally scattered to reach a number for AKS sizing. Sizing I meant by how many nodes to be assigned? what could be the configuration of nodes, how many pods to be considered, how many IP numbers to be reserved based on number of pods etc..
Is there any standard matrix here or practical way of calculation to
compute AKS cluster sizing, based on any ones'experience?

no, pretty sure there is none (and how it could be)? just take your pod cpu\memory usage and sum that up, you'll get an expectation of the resources needed to run your stuff, add k8s services on top of that.
also, like Peter mentions in his comment, you can always scale your cluster, so such planning seems a bit unreasonable.

Actually, you may be interested in the Sizing of your nodes, things like Memory, CPU, Networking, Disk are directly linked with the node you chose, example:
Not all memory and CPU in a Node can be used to run Pods. The resources are partitioned in 4:
Memory and CPU reserved to the operating system and system daemons such as SSH
Memory and CPU reserved to the Kubelet and Kubernetes agents such as the CRI
Memory reserved for the hard eviction threshold
Memory and CPU available to Pods
CPU and Memory available for PODs
________________________________________________
Memory | % Available | CPU | % Available
________________________________________________
1 | 0.00% | 1 | 84.00%
2 | 32.50% | 2 | 90.00%
4 | 53.75% | 4 | 94.00%
8 | 66.88% | 8 | 96.50%
16 | 78.44% | 16 | 97.75%
64 | 90.11% | 32 | 98.38%
128 | 92.05% | 64 | 98.69%
192 | 93.54%
256 | 94.65%
Other things are Disk and Networking, example:
Node Size | Maximum Disks| Maximum Disk IOPS | Maximum Throughput (MBps)
_______________________________________________________________________________
Standard_DS2_v2 | 8 | 6,400 | 96
Standard_B2ms | 4 | 1,920 | 22.5

How to make Spark write by partition and group by a column

I have roughly 100Gb of data that I'm trying to process. The data has the form:
| timestamp | social_profile_id | json_payload |
|-----------|-------------------|------------------|
| 123 | 1 | {"json":"stuff"} |
| 124 | 2 | {"json":"stuff"} |
| 125 | 3 | {"json":"stuff"} |
I'm trying to split this data frame into folders in S3 by social_profile_id. There are roughly 430,000 social_profile_ids.
I've loaded the data no problem into a Dataset. However when I'm writing it out, and trying to partition it, it takes forever! Here's what I've tried:
messagesDS
.write
.partitionBy("socialProfileId")
.mode(sparkSaveMode)
I don't really care how many files are in each folder at the end of the job. My theory is that each node can group by the social_profile_id, then write out to its respective folder without having to do a shuffle or communicate with other nodes. But this isn't happening as evidenced by the long job time. Ideally the end result would look a little something like this:
├── social_id_1 (only two partitions had id_1 data)
| ├── partition1_data.parquet
| └── partition3_data.parquet
├── social_id_2 (more partitions have this data in it)
| ├── partition3_data.parquet
| └── partition4_data.parquet
| ├── etc.
├── social_id_3
| ├── partition2_data.parquet
| └── partition4_data.parquet
| ├── etc.
├── etc.
I've tried increasing the compute resources a few times, both increasing instances sizes and # of instances. What I've been able to see form the spark UI that is the majority of time time is being taken by the write operation. It seems that all of the executors are being used, but they take an absurdly long time to execute (like taking 3-5 hours to write ~150Mb) Any help would be appreciated! Sorry if I mixed up some of the spark terminology.

Frequent Spikes in Cassandra write latency

In Production cluster , the Cluster Write latency frequently spikes from 7ms to 4Sec. Due to this clients face a lot of Read and Write Timeouts. This repeats in every few hours.
Observation:
Cluster Write latency (99th percentile) - 4Sec
Local Write latency (99th percentile) - 10ms
Read & Write consistency - local_one
Total nodes - 7
I tried to enable trace using settraceprobability for few mins and observed that mostly of the time is taken in internode communication
session_id | event_id | activity | source | source_elapsed | thread
--------------------------------------+--------------------------------------+-----------------------------------------------------------------------------------------------------------------------------+---------------+----------------+------------------------------------------
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c3314-bb79-11e8-aeca-439c84a4762c | Parsing SELECT * FROM table1 WHERE uaid = '506a5f3b' AND messageid >= '01;' | cassandranode3 | 7 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a20-bb79-11e8-aeca-439c84a4762c | Preparing statement | Cassandranode3 | 47 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 429c5a21-bb79-11e8-aeca-439c84a4762c | reading data from /Cassandranode1 | Cassandranode3 | 121 | SharedPool-Worker-47
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38610-bb79-11e8-aeca-439c84a4762c | REQUEST_RESPONSE message received from /cassandranode1 | cassandranode3 | 40614 | MessagingService-Incoming-/Cassandranode1
4267dca2-bb79-11e8-aeca-439c84a4762c | 42a38611-bb79-11e8-aeca-439c84a4762c | Processing response from /Cassandranode1 | Cassandranode3 | 40626 | SharedPool-Worker-5
I tried checking the connectivity between Cassandra nodes but did not see any issues. Cassandra logs are flooded with Read timeout exceptions as this is a pretty busy cluster with 30k reads/sec and 10k writes/sec.
Warning in the system.log:
WARN [SharedPool-Worker-28] 2018-09-19 01:39:16,999 SliceQueryFilter.java:320 - Read 122 live and 266 tombstone cells in system.schema_columns for key: system (see tombstone_warn_threshold). 2147483593 columns were requested, slices=[-]
During the spike the cluster just stalls and simple commands like "use system_traces" command also fails.
cassandra#cqlsh:system_traces> select * from sessions ;
Warning: schema version mismatch detected, which might be caused by DOWN nodes; if this is not the case, check the schema versions of your nodes in system.local and system.peers.
Schema metadata was not refreshed. See log for details.
I validated the schema versions on all nodes and its the same but looks like during the issue time Cassandra is not even able to read the metadata.
Has anyone faced similar issues ? any suggestions ?

(from data from your comments above) The long full gc pauses can definitely cause this. Add -XX:+DisableExplicitGC you are getting full GCs because of calls to system.gc which is most likely from a silly DGC rmi thing that gets called at regular intervals regardless of if needed. With the larger heap that is VERY expensive. It is safe to disable.
Check your gc log header, make sure min heap size is not set. I would recommend setting -XX:G1ReservePercent=20

PostgreSQL-BDR: Some of the nodes starts to replicate only after 2 hours after network problems

My setup is PostgreSQL-BDR on 4 servers with the same configuration.
After network problems (e.g. connection lost for some minutes), some of the nodes start to replicate in some seconds again, but other nodes starts to replicate only after 2 hours.
I couldn't find any configuration switch to set the timing of the replication.
I see the following lines when i am monitoring replication slots:
slot_name | database | active | retained_bytes
bdr_16385_6255603470654648304_1_16385__ | mvcn | t | 56
bdr_16385_6255603530602290326_1_16385__ | mvcn | f | 17640
bdr_16385_6255603501002479656_1_16385__ | mvcn | f | 17640
Any idea why this is happening?

The problem was that the default tcp_keepalive_time is 7200 seconds whitch is excatly 2 hour, so changing the value of /proc/sys/net/ipv4/tcp_keepalive_time solved the problem.

Checking data if its present in a node

I have a 6 node cluster and I kept the replication factor as 3 and 3 for the two data centers. I inserted a single row as test and later more rows. As the replication factor is 6 I want to check if the data is written into all nodes. How can I individually check if the data is present in that node. Worst option i got is shutting down the remaining 5 nodes and checking select statement from the 1 node and repeating same on all nodes. Is there any better way to check this? THanks for your time.

You can use nodetool getendpoints for this. Here is a sample table to keep track of Blade Runners.
aploetz#cqlsh:stackoverflow> SELECT * FROm bladerunners;
id | type | datetime | data
--------+--------------+--------------------------+---------------------------------------------
B25881 | Blade Runner | 2015-02-16 18:00:03+0000 | Holden- Fine as long as nobody unplugs him.
B26354 | Blade Runner | 2015-02-16 18:00:03+0000 | Deckard- Filed and monitored.
(2 rows)
Now if I exit back out to my command prompt, I can use nodetool getendpoints, followed by my keyspace, table, and a partition key value. The list of nodes containing data for that key should be displayed underneath:
aploetz#dockingBay94:~$ nodetool getendpoints stackoverflow bladerunners 'B26354'
127.0.0.1

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

dsbulk unload missing data - cassandra

Related

Sizing Azure Kubernetes Services (AKS) Cluster

How to make Spark write by partition and group by a column

Frequent Spikes in Cassandra write latency

PostgreSQL-BDR: Some of the nodes starts to replicate only after 2 hours after network problems

Checking data if its present in a node

Categories

Resources