Datastax DSBulk Utility giving errors on load CSV data to Astra

Datastax DSBulk Utility giving errors on load CSV data to Astra - cassandra

I am migrating data from EC2 Cassandra Nodes to DataStax Astra (Premium Account) using DSBulk utility.
Command used:
dsbulk load -url folder_created_during_unload -header true -k keyspace -t table -b "secure-connect-file.zip" -u username -p password
This command gives error after a few seconds. On checking the documentation, i found that i can add --executor.maxPerSecond in this command to limit the loading.
After this, the load command executed without any error. But if i enter a value over 15,000, the load command starts giving the error again.
Now, if a table has over 100M entries and 15,000 entries are migrated every second, it would hours and hours to complete the migration of one table. The complete database would take several days to migrate.
I want to understand what is causing this error and if there is a way to load the data at a higher speed.

What's happening here, is that DSBulk is running into the rate limit on the database. At the moment, it looks like the only way to increase that rate limit is to submit a ticket to support.
To submit a ticket, look for the "Other Resources" section of the Astra Dashboard's left nav. Click "Get Support" on the bottom.
When the "Help Center" pops up, click "Create Request" in the lower right corner.
On the next page, click the green/cyan "Submit a Ticket" button in the upper right corner. Describe the problem you're having (rate limit) along with what DSBulk outputs when set for more than 15k/sec.

To add to Aaron's response, you are hitting the default limit of 4K operations per second on your Astra DB.
We contacted you directly last week when we detected that you were hitting the limit but haven't heard back. I've reached out to you directly again today to let you know that I've logged a request on your behalf to increase the limit on your DB. Cheers!

Related

How to copy managed database?

AFAIK there is no REST API providing this functionality directly. So, I am using restore for this (there are other ways but those don’t guarantee transactional consistency and are more complicated) via Create request.
Since it is not possible to turn off short time backup (retention has to be at least 1 day) it should be reliable. I am using current time for ‘properties.restorePointInTime’ property in request. This works fine for most databases. But one db returns me this error (from async operation request):
"error": {
"code": "BackupSetNotFound",
"message": "No backups were found to restore the database to the point in time 6/14/2021 8:20:00 PM (UTC). Please contact support to restore the database."
}
I know I am not out of range because if the restore time is before ‘earliestRestorePoint’ (this can be found in GET request on managed database) or in future I get ‘PitrPointInTimeInvalid’ error. Nevertheless, I found some information that I shouldn’t use current time but rather current time - 6 minutes at most. This is also true if done via Azure Portal (where it fails with the same error btw) which doesn’t allow to input time newer than current - 6 minutes. After few tries, I found out that current time - circa 40 minutes starts to work fine. But 40 minutes is a lot and I didn’t find any way to find out what time works before I try and wait for result of async operation.
My question is: Is there a way to find what is the latest time possible for restore?
Or is there a better way to do ‘copy’ of managed database which guarantees transactional consistency and is reasonably quick?
EDIT:
The issue I was describing was reported to MS. It was occuring when:
there is a custom time zone format e.g. UTC + 1 hour.
Backups are skipped for the source database at the desired point in time because the database is inactive (no active transactions).
This should be fixed as of now (25th of August 2021) and I were not able to reproduce it with current time - 10 minutes. Also I was told there should be new API which would allow to make copy without using PITR (no sooner than 1Q/22).

To answer your first question "Is there a way to find what is the latest time possible for restore?"
Yes. Via SQL. The only way to find this out is by using extended event (XEvent) sessions to monitor backup activity.
Process to start logging the backup_restore_progress_trace extended event and report on it is described here https://learn.microsoft.com/en-us/azure/azure-sql/managed-instance/backup-activity-monitor
Including the SQL here in case the link goes stale.
This is for storing in the ring buffer (max last 1000 records):
CREATE EVENT SESSION [Verbose backup trace] ON SERVER
ADD EVENT sqlserver.backup_restore_progress_trace(
WHERE (
[operation_type]=(0) AND (
[trace_message] like '%100 percent%' OR
[trace_message] like '%BACKUP DATABASE%' OR [trace_message] like '%BACKUP LOG%'))
)
ADD TARGET package0.ring_buffer
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,
TRACK_CAUSALITY=OFF,STARTUP_STATE=ON)
ALTER EVENT SESSION [Verbose backup trace] ON SERVER
STATE = start;
Then to see output of all backup events:
WITH
a AS (SELECT xed = CAST(xet.target_data AS xml)
FROM sys.dm_xe_session_targets AS xet
JOIN sys.dm_xe_sessions AS xe
ON (xe.address = xet.event_session_address)
WHERE xe.name = 'Verbose backup trace'),
b AS(SELECT
d.n.value('(#timestamp)[1]', 'datetime2') AS [timestamp],
ISNULL(db.name, d.n.value('(data[#name="database_name"]/value)[1]', 'varchar(200)')) AS database_name,
d.n.value('(data[#name="trace_message"]/value)[1]', 'varchar(4000)') AS trace_message
FROM a
CROSS APPLY xed.nodes('/RingBufferTarget/event') d(n)
LEFT JOIN master.sys.databases db
ON db.physical_database_name = d.n.value('(data[#name="database_name"]/value)[1]', 'varchar(200)'))
SELECT * FROM b
NOTE: This tip came to me via Microsoft support when I had the same issue of point in time restores failing what seemed like randomly. They do not give any SLA for log backups. I found that on a busy database the log backups seemed to happen every 5-10 minutes but on a quiet database hourly. Recovery of a database this way can be slow depending on number of transaction logs and amount of activity to replay etc. (https://learn.microsoft.com/en-us/azure/azure-sql/database/recovery-using-backups)
To answer your second question: "Or is there a better way to do ‘copy’ of managed database which guarantees transactional consistency and is reasonably quick?"
I'd have to agree with Thomas - if you're after guaranteed transactional consistency and speed you need to look at creating a failover group https://learn.microsoft.com/en-us/azure/azure-sql/database/auto-failover-group-overview?tabs=azure-powershell#best-practices-for-sql-managed-instance and https://learn.microsoft.com/en-us/azure/azure-sql/managed-instance/failover-group-add-instance-tutorial?tabs=azure-portal
A failover group for a managed instance will have a primary server and failover server with the same user databases on each kept in synch.
But yes, whether this suits your needs depends on the question Thomas asked of what is the purpose of the copy.

Postgres CPU utilisation shot up. Any insights for my case?

My postgres instance CPU utilisation has shot up recently. I'm trying to identify the root cause. I will add the details below.
My postgres database instance running on GCP has the following configuration:
v PostgreSQL 9.6
vCPUs-1
Memory-3.75 GB
SSD storage-15 GB
I'm running 5 databases in the above DB server which are connected with a nodejs app.
I use sequelize as my ORM and recently upgraded my sequeliz from 4.6.x to 5.8.6".
Before this upgrade the CPU utilization would usually remain less than 20 percent. But after the upgrade, I see a lot of fluctuation in CPU utilization graph. And it hits 100 percent too often as well. Also, when it hits 100%, my services start wont work as expected ( because they cant interact with the DB).
I tried running this query .
SELECT "usesysid", "backend_start", "xact_start","query_start", "state_change", "state", "query" FROM pg_stat_activity ORDER BY "query_start" DESC
And, it returns the following:
But I'm not sure if this info is enough for me to find out which query could be causing this issue.
I also ran this query:
SELECT max(now() - xact_start) FROM pg_stat_activity WHERE state IN ('idle in transaction', 'active');
and it returns max = 1 day 01:42:10.987635. I think this is something alarming, but i dont know how to put this info to use.
Another thing which i think is worth mentioning is, I have started using sequelize's bulk update.
Its syntax is something like this:
Model.bulkCreate(scalesToUpdate, {
updateOnDuplicate: [
'field1',
'field2'
],
})
And, this gets translated into SQL like below:
INSERT INTO "mymodel" ("id","field1","field2","field3","field4","field5","field6","field7") VALUES (') ON CONFLICT ("id") DO UPDATE SET "field3"=EXCLUDED."field3","field4"=EXCLUDED."field4","field6"=EXCLUDED."field6","field7"=EXCLUDED."field7"
And, this query gets fired 5 times per second. Could this be the culprit?
Any insight into this is highly appreciable.

You could try the next things:
Increase ht machine type to have one core more having vCPUs= 2
It might be that sequelize 5.8.6 requires more resources than the old version, you could try to install one of the tools and run it, run the queries that you typed, to review which query has more resource usage.
If you have that query running 5 times per second, that could be using more resources. Test using one of the tools in order to be able to have a better approach.

CPU / DTUs getting maxed out on Azure SQL Database, but top queries less than 1% and database only a few MB

I just launched an Azure SQL Database, and the DTU and CPU usage is behaving strangely. The database is only receiving about 30 requests per minute, and the CPU/DTU will be extremely low for hours, and then jump up to 100% and stay there (with no increase in the number of requests that triggers this). When I click to view the top queries, none of them are above 1% cpu usage. I started out on a 5 DTU plan, and yesterday upgraded to 20 DTUs and the same behavior is occurring. Any idea what else might cause the DTU/CPU to get maxed out? See images below:
https://i.imgur.com/LdbYTPw.png
https://i.imgur.com/jlus3FM.png
Thanks in advance for any advice!
Joe
EDIT: I'm getting closer, I found these repeated entries in the error log. (about 8 - 10 per SECOND)
"The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request."
The thing is, the App Service that queries the database is only doing simple selects, updates, and inserts... none of which uses any complex WHERE IN statement. Furthermore, every query is wrapped in a try/catch block, and I'm never seeing an exception like this.
Where could these large queries be originating from?

You are only seeing the CPU component of the DTU graph, what about the "Data IO" and the "Log IO" components? Look at the top 5 queries on the 3 sections, and let me know if you find a query that start with "SELECT Statman ...". If you see that, then the Auto Update Statistics process is creating those DTU spikes.

I would suggest to install the sp_whoisactive script so that you can see what's going on more easily:
http://whoisactive.com/

select count(*) runs into timeout issues in Cassandra

Maybe it is a stupid question, but I'm not able to determine the size of a table in Cassandra.
This is what I tried:
select count(*) from articles;
It works fine if the table is small but once it fills up, I always run into timeout issues:
cqlsh:
OperationTimedOut: errors={}, last_host=127.0.0.1
DBeaver:
Run 1: 225,000 (7477 ms)
Run 2: 233,637 (8265 ms)
Run 3: 216,595 (7269 ms)
I assume that it hits some timeout and just aborts. The actual number of entries in the table is probably much higher.
I'm testing against a local Cassandra instance which is completely idle. I would not mind if it has to do a full table scan and is unresponsive during that time.
Is there a way to reliably count the number of entries in a Cassandra table?
I'm using Cassandra 2.1.13.

Here is my current workaround:
COPY articles TO '/dev/null';
...
3568068 rows exported to 1 files in 2 minutes and 16.606 seconds.
Background: Cassandra supports to
export a table to a text file, for instance:
COPY articles TO '/tmp/data.csv';
Output: 3568068 rows exported to 1 files in 2 minutes and 25.559 seconds
That also matches the number of lines in the generated file:
$ wc -l /tmp/data.csv
3568068

As far as I see you problem connected to timeout of cqlsh: OperationTimedOut: errors={}, last_host=127.0.0.1
you can simple increase it with options:
--connect-timeout=CONNECT_TIMEOUT
Specify the connection timeout in seconds (default: 5
seconds).
--request-timeout=REQUEST_TIMEOUT
Specify the default request timeout in seconds
(default: 10 seconds).

Is there a way to reliably count the number of entries in a Cassandra table?
Plain answer is no. It is not a Cassandra limitation but a hard challenge for distributed systems to count unique items reliably.
That's the challenge that approximation algorithms like HyperLogLog address.
One possible solution is to use counter in Cassandra to count the number of distinct rows but even counters can miscount in some corner cases so you'll get a few % error.

This is a good utility for counting rows that avoids the timeout issues that happen when running a large COUNT(*) in Cassandra:
https://github.com/brianmhess/cassandra-count

The reason is simple:
When you're using:
SELECT count(*) FROM articles;
it has the same effect on the database as:
SELECT * FROM articles;
You have to query over all your nodes. Cassandra simply runs into a timeout.
You can change the timeout, but it isn't a good solution. (For one time it's fine but don't use it in your regular queries.)
There's a better solution: make your client count your rows. You can create a java app where you count your rows, when you inserting them, and insert the result using a counter column in a Cassandra table.

You can use copy to avoid cassandra timeout usually happens on count(*)
use this bash
cqlsh -e "copy keyspace.table_name (first_partition_key_name) to '/dev/null'" | sed -n 5p | sed 's/ .*//'

RPC timeout in cqlsh - Cassandra

I have 5 nodes in my ring with SimpleTopologyStrategy and replication_factor=3. I inserted 1M rows using stress tool . When am trying to read the row count in cqlsh using
SELECT count(*) FROM Keyspace1.Standard1 limit 1000000;
It fails with error:
Request did not complete within rpc_timeout.
It fetches for limit 100000. Fails even for 500000.
All my nodes are up. Do I need to increase the rpc_timeout?
Please help.

You get this error because the request is timing out on the server side. One should know that this is a very expensive operation in Cassandra as others have pointed out.
Still, if you really want to do this you should update your /etc/cassandra/cassandra.yaml file and change the range_request_timeout_in_ms parameter. This will be valid for all your range queries.
Example to set a 40 second timeout:
range_request_timeout_in_ms: 40000
You will probably have to adjust at the client side as well. When using cqlsh as a client this is accomplished by creating/updating your configuration file for cqlsh under ~/.cassandra/cqlshrc and add the client_timeout parameter to the connection section.
Example to set a 40 second timeout:
[connection]
client_timeout=40

It takes a long time to read in 1M rows so that is probably why it is timing out. You shouldn't use count like this, it is very expensive since it has to read all the data. Use Cassandra counters if you need to count lots of items.
You should also check your Cassandra logs to confirm there aren't any other issues - sometimes exceptions in Cassandra lead to timeouts on the client.

If you can live with an approximate row count, take a look at this answer to Row count of a column family in Cassandra.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string