Slow bulk insert in Spanner - google-cloud-spanner

Slow bulk insert in Spanner - google-cloud-spanner

I'm trying to load a dataset of 1TB in spanner and I can't get further than 5 MB/s with 3 nodes.
Main issue is that the dataset I want to load is mainly composed by integers + some nulls columns, so the commit size is too small, lower than 500K.
I've followed the rules for bulk insert defined at https://cloud.google.com/spanner/docs/bulk-loading. (partitions, workers, etc...) If I add a text column to make the commit size bigger (2.5MB), I can reach a throughput of 70MB/s.
However, with a dataset composed by integer and nulls I don't know how to make the importer faster than 5MB/s.
Table definition:
id INT64 NOT NULL,
date DATE NOT NULL,
type STRING(16) NOT NULL,
category STRING(3) NOT NULL,
quadkey STRING(18) NOT NULL,
subcategory STRING(2) NOT NULL,
txn INT64,
accouunts INT64,
acct_cnt FLOAT64,
avg_freq FLOAT64,
avg_spend_amt FLOAT64,
avg_ticket FLOAT64,
txn_amt FLOAT64,
txn_cnt FLOAT64,
) PRIMARY KEY (id)

To improve throughput, the recommendation is to commit 1 MB to 5 MB at a time: https://cloud.google.com/spanner/docs/bulk-loading#commit-size
Perhaps you could try batching more rows together in a single transaction.
Also see https://cloud.google.com/spanner/docs/bulk-loading#test-measure and https://medium.com/google-cloud/cloud-spanner-maximizing-data-load-throughput-23a0fc064b6d
Hope this helps!

There are many factors that contribute to a maximized throughput when loading data in Spanner.
1- The first step is to ensure that there isn’t any network capacity or added latencies due to the datasets being loaded into Spanner from an inefficient storage solution. Having your datasets in a Cloud Storage bucket in the same region as your Spanner instance and loading from there is a safe way to do so.
2- Then Spanner needs to partition the data by primary key, so that the splits are equally distributed and the nodes have equal workload to insure best performances. Spanner organizes rows lexicographically, so here are the recommendations on how to optimize the splits. This will avoid “hotspots”, where some nodes have higher workload than others. This can be assessed by the CPU utilization - high priority graph accessible via the console. Trying to get as close to 65% as possible for that metric (the recommended limit) is a good way to insure hotspots are not happening.
3- Having many splits can sometimes seem to increase efficiency, but this benefit is reduced by the coordination overhead created. This is where appropriate ordered batches will find the right balance to maximize throughput. This blog post covers the concept with a detailed example, I highly suggest you read it in full.
As mentioned in the documentation, it is possible to achieve bulk writes of 10-20 MB/s, but since the commits are not in the 1-5 MB at a time, this might not be possible even following the best practices. In this case, the number of operations/s, rows/s or requests/s might be more relevant, and 6MB/s could be in fact a good throughput considering the table schemas. These metrics can be accessed via the console or Stackdriver Monitoring for more advanced insights.

Related

Why does including partition key in WHERE clause to Cosmos SQL API query increase consumed RUs for some queries?

I would like to optimise my Azure Cosmos DB SQL API queries for consumed RUs (in part in order to reduce the frequency of 429 responses).
Specifically I thought that including the partition key in WHERE clauses would decrease consumed RUs (e.g. I read https://learn.microsoft.com/en-us/azure/cosmos-db/optimize-cost-queries and https://learn.microsoft.com/en-us/azure/cosmos-db/partitioning-overview which made me think this).
However, when I run
SELECT TOP 1 *
FROM c
WHERE c.Field = "some value"
AND c.PartitionKeyField = "1234"
ORDER BY c.TimeStampField DESC
It consumes 6 RUs.
Whereas without the partition key, e.g.
SELECT TOP 1 *
FROM c
WHERE c.Field = "some value"
ORDER BY c.TimeStampField DESC
It consumes 5.76 RUs - i.e. cheaper.
(whilst there is some variation in the above numbers depending on the exact document selected, the second query is always cheaper, and I have tested against both the smallest and largest partitions.)
My database currently has around 400,000 documents and 29 partitions (both are expected to grow). Largest partition has around 150,000 documents (unlikely to grow further than this).
The above results indicate to me that I should not pass the partition key in the WHERE clause for this query. Please could someone explain why this is so as from the documentation I thought the opposite should be true?

There might a few reasons and it depends on which index the query engine decides to use or if there is an index at all.
First thing I can say is that there is likely not much data in this container because queries without a partition key get progressively more expensive the larger the container, especially when they span physical partitions.
The first one could be more expensive if there is no index on the partition key and did a scan on it after filtering by the c.field.
It could also be more expensive depending on whether there is a composite index and whether it used it.
Really though you cannot take query metrics for small containers and extrapolate. The only way to measure is to put enough data into the container. Also the amount here is so small that it's not worth optimizing over. I would put the amount of data into this container you expect to have once in production and re-run your queries.
Lastly, with regards to measuring and optimizing, pareto principle applies. You'll go nuts chasing down every optimization. Find your high concurrency queries and focus on those.
Hope this is helpful.

Why varying blob size gives different performance?

My cassandra table looks like this -
CREATE TABLE cs_readwrite.cs_rw_test (
part_id bigint,
s_id bigint,
begin_ts bigint,
end_ts bigint,
blob_data blob,
PRIMARY KEY (part_id, s_id, begin_ts, end_ts)
) WITH CLUSTERING ORDER BY (s_id ASC, begin_ts DESC, end_ts DESC)
When I insert 1 million row per client, with 8 kb blob per row and test the speed of insertions from different client hosts the speed is almost constant at ~100 mbps. But with the same table definition, from same client hosts if I insert rows with 16 bytes of blob data then my speed numbers are dramatically low ~4 to 5 mbps. Why is there such a speed difference? I am only measuring write speeds for now. My main concern is not speed (though some inputs will help) when I add more clients I see speed is almost constant for bigger blob size but for 16 bytes blob the speed is increasing only by 10-20% per added client before it becomes constant.
I have also looked at bin/nodetool tablehistograms output and do adjust number of partitions in my test data so no partition is > 100 mb.
Any insights/ links for documentation would be helpful. Thanks!

I think you are measuring the throughput in the wrong way. The throughput should be measured in transactions per second and not in data written per second.
Even though the amount of data written can play a role in determining the write throughput of a system but usually it depends on many other factors.
Compaction Strategy like STCS is write-optimized whereas LOCS is
read-optimized.
Connection speed and latency between the client and the cluster, and
between machines in the cluster
CPU usage of the node which is processing data, sending data to other
replicas and waiting for their acknowledgment.
Most of the writes are immediately written in memory instead of being written directly in the disk which basically makes the impact of the amount of data being written on final write throughput almost negligible whereas other fixed things like Network delay, CPU to coordinate the processing of data across nodes, etc have a bigger impact.
The way you should see it is that with 8KB of payload you are getting X transactions per second and with 16 Bytes you are getting Y transactions per second. Y will always be better than X but it will not be linearly proportional to the size difference.
You can find how writes are handled in cassandra explained in detail here.

Theres management overhead in Cassandra per row/partition, the more data (in bytes) you have in each row the less that overhead impacts throughput in bytes/sec. The reverse is true if you look at rows per sec as a metric of throughput. The larger the payloads the worse your rows/sec throughput would get.

Cassandra read performance degrade as we increase data on nodes

DB used: Datastax cassandra community 3.0.9
Cluster: 3 x (8core 15GB AWS c4.2xlarge) with 300GB io1 with 3000iops.
Write consistency: Quorum , read consistency: ONE Replication
factor: 3
Problem:
I loaded our servers with 50,000 users and each user had 1000 records initially and after sometime, 20 more records were added to each users. I wanted to fetch the 20 additional records that were added later(Query : select * from table where userID='xyz' and timestamp > 123) here user_id and timestamp are part of primary key. It worked fine when I had only 50,000 users. But as soon as I added another 20GB of dummy data, the performance for same query i.e. fetch 20 additional records for 50,000 users dropped significantly. Read performance is getting degraded with increase in data. As far as I have read, this should not have happened as keys get cached and additional data should not matter.
what could be possible cause for this? CPU and RAM utilisation is negligible and I cant find out what is causing the query time to increase.
I have tried changing compaction strategy to "LeveledCompaction" but that didn't work either.
EDIT 1
EDIT 2
Heap size is 8GB. The 20GB data is added in a way similar to the way in which the initial 4GB data was added (the 50k userIDs) and this was done to simulate real world scenario. "userID" and "timestamp" for the 20GB data is different and is generated randomly. Scenario is that I have 50k userIDs with 1020 rows where 1000 rows were added first and then additional 20 rows were added after some timestamp, I am fetching these 20 messages. It works fine if only 50k userIDs are present but once I have more userIDs (additional 20GB) and I try to fetch those same 20 messages (for initial 50k userIDs), the performance degrades.
EDIT 3
cassandra.yaml

Read performance is getting degraded with increase in data.
This should only happen when your add a lot of records in the same partition.
From what I can understand your table may looks like:
CREATE TABLE tbl (
userID text,
timestamp timestamp,
....
PRIMARY KEY (userID, timestamp)
);
This model is good enough when the volume of the data in a single partition is "bound" (eg you have at most 10k rows in a single partition). The reason is that the coordinator gets a lot of pressure when dealing with "unbound" queries (that's why very large partitions are a big no-no).
That "rule" can be easily overlooked and the net result is an overall slowdown, and this could be simply explained as this: C* needs to read more and more data (and it will all be read from one node only) to satisfy your query, keeping busy the coordinator, and slowing down the entire cluster. Data grow usually means slow query response, and after a certain threshold the infamous read timeout error.
That being told, it would be interesting to see if your DISK usage is "normal" or something is wrong. Give it a shot with dstat -lrvn to monitor your servers.
A final tip: depending on how many fields you are querying with SELECT * and on the amount of retrieved data, being served by an SSD may be not a big deal because you won't exploit the IOPS of your SSDs. In such cases, preferring an ordinary HDD could lower the costs of the solution, and you wouldn't incur into any penalty.

Dramatic decrease of Azure Table storage performance after querying whole partition

I use Azure Table storage as a time series database. The database is constantly extended with more rows, (approximately 20 rows per second for each partition). Every day I create new partitions for the day's data so that all partition have a similar size and never get too big.
Until now everything worked flawlessly, when I wanted to retrieve data from a specific partition it would never take more than 2.5 secs for 1000 values and on average it would take 1 sec.
When I tried to query all the data of a partition though things got really really slow, towards the middle of the procedure each query would take 30-40 sec for 1000 values.
So I cancelled the procedure just to re start it for a smaller range. But now all queries take too long. From the beginning all queries need 15-30 secs. Can that mean that data got rearranged in a non efficient way and that's why I am seeing this dramatic decrease in performance? If yes is there a way to handle such a rearrangement?

I would definitely recommend you to go over the links Jason pointed above. You have not given too much detail about how you generate your partition keys but from sounds of it you are falling into several anti patterns. Including by applying Append (or Prepend) and too many entities in a single partition. I would recommend you to reduce your partition size and also put either a hash or a random prefix to your partition keys so they are not in lexicographical order.
Azure storage follows a range partitioning scheme in the background, so even if the partition keys you picked up are unique, if they are sequential they will fall into the same range and potentially be served by a single partition server, which would hamper the ability of azure storage service overall to load balance and scale out your storage requests.
The other aspect you should think is how you are reading the entities back, the best recommendation is point query with partition key and row key, worst is a full table scan with no PK and RK, there in the middle you have partition scan which in your case will also be pretty bad performance due to your partition size.

One of the challenges with time series data is that you can end up writing all your data to a single partition which prevents Table Storage from allocating additional resources to help you scale. Similarly for read operations you are constrained by potentially having all your data in a single partition which means you are limited to 2000 entities / second - whereas if you spread your data across multiple partitions you can parallelize the query and yield far greater scale.
Do you have Storage Analytics enabled? I would be interested to know if you are getting throttled at all or what other potential issues might be going on. Take a look at the Storage Monitoring, Diagnosing and Troubleshooting guide for more information.
If you still can't find the information you want please email AzTableFeedback#microsoft.com and we would be happy to follow up with you.
The Azure Storage Table Design Guide talks about general scalability guidance as well as patterns / anti-patterns (see the append only anti-pattern for a good overview) which is worth looking at.

Azure Table Storage Performace For a Very Specific Table

Is there a way around 500 entities / second / partition with ATS (Azure Table Storage)? OK with dirty reads. If in insert is not immediately available for read then OK.
Looking to move some large tables from SQL to ATS.
Scale: Because of these tables the size is bumping the 150 GB limit of SQL Azure
Insert speed: Inverted index for query speed. Insert order is not
sorted by the table clustered index which causes rapid SQL table
fragmentation. ATS most likely has an insert advantage over SQL.
Cost: ATS has a lower monthy cost. But ATS has a higher load cost as millions of rows and cannot batch as the order of the load is not by partition.
Query speed: A search is almost never on just one partitionKey. A search will have a SQL component and zero or more ATS components. This ATS query is always by partitionKey and returning rowKeys. Raw search on partitionKey is fast the problem is the time to return the entities (rows). A given partitionKey will have on average 1,000 rowKeys which is 2 seconds at 500 entities / second / partition. But there will be some partitionKeys with over 100,000 rowKeys which equates to over 3 minutes. Return 10,000 rows at a time and in SQL and no query is over 10 seconds as with the power of joins don't have to bring down 100,000 rows to have those rows considered in the where.
Is there a was around this select entity speed with ATS? For scale and insert speed would like to go to ATS.
Windows Azure Storage Abstractions and their Scalability Targets
How to get most out of Windows Azure Tables
Designing a Scalable Partitioning Strategy for Windows Azure Table Storage
Turn entity tracking off for query results that are not going to be modified:
context.MergeOption = MergeOption.NoTracking;

One potential workaround is to stripe the data across multiple partitions and/or tables, perform queries across all the (sub)partitions in parallel and merge the results.
For example, for striping across partitions, prepending the partition key with a single digit can multiple the scalability of the partition 10 times.
So a partition key, say ABCDEFGH, could be sub partitioned 0ABCDEFGH to 9ABCDEFGH.
Writes are made to a partition, with the prefix digit generated either randomly or in round robin fashion.
Reads would query across all 10 partitions in parallel and merge the results.
For striping across tables, one of N tables can be written to randomly or in round robin fashion and queried similarly in parallel.

Edit: I had originally stated that the limit was 500 transaction/partition/sec. That was incorrect. The limit is actually 500 entities/partition/sec, as stated in the original question.
This also applies to the query speeds you've calculated. If you query an ATS PartitionKey and it returns 1000 entities, that will likely take only a little longer, perhaps a few hundred milliseconds, than returning a single entity. On the other hand, if the query returns more than 1000 entities it will be much slower, as each set of 1000 rows requires an essentially independent transaction and must be done in serial.
It's not completely clear to me what you're doing, but it sounds like a lot of querying. Keep in mind that querying ATS on non-key columns tends to be very slow. If you're doing a lot of that, you might be better served by using SQL Azure Federations and fan-out queries instead.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string