How to efficiently query google-cloud-spanner in parallel with multiple threads?

How to efficiently query google-cloud-spanner in parallel with multiple threads? - multithreading

(Sorry this is TL;DR; but I'm desperate and want to be thorough!)
We are moving a service from AWS to GCP and switching from DynamoDB to Cloud Spanner as the back-end data store.
The data store (spanner) contains data that users of the web service query for. In production loads, the data being queried is found between a 1% and 10% of the time. I have a simple multi-threaded Java test client that queries our service, continually adding new threads as long as the average throughput over the last 1 minute is increasing.
My test client is running on a GCE VM (64 CPU) and when using DynamoDB data source, I can get up to 3700 threads, pushing through 50k req/s on average once our service auto-scales up to the configured pod max node count. Each thread reads 100 hashes from Dynamo for every 1000 requests (10% hit rate).
I now need to switch my Java client to query spanner for data used in 10% of the requests. My query generally looks like:
SELECT A, B, C FROM data_table LIMIT 250 OFFSET XXX
Theoretically, I want each thread to SELECT blocks of unique rows. I use the OFFSET to start each thread reading from a unique position and once each block of records has been used up, I increment the OFFSET to startingOffset + totalRows and SELECT another block of data.
I realize this query may not translate to every implementation, but the concept should hold true that every thread can query spanner for a unique dataset over the life of the thread.
I tried using the java-spanner-jdbc with both a c3p0 connection pool and just going through the standard DriverManager.getConnection() route. I played with the min/max Session configuration as well as numChannels, but nothing seemed to help me get this to scale. TBH, I still don't understand the correlation between the sessions and channels.
I also tried the native SpannerDB client with singleUseReadOnlyTransaction(), batchReadOnlyTransaction() and most recently txn.partitionQuery().
Since the partitionQuery() feels a lot like the DynamoDB code, this feels like the right direction, but because my query (based off the "Read data in parallel" example at https://cloud.google.com/spanner/docs/reads) has a LIMIT clause, I'm getting the error:
com.google.cloud.spanner.SpannerException: INVALID_ARGUMENT:
com.google.api.gax.rpc.InvalidArgumentException:
io.grpc.StatusRuntimeException: INVALID_ARGUMENT: Query is not root
partitionable since it does not have a DistributedUnion at the root.
Please run EXPLAIN for query plan details.
Removing the LIMIT clause gets past this, but then the queries take an eternity!
So the question is, if the partitionQuery() route is correct, how do I do parallel queries with 'paging' limits? If this is not the best route, what should I use to get the best parallel read throughput with unique data sets for each thread?
[EDIT]
Based on the comment below by Knut Olav Loite, partitioned or batch queries is not the right approach so I am back to a single use read-only query.
Here is my code for creating spannerDbClient:
RetrySettings retrySettings = RetrySettings.newBuilder()
.setInitialRpcTimeout(Duration.ofSeconds(SPANNER_INITIAL_TIMEOUT_RETRY_SECONDS))
.setMaxRpcTimeout(Duration.ofSeconds(SPANNER_MAX_TIMEOUT_RETRY_SECONDS))
.setMaxAttempts(SPANNER_MAX_RETRY_ATTEMPTS)
.setTotalTimeout(Duration.ofSeconds(SPANNER_TOTAL_TIMEOUT_RETRY_SECONDS))
.build();
SpannerOptions.Builder builder = SpannerOptions.newBuilder()
.setSessionPoolOption(SessionPoolOptions.newBuilder()
.setFailIfPoolExhausted()
.setMinSessions(SPANNER_MIN_SESSIONS)
.setMaxSessions(SPANNER_MAX_SESSIONS)
.build()
)
.setNumChannels(SPANNER_NUM_CHANNELS);
if (credentials != null) {
builder.setCredentials(credentials);
}
builder.getSpannerStubSettingsBuilder()
.executeSqlSettings()
.setRetryableCodes(StatusCode.Code.DEADLINE_EXCEEDED, StatusCode.Code.UNAVAILABLE)
.setRetrySettings(retrySettings);
spanner = builder.build().getService();
databaseId = DatabaseId.of(
projectName,
instanceName,
databaseName
);
spannerDbClient = spanner.getDatabaseClient(databaseId);
Here is my method for performing the actual query:
List<Entry> entry = new ArrayList<>();
try (ResultSet resultSet = spannerDbClient
.singleUseReadOnlyTransaction(TimestampBound.ofMaxStaleness(5, TimeUnit.SECONDS))
.executeQuery(Statement.newBuilder(String.format("SELECT * from %s LIMIT %d OFFSET %d", tableName, limit, offset)).build())) {
while (resultSet.next()) {
entry.add(getEntryFromResultSet(resultSet));
}
}
I added timer code show how long the queries and this is what it looks like for 50 threads. This is using a shared spannerDbClient instance with maxSession=50,minSession=50,numChannels=4 (default):
--> [0h:00m:00s] Throughput: Total 0, Interval 0 (0 req/s), 0/0 threads reporting
[tId:099][00:00:00.335] Spanner query, LIMIT 250 OFFSET 99000
[tId:146][00:00:00.382] Spanner query, LIMIT 250 OFFSET 146000
[tId:140][00:00:00.445] Spanner query, LIMIT 250 OFFSET 140000
[tId:104][00:00:00.494] Spanner query, LIMIT 250 OFFSET 104000
[tId:152][00:00:00.363] Spanner query, LIMIT 250 OFFSET 152000
[tId:149][00:00:00.643] Spanner query, LIMIT 250 OFFSET 149000
[tId:143][00:00:00.748] Spanner query, LIMIT 250 OFFSET 143000
[tId:163][00:00:00.682] Spanner query, LIMIT 250 OFFSET 163000
[tId:155][00:00:00.799] Spanner query, LIMIT 250 OFFSET 155000
[tId:166][00:00:00.872] Spanner query, LIMIT 250 OFFSET 166000
[tId:250][00:00:00.870] Spanner query, LIMIT 250 OFFSET 250000
[tId:267][00:00:01.319] Spanner query, LIMIT 250 OFFSET 267000
[tId:229][00:00:01.917] Spanner query, LIMIT 250 OFFSET 229000
[tId:234][00:00:02.256] Spanner query, LIMIT 250 OFFSET 234000
[tId:316][00:00:02.401] Spanner query, LIMIT 250 OFFSET 316000
[tId:246][00:00:02.844] Spanner query, LIMIT 250 OFFSET 246000
[tId:312][00:00:02.989] Spanner query, LIMIT 250 OFFSET 312000
[tId:176][00:00:03.497] Spanner query, LIMIT 250 OFFSET 176000
[tId:330][00:00:03.140] Spanner query, LIMIT 250 OFFSET 330000
[tId:254][00:00:03.879] Spanner query, LIMIT 250 OFFSET 254000
[tId:361][00:00:03.816] Spanner query, LIMIT 250 OFFSET 361000
[tId:418][00:00:03.635] Spanner query, LIMIT 250 OFFSET 418000
[tId:243][00:00:04.503] Spanner query, LIMIT 250 OFFSET 243000
[tId:414][00:00:04.006] Spanner query, LIMIT 250 OFFSET 414000
[tId:324][00:00:04.457] Spanner query, LIMIT 250 OFFSET 324000
[tId:498][00:00:03.865] Spanner query, LIMIT 250 OFFSET 498000
[tId:252][00:00:04.945] Spanner query, LIMIT 250 OFFSET 252000
[tId:494][00:00:04.211] Spanner query, LIMIT 250 OFFSET 494000
[tId:444][00:00:04.780] Spanner query, LIMIT 250 OFFSET 444000
[tId:422][00:00:04.951] Spanner query, LIMIT 250 OFFSET 422000
[tId:397][00:00:05.234] Spanner query, LIMIT 250 OFFSET 397000
[tId:420][00:00:05.106] Spanner query, LIMIT 250 OFFSET 420000
[tId:236][00:00:05.985] Spanner query, LIMIT 250 OFFSET 236000
[tId:406][00:00:05.429] Spanner query, LIMIT 250 OFFSET 406000
[tId:449][00:00:05.291] Spanner query, LIMIT 250 OFFSET 449000
[tId:437][00:00:05.929] Spanner query, LIMIT 250 OFFSET 437000
[tId:341][00:00:06.611] Spanner query, LIMIT 250 OFFSET 341000
[tId:475][00:00:06.223] Spanner query, LIMIT 250 OFFSET 475000
[tId:490][00:00:06.186] Spanner query, LIMIT 250 OFFSET 490000
[tId:416][00:00:06.460] Spanner query, LIMIT 250 OFFSET 416000
[tId:328][00:00:07.446] Spanner query, LIMIT 250 OFFSET 328000
[tId:322][00:00:07.679] Spanner query, LIMIT 250 OFFSET 322000
[tId:158][00:00:09.357] Spanner query, LIMIT 250 OFFSET 158000
[tId:496][00:00:08.183] Spanner query, LIMIT 250 OFFSET 496000
[tId:256][00:00:09.250] Spanner query, LIMIT 250 OFFSET 256000
--> [0h:00m:10s] Throughput: Total 9848, Interval +9848 (984 req/s), 44/50 threads reporting
[tId:492][00:00:08.646] Spanner query, LIMIT 250 OFFSET 492000
[tId:390][00:00:09.810] Spanner query, LIMIT 250 OFFSET 390000
[tId:366][00:00:10.142] Spanner query, LIMIT 250 OFFSET 366000
[tId:320][00:00:10.451] Spanner query, LIMIT 250 OFFSET 320000
[tId:318][00:00:10.619] Spanner query, LIMIT 250 OFFSET 318000
--> [0h:00m:20s] Throughput: Total 56051, Interval +46203 (4620 req/s), 50/50 threads reporting
--> [0h:00m:30s] Throughput: Total 102172, Interval +46121 (4612 req/s), 50/50 threads reporting
Note that the query times only increase regardless of the offset and it takes between 10 and 20 seconds for the initial spanner query to return data for all 50 threads before they start reporting results. If I increase the limit to 1000, it takes almost 2 minutes for all 50 threads to get their results back from Spanner.
Compare that to the DynamoDb equivalent (except the limit is 1000) where all queries return in less than 1 second and all 50 threads are reporting results before the 10 second status update is displayed:
--> [0h:00m:00s] Throughput: Total 0, Interval 0 (0 req/s), 0/0 threads reporting
[tId:045] Dynamo query, LIMIT 1000 [00:00:00.851]
[tId:138] Dynamo query, LIMIT 1000 [00:00:00.463]
[tId:183] Dynamo query, LIMIT 1000 [00:00:00.121]
[tId:122] Dynamo query, LIMIT 1000 [00:00:00.576]
[tId:095] Dynamo query, LIMIT 1000 [00:00:00.708]
[tId:072] Dynamo query, LIMIT 1000 [00:00:00.778]
[tId:115] Dynamo query, LIMIT 1000 [00:00:00.619]
[tId:166] Dynamo query, LIMIT 1000 [00:00:00.296]
[tId:058] Dynamo query, LIMIT 1000 [00:00:00.814]
[tId:179] Dynamo query, LIMIT 1000 [00:00:00.242]
[tId:081] Dynamo query, LIMIT 1000 [00:00:00.745]
[tId:106] Dynamo query, LIMIT 1000 [00:00:00.671]
[tId:162] Dynamo query, LIMIT 1000 [00:00:00.348]
[tId:035] Dynamo query, LIMIT 1000 [00:00:00.889]
[tId:134] Dynamo query, LIMIT 1000 [00:00:00.513]
[tId:187] Dynamo query, LIMIT 1000 [00:00:00.090]
[tId:158] Dynamo query, LIMIT 1000 [00:00:00.405]
[tId:191] Dynamo query, LIMIT 1000 [00:00:00.095]
[tId:195] Dynamo query, LIMIT 1000 [00:00:00.096]
[tId:199] Dynamo query, LIMIT 1000 [00:00:00.144]
[tId:203] Dynamo query, LIMIT 1000 [00:00:00.112]
[tId:291] Dynamo query, LIMIT 1000 [00:00:00.102]
[tId:303] Dynamo query, LIMIT 1000 [00:00:00.094]
[tId:312] Dynamo query, LIMIT 1000 [00:00:00.101]
[tId:318] Dynamo query, LIMIT 1000 [00:00:00.075]
[tId:322] Dynamo query, LIMIT 1000 [00:00:00.086]
[tId:326] Dynamo query, LIMIT 1000 [00:00:00.096]
[tId:330] Dynamo query, LIMIT 1000 [00:00:00.085]
[tId:334] Dynamo query, LIMIT 1000 [00:00:00.114]
[tId:342] Dynamo query, LIMIT 1000 [00:00:00.096]
[tId:391] Dynamo query, LIMIT 1000 [00:00:00.081]
[tId:395] Dynamo query, LIMIT 1000 [00:00:00.088]
[tId:406] Dynamo query, LIMIT 1000 [00:00:00.088]
[tId:415] Dynamo query, LIMIT 1000 [00:00:00.078]
[tId:421] Dynamo query, LIMIT 1000 [00:00:00.089]
[tId:425] Dynamo query, LIMIT 1000 [00:00:00.068]
[tId:429] Dynamo query, LIMIT 1000 [00:00:00.088]
[tId:433] Dynamo query, LIMIT 1000 [00:00:00.105]
[tId:437] Dynamo query, LIMIT 1000 [00:00:00.092]
[tId:461] Dynamo query, LIMIT 1000 [00:00:00.110]
[tId:483] Dynamo query, LIMIT 1000 [00:00:00.071]
[tId:491] Dynamo query, LIMIT 1000 [00:00:00.078]
[tId:495] Dynamo query, LIMIT 1000 [00:00:00.075]
[tId:503] Dynamo query, LIMIT 1000 [00:00:00.064]
[tId:499] Dynamo query, LIMIT 1000 [00:00:00.108]
[tId:514] Dynamo query, LIMIT 1000 [00:00:00.163]
[tId:518] Dynamo query, LIMIT 1000 [00:00:00.135]
[tId:529] Dynamo query, LIMIT 1000 [00:00:00.163]
[tId:533] Dynamo query, LIMIT 1000 [00:00:00.079]
[tId:541] Dynamo query, LIMIT 1000 [00:00:00.060]
--> [0h:00m:10s] Throughput: Total 24316, Interval +24316 (2431 req/s), 50/50 threads reporting
--> [0h:00m:20s] Throughput: Total 64416, Interval +40100 (4010 req/s), 50/50 threads reporting
Am I missing something in the config? If I let it autoscale the performance issue is greatly magnified.

I suspect that in order to produce accurate results for
SELECT A, B, C FROM data_table LIMIT 250 OFFSET XXX
The backend would need to fetch 250 + XXX rows and then skip XXX of them. So, if XXX is very large, this can be a very expensive query and require scanning a big chunk of data_table.
Would it make sense to instead restrict the table key(s)? something like:
SELECT A, B, C FROM data_table WHERE TableKey1 > 'key_restriction' LIMIT 250;
This type of query should only read up to 250 rows.
Independently, it would be good to understand how representative such queries would be for your production workload. Can you explain what type of queries you expect in production?

EDIT based on the additional information:
As Panagiotis Voulgaris pointed out below, I don't think the problem in this case is related to the client configuration, but to the query itself. The query seems to be quite slow, especially for higher OFFSET values. I tried it out with a table with approx 1,000,000 rows, and for an OFFSET value of 900,000 a single query runs for 4-5 seconds. The reason that the problem is getting worse when you scale up, is probably that you are overwhelming the backend with a lot of parallel queries that take a long time, and not because the client is wrongly configured.
The best would be if you could re-write your query to select a range of rows based on the primary key value instead of using a LIMIT x OFFSET y construct. So your query would then look something like this:
SELECT A, B, C
FROM data_table
WHERE A >= x AND A < (x+250)
This obviously won't guarantee that you get exactly 250 rows in each partition if your key column contains gaps between the values. You could in that case also increase the +250 value a little to get reasonable partitions.
If the above is not possible because the key values are completely random values (or are not evenly distributed), then I think the following query would be more efficient than your current query:
SELECT A, B, C
FROM data_table
WHERE A >= (
SELECT ANY_VALUE(A)
FROM data_table
GROUP BY A
LIMIT 1 OFFSET y
)
ORDER BY A
LIMIT 250
It's not really clear to me exactly what your end goal is in this case, and that makes a difference when it comes to the concrete question:
...if the partitionQuery() route is correct (?)
The BatchReadOnlyTransaction and partitionQuery() route is intended for reading a large dataset at a single point in time. This could for example be when you want to create a dump of all the data in a table. Spanner will partition the query for you and return a list of partitions. Each partition can then be handled by separate threads (or even separate VMs). This so to speak automatically replaces the LIMIT 250 OFFSET xxxx part of your query, as Spanner creates the different partitions based on the actual data in the table.
However, if your end goal here is to simulate production load, then BatchReadOnlyTransactionis not the route to follow.
If what you want to do is to efficiently query a data set then you should make sure that you use a single-use read-only transaction for the query. This is what you are already doing with the native client. Also, the JDBC driver will also automatically use single-use read-only transactions for queries as long as the connection is in autocommit mode. If you turn off autocommit, the driver will automatically start a transaction when you execute a query.
Regarding sessions and channels:
Sessions are somewhat comparable to what you would normally call a connection. Both the JDBC driver and the native client use an internal session pool. The important part in your case is the number of parallel reads that will be executing at any time. One session can handle one transaction (i.e. one read operation) at any time. So you will need as many sessions as there will be read operations in parallel. I assume that in your setup with c3po you are assigning a single JDBC connection to each thread that is reading. The max number of sessions should in that case be set equal to the max number of connections in the c3po pool.
Channels: A channel is a low-level network connection that is used by gRPC. One channel can handle multiple simultaneous requests in parallel. As far as I know, the default max is 100 simultaneous requests per channel, so you should use 1 channel for every 100 sessions. This is also the default in the JDBC driver and native client library.
Regarding the (example) query:
As mentioned above, it's not really clear to me whether this is just a test setup, or an actual production example. I would however expect the query to contain an explicit ORDER BY clause to ensure that the data is returned in the expected order, and that ORDER BY clause should obviously use an indexed column.
Finally: Is the problem caused by the backend responding to slowly on each query? Or is the backend basically idling, and is the client not able to really ramp up the the queries?

Related

Spanner SQL Filtering by Less Than on Tuples

Is there any way to achive <, >, etc comparisons in the WHERE clause of a Spanner SQL query where the values compared are not scalar but tuples/structs?
For example, say we have a table users (intentionally unrealistic schema)
CREATE TABLE users (
is_special BOOL NOT NULL,
registered_on TIMESTAMP NOT NULL,
) PRIMARY KEY (is_special DESC, registered_on DESC)
The natural sort order of the PK index is then is_special DESC, registered_on DESC.
I want select a range of rows starting with a specific row in that PK index (i.e. from a cursor):
SELECT * FROM users
WHERE (is_special, registered_on) < (#cursor.is_special, #cursor.registered_on)
LIMIT 100
That's not allowed by Spanner SQL because the tuple is treated as a STRUCT type and STRUCT types do not allow the < comparison. Is there any other way to achieve this?
With the Read API, I can query a range by using a KeyRange and providing the PK of the row I want to start the query from, but I'd like to achieve the same in SQL.

Here is how to write the query using individual fields. This relies on the fact that column is_special is not nullable.
SELECT * FROM users
WHERE (is_special < #cursor.is_special) OR (is_special = #cursor.is_special AND registered_on < #cursor.registered_on)
LIMIT 100
Just for completeness; if column is_special is nullable then it gets a uglier.
SELECT * FROM users
WHERE (is_special < #cursor.is_special) OR ((is_special = #cursor.is_special OR (is_special IS NULL AND #cursor.is_special IS NULL)) AND registered_on < #cursor.registered_on)
LIMIT 100
Additional comment. The query has a LIMIT clause but no ORDER BY clause. This is legal but unusual and it looks like a bug given that the query is paging results.
I think the query should have the following clause:
ORDER BY is_special, registered_on
The reason is as follows:
If a SQL query does not have an ORDER BY clause then it does not provide any row ordering guarantee. In practice you will observe ordering in Spanner results even without an ORDER BY clause but no order is guaranteed and you should not rely on it. However, if a query has an ORDER BY and Spanner uses an index that provides the required order then Spanner will not explicitly sort the data. Therefore you need not worry about the performance or memory impact of including ORDER BY.

Cassandra UUID partition key and partition size

Given a table
CREATE TABLE sensors_by_id (
id uuid,
time timeuuid,
some_text text,
PRIMARY KEY (id, time)
)
Will this scale when there are a lot of entries? I´m not sure, if a UUID field is sufficient as a good partition key or is there a need to create some artificial key like week_first_day or something similar?

It's really depends on how will you insert your data - if you generate the UUID really randomly for every insert, then the chance of duplicates is very low, and you'll get so-called "skinny rows" (a lot of partitions with 1 row inside). Even if you start to get the duplicates, there will be not so many for every row...

It could be a problem with partition size cause cassandra has limit for disk size per one partition.
Good rule of thumb is to keep the maximum number of rows below 100,000 items and the disk size under 100 MB.
It is easy to calculate partition size by using that formula
You can read more about data modeling here.
So in your case with current schema for 1 000 000 rows count per one partition with average size 100 byte for some_text column will be:
Number of Values: (1000000 * (3 - 2 - 0) + 0) = 1000000
Partition Size on Disk: (16 + 0 + (1000000 * 116) + (8 * 1000000))
= 124000016 bytes (118.26 Mb)
So as you can see you out of limit with 118.26 Mb per one partition. So you need optimize your partition keys.
I calculated it using my open source project - cql-calculator.

Count rows in table

I have a trouble with the rows counting of very huge table in Cassandra DB.
Simple statement:
SELECT COUNT(*) FROM my.table;
Invokes the timeout error:
OperationTimedOut: errors={}, ...
I have increased client_timeout in ~/.cassandra/cqlshrc file:
[connection]
client_timeout = 900
Statement is running this time and invokes OperationTimeout error again. How can I count rows in table?

You could count multiple times by using split token ranges.
Cassandra uses a token range from -2^63 to +2^63-1. So by splitting up this range you could do queries like that:
select count(*) from my.table where token(partitionKey) > -9223372036854775808 and token(partitionKey) < 0;
select count(*) from my.table where token(partitionKey) >= 0 and token(partitionKey) < 9223372036854775807;
Add those two counts and you'll have the total count.
If those querys still not go through you can split them again into smaller token ranges.
Check out this tool, which does basically exactly that: https://github.com/brianmhess/cassandra-count

"select count(id) from table" takes up to 30 minutes to calculate in SQL Azure

I have a database in SQL Azure which is not taking between 15 and 30 minutes to do a simple:
select count(id) from mytable
The database is about 3.3GB and the count is returning approx 2,000,000 but I have tried it locally and it takes less than 5 seconds!
I have also run a:
ALTER INDEX ALL ON mytable REBUILD
On all the tables in the database.
Would appreciate if anybody could point me to some things to try to diagnose/fix this.
(Please skip to UPDATE 3 below as I now think this is the issue but I still do not understand it).
UPDATE 1:
It appears to take 99% of the time in a clustered index scan as image below shows. I have
UPDATE 2: And this is what the statistics messages come back as when I do:
SET STATISTICS IO ON
SET STATISTICS TIME ON
select count(id) from TABLE
Statistics:
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 317037 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
(1 row(s) affected)
Table 'TABLE'. Scan count 1, logical reads 279492, physical reads 8220, read-ahead reads 256018, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
(1 row(s) affected)
SQL Server Execution Times:
CPU time = 297 ms, elapsed time = 438004 ms.
SQL Server parse and compile time:
CPU time = 0 ms, elapsed time = 0 ms.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 0 ms.
UPDATE 3: OK - I have another theory now. The Azure portal is suggesting each time I do test this simply select query it is maxing out my DTU percentage to nearly 100%. I am using a Standard Azure SQL instance with performance level S1 (20 DTUs). Is it possible that this simple query is being slowed down by my DTU limit?

I realize this is old, but I had the same issue. I had a table with 2.5 million rows that I imported from an on-prem database into Azure SQL and ran at S3 level. Select Count(0) from Table resulted in a 5-7 minute execution time vs milliseconds on-premise.
In Azure, index and table scans seem to be penalized tremendously in performance, so adding a 'useless' WHERE to the query that forces it to perform an index seek on the clustered index helped.
In my case, this performed almost identical Select count(0) from Table where id > 0 resulted in performance matching the on premise query.

Suggestion: try select count(*) instead: it might actually improve the response time:
http://www.sqlskills.com/blogs/paul/which-index-will-sql-server-use-to-count-all-rows/
Also, have you done an "explain plan"?
http://azure.microsoft.com/blog/2011/12/15/sql-azure-management-portal-tips-and-tricks-part-ii/
http://social.technet.microsoft.com/wiki/contents/articles/1657.gaining-performance-insight-into-windows-azure-sql-database.aspx
============ UPDATE ============
Thank you for getting the statistics.
You're doing a full table scan of 2M rows - not good :(
POSSIBLE WORKAROUND: query system table row_count instead:
http://blogs.msdn.com/b/arunrakwal/archive/2012/04/09/sql-azure-list-of-tables-with-record-count.aspx
select t.name ,s.row_count from sys.tables t
join sys.dm_db_partition_stats s
ON t.object_id = s.object_id
and t.type_desc = 'USER_TABLE'
and t.name not like '%dss%'
and s.index_id = 1

Quick refinement of #FoggyDay post. If your tables are partitioned, you'll want to sum the rowcount.
SELECT t.name, SUM(s.row_count) row_count
FROM sys.tables t
JOIN sys.dm_db_partition_stats s
ON t.object_id = s.object_id
AND t.type_desc = 'USER_TABLE'
AND t.name not like '%dss%'
AND s.index_id = 1
GROUP BY t.name

fetching timeseries/range data in cassandra

I am new to Cassandra and trying to see if it fits my data query needs. I am populating test data in a table and fetching them using cql client in Golang.
I am storing time series data in Cassandra, sorted by timestamp. I store data on a per-minute basis.
Schema is like this:
parent: string
child: string
bytes: int
val2: int
timestamp: date/time
I need to answer queries where a timestamp range is provided and a childname is given. The result needs to be the bytes value in that time range(Single value, not series) I made a primary key(child, timestamp). I followed this approach rather than the column-family, comparator-type with timeuuid since that was not supported in cql.
Since the data stored in every timestamp(every minute) is the accumulated value, when I get a range query for time t1 to t2, I need to find the bytes value at t2, bytes value at t1 and subtract the 2 values before returning. This works fine if t1 and t2 actually had entries in the table. If they do not, I need to find those times between (t1, t2) that have data and return the difference.
One approach I can think of is to "select * from tablename WHERE timestamp <= t2 AND timestamp >= t1;" and then find the difference between the first and last entry in this array of rows returned. Is this the best way to do it? Since MIN and MAX queries are not supported, is there is a way to find the maximum timestamp in the table less than a given value? Thanks for your time.

Are you storing each entry as a new row with a different partition key(first column in the Primary key)? If so, select * from x where f < a and f > b is a cluster wide query, which will cause you problems. Consider adding a "fake" partition key, or use a partition key per date / week / month etc. so that your queries hit a single partition.
Also, your queries in cassandra are >= and <= even if you specify > and <. If you need strictly greater than or less than, you'll need to filter client side.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to efficiently query google-cloud-spanner in parallel with multiple threads? - multithreading

Related

Spanner SQL Filtering by Less Than on Tuples

Cassandra UUID partition key and partition size

Count rows in table

"select count(id) from table" takes up to 30 minutes to calculate in SQL Azure

fetching timeseries/range data in cassandra

Categories

Resources