Understanding spanner explanation - google-cloud-spanner

I've a table of 860M rows in Google Cloud Spanner and I'm trying to understand how explanation works.
The table has a string column geoid and there is an index at this column.
When I run the following query it takes only 36ms:
SELECT count(*)
FROM usbg_2015
WHERE geoid= '340170175001'
Table structure is:
CREATE TABLE usbg_2015 (
geoid STRING(12),
quadkey STRING(24),
) PRIMARY KEY (geoid, quadkey)
However, I don't understand why the explanation says it uses a Table Scan instead of an Index Scan. I understood a Table scan as a full scan of the table, in this case reading 860M rows and it should take more time than 36ms. What I'm missing?

In the explanation, Table Scan merely means that it reads the data from a table and does not necessarily mean a full table scan. Same goes for index scan. It means that it is reading from an index. In both cases, if there is a seekable predicate (e.g., constant prefixes on primary key or indexed column), they will do the seek.
The plan used the base table, and seek-and-scanned 11 rows, otherwise you would see 860M rows returned as a result out of the Table Scan.
Is geoid the leading primary key column of the table usbg_2015? That is the only explanation that I can think of given the plan.

table scan have two meanings:
1- when you search by a primary key (or in your case the first part of it)
2- When you perform am index scan and have in the select list a column that isn't neither in the index nor in the storing clause, than you have to join the index with the table itself. This operation is called table scan.

Related

Cassandra : (3.11.11) find a string in the cassandra table column

I am a new bee to Cassandra.
I have a Table(table1) and the Data like
ch1,ch2,ch3,ch4
LD,9813970,1484914,'T03103','T04014'
LD,1008203,1486104,'T03103','T04024'
Want to find a string in this cassandra table : table1. Is there any option to search a given string in this table's column ch4 using only IN operator (not LIKE operator). Sample query is like
select * from table1 where 'T04014' IN (ch4)
if required ch4 column may included in the partition or clustering keys.
You didn't post the table schema so I'm going to assume that ch4 is not part of the primary key.
You cannot include a column in the filter unless it is part of the primary key or you have a secondary index defined on it. Be aware that secondary indexes are not always a good fit. Have a look at when to use an index for details.
The general recommendation is to denormalise and create a table specifically designed for each app query so you get the best performance out of your cluster. Cheers!

Regarding suggestion of best schema for a cassandra table?

I want to have a table in Cassandra that has a partition key say column 'A', and a column say 'B' which is of 'set' type and can have up to 10000 elements in the set.
But when i retrieve a row from this table then the whole set is retrieved at once and because of that the JVM heap increases rapidly. So should i stick to this schema or go with other schema where 'A' is partition key and i make dynamic columns for each element in the set in my other schema say 'B1', 'B2' ..... 'B10,000'where each of this column is a clustering key.
Which schema is suited best and will give the optimal performance please recommend.
NOTE: cqlsh 5.0.1v
Based off of what you've described, and the documentation I've read, I would not create a collection with 10k elements. Instead I would have two tables, one with everything but the collection, and then use the primary key values of the first table, as the partition key columns of the second table; adding the element name (or whatever you can use to identify an individual element) as a clustering column.
So for a given query, if you wanted everything for a particular primary key value (including all elements), you'd query the first table with the primary key, grab whatever you need, then hit the second table as well, looping/fetching through all elements.
If the query only provides a filter on the partition key (not the primary key - i.e. retrieving multiple rows) , the first query would have to retrieve all columns that make up the primary key for each row, and then query the second table looping for all elements - nested loop here - one loop for each primary key record retrieved from the first table, and a second loop to grab all elements for each pk record.
Probably the best way to go with this. That's how I would probably tackle this.
Does that make sense?
-Jim

Filter on the partition and the clustering key with an additional criteria

I want to filter on a table that has a partition and a clustering key with another criteria on a regular column. I got the following warning.
InvalidQueryException: Cannot execute this query as it might involve
data filtering and thus may have unpredictable performance. If you
want to execute this query despite the performance unpredictability,
use ALLOW FILTERING
I understand the problem if the partition and the clustering key are not used. In my case, is it a relevant error or can I ignore it?
Here is an example of the table and query.
CREATE TABLE mytable(
name text,
id uuid,
deleted boolean
PRIMARY KEY((name),id)
)
SELECT id FROM mytable WHERE name='myname' AND id='myid' AND deleted=false;
In Cassandra you can't filter data with non-primary key column unless you create index in it.
Cassandra 3.0 or up it is allowed to filter data with non primary key but in unpredictable performance
Cassandra 3.0 or up, If you provide all the primary key (as your given query) then you can use the query with ALLOW FILTERING, ignoring the warning
Otherwise filter from the client side or remove the field deleted and create another table :
Instead of updating the field to deleted true move your data to another table let's say mytable_deleted
CREATE TABLE mytable_deleted (
name text,
id uuid
PRIMARY KEY (name, id)
);
Now if you only have the non deleted data on mytable and deleted data on mytable_deleted table
or
Create index on it :
The column deleted is a low cardinality column. So remember
A query on an indexed column in a large cluster typically requires collating responses from multiple data partitions. The query response slows down as more machines are added to the cluster. You can avoid a performance hit when looking for a row in a large partition by narrowing the search.
Read More : When not to use an index

Where and Order By Clauses in Cassandra CQL

I am new to NoSQL database and have just started using apache Cassandra. I created a simple table "emp" with primary key on "empno" column. This is a simple table as we always get in Oracle's default scott schema.
Now I loaded data using the COPY command and issued query Select * from emp order by empno but I was surprised that CQL did not allow Order by on empno column (which is PK). Also when I used Where condition, it did not allow any inequality operations on empno column (it said only EQ or IN conditions are allowed). It also did not allowed Where and Order by on any other column, as they were not used in PK, and did not have an index.
Can someone please help me what should I do if I want to keep empno unique in the table and want a query results in Sorted order of empno?
(My version is:
cqlsh:demodb> show version
[cqlsh 5.0.1 | Cassandra 2.2.0 | CQL spec 3.3.0 | Native protocol v4]
)
There are two parts to a PRIMARY KEY in Cassandra:
partition key(s)
clustering key(s)
PRIMARY KEY (partitionKey1,clusteringKey1,clusteringKey2)
or
PRIMARY KEY ((partitionKey1,partitionKey2),clusteringKey1,clusteringKey2)
The partition key determines which node(s) your data is stored on. The clustering key determines the order of the data within your partition key.
In CQL, the ORDER BY clause is really only used to reverse the defined sort direction of your clustering order. As for the columns themselves, you can only specify the columns defined (and in that exact order...no skipping) in your CLUSTERING ORDER BY clause at table creation time. So you cannot pick arbitrary columns to order your result set at query-time.
Cassandra achieves performance by using the clustering keys to sort your data on-disk, thereby only returning ordered rows in a single read (no random reads). This is why you must take a query-based modeling approach (often duplicating your data into multiple query tables) with Cassandra. Know your queries ahead of time, and build your tables to serve them.
Select * from emp order by empno;
First of all, you need a WHERE clause. It's ok to query without it, if you're working with a relational database. With Cassandra, you should do your best to avoid unbound SELECT queries. Besides, Cassandra can only enforce a sort order within a partition, so querying without a WHERE clause won't return data in the order you want, anyway.
Secondly, as I mentioned above, you need to define clustering keys. If you want to order your result set by empno, then you must find another column to define as your partition key. Try something like this:
CREATE TABLE emp_by_dept (
empno text,
dept text,
name text,
PRIMARY KEY (dept,empno)
) WITH CLUSTERING ORDER BY (empno ASC);
Now, I can query employees by department, and they will be returned to me ordered by empno:
SELECT * FROM emp_by_dept WHERE dept='IT';
But to be clear, you will not be able to query every row in your table, and have it ordered by a single column. The only way to get meaningful order into your result sets, is first partition your data in a way that makes sense to your business case. Running an unbound SELECT will return all of your rows (assuming that the query doesn't time-out while trying to query every node in your cluster), but result set ordering can only be enforced within a partition. So you have to restrict by partition key in order for that to make any sense.
My apologies for self-promoting, but last year I wrote an article for DataStax called We Shall Have Order!, in which I addressed how to solve these types of problems. Give it a read and see if it helps.
Edit for additional questions:
From your answer I concluded 2 things about Cassandra:
(1) There is no
way of getting a result set which is only order by a column that has
been defined as Unique.
(2) When we define a PK
(partition-key+clustering-key), then the results will always be order
by Clustering columns within any fixed partition key (we must restrict
to one partition-key value), that means there is no need of ORDER BY
clause, since it cannot ever change the order of rows (the order in
which rows are actually stored), i.e. Order By is useless.
1) All PRIMARY KEYs in Cassandra are unique. There's no way to order your result set by your partition key. In my example, I order by empno (after partitioning by dept). – Aaron 1 hour ago
2) Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC.
I created an index on "empno" column of "emp" table, it is still not
allowing ORDER BY empno. So, what Indexes are for? are they only for
searching records for specific value of index key?
You cannot order a result set by an indexed column. Secondary indexes are (not the same as their relational counterparts) really only useful for edge-case, analytics-based queries. They don't scale, so the general recommendation is not to use secondary indexes.
Ok, that simply means that one table cannot be used for getting
different result sets with different conditions and different sorting
order.
Correct.
Hence for each new requirement, we need to create a new table.
IT means if we have a billion rows in a table (say Sales table), and
we need sum of sales (1) Product-wise, (2) Region-wise, then we will
duplicate all those billion rows in 2 tables with one in clustering
order of Product, the other in clustering order of Region,. and even
if we need to sum sales per Salesman_id, then we build a 3rd table,
again putting all those billion rows? is it sensible?
It's really up to you to decide how sensible it is. But lack of query flexibility is a drawback of Cassandra. To get around it you can keep creating query tables (I.E., trading disk for performance). But if it gets to a point where it becomes ungainly or difficult to manage, then it's time to think about whether or not Cassandra is really the right solution.
EDIT 20160321
Hi Aaron, you said above "Stopping short of saying that ORDER BY is useless, I'll say that its only real use is to switch your sort direction between ASC and DESC."
But i found even that is not correct. Cassandra only allows ORDER by in the same direction as we define in the "CLUSTERING ORDER BY" caluse of CREATE TABLE. If in that clause we define ASC, it allows only order by ASC, and vice versa.
Without seeing an error message, it's hard to know what to tell you on that one. Although I have heard of queries with ORDER BY failing when you have too many rows stored in a partition.
ORDER BY also functions a little odd if you specify multiple columns to sort by. If I have two clustering columns defined, I can use ORDER BY on the first column indiscriminately. But as soon as I add the second column to the ORDER BY clause, my query only works if I specify both sort directions the same (as the CLUSTERING ORDER BY definition) or both different. If I mix and match, I get this:
InvalidRequest: code=2200 [Invalid query] message="Unsupported order by relation"
I think that has to do with how the data is stored on-disk. Otherwise Cassandra would have more work to do in preparing result sets. Whereas if it requires everything to either to match or mirror the direction(s) specified in the CLUSTERING ORDER BY, it can just relay a sequential read from disk. So it's probably best to only use a single column in your ORDER BY clause, for more predictable results.
Adding a redux answer as the accepted one is quite long.
Order by is currently only supported on the clustered columns of the PRIMARY KEY
and when the partition key is restricted by an Equality or an IN operator in where clause.
That is if you have your primary key defined like this :
PRIMARY KEY ((a,b),c,d)
Then you will be able to use the ORDER BY when & only when your query has :
a where clause with all the primary key restricted either by an equality operator (=) or an IN operator such as :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c,d;
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY c;
These two query are the only valid ones.
Also this query would not work :
SELECT * FROM emp WHERE a = 1 AND b = 'India' ORDER BY d,c;
because order by currently only support the ordering of columns following their declared order in the PRIMARY KEY that is in primary key definition c has been declared before d and the query violates the ordering by placing d first.

Why cassandra/cql restrict to use where clause on a column that not indexed?

I have a table as follows in Cassandra 2.0.8:
CREATE TABLE emp (
empid int,
deptid int,
first_name text,
last_name text,
PRIMARY KEY (empid, deptid)
)
when I try to search by: "select * from emp where first_name='John';"
cql shell says:
"Bad Request: No indexed columns present in by-columns clause with Equal operator"
I searched for the issue and every places it says add a secondary index for the column 'first_name'.
But I need to know the exact reason for why that column need to be indexed?
Only thing I can figure out is performance.
Any other reasons?
Cassandra does not support for searching by arbitrary column. It is because it would involve scanning all the rows, which is not supported.
The data are internally organised into something which one can compare to HashMap[X, SortedMap[Y, Z]]. The key of the outer map is a partition key value and the key of the inner map is a kind of concatenation of all clustering columns values and a name of some regular column.
Unless you have an index on a column, you need to provide full (preferred) or partial path to the data you want to collect with the query. Therefore, you should design your schema so that queries contain primary key value and some range on clustering columns.
You may read about what is allowed and what is not here
Alternatively you can create an index in Cassandra, but that will hamper your write performance.

Resources