Kudu auto generated key column - apache-kudu

I am trying to make custom auto generated/incremented key in Kudu which will keep increasing its value -from a starting seed which zero by default.
It's pretty inefficient to go through all records and increment a counter to get a row count.
Does Kudu provide the rows count out of the box?
If not, what are the best way to get it?

Apache Kudu does not support AUTO_INCREMENT columns at this time. There is a FAQ entry on the Kudu web site that mentions this.
Kudu is a distributed storage engine that is focused on being a good analytical store (OLAP) as opposed to being a good transactional store (OLTP) and it shows in the features we've prioritized so far. This is a good example of that.
Because we're not trying to be an OLTP store, Kudu doesn't yet implement multi-row or multi-node transactions, and so a simple incrementing primary key counter would be difficult to implement correctly at this time -- especially for example when the table is hash-partitioned on the primary key. We'd need a central transaction coordinator that doesn't currently exist.
To answer your second question, getting a row count is currently a little expensive in Kudu as it involves scanning the index column on each tablet and summing up the total count. Apache Impala / Apache Spark SQL will do this transparently for you if you do a SELECT COUNT(*) from kudu_table but I wouldn't currently rely on that for the purposes of assigning a new ID, since Impala currently allows scanning from a slightly stale Kudu replica thus potentially being off on the row count.
The best thing to do right now is rely on some external mechanism to assign row IDs.
Source: I am a PMC member on Apache Kudu.

In addition do #JoeyVanHalens answer, there is another option which is also explained here on SO. You can use row_numer() to create an ID which resembles a counter but does not force you to do some cumbersome nesting or something else if you only want a counter-like column.
Straight forward, it looks like this:
SELECT
row_number() OVER (PARTITION BY "dummy" ORDER BY "dummy") as incremented_id
FROM some_table
row_number() creates an incremtented number over a partition. Unlike rank(), row_number() ensures you to have an increment even if your partition contains duplicates.
PARTITION BY "dummy" interprets a temporary "dummy"-column during runtime as one partition over the entire table. Thus, the increment happens for all records.
ORDER BY follows the same "dummy"-logic.
Of course you can also replace "dummy" by whatever column is necessary for your table-logic.
The result looks like:
-- ID = incremented_id
| ID | some_content |
|-------|--------------|
| 1 | "a" |
| 2 | "b" |
| 3 | "c" |
| 4 | "d" |
|-------|--------------|

There are several ways to get around this.
Use impala's uuid() function to generate a unique id.
convert the uuid() to BIGINT (via hashing, etc.)
use impala's unix_timestamp to generate a BIGINT value representing the current date and time as a delta from the Unix epoch (this might cause some collision, so better add another column if you're going to use this as a primary key.

Related

Cloud Spanner complex primary key and queries

I'm playing with Cloud Spanner and I created an imgur clone with the schema as follows:
CREATE TABLE Images (id STRING(36) NOT NULL, createdAt TIMESTAMP, caption STRING(1024), fileType STRING(10)) PRIMARY KEY (id, createdAt DESC)
The id is a version 4 UUID as the GCP documentation specifies so that I avoid hotspots. The createdAt is a timestamp when an image is first created. I have my PRIMARY KEY defined as (id, createdAt DESC) so that I can more easily query by latest added images.
What I don't understand is what happens if I want to get a single image using only SELECT * FROM Images WHERE id = 'some UUID? Will Spanner still search by key in an efficient way, meaning getting the information from the server that stores the specific key in its key range even though I only specified a part of the primary key?
In your simple example, yes. It will try to come up with an efficient execution plan which may include using an index (automatically created for PKs) even though your predicate is on just 1 of the 2-column composite PK because it is on the 1 column. If your predicate was just "...createdAt= then it will scan the table. It would be far more expensive to find matches for col2 in your composite PK of (col1, col2) than it is to just scan col2.
This assumes there's enough data to matter. For example, if you have 42 rows, it really won't matter how you execute the query or what predicates were provided; the number off I/O requests (often the most expensive part of a query) will be the same.
In general, Spanner tries to pick the index it thinks will be most efficient. The actual physical steps don't work like that but conceptually, it's a reasonable way to think about it.
Whether an index is helpful or not depends on a few things and whether it gets picked or not also has dependencies. Does it have statistics, are the statistics correct/fresh, is it making correct estimates on row counts, etc... Composite indexes/keys are a just a bit more interesting as noted above.
Just make sure you always test with enough data (closely matching your production environment if possible).

Cassandra get latest entry for each element contained within IN clause

So, I have a Cassandra CQL statement that looks like this:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID = ? AND DATA_SCHEMA = ?
This table is sorted by a timestamp column.
The functionality is fronted by a REST API, and one of the filter parameters that they can specify to get the most recent row, and then I appent "LIMIT 1" to the end of the CQL statement since it's ordered by the timestamp column in descending order. What I would like to do is allow them to specify multiple device id's to get back the latest entries for. So, my question is, is there any way to do something like this in Cassandra:
SELECT * FROM DATA WHERE APPLICATION_ID = ? AND PARTNER_ID = ? AND LOCATION_ID = ? AND DEVICE_ID IN ? AND DATA_SCHEMA = ?
and still use something like "LIMIT 1" to only get back the latest row for each device id? Or, will I simply have to execute a separate CQL statement for each device to get the latest row for each of them?
FWIW, the table's composite key looks like this:
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema), activity_timestamp)
) WITH CLUSTERING ORDER BY (activity_timestamp DESC);
IN is not recommended when there are a lot of parameters for it and under the hood it's making reqs to multiple partitions anyway and it's putting pressure on the coordinator node.
Not that you can't do it. It is perfectly legal, but most of the time it's not performant and is not suggested. If you specify limit, it's for the whole statement, basically you can't pick just the first item out from partitions. The simplest option would be to issue multiple queries to the cluster (every element in IN would become one query) and put a limit 1 to every one of them.
To be honest this was my solution in a lot of the projects and it works pretty much fine. Basically coordinator would under the hood go to multiple nodes anyway but would also have to work more for you to get you all the requests, might run into timeouts etc.
In short it's far better for the cluster and more performant if client asks multiple times (using multiple coordinators with smaller requests) than to make single coordinator do to all the work.
This is all in case you can't afford more disk space for your cluster
Usual Cassandra solution
Data in cassandra is suggested to be ready for query (query first). So basically you would have to have one additional table that would have the same partitioning key as you have it now, and you would have to drop the clustering column activity_timestamp. i.e.
PRIMARY KEY ((application_id, partner_id, location_id, device_id, data_schema))
double (()) is intentional.
Every time you would write to your table you would also write data to the latest_entry (table without activity_timestamp) Then you can specify the query that you need with in and this table contains the latest entry so you don't have to use the limit 1 because there is only one entry per partitioning key ... that would be the usual solution in cassandra.
If you are afraid of the additional writes, don't worry , they are inexpensive and cpu bound. With cassandra it's always "bring on the writes" I guess :)
Basically it's up to you:
multiple queries - a bit of refactoring, no additional space cost
new schema - additional inserts when writing, additional space cost
Your table definition is not suitable for such use of the IN clause. Indeed, it is supported on the last field of the primary key or the last field of the clustering key. So you can:
swap your two last fields of the primary key
use one query for each device id

How to copy data from a Cassandra table to another structure for better performance

In several places it's advised to design our Cassandra tables according to the queries we are going to perform on them. In this article by DataScale they state this:
The truth is that having many similar tables with similar data is a good thing in Cassandra. Limit the primary key to exactly what you’ll be searching with. If you plan on searching the data with a similar, but different criteria, then make it a separate table. There is no drawback for having the same data stored differently. Duplication of data is your friend in Cassandra.
[...]
If you need to store the same piece of data in 14 different tables, then write it out 14 times. There isn’t a handicap against multiple writes.
I have understood this, and now my question is: provided that I have an existing table, say
CREATE TABLE invoices (
id_invoice int PRIMARY KEY,
year int,
id_client int,
type_invoice text
)
But I want to query by year and type instead, so I'd like to have something like
CREATE TABLE invoices_yr (
id_invoice int,
year int,
id_client int,
type_invoice text,
PRIMARY KEY (type_invoice, year)
)
With id_invoice as the partition key and year as the clustering key, what's the preferred way to copy the data from one table to another to perform optimized queries later on?
My Cassandra version:
user#cqlsh> show version;
[cqlsh 5.0.1 | Cassandra 3.5.0 | CQL spec 3.4.0 | Native protocol v4]
You can use cqlsh COPY command :
To copy your invoices data into csv file use :
COPY invoices(id_invoice, year, id_client, type_invoice) TO 'invoices.csv';
And Copy back from csv file to table in your case invoices_yr use :
COPY invoices_yr(id_invoice, year, id_client, type_invoice) FROM 'invoices.csv';
If you have huge data you can use sstable writer to write and sstableloader to load data faster.
http://www.datastax.com/dev/blog/using-the-cassandra-bulk-loader-updated
To echo what was said about the COPY command, it is a great solution for something like this.
However, I will disagree with what was said about the Bulk Loader, as it is infinitely harder to use. Specifically, because you need to run it on every node (whereas COPY needs to only be run on a single node).
To help COPY scale for larger data sets, you can use the PAGETIMEOUT and PAGESIZE parameters.
COPY invoices(id_invoice, year, id_client, type_invoice)
TO 'invoices.csv' WITH PAGETIMEOUT=40 AND PAGESIZE=20;
Using these parameters appropriately, I have used COPY to successfully export/import 370 million rows before.
For more info, check out this article titled: New options and better performance in cqlsh copy.
An alternative to using COPY command (see other answers for examples) or Spark to migrate data is to create a materialized view to do the denormalization for you.
CREATE MATERIALIZED VIEW invoices_yr AS
SELECT * FROM invoices
WHERE id_client IS NOT NULL AND type_invoice IS NOT NULL AND id_client IS NOT NULL
PRIMARY KEY ((type_invoice), year, id_client)
WITH CLUSTERING ORDER BY (year DESC)
Cassandra will fill the table for you then so you wont have to migrate yourself. With 3.5 be aware that repairs don't work well (see CASSANDRA-12888).
Note: that Materialized Views are probably not best idea to use and has been changed to "experimental" status

How to avoid key lookup

My database needs to have GUID's as its primary key as its synced with multiple offline databases so IDENTITY column was not an option as it would lead to collisions in syncing.
Since GUID's would have lead to high table fragmentation, we opted to add another column to all our tables CREATEDDATETIME which is a timestamp and make CREATEDDATETIME as the CLUSTERED Index and GUID column has been made a NON-CLUSTERED index.
The issue is that CREATEDDATETIME is hardly if ever used in a WHERE clause, so almost all queries in their execution plan show a KEY LOOKUP on the clustered index CREATEDDATETIME to get its data. I was wondering if this performance can be improved in 1 of these 2 ways:
For all non-clustered indexes such as the one on GUID column I also
INCLUDE CREATEDDATETIME column? OR;
I make every non-clustered index as a composite key where I make
sure the clustered index is part of it ie GUID + CREATEDDATETIME
Which one might be better?
Key lookups occur when the information that you ultimately need is not available at the leaf level and so it must go to the clustered index to obtain it. Imagine the following query:
select a, b, c
from dbo.yourTable
where GUID = <some guid>;
If a, b, and c are included columns in the index, the key lookup can be avoided. Note that the clustering key is automatically an include column in every non-clustered index (which makes sense - how else would it be able to do the key lookup?). So, include columns based on what is actually being selected and I think you'll see the key lookups disappear from your query plans.
Since you mentioned SQL-Azure in the note above, it is safe to say that you will have to test different approaches. You have listed 2 and there might be others depending on your application (data, query profiles, and index coverages).
As you may already know, fragmentation affects selects from inserts differently. So your app needs will dictate what choices you make. While you're optimizing for lookups your inserts might become unbearably slow.
Both logical fragmentation and physical fragmentation could affect option 1 whereas option 2 seems like a plain overhead with no clear optimization conditions (suitable for plans to use). Plan optimization techniques as shown in the Azure manuals can help there.
For fragmentation testing, I use this query that someone recommended a while back:
SELECT OBJECT_NAME (S.[object_id]) as ObjectName,
I.name as IndexName,
ROUND (S.avg_fragmentation_in_percent, 2) as FragPercent,
S.page_count as PageCount,
ROUND (S.avg_page_space_used_in_percent, 2) as PageDensity
FROM sys.dm_db_index_physical_stats
(DB_ID ('MyDatabase'), NULL, NULL, NULL, 'DETAILED') as S
CROSS APPLY sys.indexes as I
WHERE I.object_id=S.object_id
AND I.index_id=S.index_id
AND S.index_level = 0;

An Approach to Cassandra Data Model

Please note that I am first time using NoSQL and pretty much every concept is new in this NoSQL world, being from RDBMS for long time!!
In one of my heavy used applications, I want to use NoSQL for some part of the data and move out from MySQL where transactions/Relational model doesn't make sense. What I would get is, CAP [Availability and Partition Tolerance].
The present data model is simple as this
ID (integer) | ENTITY_ID (integer) | ENTITY_TYPE (String) | ENTITY_DATA (Text) | CREATED_ON (Date) | VERSION (interger)|
We can safely assume that this part of application is similar to Logging of the Activity!
I would like to move this to NoSQL as per my requirements and separate from Performance Oriented MySQL DB.
Cassandra says, everything in it is simple Map<Key,Value> type! Thinking in terms of Map level,
I can use ENTITY_ID|ENTITY_TYPE|ENTITY_APP as key and store the rest of the data in values!
After reading through User Defined Types in Cassandra, can I use UserDefinedType as value which essentially leverage as One Key and multiple values! Otherwise, Use it as normal column level without UserDefinedType! One idea is to use the same model for different applications across systems where it would be simple logging/activity data can be pushed to the same, since the key varies from application to application and within application each entity will be unique!
No application/business function to access this data without Key, or in simple terms no requirement to get data randomly!
References: http://www.ebaytechblog.com/2012/07/16/cassandra-data-modeling-best-practices-part-1/
Let me explain the cassandra data model a bit (or at least, a part of it). You create tables like so:
create table event(
id uuid,
timestamp timeuuid,
some_column text,
some_column2 list<text>,
some_column3 map<text, text>,
some_column4 map<text, text>,
primary key (id, timestamp .... );
Note the primary key. There's multiple columns specified. The first column is the partition key. All "rows" in a partition are stored together. Inside a partition, data is ordered by the second, then third, then fourth... keys in the primary key. These are called clustering keys. To query, you almost always hit a partition (by specifying equality in the where clause). Any further filters in your query are then done on the selected partition. If you don't specify a partition key, you make a cluster wide query, which may be slow or most likely, time out. After hitting the partition, you can filter with matches on subsequent keys in order, with a range query on the last clustering key specified in your query. Anyway, that's all about querying.
In terms of structure, you have a few column types. Some primitives like text, int, etc., but also three collections - sets, lists and maps. Yes, maps. UDTs are typically more useful when used in collections. e.g. A Person may have a map of addresses: map. You would typically store info in columns if you needed to query on it, or index on it, or you know each row will have those columns. You're also free to use a map column which would let you store "arbitrary" key-value data; which is what it seems you're looking to do.
One thing to watch out for... your primary key is unique per records. If you do another insert with the same pk, you won't get an error, it'll simply overwrite the existing data. Everything in cassandra is an upsert. And you won't be able to change the value of any column that's in the primary key for any row.
You mentioned querying is not a factor. However, if you do find yourself needing to do aggregations, you should check out Apache Spark, which works very well with Cassandra (and also supports relational data sources....so you should be able to aggregate data across mysql and cassandra for analytics).
Lastly, if your data is time series log data, cassandra is a very very good choice.

Resources