We are supporting multi-tenancy. Is it better to have customer id as part of partition key or clustering column
Having customer id as part of partition key will ensure that one customers data cannot be viewed by another customer.
Having customer id as part of clustering column, developers have to ensure that customer id is part of where clause. Also takes up more space.
Is there one better way over the other ?
There will impact on time-series data, since data will be partitioned by customer id and the super user having access to all customers will not be able to view time-series data correctly.
Thanks
Have customer_id as a part of your partition key. You'll need this to ensure that each customer's data is stored together.
However, make sure that customer_id is not the only partition key. If you have a time series data set with millions of rows, you won't want to attempt to store them all in the same partition (it'll get too big).
There will impact on time-series data, since data will be partitioned by customer id and the super user having access to all customers will not be able to view time-series data correctly.
This comes back to designing your tables with a query-based approach. If you have a query requirement to support queries on time series data for all (or multiple) customers at once, then you may need a table designed to support that.
Related
I have customer_info table in cassandra. It will have the following columns.
UUID is the primary key.
customer_id
amount
other fields ...
100$ transaction limit for each customer during a 365 days.
I have following 2 options
Select all records of particular customer_id from customer table. Do calculation in memory at application code; if transaction limit is not crossing 100$ then do the insert or update in the customer_info table.
Maintain a new table customer_limit which will consist of customer_id and limit fields. Before CRUD operation on customer_info, I will query on customer_limit table to know the limit and based on the limit do the CRUD operation on customer_info table.
In terms of maintenance and faster read/write, which option is best suited ?
I would use 2 tables for this purpose.
table-2 would be a counter table with limit as the counter value. You should always query this table-2 before inserting into customer_info table.
Refer Counters here. They are easy to make concurrent increments avoiding read before write in application source code.
Also please read about Partion and Clustering key concepts. Your choice of keys for customer_info is not very good.
I think you must keep the details of each transaction, because you need a "moving" window of fixed aperture (365 days) that "advances" at each transaction.
You could create a transactions table with the following primary key fields pair:
(customer_id, transaction_date)
By clustering in DESC order this table (by date of course) then you can always query for the last 365 days, efficiently, everyday.
I have a massively huge table with hundreds of billions of records and I mean to add a field in this table of which the same value would be repeated for millions of records. I don't know how to efficiently model this in cassandra. Allow me to elaborate:
I have a generic table:
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
PRIMARY KEY ((key, key2) time)
)
This table has 700.000.000+ records.
I want to create a field in this table, named source. This field indicates where the record was gotten from (since the software has many ways of receiving the information on the reading table). One possible value for this field is "XML: path\to\file.xml" or "Direct import from the X database" or even "Manually added", I want this to be a descriptive field, used exclusively to allow later maintenance in the database where we want to manipulate only records from a given source.
The queries I want to run that I can't now are:
Which records on the readings table were gotten from a given source?
What is the source of a given record?
A solution would be for me to create a table such as:
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
which would allow me to execute the first query, but would also mean that I would create 700.000.000+ new records on my database with a lot of information, which would take a lot of unnecessary storage space since tens of millions of these records would have the same value for source.
If this was a relational environment, I would create a source_id field on the readings table and a source table with id (PK) and name fields, that would mean storing only an additional integer for each row on the readings table and a new table with as many records as different sources there was.
How does one go about modelling this in cassandra?
Your schema
CREATE TABLE readings_per_source(
source text,
key int,
key2 int,
time timestamp,
PRIMARY KEY (source, key, key2, time)
)
is a very bad idea because source is the partition key and you can have millions of records sharing the same source e.g. having a very very wide partition --> hot spots
For you second query, What is the source of a given record? is it quite trivial if you access the data using the record primary keys (key, key2). The source column can be added as just a regular column into the table
For the first query Which records on the readings table were gotten from a given source? it is trickier. The idea here is to fetch all the records having the same source.
Do you realize that this query can potentially return tens of millions of records ?
If it's what you want to do, there is a solution, use the new SASI secondary index (read my blog post for all details) and create an index on the source column
CREATE TABLE readings (
key int,
key2 int,
time timestamp,
name text,
source text,
PRIMARY KEY ((key, key2), time)
)
CREATE CUSTOM INDEX source_idx ON readings(source)
USING 'org.apache.cassandra.index.sasi.SASIIndex'
WITH OPTIONS = {
'mode': 'PREFIX',
'analyzer_class': 'org.apache.cassandra.index.sasi.analyzer.NonTokenizingAnalyzer',
'case_sensitive': 'false'
};
Then to fetch all records having the same source, use server-side paging feature of the Java driver (or any other Datastax driver)
http://www.datastax.com/2015/03/how-to-do-joins-in-apache-cassandra-and-datastax-enterprise is a pretty good article on how to go about joining tables in Cassandra.
normalized data will always take up less storage than de-normalized (flat) data (provided the related data is larger than the key being used to join the tables together) but requires joins which take more horsepower to compute during queries.
There's always a trade-off. There's also a tradeoff concerning state with fully normalized data, one example being the customer who changes addresses. In a fully normalized schema, once the address change is made, all invoices for the customer, past and present show the new address. This isn't always desirable.
Often it's desirable to partially normalize to provide historic state on records where it's important to show the state of the data at a given time, such as on invoices. In that case you'd store a copy of the customer address data on the invoice at the time of invoice creation.
This is especially important for pricing and taxes as well. You want the price/tax stored with the invoice so you can show what the customer paid at the time the invoice was created, so when accounting runs monthly, yearly and beyond numbers that the prices on a given invoice are correct for the date on the invoice, even though the prices of the products may have changed. Otherwise you have an accounting nightmare!
There is a lot more to consider than simply storage space when deciding how to normalize/de-normalize a schema.
Sorry for rambling...
I want to store user related data and each chunk of data has it's own id. The id is the user id plus the id of the specific bit of information. Those data also are time based and reads will be say 50% as many as writes. I am thinking of using one of the following:
create table d1(id text,time bigint, data bigint, primary key (id,time)) with cluster order by (time desc)
here the id will be userid+data id, all in 1 string. This means that data for 1 user will be in different nodes.
create table d2(id text,dataid text, time bigint, data bigint, primary key (id,dataid,time)) with cluster order by (time desc)
here the id = userid and dataid is a cluster id. This means all data for a user will be in 1 node.
So for 1 request of this user, I might have to query for 2-3 or more of these rows.
Which will be the best approach in terms of performance?
I assume that you're interested in some sort of time lookup since you're mentioning time series. The answer depends on what type of query you're pursuing.
a) If you want to query all rows for a given user sorted by time, your second schema is better by far. In the first schema your data is partitioned by user and dataid, and so you give up the ability to run fast queries per user, because various data for even that single user will be spread across the cluster (assuming you use the default partitioner).
b) If you want to query by user and dataid sorted by time (to get the history of changes for that single piece of data, let's say), schema 1 would be preferable for the same reasons as outlined in #1.
c) If you want to query by time globally, both schemas will be slow, because you'll need to scan multiple users (the whole cluster) to satisfy the query.
Lastly, writes will be comparatively fast no matter what schema you pick.
I am new to Azure tables and having read a lot of articles but would like some reassurance on the above given its fundamental.
I have data which is similar to this:
CustomerId, GUID
TripId, GUID
JourneyStep, GUID
Time, DataTime
AverageSpeed, int
Based on what I have read, is CustomerId a good PartitionKey? Where I become stuck is the combination of CustomerId and TripId that does not make a unique row. My justification for TripId as the Row Key is because every query will be a dataset based on CustomerId and TripId.
Just for context, the CustomerId is clearly unique, the TripId represents one journey in a vehicle and within that journey the JourneyStep represents a unit within that Trip which may be 10 steps or 1000.
The intention is aggregate the data into further tables with each level being used for a different purpose. At the most aggregated level, the customer will be given some scores.
The amount of data will obviously be huge so need to think about query performance from the outset.
Updated:
As requested, the solution is for Vehicle Telematics so think of yourself in your own car. Blackbox shipping data to an server which in turn passes it to Azure Tables. In Relational DB terms, I would have a Customer Table and a trip table with a foreign key back to the customer table.
The tripId is auto generated by the blackbox. TripId does not need stored by date time from a query point of view, however may be relevant from a query performance point of view.
Queries will be split into two:
Display a map of a single journey for each customer, so filter by customer and then Trip to then iterate each row (journeystep) to a map.
Per customer, I will score each trip and then retrieve trips for, let's say, the last month to aggregate a score. I do have SQL Database to enrich data with client records etc but for the volume data (the trip data) I wish to use Azure Tables.
The aggregates from the second query will probably be stored in a separate table, so if someone made 10 trips in one month, I would run the second query which would score each trip, then produce a score for all trips that month and store both answers so potentially a table of trip aggregates and a table of monthly aggregates.
The thing about the Partition Key is that it represents a logical grouping; You cannot insert data spanning multiple partition keys, for example. Similarly, rows with the same partition are likely to be stored on the same server, making it quick to retrieve all the data for a given partition key.
As such, it is important to look at your domain and figure out what aggregate you are likely to work with.
If I understand your domain model correctly, I would actually be tempted to use the TripId as the Partition Key and the JourneyStep as the Row Key.
You will need to, separately, have a table that lists all the Trip IDs that belongs to a given Customer - which sort of makes sense as you probably want to store some data, such as "trip name" etc in such a table anyway.
Your design has to be related to your query. You can filter your data based on 2 columns PartitionKey and RowKey. PartitionKey is your most important column since your queries will hit that column first.
In your case CustomerId should be your PartitionKey since most of the time you will try to reach your data based on the customer. (you may also need to keep another table for your client list)
Now, RowKey can be your tripId or time. if I were you I probably use rowKey as yyyyMMddHHmm|tripId format which will let you to query based on startWith and endWidth options.
Adding to #Frans answer:
One thing you could do is create a separate table for each customer. So you could have table named like Customer. That way each customer's data is nicely segregated into different tables. Then you could use TripId as PartitionKey and then JourneyStep as RowKey as suggested by #Frans. For storing some metadata about the trip, instead of going into a separate table, I would still use the same table but here I would keep the RowKey as empty and put other information about the trip there.
I would suggest considering the following approach to your PK/RK design. I believe it would yield the best performance for your outlined queries:
PartitionKey: combination of CustomerId and TripId.
string.Format("{0}_{1}", customerId.ToString(), tripId.ToString())
RowKey: combination of the DateTime.MaxValue.Ticks - Time.Ticks formatted to a large 0-padded string with the JourneyStep.
string.Format("{0}_{1}", (DateTime.MaxValue.Ticks - Time.Ticks).ToString("00000000000000000"), JourneyStep.ToString())
Such combination will allow you to do the following queries "quickly".
Get data by CustomerId only. Example: context.Trips.Where(n=>string.Compare(id + "_00000000-0000-0000-0000-000000000000", n.PartitionKey) <= 0 && string.Compare(id+"_zzzzzzzz-zzzz-zzzz-zzzz-zzzzzzzzzzzz") >=0).AsTableServiceQuery(context);
Get data by CustomerId and TripId. Example: context.Trips.Where(n=>n.PartitionKey == string.Format("{0}_{1}", customerId, tripId).AsTableServiceQuery(context);
Get last X amount of journey steps if you were to search by either CustomerId or CustomerId/TripId by using the "Take" function
Get data via date-range queries by translating timestamps into Ticks
Save data into a trip with a single storage transaction (assuming you have less than 100 steps)
If you can guarantee uniqueness of Times of Steps within each Trip, you don't even have to put JourneyStep into the RowKey as it is somewhat inconvenient
The only downside to this schema is not being able to retrieve a particular single journey step without knowing its Time and Id. However, unless you have very specific use cases, downloading all of the steps inside a trip and then picking a particular one from the list should not be so bad.
HTH
The design of table storage is a function to optimize two major capabilities of Azure Tables:
Scalability
Search performance
As #Frans user already pointed out, Azure tables uses the partitionkey to decide how to scale out your data on multiple storage server nodes. Because of this, I would advise against having unique partitionkeys, since in theory, you will have Azure spanning out storage nodes that will be able to serve one customer only. I say "in theory" because, in practice, Azure uses smart algorithms to identify if there are patterns in your partitionkeys and thus be able to group them (example, if your ids are consecutive numbers). You don't want to fall into this scenario because the scalability of your storage will be unpredictable and at the hands of obscure algorithms that will be making those decisions. See HERE for more information about scalability.
Regarding performance, the fastest way to search is to hit both partitionkey+rowkey in your search queries. Contrary to Amazon DynamoDB, Azure Tables does not support secondary column indexes. If you have your search queries search for attributes stored in columns apart from those two, Azure will need to do a full table scan.
I faced a situation similar to yours, where the design of the partition/row keys was not trivial. In the end, we expanded our data model to include more information so we could design our table in such a way that ~80% of all search queries can be matched to partition+row keys, while the remaining 20% require a table scan. We decided to include the user's location, so our partition key is the user's country and the rowkey is a customer unique ID. This means our data model had to be expanded to include the user's country, which was not a big deal. Maybe you can do the same thing? Group your customers by segment, or by location, or by email address SMTP domain?
I'm currently using and researching about data modeling practices in cassandra. So far, I get that you need have a data modeling based on the queries executed. However, multiple select requirements make data modeling even harder or impossible to handle it on 1 table. So, when you can't handle these requirements on 1 table, you need to insert 2-3 tables. In other words, you need to make multiple inserts on 1 operation.
Currently, I'm dealing with a data model of a campaign structure. I have a campaign table on cassandra with the following cql;
CREATE TABLE campaign_users
(
created_at timeuuid,
campaign_id int,
uid bigint,
updated_at timestamp,
PRIMARY KEY (campaign_id, uid),
INDEX(campaign_id, created_at)
);
In this model, I need to be able to make incremental exports given a timestamp only. In cassandra, there is allow filtering mode that enables select queries for secondary indexes. So, my cql statement for incremental export is the following;
select campaign_id, uid
from campaign_users
where created_at > minTimeuuid('2013-08-14 12:26:06+0000') allow filtering;
However, if allow filtering is used, there is a warning saying that the statement have unpredictable performance. So, is it a good practice relying on allow filtering ? What can be other alternatives ?
The ALLOW FILTERING warning is because Cassandra is internally skipping over data, rather than using an index and seeking. This is unpredictable because you don't know how much data Cassandra is going to skip over per row returned. You could be scanning through all your data to return zero rows, in the worst case. This is in contrast to operations without ALLOW FILTERING (apart from SELECT COUNT queries), where the data read through scales linearly with the amount of data returned.
This is OK if you're returning most of the data, so the data skipped over doesn't cost very much. But if you were skipping over most of your data a lot of work will be wasted.
The alternative is to include time in the first component of your primary key, in buckets. E.g. you could have day buckets and duplicate your queries for each day that contains data you need. This method guarantees that most of the data Cassandra reads over is data that you want. The problem is that all data for the bucket (e.g. day) needs to fit in one partition. You can fix this by sharding the partition somehow e.g. include some aspect of the uid within it.