Sequence having big gaps in YugabyteDB YSQL - yugabytedb

[Question posted by a user on YugabyteDB Community Slack]
Using YugabyteDB 2.11, and having a simple table like below:
create table my_table (id bigserial primary key, a text);
But while inserting data, sequence is having big gaps. eg.:, while call insert services, 'id' column gets values as 1,2,3,101,201,202,301,.......

This happens because of --ysql_sequence_cache_minval glfag https://docs.yugabyte.com/latest/reference/configuration/yb-tserver/#ysql-sequence-cache-minval.
All sequences are cached by 100 in each yb-tserver so that it can scale without querying the 'single point of truth' each time. Sequences are part of the postgres catalog, and therefore stored on the yb-master. A sequence with a cache of 1 (no cache), will have to reach out to the master for each value, a cache of 100 allows a backend to obtain the sequence number from the master and use the number as indicated by the cache from that number. This will make a next session pick its own range of 100, etc.
Note that even if you remove the cache there will be gaps in sequences because they are not transactional by nature.
The problem of gaps in sequences exists since database sequences exist. It's a computer science problem, not a YB (or implementation) problem. If you want no gaps, you need a serial (not scalable) process. This is not worth the trade-off, especially if you implement YB for scalability.

Related

Best way to shard multi-tenant app in YugabyteDB

[Question posted by a user on YugabyteDB Community Slack]
I want to someone remove my confusion, please correct me If I am wrong:
I have 3 nodes (3 tables)
Table structure:
ID (Hash of Account/Site/TS)
Account
Site
Timestamp
I have pattern of accounts inside multiple sites. Should I partition by account is it better by site? (Small partition size is better / Large partition size is better).
Read happens by all three columns. Which is a better choice of partition?
YugabyteDB doesn't need declarative partitioning to distributed data (this is done by sharding on the key). Partitioning is used to isolate data (like cold data to archive later, or like geo-distribution placement).
If you define PRIMARY KEY( ( Account, Site, Timestamp ) HASH ) you will have the distribution (YugabyteDB uses a hash function on the columns to distribute to tablets) and no skew (because the timestamp is in it). Sharding is automatic. You don't have to define an additional column for that https://docs.yugabyte.com/preview/architecture/docdb-sharding/sharding/

Are secondary indices always a bad idea in Cassandra even if I specify them in conjunction with the partitioning key in all my queries?

I know that secondary indices in Cassandra are generally a bad idea because the index is stored locally in each node i.e. not distributed across the cluster which may result in a query scanning a huge number of nodes. However, I don't understand why they are still a bad idea if I always specify the partition key in my queries and only use the secondary index as a final filter. I've read that they don't scale with large amounts of data even if I specify the partition key. Is this true? and if it's then why?
In general secondary indexes are bad idea, not only for the distributed part, but also for the index size and the number of distinct value, so if you have a field with high or low cardinality,you will be spending time on scanning many rows or many columns.
Also you can have other issue while dealing with tombstones ...
To answer your question, secondary index in Cassandra doesn't scale that good, but if you use a partition key and by it you tell Cassandra which node have the data, it perform really better !
you can find more details here in section F :
https://www.datastax.com/blog/2016/04/cassandra-native-secondary-index-deep-dive
I hope this helps !
These guys have a nice write-up on the performance impacts of secondary indexes: 
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to
satisfy a query by indexed value, each node has to query its own records to build the
final result set (as opposed to a primary key query where it is known exactly which node
needs to be queried). So there's not just an impact on writes, but on read performance as
well.
Cassandra on a ring of five machines, with a primary index of user IDs and a secondary index of user emails. If you were to query for a user by their ID or by their primary indexed key any machine in the ring would know which machine has a record of that user. One query, one read from disk. However to query a user by their email or their secondary indexed value each machine has to query its own record of users. One query, five reads from disk. By either scaling the number of users system wide, or by scaling the number of machines in the ring, the noise to signal-to-ratio increases and the overall efficiency of reading drops. In some cases to the point of timing out also.
Please refer below link for good explanation on secondary index.
https://dzone.com/articles/cassandra-scale-problem

Getting rid of confusion regarding NoSQL databases

This question is about NoSQL (for instance take cassandra).
Is it true that when you use a NoSQL database without data replication that you have no consistency concerns? Also not in the case of access concurrency?
What happens in case of a partition where the same row has been written in both partitions, possible multiple times? When the partition is gone, which written value is used?
Let's say you use N=5 W=3 R=3. This means you have guaranteed consistency right? How good is it to use this quorum? Having 3 nodes returning the data isn't that a big overhead?
Can you specify on a per query basis in cassandra whether you want the query to have guaranteed consistency? For instance you do an insert query and you want to enforce that all replica's complete the insert before the value is returned by a read operation?
If you have: employees{PK:employeeID, departmentId, employeeName, birthday} and department{PK:departmentID, departmentName} and you want to get the birthday of all employees with a specific department name. Two problems:
you can't ask for all the employees with a given birthday (because you can only query on the primary key)
You can't join the employee and the department column families because joins are impossible.
So what you can do is create a column family:
departmentBirthdays{PK:(departmentName, birthday), [employees-whos-birthday-it-is]}
In that case whenever an employee is fired/hired it has to be removed/added in the departmentBirthdays column family. Is this process something you have to do manually? So you have to manually create queries to update all redundant/denormalized data?
I'll answer this from the perspective of cassandra, coz that's what you seem to be looking at (hardly any two nosql stores are the same!).
For a single node, all operations are in sequence. Concurrency issues can be orthogonal though...your web client may have made a request, and then another, but due to network load, cassandra got the second one first. That may or may not be an issue. There are approaches around such problems, like immutable data. You can also leverage "lightweight transactions".
Cassandra uses last write wins to resolve conflicts. Based on you replication factor and consistency level for your query, this can work well.
Quurom for reads AND writes will give you consistency. There is an edge case..if the coordinator doesn't know a quorum node is down, it sends the write requests, then the write would complete when quorum is re-established. The client in this case would get a timeout and not a failure. The subsequent query may get the stale data, but any query after that will get latest data. This is an extreme edge case, and typically N=5, R=3, W3= will give you full consistency. Reading from three nodes isn't actually that much of an overhead. For a query with R=3, the client would make that request to the node it's connected to (the coordinator). The coordinator will query replicas in parallel (not sequenctially). It willmerge up the results with LWW to get the result (and issue read repairs etc. if needed). As the queries happen in parallel, the overhead is greatly reduced.
Yes.
This is a matter of data modelling. You describe one approach (though partitioning on birthday rather than dept might be better and result in more even distribution of partitions). Do you need the employee and department tables...are they needed for other queries? If not, maybe you just need one. If you denormalize, you'll need to maintain the data manually. In Cassandra 3.0, global indexes will allow you to query on an index without being inefficient (which is the case when using a secondary index without specifying the partition key today). Yes another option is to partition employeed by birthday and do two queries, and do the join in memory in the client. Cassandra queries hitting a partition are very fast, so doing two won't really be that expensive.

Problems In Cassandra ByteOrderedPartitioner in Cluster Environment

I am using cassandra 1.2.15 with ByteOrderedPartitioner in a cluster environment of 4 nodes with 2 replicas. I want to know what are the drawbacks of using the above partitioner in cluster environment? After a long search I found one drawback. I need to know what are the consequences of such drawback?
1) Data will not distribute evenly.
What type of problem will occur if data are not distributed evenly?
Is there is any other drawback with the above partitioner in cluster environment if so, what are the consequences of such drawbacks? Please explain me clearly.
One more question is, Suppose If I go with Murmur3Partitioner the data will distribute evenly. But the order will not be preserved, however this drawback can be overcome with cluster ordering (Second key in the primary keys). Whether my understanding is correct?
As you are using Cassandra 1.2.15, I have found a doc pertaining to Cassandra 1.2 which illustrates the points behind why using the ByteOrderedPartitioner (BOP) is a bad idea:
http://www.datastax.com/documentation/cassandra/1.2/cassandra/architecture/architecturePartitionerBOP_c.html
Difficult load balancing More administrative overhead is required to load balance the cluster. An ordered partitioner
requires administrators to manually calculate partition ranges
(formerly token ranges) based on their estimates of the row key
distribution. In practice, this requires actively moving node
tokens around to accommodate the actual distribution of data once
it is loaded.
Sequential writes can cause hot spots If your application tends to write or update a sequential block of rows at a time, then the
writes are not be distributed across the cluster; they all go to
one node. This is frequently a problem for applications dealing
with timestamped data.
Uneven load balancing for multiple tables If your application has multiple tables, chances are that those tables have different row keys and different distributions of data. An ordered
partitioner that is balanced for one table may cause hot spots and uneven distribution for another table in the same cluster.
For these reasons, the BOP has been identified as a Cassandra anti-pattern. Matt Dennis has a slideshare presentation on Cassandra Anti-Patterns, and his slide about the BOP looks like this:
So seriously, do not use the BOP.
"however this drawback can be overcome with cluster ordering (Second key in the primary keys). Whether my understanding is correct?"
Somewhat, yes. In Cassandra you can dictate the order of your rows (within a partition key) by using a clustering key. If you wanted to keep track of (for example) station-based weather data, your table definition might look something like this:
CREATE TABLE stationreads (
stationid uuid,
readingdatetime timestamp,
temperature double,
windspeed double,
PRIMARY KEY ((stationid),readingdatetime));
With this structure, you could query all of the readings for a particular weather station, and order them by readingdatetime. However, if you queried all of the data (ex: SELECT * FROM stationreads;) the results probably will not be in any discernible order. That's because the total result set will be ordered by the (random) hashed values of the partition key (stationid in this case). So while "yes" you can order your results in Cassandra, you can only do so within the context of a particular partition key.
Also, there have been many improvements in Cassandra since 1.2.15. You should definitely consider using a more recent (2.x) version.

Cassandra schema advice needed

I'm designing a Cassandra schema for a browser event collection system, and I was hoping to sanity check my approach. The system collects user events in the browser, like mouse movements, clicks, etc. The events are stored and processed to create heat maps of user activity on a web page. I've chosen Cassandra for persistence, since my use case is more write heavy than ready heavy: every 50 milliseconds, an ajax call dumps the aggregated events to my server, and into the database. I'm using node.js for the server, and the JSON events look something like this on the server:
{ uuid: dsf86ag487hadf97hadf97, type: 'MOVE', time: 12335234345, pageX: 334, pageY:566, .... }
As you can see each user has a unique uuid, associated with each of their events, generated on the browser, stored in a cookie. My read case will be some map-reduce job. Each top-level domain will be a keyspace, and I was planning using the uuid as my partition key. The main table will be the events table, where each row will be one event, using a composite primary key, consisting of the browser-generated uuid and a cassandra-generated timeuuid. The primary key must have a timeuuid component, since two events may have the same timestamp on certain browsers. The data types for event will be strings, ints, timestamps. The total data for a partition should not exceed a few hundred megabytes. So...Is this sane? What questions should I be asking myself? I recognize that this use case has many analogs in sensor data collection, etc, so please point me to existing examples. Thanks in advance.
Choosing a partition key
While recording the user ID may be important in some cases for distinguishing events from different users that may occur at the same time, the user ID is probably not the best choice for the partition key. That is, unless you are planning to analyze the behavior of specific users.
You are probably more concerned with how the heatmap changes over time and specifically which areas of the page were involved. These are probably better considerations for your partition key, though perhaps not stored as a timestamp nor as X/Y coordinates, which I'll get into later.
You will generally want to choose a partition key that has (1) a large distribution of values, to create even load across your cluster, and (2) is made up of values that are relatively "well known". By "well known", I mean something you either know in advance or something that can be computed easily and deterministically. For instance, you will have many users and will gather statistics over many days. While the the specific of days (encoded as, say, YYYY-MM-DD strings) can be easily determined based on a known start/end date range or query input, the set of all valid user IDs (assuming UUIDs or other non-incremental value, or hash) is much harder to determine without doing a scan of the entire cluster. Avoid doing partition key scans; aim for "exact" random access to your partitions.
Format of the partition key
The partition key is traditionally shown as a single column in many examples, but you can have a multi-column partition key. This can be useful when using date/time information as all or part of the key. You would aim to have as few unique values per column as possible, so that the set of values you need to enumerate is as small as possible, but as many values (or additional columns) as necessary to balance the I/O load and data distribution across the cluster.
For example, if you were to use a timestamp as your partition key, in 64-bit Java timestamp format, there are 1,000 possible partitions per second. Even though you can technically iterate over them, that may be more granular than you need or want. On the other side, if your partition key were simply the 4-digit year, then all of that year's events would go to the same partition (making it very large) and to the same set of replica nodes (hotspots, inefficient cluster use). By choosing a key that balances between these extremes, you can control the size of your partitions and also the number of partitions you must access in order to satisfy a query.
Also consider what you'll do when you ever want to delete old data. The easiest means (within a single column family/table) is to delete an entire partition as this helps avoid accumulating individual column tombstones. If you ever want to run an operation like "delete all data older than 2013" then you definitely don't want to bury the date deep down in the data and would rather have it as part of your partition key.
Choosing a row (clustering) key
Any additional columns in the primary key that are not part of the partition key become the row key within the partition, and the rows are clustered (ordered) by the sort order of the first of these columns.
That clustering/sorting is important, because it's generally the only native sorting you're going to get with Cassandra. Even if the partition key is down to the level of a specific hour or minute of a specific day, you might choose to cluster the rows by your millisecond timestamp or time UUID, to keep everything within that partition in chronological order.
You can still have additional columns, like your X/Y coordinates or user IDs, in your row keys -- in case it sounded like I was recommending that you put time (only) in both the partition and clustering keys.
Using X/Y coordinates
This part has nothing to do with Cassandra, but if you're heat-mapping the page, do be aware that people use different screens and devices at different resolutions. Unless you're doing pixel-perfect layout on your site (and hopefully you're using a fluid, responsive layout instead) then the X/Y coordinate of one user isn't going to match the X/Y coordinates from another user. They might not even match for the same user, if that user switches devices.
Consider mapping not by X/Y coordinate of the mouse, but perhaps the IDs of elements in the DOM. Have an ID for your "sidebar", "main menu", "main body div" and any specific elements you want to map. These would be string keys, not coordinate pairs, and while they'd still be triggered on mouse enter/leave/click the logged information doesn't depend or assume any particular screen geometry.
Perhaps you decide to include the element ID as part of the row or partition key, too.

Resources