Does couch DB have a partition key and sort key? - couchdb

Does couch DB have a partition key and sort key? If yes, What is the sorting logic and theory?

Recent versions of CouchDB support partitioned databases, letting you specify an explicit partition as a document id of the form mypartition:documentid. For certain use cases, this can be a very efficient way of organising your data, with partitioned view queries not having to visit every shard. Note that there are some quirks to be aware of in terms of partition sizes etc. A database can be created as partitioned, but an existing (non-partitioned) database can't be made partitioned at a later stage. Cloudant wrote a handy blog post introducing partitioned databases.
There is no separate sort key; ordering is determined by the document id (or for views, the view key). The ordering is lexicographical on the key.

Related

How do I design a table in Cassandra for a TinyURL use case?

Recently I came across a well-known design problem.
'Tiny URL'
What I found was people vouching for NoSQL DBS such as DynamoDB or Cassandra. I've been reading about Cassandra for a couple of days, and I want to design my solution around this DB for this specific problem.
What would be the table definition? If I choose the following table definition:
Create table UrlMap(tiny_url text PRIMARY KEY, url text);
Wouldn't this result in a lot of partitions? since my partition key can take on around 68B values (using 6 char base64 strings)
Would that somehow affect the overall read/write performance? If so, what would be a better model to define the table.
Lot's of partitions is fine, think of it as using c* as a key value store.
The primary principle of data modelling in Cassandra is to design one table for each application query.
For a URL shortening service, the main application query is to retrieve the equivalent full URL for a given tiny URI. In pseudo-code, the query looks like:
GET long url FROM datastore WHERE uri = ?
Note that for the purpose of a service, we won't store the web domain name to make the app reusable for any domain. The filter (WHERE clause) is the URI so this is what you want as the partition key so we would design the table accordingly:
CREATE TABLE urls_by_uri (
uri text,
long_url text,
PRIMARY KEY(uri)
)
If we want to retrieve the URL for http://tinyu.rl/abc123, the CQL query is:
SELECT long_url FROM urls_by_uri WHERE uri = 'abc123'
As Phact and Andrew pointed, there is no need to worry about the number of partitions (records) you'll be storing in the table because you can store as many as 2^128 partitions in a Cassandra table which for practical purposes is limitless.
In Cassandra, each partition gets hashed into a token value using the Murmur3 hash algorithm (default partitioner). This implementation distributes each partition randomly across all nodes in the cluster. The same hash algorithm is used to determine which node "owns" the partition making retrieval (reads) very fast in Cassandra.
As long as you limit the SELECT queries to a single partition, retrieving the data is extremely fast. In fact, I work with hundreds of companies who have an SLA on reads of 95% between 6-9 milliseconds. This is achievable in Cassandra when you model your data correctly and size your cluster correctly. Cheers!

Best way of querying table without providing the primary key

I am designing the data model of our Scylla database. For example, I created a table, intraday_history, with fields:
CREATE TABLE intraday_history (id bigint,timestamp_seconds bigint,timestamp timestamp,sec_code text,open float,high float,low float,close float,volume float,trade int, PRIMARY KEY ((id,sec_code),timestamp_seconds,timestamp));
My id is a twitter_snowflake generated 64-bit integers.. My problem is how can I use WHERE without providing always the id (most of the time I will use the timestamp with bigint). I also encounter this problem in other tables. Because the id is unique then I cannot query a batch of timestamp.
Is it okay if lets say for a bunch of tables for my 1 node, I will use an ID like cluster1 so that when I query the id I will just id=cluster1 ? But it loss the uniqueness feature
Allow filtering comes as an option here. But I keep reading that it is a bad practice, especially when dealing with millions of query.
I'm using the ScyllaDB, a compatible c++ version of Apache Cassandra.
In Cassandra, as you've probably already read, the queries derive the tables, not the other way around. So your situation where you want to query by a different filter would ideally entail you creating another Cassandra table. That's the optimal way. Partition keys are required in filters unless you provide the "allow filtering" "switch", but it isn't recommended as it will perform a DC (possibly cluster)-wide search, and you're still subjected to timeouts. You could consider using indexes or materialized views, which are basically cassandra maintained tables populated by the base table's changes. That would save you the troubles of having the application populate multiple tables (Cassandra would do it for you). We've had some luck with materialized views, but with either of these components, there can be side effects like any other cassandra table (inconsistencies, latencies, additional rules, etc.). I would say do a bit of research to determine the best approach, but most likely providing "allow filtering" isn't the best choice (especially for high volume and frequent queries or with tables containing high volumes of data). You could also investigate SOLR if that's an option, depending on what you're filtering.
Hope that helps.
-Jim

Inject Custom Sharding in Cassandra or Couchbase

Can I inject a sharding algorithm to wither Cassandra or Couchbase?
Or do they decide where each document go to?
For instance if I want to pin data to shards by one of the data properties.
Couchbase hash the key of the document to decide in which shard(vBucket) the document should be associated with. The SDK also uses the same algorithm to find out in which shard the document is located when you want to retrieve the document by its key.
One of the problems of letting developers decide on the sharding algorithm is that sometimes they end up with an excessive number of documents in a single shard, and naturally, this shard becomes the bottleneck of the application.
One of the core concepts in Couchbase is that the documents are (almost) evenly distributed between all shards, so I am not familiar with any native support to insert your own algorithm there.
Cassandra decides where the data goes by the partition key. So if you use the data you want to use as the "pin" as the partition key then it will accomplish what your asking for I think. However, you don't pick the replicas explicitly and it can change as hosts are removed and added to the cluster.

Partition key for Azure Cosmos DB collection

I am bit new to Azure Cosmos DB and trying to understand the concepts.
I want help to decide the the best possible partition key for DocumentDB collection. Please refer image below which have possible partitions using different partition keys.
As mentioned in the blog post here,
An ideal partition key is one that appears frequently as a filter in
your queries and has sufficient cardinality to ensure your solution is
scalable.
From above line, I think, in my case, UserId can be used as partition key.
Can someone please suggest me which key is the best possible candidate for partition key?
From the 10 things to know about DocumentDB Partitioned Collections and micro official document , you could find lots of very good advice about choice of partitioning key, so I'm not going to repeat here.
The selection of partitioning keys depends on the data stored in the database and the frequent query filtering criteria.
It is often advised to partition on something like userid which is good if you have. Suppose your business logic has many queries for a given userid and want to look up no more than a few hundred entries. In such cases the data can be quickly extracted from a single partition without the overhead of having to collate data across partitions.
However, if you have millions of records for the user then partitioning on userid is perhaps the worst option as extracting large volumes of data from a single partition will soon exceed the overhead of collation. In such cases you want to distribute user data as evenly as possible over all partitions. You may need to find another column to be the partition key.
So , if the data volume is very large, I suggest that you do some simple tests based on your business logic and choose the best partitioning key for your performance. After all, the partitioning key cannot be changed once it is set up.
Hope it helps you.
It depends, but here are few things to consider:
The blog post you mentioned say:
Additionally, the storage size for documents belonging to the same partition key is limited to 10GB. An ideal partition key is one that appears frequently as a filter in your queries and has sufficient cardinality to ensure your solution is scalable.
Also, I really recommend to check this post and video, https://learn.microsoft.com/en-us/azure/cosmos-db/partition-data,
The choice of the partition key is an important decision that you have to make at design time. You must pick a property name that has a wide range of values and has even access patterns.
So make sure to choose a partition Key that has many values and meets those requirements.

Cassandra: Sort by query

I have a little bit special request.
Constelation: I use a Redis DB to store geo data and use georedius to get them back, sorted by distance. With this keys I search the data in cassandra. But the result of cassandra is sorted in the key or something else.
What I want is, to get the inforamtions back in the same order i requested it.
The partition key is build from id (I get back form redis) and a status.
Could I tell cassandra to sort by id array?
Partition key are designed to be randomly distributed across different nodes. You can use ByteOrderedPartitioner to do ordered queries. But BOP are considered anti-pattern is cassandra and I will highly recommend against it. You can read more about it here Cassandra ByteOrderedPartitioner.
You can add more parameters to the Primary Key which will determine how to store data on the disk. These are known as clustering keys. You can do Order By queries on clustering keys. This is a good document on clustering keys https://docs.datastax.com/en/cql/3.1/cql/ddl/ddl_compound_keys_c.html.
If you can share more schema details, I can suggest what to use as clustering key.

Resources