Big data solution for frequent queries

Big data solution for frequent queries - cassandra

I need a big data storage solution for batch inserts of denormalized data which happen infrequently and queries on the inserted data which happen frequently.
I've gone through Cassandra and feel that its not that good for batch inserts, but an OK solution for querying. Also, it would be good if there was a mechanism to segregate data separately based on a data attribute.

As you mentioned Cassandra I will talk about it:
Can you insert in an unbatched way or is this impossed by the system? If you can insert unbatched, Cassandra will probably be able to handle it easily.
Batched inserts should also be handable by Cassandra nodes, but this won't distribute the load properly among all the nodes (NOTE: I'm talking about load balancing, not about data balance, which will be only depending on your partition key setup). If you are not very familiar with Cassandra you could tell us your data structure and your query types and we could suggest you how to use Cassandra's data model to fit it.
For the filtering part of the question, Cassandra has clustering keys and secondary indexes, that are basically like adding another column configuration to the clustering key so that you have both for querying.

Related

Is it hacky to do RF=ALL + CL=TWO for a small frequently used Cassandra table?

I plan to enhance the search for our retail service, which is managed by DataStax. We have data of about 500KB in raw from our wheels and tires and could be compressed and encrypted to about 20KB. This table is frequently used and changes about every day. We send the data to the frontend, which will be processed with Next.js later. Now we want to store this data in a single row table in a separate keyspace with a consistency level of TWO and RF equal to all nodes, replicating the table to all of the nodes.
Now the question: Is this solution hacky or abnormal? Is any solution rather this that fits best in this situation?

The quick answer to your question is yes, it is a hacky solution to do RF=ALL.
The table is very small so there is no benefit to replicating it to all nodes in the cluster. In practice, the tables are so small that the data will be cached anyway.
Since you are running with DataStax Enterprise (DSE), you might as well take advantage of the DSE In-Memory feature which allows you to keep data in RAM to save from disk seeks. Since your table can easily fit in RAM, it is a perfect use case for DSE In-Memory.
To configure the table to run In-Memory, set the table's compaction strategy to MemoryOnlyStrategy:
CREATE TABLE inmemorytable (
...
PRIMARY KEY ( ... )
) WITH compaction= {'class': 'MemoryOnlyStrategy'}
AND caching = {'keys':'NONE', 'rows_per_partition':'NONE'};
To alter the configuration of an existing table:
ALTER TABLE inmemorytable
WITH compaction= {'class': 'MemoryOnlyStrategy'}
AND caching = {'keys':'NONE', 'rows_per_partition':'NONE'};
Note that tables configured with DSE In-Memory are still persisted to disk so you won't lose any data in the event of a power outage or service disruption. In-Memory tables operate the same as regular tables so the same backup and restore processes still apply with the only difference being that a copy of the data is kept in memory for faster read performance.
For details, see DataStax Enterprise In-Memory. Cheers!

Best way of querying table without providing the primary key

I am designing the data model of our Scylla database. For example, I created a table, intraday_history, with fields:
CREATE TABLE intraday_history (id bigint,timestamp_seconds bigint,timestamp timestamp,sec_code text,open float,high float,low float,close float,volume float,trade int, PRIMARY KEY ((id,sec_code),timestamp_seconds,timestamp));
My id is a twitter_snowflake generated 64-bit integers.. My problem is how can I use WHERE without providing always the id (most of the time I will use the timestamp with bigint). I also encounter this problem in other tables. Because the id is unique then I cannot query a batch of timestamp.
Is it okay if lets say for a bunch of tables for my 1 node, I will use an ID like cluster1 so that when I query the id I will just id=cluster1 ? But it loss the uniqueness feature
Allow filtering comes as an option here. But I keep reading that it is a bad practice, especially when dealing with millions of query.
I'm using the ScyllaDB, a compatible c++ version of Apache Cassandra.

In Cassandra, as you've probably already read, the queries derive the tables, not the other way around. So your situation where you want to query by a different filter would ideally entail you creating another Cassandra table. That's the optimal way. Partition keys are required in filters unless you provide the "allow filtering" "switch", but it isn't recommended as it will perform a DC (possibly cluster)-wide search, and you're still subjected to timeouts. You could consider using indexes or materialized views, which are basically cassandra maintained tables populated by the base table's changes. That would save you the troubles of having the application populate multiple tables (Cassandra would do it for you). We've had some luck with materialized views, but with either of these components, there can be side effects like any other cassandra table (inconsistencies, latencies, additional rules, etc.). I would say do a bit of research to determine the best approach, but most likely providing "allow filtering" isn't the best choice (especially for high volume and frequent queries or with tables containing high volumes of data). You could also investigate SOLR if that's an option, depending on what you're filtering.
Hope that helps.
-Jim

Why Cassandra doesn't have secondary index?

Cassandra is positioned as scalable and fast database.
Why , I mean from technical details, above goals cannot be accomplished with secondary indexes?

Cassandra does indeed have secondary indexes. But secondary index usage doesn't work well with distributed databases, and it's because each node only holds a subset of the overall dataset.
I previously wrote an answer which discussed the underlying details of secondary index queries:
How do secondary indexes work in Cassandra?
While it should help give you some understanding of what's going on, that answer is written from the context of first querying by a partition key. This is an important distinction, as secondary index usage within a partition should perform well.
The problem is when querying only by a secondary index, that Cassandra cannot guarantee all of your data will be able to be served by a single node. When this happens, Cassandra designates a node as a coordinator, which in turn queries all other nodes for the specified indexed values.
Essentially, instead of performing sequential reads from a single node, secondary index usage forces Cassandra to perform random reads from all nodes. Now you don't have just disk seek time, but also network time complicating things.
The recommendation for Cassandra modeling, is to duplicate your data into new tables to support the desired query. This adds in some other complications with keeping data in-sync. But (when done correctly) it ensures that your queries can indeed be served by a single node. That's a tradeoff you need to make when building your model. You can have convenience or performance, but not both.

So yes cassandra does have secondary indexes and aaron's explaination does a great job of explaining why.
You see many people trying to solve this issue by writing their data to multiple tables. This is done so they can be sure that the data they need to answer the query that would traditionally rely on a secondary index is on the same node.
Some of the recent iterations of cassandra have this 'built in' via materialized views. I've not really used them since 3.0.11 but they are promising. The problems i had at the time were primarily adding them to tables with existing data and they had a suprisingly large amount of overhead on write (increased latency).

How to ensure data consistency in Cassandra on different tables?

I'm new in Cassandra and I've read that Cassandra encourages denormalization and duplication of data. This leaves me a little confused.
Let us imagine the following scenario:
I have a keyspace with four tables: A,B,C and D.
CREATE TABLE A (
tableID int,
column1 int,
column2 varchar,
column3 varchar,
column4 varchar,
column5 varchar,
PRIMARY KEY (column1, tableID)
);
Let us imagine that the other tables (B,C,D) have the same structure and the same data that table A, only with a different primary key, in order to respond to other queries.
If I upgrade a row in table A how I can ensure consistency of data in other tables that have the same data?

Cassandra provides BATCH for this purpose. From the documentation:
A BATCH statement combines multiple data modification language (DML) statements (INSERT, UPDATE, DELETE) into a single logical operation, and sets a client-supplied timestamp for all columns written by the statements in the batch. Batching multiple statements can save network exchanges between the client/server and server coordinator/replicas. However, because of the distributed nature of Cassandra, spread requests across nearby nodes as much as possible to optimize performance. Using batches to optimize performance is usually not successful, as described in Using and misusing batches section. For information about the fastest way to load data, see "Cassandra: Batch loading without the Batch keyword."
Batches are atomic by default. In the context of a Cassandra batch operation, atomic means that if any of the batch succeeds, all of it will. To achieve atomicity, Cassandra first writes the serialized batch to the batchlog system table that consumes the serialized batch as blob data. When the rows in the batch have been successfully written and persisted (or hinted) the batchlog data is removed. There is a performance penalty for atomicity. If you do not want to incur this penalty, prevent Cassandra from writing to the batchlog system by using the UNLOGGED option: BEGIN UNLOGGED BATCH
UNLOGGED BATCH is almost always undesirable and I believe is removed in future versions. Normal batches provide the functionality you desire.

You can also explore a new feature from Cassandra 3.0 called materialized views:
Basic rules of data modeling in Cassandra involve manually denormalizing data into separate tables based on the queries that will be run against that table. Currently, the only way to query a column without specifying the partition key is to use secondary indexes, but they are not a substitute for the denormalization of data into new tables as they are not fit for high cardinality data. High cardinality secondary index queries often require responses from all of the nodes in the ring, which adds latency to each request. Instead, client-side denormalization and multiple independent tables are used, which means that the same code is rewritten for many different users.
In 3.0, Cassandra will introduce a new feature called Materialized Views. Materialized views handle automated server-side denormalization, removing the need for client side handling of this denormalization and ensuring eventual consistency between the base and view data. This denormalization allows for very fast lookups of data in each view using the normal Cassandra read path.
The idea is exactly the same as suggested by Jeff Jirsa, but it won't require you to handle all the multi-table consistency logic inside your application, Cassandra will do it for you automatically.

Choosing a NoSQL database

I need a NoSQL database that will run on Windows Azure that works well for the following parameters. Right now Azure Table Storage, HBase and Cassandra seems to be the most promising options.
1 billion entities
up to 100 reads per second, though caching will mostly make it much less
around 10 - 50 writes per second
Strong consistency would be a plus, so perhaps HBase would be better than Cassandra in that regard.
Querying will often be done on a secondary in-memory database with various indexes in addition to ElasticSearch or Windows Azure Search for fulltext search and perhaps some filtering.
Azure Table Storage looks like it could be nice, but from what I can tell, the big difference between Azure Table Storage and HBase is that HBase supports updating and reading values for a single property instead of the whole entity at once. I guess there must be some disadvantages to HBase however, but I'm not sure what they would be in this case.
I also think crate.io looks like it could be interesting, but I wonder if there might be unforseen problems.
Anyone have any other ideas of the advantages and disadvantages of the different databases in this case, and if any of them are really unsuited for some reason?

I currently work with Cassandra and I might help with a few pros and cons.
Requirements
Cassandra can easily handle those 3 requirements. It was designed to have fast reads and writes. In fact, Cassandra is blazing fast with writes, mostly because you can write without doing a read.
Also, Cassandra keeps some of its data in memory, so you could even avoid the secondary database.
Consistency
In Cassandra you choose the consistency in each query you make, therefore you can have consistent data if you want to. Normally you use:
ONE - Only one node has to get or accept the change. This means fast reads/writes, but low consistency (You can have other machine delivering the older information while consistency was not achieved).
QUORUM - 51% of your nodes must get or accept the change. This means not as fast reads and writes, but you get FULL consistency IF you use it in BOTH reads and writes. That's because if more than half of your nodes have your data after you inserted/updated/deleted, then, when reading from more than half your nodes, at least one node will have the most recent information, which would be the one to be delivered.
Both this options are the ones recommended because they avoid single points of failure. If all machines had to accept, if one node was down or busy, you wouldn't be able to query.
Pros
Cassandra is the solution for performance, linear scalability and avoid single points of failure (You can have machines down, the others will take the work). And it does most of its management work automatically. You don't need to manage the data distribution, replication, etc.
Cons
The downsides of Cassandra are in the modeling and queries.
With a relational database you model around the entities and the relationships between them. Normally you don't really care about what queries will be made and you work to normalize it.
With Cassandra the strategy is different. You model the tables to serve the queries. And that happens because you can't join and you can't filter the data any way you want (only by its primary key).
So if you have a database for a company with grocery stores and you want to make a query that returns all products of a certain store (Ex.: New York City), and another query to return all products of a certain department (Ex.: Computers), you would have two tables "ProductsByStore" and "ProductsByDepartment" with the same data, but organized differently to serve the query.
Materialized Views can help with this, avoiding the need to change in multiple tables, but it is to show how things work differently with Cassandra.
Denormalization is also common in Cassandra for the same reason: Performance.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string