The MemSQL 4.1 release notes advise that there have been improvements to insert performance of Columnstore tables.
My (basic) understanding of the Columnstore table type is that it is unsuitable for individual inserts and best placed for larger bulk inserts (~100k rows per insert).
Is this still the case with the 4.1 release or does the memory-optimized data structure in front of each columnstore table now fix this deficiency?
To be clear, performance is not the issue for my use case, its utilizing the colunstore for individual inserts.
Appreciate any additional information or links to further reading - I cannot find more details on these changes.
That's right, with 4.1, MemSQL columnstore tables can support individual inserts reasonably (unlike before 4.1 where each individual insert created a separate columnstore segment). Individual inserts go into in-memory row storage until enough accumulate that we can batch them up into column storage. Of course MemSQL will be able to handle bulk inserts with much better perf.
Related
I plan to enhance the search for our retail service, which is managed by DataStax. We have data of about 500KB in raw from our wheels and tires and could be compressed and encrypted to about 20KB. This table is frequently used and changes about every day. We send the data to the frontend, which will be processed with Next.js later. Now we want to store this data in a single row table in a separate keyspace with a consistency level of TWO and RF equal to all nodes, replicating the table to all of the nodes.
Now the question: Is this solution hacky or abnormal? Is any solution rather this that fits best in this situation?
The quick answer to your question is yes, it is a hacky solution to do RF=ALL.
The table is very small so there is no benefit to replicating it to all nodes in the cluster. In practice, the tables are so small that the data will be cached anyway.
Since you are running with DataStax Enterprise (DSE), you might as well take advantage of the DSE In-Memory feature which allows you to keep data in RAM to save from disk seeks. Since your table can easily fit in RAM, it is a perfect use case for DSE In-Memory.
To configure the table to run In-Memory, set the table's compaction strategy to MemoryOnlyStrategy:
CREATE TABLE inmemorytable (
...
PRIMARY KEY ( ... )
) WITH compaction= {'class': 'MemoryOnlyStrategy'}
AND caching = {'keys':'NONE', 'rows_per_partition':'NONE'};
To alter the configuration of an existing table:
ALTER TABLE inmemorytable
WITH compaction= {'class': 'MemoryOnlyStrategy'}
AND caching = {'keys':'NONE', 'rows_per_partition':'NONE'};
Note that tables configured with DSE In-Memory are still persisted to disk so you won't lose any data in the event of a power outage or service disruption. In-Memory tables operate the same as regular tables so the same backup and restore processes still apply with the only difference being that a copy of the data is kept in memory for faster read performance.
For details, see DataStax Enterprise In-Memory. Cheers!
Cassandra is positioned as scalable and fast database.
Why , I mean from technical details, above goals cannot be accomplished with secondary indexes?
Cassandra does indeed have secondary indexes. But secondary index usage doesn't work well with distributed databases, and it's because each node only holds a subset of the overall dataset.
I previously wrote an answer which discussed the underlying details of secondary index queries:
How do secondary indexes work in Cassandra?
While it should help give you some understanding of what's going on, that answer is written from the context of first querying by a partition key. This is an important distinction, as secondary index usage within a partition should perform well.
The problem is when querying only by a secondary index, that Cassandra cannot guarantee all of your data will be able to be served by a single node. When this happens, Cassandra designates a node as a coordinator, which in turn queries all other nodes for the specified indexed values.
Essentially, instead of performing sequential reads from a single node, secondary index usage forces Cassandra to perform random reads from all nodes. Now you don't have just disk seek time, but also network time complicating things.
The recommendation for Cassandra modeling, is to duplicate your data into new tables to support the desired query. This adds in some other complications with keeping data in-sync. But (when done correctly) it ensures that your queries can indeed be served by a single node. That's a tradeoff you need to make when building your model. You can have convenience or performance, but not both.
So yes cassandra does have secondary indexes and aaron's explaination does a great job of explaining why.
You see many people trying to solve this issue by writing their data to multiple tables. This is done so they can be sure that the data they need to answer the query that would traditionally rely on a secondary index is on the same node.
Some of the recent iterations of cassandra have this 'built in' via materialized views. I've not really used them since 3.0.11 but they are promising. The problems i had at the time were primarily adding them to tables with existing data and they had a suprisingly large amount of overhead on write (increased latency).
i am working on migration tool oracle to cassandra , where I want to maintain a validation table with columns oracle count and cassandra count , so that i can validate the migration job,in cassandra is there any way system maintains the recently executed/inserted query count ? total count of a particular table ? is there anywhere in cassandra system tables does it store? if so what is it ? if not please suggest some way to design validation framework of data migration.
Is there way in cassandra, get the latest query inserted record count and total count of table in any system tables from where we can read the counts instead of executing the count(*) query on the tables ? does cassandra maintains the of the counts anywhere internally ?If so where we can check the meta data of latest inserts i.e which system tables?
Cassandra is distributed system and there is no place where it will collect the counts per tables. You can get some estimates from system.size_estimates, but it will say only paritions count per range, and their sizes.
For such framework as you're asking, you may need to develop custom Spark code (easiest way) that will perform counting of the rows, and other checks. Spark is highly optimized for effective data access and could be more preferable than writing the custom code.
Also, during migration, consider using consistency level greater than ONE to make sure that at least several nodes confirmed writing of the data. Although, it depends on the amount of data & timing requirements for your migration jobs.
I plan to use memsql to store my last 7 days data for real time analytics using SQL.
I checked the documentation and find out that there is no such TTL / expiration feature in MemSQL
Is there any such feature (in case I missed it)?
Is memsql fit the use case if I do daily delete on >7 days data? I quite curious about the fragmentation
We tried it on postgresql and we need to execute Vacuum command, it takes a long time to run.
There is no TTL/expiration feature. You can do it by running delete queries. Many customer use cases are doing this type of thing, so yes MemSQL does fit the use case. Fragmentation generally shouldn't be too much of a problem here - what kind of fragmentation are you concerned about?
There is No Out of the Box TTL feature in MemSQL.
We achieved TTL by adding an additional TS column in our MemSQL Rowstore table with TIMESTAMP(6) datatype.
This provides automatic current timestamp insertion when you add a new row to the table.
When querying data from this table, you can apply a simple filter based on this TIMESTAMP column to filter older records beyond your TTL value.
https://docs.memsql.com/sql-reference/v6.7/datatypes/#time-and-date
You can always have a batch job which can run one a month which can delete older data.
we have not seen any issues due to fragmentation but you can do below once in a while if fragmentation is a concern for you:
MemSQL’s memory allocators can become fragmented over time (especially if a large table is shrunk dramatically by deleting data randomly). There is no command currently available that will compact them, but running ALTER TABLE ADD INDEX followed by ALTER TABLE DROP INDEX will do it.
Warning
Caution should be taken with this work around. Plans will rebuild and the two ALTER queries are going to move all moves in the table twice, so this should not be used that often.
Reference:
https://docs.memsql.com/troubleshooting/latest/troubleshooting/
Given a simple CQL table which stores an ID and a Blob, is there any problem or performance impact of storing potentially billions of rows?
I know with earlier versions of Cassandra wide rows were de rigueur, but CQL seems to encourage us to move away from that. I don't have any particular requirement to ensure the data is clustered together or able to filter in any order. I'm wondering whether very many rows in a CQL table could be problematic in any way.
I'm considering binning my data, that is - creating a partition key which is a hash%n of the ID and would limit the data to n 'bins' (millions of?). Before I add that overhead I'd like to validate whether it's actually worthwhile.
First, I don't think is correct.
I know with earlier versions of Cassandra wide rows were de rigueur, but CQL seems to encourage us to move away from that.
Wide rows are supported and well. There's a post from Jonathan Ellis Does CQL support dynamic columns / wide rows?:
A common misunderstanding is that CQL does not support dynamic columns or wide rows. On the contrary, CQL was designed to support everything you can do with the Thrift model, but make it easier and more accessible.
For the part about the "performance impact of storing potentially billions of rows" I think the important part to keep in mind is the size of these rows.
According to Aaron Morton in this mail thread:
When rows get above a few 10’s of MB things can slow down, when they get above
50 MB they can be a pain, when they get above 100MB it’s a warning sign. And
when they get above 1GB, well you you don’t want to know what happens then.
and later:
Larger rows take longer to go through compaction, tend to cause more JVM GC and
have issue during repair. See the in_memory_compaction_limit_in_mb comments in
the yaml file. During repair we detect differences in ranges of rows and stream
them between the nodes. If you have wide rows and a single column is our of sync
we will create a new copy of that row on the node, which must then be compacted.
I’ve seen the load on nodes with very wide rows go down by 150GB just by
reducing the compaction settings.
IMHO all things been equal rows in the few 10’s of MB work better.
In a chat with Aaron Morton (last pickle) he indicated that billions of rows per table is not necessarily problematic.
Leaving this answer for reference, but not selecting it as "talked to a guy who knows a lot more than me" isn't particularly scientific.