Cassandra : Multiple tables vs materialized view - cassandra

Creating materialized view seems to be an easy option compare to multiple tables..but is it a good option?
Since materialized views are nothing but another table in the back drop.
What exactly happens when we create a materialized view over a table and the partition key is changed to clustering key?
I just think creating another table rather than a materilized view is better for long term perspective when the data increase rate is high.

Mvs really helps avoiding overhead of managing multiple tables at client side. However it has some functional limitations.
This is good blog written on MV.
Also You can see a warning while using MV:
MVs are experimental and are not recommended for production use.
Personally, I would prefer managing my own table, instead of working with this risk.

Related

Precalculate OLAP cube inside Azure Synapse

We have dimensinal model with fact tables of 100-300 GBs in parquet each. We build PBI reports on top of Azure Synapse (DirectQuery) and experience performance issues on slicing/dicing and especially on calculating multiple KPIs. In the same time data volume is pretty expensive to be kept in Azure Analysis Services. Because of number of dimensions, the fact table can't be aggregated significantly, so PBI import mode or composite model isn't an option as well.
Azure Synapse Analytics faciliates OLAP operations, like GROUP BY ROLLUP/CUBE/GROUPING SETS.
How can I benefit from Synapse's OLAP operations support?
Is that possible to pre-calculate OLAP cubes inside Synapse in order to boost PBI reports performance? How?
If the answer is yes, is that recomended to pre-calculate KPIs? Means moving KPIs definition to DWH OLAP cube level - is it an anti-pattern?
P.S. using separate aggreagations for each PBI visualisation is not an option, it's more an exception from the rule. Synapse is clever enough to take the benefit from materialized view aggregation even on querying a base table, but this way you can't implement RLS and managing that number of materialized views also looks cumbersome.
Upd for #NickW
Could you please answer the following sub-questions:
Have I got it right - OLAP operations support is mainly for downstream cube providers, not for Warehouse performance?
Is spawning Warehouse with materialized views in order to boost performance is considered a common practice or an anti-pattern? I've found (see the link) Power BI can create materialized views automatically based on query patterns. Still I'm afraid it won't be able to provide a stable testable solution, and RLS support again.
Is KPIs pre-calculation at Warehouse side considered as a common way or an anti-pattern? As I understand this is usually done no cube provider side, but if I haven't got one?
Do you see any other options to boost the performance? I can think only about reducing query parallelism by using PBI composite model and importing all dimensions to PBI. Not sure if it'd help.
Synapse Result Set Caching and Materialized Views can both help.
In the future the creation and maintence of Materialized Views will be automated.
Azure Synapse will automatically create and manage materialized views
for larger Power BI Premium datasets in DirectQuery mode. The
materialized views will be based on usage and query patterns. They
will be automatically maintained as a self-learning, self-optimizing
system. Power BI queries to Azure Synapse in DirectQuery mode will
automatically use the materialized views. This feature will provide
enhanced performance and user concurrency.
https://learn.microsoft.com/en-us/power-platform-release-plan/2020wave2/power-bi/synapse-integration
Power BI Aggregations can also help. If there are a lot of dimensions, select the most commonly used to create aggregations.
to hopefully answer some of your questions...
You can't pre-calculate OLAP cubes in Synapse; the closest you could get is creating aggregate tables and you've stated that this is not a viable solution
OLAP operations can be used in queries but don't "pre-build" anything that can be used by other queries (ignoring CTEs, sub-queries, etc.). So if you have existing queries that don't use these functions then re-writing them to use these functions might improve performance - but only for each specific query
I realise that your question was about OLAP but the underlying issue is obviously performance. Given that OLAP is unlikely to be a solution to your performance issues, I'd be happy to talk about performance tuning if you want?
Update 1 - Answers to additional numbered questions
I'm not entirely sure I understand the question so this may not be an answer: the OLAP functions are there so that it is possible to write queries that use them. There can be an infinite number of reasons why people might need to to write queries that use these functions
Performance is the main (only?) reason for creating materialised views. They are very effective for creating datasets that will be used frequently i.e. when base data is at day level but lots of reports are aggregated at week/month level. As stated by another user in the comments, Synapse can manage this process automatically but whether it can actually create aggregates that are useful for a significant proportion of your queries is obviously entirely dependent on your particular circumstances.
KPI pre-calculation. In a DW any measures that can be calculated in advance should be (by your ETL/ELT process). For example, if you have reports that use Net Sales Amount (Gross Sales - Tax) and your source system is only providing Gross Sales and Tax amounts then your should be calculating Net Sales as a measure when loading your fact table. Obviously there are KPIs that can't be calculated in advance (i.e. probably anything involving averages) and these need to be defined in your BI tool
Boosting Performance: I'll cover this in the next section as it is a longer topic
Boosting Performance
Performance tuning is a massive subject - some areas are generic and some will be specific to your infrastructure; this is not going to be a comprehensive review but will highlight a few areas you might need to consider.
Bear in mind a couple of things:
There is always an absolute limit on performance - based on your infrastructure - so even in a perfectly tuned system there is always going to be a limit that may not be what you hoped to achieve. However, with modern cloud infrastructure the chances of you hitting this limit are very low
Performance costs money. If all you can afford is a Mini then regardless of how well you tune it, it is never going to be as fast as a Ferrari
Given these caveats, a few things you can look at:
Query plan. Have a look at how your queries are executing and whether there are any obvious bottlenecks you can then focus on. This link give some further information Monitor SQL Workloads
Scale up your Synapse SQL pool. If you throw more resources at your queries they will run quicker. Obviously this is a bit of a "blunt instrument" approach but worth trying once other tuning activities have been tried. If this does turn out to give you acceptable performance you'd need to decide if it is worth the additional cost. Scale Compute
Ensure your statistics are up to date
Check if the distribution mechanism (Round Robin, Hash) you've used for each table is still appropriate and, on a related topic, check the skew on each table
Indexing. Adding appropriate indexes will speed up your queries though they also have a storage implication and will slow down data loads. This article is a reasonable starting point when looking at your indexing: Synapse Table Indexing
Materialised Views. Covered previously but worth investigating. I think the automatic management of MVs may not be out yet (or is only in public preview) but may be something to consider down the line
Data Model. If you have some fairly generic facts and dimensions that support a lot of queries then you might need to look at creating additional facts/dimensions just to support specific reports. I would always (if possible) derive them from existing facts/dimensions but you can create new tables by dropping unused SKs from facts, reducing data volumes, sub-setting the columns in tables, combining tables, etc.
Hopefully this gives you at least a starting point for investigating your performance issues.

Will DynamoDB get Materialized Views?

I am considering choosing between DynamoDB and AWS Keyspaces.
My main issue is still with many-to-many relationship in Dynamo. You don't really have too nice options. Either you do adjecency list for immutable data...but in most scenarios data is gonna change. Other way is making 2 db calls which is really not that great. Third option would be to update data all the time which seems also like a big pain in the a**. Also for batch writes it's up to 25 rows I think.
However Cassandra provides materialized views where at least I don't have to manage replication on my own. Also I can do 1 DB call to get all I need.
I am still relatively new to NoSQL databases so I might be missing a lot of stuff.
Are there plans for Dynamo to add Materialized Views or is there better way to do it?
In my eyes it seems like a really good feature. It doesn't even have to create new tables, rather references between columns of items to make it autoupdate.
DynamoDB has a feature called Global Secondary Index which is very close to the materialized view feature of Cassandra. Despite its confusing name, DynamoDB's GSI is not just an index like what Cassandra calls a "secondary index"! It doesn't just like the keys matching a particular column value: Beyond the keys it can also keep any other items attributes which you choose to project. Exactly like a materialized view.
DynamoDB also has a more efficient Local Secondary Index which you can consider if the view's partition key is the same as the base table's - and you just want to sort items differently or project only part of the attributes.

what is better view or another table in cassandra?

I got table that I need to search by not indexed field. What is better, to make separate table with data I need and indexed by that field or make view? what is drawbacks of each chose? May be I can use secondary Index in that case instead?
A second table will be better hands down. Only disadvantage is it requires more of your effort.
Materialized views have issues where they get outta sync and theres no way to repair them, only drop and recreate (they are now considered experimental and not prod ready). Secondary indexes require huge scatter gather queries that make your 99th percentile your average (while also being difficult to size appropriately). Ultimately for any heavy load, MVs or 2i will break, but its easy to add.

Is Cassandra just a storage engine?

I've been evaluating Cassandra to replace MySQL in our microservices environment, due to MySQL being the only portion of the infrastructure that is not distributed. Our needs are both write and read intensive as it's a platform for exchanging raw data. A type of "bus" for lack of better description. Our selects are fairly simple and should remain that way, but I'm already struggling to get past some basic filtering due to the extreme limitations of select queries.
For example, if I need to filter data it has to be in the key. At that point I can't change data in the fields because they're part of the key. I can use a SASI index but then I hit a wall if I need to filter by more than one field. The hope was that materialized views would help with this but in another post I was told to avoid them, due to some instability and problematic behavior.
It would seem that Cassandra is good at storage but realistically, not good as a standalone database platform for non-trivial applications beyond very basic filtering (i.e. a single field.) I'm guessing I'll have to accept the use of another front-end like Elastic, Solr, etc. The other option might be to accept the idea of filtering data within application logic, which is do-able, as long as the data sets coming back remain small enough.
Apache Cassandra is far more than just a storage engine. Its design is a distributed database oriented towards providing high availability and partition tolerance which can limit query capability if you want good and reliable performance.
It has a query engine, CQL, which is quite powerful, but it is limited in a way to guide user to make effective queries. In order to use it effectively you need to model your tables around your queries.
More often than not, you need to query your data in multiple ways, so users will often denormalize their data into multiple tables. Materialized views aim to make that user experience better, but it has had its share of bugs and limitations as you indicated. At this point if you consider using them you should be aware of their limitations, although that is generally good idea for evaluating anything.
If you need advanced querying capabilities or do not have an ahead of time knowledge of what the queries will be, Cassandra may not be a good fit. You can build these capabilities using products like Spark and Solr on top of Cassandra (such as what DataStax Enterprise does), but it may be difficult to achieve using Cassandra alone.
On the other hand there are many use cases where Cassandra is a great fit, such as messaging, personalization, sensor data, and so on.

is the notion of a Table required at all in Azure Table service?

I am working with azure table services. What I would like to ask is whether the Azure internals care about the notion of Table at all?
Making things fast largely depend on partition key and row key. Tables do not look like containers or Grouping of entities because there is no limit on number of tables you can create. The total storage size is tied to my storage account.
So is Table just a notion to help people transition from RDBMS land or do they serve a purpose internally? Can I do applications with a single table design without worrying about performance? After all, if table is a just a tag then I may as well include as part of partition key.
EDIT
To give an example, Table partition keys look very much like Cassandra rows and Table rows are like Cassandra columns. It is okay to treat the Storage as a big bucket of key(RowKey)-value pairs. Partition keys are sharding mechanism. Then table just comes across as a "labeling" notion.
You can have all your entities in a single azure table storage and get optimum performance by thoughtfully choosing a right Partition Key and Row Key combination, but IMHO it is any time better to have related entities each in a separate table as that would be easy manageable ( from a developer perspective) . you know what part of you application is hitting which table.
You may want to view this recorded session which discusses best practices and internals.
Windows Azure Storage: What’s Coming, Best Practices, and Internals
You can think the union of a Table name and a Partition Key as being a unit of performance.
Youre probably on the thought mindset that will re-invent FatEntities. developed by Lokad.
The concept of a "Table" has many restrictions and issues. For example if you have a large table with 100,000 entries in a partition, then you can't easily jump to entry 99,001 without iterating through each of the entries. Going backwards is impossible (you cant start at the last entry and go backwards)

Resources