I am considering choosing between DynamoDB and AWS Keyspaces.
My main issue is still with many-to-many relationship in Dynamo. You don't really have too nice options. Either you do adjecency list for immutable data...but in most scenarios data is gonna change. Other way is making 2 db calls which is really not that great. Third option would be to update data all the time which seems also like a big pain in the a**. Also for batch writes it's up to 25 rows I think.
However Cassandra provides materialized views where at least I don't have to manage replication on my own. Also I can do 1 DB call to get all I need.
I am still relatively new to NoSQL databases so I might be missing a lot of stuff.
Are there plans for Dynamo to add Materialized Views or is there better way to do it?
In my eyes it seems like a really good feature. It doesn't even have to create new tables, rather references between columns of items to make it autoupdate.
DynamoDB has a feature called Global Secondary Index which is very close to the materialized view feature of Cassandra. Despite its confusing name, DynamoDB's GSI is not just an index like what Cassandra calls a "secondary index"! It doesn't just like the keys matching a particular column value: Beyond the keys it can also keep any other items attributes which you choose to project. Exactly like a materialized view.
DynamoDB also has a more efficient Local Secondary Index which you can consider if the view's partition key is the same as the base table's - and you just want to sort items differently or project only part of the attributes.
Related
As the title says, I'm trying to query all the data I got with no value stored in it. I've been searching for a while, and the only operation allowed that I've found is CONTAINS, which doesn't fit my need.
consider the following table:
CREATE TABLE environment(
id uuid,
name varchar,
message text,
public Boolean,
participants set<varchar>,
PRIMARY KEY (id)
)
How can I get all entries in the table with an empty set? E.g. participants = {} or null?
Unfortunately, you really can't. Cassandra makes queries like this difficult by design, because there's no way it can be done without doing a full table scan (scanning each and every node). This is why a big part of Cassandra data modeling is understanding all the ways that table will be queried, and building it to support those queries.
The other issue that you'll have to deal with, is that (generally speaking) Cassandra does not allow filtering by nulls. Again, it's a design choice...it's much easier to query for data that exists, rather than data that does not exist. Although, when writing with lightweight transactions, there are ways around this one (using the IF clause).
If you knew all of the ids ahead of time, you could write something to iterate through them, SELECT and check for null on the app-side. Although that approach will be slow (but it won't stress the cluster). Probably the better approach, is to use a distributed OLAP layer like Apache Spark. It still wouldn't be fast, but this is probably the best way to handle a situation like this.
I know the question was asked before, but at the time it was, we had EF Core 2.x. The short answer was "no you can't" and obviously, not very helpful.
The other answers involved ugly hacks like changing migration files after they were created by the tool.
I make an application Code First. I have my models created with lot's of foreign keys and database joins in mind.
But here comes the unpleasant surprise (I'm a little new to EF): those joins written in LINQ are pretty slow, as a matter of fact they do not produce database join, but fetch whole tables instead.
Of course it's totally unacceptable, I import an old database with millions of records, with the joins I get results in milliseconds, without I get couple of seconds lags - on my very fast internet connection (in real world scenario it would be much worse).
I need views, and AFAIK EF won't create them for me, is it STILL true for EF 3.0?
Then, what would be the best and the most clean way to create views in SQL and to make entities for them? I mean - considering the situation the database models would change over time, and the database structure would have to be updated.
Well, I would prefer doing my joins not in SQL views, just have queries returned "JOIN" statement results. Especially some not obvious joins. Lets say table B has a column being a foreign key referencing table A. I want to get results from table A joining B for details. With normal SQL JOIN performance.
I checked the database: there is no significant performance difference between "select * from A" and "select * from A join B...". In LINQ - the difference is huge.
I figured out that in Code First database views are redundant.
The "views" can be created as models (ordinary classes) having a field or a property set to joined entity. I use private fields for that purpose. Then I use LINQ Join() to create my view entity. The query may refer ONLY to the fields set to joined entities, nothing else. Such query, if written properly translates clearly to SQL JOIN and works with full speed. In my application it's equivalent of a database view.
Why private fields and not properties, you may ask. Maybe because joined entities are "implementation details", but another reason is my presentation code uses reflection to operate on entity public properties, it's good to have those entities hidden from it. Otherwise I would probably need to use attributes to hide those "columns".
BTW, such views can be ordered with OrderBy(), filtered with Where() at virtually no cost. The constraint is to maintain the collection's IQueryable interface, never refer joined entities indirectly. So even if X refers to A.B, never refer X in a LINQ query, always A.B where A is direct entity reference assigned in the Join() query.
To build dynamic queries at runtime one must use expressions.
This set of properties of EF Core 3.0 allows to build a database application without using SQL, but with the full SQL speed maintained. However, the database / entity structure must be relatively simple to achieve that.
Given a scenario where you have a User table, with id as PRIMARY KEY.
You have a column called email, and a column called name.
You want to UPDATE User.name based on User.email
I realized that the UPDATE command requires you to pass in a PRIMARY KEY. Does this mean I can't use a pure CQL migration, and would need to first query for the User.id primary key before I can UPDATE?
In this case, I DO know the PRIMARY KEY because the UUIDs are the same for dev and prod, but it feels dirty.
Yes, you're correct - you need to know primary key of the record to perform an update on the data, or deletion of specific record. There are several options here, depending of your data model:
Perform full scan of the table using effective token range scan (Look to this answer for more details);
If this is required very often, you can create a materialized view, with User.email as partition key, and fetch all message IDs that you can update (but you'll need to do this from your application, there is no nested query support in CQL). But also be aware that materialized views are "experimental" feature in Cassandra, and may not work all the time (it's more stable in DataStax Enterprise). Also, if you have some users with hundreds of thousands of emails, this may create big partitions.
Do like 2nd item with your code, by using an additional table
I think Alex's answer covers your question -- "how can I find a value in a PK column working backwards from a non-PK column's value?".
However, I think it's worth noting that asking this question indicates you should reconsider your data model. A rule of thumb in C* data model design is that you begin by considering the queries you need, and you've missed the UPDATE query use case. You can probably make things work without changing your model for now, but if you find you need to make other queries you're unprepared for, you'll run into operational issues with lots of indexes and/or MVs.
More generally, search around for articles and other resources about Cassandra data modeling. It sounds like you're basically using C* for a relational use case so you'll want to look into that.
I got table that I need to search by not indexed field. What is better, to make separate table with data I need and indexed by that field or make view? what is drawbacks of each chose? May be I can use secondary Index in that case instead?
A second table will be better hands down. Only disadvantage is it requires more of your effort.
Materialized views have issues where they get outta sync and theres no way to repair them, only drop and recreate (they are now considered experimental and not prod ready). Secondary indexes require huge scatter gather queries that make your 99th percentile your average (while also being difficult to size appropriately). Ultimately for any heavy load, MVs or 2i will break, but its easy to add.
Creating materialized view seems to be an easy option compare to multiple tables..but is it a good option?
Since materialized views are nothing but another table in the back drop.
What exactly happens when we create a materialized view over a table and the partition key is changed to clustering key?
I just think creating another table rather than a materilized view is better for long term perspective when the data increase rate is high.
Mvs really helps avoiding overhead of managing multiple tables at client side. However it has some functional limitations.
This is good blog written on MV.
Also You can see a warning while using MV:
MVs are experimental and are not recommended for production use.
Personally, I would prefer managing my own table, instead of working with this risk.