How many items can updated in a transaction in QLDB? - amazon-qldb

A transaction cannot contain more than 25 unique items - In a DynamoDB.
Refer : https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Limits.html#limits-dynamodb-transactions
What the QLDB limits on this?
Also, does QLDB support across table transactions?

As per the limits page:
Number of documents in a transaction 40
QLDB transactions can operate across tables (but not across ledgers).

Related

Large number of Cassandra partition required

I am going to design a Cassandra cluster for telecom domain with 7 nodes and data volume 30 TB on 45 days retention. Application layer will generate unique transaction id for each transaction which is a combination of mobile number and date-time. Customer can ask for all details of a specific mobile number for a particular day/range of dates,All transactions for a day and from these details, they can go for all details extraction for a particular Transaction id.
Will it be a good idea to create a single table keeping transaction id as primary key and other details as non key column? It may need 22*10^9 unique partitions. Any practical example of so large number of partitions? secondary index needed for 1st 2 types of queries
Will it be a better idea to create different tables? One with primary key (mobile number as partition and date as cluster) and other with transaction id as primary key. Storage requirements will be more.
Would materialised view help here?
kindly suggest any other idea for best performance.

Cosmos DB partition key and query design for sequential access

We would like to store a set of documents in Cosmos DB with a primary key of EventId. These records are evenly distributed across a number of customers. Clients need to access the latest records for a subset of customers as new documents are added. The documents are immutable, and need to be stored indefinitely.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
If we use just CustomerId as the partition key, we would eventually run over the 10GB limit for a logical partition, and if we use EventId, then querying becomes inefficient (would result in a cross-partition query, and high RU usage, which we'd like to avoid).
Another idea would be to group documents into blocks. i.e. PartitionKey = int(EventId / PartitionSize). This would result in all clients hitting the latest partition(s), which presumably would result in poor performance and throttling.
If we use a combined PartitionKey of CustomerId and int(EventId / PartitionSize), then it's not clear to me how we would avoid a cross-partition query to retrieve the correct set of documents.
Edit:
Clarification of a couple of points:
Clients will access the events by specifying a list of CustomerId's, the last EventId they received, and a maximum number of records to retrieve.
For this reason, the use of EventId alone won't perform well, as it will result in a cross partition query (i.e. WHERE EventId > LastEventId).
The system will probably be writing on the order of 1GB a day, in 15 minute increments.
It's hard to know what the read volume will be, but I'd guess probably moderate, with maybe a few thousand clients polling the API at regular intervals.
So first thing first, logical partitions size limit has now been increased to 20GB, please see here.
You can use EventID as a partition as well, as you have limit of logical partition's size in GB but you have no limit on amount of logical partitions. So using EventID is fine, you will get a point to point read which is very fast if you query using the EventID. Now you mention using this way you will have to do cross-partition queries, can you explain how?
Few things to keep in mind though, Cosmos DB is not really meant for storing this kind of Log based data as it stores everything in SSDs so please calculate how much is your 1 document size and how many in a second would you have to store then how much in a day to how much in a month. You can use TTL to delete from Cosmos when done though and for long term storage store it in Azure BLOB Storage and for fast retrievals use Azure Search to query the data in BLOB by using CustomerID and EventID in your search query.
How should we design our partition key and queries to avoid clients all hitting the same partitions and/or high RU usage?
I faced a similar issue some time back and a PartitionKey with customerId + datekey e.g. cust1_20200920 worked well for me.
I created the date key as 20200920 (YYYYMMDD), but you can choose to ignore the date part or even the month (cust1_202009 /cust1_2020), based on your query requirement.
Also, IMO, if there are multiple known PartitionKeys at a query time it's kind of a good thing. For example, if you keep YYYYMM as the PartitionKey and want to get data for 4 months, you can run 4 queries in parallel and combine the data. Which is faster if you have many clients and these Partition Keys are distributed among multiple physical partitions.
On a separate note, Cosmos Db has recently introduced an analytical store for the transactional data which can be useful for your use case.
More about it here - https://learn.microsoft.com/en-us/azure/cosmos-db/analytical-store-introduction
One approach is using multiple Cosmos containers as "hot/cold" tiers with different partitioning. We could use two containers:
Recent: all writes and all queries for recent items go here. Partitioned by CustomerId.
Archive: all items are copied here for long term storage and access. Partitioned by CustomerId + timespan (e.g. partition per calendar month)
The Recent container would provide single partition queries by customer. Data growth per partition would be limited either by setting reasonable TTL during creation, or using a separate maintenance job (perhaps Azure Function on timer) to delete items when they are no longer candidates for recent-item queries.
A Change Feed processor, implemented by an Azure Function or otherwise, would trigger on each creation in Recent and make a copy into Archive. This copy would have partition key combining the customer ID and date range as appropriate to limit the partition size.
This scheme should provide efficient recent-item queries from Recent and safe long-term storage in Archive, with reasonable Archive query efficiency given a desired date range. The main downside is two writes for each item (one for each container) -- but that's the tradeoff for efficient polling. Whether this tradeoff is worthwhile is probably best determined by simulating the load and observing performance.

Managing multiple database connections and data with foreachPartition

Will try to make it as clear as possible so an example isn't required as this has to be a concept that I didn't grasp properly and I'm struggling with rather than a problem with data or Spark code itself.
I'm required to insert city data within their own database (MongoDB) and I'm trying to perform those upserts as fast as possible.
Take into account a sample DataFrame with the following, where I want to do some upserts against MongoDB based on, for example, year, city and zone.
year - city - zone - num_business - num_vehicles.
Having groupedBy those columns I'm just pending to perform the upsert into the DB.
Using the MongoDB Driver I'm required to instantiate several WriteConfigs to cope with multiple databases (1 database per city).
// the 'getDatabaseWriteConfigsPerCity' method filters the 'df' so it only contains the docs from a single city.
for (cityDBConnection <- getDatabaseWriteConfigsPerCity(df) {
cityDBConnection.getDf.foreach(
... // set MongoDB upsert criteria.
)
}
Doing it that way works but still, more performance can be gained when using foreachPartition as I hope that those records within the DF are spread to the executors are more data is concurrently being upsert.
However, I get erroneous results when using foreachPartition. Erroneus because they seem incomplete. Counters are way off and such.
I suspect this is because, among the partitions, same keys are in different partitions and it's not until those are merged in the master when those are inserted to MongoDB as a single record.
Is there any way I can make sure partitions contain the total of documents related to an upsert key?
Don't really know if I'm being clear enough, but if it's still too complicated I will update as soon as possible.
Is there any way I can make sure partitions contain the total of
documents related to an upsert key? if you do:
df.repartition("city").foreachPartition{...}
You can be sure that all records with same city are in the same partition (but there is probably more than 1 city per partition!)

Cassandra - Batch too large

I have a list of Products which have to be added to a Purchase Order. The Purchase order has a sequence number and once the Products are added, their status should be changed to indicate that these are out for purchase.
The typical number of Products being processed in 1 Purchase Order would be 500.
On the DB - I have 2 tables -> 1 for Products and another for Purchase Orders. Which means I need 500 updates and 1 insert to be done.
When I try to do this in a BatchStatement I get the error - Batch too large.
Suggestions from various quarters tell me that I should use multiple async queries. My concern however is atomicity of the entire operation.
Please suggest what would be the best way forward given my requirement.
Thanks in advance.
This is interesting. Inserting a lot of inserts (> 10) into a batch (to achieve atomicity) is really going to be a bad performancer, so raising the batch limit is not really an option.
Since Cassandra manages atomicity at single row level also, you could exploit that by changing your model by adding a table to "bookmark" your purchase orders, where you store there in one row only both the purchase order id and the items into a map, so you have idempotency in your queries. You can then expand or post process this table to continue your workflow as needed.
My concern however is atomicity of the entire operation. Please suggest what would be the best way forward given my requirement.
Please note, Cassandra batches doesn't provide isolation (http://www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2):
Note that we mean “atomic” in the database sense that if any part of the batch succeeds, all of it will. No other guarantees are implied; in particular, there is no isolation; other clients will be able to read the first updated rows from the batch, while others are in progress.
So if you need isolation, as #xmas79 answered, you should store products and purchase orders together in one table.
If isolation and performance are not critical, you could try to tune Cassandra yaml and increase value for batch_size_fail_threshold_in_kb parameter
Fail any batch exceeding this value. 50kb (10x warn threshold) by default.

Azure storage table records filtering advice

Imaging something like a blog posting system, built using Azure Storage Table.
A user posts a message and the database records user's Region, City and Language along with it.
After that, a user is able to browse all other user's posts and able to filter them by any combination of Region, City and Language. Or neither and see all of them.
I see several solutions:
Put each message in 8 different partitions with combinations of Region-City-Language (pros: lightning fast point queries on read; cons: 8 transactions per message on write).
Put each message in 4 different partitions with combinations of Region-City and an ability to do partition scan to filter by languages (pros: less transactions than (1); cons: partition scan, 4 transactions per message).
Put each message in partitions, based on user's ID (pros: single transaction per message; cons: slow table scan and partition scan after that).
The way i see it:
Fast reads, slow (and perhaps costly) writes.
Balanced reads/writes/cost.
Fast writes, slow (but cheap) reads.
By "cost/cheap" i mean pricing based on transactions (not space).
And by "balanced" i mean just among these variants.
Thought about using index tables, but can't see how they help here.
So the question is, perhaps there is another, better way?
I've decided to go with a variation of (1).
The difference is that i won't be storing ALL of combinations for Region-Location-Language. Instead i decided to store only uniques:
Table: FiltersByRegion
----------------------
Partition: Region
Row: Location.Language
Prop: Message
Table: FiltersByRegionPlace
---------------------------
Partition: Region.Location
Row: Language
Prop: Message
Table: FiltersByRegionLanguage
------------------------------
Partition: Region.Language
Row: Location
Prop: Message
Table: FiltersByLanguage
------------------------
Partition: Language
Row: Region.Location
Prop: Message
Because of the fact that i'm storing only uniques there won't be a lot of transactions per every post. Only those, that are not already present in database.
In other words, if there are a lot of posts from the same region-location-language, filter tables won't be updated and transactions won't be spent. Tests for uniques could use Redis to speed things a bit.
Filtering is now only a matter of picking the right table.
It depends on your scenarios and read/write pattern. You might want to consider some aspects:
Design for how the records will be queried. Putting them into a "Region-City-Language" partition with message ID as entity data may help in your fast query.
Each message may have a unique message ID and ID-Message mappings are saved in other tables, then every time you only need to update one table when a message is updated and the message ID referenced in other tables keeps unchanged.
Leverage ParitionKey and RowKey in your table design and query entities with both keys. For instance: "Region-City-Language" as partition keys and "User" as row keys.
Consider storing duplicate copies of entities for query scenarios. For example, if you have heavy users based and language based queries, you may consider have two tables with "user" and "language" as keys respectively.
Please also refer to https://azure.microsoft.com/en-us/documentation/articles/storage-table-design-guide/ for a full guide.

Resources