I am not sure which NoSQL is suitable for my scenario - cassandra

I am trying to design create a cloud based system (IaaS) that will gather data from sensors (water pollution related activity) and upon certain events will decide to process the data for a specific sensor.
Data characteristics are:
1. For each sensor data is being sent once every couple of days (up to 6 times a month)
2. each sensor reading contains about 5000 events that are encapsulated in 50-100 messages that are sent to the server (such "session" takes about 20 minutes where messages are sent every 5 seconds)
3. I am building the system to handle rate of 30,000 messages per second.
4. processing of the data shouldn't be real time , I have about 10 minutes once the "session" is finished to do the processing.
5. 90% of the sessions are not interesting and can be thrown away once they are finished. the other 10% have event or event encapsulated in the messages that according to them I need to decide if I need to process the entire session data and send an alert to the sensor that there is a pollution.
I created a tool that generates 5000 messages per second and I am trying to figure out which database would be the most optimal for my scenario.
These are the databases I am thinking to try:
Cassandra - I will save for each session an in memory collection of keys. the keys are for the messages that are stored in cassandra. Once I detect a message that contains bad readings I will need to pull all of the other messages in the "session" and process them (that means 50-100 requests to cassandra). My concern here is about write performance (since I have many read and write operations) + I don't have a good strategy for deleting the 90% not needed sessions.
Couchbase - I will save a document for each "session" according to sensorID and will append each message to the document. Once I detect a message that contains bad readings I will only need to send one request for the document. My concern here is about the read performance.
Redis - use it like cassandra. I assume performance will be the best but I will need to handle the sharding and replication of data myself in order not to reach the memory limit
I would love to hear which option would be the most appropriate
thanks

Reg. Redis – You may consider using a DAAS (Data as a Service). The service will manage for you all the instances, clusters, scaling, data persistence and high availability settings.
One example, is Redis Cloud by Redis Labs

This is an interesting one. If we go to basics of CAP Theorem and try to choose one DB based upon need of consistency, availability, and partition tolerance.
For High consistency and availability- Choose MySQL, PostgreSQL,Greenplum, Vertica, Neo4J.
For High availability and partition tolerance- Use Cassandra,Voldemort,Dynamo,CouchDB, Riak
For High consistency and partition tolerance- Use HBase, Redis, MongoDB,
BerkeleyDB, BigTable
So my Vote is for Cassandra here.

Related

Cosmos DB metrics report 100x more requests than expected

I'm comparing the service side metrics of my app with the metrics emitted by Cosmos DB and I can see a 100x difference in request counts.
Is my container misconfigured? Am I querying the wrong way? Is Cosmos performing multiple requests internally for each query I'm running against it?
The metric I'm looking at in Cosmos is TotalRequests/Count/5min.
The container has indexes on all attributes + a few composite indexes.
The query I'm running is:
SELECT *
FROM x
WHERE x.partitionKey = 0
and x.index1 = 1
and x.index2 = 2
The container is suffering from a VERY hot partition.
Each request consumes about 5 RUs.
The consistency level is BOUNDED_STALENESS.
I tried changing the consistency level to EVENTUAL which brought the consumed RUs down, but I'm still seeing a huge amount of requests that aren't accounted for.
The Total Requests metric includes every request between the SDK and the service. The SDK makes frequent calls to the service when an SDK instance is first created, then makes regular calls for metadata and other information. If you want to see just requests made by user, apply a filter for OperationType and select the operations you want to monitor.
It's not clear why you were using Bounded Staleness. Reads using Strong and Bounded Staleness consume twice the RU/s because they read from 2 replicas rather than 1 replica for the other weaker consistency models. In addition to differences in cost, there are of course differences in whether you may read stale data or not. They also play a big role in your RTO and RPO in multi-region scenarios.
A hot partition does not have impact on throughput consumption. 5 RU/s for a query is actually very good.

Cosmos Write Returning 429 Error With Bulk Execution

We have a solution utilizing a micro-service approach. One of our micro-service is responsible for pushing data to Cosmos. Our Cosmos database is using serverless provision having a 5,000 RU/s limit.
The data we are inserting into Cosmos looks like the below. There are 10 columns and we are pushing a batch containing 5,807 rows of this data.
Id
CompKey
Primary Id
Secondary Id
Type
DateTime
Item
Volume
Price
Fee
1
Veg_Buy
csd2354csd
dfg564dsfg55
Buy
30/08/21
Leek
10
0.75
5.00
2
Veg_Buy
sdf15s1dfd
sdf31sdf654v
Buy
30/08/21
Corn
5
0.48
3.00
We are retrieving data from multiple sources, normalizing it, and sending out the data as one bulk execution to Cosmos. The retrieval process happens every hour. We understand that we are spiking the Cosmos database once per hour with the data that has been retrieved and then stop sending data until the next retrieval cycle. So if this high peak is the problem, what remedies exist for such a scenario?
Can anyone shed some light on what we should/need to do to overcome this issue? Perhaps we are missing a setting when creating the Cosmos database or possibly this has something to do with partitioning?
You can mostly determine these things by looking at the metrics published in the Azure Portal. This doc is a good place to start, Monitor and debug with insights in Azure Cosmos DB.
In particular I would look at the section titled, Determine the throughput consumption by a partition key range
If you are not dealing with a hot partition key you may want to look at options to throttle your writes. This may include modifying your batch size and putting the write operations on a while..loop with a one second timer until RU/s consumed equals 5000 RU/s. You could also possibly look at doing queue-based load leveling and put writes on a queue in front of Cosmos and stream them in.

MongoDB ChangeStream performance

Is it possible to use change stream for extensive use? I want to watch many collections with many documents with various parameters. The idea is to allow for multiple users to watch data that they are interested in. So not only to show few real-time updates on e.g. some stock data from a single collection or whatever, but to allow a modern web application to be real-time. I've stumbled upon some discussions e.g. this one which suggests, that the feature is not usable for such purpose.
So imagine implementing commonly known social network. Each user would want to have live data on (1) notifications, (2) online friends, (3) friends requests, (4) news feed, (5) comments on news feed posts (maybe one for each post?). This makes at least 5 open change streams per user. If a service would have connected e.g. 10000 users, it makes 50000 active change streams.
Is this mechanism ready for such load? If I understood the discussion (and some others) every change stream watcher creates one connection. Would it be okay to have like tens of thousands of connections? It does not seems like a good design. It seems like it'd be better to watch each collection and do the filtering on a application server, but that is more of a database server's job.
Is there way how to handle such load with mongo db?
Each change stream will require a connection to the server. Assuming your 10000 active users are going to do things like login, post things, read things, comment on other people's things, manage friend lists, etc. you may actually be needing more like 10 connections per user.
Each change stream is essentially an aggregation the maintains a cursor over the operations log. That should work fairly well as long as the server is sufficiently sized to handle:
100,000 simultaneous connections
state for 50,000 long running cursors
10s of thousands of queries per second for those change streams
whatever query rate the other non-changestream reads and writes will need
On MongoDB Atlas you would need at least an M140 instance just to handle that number of connections, with a price tag in the neighborhood of $10K per month.
At that price point, it would probably be more cost effective to design a pub/sub notification service that uses a total of 5 change streams to watch for the different types of changes, and deliver those to users with a push mechanism rather than having every user poll the database directly.

How to avoid database from being hit hard when API is getting bursted?

I have an API which allows other microservices to call on to check whether a particular product exists in the inventory. The API takes in only one parameter which is the ID of the product.
The API is served through API Gateway in Lambda and it simply queries against a Postgres RDS to check for the product ID. If it finds the product, it returns the information about the product in the response. If it doesn't, it just returns an empty response. The SQL is basically this:
SELECT * FROM inventory where expired = false and product_id = request.productId;
However, the problem is that many services are calling this particular API very heavily to check the existence of products. Not only that, the calls often come in bursts. I assume those services loop through a list of product IDs and check for their existence individually, hence the burst.
The number of concurrent calls on the API has resulted in it making many queries to the database. The rate can burst beyond 30 queries per sec and there can be a few hundred thousands of requests to fulfil. The queries are mostly the same, except for the product ID in the where clause. The column has been indexed and it takes an average of only 5-8ms to complete. Still, the connection to the database occasionally time out when the rate gets too high.
I'm using Sequelize as my ORM and the error I get when it time out is SequelizeConnectionAcquireTimeoutError. There is a good chance that the burst rate was too high and it max'ed out the pool too.
Some options I have considered:
Using a cache layer. But I have noticed that, most
of the time, 90% of the product IDs in the requests are not repeated.
This would mean that 90% of the time, it would be a cache miss and it
will still query against the database.
Auto scale up the database. But because the calls are bursty and I don't
know when they may come, the autoscaling won't complete in time to
avoid the time out. Moreover, the query is a very simple select statement and the CPU of the RDS instance hardly crosses 80% during the bursts. So I doubt scaling it would do much too.
What other techniques can I do to avoid the database from being hit hard when the API is getting burst calls which are mostly unique and difficult to cache?
Use cache in the boot time
You can load all necessary columns into an in-memory data storage (redis). Every update in database (cron job) will affect cached data.
Problems: memory overhead of updating cache
Limit db calls
Create a buffer for ids. Store n ids and then make one query for all of them. Or empty the buffer every m seconds!
Problems: client response time extra process for query result
Change your database
Use NoSql database for these data. According to this article and this one, I think choosing NoSql database is a better idea.
Problems: multiple data stores
Start with a covering index to handle your query. You might create an index like this for your table:
CREATE INDEX inv_lkup ON inventory (product_id, expired) INCLUDE (col, col, col);
Mention all the columns in your SELECT in the index, either in the main list of indexed columns or in the INCLUDE clause. Then the DBMS can satisfy your query completely from the index. It's faster.
You could start using AWS lambda throttling to handle this problem. But, for that to work the consumers of your API will need to retry when they get 429 responses. That might be super-inconvenient.
Sorry to say, you may need to stop using lambda. Ordinary web servers have good stuff in them to manage burst workload.
They have an incoming connection (TCP/IP listen) queue. Each new request coming in lands in that queue, where it waits until the server software accept the connection. When the server is busy requests wait in that queue. When there's a high load the requests wait for a bit longer in that queue. In nodejs's case, if you use clustering there's just one of these incoming connection queues, and all the processes in the cluster use it.
The server software you run (to handle your API) has a pool of connections to your DBMS. That pool has a maximum number of connections it it. As your server software handles each request, it awaits a connection from the pool. If no connection is immediately available the request-handling pauses until one is available, then handles it. This too smooths out the requests to the DBMS. (Be aware that each process in a nodejs cluster has its own pool.)
Paradoxically, a smaller DBMS connection pool can improve overall performance, by avoiding too many concurrent SELECTs (or other queries) on the DBMS.
This kind of server configuration can be scaled out: a load balancer will do. So will a server with more cores and more nodejs cluster processes. An elastic load balancer can also add new server VMs when necessary.

High throughput send to EventHubs resulting into MessagingException / TimeoutException / Server was unable to process the request errors

We are experiencing lots of these exceptions sending events to EventHubs during peak traffic:
"Failed to send event to EventHub. Exception : Microsoft.ServiceBus.Messaging.MessagingException: The server was unable to process the request; please retry the operation. If the problem persists, please contact your Service Bus administrator and provide the tracking id."
or
"Failed to send event to EventHub. Exception : System.TimeoutException: The operation did not complete within the allocated time "
You can see it clearly here:
As you can see, we got lots of Internal Errors, Server Busy Errors, Failed Request when Incoming messages are over 400K events/hour (or ~270 MB/hour). This is not just a transient issue. It's clearly related to throughput.
Our EH has 32 partitions, message retention of 7 days, and 5 throughput units assigned. OperationTimeout is set to 5 mins, and we are using the default RetryPolicy.
Is it anything we still need to tweak here? We are really concerned about the scalability of EH.
Thanks
Send throughput tuning can be achieved using efficient partition distribution strategies. There isn't any single knob which can do this. Below is the basic information you will need to be able to design for High-Thruput Scenarios.
1) Lets start from the Namespace: Throughput Units(aka TUs) are configured at Namespace level. Pls. bear in mind, that, TUs configured is applied - aggregate of all EventHubs under that Namespace. If you have 5 TUs on your Namespace and 5 eventhubs under it - it will be divided among all 5 eventhubs.
2) Now lets look at EventHub level: If the EventHub is allocated with 5 TUs and it has 32 partitions - No single partition can use all 5 TUs. For ex. if you are trying to send 5TU of data to 1 partition and 'Zero' to all other 31 partitions - this is not possible. Maximum you should plan per Partition is 1 TU. In general, you will need to ensure that the data is distributed evenly across all partitions. EventHubs support 3 types of sends - which gives users different level of control on Partition distribution:
EventHubClient.Send(EventDataWithoutPartitionKey) -> if you are using this API to send - eventhub will take care of evenly distributing the data across all partitions. EventHubs service gateway will round-robin the data to all partitions. When a specific partition is down - the Gateways auto-detect and ensure Clients doesn't see any impact. This is the most recommended way to Send to EventHubs.
EventHubClient.Send(EventDataWithPartitionKey) -> if you are using this API to send to EventHubs - the partitionKey will determine the distribution of your data. PartitionKey is used to Hash the EventData to the appropriate partition (algo. to hash is Microsoft Proprietary and not Shared). Typically users who require correlation of a group of messages will use this variant of Send.
EventHubSender.Send(EventData) -> In this variant, the Sender is already attached to the Partition. So - this gives complete control of Distribution across partitions to the Client.
To measure your present distribution of Data - use EventHubClient.GetPartitionRuntimeInfo Api to estimate which Partition is overloaded. The difference b/w BeginSequenceNumber and LastEnqueuedSequenceNumber is supposed to give an estimate of that partitions load compared to others.
3) Last but not the least - you can tune performance (not Throughput) at send operation level - using the SendBatch API.
1 TU can buy a Max of 1000 msgs/sec or 1MBPS - you will be throttled with whichever limit hits first - this cannot be changed.
If your messages are small - lets say 100 bytes and you can send only 1000 msgs/sec (as per the TU limit) - you will first hit the 1000 events/sec limit. However, overall using SendBatch API - you can batch lets say 10 of 100byte msgs and push at the same rate - 1000 msgs/sec with just 100 API calls and improve the end-to-end latency of the system (as it helps service also to persist messages efficiently). Remember, the only limitation here is the Max. Msg Size that can be sent - which is 256 kb (this limit will apply on your BatchSize if you use SendBatch API).
Given that background, in your case:
- Having 32 partitions and 5 TUs - I would really double-check the Partition distribution strategy.
here's some more general reading on Event Hubs...
After a lot of digging we decided to stop setting the PK for posted messages, and the issue simply went away!. We were using GUID as PK. We start to get very few erros on the Azure Portal, and no more exceptions. Hope this helps someone else

Resources