Using Arangodb to implement a queue - arangodb

Is there a way to implement a thread-safe concurrent queue store using arangodb?
I read this article from RocksDB that using a KV store, a scalable persistent queue service can be implemented "easily", does this apply to ArangoDB as well? I read somewhere that Arango uses RocksDB as storage engine for it's KV store, so I was wondering if someone has already tried this.
Thanks!

I have tried this, but for whatever reason (probably bone-headed implementation) I ran into resource contention issues (deadlock), even at low usage rates.
ArangoDB does indeed use RocksDB as the default storage engine (MMFiles is deprecated) but doesn't expose RocksDB internals other than a few knobs to tweak for performance tuning. If you want something VERY similar to the RocksDB-based solution, ArangoDB is probably not what you are looking for, but ArangoDB does provide a sort of K/V solution.
Since ArangoDB only supports two "collection" types ("document" and "edge"), a K/V-store is really an implementation method, not an option you choose. Their idea is to use the native "_key" attribute (present in every document, unique, automatically indexed) with a single "value" attribute, creating a document like:
{
"_key": "my awesome key name",
"value": "supercool"
}
Part of my use-case was to create a queue of "nonce" tokens that I would pick-up when a request came in, to act as a sort of cheap resource governor. However, the queue quickly became overwhelmed when I tried to go with sub-1-second query rates, giving me deadlocks when it tried to access/lock tokens that were being written.
Again, I believe this could have been sorted out, but the project went in a different direction and I never ended up troubleshooting it to completion.

Use Transactions, for more details check how ArangoDB implements Foxx queues, foxx/queues code # github

Related

Shopware 6 partitioning

Has anyone had any experience with database partitioning? We already have a lot of data and queries on it are already starting to slow down. Maybe someone has some examples? These are tables related to orders.
Shopware, since version 6.4.12.0, allows the use of database clusters, see the relevant documentation. You will have to set up a number read-only nodes first. The load of reading data will then be distributed among the read-only nodes while write operations are restricted to the primary node.
Note that in a cluster setup you should also use a lock storage that compliments the setup.
Besides using a DB cluster you can also try to reduce the load of the db server.
The first thing you should enable the HTTP-Cache, still better to additionaly also set up a reverse cache like varnish. This will greatly decrease the number of requests that hit your webserver and thus your DB server as well.
Besides all those measures explained here should improve the overall performance of your shop as well as decreasing load on the DB.
Additionally you could use Elasticsearch, so that costly search requests won't hit the Database. And use a "real" MessageQueue, so that the messages are not stored in the Database. And use Redis instead of the database for the storage of performance critical information as is documented in the articles in this category of the official docs.
The impact of all those measures probably depends on your concrete project setup, so maybe you see in the DB locks something that hints to one of the points i mentioned previously, so that would be an indicator to start in that direction. E.g. if you see a lot of search related queries Elasticsearch would be a great start, but if you see a lot of DB load coming from writing/reading/deleting messages, then the MessageQueue might be a better starting point.
All in all when you use a DB cluster with a primary and multiple replicas and use the additional services i mentioned here your shop should be able to scale quite well without the need for partitioning the actual DB.

Is there a built-in method to replicate a collection to a "follower" collection in the same region?

CosmosDB can geo-replicate collections and clients can be configured to make (read-only) queries to these "follower" regions.
Is there a built-in way for CosmosDB to provide a "follower" collection in the same region?
The scenario for using that is to use the "main" collection for fast interactive queries, and use the "follower" collection for slower, heavier backend queries, without the possibility of hitting limits and causing throttling that would impact the interactive case.
The usual answer for "copying" collections is to use a change feed (possibly via an Azure function), but this is "manual" work and the client (me) would have to take care of general dev-ops overhead like provisioning, telemetry, monitoring, alerting, key rotation etc.
I'd like to know if there's a "managed" way to do this, like there is for geo-replication.
The built-in geo-replication feature only works when replicating to different regions. You cannot replicate the same collection(s) back to the same region.
You'll need to set this up yourself. As you've already mentioned, you can use Change Feed to do this (though you called it a "manual" process and I don't see it as such, since this can be completely automated in code). You can also incorporate a messaging/event pattern: subscribe to database update events, and have multiple consumers writing to different database collections, per your querying needs.
Also: by having an independent collection where you provide the data-movement code, you can choose a different data model for your slower, heavier backend queries (maybe with a different partition key; maybe with some helpful aggregations; etc.).
There's really no way to avoid the added infrastructure setup.
Replication is limited to a single container/collection. For most scenarios like yours, one would use an alternate partition key to make the second collection read optimized. You should also review your top queries and consider using an alternate database which is more read optimize.
You could use this new tool:
https://github.com/Azure-Samples/azure-cosmosdb-live-data-migrator

Data access layer patterns using azure function

We are currently working on a design using Azure functions with Azure storage queue binding.
Each message in the queue represents a complete transaction. An Azure function will be bound to that queue so that the function will be triggered as soon as there is a new message in the queue.
The function will then commit the transaction in a SQL DB.
The first-cut implementation is also complete; and it's working fine. However, on retrospective, we are considering the following:
In a typical DAL, there are well-established design patterns using entity framework, repository patterns, etc. However, we didn't find a similar guidance/best practices when implementing DAL within a server-less code.
Therefore, my question is: should such patterns be implemented with Azure functions (this would be challenging :) ), or should the server-less code be kept as light as possible or this is not a use-case for azure functions, at all?
It doesn't take anything too special. We're using a routine set of library DLLs for all kinds of things -- database, interacting with other parts of Azure (like retrieving Key Vault secrets for connection strings), parsing file uploads, business rules, and so on. The libraries are targeting netstandard20 so we can more easily migrate to Functions v2 when the right triggers become available.
Mainly just design your libraries so they're highly modularized, so you can minimize how much you load to get the job done (assuming reuse in other areas of the system is important, which it usually is).
It would be easier if dependency injection was available today. See this for a few ways some of us have hacked it together until we get official DI support. (DI is on the roadmap for Functions, I believe the 3.0 release.)
At first I was a little worried about startup time with the library approach, but the underlying WebJobs stack itself is already pretty heavy, and Functions startup performance seems to vary wildly anyway (on the cheaper tiers, at least). During testing, one of our infrequently-executed Functions has varied from just ~300ms to a peak of about ~3800ms to parse the exact same test file, with all but ~55ms spent on startup).
should such patterns be implemented with Azure functions (this would
be challenging :) ), or should the server-less code be kept as light
as possible or this is not a use-case for azure functions, at all?
My answer is NO.
There should be patterns to follow, but the traditional repository patterns and CRUD operations do not seem to be valid in the cloud era.
Many strong concepts we were raised up to adhere to, became invalid these days.
Denormalizing the data base became something not only acceptable but preferable.
Now designing a pattern will depend on the database you selected for your solution and also depends of the type of your application and the type of your data.
This is a link for general guideline when you do Table Storage design Guidelines.
Is your application read-heavy or write-heavy ? The design will vary accordingly.
Are you using Azure Tables or Mongo? There are design decisions based on that. Indexing is important in Mongo while there is non in Azure table that you can do.
Sharding consideration.
Redundancy Consideration.
In modern development/Architecture many principles has changed, each Microservice has its own database that might be totally different that any other Microservices'.
If you read along the guidelines that I provided, you will see what I mean.
Designing your Table service solution to be read efficient:
Design for querying in read-heavy applications. When you are designing your tables, think about the queries (especially the latency sensitive ones) that you will execute before you think about how you will update your entities. This typically results in an efficient and performant solution.
Specify both PartitionKey and RowKey in your queries. Point queries such as these are the most efficient table service queries.
Consider storing duplicate copies of entities. Table storage is cheap so consider storing the same entity multiple times (with different keys) to enable more efficient queries.
Consider denormalizing your data. Table storage is cheap so consider denormalizing your data. For example, store summary entities so that queries for aggregate data only need to access a single entity.
Use compound key values. The only keys you have are PartitionKey and RowKey. For example, use compound key values to enable alternate keyed access paths to entities.
Use query projection. You can reduce the amount of data that you transfer over the network by using queries that select just the fields you need.
Designing your Table service solution to be write efficient:
Do not create hot partitions. Choose keys that enable you to spread your requests across multiple partitions at any point of time.
Avoid spikes in traffic. Smooth the traffic over a reasonable period of time and avoid spikes in traffic.
Don't necessarily create a separate table for each type of entity. When you require atomic transactions across entity types, you can store these multiple entity types in the same partition in the same table.
Consider the maximum throughput you must achieve. You must be aware of the scalability targets for the Table service and ensure that your design will not cause you to exceed them.
Another good source is this link:

How does AWS SimpleDb differ to Azure DocumentDb? How do both differ to ElasticSearch

In terms of
scalability,
performance,
maintenance,
Ease of use / Learning curve
cost,
In order of significance but wouldn't mind a general answer as I appreciate I m probably asking for too much :)
Thanks
EDIT: I m looking for a database that will serve as the single authoritative data store and I need all attributes of the documents stored to be indexed for various business reasons. Therefore I know that other solutions won't do what I m looking for.
tl;dr; If you are using JavaScript and building browser apps, node.js and DocumentDB are a match made in heaven. If you are using .NET and/or other Azure services, then DocumentDB is favored. If you are using other AWS services, then SimpleDB might be better.
I know that questions like this are not ideal for Stack Overflow, but I often see value in answers like this and my most popular answer on SO is essentially informed opinion backed by evidence. I have not used SimpleDB but I looked into it before deciding on DocumentDB. I rejected it pretty quickly... although I did give AWS Lambda a serious look before deciding on DocumentDB. So:
scalability. DocumentDB has a very straight forward and explicit scaling model -- add more collections if you need either more space or more operations per second. SimpleDB's scaling model is similar except less straight forward since you add domains which are overloaded to both provide type separation (think tables) and scalability. You can scale either to whatever you need.
performance. Since I never built anything on it, I can't say anything about SimpleDB's performance. However, I've been very impressed with the performance of DocumentDB. Latency is less than 10ms for simple id-based reads and I get impressive latency and throughput for queries. The DocumentDB implementation of our current app returns complex n-dimensional aggregations (done in stored procedures on DocumentDB using documentdb-lumenize) in 1/4 the time of the functionally-equivalent MongoDB/node.js implementation. You'd have to do your own performance testing on your actuall application to have a definitive answer here.
maintenance. Both are much more hands off than traditional data stores. There just aren't that many knobs to turn maintaining either of them. SimpleDB geographically distributes your data by default. You'd have to do the equivalent manually in DocumentDB. Possible, but harder. DocumentDB has good import/export tools and their backup solution is about to be significantly upgraded.
ease of use / learning curve. If you are JavaScript programmer, than DocumentDB has a lot to recommend it. DocumentDB uses JSON natively. SimpleDB uses XML. DocumentDB has ACID-enabling stored procedures written in JavaScript. You'd need to combine SimpleDB with something else (Lambda maybe, but the XML/JavaScript mismatch would make this less than ideal) to get the equivalent. Both allow use to use SQL but DocumentDB also allows for JavaScript native queries.
There is one huge mindset hurdle that you will have to get over in order to be successful with DocumentDB. Despite the fact that they both scale by adding more domains/collections, SimpleDB domains are closer conceptually to tables. The word choice of "collection" by the DocumentDB team is unfortunate since they are more akin to partitions and should not be thought of as tables. The hard part is getting used to the idea that you store all of your different data types in the same collection. Once you get over that, I find DocumentDB's approach refreshing and incredibly flexible. I can efficiently model inheritance and type-mixins. Collections nay partitions have one purpose -- scalability. Domains are used for both scalability and data type separation which is actually harder in practice.
cost. Not much to say here. Both allow you to scale your cost gradually. For really small implementations, DocumentDB is probably more expensive since the smallest unit of usage is a single collection which is $25/month minimum. You'd have to do your own modeling/what-if analysis to determine which would be less expensive for you. Note, that Azure is being every aggressive in general and even pushing AWS to lower prices in some cases. My gut is that they would be roughly equal in cost for most applications.
Other thoughts:
You wrote, "I need all attributes of the documents stored to be indexed". One really nice feature of DocumentDB is that you can specify the size of your indexes By default, every field is indexed into a 3-byte per field hash index, which is highly space efficient. I do not know if SimpleDB has the equivalent.
This is a bit like comparing apples to oranges. I consider DocumentDB to be like MongoDB or CouchDB in it's data model and VoltDB in its use execution model (although VoltBD sprocs are written in Java). SimpleDB feels more like a simple XML object store. If you already have a big XML mindset, then it might be easier, but I think there are more folks using JSON today than XML.
Writing ACID-enabling stored procedures in JavaScript is a killer feature that only DocumentDB has. Some say the days of stored procedures are over; that you should put all such logic in your application server layer. If you implementing a simple CRUD API, that may be, but almost every application requires some sort of transaction where more than one row is changed at a time. This is mind bogglingly hard to do correctly without transaction support in your data store. Even if you do implement the equivalent of transactions with your NoSQL database, the overhead of the implementation eats away any development/performance/scalability advantages that you got by choosing NoSQL rather than SQL.
DocumentDB's user defined functions and triggers (also written in JavaScript) might be useful, although I believe the trigger implementation is crippled at this moment in time and I haven't found a use for UDFs myself yet.
DocumentDB has built-in attachment support. You need to integrate manually with S3 for the equivalent on AWS.
DocumentDB has geo indexing and operators.
SimpleDB's 1K per document limit is a serious limitation. This tells me that it's designed mostly for logging or as an index to S3 and not a full-fledged document store. The limit for DocumentDB is 512K.
If comparison to SimpleDB is like apples to oranges, then comparison to ElasticSearch is like apples to fire engines. My impression of ElasticSearch is that it's all about full-text searching and analytics. I don't think it's space/execution/api efficient to serve as a primary transactional store. Built on Lucene, it was not designed to have the reliability/durability to be your primary store. Further, even when hosted, it's more of an IaaS offering, wherease DocumentDB and SimpleDB are true PaaS offerings. The maintenance will be much higher with ElasticSearch.

How does Azure DocumentDB scale? And do I need to worry about it?

I've got an application that's outgrowing SQL Azure - at the price I'm willing to pay, at any rate - and I'm interested in investigating Azure DocumentDB. The preview clearly has distinct scalability limits (as described here, for instance), but I think I could probably get away with those for the preview period, provided I'm using it correctly.
So here's the question I've got. How do I need to design my application to take advantage of the built-in scalability of the Azure DocumentDB? For instance, I know that with Azure Table Storage - that cheap but awful highly limited alternative - you need to structure all your data in a two-step hierarchy: PartitionKey and RowKey. Provided you do that (which is nigh well impossible in a real-world application), ATS (as I understand it) moves partitions around behind the scenes, from machine to machine, so that you get near-infinite scalability. Awesome, and you never have to think about it.
Scaling out with SQL Server is obviously much more complicated - you need to design your own sharding system, deal with figuring out which server the shard in question sits on, and so forth. Possible, and done right quite scalable, but complex and painful.
So how does scalability work with DocumentDB? It promises arbitrary scalability, but how does the storage engine work behind the scenes? I see that it has "Databases", and each database can have some number of "Collections", and so forth. But how does its arbitrary scalability map to these other concepts? If I have a SQL table that contains hundreds of millions of rows, am I going to get the scalability I need if I put all this data into one collection? Or do I need to manually spread it across multiple collections, sharded somehow? Or across multiple DB's? Or is DocumentDB somehow smart enough to coalesce queries in a performant way from across multiple machines, without me having to think about any of it? Or...?
I've been looking around, and haven't yet found any guidance on how to approach this. Very interested in what other people have found or what MS recommends.
Update: As of April 2016, DocumentDB has introduced the concept of a partitioned collection which allows you scale-out and take advantage of server-side partitioning.
A single DocumentDB database can scale practically to an unlimited amount of document storage partitioned by collections (in other words, you can scale out by adding more collections).
Each collection provides 10 GB of storage, and an variable amount of throughput (based on performance level). A collection also provides the scope for document storage and query execution; and is also the transaction domain for all the documents contained within it.
Source: http://azure.microsoft.com/en-us/documentation/articles/documentdb-manage/
Here's a link to a blog post I wrote on scaling and partitioning data for a multi-tenant application on DocumentDB.
With the latest version of DocumentDB, things have changed. There is still the 10GB limit per collection but in the past, it was up to you to figure out how to split up your data into multiple collections to avoid hitting the 10 GB limit.
Instead, you can now, specify a partition key and DocumentDB now handles the partitioning for you e.g. If you have log data, you may want to partition the data on the date value in your JSON document, so that each day a new partition is created.
You can fan out queries like this - http://stuartmcleantech.blogspot.co.uk/2016/03/scalable-querying-multiple-azure.html

Resources