Does an LMDB reader get the latest view of the database if the writer is updating constantly? - lmdb

I am using LMDB database.
I have a writer and multiple readers. One of the readers is transient and the other is long running which does an open once and reads multiple times. The transient reader works fine, but the read from the long running sometimes gives stale data and sometimes fails.
Are the reads in LMDB reader represent the latest view of the database always?
Does the reader have to do update() or something to get the latest view?

LMDB is copy on write DB and read transaction locks pages it touches. So, if you have an open long running read transaction it will only see the older pages in general. Also, long running transaction prevents writer from being able to reuse locked pages so the DB will be forced to allocate new pages and as result DB will grow much faster.
LMDB isn't designed to have long running transaction and having them will lead to issues. You need to try and keep your transactions as short as possible. If you do need to have a long running transaction, it might be better to look for an alternative DB as it most likely won't work well with LMDB.

Related

Synchronicity of Azure stored procedures [duplicate]

Can documentDb stored procedures run in parallel and update the same object? Will documentDb process them sequentially?
Consider the following scenario.
I have an app and I have 10000 coins to give away to my users when they complete a task. And I have the following object
{
remainingPoints: 10000
}
I have a stored procedure that subtracts 10 points from this object and adds them to the users' points.
Now lets say 10 users complete the task at the same time and I call the stored procedure 10 times at the same time, will DocDb execute them sequentially? Or will I have to execute the stored procedures sequentially?
I had similar questions when I first started using DocumentDB and got good answers here and in email from the DocumentDB product managers. Quoting:
Stored procedures ... get an isolated snapshot of the database for transactional support. The snapshot reflects the current state of the world (no stale data) at the time the sproc begins execution (strongly consistent).
Caveat – since stored procedures are operating on a snapshot, you can still get a stale read in a sproc if a new write come in from the outside world during execution.
Also, stored procedures will ALWAYS read their owns writes.
Sprocs are DocumentDB’s mechanism for multi-document transactions. Sproc writes are committed when a sproc successfully complete execution. If an exception is thrown, all work done in a sproc gets rolled back.
So if two are sprocs are running concurrently, they won’t see eachother’s writes.
If both sprocs happen to write to the same document (replace) – then the 2nd one will fail due to an etag mismatch when it attempts to commit writes.
From that, I went forward with my design making sure to use ETags in my writes as #Julian suggests. I also automatically retry up to 3 times each sproc execution to handle the case where they fail due to parallel operations among other reasons. In practice, I've never exceed the 3 retries (except in cases where my sproc had a bug) and I rarely even get a single retry.
I assume from the behavior that I observe that it sends each new sproc execution to a different replica until it runs out of replicas and then it queues them for sequential execution, so it's a hybrid of parallel and serial execution.
One other tip that I learned through experimentation is that you are better off doing pure read operations (no writes and no significant aggregation) client-side rather than in a sproc when you are on a heavily loaded system. I assume the advantage is because DocumentDB can satisfy different reads from different replicas in parallel. I have modularized my sproc code using the expandScript functionality of documentdb-utils to make sure that I use the exact same code for write validation, intra-document consistency, and derived fields both client-side and server-side, which is possible using node.js. Even if you are mostly .NET, you may want to use expandScripts to build your sprocs in a modular DRY way. You'll still need to run node.js in your build process to pre-process your sprocs or use Edge.NET (node running inside of .NET) to do so on the fly.
It will depend on the consistency you have choose for your collection. But the idea is that DocumentDb handle concurrency using etag and executes stored procedure on a snapshot of a document version, and commit the result only if the execution succeed.
See: https://azure.microsoft.com/en-us/documentation/articles/documentdb-faq/#develop
This thread may help too: Atomically increment an integer in a document in Azure DocumentDB

Nodejs and Sqlite. Perform long queries

I have to perform 2 queries: query A is long (20 seconds) and query B is fast (1 second).
I want to guarantee that query B is performed fast also if query A is running.
How can i achive this behaviour?
It may not be easy to do because of how SQLite does locking.
From official Appropriate Uses For SQLite documentation:
SQLite supports an unlimited number of simultaneous readers, but it will only allow one writer at any instant in time. For many situations, this is not a problem. Writer queue up. Each application does its database work quickly and moves on, and no lock lasts for more than a few dozen milliseconds. But there are some applications that require more concurrency, and those applications may need to seek a different solution.
[...]
SQLite only supports one writer at a time per database file. But in most cases, a write transaction only takes milliseconds and so multiple writers can simply take turns. SQLite will handle more write concurrency that many people suspect. Nevertheless, client/server database systems, because they have a long-running server process at hand to coordinate access, can usually handle far more write concurrency than SQLite ever will.
It may not be the best way to use SQLite, as the SQLite documentation states, when you have so many data that a single query takes so long.
There is no easy solution to fix that, other than using a real RDBMS like PostgreSQL.
And since you didn't include those queries that take so long it's also impossible to tell you anything more than that. Of course maybe your queries could be optimized but we don't know that.

Massive inserts kill arangod (well, almost)

I was wondering of anyone has ever encountered this:
When inserting documents via AQL, I can easily kill my arango server. For example
FOR i IN 1 .. 10
FOR u IN users
INSERT {
_from: u._id,
_to: CONCAT("posts/",CEIL(RAND()*2000)),
displayDate: CEIL(RAND()*100000000)
} INTO canSee
(where users contains 500000 entries), the following happens
canSee becomes completely locked (also no more reads)
memory consumption goes up
arangosh or web console becomes unresponsive
fails [ArangoError 2001: Could not connect]
server is still running, accessing collection gives timeouts
it takes around 5-10 minutes until the server recovers and I can access the collection again
access to any other collection works fine
So ok, I'm creating a lot of entries and AQL might be implemented in a way that it does this in bulk. When doing the writes via db.save method it works but is much slower.
Also I suspect this might have to do with write-ahead cache filling up.
But still, is there a way I can fix this? Writing a lot of entries to a database should not necessarily kill it.
Logs say
DEBUG [./lib/GeneralServer/GeneralServerDispatcher.h:411] shutdownHandler called, but no handler is known for task
DEBUG [arangod/VocBase/datafile.cpp:949] created datafile '/usr/local/var/lib/arangodb/journals/logfile-6623368699310.db' of size 33554432 and page-size 4096
DEBUG [arangod/Wal/CollectorThread.cpp:1305] closing full journal '/usr/local/var/lib/arangodb/databases/database-120933/collection-4262707447412/journal-6558669721243.db'
bests
The above query will insert 5M documents into ArangoDB in a single transaction. This will take a while to complete, and while the transaction is still ongoing, it will hold lots of (potentially needed) rollback data in memory.
Additionally, the above query will first build up all the documents to insert in memory, and once that's done, will start inserting them. Building all the documents will also consume a lot of memory. When executing this query, you will see the memory usage steadily increasing until at some point the disk writes will kick in when the actual inserts start.
There are at least two ways for improving this:
it might be beneficial to split the query into multiple, smaller transactions. Each transaction then won't be as big as the original one, and will not block that many system resources while ongoing.
for the query above, it technically isn't necessary to build up all documents to insert in memory first, and only after that insert them all. Instead, documents read from users could be inserted into canSee as they arrive. This won't speed up the query, but it will significantly lower memory consumption during query execution for result sets as big as above. It will also lead to the writes starting immediately, and thus write-ahead log collection starting earlier. Not all queries are eligible for this optimization, but some (including the above) are. I worked on a mechanism today that detects eligible queries and executes them this way. The change was pushed into the devel branch today, and will be available with ArangoDB 2.5.

Custom in-memory cache

Imagine there's a web service:
Runs on a cluster of servers (nginx/node.js)
All data is stored remotely
Must respond within 20ms
Data that must be read for a response is split like this..
BatchA
Millions of small objects stored in AWS DynamoDB
Updated randomly at random times
Only consistent reads, can't be catched
BatchB
~2,000 records in SQL
Updated rarely, records up to 1KB
Can be catched for up to 60-90s
We can't read them all at once as we don't know which records to fetch from BatchB until we read from BatchA.
Read from DynamoDB takes up to 10ms. If we read BatchB from remote location, it would leave us with no time for calculations or we would have already been timed out.
My current idea is to load all BatchB records into memory of each node (that's only ~2MB). On startup, the system would connect to SQL server and fetch all records and then it would update them every 60 or 90 seconds. The question is what's the best way to do this?
I could simply read them all into a variable (array) in node.js and then use SetTimeout to update the array after 60-90s. But is the the best solution?
Your solution doesn't sound bad. It fits your needs. Go for it.
I suggest keeping two copies of the cache while in the process of updating it from remote location. While the 2MB are being received you've got yourself a partial copy of the data. I would hold on to the old cache until the new data is fully received.
Another approach would be to maintain only one cache set and update it as each record arrives. However, this is more difficult to implement and is error-prone. (For example, you should not forget to delete records from the cache if they are no longer found in the remote location.) This approach conserves memory, but I don't suppose that 2MB is a big deal.

alternative to polling database?

I have an application that works as follows: Linux machines generate 28 different types of letter to customers. The letters must be sent in .docx (Microsoft Word format). A secretary maintains MS Word templates, which are automatically used as necessary. Changing from using MS Word is not an option.
To coordinate all this, document jobs are placed into a database table and a python program running on each of the windows machines polls the database frequently, locking out jobs and running them as necessary.
We use a central database table for the job information to coordinate different states ("new", "processing", "finished", "printed")... as well to give accurate status information.
Anyway, I don't like the clients polling the database frequently, seeing as they aren't working most of the time. Clients hpoll every 5 seconds.
To avoid polling, I kind of want a broadcast "there's some work to do" or "check your database for some work to do" message sent to all the client machines.
I think some kind of publish/subscribe message queue would be up to the job, but I don't want any massive extra complexity.
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
X
Is there any objective evidence that any significant load is being put on the server? If it works, I'd make sure there's really a problem to solve here.
It must be nice to have everything running so smoothly that you're looking at things that might only possibly be improved!
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
Possibly, but what you would save in configuration and implementation time would likely hurt performance more than your polling service ever could. SQL Server isn't made to do a push really (not easily anyway). There are things that you could use to push data out (replication service, log shipping - icky stuff), but they would be more complex and require more resources than your simple polling service. Some options would be:
some kind of trigger which runs your executable using command-line calls (sp_cmdshell)
using a COM object which SQL Server could open and run
using a SQL Agent job to run a VBScript (which would again be considered "polling")
These options are a bit ridiculous considering what you have already done is much simpler.
If you are worried about the polling service using too many cycles or something - you can always throttle it back - polling every minute, every 10 minutes, or even just once a day might be more appropriate - this would be a business decision, so go ask someone in the business how fast it needs to be.
Simple polling services are fairly common, because they are, well... simple. In addition they are also low overhead, remotely stable, and error-tolerant. The down side is that they can hammer the database into dust if not carefully controlled.
A message queue might work well, as they're usually setup to be able to block for a while without wasting resources. But with MySQL, I don't think that's an option.
If you just want to reduce load on the DB, you could create a table with a single row: the latest job ID. Then clients just need to compare that to their last ID to see if they need to run a full poll against the real table. This way the overhead should be greatly reduced, if it's an issue now.
Unlike Postgres and SQL Server (or object stores like CouchDb), MySQL does not emit database events. However there are some coding patterns you can use to simulate this.
If you have one or more tables that you wish to monitor, you can create triggers on these tables that add a row to a "changes" table that records a queue of events to process. Your triggers filter the subset of data changes that you care about and create records in your changes table for each event you wish to perform. Because this pattern queues and persists events it works well even when the workers that process these events have outages.
You might think that MyISAM is the best choice for the changes table since it's mostly performing writes (or even MEMORY if you don't need to persist the events between database server outages). However, keep in mind that both Memory and MEMORY and MyISAM have only full-table locks so your trigger on an InnoDB table might hit a bottle neck when performing an insert into a MEMORY and MyISAM table. You may also require InnoDB for the changes table if you're using a ON DELETE CASCADE with another InnoDB table (requires both tables to use the same engine).
You might also use SHOW TABLE STATUS to check the last update time of you changes table to check if there's something to perform. This feature wont work for InnoDB tables.
These articles describes in more depth some of alternative ways to implement queues in MySQL and even avoid polling!
How to notify event listeners in MySQL
How to implement a queue in SQL
5 subtle ways you're using MySQL as a queue, and why it'll bite you

Resources