Hanging replications on Cloudant

Hanging replications on Cloudant - couchdb

In the last months, we've been charged for a lot of HTTP requests we were not expecting on Cloudant. By looking at CouchDB console locally, I found out that for each continuous replication a GET request is issued every 5 seconds or so.
I have stopped all the continuous replications I could find in Futon and I did the same for every Cloudant accounts we have. By looking at Cloudant's dashboard, I have seen a reduction of GET requests (many thousands), but it did not went down to a reasonable level. So there must be some continuous replications left, but I cannot find them.
How can I find and stop the remaining replications?

To identify continuous replications that may be hidden to the user, the best way is to query a curl command, invoke _active_tasks, and apply a jq filter to display only those tasks of type "replication".
That is to say, in the command line, run a command of the following form:
curl 'https://username:password#username.cloudant.com/databasename/_active_tasks | jq 'map(select(.type == "replication"))'
The same methodology can be applied to retrieve other active tasks (view_compaction, database_compaction, etc.)
That said, in general, Cloudant-based replication is much smoother when using the _replicator database. To do so:
1) As an initial one-time task, create the database:
https://username.cloudant.com/_replicator
2) Then, create a document for every replication. If you have "continuous":true in the doc it will be treated as continuous.
3) Then, to cancel the replication you simply delete the document.
All of the above commands (e.g. creating and deleting documents) are well documented on Cloudant's website as well as throughout Stack Overflow, so please refer there for further details.
Finally, it is imperative to add the usr_ctx field so that the replication gets triggered and run within your user context. This is critical so that it shows up when you query _active_tasks, otherwise it'll run anonymously and only show up in the _active_tasks if queried by an admin. This is precisely what happened in the case of the original poster.

Related

Elasticsearch how to check for a status of a bulk indexing request?

I am bulk indexing into Elasticsearch docs containing country shapes (files here), based on the cshapes dataset.
The geoshapes have a lot of points in "geometry":{"type":"MultiPolygon", and the bulk request takes a long time to complete (and sometimes does not complete, which is a separate and already reported problem).
Since the client times out (I use the official ES node.js), I would like to have a way to check what the status of the bulk request is, without having to use enormous timeout values.
What I would like is to have a status such as active/running, completed or aborted. I guess that just by querying the single doc in the batch would not tell me whether the request has been aborted.
Is this possible?

I'm not sure if this is exactly what you're looking for, but may be helpful. Whenever I'm curious about what my cluster is doing, I check out the tasks API.
The tasks API shows you all of the tasks that are currently running on your cluster. It will give you information about individual tasks, such as the task ID, start time, and running time. Here's the command:
curl -XGET http://localhost:9200/_tasks?group_by=parents | python -m json.tool

Elasticsearch doesn't provide a way to check the status of an ongoing Bulk request- documentation reference here.
First, check that your request succeeds with a smaller input, so you know there is no problem with the way you are making the request. Second, try dividing the data into smaller chunks and calling the Bulk API on them in parallel.
You can also try with a higher request_timeout value, but I guess that is something you don't want to do.

just a side note hint, of why your requests might take a lot of time (unless you are just indexing too many in a single bulk run). If you have configured your own precision for geo shapes, also make sure you are configuring distance_error_pct, otherwise no error is assumed, resulting in documents with a lot of terms that take a lot of time indexing.

DocumentDB: How to run a query without timing out

I am new to the documentDb. I wrote a stored procedure that checks all records and update them under certain circumstances.
Current scenario:
It would run 100 records at a time, updates them and after running few times( taking 100 records at a time and updating) it is timing out.
Expectation
Run the script on all the records without timing out.
The document has close to a million records. So, running the same script multiple times manually is not a the way I am looking for.
Can anyone please advise me how I can achieve that?

tl;dr; Keep calling the sproc with the query continuation token being passed back and forth.
A few thoughts:
There is no capacity of RUs for collections that will allow you to do all million in one call to the sproc.
Sprocs run in isolation on a single replica. This means that they can be transactional but their use will have lower throughput than a regular query that can use all replicas to satisfy the request, so unless you need it to be in a sproc, I recommend using direct queries for reads that don't need to be transactional with writes. Even then, with a million documents, your queries will max out and you'll have to run the query again with a continuation token.
If you must use a sproc... As you are probably aware since you have done the 100 at a time thing, each query returns a continuation token. You can actually add that to the package that you send back from your sproc when it times out. Then you can pass that back into another call to the same sproc and write your sproc to pick up where you left off. The documentdb-utils library for node.js automatically re-calls the sproc until done as long as you follow this pattern for writing your sprocs. If you are using node.js, you could use that (but it has not yet been upgraded to support partitioned collections) or you could write the equivalent in whatever platform you are using.

couchdb request never completes

I am trying to request a large number of documents from my database (which has over 400k documents). I started using _all_docs built-in view. I first tried with this query:
http://database:port/databasename/_all_docs?limit=100&include_docs=true
No problem. Completes as expected. Now to ramp it up:
http://database:port/databasename/_all_docs?limit=1000&include_docs=true
Still fine. Took longer, more data, etc. as expected. Ramp it up again:
http://database:port/databasename/_all_docs?limit=10000&include_docs=true
Request never completes. The Dev tools in chrome show Size = 5.3MB (seems to be significant), and this occurs no matter what value for the limit parameter I use that is over 6500ish. No matter if i specify 6500 or 10,000, it always returns 5.3MB downloaded, and the request stalls.
I have also tried other combinations, such as "skip" and it seems that limit + skip must be < 6500 or I get the same stall.
My environment: Couchdb 1.6.1, Ubuntu 14.04.3 LTS, Azure A1 standard

you have to prewarm your queries, just throwing a 100K or more docs and expecting that you'd get them out of couchdb won't work, it just won't work.
When you ask for some items from a view (in your case Default View), at the first read CouchDB will notice that the B-tree for the view doesn't exist yet, so it goes ahead and builds it on the first read. Depending on how many documents you have in your database, that can take a while, putting a good work load on your database.
On every subsequent read, CouchDB will check if documents have changed since the last write, and throw the changed documents at the map and reduce function. So if you only query some views from time to time, but have lots of changes in between, expect some delays on the next read.
There are 2 ways to handle this situation
1. Pre-warm your view - run a cronjob that does reads to make sure that your view has the B-Tree for this View.
2. Prepare your view in advance for a particular query before inserting the data in the couchdb.
and for now if you really want to read all your docs, don't read them all at once, rather use the skip, limit range queries.

Mongodb, can i trigger secondary replication only at the given time or manually?

I'm not a mongodb expert, so I'm a little unsure about server setup now.
I have a single instance running mongo3.0.2 with wiredtiger, accepting both read and write ops. It collects logs from client, so write load is decent. Once a day I want to process this logs and calculate some metrics using aggregation framework, data set to process is something like all logs from last month and all calculation takes about 5-6 hours.
I'm thinking about splitting write and read to avoid locks on my collections (server continues to write logs while i'm reading, newly written logs may match my queries, but i can skip them, because i don't need 100% accuracy).
In other words, i want to make a setup with a secondary for read, where replication is not performing continuously, but starts in a configured time or better is triggered before all read operations are started.
I'm making all my processing from node.js so one option i see here is to export data created in some period like [yesterday, today] and import it to read instance by myself and make calculations after import is done. I was looking on replica set and master/slave replication as possible setups but i didn't get how to config it to achieve the described scenario.
So maybe i wrong and miss something here? Are there any other options to achieve this?

Your idea of using a replica-set is flawed for several reasons.
First, a replica-set always replicates the whole mongod instance. You can't enable it for individual collections, and certainly not only for specific documents of a collection.
Second, deactivating replication and enabling it before you start your report generation is not a good idea either. When you enable replication, the new slave will not be immediately up-to-date. It will take a while until it has processed the changes since its last contact with the master. There is no way to tell how long this will take (you can check how far a secondary is behind the primary using rs.status() and comparing the secondaries optimeDate with its lastHeartbeat date).
But when you want to perform data-mining on a subset of your documents selected by timespan, there is another solution.
Transfer the documents you want to analyze to a new collection. You can do this with an aggregation pipeline consisting only of a $match which matches the documents from the last month followed by an $out. The out-operator specifies that the results of the aggregation are not sent to the application/shell, but instead written to a new collection (which is automatically emptied before this happens). You can then perform your reporting on the new collection without locking the actual one. It also has the advantage that you are now operating on a much smaller collection, so queries will be faster, especially those which can't use indexes. Also, your data won't change between your aggregations, so your reports won't have any inconsistencies between them due to data changing between them.
When you are certain that you will need a second server for report generation, you can still use replication and perform the aggregation on the secondary. However, I would really recommend you to build a proper replica-set (consisting of primary, secondary and an arbiter) and leave replication active at all times. Not only will that make sure that your data isn't outdated when you generate your reports, it also gives you the important benefit of automatic failover should your primary go down for some reason.

alternative to polling database?

I have an application that works as follows: Linux machines generate 28 different types of letter to customers. The letters must be sent in .docx (Microsoft Word format). A secretary maintains MS Word templates, which are automatically used as necessary. Changing from using MS Word is not an option.
To coordinate all this, document jobs are placed into a database table and a python program running on each of the windows machines polls the database frequently, locking out jobs and running them as necessary.
We use a central database table for the job information to coordinate different states ("new", "processing", "finished", "printed")... as well to give accurate status information.
Anyway, I don't like the clients polling the database frequently, seeing as they aren't working most of the time. Clients hpoll every 5 seconds.
To avoid polling, I kind of want a broadcast "there's some work to do" or "check your database for some work to do" message sent to all the client machines.
I think some kind of publish/subscribe message queue would be up to the job, but I don't want any massive extra complexity.
Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
X

Is there any objective evidence that any significant load is being put on the server? If it works, I'd make sure there's really a problem to solve here.
It must be nice to have everything running so smoothly that you're looking at things that might only possibly be improved!

Is there a zero or near zero config/maintenance piece of software that would achieve this? What are the options?
Possibly, but what you would save in configuration and implementation time would likely hurt performance more than your polling service ever could. SQL Server isn't made to do a push really (not easily anyway). There are things that you could use to push data out (replication service, log shipping - icky stuff), but they would be more complex and require more resources than your simple polling service. Some options would be:
some kind of trigger which runs your executable using command-line calls (sp_cmdshell)
using a COM object which SQL Server could open and run
using a SQL Agent job to run a VBScript (which would again be considered "polling")
These options are a bit ridiculous considering what you have already done is much simpler.
If you are worried about the polling service using too many cycles or something - you can always throttle it back - polling every minute, every 10 minutes, or even just once a day might be more appropriate - this would be a business decision, so go ask someone in the business how fast it needs to be.
Simple polling services are fairly common, because they are, well... simple. In addition they are also low overhead, remotely stable, and error-tolerant. The down side is that they can hammer the database into dust if not carefully controlled.

A message queue might work well, as they're usually setup to be able to block for a while without wasting resources. But with MySQL, I don't think that's an option.
If you just want to reduce load on the DB, you could create a table with a single row: the latest job ID. Then clients just need to compare that to their last ID to see if they need to run a full poll against the real table. This way the overhead should be greatly reduced, if it's an issue now.

Unlike Postgres and SQL Server (or object stores like CouchDb), MySQL does not emit database events. However there are some coding patterns you can use to simulate this.
If you have one or more tables that you wish to monitor, you can create triggers on these tables that add a row to a "changes" table that records a queue of events to process. Your triggers filter the subset of data changes that you care about and create records in your changes table for each event you wish to perform. Because this pattern queues and persists events it works well even when the workers that process these events have outages.
You might think that MyISAM is the best choice for the changes table since it's mostly performing writes (or even MEMORY if you don't need to persist the events between database server outages). However, keep in mind that both Memory and MEMORY and MyISAM have only full-table locks so your trigger on an InnoDB table might hit a bottle neck when performing an insert into a MEMORY and MyISAM table. You may also require InnoDB for the changes table if you're using a ON DELETE CASCADE with another InnoDB table (requires both tables to use the same engine).
You might also use SHOW TABLE STATUS to check the last update time of you changes table to check if there's something to perform. This feature wont work for InnoDB tables.
These articles describes in more depth some of alternative ways to implement queues in MySQL and even avoid polling!
How to notify event listeners in MySQL
How to implement a queue in SQL
5 subtle ways you're using MySQL as a queue, and why it'll bite you

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string