How does CouchDB compaction affects /db/_changes? - couchdb

I have an application pulls CouchDB from the first doc to the latest one, batch by batch.
I tried compact my database from 1.7GB to 1.0GB, and /db/_changes seems the same.
Can anyone please clarify if CouchDB compaction affects /db/_changes ?

All compaction does is remove old references to documents in a given database. The changes feed deals exclusively with write operations, which are unaffected by compaction. (since those writes have already happened)
Now, it should be noted that the changes feed will give you the rev numbers as well. Upon compaction, all but the most recent rev are deleted, so those entries in the changes feed will have "dead" links. (so-to-speak)
See the docs for more information about compaction.

Related

CouchDB taking lot of space due to revisions

We have a project that involved with a database sync with pouchdb in mobile devices. We have faced issue when updating multiple documents (8400 docs per minute), internal storage increasing (around 20MB per minute) frequency.
We figured one main reason for that couchdb revisions. So we decided to decrease database rev_limit to around 5. But we heard it may impact replication process between couchdb and pouchdb. My first question is
how this decrease of revision limit impact to the replication process?.
And we figured out views taking more space than normal document storage. My second question, is there any way to reduce couchdb view size?
Your data model (fast updates) doesn't play to CouchDB's strengths. Even after compaction, old revisions (including tombstones) take up space. CouchDB is happiest when using small, immutable documents. Such a model is also less likely to suffer from update conflicts.
Look to your documents - can they be broken apart such that updates can be changed to new document writes? Typical indicators are nested objects or arrays that grow in documents over time.

Best practices for ArangoDB compaction on-demand for file space reclamation

Part of my evaluation of ArangoDB involves importing a few CSV files of over 1M rows into a staging area, then deleting the resulting collections or databases. I will need to do this repeatedly for the production processes I envision.
I understand the the ArangoDB service invokes compaction periodically per this page:
https://docs.arangodb.com/3.3/Manual/Administration/Configuration/Compaction.html
After deleting a database, I waited over 24 hours and no disk space has been reclaimed, so I'm not sure this automated process is working.
I'd like answers to these questions:
What are the default values for the automatic compaction parameters shown in the link above?
Other than observing a change in file space, how do I know that a compaction worked? Is the a log file or other place that would indicate this?
How can I execute a compaction on-demand? All the references I found that discussed such a feature indicated that it was not possible, but they were from several years ago and I'm hoping this feature has been added.
Thanks!
The GET route /_api/collection/{collection-name}/figures contains a sub-attribute compactionStatus in the attribute figures with time and message of the last compaction for debugging purposes. There is also some other information in the response that you might be interested in. See if doCompact is set to true at all.
https://docs.arangodb.com/3.3/HTTP/Collection/Getting.html#return-statistics-for-a-collection
You can run arangod --help-compaction to see the startup options for compaction including the default values. This information is also available online in the 3.4 docs:
https://docs.arangodb.com/3.4/Manual/Programs/Arangod/Options.html#compaction-options
The PUT route /_api/collection/{collection-name}/rotate, quoting the documentation directly:
Rotates the journal of a collection. The current journal of the
collection will be closed and made a read-only datafile. The purpose
of the rotate method is to make the data in the file available for
compaction (compaction is only performed for read-only datafiles, and
not for journals)
Saving new data in the collection subsequently will create a new journal file automatically if there is no current journal.
https://docs.arangodb.com/3.3/HTTP/Collection/Modifying.html#rotate-journal-of-a-collection

Couchdb View Compaction

I understand that compaction of a db removes old revisions beyond the limit set in the config. The result is decreased disk usage, with little to no affect on view speed, because old revisions aren't part of the view index.
I recognize that view compaction is different from view cleanup, which removes unused view index files to save space.
However, what happens with a view compaction? I haven't been able to find much documentation on this, just that it is necessary. Does it operate similarly to db compaction in that it removes old revisions from design docs? If so, I don't think there is much of a benefit as design docs are usually small and few.
Views are structured similarly to databases, so when you make changes to documents there will be old revisions in your view index until you do a compaction, just like a database. The documentation doesn't state this explicitly, but it's implied by the "Views are also need compaction like databases" statement.

Replicating CouchDB to local couch reduces size - why?

I recently started using Couch for a large app I'm working on.
I database with 7907 documents, and wanted to rename the database. I poked around for a bit, but couldn't figure out how to rename it, so I figured I would just replicate it to a local database of the name I wanted.
The first time I tried, the replication failed, I believe the error was a timeout. I tried again, and it worked very quickly, which was a little disconcerting.
After the replication, I'm showing that the new database has the correct amount of records, but the database size is about 1/3 of the original.
Also a little odd is that if I refresh futon, the size of the original fluctuates between 94.6 and 95.5 mb
This leaves me with a few questions:
Is the 2nd database storing references to the first? If so, can I delete the first without causing harm?
Why would the size be so different? Had the original built indexes that the new one eventually will?
Why is the size fluctuating?
edit:
A few things that might be helpful:
This is on a cloudant couchdb install
I checked the first and last record of the new db, and they match, so I don't believe futon is underreporting.
Replicating to a new database is similar to compaction. Both involve certain side-effects (incidentally, and intentionally, respectively) which reduce the size of the new .couch file.
The b-tree indexes get balanced
Data from old document revisions is discarded.
Metadata from previous updates to the DB is discarded.
Replications store to/from checkpoints, so if you re-replicate from the same source, to the same location (i.e. re-run a replication that timed out), it will pick up where it left off.
Answers:
Replication does not create a reference to another database. You can delete the first without causing harm.
Replicating (and compacting) generally reduces disk usage. If you have any views in any design documents, those will re-build when you first query them. View indexes use their own .view file which also consumes space.
I am not sure why the size is fluctuating. Browser and proxy caches are the bane of CouchDB (and web) development. But perhaps it is also a result of internal Cloudant behavior (for example, different nodes in the cluster reporting slightly different sizes).

CouchDB Compaction and Doc Deletion - Compaction indifferent?

Taking a simple CouchDB to a theory that CouchDB compaction is totally indifferent to deleted docs.
Deleting a doc from couch via a DELETE method yields the following when trying to retrieve it:
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
Expected.
Now I compact the database:
localhost:5984/enq/_compact
{'ok': true }
And check compaction has finished
"compact_running":false
Now I would expect CouchDB to return not_found, reason "missing" on a simple GET
localhost:5984/enq/deleted-doc-id
{"error":"not_found","reason":"deleted"}
And trying with ?rev=deleted_rev gives me a ful doc, yeah for worthless data.
So am I correct in thinking the couchdb compaction shows no special treatment for deleted docs and simple looks at the rev count again rev limit when deciding what is part of compaction. Is there a special rev_limit we can set for deleted docs?
Surely the only solution can't be a _purge? at the moment we must have thousands of orphaned deleted docs, and whilst we want to maintain some version history for normal docs we dont want to reduce our rev_limit to 1 to assist in this scenario
What are the replication issues we should be aware of with purge?
Deleted documents are preserved forever (because it's essential to providing eventual consistency between replicas). So, the behaviour you described is intentional.
To delete a document as efficiently as possible use the DELETE verb, since this stores only _id, _rev and the deleted flag. You can, of course, achieve the same more manually via POST or PUT.
Finally, _purge exists only for extreme cases where, for example, you've put an important password into a couchdb document and need it be gone from disk. It is not a recommended method for pruning a database, it will typically invalidate any views you have (forcing a full rebuild) and messes with replication too.
Adding a document, deleting it, and then compacting does not return the CouchDB database to a pristine state. A deleted document is retained through compaction, though in the usual case the resulting document is small (just the _id, _rev and _deleted=true). The reason for this is replication. Imagine the following:
Create document.
Replicate DB to remote DB.
Delete document.
Compact DB.
Replicate DB to remote DB again.
If the document is totally removed after deletion+compaction, then the second replication won't know to tell the remote DB that the document has been deleted. This would result in the two DBs being inconsistent.
There was an issue reported that could result in the document in the DB not being small; however it did not pertain to the HTTP DELETE method AFAIK (though I could be wrong). The ticket is here:
https://issues.apache.org/jira/browse/COUCHDB-1141
The basic idea is that audit information can be included with the DELETE that will be kept through compaction. Make sure you aren't posting the full doc body with the DELETE method (doing so might explain why the document isn't actually removed).
To clarify... from our experience you have to kick of a DELETE with the id and a compact in order to fully remove the document data.
As pointed out above you will still have the "header data" in your database afterwards.

Resources