difference between flush request and emptying cache for elasticsearch - search

What is the difference between between issuing a flush request and emptying the cache for elasticsearch? Does a restart of elasticsearch achieve either of these?

If you mean the difference between flush and clear cache api, it is pretty big.
Flush issues a lucene commit and empties the elasticsearch transaction log. As a result it gives durability on the lucene index level (that's why the translog can be emptied). Flush is called automatically under the hood at regular intervals that are adaptive depending on how many documents you index, how big they are and when the last flush was. You don't normally call flush, unless you are doing maintenance on the indices.
Clear cache empties the elasticsearch caches that are used to make search faster, for instance when it comes to executing the same filters or the same facets. There are different types of caches, but they are all at this time stored in memory (java heap).

Related

How do I force a memtable flush for every write?

I would like to flush the memtable to disk after every update/write operation (or in any case, as frequently as possible). My sole purpose is to stress test the underlying disk using a production-level database software.
It seems like memtable_cleanup_threshold is the way to go, but it's deprecated, is there another way to accomplish this? How about memtable_heap_space_in_mb and memtable_offheap_space_in_mb? I'm no Java Programmer, which one should I tune without compromising the rest of the functionalities?
You can definitely try setting both memtable_heap_space_in_mb and memtable_offheap_space_in_mb to a really low value.
Additionally, you can also configure commitlog_total_space_in_mb. If the occupied space goes above this property, it will cause more frequent flushes.
But since your goal is to stress-test the disk, my suggestion is to do the following:
Configure both data_file_directories and commitlog_directory to be mounted on the same disk.
Use NoSQLBench to stress test with heavy writes.
This way, you don't have to muck around with the memtable settings. Have a look at the NoSQLBench Beginner's Guide blog post for details. Cheers!
You also could just trigger a flush on the table by issuing nodetool flush or run the according JMX op after each write. However, Cassandra stores data distributed over many nodes and a flush is always a node bound operation. To find out which nodes you need to flush you would need to query the list of endpoints the written data is stored on (also available via JMX or with nodetool), otherwise you would need to flush all nodes.
While this is fine for testing purposes I would not recommend that for production.

Hazelcast load data from RDBMS in client-server topology

I am using client-server topology for hazelcast cache. I have multiple maps which I load eagerly using MapLoaders. When there is a cache miss- Maploader's load(key) method is called. The MapLoader.load(key) method seems to be executed by the partition thread, which means that all other operations on the partition are blocked until loading is done. A very common use case for the MapLoader is to load data from a DB, which arguably can take some time. So what is the best possible approach to take so that other operations on partition are not blocked when laod is taking place? Is there any other way to load missing data at runtime?(Hazelcast version : Hazelcast 4.0.3)
There's a good answer to this question that gives a few options.
MapLoader.load(key) only loads a single entry, but if the remote source is really slow or there are lots of cache misses it's going to mount up.
Another alternative to #mike-yawn 's answer would be to have a Runnable that fetches needed items from the database and writes them directly into the map. You can still have the MapLoader.load(key) as well, but the chances of cache miss are reduced if your fetcher code is good at predicting what entries will be needed.
If you don't cache 100% of the records, then a cache miss is inevitable. If it's punitively slow you could always return a Entry.Value that contains some sort of flag that it's a placeholder and launch a thread to do the actual load. Then your code has to deal with that placeholder and try again later -- noting that when it tries later the eventual result of the database query could be no record found.

Massive inserts kill arangod (well, almost)

I was wondering of anyone has ever encountered this:
When inserting documents via AQL, I can easily kill my arango server. For example
FOR i IN 1 .. 10
FOR u IN users
INSERT {
_from: u._id,
_to: CONCAT("posts/",CEIL(RAND()*2000)),
displayDate: CEIL(RAND()*100000000)
} INTO canSee
(where users contains 500000 entries), the following happens
canSee becomes completely locked (also no more reads)
memory consumption goes up
arangosh or web console becomes unresponsive
fails [ArangoError 2001: Could not connect]
server is still running, accessing collection gives timeouts
it takes around 5-10 minutes until the server recovers and I can access the collection again
access to any other collection works fine
So ok, I'm creating a lot of entries and AQL might be implemented in a way that it does this in bulk. When doing the writes via db.save method it works but is much slower.
Also I suspect this might have to do with write-ahead cache filling up.
But still, is there a way I can fix this? Writing a lot of entries to a database should not necessarily kill it.
Logs say
DEBUG [./lib/GeneralServer/GeneralServerDispatcher.h:411] shutdownHandler called, but no handler is known for task
DEBUG [arangod/VocBase/datafile.cpp:949] created datafile '/usr/local/var/lib/arangodb/journals/logfile-6623368699310.db' of size 33554432 and page-size 4096
DEBUG [arangod/Wal/CollectorThread.cpp:1305] closing full journal '/usr/local/var/lib/arangodb/databases/database-120933/collection-4262707447412/journal-6558669721243.db'
bests
The above query will insert 5M documents into ArangoDB in a single transaction. This will take a while to complete, and while the transaction is still ongoing, it will hold lots of (potentially needed) rollback data in memory.
Additionally, the above query will first build up all the documents to insert in memory, and once that's done, will start inserting them. Building all the documents will also consume a lot of memory. When executing this query, you will see the memory usage steadily increasing until at some point the disk writes will kick in when the actual inserts start.
There are at least two ways for improving this:
it might be beneficial to split the query into multiple, smaller transactions. Each transaction then won't be as big as the original one, and will not block that many system resources while ongoing.
for the query above, it technically isn't necessary to build up all documents to insert in memory first, and only after that insert them all. Instead, documents read from users could be inserted into canSee as they arrive. This won't speed up the query, but it will significantly lower memory consumption during query execution for result sets as big as above. It will also lead to the writes starting immediately, and thus write-ahead log collection starting earlier. Not all queries are eligible for this optimization, but some (including the above) are. I worked on a mechanism today that detects eligible queries and executes them this way. The change was pushed into the devel branch today, and will be available with ArangoDB 2.5.

CouchDB .view file growing out of control?

I recently encountered a situation where my CouchDB instance used all available disk space on a 20GB VM instance.
Upon investigation I discovered that a directory in /usr/local/var/lib/couchdb/ contained a bunch of .view files, the largest of which was 16GB. I was able to remove the *.view files to restore normal operation. I'm not sure why the .view files grew so large and how CouchDB manages .view files.
A bit more information. I have a VM running Ubuntu 9.10 (karmic) with 512MB and CouchDB 0.10. The VM has a cron job which invokes a Python script which queries a view. The cron job runs once every five minutes. Every time the view is queried the size of a .view file increases. I've written a job to monitor this on an hourly basis and after a few days I don't see the file rolling over or otherwise decreasing in size.
Does anyone have any insights into this issue? Is there a piece of documentation I've missed? I haven't been able to find anything on the subject but that may be due to looking in the wrong places or my search terms.
CouchDB is very disk hungry, trading disk space for performance. Views will increase in size as items are added to them. You can recover disk space that is no longer needed with cleanup and compaction.
Every time you create update or delete a document then the view indexes will be updated with the relevant changes to the documents. The update to the view will happen when it is queried. So if you are making lots of document changes then you should expect your index to grow and will need to be managed with compaction and cleanup.
If your views are very large for a given set of documents then you may have poorly designed views. Alternatively your design may just require large views and you will need to manage that as you would any other resource.
It would be easier to tell what is happening if you could describe what document updates (inc create and delete) are happening and what your view functions are emitting, especially for the large view.
That your .view files grow, each time you access a view is because CouchDB updates views on access. CouchDB views need compaction like databases too. If you have frequent changes to your documents, resulting in changes in your view, you should run view compaction from time to time. See http://wiki.apache.org/couchdb/HTTP_view_API#View_Compaction
To reduce the size of your views, have a look at the data, you are emitting. When you emit(foo, doc) the entire document is copied to the view to it is very instantly available when you query the view. the function(doc) { emit(doc.title, doc); } will result in a view as big as the database itself. You could also emit(doc.title, nil); and use the include_docs option to let CouchDB fetch the document from the database when you access the view (which will result in a slightly performance penalty). See http://wiki.apache.org/couchdb/HTTP_view_API#Querying_Options
Use sequential or monotonic id's for documents instead of random
Yes, couchdb is very disk hungry, and it needs regular compactions. But there is another thing that can help reducing this disk usage, specially sometimes when it's unnecessary.
Couchdb uses B+ trees for storing data/documents which is very good data structure for performance of data retrieval. However use of B-tree trades in performance for disk space usage. With completely random Id, B+-tree fans out quickly. As the minimum fill rate is 1/2 for every internal node, the nodes are mostly filled up to the 1/2 (as the data spreads evenly due to its randomness) generating more internal nodes. Also new insertions can cause a rewrite of full tree. That's what randomness can cause ;)
Instead, use of sequential or monotonic ids can avoid all.
I've had this problem too, trying out CouchDB for a browsed-based game.
We had about 100.000 unexpected visitors on the first day of a site launch, and within 2 days the CouchDB database was taking about 40GB in space. This made the server crash because the HD was completely full.
Compaction brought that back to about 50MB. I also set the _revs_limit (which defaults to 1000) to 10 since we didn't care about revision history, and it's running perfectly since. After almost 1M users, the database size is usually about 2-3GB. When i run compaction it's about 500MB.
Setting document revision limit to 10:
curl -X PUT -d "10" http://dbuser:dbpassword#127.0.0.1:5984/yourdb/_revs_limit
Or without user:password (not recommended):
curl -X PUT -d "10" http://127.0.0.1:5984/yourdb/_revs_limit

Limit the number of revisions in Couchdb

Is there a way to limit the number of revisions in couchdb? Something along the lines of a hard limit in a config file. I am aware of the fact that I could periodically compact the database, but somehow it feels like a hack. Is there a better way?
There's no configurable limit, primarily because CouchDB uses append-only storage, i.e. it promises to only ever write to the end of a file and never change anything in the middle. As a result a configurable limit is meaningless.
Compaction is your only option. There has been some talk about automatically triggered compaction on the mailing lists but it can only be triggered manually for now.
It is possible on a per database basis through this HTTP API:
PUT /{db}/_revs_limit
Sets the maximum number of document revisions that will be tracked by CouchDB, even after compaction has occurred. You can set the revision limit on a database with a scalar integer of the limit that you want to set as the request body.
See https://docs.couchdb.org/en/stable/api/database/misc.html#db-revs-limit

Resources