Sphinx getting slow after continuously adding and removing items - search

I have Sphinx search function with about 5m items. New items are added and removed continuously. Thereby Sphinx is getting slow after a while.
When i TRUNCATE RTINDEX the sphinx database and put every item back in sphinx again, than sphinx is fast again but after a while sphinx is getting slow again.
I don't want to truncate every day because it takes about an hour and a half to add the products to sphinx again.
Does someone know how to optimize sphinx to fix this problem? Maybe something with caching?
Some extra info:
Virtual memory 1021.73 MB used, 1021.99 MB total
Real memory 10.42 GB used, 31.39 GB total

Real-time indexes stores on disk in chunks. After a while there are many chunks, which impact performance. So, you need to optimize you index by executing OPTIMIZE INDEX command: http://sphinxsearch.com/docs/current.html#sphinxql-optimize-index

Related

Query for RAMDirectory for 6 GB Index File

I am writing a spark job, where use case if of millions of documents doing search on Lucene index for getting details.
The index is around 6 GB and contains around 40 million docs.Now I can't use elastic search as doing a batch search is getting bottlenecked. I used HDFS Directory for search and performance is bad.
I was thinking of creating g a RAMDirectory index searcher per executor.
I see following warning in the code :
Warning: This class is not intended to work with huge indexes. Everything beyond several hundred megabytes will waste resources (GC cycles), because it uses an internal buffer size of 1024 bytes, producing millions of byte[1024] arrays. This class is optimized for small memory-resident indexes. It also has bad concurrency on multithreaded environments.
My search is not simple(Use more like this and other term queries) which cannot be mimicked by spark SQL.
My question is has anybody used ram directory over Gb of file to see. My index will be read-only, so maybe I won't hit this issue.
If not how to solve this?

Forcing MongoDB to prefetch memory

I'm currently running MongoDB 2.6.7 on CentOS Linux release 7.0.1406 (Core).
Soon after the server starts, as well as (sometimes) after a period of inactivity (at least a few minutes) all queries take longer than usual. After a while, their speeds increase and stabilize around a much shorter duration (for a particular (more complex) query, the difference is from 30 seconds initially, to about 7 after the "warm-up" period).
After monitoring my VM using both top and various network traffic monitoring tools, I've noticed that the bottleneck is hit because of hard page faults (which had been my hunch from the beginning).
Given that my data takes up <2Gb and that my machine has 3.5 GB available, my collections should all fit in-memory (even with their indexes). And they actually do end up being fetched, but only on an on-demand basis, which can end up having a relatively negative impact on the user experience.
MongoDB uses memory-mapped files to operate on collections. Is there any way to force the operating system to prefetch the whole file into memory as soon as MongoDB starts up, instead of waiting for queries to trigger random page faults?
From mongodb docs:
The touch command loads data from the data storage layer into memory. touch can load the data (i.e. documents) indexes or both documents and indexes. Use this command to ensure that a collection, and/or its indexes, are in memory before another operation. By loading the collection or indexes into memory, mongod will ideally be able to perform subsequent operations more efficiently.

Elasticsearch bad indexing time

I am trying to migrate (copy) 35 million documents (which is a standard amount, not too big) between couchbase to elasticsearch.
My elasticsearch (version 1.3) cluster composed from 3 A3 (4 cores, 7 GB memory) CentOS Severs on Microsoft Azure (each server equals to a large server on Amazon)..
I used "timing data flow" indexing to store the docuemnts. each index represents a month and composed by 3 shards and 2 replicas.
when i start the migration script i see that the insertion time is becoming very slow (about 10 documents per second) and the load average of each server in the cluster jumping over than 1.5.
In addition, the JVM memory is being increased almost to 100% while the cpu shows 20% and the IOps shows 20 at max.
(i used Marvel CNC to get all these data)
Does anyone faced these kind of indexing problems in elasticsearch?
I would like to know if there are any parameters that i should be aware about to extend java memory?
is my cluster specifications good enough to handle 100 indexing per second.
is the indexing time depends on how big is the index? and should it be that slow?
Thnx Niv
I am quoting an answer I got in google group (link)
A couple of suggestions:
Disable replicas before large amounts of inserts (set replica count to 0), and only enable it afterwards again.
Use batching, actual batch size would depends on many factors (doc sizes, network, instances strengths)
Follow ES's advice on node setup, e.g. allocate 50% of the available memory size to the Java heap of ES, don't run anything else
on that machine, and disable swappiness.
Your index is already sharded, try spreading it out to 3 different servers instead of having them on one server ("virtual shards"). This
will help fan out the indexing load.
If you don't specify the document IDs yourself, make sure you use the latest ES, there's a significant improvement there in the ID
generation mechanism which could help speeding up things.
I applied points 1 & 3 and it seems that the problems solved :)
now i am indexing in rate of 80 docs per second and the load avg is low (0.7 at max)
I have to give the credit to Itamar Syn-Hershko that posted this reply.

cassandra read performance for large number of keys

Here is situation
I am trying to fetch around 10k keys from CF.
Size of cluster : 10 nodes
Data on node : 250 GB
Heap allotted : 12 GB
Snitch used : property snitch with 2 racks in same Data center.
no. of sstables for cf per node : around 8 to 10
I am supercolumn approach.Each row contains around 300 supercolumn which in terms contain 5-10 columns.I am firing multiget with 10k row keys and 1 supercolumn.
When fire the call 1st time it take around 30 to 50 secs to return the result.After that cassandra serves the data from key cache.Then it return the result in 2-4 secs.
So cassandra read performance is hampering our project.I am using phpcassa.Is there any way I can tweak cassandra servers so that I can get result faster?
Is super column approach affects the read performance?
Use of super columns is best suited for use cases where the number of sub-columns is a relatively small number. Read more here:
http://www.datastax.com/docs/0.8/ddl/column_family
Just in case you haven't done this already, since you're using phpcassa library, make sure that you've compiled the Thrift library. Per the "INSTALLING" text file in the phpcassa library folder:
Using the C Extension
The C extension is crucial for phpcassa's performance.
You need to configure and make to be able to use the C extension.
cd thrift/ext/thrift_protocol
phpize
./configure
make
sudo make install
Add the following line to your php.ini file:
extension=thrift_protocol.so
After doing much of RND about this stuff we figured there is no way you can get this working optimally.
When cassandra is fetching these 10k rows 1st time it is going to take time and there is no way to optimize this.
1) However in practical, probability of people accessing same records are more.So we take maximum advantage of key cache.Default setting for key cache is 2 MB. So we can afford to increase it to 128 MB with no problems of memory.
After data loading run the expected queries to warm up the key cache.
2) JVM works optimally at 8-10 GB (Dont have numbers to prove it.Just observation).
3) Most important if you are using physical machines (not cloud OR virtual machine) then do check out the disk scheduler you are using.Set it NOOP which is good for cassandra as it reads all keys from one section reducing disk header movement.
Above changes helped to bring down time required for querying within acceptable limits.
Along with above changes if you have CFs which are small in size but frequently accessed enable row caching for it.
Hope above info is useful.

Solr for constantly updating index

I have a news site with 150,000 news articles. About 250 new articles are added daily to the database at an interval of 5-15 minutes. I understand that Solr is optimized for millions of records and my 150K won't be a problem for it. But I am worried the frequent updation will be a problem, since the cache gets invalidated with every update. In my dev server, cold load of a page takes 5-7 seconds to load (since every page runs a few MLT queries).
Will it help, if I split my index into two - An archive index and a latest index. The archive index will be updated once every day.
Can anyone suggest any ways to optimize my installation for a constantly updating index?
Thanks
My answer is: test it! Don't try to optimize yet if you don't know how it performs. Like you said, 150K is not a lot, it should be quick to build an index of that size for your tests. After that, run a couple of MLT queries from a different concurrent threads (to simulate users) while you index more documents to see how it behaves.
One setting that you should keep an eye on is auto-commit. Since you are indexing constantly, you can't commit at each document (you will bring Solr down). The value that you will choose for this setting will let you tune the latency of the system (how many times it takes for new documents to be returned in results) while keeping the system responsive.
Consider using mlt=true in the main query instead of issuing per-result MoreLikeThis queries. You'll save the roundtrips and so it will be faster.

Resources