I'm going to build a search engine on solr, and nutch as a crawler. I have to index about 13mln documents.
I have 3 servers for this job:
4 core Xeon 3Ghz, 20Gb ram, 1.5Tb sata
2*4 core Xeon 3Ghz, 16Gb ram, 500Gb ide
2*4 core Xeon 3Ghz, 16Gb ram, 500Gb ide
One of the servers I can use as a master for crawling and indexing, other twos as a slave for searching, or I can use one for searching, and another two for indexing with two shards.
What architecture can you recommend? Should I use sharding, how much shards, and which of the servers should I use for what?
I think try both. Read up on what the HathiTrust has done. I would start out with a single master for and two slaves, that is the simplest approach. And if you only have 13mln documents, I am guessing the load will be on the indexing/crawling side..... But 13mln is only ~300 pages a minute. I think you nutch crawler will be the bottle neck....
I'd tend towards using two servers for search and one for indexing.
As a general rule you want to keep search as fast as possible, at the expense of indexing performance. Also, two search servers gives you some natural redundancy.
I'd use the third server for searching, too, when it's not actually doing the indexing. (13 million docs isn't a huge index, and indexing it shouldn't take very long compared to how often you reindex it)
Related
We use ArangoDB to store telco data. The main goal of our application is to let users build a certain types of reports very quickly. The reports are mostly based on the data we get from ArangoDB when we traverse different graphs. The business logic of the reports is not simple which leads to very complex AQL queries with multiple nested traversals (sub-queries).
Quick Overview of the data we store in ArangoDB:
28 collections with documents (the biggest collection consist of 3500K documents, average collection would usually have from 100K to 1000K)
3 collections with edges (335K edges, 3500K edges and 15000K edges)
3 graphs (each graph is linked to one edge collection and the biggest graph has 23 from/to collections)
The overall data set takes about 28 GB of RAM when fully loaded (including indexes).
We have been using MMFiles for almost two years now and were very happy with the results, except for some problems:
unprecedented memory consumption which I described here
very slow restart (takes 1 hour 30 minutes before the database is fully responsive again)
the fact that we have to use very expensive VMs with 64 GB of RAM to be able to fit all the data into the RAM
After some research we started to look into a new RocksDB storage engine. I have read:
https://www.arangodb.com/why-arangodb/rocksdb-storage-engine/
https://docs.arangodb.com/3.4/Manual/Architecture/StorageEngines.html
From the documents and from the proposed answers on my question about the problem with RAM consumption I can see that RocksDB should be a way to go for us. All the documents say it is new default engine for ArangoDB and it should be used if you want to store more data than fits into the RAM.
I installed new ArangoDB 3.4.1 and converted our database from MMFiles to RocksDB (via arangodumpa and arangorestore). Then I run some performance tests and found that all traversals became 2-6 times slower compare to what we had with MMFiles engine. Some queries which took 20 seconds with MMFiles engine now take 40 seconds with RocksDB, even if you run the same query multiple times (i.e. the data mush be already cashed).
Update 2/15/2019:
We run ArangoDB inside of a docker container on m4.4xlarge instance on AWS with 16 vCPU and 64 GB of RAM. We allocated 32 GB of RAM and 6144 CPU units for ArangoDB container. Here is a short summary of our tests (the numbers show the time it took to execute a particular AQL traversal query in HH:mm:ss format):
Note, in this particular table we do not have 10 times performance degradation as I mentioned in my original question. The maximum is 6 times slower when we run AQL right after the restart of ArangoDB (which I guess is OK). But, most of the queries are 2 times slower compare to MMFiles even when you run it a second time when all the data must be already cached in the RAM. The situation is even worse on Windows (it is there I had performance degradation like 10 times and more). I will post the detailed spec of my Windows PC with the performance tests a bit later.
My question is: Is it an expected behavior that AQL traversals are much slower with RocksDB engine? Are there any general recommendations on when to use MMFiles engine and when to use RocksDB engine and in which cases RocksDB is not an option?
With Arangodb 3.7 Support for MMFiles has been dropped, hence this question can be answered with "use rocksdb".
It took us a while to mature the rocksdb based storage engine in ArangoDB, but we now feel confident it fully can handle all loads.
We demonstrate how to work with parts of the rocksdb storage system and which effects they have in this article.
I am trying to migrate (copy) 35 million documents (which is a standard amount, not too big) between couchbase to elasticsearch.
My elasticsearch (version 1.3) cluster composed from 3 A3 (4 cores, 7 GB memory) CentOS Severs on Microsoft Azure (each server equals to a large server on Amazon)..
I used "timing data flow" indexing to store the docuemnts. each index represents a month and composed by 3 shards and 2 replicas.
when i start the migration script i see that the insertion time is becoming very slow (about 10 documents per second) and the load average of each server in the cluster jumping over than 1.5.
In addition, the JVM memory is being increased almost to 100% while the cpu shows 20% and the IOps shows 20 at max.
(i used Marvel CNC to get all these data)
Does anyone faced these kind of indexing problems in elasticsearch?
I would like to know if there are any parameters that i should be aware about to extend java memory?
is my cluster specifications good enough to handle 100 indexing per second.
is the indexing time depends on how big is the index? and should it be that slow?
Thnx Niv
I am quoting an answer I got in google group (link)
A couple of suggestions:
Disable replicas before large amounts of inserts (set replica count to 0), and only enable it afterwards again.
Use batching, actual batch size would depends on many factors (doc sizes, network, instances strengths)
Follow ES's advice on node setup, e.g. allocate 50% of the available memory size to the Java heap of ES, don't run anything else
on that machine, and disable swappiness.
Your index is already sharded, try spreading it out to 3 different servers instead of having them on one server ("virtual shards"). This
will help fan out the indexing load.
If you don't specify the document IDs yourself, make sure you use the latest ES, there's a significant improvement there in the ID
generation mechanism which could help speeding up things.
I applied points 1 & 3 and it seems that the problems solved :)
now i am indexing in rate of 80 docs per second and the load avg is low (0.7 at max)
I have to give the credit to Itamar Syn-Hershko that posted this reply.
I'm running multiple content/design separate websites from same middleware and I want to use Solr as a search engine. The sites differ in domain but not in internal structure (meaning, the actual database and datastructures are identical between the sites).
The question now is - is it better to store that site data in single Solr index and then separate it by a "site" field, or use a separate Solr core within a single JVM for each site?
What will provide the best performance (there are no cross-site queries)? What will provide the best recall and precision (I'm worried about loss of precision because of IDF factors - differences in content domains are quite large)?
I assume you are more worried about what happens when your sites grow. IMO, multiple cores seems a better choice.
Single large index: All updates and queries impinge upon a single point. When it starts getting slow, you must make a cluster by sharding or replication to store your large index. And it's a single point of failure. Backing up the index will be tough.
Multiple cores: If one site is growing and dwarfing others, you can easily migrate it to a different server, ensuring that no servers are overloaded. Backing up individual sites will be relatively trivial.
Multiple cores will make your life simpler when you have un-busy sites. As your sites grows, you can put off clustering and performance tuning until later.
I would do multiple Solr cores on a single tomcat
I'm working on a hobby project involving a rather CPU-intensive calculation. The problem is embarrassingly parallel. This calculation will need to happen on a large number of nodes (say 1000-10000). Each node can do its work almost completely independently of the others. However, the entire system will need to answer queries from outside the system. Approximately 100000 such queries per second will have to be answered. To answer the queries, the system needs some state that is sometimes shared between two nodes. The nodes need at most 128MB RAM for their calculations.
Obviously, I'm probably not going to afford to actually build this system in the scale described above, but I'm still interested in the engineering challenge of it, and thought I'd set up a small number of nodes as proof-of-concept.
I was thinking about using something like Cassandra and CouchDB to have scalable persistent state across all nodes. If I run a distributed database server on each node, it would be very lightly loaded, but it would be very nice from an ops perspective to have all nodes be identical.
Now to my question:
Can anyone suggest a distributed database implementation that would be a good fit for a cluster of a large number of nodes, each with very little RAM?
Cassandra seems to do what I want, but http://wiki.apache.org/cassandra/CassandraHardware talks about recommending at least 4G RAM for each node.
I haven't found a figure for the memory requirements of CouchDB, but given that it is implemented in Erlang, I figure maybe it isn't so bad?
Anyway, recommendation, hints, suggestions, opinions are welcome!
You should be able to do this with cassandra, though depending on your reliability requirements, an in memory database like redis might be more appropriate.
Since the data set is so small (100 MBs of data), you should be able to run with less than 4GB of ram per node. Adding in cassandra overhead you probably need 200MB of ram for the memtable, and another 200MB of ram for the row cache (to cache the entire data set, turn off the key cache), plus another 500MB of ram for java in general, which means you could get away with 2 gigs of ram per machine.
Using a replication factor of three, you probably only need a cluster on the order of 10's of nodes to serve the number of reads/writes you require (especially since your data set is so small and all reads can be served from the row cache). If you need the computing power of 1000's of nodes, have them talk to the 10's of cassandra nodes storing you data rather than try to split cassandra to run across 1000's of nodes.
I've not used CouchDB myself, but I am told that Couch will run in as little as 256M with around 500K records. At a guess that would mean that each of your nodes might need ~512M, taking into account the extra 128M they need for their calculations. Ultimately you should download and give each a test inside a VPS, but it does sound like Couch will run in less memory than Cassandra.
Okay, after doing some more read-up after posting the question, and trying some thing out, I decided to go with MongoDB.
So far I'm happy. I have very little load, and MongoDB is using very little system resources (~200MB at most). However, my dataset isn't nearly as large as described in the question, and I am only running 1 node, so this doesn't mean anything.
CouchDB doesn't seem to support sharding out-of-the-box, so is not (it turns out) a good fit for the problem described in the question (I know there are addons for sharding).
I am planning on using ElasticSearch to index my Cassandra database. I am wondering if anyone has seen the practical limits of ElasticSearch. Do things get slow in the petabyte range? Also, has anyone has any problems using ElasticSearch to index Cassandra?
See this thread from 2011, which mentions ElasticSearch configurations with 1700 shards each of 200GB, which would be in the 1/3 petabyte range. I would expect that the architecture of ElasticSearch would support almost limitless horizontal scalability, because each shard index works separately from all other shards.
The practical limits (which would apply to any other solution as well) include the time needed to actually load that much data in the first place. Managing a Cassandra cluster (or any other distributed datastore) of that size will also involve significant workload just for maintenance, load balancing etc.
Sonian is the company kimchy alludes to in that thread. We have over a petabyte on AWS across multiple ES clusters. There isn't a technical limitation to how far horizontally you can scale ES, but as DNA mentioned there are practical problems. The biggest by far is network. It applies to every distributed data storage. You can only move so much across the wire at a time. When ES has to recover from a failure, it has to move data. The best option is to use smaller shards across more nodes (more concurrent transfer), but you risk a higher rate of failure and exhorbitant cost per byte.
AS DNA mentioned, 1700 shards, but it is not 1700 shards but there are 1700 indexes each with 1 shard and 1 replica. So it is quite possible that these 1700 indexes are not present on single machine but are split around multiple machines.
So this is never a problem
I am currently starting working with Elisandra (Elasticsearch + Cassandra)
I am also, having problems to index Cassandra with elasticsearch. My problem is basically the node configuration.
Doing $ nodetool status you can see Host ID and then ruining:
curl -XGET http://localhost:9200/_cluster/state/?pretty=true
You can check that one of the node: is the same name as Host ID