in solr search how to optimizing to improve Solr search Speed. I try with different Cache mechanism but not work.we are using 65 million record to search using solr search.it takes approx. 45 sec. to search. but i want to search 65 million record approx. 5-10 sec. so friend suggest me to reduce the search time.
i am using Apache Solr (Ver. 5.2.1) .
You can create multiple core where in you can split your data into different cores. As the data gets divided/split in different cores, the search is limited to the core and limited indexed data which could improve your search speed.
In my case I have data of different category so created the cores for each category. Cores are created by category name. When a search request comes for a category, the search request is made only to that category.
The second approach is you can do the sharding which will again split the data into different shard. Here each shard will hold the index data.
When data is too large for one node, you can break it up and store it in sections by creating one or more shards. Each is a portion of the logical index, or core, and it's the set of all nodes containing that section of the index.
It is highly recommended that you use SolrCloud when needing to scale up or scale out.
Below are the links which will help you on the solrCloud
https://cwiki.apache.org/confluence/display/solr/Distributed+Search+with+Index+Sharding
https://cwiki.apache.org/confluence/display/solr/Shards+and+Indexing+Data+in+SolrCloud
Related
I have a index which has almost now 50GB of data and it will exceed to 100GB to soon! so now I would like to setup index architecture for the better performance.
I have checkout many things one of them is Index LifeCycle but as the index which i have that can be updated at anytime! so in that case how can I design my index so that will be good for perfomance.
Another thing is that As I have found an artical Dynamic Indices on the update and delete records from Index. it shows that data will be find out while we perform any search operations! as in my case I have too much update records in those cases it will reduce the performance of the index!
How can we improve our index performance when we have a large data and having to many update and delete operaion? what architecture should we follow?
Is all of your date likely to be updated or deleted or is it only the latest data?
If your updates are on fairly recent data and the old data is read-only you can create the hot-warm-cold architecture as described in this blog post.
If all of your data is likely to be updated you can do a hot-warm architecture, where all your updates go to hot node and all queries go to warm node. Elastic will sync the hot and warm node to achieve eventual consistency so you might have to live with stale date for milliseconds I assume. Check this.
In my experience Elastic is able to easily handle 50-100 GB data even if you update and search from the same set of indices and nodes. It all depends on the rate of the updates and search.
I have around 10 Million names, combination of about 5 files each consisting of 2 million names and there are 100s of users. Each user comes with a file which has 1Million numbers.
I need to process these 1 million numbers against 2 million names and generate the values and show the values with names to the User.
Will cassandra be a good choice to make?
Currently I'm using SQL with RoR but it's quite slow in returning the values.
Cassandra is a no-sql database and not rdbms.
So, if you don't know, then there is no joins in cassandra.
So, if your problem is slow returning data because of IO, then definately cassandra is a good choice.
However, if your result is coming slow because of join, then cassandra cannot help you.
Because, like i said, there is no join in cassandra.
Now coming to your requirement.
It needs more information to frame opinion for that, like,
when do you want to process data to create value (as batch, on the fly).
How many data records you want to pull and show to user at a time, etc.
Background
We have recently started a "Big Data" project where we want to track what users are doing with our product - how often they are logging in, which features they are clicking on, etc - your basic user analytics stuff. We still don't know exactly what questions we will be asking, but most of it will be "how often did X occur over the last Y months?" type of thing, so we started storing the data sooner rather than later thinking we can always migrate, re-shape etc when we need to but if we don't store it it is gone forever.
We are now looking at what sorts of questions we can ask. In a typical RDBMS, this stage would consist of slicing and dicing the data in many different dimensions, exporting to Excel, producing graphs, looking for trends etc - it seems that for Cassandra, this is rather difficult to do.
Currently we are using Apache Spark, and submitting Spark SQL jobs to slice and dice the data. This actually works really well, and we are getting the data we need, but it is rather cumbersome as there doesn't seem to be any native API for Spark that we can connect to from our workstations, so we are stuck using the spark-submit script and a Spark app that wraps some SQL from the command line and outputs to a file which we then have to read.
The question
In a table (or Column Family) with ~30 columns running on 3 nodes with RF 2, how bad would it be to add an INDEX to every non-PK column, so that we could simply query it using CQL across any column? Would there be a horrendous impact on the performance of writes? Would there be a large increase in disk space usage?
The other option I have been investigating is using Triggers, so that for each row inserted, we populated another handful of tables (essentially, custom secondary index tables) - is this a more acceptable approach? Does anyone have any experience of the performance impact of Triggers?
Impact of adding more indexes:
This really depends on your data structure, distribution and how you access it; you were right before when you compared this process to RDMS. For Cassandra, it's best to define your queries first and then build the data model.
These guys have a nice write-up on the performance impacts of secondary indexes:
https://pantheon.io/blog/cassandra-scale-problem-secondary-indexes
The main impact (from the post) is that secondary indexes are local to each node, so to satisfy a query by indexed value, each node has to query its own records to build the final result set (as opposed to a primary key query where it is known exactly which node needs to be quired). So there's not just an impact on writes, but on read performance as well.
In terms of working out the performance on your data model, I'd recommend using the cassandra-stress tool; you can combine it with a data modeler tool that Datastax have built, to quickly generate profile yamls:
http://www.datastax.com/dev/blog/data-modeler
For example, I ran the basic stress profile without and then with secondary indexes on the default table, and the "with indexes" batch of writes took a little over 40% longer to complete. There was also an increase in GC operations / duration etc.
I have created a search project that based on lucene 4.5.1
There are about 1 million documents and each of them is about few kb, and I index them with fields: docname(stored), lastmodified,content. The overall size of index folder is about 1.7GB
I used one document (the original one) as a sample, and query the content of that document against index. the problems now is each query result is coming up slow. After some tests, I found that my queries are too large although I removed stopwords, but I have no idea how to reduce query string size. plus, the smaller size the query string is, the less accurate the result comes.
This is not limited to specific file, because I also tested with other original files, the performance of search is relatively slow (often 1-8 seconds)
Also, I have tried to copy entire index directory to RAMDirectory while search, that didn't help.
In addition, I have one index searcher only across multiple threads, but in testing, I only used one thread as benchmark, the expected response time should be a few ms
So, how can improve search performance in this case?
Hint: I'm searching top 1000
If the number of fields is large a nice solution is to not store them then serialize the whole object to a binary field.
The plus is, when projecting the object back out after query, it's a single field rather than many. getField(name) iterates over the entire set so O(n/2) then getting the values and setting fields. Just one field and deserialize.
Second might be worth at something like a MoreLikeThis query. See https://stackoverflow.com/a/7657757/277700
I have some software which collects data over a large period of time, approx 200 readings per second. It uses an SQL database for this. I am looking to use Azure to move a lot of my old "archived" data to.
The software uses a multi-tenant type architecture, so I am planning to use one Azure Table per Tenant. Each tenant is perhaps monitoring 10-20 different metrics, so I am planning to use the Metric ID (int) as the Partition Key.
Since each metric will only have one reading per minute (max), I am planning to use DateTime.Ticks.ToString("d19") as my RowKey.
I am lacking a little understanding as to how this will scale however; so was hoping somebody might be able to clear this up:
For performance Azure will/might split my table by partitionkey in order to keep things nice and quick. This would result in one partition per metric in this case.
However, my rowkey could potentially represent data over approx 5 years, so I estimate approx 2.5 million rows.
Is Azure clever enough to then split based on rowkey as well, or am I designing in a future bottleneck? I know normally not to prematurely optimise, but with something like Azure that doesn't seem as sensible as normal!
Looking for an Azure expert to let me know if I am on the right line or whether I should be partitioning my data into more tables too.
Few comments:
Apart from storing the data, you may also want to look into how you would want to retrieve the data as that may change your design considerably. Some of the questions you might want to ask yourself:
When I retrieve the data, will I always be retrieving the data for a particular metric and for a date/time range?
Or I need to retrieve the data for all metrics for a particular date/time range? If this is the case then you're looking at full table scan. Obviously you could avoid this by doing multiple queries (one query / PartitionKey)
Do I need to see the most latest results first or I don't really care. If it's former, then your RowKey strategy should be something like (DateTime.MaxValue.Ticks - DateTime.UtcNow.Ticks).ToString("d19").
Also since PartitionKey is a string value, you may want to convert int value to a string value with some "0" prepadding so that all your ids appear in order otherwise you'll get 1, 10, 11, .., 19, 2, ...etc.
To the best of my knowledge, Windows Azure partitions the data based on PartitionKey only and not the RowKey. Within a Partition, RowKey serves as unique key. Windows Azure will try and keep data with the same PartitionKey in the same node but since each node is a physical device (and thus has size limitation), the data may flow to another node as well.
You may want to read this blog post from Windows Azure Storage Team: http://blogs.msdn.com/b/windowsazurestorage/archive/2010/11/06/how-to-get-most-out-of-windows-azure-tables.aspx.
UPDATE
Based on your comments below and some information from above, let's try and do some math. This is based on the latest scalability targets published here: http://blogs.msdn.com/b/windowsazurestorage/archive/2012/11/04/windows-azure-s-flat-network-storage-and-2012-scalability-targets.aspx. The documentation states that:
Single Table Partition– a table partition are all of the entities in a
table with the same partition key value, and usually tables have many
partitions. The throughput target for a single table partition is:
Up to 2,000 entities per second
Note, this is for a single partition, and not a single table. Therefore, a table with good partitioning, can process up to the
20,000 entities/second, which is the overall account target described
above.
Now you mentioned that you've 10 - 20 different metric points and for for each metric point you'll write a maximum of 1 record per minute that means you would be writing a maximum of 20 entities / minute / table which is well under the scalability target of 2000 entities / second.
Now the question remains of reading. Assuming a user would read a maximum of 24 hours worth of data (i.e. 24 * 60 = 1440 points) per partition. Now assuming that the user gets the data for all 20 metrics for 1 day, then each user (thus each table) will fetch a maximum 28,800 data points. The question that is left for you I guess is how many requests like this you can get per second to meet that threshold. If you could somehow extrapolate this information, I think you can reach some conclusion about the scalability of your architecture.
I would also recommend watching this video as well: http://channel9.msdn.com/Events/Build/2012/4-004.
Hope this helps.