I searched on SO and didn't really find an answer to this but it seems like a common problem.
I have several hundred thousand locations in a database, each having the geocode (lat/long). If it matters, they are spread out across the U.S. Now, I have a client app in which I want users to give me their lat/long and a radius (say 5mi, 10mi, 25mi, etc) and I want to return all the records that match. I only care abouot the distance value that can be gained via, say, the Haversine formula, not shortest road distance. However, given that, I want it to be as accurate as possible.
This database is mostly read-only. On a good day, there might be 10 inserts. Now, I will have hundreds of clients, maybe tens of thousands of clients that will be using the software. I want users to get results in a few seconds but if a single query takes 10-20 seconds, then it will crawl when hit with a load of clients.
How do I serve up results as efficiently as possible? I know I could just store them in MySQL or PostgreSQL (Oracle and MS SQL Server are out for this, but some other open source data store may be fine) and just put the Haversine formula in there WHERE clause but I don't think that is going to yield efficient results.
PostgreSQL supports a wide range of geospatial queries, so long as it's got the PostGIS extensions installed. Nearest or radius or bounding box searches are particularly easy.
I have used Solr (Lucene based search server) for radius search. We have written a property portal that lets user search the properties based on radius.
We index the database, so the search will be ultra fast.
I'm starting to feel like the offical spokesman for Sphinx. This article explains how to configure it for geospatial searching: http://www.god-object.com/2009/10/20/geospatial-search-using-sphinx-search-and-php/
To clarify: the data will be stored in mysql/postgres but indexed by and searched via Sphinx.
Related
I have a collection of text files containing anonymised medical data (age, country, symptoms, diagnosis etc). This data goes back for at least 30 years so as you can imagine I have quite a large sized data set. In total I have around 20,000 text files totalling approx. 1TB.
Periodically I will be needing to search these files for occurances of a particular string (not regex). What is the quickest way to search through this data?
I have tried using grep and recursively searching through the directory as follows:
LC_ALL=C fgrep -r -i "searchTerm" /Folder/Containing/Files
The only problem with doing the above is that it takes hours (sometimes half a day!) to search through this data.
Is there a quicker way to search through this data? At this moment I am open to different approaches such as databases, elasticsearch etc. If I do go down the database route, I will have approx. 1 billion records.
My only requirements are:
1) The search will be happening on my local computer (Dual-Core CPU and 8GB RAM)
2) I will be searching for strings (not regex).
3) I will need to see all occurances of the search string and the file it was within.
There are a lot of answers already, I just wanted to add my two cents:
Having this much huge data(1 TB) with just 8 GB of memory will not be good enough for any approach, be it using the Lucene or Elasticsearch(internally uses Lucene) or some grep command if you want faster search, the reason being very simple all these systems hold the data in fastest memory to be able to serve faster and out of 8 GB(25% you should reserve for OS and another 25-50% at least for other application), you are left with very few GB of RAM.
Upgrading the SSD, increasing RAM on your system will help but it's quite cumbersome and again if you hit performance issues it will be difficult to do vertical scaling of your system.
Suggestion
I know you already mentioned that you want to do this on your system but as I said it wouldn't give any real benefit and you might end up wasting so much time(infra and code-wise(so many approaches as mentioned in various answers)), hence would suggest you do the top-down approach as mentioned in my another answer for determining the right capacity. It would help you to identify the correct capacity quickly of whatever approach you choose.
About the implementation wise, I would suggest doing it with Elasticsearch(ES), as it's very easy to set up and scale, you can even use the AWS Elasticsearch which is available in free-tier as well and later on quickly scale, although I am not a big fan of AWS ES, its saves a lot of time of setting up and you can quickly get started if you are much familiar of ES.
In order to make search faster, you can split the file into multiple fields(title,body,tags,author etc) and index only the important field, which would reduce the inverted index size and if you are looking only for exact string match(no partial or full-text search), then you can simply use the keyword field which is even faster to index and search.
I can go on about why Elasticsearch is good and how to optimize it, but that's not the crux and Bottomline is that any search will need a significant amount of memory, CPU, and disk and any one of becoming bottleneck would hamper your local system search and other application, hence advising you to really consider doing this on external system and Elasticsearch really stands out as its mean for distributed system and most popular open-source search system today.
You clearly need an index, as almost every answer has suggested. You could totally improve your hardware but since you have said that it is fixed, I won’t elaborate on that.
I have a few relevant pointers for you:
Index only the fields in which you want to find the search term rather than indexing the entire dataset;
Create multilevel index (i.e. index over index) so that your index searches are quicker. This will be especially relevant if your index grows to more than 8 GB;
I wanted to recommend caching of your searches as an alternative, but this will cause a new search to again take half a day. So preprocessing your data to build an index is clearly better than processing the data as the query comes.
Minor Update:
A lot of answers here are suggesting you to put the data in Cloud. I'd highly recommend, even for anonymized medical data, that you confirm with the source (unless you scraped the data from the web) that it is ok to do.
To speed up your searches you need an inverted index. To be able to add new documents without the need to re-index all existing files the index should be incremental.
One of the first open source projects that introduced incremental indexing is Apache Lucense. It is still the most widely used indexing and search engine although other tools that extend its functionality are more popular nowadays. Elasiticsearch and Solr are both based on Lucense. But as long as you don't need a web frontend, support for analytical querying, filtering, grouping, support for indexing non-text files or an infrastrucutre for a cluster setup over multiple hosts, Lucene is still the best choice.
Apache Lucense is a Java library, but it ships with a fully-functional, commandline-based demo application. This basic demo should already provide all the functionality that you need.
With some Java knowledge it would also be easy to adapt the application to your needs. You will be suprised how simple the source code of the demo application is. If Java shouldn't be the language of your choice, its wrapper for Pyhton, PyLucene may also be an alternative. The indexing of the demo application is already reduced nearly to the minimum. By default no advanced functionlity is used like stemming or optimization for complex queries - features, you most likely will not need for your use-case but which would increase size of the index and indexing time.
I see 3 options for you.
You should really consider upgrading your hardware, hdd -> ssd upgrade can multiply the speed of search by times.
Increase the speed of your search on the spot.
You can refer to this question for various recommendations. The main idea of this method is optimize CPU load, but you will be limited by your HDD speed. The maximum speed multiplier is the number of your cores.
You can index your dataset.
Because you're working with texts, you would need some full text search databases. Elasticsearch and Postgres are good options.
This method requires you more disk space (but usually less than x2 space, depending on the data structure and the list of fields you want to index).
This method will be infinitely faster (seconds).
If you decide to use this method, select the analyzer configuration carefully to match what considered to be a single word for your task (here is an example for Elasticsearch)
Worth covering the topic from at two level: approach, and specific software to use.
Approach:
Based on the way you describe the data, it looks that pre-indexing will provide significant help. Pre-indexing will perform one time scan of the data, and will build a a compact index that make it possible to perform quick searches and identify where specific terms showed in the repository.
Depending on the queries, it the index will reduce or completely eliminate having to search through the actual document, even for complex queries like 'find all documents where AAA and BBB appears together).
Specific Tool
The hardware that you describe is relatively basic. Running complex searches will benefit from large memory/multi-core hardware. There are excellent solutions out there - elastic search, solr and similar tools can do magic, given strong hardware to support them.
I believe you want to look into two options, depending on your skills, and the data (it will help sample of the data can be shared) by OP.
* Build you own index, using light-weight database (sqlite, postgresql), OR
* Use light-weight search engine.
For the second approach, using describe hardware, I would recommended looking into 'glimpse' (and the supporting agrep utility). Glimple provide a way to pre-index the data, which make searches extremely fast. I've used it on big data repository (few GB, but never TB).
See: https://github.com/gvelez17/glimpse
Clearly, not as modern and feature rich as Elastic Search, but much easier to setup. It is server-less. The main benefit for the use case described by OP is the ability to scan existing files, without having to load the documents into extra search engine repository.
Can you think about ingesting all this data to elasticsearch if they have a consistent data structure format ?
If yes, below are the quick steps:
1. Install filebeat on your local computer
2. Install elasticsearch and kibana as well.
3. Export the data by making filebeat send all the data to elasticsearch.
4. Start searching it easily from Kibana.
Fs Crawler might help you in indexing the data into elasticsearch.After that normal elasticsearch queries can you be search engine.
I think if you cache the most recent searched medical data it might help performance wise instead of going through the whole 1TB you can use redis/memcached
We are interested in using DocumentDb as a data store for a number of data sources and as such we are running a quick POC to establish whether it meets the criteria we are looking for.
One of the areas we are keen to provide is look ahead search capabilities for certain fields. These are traditionally provided using the SQL LIKE syntax which does not appear to be supported at present.
Searching online I have seen people talking about integrating Azure search but this appears to be a very costly mechanism for such a simple use case.
I have also seen people mention the use of UDF's but this appears to require an entire collection scan which is not practical from a performance perspective.
Does anyone have any alternative suggestions? One thing I considered was simply using a SQL table and initiating an update each time a document was inserted\updated\deleted?
DocumentDB supports STARTSWITH and range indexes to support prefix/look ahead searching.
You can progressively make queries like the following based on what your user types in a text box:
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "H")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hi")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hil")
SELECT TOP 10 * FROM hotel H WHERE STARTSWITH(H.name, "Hilton")
Note that you must configure the collection, or the path/property you're using for these queries with a range index. You can extend this approach to handle additional cases as well:
To query in a case-insensitive manner, you must store the lower case form of the search property, and use that for querying.
I faced a similar situation, where a fast lookup was required, as a user typed search terms.
My scenario was that potentially thousands of simultaneous users would be performing such lookups; when testing this under load, to avoid saturation and throttling, we found we would have to increase the DocumentDB Request Unit (RU) throughput amount to a point that was not financially viable for us, in our specific circumstances.
We decided that DocumentDB was best used as the persistent store, and 'full' data retrieval - and this role it performs exceptionally well - while a small ElasticSearch cluster performed the role it was designed for - text search, faceted search, weighted search, stemming, and most relevant to your question, autocomplete analyzersand completion suggesters.
The subject of type ahead queries, creation of indexes, autocomplete analyzer and query time 'search as you type' in ElasticSearch can be found here, here and here
The fact that you plan to have several data sources would also potentially make the ElasticSearch cluster approach more attractive, to aggregate search data.
I used the Bitnami template available in the Azure market place to create relatively small instances, and most importantly, this allowed me to place the cluster on the same Virtual Network as my other components, which greatly increased performance.
Cost was lower than Azure Search (which uses ElasticSearch under the hood).
I am developing an Azure based website and I want to provide search capabilities using Lucene. (structured json objects would be indexed and stored in Lucene and other content such as Word documents, etc. would be indexed in lucene but stored in blob storage) I want the search to be secure, such that one user would never see a document belonging to another user. I want to allow ad-hoc searches as typed by the user. Lastly, I want to query programmatically to return predefined sets of data, such as "all notes for user X". I think I understand how to add properties to each document to achieve these 3 objectives. (I am listing them here so if anyone is kind enough to answer, they will have better idea of what I am trying to do)
My questions revolve around performance and security.
Can I improve document security by having a separate index for each user, or is including the user's ID as a parameter in each search sufficient?
Can I improve indexing speed and total throughput of the system by having a separate index for each user? My thinking is that having separate indexes would allow me to scale the system by having multiple index writers (perhaps even on different server instances) working at the same time, each on their own index.
Any insight would be greatly appreciated.
Regards,
Nate
Of course, one index.
You can do even better than what you suggested by using ManifoldCF (Apache product that knows how to handle Solr) to manage security.
And one off topic, uninformed suggestion: I'd rather use CloudBees or Heroku (or Amazon) instead of Azure.
Until you will use several machines for indexing I guess it's more convenient to use single index. Lucene community done a lot of work to make indexing process as efficient as it can. So unless you intentionally want to implement distributed indexing I doesn't recommend you to split indexes.
However there are several reasons why you would want to split indexes:
if your machine have several IO devices which could be utilized in parallel. In this case, if you are IO bound, splitting indexes is good idea.
splitting document fields between indexes (this is what ParallelReader is supposed for). This is more exotic form of splitting, but it may be a good idea if search is performed using different groups of fields. Suppose, we have two search query types: the first is using field name and type, and the second is using fields price and discount. If those fields are updated at different rate (I guess, name updates are far more rarely than price updates), updating only part of index would require less IO resources. This will give more overall throughput to the system.
Hey everyone I am building a website that needs to have the capability to store the physical location of a person and allow someone to search for other people within some radius of that location. For the sake of an example lets pretend it's a dating website (it's not) and so as a user you want to find people within 50 miles of your current location that meet some other set of search criteria.
Currently I am storing all of my user information in Azure Table Storage. However this is the first time I've ever attempted to create a geo-aware search algorithm so I wanted to verify that I am not going to waste my time doing something totally insane. To that end the idea for my implementation is as follows:
Store the longitude and latitude for the users location in the Azure Table
The PartitionKey for each entry is the state (or country if outside the US) that the person lives in
I want to calculate the distance between the current user and all other users using the haversine equation. I'm assuming I can embed this into my LINQ query?
Eventually I'd like to be able to get a list of all states/countries within the radius so I can optimize based on PartitionKey's
Naturally this implementation strategy leads to a few questions:
If I do the haversine equation in a LINQ query where is it being executed? The last thing I want is my WebRole to pull every record from azure storage to run this haversine equation on it within the application process
Is the idea of doing this with Azure Storage totally crazy? I know there are solutions like MongoDB that have the capability to do spatial search queries built-in. I like the scalability of Azure but is there some better alternative I should investigate instead?
Thanks for the help in advance!
If you want the query to be fast against Azure tables, the query has to run against the partition key and row key. Also if you're using LINQ to query Azure tables then you need to be careful about which functions you use. If the value can be calculated once for all row, LINQ will be clever and evaluate it before sending the query to AZT, however if it needs to be evaluated for each row, then you can only use those are those that Azure supports (which is a pretty short list).
You might be able to make this work if you want to use a bounding square for your area not a bounding circle. Store latitude in the PartitionKey and longitude in the RowKey and have a query that looks like this:
var query = from UserLocation ul
in repository.All()
where
ul.PartitionKey.CompareTo(minimumLatitude) > 0
&& ul.PartitionKey.CompareTo(maximumLatitude) < 0
&& ul.RowKey.CompareTo(minimumLongitude) > 0
&& ul.RowKey.CompareTo(maximumLongitude) < 0
select
ul;
This is probably not as clever as you were hoping for though. If this is not going to work for you, then you'll need to look at other options. SQL Azure supports geospatial queries if you want to stay within the Microsoft family.
Having used spatial queries with SQL Azure (approximately 5 million records) i can confirm that it can be very quick - it may well be worth a look for you.
I am trying to figure out exactly what these new fangled data stores such as bigtable, hbase and cassandra really are.
I work with massive amounts of stock market data, billions of rows of price/quote data that can add up to 100s of gigabytes every day (although these text files often compress by at least an order of magnitude). This data is basically a handful of numbers, two or three short strings and a timestamp (usually millisecond level). If I had to pick a unique identifier for each row, I would have to pick the whole row (since an exchange may generate multiple values for the same symbol in the same millisecond).
I suppose the simplest way to map this data to bigtable (I'm including its derivatives) is by symbol name and date (which may return a very large time series, more than million data points isn't unheard of). From reading their descriptions, it looks like multiple keys can be used with these systems. I'm also assuming that decimal numbers are not good candidates for keys.
Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?
What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?
What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?
Reading relevant papers seems to show that these systems are not a very good fit for massive time series systems. However, if systems such as google maps are based on them, I think time series should work as well. For example, think of time as the x-axis, prices as y-axis and symbols as named locations--all of a sudden it looks like bigtable should be the ideal store for time series (if the whole earth can be stored, retrieved, zoomed and annotated, stock market data should be trivial).
Can some expert point me in the right direction or clear up any misunderstandings.
Thanks
I am not an expert yet, but I've been playing with Cassandra for a few days now, and I have some answers for you:
Don't worry about amount of data, it's irrelevant with systems like Cassandra, if you have $$$ for a large hardware cluster.
Some of these systems (Cassandra, for example) claims to be able to do range queries. Would I be able to efficiently query, say, all values for MSFT, for a given day, between 11:00 am and 1:30 pm ?
Cassandra is very useful when you know how to work with keys. It can swift through keys very quickly. So to search for MSFT between 11:00 and 1:30pm, you'd have to key your rows like this:
MSFT-timestamp, GOOG-timestamp , ..etc
Then you can tell Cassandra to find all keys that start with MSFT-now and end with MSFT-now+1hour.
What if I want to search across ALL symbols for a given day, and request all symbols that have a price between $10 and $10.25 (so I'm searching the values, and want keys returned as a result)?
I am not an expert, but so far I realized that Cassandra doesn't' search by values at all. So if you want to do the above, you will have to make another table dedicated just to this problem and design your schema to fit the case. But it won't be much different from what I described above. It's all about naming your keys and columns. Cassandra can find them very quickly!
What if I want to get two times series, subtract one from the other, and return the two times series and their result, will I have to do his logic in my own program?
Correct, all logic is done inside your program. This is not MySQL. This is just a storage engine. (But I am sure the next versions will offer these sort of things)
Please remember, that I am a novice at this, if I am wrong, feel free to correct me.
If you're dealing with a massive time series database, then the standards are:
KDB: http://www.kx.com/
OneTick: http://www.onetick.com
Vhayu: http://www.vhayu.com
These aren't cheap, but they can handle your data very efficiently.
Someone whom I respect recommended the Open Time Series Database. In particular, that the schema was the nicest he had ever seen.
http://opentsdb.net/
'Am standing in front of the same mountain. My main problem with cassandra is that I cannot get a stream on the result set, for example in the form of an iterator.
I am looking already up and down the docs and the net, but nothing.
I can't fetch all the keys and then get the rows as billions of rows makes this impossible.
The DataStax Java Driver allows for automatic paging so that will stream the results just like an iterator and it's all built in. This is in Cassandra 2.0.1 by the way - http://www.datastax.com/dev/blog/client-side-improvements-in-cassandra-2-0
Just for the sake of completeness reading this in 2018, there is now a special database just for timeseries data called TimescaleDB
http://www.timescale.com/
This blog is worth reading, it explains why it´s superior to solutions like Cassandra for that special case and why they decided to build it on top of the relational PostgreSQL database
https://blog.timescale.com/time-series-data-why-and-how-to-use-a-relational-database-instead-of-nosql-d0cd6975e87c