Best local data store for 34,000 rows - search

I'm on Win 7 pro and I've inherited a text file with 34,000 rows. I need to do some really basic searching and sorting on it.
I was going to use Excel, but it would bloat and die.
What's the simplest solution, short of loading the thing into my local SQL server, to throw this data into to do some simple searching?

Read text into a list or map (depending on what suits your needs better) then apply a sorting algorithm on the list. Check out this list to find one that suits your needs.
Sorting Algorithms

Related

How to speed up a search on large collection of text files (1TB)

I have a collection of text files containing anonymised medical data (age, country, symptoms, diagnosis etc). This data goes back for at least 30 years so as you can imagine I have quite a large sized data set. In total I have around 20,000 text files totalling approx. 1TB.
Periodically I will be needing to search these files for occurances of a particular string (not regex). What is the quickest way to search through this data?
I have tried using grep and recursively searching through the directory as follows:
LC_ALL=C fgrep -r -i "searchTerm" /Folder/Containing/Files
The only problem with doing the above is that it takes hours (sometimes half a day!) to search through this data.
Is there a quicker way to search through this data? At this moment I am open to different approaches such as databases, elasticsearch etc. If I do go down the database route, I will have approx. 1 billion records.
My only requirements are:
1) The search will be happening on my local computer (Dual-Core CPU and 8GB RAM)
2) I will be searching for strings (not regex).
3) I will need to see all occurances of the search string and the file it was within.
There are a lot of answers already, I just wanted to add my two cents:
Having this much huge data(1 TB) with just 8 GB of memory will not be good enough for any approach, be it using the Lucene or Elasticsearch(internally uses Lucene) or some grep command if you want faster search, the reason being very simple all these systems hold the data in fastest memory to be able to serve faster and out of 8 GB(25% you should reserve for OS and another 25-50% at least for other application), you are left with very few GB of RAM.
Upgrading the SSD, increasing RAM on your system will help but it's quite cumbersome and again if you hit performance issues it will be difficult to do vertical scaling of your system.
Suggestion
I know you already mentioned that you want to do this on your system but as I said it wouldn't give any real benefit and you might end up wasting so much time(infra and code-wise(so many approaches as mentioned in various answers)), hence would suggest you do the top-down approach as mentioned in my another answer for determining the right capacity. It would help you to identify the correct capacity quickly of whatever approach you choose.
About the implementation wise, I would suggest doing it with Elasticsearch(ES), as it's very easy to set up and scale, you can even use the AWS Elasticsearch which is available in free-tier as well and later on quickly scale, although I am not a big fan of AWS ES, its saves a lot of time of setting up and you can quickly get started if you are much familiar of ES.
In order to make search faster, you can split the file into multiple fields(title,body,tags,author etc) and index only the important field, which would reduce the inverted index size and if you are looking only for exact string match(no partial or full-text search), then you can simply use the keyword field which is even faster to index and search.
I can go on about why Elasticsearch is good and how to optimize it, but that's not the crux and Bottomline is that any search will need a significant amount of memory, CPU, and disk and any one of becoming bottleneck would hamper your local system search and other application, hence advising you to really consider doing this on external system and Elasticsearch really stands out as its mean for distributed system and most popular open-source search system today.
You clearly need an index, as almost every answer has suggested. You could totally improve your hardware but since you have said that it is fixed, I won’t elaborate on that.
I have a few relevant pointers for you:
Index only the fields in which you want to find the search term rather than indexing the entire dataset;
Create multilevel index (i.e. index over index) so that your index searches are quicker. This will be especially relevant if your index grows to more than 8 GB;
I wanted to recommend caching of your searches as an alternative, but this will cause a new search to again take half a day. So preprocessing your data to build an index is clearly better than processing the data as the query comes.
Minor Update:
A lot of answers here are suggesting you to put the data in Cloud. I'd highly recommend, even for anonymized medical data, that you confirm with the source (unless you scraped the data from the web) that it is ok to do.
To speed up your searches you need an inverted index. To be able to add new documents without the need to re-index all existing files the index should be incremental.
One of the first open source projects that introduced incremental indexing is Apache Lucense. It is still the most widely used indexing and search engine although other tools that extend its functionality are more popular nowadays. Elasiticsearch and Solr are both based on Lucense. But as long as you don't need a web frontend, support for analytical querying, filtering, grouping, support for indexing non-text files or an infrastrucutre for a cluster setup over multiple hosts, Lucene is still the best choice.
Apache Lucense is a Java library, but it ships with a fully-functional, commandline-based demo application. This basic demo should already provide all the functionality that you need.
With some Java knowledge it would also be easy to adapt the application to your needs. You will be suprised how simple the source code of the demo application is. If Java shouldn't be the language of your choice, its wrapper for Pyhton, PyLucene may also be an alternative. The indexing of the demo application is already reduced nearly to the minimum. By default no advanced functionlity is used like stemming or optimization for complex queries - features, you most likely will not need for your use-case but which would increase size of the index and indexing time.
I see 3 options for you.
You should really consider upgrading your hardware, hdd -> ssd upgrade can multiply the speed of search by times.
Increase the speed of your search on the spot.
You can refer to this question for various recommendations. The main idea of this method is optimize CPU load, but you will be limited by your HDD speed. The maximum speed multiplier is the number of your cores.
You can index your dataset.
Because you're working with texts, you would need some full text search databases. Elasticsearch and Postgres are good options.
This method requires you more disk space (but usually less than x2 space, depending on the data structure and the list of fields you want to index).
This method will be infinitely faster (seconds).
If you decide to use this method, select the analyzer configuration carefully to match what considered to be a single word for your task (here is an example for Elasticsearch)
Worth covering the topic from at two level: approach, and specific software to use.
Approach:
Based on the way you describe the data, it looks that pre-indexing will provide significant help. Pre-indexing will perform one time scan of the data, and will build a a compact index that make it possible to perform quick searches and identify where specific terms showed in the repository.
Depending on the queries, it the index will reduce or completely eliminate having to search through the actual document, even for complex queries like 'find all documents where AAA and BBB appears together).
Specific Tool
The hardware that you describe is relatively basic. Running complex searches will benefit from large memory/multi-core hardware. There are excellent solutions out there - elastic search, solr and similar tools can do magic, given strong hardware to support them.
I believe you want to look into two options, depending on your skills, and the data (it will help sample of the data can be shared) by OP.
* Build you own index, using light-weight database (sqlite, postgresql), OR
* Use light-weight search engine.
For the second approach, using describe hardware, I would recommended looking into 'glimpse' (and the supporting agrep utility). Glimple provide a way to pre-index the data, which make searches extremely fast. I've used it on big data repository (few GB, but never TB).
See: https://github.com/gvelez17/glimpse
Clearly, not as modern and feature rich as Elastic Search, but much easier to setup. It is server-less. The main benefit for the use case described by OP is the ability to scan existing files, without having to load the documents into extra search engine repository.
Can you think about ingesting all this data to elasticsearch if they have a consistent data structure format ?
If yes, below are the quick steps:
1. Install filebeat on your local computer
2. Install elasticsearch and kibana as well.
3. Export the data by making filebeat send all the data to elasticsearch.
4. Start searching it easily from Kibana.
Fs Crawler might help you in indexing the data into elasticsearch.After that normal elasticsearch queries can you be search engine.
I think if you cache the most recent searched medical data it might help performance wise instead of going through the whole 1TB you can use redis/memcached

full database table update

I currently have a REST endpoint with basic CRUD operations for a sqlite database.
But my application updates whole tables at a time (with a "save" button)
My current idea/solution is to query the data first, compare the data, and update only the "rows" that changed.
The solution is a bit complex because there are several different types of changes that can be done:
Add row
Remove row
Row content changed (similar to content moving up or down)
Is there a simpler solution?
The most simplest solution is a bit dirty. (Delete table, create table and add each row back)
The solution is a bit complex because there are several different types of changes that can be done:
Add row
Remove row
Row content changed (similar to content moving up or down)
Is there a simpler solution?
The simple answer is
Yes, you are correct.
That is exactly how you do it.
There is literally no easy way to do this.
Be aware that, for example, Firebase entirely exists to do this.
Firebase is worth billions, is far from perfect, and was created by the smartest minds around. It literally exists to do exactly what you ask.
Again there is literally no easy solution to what you ask!
Some general reading:
One of the handful of decent articles on this:
https://www.objc.io/issues/10-syncing-data/data-synchronization/
Secondly you will want to familiarize yourself with Firebase, since, a normal part of computing now is either using baas sync solutions (eg Firebase, usually some noSql solution), or indeed doing it by hand.
http://firebase.google.com/docs/ios/setup/
(I don't especially recommend Firebase, but you have to know how to use it in as much as you have to know how to do regex and you have to know how to write sql calls.)
Finally you can't make realistic iOS apps without Core Data,
https://developer.apple.com/library/archive/documentation/Cocoa/Conceptual/CoreData/index.html
By no means does core data solve the problem you describe, but, realistically you will use it while you solve the problem conceptually.
You may enjoy realm,
https://realm.io
which again is - precisely - a solution to the problem you describe. (Which is basically the basic problem in typical modern app development.) As with FBase, even if you don't like it or decide not to go with it on a particular project, one needs to be familiar with it.

Best way to process huge .csv

I need to process a pretty huge .css (at least 10 millions rows, hundred of columns) with Python. I'd like:
To filter the content based on several criteria (mostly strings, maybe some regular expressions)
To consolidate the filtered data. For instance, grouping them by date, and for each date counting occurences based on a specific criterium. Pretty similar to what a pivot table could do.
I'd like to have an user-friendly access to that consolidated data
I'd like to generate charts (mostly basic line charts)
Processing must be fast AND light, because computers at work cannot handle much and we're always in a hurry
Given these prerequisites, could you please suggest some ideas? I thought about using pandas. I also thought about dumping the csv into a SQLite database (because it may be easier to query if I code an User Interface). But it is really my first foray into this world, so I don't know where to start. I don't have much time, but I'll would be very glad if you could offer some pieces of advice, some good (and fresh) things to read etc, interesting libs and so forth. Sorry if Stackoverflow is not the best place to ask for this kind of help. I'll delete the post if needed. Regards.
Give xsv a shot. It is quite convenient with decent speed. And it fits in the Unix philosopy. However if the dataset is used more than ten times, I'd suggest converting csv to some binary format, and ClickHouse is a good choice for that.
There are 2 rather different situations:
when your reports (charts, pivot tables) use limited number of columns from orignal CSV, and you can pre-aggregate your large CSV file only once to get much smaller dataset. This one-time processing can take some time (minutes) and no need to load whole CSV into memory as it can be processed as data stream (row-by-row). After that you can use this small dataset for fast processing (filtering, grouping etc).
you don't know which columns of original CSV may be used for grouping and filtering, and pre-aggregation is not possible. In other words, all 10M rows should be processed in the real-time (very fast) - this is OLAP use-case. This is possible if you load CSV data into memory once, and then iterate over 10M rows quickly when needed; if this is not possible, only option is to import it into the database. SQLite is a good lightweight DB and you can easily import CSV with sqlite3 command line tool. Note that SQL queries for 10M rows might be not so fast, and possibly you'll need to add some indexes.
Another option might be using specialized OLAP database like Yandex ClickHouse - you can use it to query CSV file directly with SQL (table engine=FILE) or import CSV into its column store. This database is lightning fast with GROUP BY queries (it can process 10M rows in <1s).

Searching substrings from a large set of strings

Is there a space efficient data structure that can help answer the following question:
Assume I have a database of a large number of strings (in the
millions). I need to be able to answer quickly if a given string is
a substring of one these strings in the database.
Note that it's not even necessary in this case to tell which string it is a substring of, just that it's a substring of one.
As clarification, the ideal is to keep the data as small as possible, but query speed is really the most important issue. The minimum requirement is being able to hold the query data structure in RAM.
The right way to go about this is to avoid using your Java application to answer the question. If you solve the problem in Java, your app is guaranteed to read the entire table, and this is in addition to logic you will have to run on each record.
A better strategy would be to use your database to answer the question. Consider the following SQL query (assuming your database is some SQL flavor):
SELECT COUNT(*) FROM your_table WHERE column LIKE "%substring%"
This query will return the number of rows where 'column' contains some 'substring'. You can issue a JDBC call from your Java application. As a general rule, you should leave the heavy database lifting to your RDBMS; it was created for that.
I am giving a hat tip to this SO post which was the basis for my response: http://www.stackoverflow.com/questions/4122193/how-to-search-for-rows-containing-a-substring
Strings are highly compact structures, so for regular English text it is unlikely that you will find any other kind of structure that would be more space efficient than strings. You can perform various tricks with bits so as to make each character occupy less space in memory, (at the expense of supporting other languages,) but the savings there will be linear.
However, if your strings have a very low degree of variation, (very high level of repetition,) then you might be able to save space by constructing a tree in which each node corresponds to a letter. Each path of nodes in the tree then forms a possible word, as follows:
[c]-+-[a]-+-[t]
+
+-[r]
So, the above tree encodes the following words: cat, car. Of course this will only result in savings if you have a huge number of mostly similar strings.

Reading MS excel using Apache POI

We have an excel file which uses lot of built-in formulas, vlookups, Macros etc. We need to convert this excel to a web based solution.
Can we use Apache POI for this?
When we went through the documentation, we saw that it reads the excel every time and then evaluates the formula. But for multiple
users accessing the applicatin, we are not sure if it is good to read excel and evaluate formula every time. Can we convert all the excel formulas to methods in a
Java code and invoke it from there? Can this be done from Apache POI?
I do a lot of work in excel / access integration with VBA and some stuff in Java Hibernate when the record counts are real high for preprocessing on the client side before inserting records to a database.
I've had a lot of cases like this where the data is being manipulated over a few thousand lines of code and a couple formulas driven by other formulas doing queries and on and on. The problem sounds like it's way out of scope for Apache POI. The POI is meant for extracting data from more or less a static file where as your data sounds much more dynamic like a lot of my own work. I'm afraid a web solution would require a redesign, I can't think of a simple solution for this one.
Try to re examin the data your working with and see how much of it you can process on the server side and store results into a daily table to keep it performant. Reduce the number of records to constrain the record set as much as possible and simplify your queries a user may try to launch on the web form. Maybe you're doing more work than you need to in excel, any way to reduce the query's work load the better.
Back end preprocessing has helped to keep things fast and simple for me, just make sure it's all automated and well documented. Don't want some admin to come along and wonder why some task is running each night and think it's not important then have your front end running off of old bad data.
Sorry, hope that helps to give you some ideas. POI doesn't seem like a good solution in this case to me. Excel documents have a way of being way overly complex, look at ways to drill it down at the source. You might be doing half the work just formatting the data in order to process it. Some SQL statements or stored expressions might solve it in a fraction of the time. Be critical of all excel formulas!

Resources