Searching substrings from a large set of strings

Searching substrings from a large set of strings - string

Is there a space efficient data structure that can help answer the following question:
Assume I have a database of a large number of strings (in the
millions). I need to be able to answer quickly if a given string is
a substring of one these strings in the database.
Note that it's not even necessary in this case to tell which string it is a substring of, just that it's a substring of one.
As clarification, the ideal is to keep the data as small as possible, but query speed is really the most important issue. The minimum requirement is being able to hold the query data structure in RAM.

The right way to go about this is to avoid using your Java application to answer the question. If you solve the problem in Java, your app is guaranteed to read the entire table, and this is in addition to logic you will have to run on each record.
A better strategy would be to use your database to answer the question. Consider the following SQL query (assuming your database is some SQL flavor):
SELECT COUNT(*) FROM your_table WHERE column LIKE "%substring%"
This query will return the number of rows where 'column' contains some 'substring'. You can issue a JDBC call from your Java application. As a general rule, you should leave the heavy database lifting to your RDBMS; it was created for that.
I am giving a hat tip to this SO post which was the basis for my response: http://www.stackoverflow.com/questions/4122193/how-to-search-for-rows-containing-a-substring

Strings are highly compact structures, so for regular English text it is unlikely that you will find any other kind of structure that would be more space efficient than strings. You can perform various tricks with bits so as to make each character occupy less space in memory, (at the expense of supporting other languages,) but the savings there will be linear.
However, if your strings have a very low degree of variation, (very high level of repetition,) then you might be able to save space by constructing a tree in which each node corresponds to a letter. Each path of nodes in the tree then forms a possible word, as follows:
[c]-+-[a]-+-[t]
+
+-[r]
So, the above tree encodes the following words: cat, car. Of course this will only result in savings if you have a huge number of mostly similar strings.

Related

How do you save a List<> as a column in a table in room?

I am building an app in which I have a Room entity that one of its columns is supposed to hold a List.
What is the best approach for doing this in an app that uses Flow, Coroutines and Room?
I tried serializing with Jackson (turning the List to a long json String and then bring it back to a List when fetched) but I am not sure if this is the correct approach.
Thank you,

What is the best approach for doing this in an app that uses Flow, Coroutines and Room?
This is very much open to opinion.
From a database perspective the approach would be to have any list as a table and thus
reducing the JSON bloat and thus reducing efficiency,
reduce duplication and thus be more likely to conform to normalisation
not potentially introducing complexities and even greater inefficiencies (e.g. not mentioned in the answer below but wild-character as the first character must do a full table scan)
perhaps consider this question and answer matching multiple title in single query using like keyword where if the table per list approach were taken then a simple SELECT * FROM task WHERE task_tags IN(:taglist) could do the same
From a coding point of view at first the coding is simpler when embedding JSON as the complex code is within the JSON libraries.

How to speed up a search on large collection of text files (1TB)

I have a collection of text files containing anonymised medical data (age, country, symptoms, diagnosis etc). This data goes back for at least 30 years so as you can imagine I have quite a large sized data set. In total I have around 20,000 text files totalling approx. 1TB.
Periodically I will be needing to search these files for occurances of a particular string (not regex). What is the quickest way to search through this data?
I have tried using grep and recursively searching through the directory as follows:
LC_ALL=C fgrep -r -i "searchTerm" /Folder/Containing/Files
The only problem with doing the above is that it takes hours (sometimes half a day!) to search through this data.
Is there a quicker way to search through this data? At this moment I am open to different approaches such as databases, elasticsearch etc. If I do go down the database route, I will have approx. 1 billion records.
My only requirements are:
1) The search will be happening on my local computer (Dual-Core CPU and 8GB RAM)
2) I will be searching for strings (not regex).
3) I will need to see all occurances of the search string and the file it was within.

There are a lot of answers already, I just wanted to add my two cents:
Having this much huge data(1 TB) with just 8 GB of memory will not be good enough for any approach, be it using the Lucene or Elasticsearch(internally uses Lucene) or some grep command if you want faster search, the reason being very simple all these systems hold the data in fastest memory to be able to serve faster and out of 8 GB(25% you should reserve for OS and another 25-50% at least for other application), you are left with very few GB of RAM.
Upgrading the SSD, increasing RAM on your system will help but it's quite cumbersome and again if you hit performance issues it will be difficult to do vertical scaling of your system.
Suggestion
I know you already mentioned that you want to do this on your system but as I said it wouldn't give any real benefit and you might end up wasting so much time(infra and code-wise(so many approaches as mentioned in various answers)), hence would suggest you do the top-down approach as mentioned in my another answer for determining the right capacity. It would help you to identify the correct capacity quickly of whatever approach you choose.
About the implementation wise, I would suggest doing it with Elasticsearch(ES), as it's very easy to set up and scale, you can even use the AWS Elasticsearch which is available in free-tier as well and later on quickly scale, although I am not a big fan of AWS ES, its saves a lot of time of setting up and you can quickly get started if you are much familiar of ES.
In order to make search faster, you can split the file into multiple fields(title,body,tags,author etc) and index only the important field, which would reduce the inverted index size and if you are looking only for exact string match(no partial or full-text search), then you can simply use the keyword field which is even faster to index and search.
I can go on about why Elasticsearch is good and how to optimize it, but that's not the crux and Bottomline is that any search will need a significant amount of memory, CPU, and disk and any one of becoming bottleneck would hamper your local system search and other application, hence advising you to really consider doing this on external system and Elasticsearch really stands out as its mean for distributed system and most popular open-source search system today.

You clearly need an index, as almost every answer has suggested. You could totally improve your hardware but since you have said that it is fixed, I won’t elaborate on that.
I have a few relevant pointers for you:
Index only the fields in which you want to find the search term rather than indexing the entire dataset;
Create multilevel index (i.e. index over index) so that your index searches are quicker. This will be especially relevant if your index grows to more than 8 GB;
I wanted to recommend caching of your searches as an alternative, but this will cause a new search to again take half a day. So preprocessing your data to build an index is clearly better than processing the data as the query comes.
Minor Update:
A lot of answers here are suggesting you to put the data in Cloud. I'd highly recommend, even for anonymized medical data, that you confirm with the source (unless you scraped the data from the web) that it is ok to do.

To speed up your searches you need an inverted index. To be able to add new documents without the need to re-index all existing files the index should be incremental.
One of the first open source projects that introduced incremental indexing is Apache Lucense. It is still the most widely used indexing and search engine although other tools that extend its functionality are more popular nowadays. Elasiticsearch and Solr are both based on Lucense. But as long as you don't need a web frontend, support for analytical querying, filtering, grouping, support for indexing non-text files or an infrastrucutre for a cluster setup over multiple hosts, Lucene is still the best choice.
Apache Lucense is a Java library, but it ships with a fully-functional, commandline-based demo application. This basic demo should already provide all the functionality that you need.
With some Java knowledge it would also be easy to adapt the application to your needs. You will be suprised how simple the source code of the demo application is. If Java shouldn't be the language of your choice, its wrapper for Pyhton, PyLucene may also be an alternative. The indexing of the demo application is already reduced nearly to the minimum. By default no advanced functionlity is used like stemming or optimization for complex queries - features, you most likely will not need for your use-case but which would increase size of the index and indexing time.

I see 3 options for you.
You should really consider upgrading your hardware, hdd -> ssd upgrade can multiply the speed of search by times.
Increase the speed of your search on the spot.
You can refer to this question for various recommendations. The main idea of this method is optimize CPU load, but you will be limited by your HDD speed. The maximum speed multiplier is the number of your cores.
You can index your dataset.
Because you're working with texts, you would need some full text search databases. Elasticsearch and Postgres are good options.
This method requires you more disk space (but usually less than x2 space, depending on the data structure and the list of fields you want to index).
This method will be infinitely faster (seconds).
If you decide to use this method, select the analyzer configuration carefully to match what considered to be a single word for your task (here is an example for Elasticsearch)

Worth covering the topic from at two level: approach, and specific software to use.
Approach:
Based on the way you describe the data, it looks that pre-indexing will provide significant help. Pre-indexing will perform one time scan of the data, and will build a a compact index that make it possible to perform quick searches and identify where specific terms showed in the repository.
Depending on the queries, it the index will reduce or completely eliminate having to search through the actual document, even for complex queries like 'find all documents where AAA and BBB appears together).
Specific Tool
The hardware that you describe is relatively basic. Running complex searches will benefit from large memory/multi-core hardware. There are excellent solutions out there - elastic search, solr and similar tools can do magic, given strong hardware to support them.
I believe you want to look into two options, depending on your skills, and the data (it will help sample of the data can be shared) by OP.
* Build you own index, using light-weight database (sqlite, postgresql), OR
* Use light-weight search engine.
For the second approach, using describe hardware, I would recommended looking into 'glimpse' (and the supporting agrep utility). Glimple provide a way to pre-index the data, which make searches extremely fast. I've used it on big data repository (few GB, but never TB).
See: https://github.com/gvelez17/glimpse
Clearly, not as modern and feature rich as Elastic Search, but much easier to setup. It is server-less. The main benefit for the use case described by OP is the ability to scan existing files, without having to load the documents into extra search engine repository.

Can you think about ingesting all this data to elasticsearch if they have a consistent data structure format ?
If yes, below are the quick steps:
1. Install filebeat on your local computer
2. Install elasticsearch and kibana as well.
3. Export the data by making filebeat send all the data to elasticsearch.
4. Start searching it easily from Kibana.

Fs Crawler might help you in indexing the data into elasticsearch.After that normal elasticsearch queries can you be search engine.

I think if you cache the most recent searched medical data it might help performance wise instead of going through the whole 1TB you can use redis/memcached

inexact string search - short query string to huge database (blast?)

I have an OCR that recognises a few short query strings (4-12 letters) in a given picture. And I would like to match these recognised words against a big database of known words. I've already build a confusion matrix with the used alphabet from the most common mistakes and I tried to do a whole gotoh alignment against all words in my database and found (not surprinsingly) that this is too time consuming.
So I am looking for a heuristic approach to match these words to the database (allowing mismatches). Does anyone know of an available library or algorithm that could help me out?
I've already thought about using BLAST or FASTA but the way I understood it both are limited to the standard amino acid alphabet and I would like to use all letters and numbers.
Thank you for your help!

Not an expert but I've done some reading on bioinformatics (which aren't the topic but related). You could use suffix trees or related data structures to more-quickly search through the database. I believe currently the time required for construction of the tree is linear wrt database length, and time required for querying the tree is linear wrt length of query string, so if you have a lot of query strings that are relatively short this sounds like the perfect data structure for you. More reading can be found on the wikipedia page for suffix trees.

Fuzzy String Matching

I have a requirement within my application to fuzzy match a string value inputted by the user, against a datastore.
I am basically attempting to find possible duplicates in the process in which data is added to the system.
I have looked at Metaphone, Double Metaphone, and SoundEx, and the conclusion I have came to is they are all well and good when dealing with a single word input string; however I am trying to match against a undefined number of words (they are actually place names).
I did consider actually splitting each of the words from the string (removing any I define as noise words), then implementing some logic which would determine which place names within my data store, best matched (based on the keys from the algorithm I choose); the advantage I see in this, would be I could selectively tighten up, or loosen the match criteria to suit the application: however this does seem a little dirty to me.
So my question(s) are:
1: Am I approaching this problem in the right way, yes I understand it will be quite expensive; however (without going to deeply into the implementation) this information will be coming from a memcache database.
2: Are there any algorithms out there, that already specialise in phonetically matching multiple words? If so, could you please provide me with some information on them, and if possible their strengths and limitations.

You may want to look into a Locality-sensitive Hash such as the Nilsimsa Hash. I have used Nilsimsa to "hash" craigslists posts across various cities to search for duplicates (NOTE: I'm not a CL employee, just a personal project I was working on).
Most of these methods aren't as tunable as you may want (basically you can get some loosely-defined "edit distance" metric) and they're not phonetic, solely character based.

What code could be used as a string aggregator for Sybase? (Like Oracle's stragg)

In my travels in Oracle, the 'stragg' function, or 'String Aggregator' was life-saving when I had to create dynamic SQL queries on the fly.
You can read up about it here: http://www.oratechinfo.co.uk/delimited_lists_to_collections.html
The basic use of it was:
select stragg(fruit) from food;
fruit
-----------
apple,pear,banana,strawberry
1 row(s) returned
So simple to use, concatenating chr(13) turned it into a long list, and selecting information from system tables gave a 5 minute solution to dynamically generated SQL, e.g. auditing triggers.
Now I've been charged with transferring oracle functionality related to auditing into Sybase, and a function similar to Stragg would be ideal for this purpose.
E.g.
select #my_table = 'table_of_fruit'
select 'insert into '+#mytable+'_copy (' +char(10)
+ stragg(c.name) +char(10)
+ 'select '
+ stragg('inserted.'+c.name) + char(10)
+ 'from '+#mytable
from syscolumns c
where objectid(#mytable) = c.id
------------------------------------------
insert into table_of_fruit_copy
(fruit, sweetness, price)
select fruit, sweetness,price
from inserted
Done. Simple.
Except I don't know how to get a string-aggregation function working in Sybase.
Does anyone know of an attempt to do this kind of thing, or code that could work the same as stragg that could be used in this way?
The alternative at the moment is printing code based on complex cursors and such (sample LOC: 500), or select statements combining static strings and columns from user tables (sample LOC: 200). Stragg would severely reduce the complexity of this code, and would be a great deal of help in the future (sample LOC: who knows, maybe 50?)
p.s. I'm calling these selects through a shell script then piping them to file, then running the file through iSQL. Not the nicest solution, but it's better than the alternatives.

There are three separate answers
Question
You have made comments about simplicity, which need to be addressed before we get to the solution.
It is a common requirement to be able to take a delimited list of values, say A,B,C,D, and treat this data like it was a set of rows in a table, or vice versa
This one of the Top Ten Worst Programming Practices I read about recently.
In general, Sybase types tend to be somewhat more academically and Relationally qualified than Oracle types, so we simply do not do that sort of thing in SybaseLand or DB2Land.
In 20 years of working with Sybase, I have had to code that as part of my project just once, and that was for non-technical Auditor who loaded the result set into MS Access.
On the other hand, I have had to code that at least 12 times, when producing text files for importation into Oracle databases (fulfilling external requirements is outside my project, but I satisfy any such requirement free). Obviously the target databases were sub-standard and non-relational (loading a column with more than one datum breaks 1NF, and creates Update Anomalies), which is typical of what Oracle types have to do to get some speed.
Therefore, no, it is not simplicity, at least in the sense of that principle. It is by definition, complexity.
Your reference to "arrays" is incorrect. All commercial dbms handle arrays, according to the ISO/IEC/ANSI SQL (STRAGGR and LIST operators are non-standard SQL, therefore not SQL). Sybase is very strong in processing arrays. If it was an array, you would not need special hand coding to handle it (and you do, as per your question). This is not an array, there is no definition to the cells. This is a single concatenated scalar string.
Pivoting is an entirely different process, which uses set-processing; it does not require row-processing. (I understand on good authority, that Oracle is hopeless at scalar subqueries, and thus Oracle people are used to writing them as [very inefficient] joins or inline views, and then filtering: all that can be elevated to set-processing via scalar subqueries, and it will perform much faster. Particularly your Pivots.)
Even the author in your link posts as follows. Please familiarise yourself with the caveats:
It's as simple as this: If you want to have a system with no logical limitation in the number of data elements passed to a given process, then forget the following mechanisms! They are simply the wrong way to approach the problem.
Therefore, know whatever you are doing is sub-standard, non-relational, and limited; and go ahead with your eyes open. No use pretending that: it will not break; it is not limited; it is an "array"; or that Sybase doesn't have a neat little function that Oracle has. Any professional will see through all that. And if the string length is exceeded, for God's sake send some indicator back to the caller ("!Exceeded" in the string) identifying that condition.
Essentially you are turning the set-processing engine on its head, and forcing it into row-processing mode, so it will be very slow. A WHILE loop is distinctly faster than a cursor, but both are in the same class, row-processors.
The alternative at the moment is printing code based on complex cursors and such
What 200 or 500 LoC ? It is possible I am missing something, but my code is the same few lines of code identified under "Using a Table Function" in your link. Maximum 20, if you count nice formatting; the loop; initialisation; error handling. There is nothing "complex" about it. Do the exact reverse to cancatenate a single string from multiple rows. We use stored procedures for this (which oracle does not have, really, PL/SQL is a different animal). If you have ASE 15.0.2 or greater, you can use a User Defined Function, which you can then use in place of a column. Stored procs are better for true arrays.
the concatenation operator in Sybase is the plus sign. For reversal (decomposing the CSV string) you need CHARINDEX and SUBSTRING functions
You may need the Function Reference Manual, if for nothing else, to avoid writing code where we have functions.
Likewise, we do not have a RANK() function. We are quite happy with the 4 lines of code requires for the subquery. It is only required for Oracle because subqueries are crippled.
Ok, I have answered your question, Now to address the approaches.
You will be aware that code using Oracle Extensions to the SQL standard will need to be changed.
Sybase is way more automated than Oracle; if you familiarise yourself with its feature set, in many instances, you can get the same result (as you did in Oracle) without writing any code. Writing code-for-code blocks is the chain gang, rock-breaking method of building roads, in the context of bulldozers. Even if your company had good reason to use that method, you need to the aware that features work quite differently, eg. triggers, which is why I am posting so much detail.
Another issue that will annoy you is that Oracle isn't really ANSI SQL compliant (stretches the definitions in many places, in order to appear to be compliant), and Sybase, given its customer base, is rigidly SQL compliant. So in addition to the same function working differently, or in a different deployment, you need to be aware that code changes may be required to elevate Oracle code to ANSI compliance levels, just to execute on an ANSI SQL compliant platform.
I am not sure if you are trying to write code for the content of a trigger, or if you are trying to capture the changes to a database. I will provide both answers.
Auditing
Capture Changes to Database
We have an very robust, fast and configurable Auditing subsystem, fit for high volumes and banking level auditing requirements. Get your DBA to setup the sybaudit (separate) database, and to configure exactly what changes need to be captured. This facility will perform much faster than any code you or I can write in a trigger (as much as 100 times faster than your row-by-processing required for the above, as it is executed within the engine, within your executing thread). And of course the setup time is a fraction of your coding time.
Triggers
Again, I am not sure exactly what you are trying to achieve, but assuming you want to copy every insert to some table to a COPY of that table (inside the Trigger), that example code you have provided will not work (and I am not counting syntax issues).
Speaking to your example, you need to do way more work, to deal with the different datatypes; column sizes; precisions; scale; etc. And perhaps the UPDATE() function to identify which columns have changed (for an UPDATE trigger of course). If all you are trying to do is convert the various datatypes to strings, check the CONVERT() function.
Triggers are transactional.
Never place row-processing code in a Trigger (it will strangle the table)
You can't place Dynamic SQL in a Trigger.
But in Sybase even that is not necessary. Refer to the User Guide, chapter 19 is devoted to Triggers, with several variations, and examples. Inside the trigger, you should be able to simply:
INSERT table_copy
SELECT column_list -- never use * unless you want the db fixed in cement
FROM inserted
If you are trying to copy the inserts to all tables into one Audit table, then beware. Then I understand your example a little bit more. You will be forcing a highly Symmetric Muli-Threading server (oracle is not a server in the architecture sense) into single-threading through your table. Auditing is multi-threaded.
Last, the use of manual methods of any kind is not required, so if you could expand a bit more on your PS, what the requirement you are trying to fulfil is, I can identify the programmatic method for you. It appears you are trying to use the PL/SQL approach (which is very limited).

Just use the LIST() function. It's a direct replacement for stragg() function. Example:
SELECT LIST(state, ', ') FROM cities
Result:
name
CA, CA, MA, NY

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string