How do I find strings and string patterns in a set of many files? - search

I have a collection of about two million text files, that total about 10GB uncompressed. I would like to find documents containing phrases in this collection, which look like "every time" or "bill clinton" (simple case-insensitive string matching). I would also like to find phrases with fuzzy contents; e.g. "for * weeks".
I've tried indexing with Lucene but it is no good at finding phrases containing stopwords, as these are removed at index time by default. xargs and grep are a slow solution. What's fast and appropriate for this amount of data?

You may want to check out the ugrep utility for fuzzy search, which is much faster than agrep:
ugrep -i -Z PATTERN ...
This runs multiple threads (typically 8 or more) to search files concurrently. Option -i is for case-insensitive search and -Z specifies fuzzy search. You can increase the fuzziness from 1 to 3 with -Z3 to allow up to 3 errors (max edit distance 3) or only allow up to 3 insertions (extra characters) with -Z+3 for example. Unicode regex matching is supported by default. For example for fuzzy-matches für (i.e. one substitution).

you could use a postgreSQL datbase. There is full text search implementation and by using dictionaries you can define your own stop words. I don't know if it helps much, but I would give it a try.

Related

Python big texts comparing

I'm not at good with math and I post my question here. Hope, will not get tons of dislikes.
I have a lot of big texts from 200.000 to 1.000.000 chars in each of them. And I need to compare texts to find duplicates. I decided to use fingerprint (md5 hashing) and then compare the fingerprint. But then I realised a new way of comparison - count chars in text.
So which one will work faster and which one will get less CPU power?
P.S. IMPORTANT: there CANNOT be 2 or more different texts with the same chars count
Taking the length of the string will be a lot faster and use less cpu power
This is because it is only one task and is easy for python and has the benifet of being an in built function.
However to preform an md5, it will need to do calculations on each character to produce the overall hash which will take a lot longer.
If the texts are exact duplicates you can get the hashes, or even faster, the lengths of texts and sort the lengths (coupled by id of text or by text reference itself) identifying the repetitions of lengths (or hashes).
For sorting you can use fast sorting algorithm, for example quicksort.
In fact there is even special *nix command line utility for sorting the items with support of duplicate removal, it is sort -u.
If the texts are near duplicates, not exact ones, the things go harder, you need to use special duplication aware hashing algorithms and sort the resultant hashes using their similarity metrics advanced so they count near things similar if distance between two compared items is lesser then some threshold of similarly.
Then again pass by resulting sorted list and get the near duplicates.

Mallet topic modeling: remove most common words

I'm new with Mallet and topic modeling in the field of art history. I'm working with Mallet 2.0.8 and command line (I don't know yet Java). I'd like to remove most common and least common words (10 times in the whole corpus, as D. Mimno recommend) before training the model because the results aren't clean (even with the stoplist), which is not surprising.
I've found that prune command could be usefull, with options like prune-document-freq. Is it right? Or does it exist another way? Someone could explain me the whole procedure in details (for example: create/input Vectors2Vectors file and at which stage and then?)? It would be much appreciated!
I'm sorry for this question, I'm a beginner with Mallet and text mining! But it's quite exciting!
Thanks a lot for your help!
There are two places you can use Mallet to curate the vocabulary. The first is in data import, for example the import-file command. The --remove-stopwords option removes a fixed set of English stopwords. This is here for backwards compatibility reasons, and is probably not a bad idea for some English-language prose, but you can generally do better by creating a custom lists. I would recommend using instead the --stoplist-file option along with the name of a file. All words in this file, separated by spaces and/or newlines, will be removed. (Using both options will remove the union of the two lists, probably not what you want.) Another useful option is --replacement-files, which allows you to specify multi-word strings to treat as single words. For example, this file:
black hole
white dwarf
will convert "black hole" into "black_hole". Here newlines are treated differently from spaces. You can also specify multi-word stopwords with --deletion-files.
Once you have a Mallet file, you can modify that file with the prune command. --prune-count N will remove words that occur fewer than N times in any document. --prune-document-freq N will remove words that occur at least once in N documents. This version can be more robust against words that occur a lot in one document. You can also prune by proportion: --min-idf removes infrequent words, --max-idf removes frequent words. A word with IDF 10.0 occurs less than once in 20000 documents, a word with IDF below 2.0 occurs in more than 13% of the collection.

Grep internal working principle

I want to know how grep works internally. Specifically I want to know whether finding the first match is significantly more faster than finding all matches? For example, the first match occurs at the 10% point of the file from start and all matches spread all over the file. Then I think finding the first match only will make grep process much less file content than finding all matches (in this case grep must traverse the whole file, compared to 10% of file in the earlier case). I want to know whether my assumption is correct because this possible improvement can vastly improve my processing work.
Thanks.
If you're using grep to print all matching lines from the file, then of course it has to process the entire file.
On the other hand, if you use grep -q to produce a successful termination status if at least one match is found, then of course grep can stop at the first match. If the first match is found early in the file, then that saves time because grep can immediately exit at that point and return a successful termination status. If no match occurs in the file (worst case), then it has to process the entire file. It must process the entire file in this case because, how could it be sure there is no match? If a match occurs only in the very last line, but grep ignores that line, then it will wrongly report that there is no match.
Grep compiles patterns to regular expressions. There are are performance implications regarding how regular expressions are structured. Some regular expressions perform better than others. Depending on the algorithm used, some regular expressions that appear small can generate state machines with a large number of states.
A techique to speed up searching is indexing. If you're often looking for specific words in a corpus of text, it's faster if you have an index of the words which indicates the locations where they are found in the corpus. The index is organized in such a way that the list of locations where the word is found is retrieved very quickly, without scanning the text. It takes time to build the index (requiring a complete scan of the entire body of text), and the index has to be rebuilt when the corpus changes.
This is the basis for tools that speed up identifier searching over computer program source code, such as GNU Id-Utils. And of course, indexing is the basis for World-Wide-Web search engines like Google.
A quick look at the grep source code (version 2.18), there's a variable in /src/main.c called done_on_match which, if set, is supposed to stop scanning after the first match. This variable is set on -l, -L, or -q (and possibly others). So yes, searching for first match does make grep exist earlier than it otherwise would need to.
That begin said, it's not clear to me that this will make your processing go any faster your main latency is probably still going to be file I/O.

Given a list of dozens of words, how do I find the best matching sections from a corpus of hundreds of texts?

Let’s say I have a list of 250 words, which may consist of unique entries throughout, or a bunch of words in all their grammatical forms, or all sorts of words in a particular grammatical form (e.g. all in the past tense). I also have a corpus of text that has conveniently been split up into a database of sections, perhaps 150 words each (maybe I would like to determine these sections dynamically in the future, but I shall leave it for now).
My question is this: What is a useful way to get those sections out of the corpus that contain most of my 250 words?
I have looked at a few full text search engines like Lucene, but am not sure they are built to handle long query lists. Bloom filters seem interesting as well. I feel most comfortable in Perl, but if there is something fancy in Ruby or Python, I am happy to learn. Performance is not an issue at this point.
The use case of such a program is in language teaching, where it would be nice to have a variety of word lists that mirror the different extents of learner knowledge, and to quickly find fitting bits of text or examples from original sources. Also, I am just curious to know how to do this.
Effectively what I am looking for is document comparison. I have found a way to rank texts by similarity to a given document, in PostgreSQL.

A string searching algorithm to quickly match an abbreviation in a large list of unabbreviated strings?

I am having a lot of trouble finding a string matching algorithm that fits my requirements.
I have a very large database of strings in an unabbreviated form that need to be matched to an arbitrary abbreviation. A string that is an actual substring with no letters between its characters should also match, and with a higher score.
Example: if the word to be matched within was "download" and I searched "down", "ownl", and then "dl", I would get the highest matching score for "down", followed by "ownl" and then "dl".
The algorithm would have to be optimized for speed and a large number of strings to be searched through, and should allow me to pull back a list of matching items strings (if I had added both "download" and "upload" to the database, searching "load" should return both). Memory is still important, but not as important as speed.
Any ideas? I've done a bunch of research on some of these algorithms but I haven't found any that even touch abbreviations, let alone with all these conditions!
I'd wonder if Peter Norvig's spell checker could be adapted in some way for this problem.
It's a stretch that I haven't begun to work out, but it's such an elegant solution that it's worth knowing about.

Resources