How to set maximum sentence length in spacy? - nlp

I have a string I converted to a spacy Doc. However, when I iterate through the Doc.sents object, I get sentences I found they are too long.
Is there a way when doing doc = nlp(string) to set the maximum length for a single sentence?
Thanks a lot, this would really help.

No, there is no way to do this.
In normal language, while practically sentences don't get too long, there's no strict limit on the length of a sentence. Imagine a list of all fruits or something.
Partly because of that, it's not clear what to do with overlong sentences. Do you split them into segments of the max length or less? Do you throw them out entirely, or cut off words after the first chunk? The right approach depends on your application.
It should typically be easy to implement the strategy you want on top of the .sents iterator.
To split sentences into a max length or less you can do this:
def my_sents(doc, max_len):
for sent in doc.sents:
if len(sent) < max_len:
yield sent
continue
# this is a long one
offset = 0
while offset < len(sent):
yield sent[offset:offset+max_len]
offset += max_len
However, note that for many applications this isn't useful. If you have a max length for sentences you should really think about why you have it and adjust your approach based on that.

Related

What value to set for max_len in pad sequences?

Does the value of max_len in pad sequences for deep learning depend upon the use case? Suppose if it was a Twitter related classification, should the value be set to 280 (280 is the maximum length of characters in tweets)?
Absolutely not, After you converted texts into sequences by tokenizer which had been fitted on list of tweets, you could iterate over these sequences to derive the length of seqeunces.
the max_len parameter in pad_sqeuences function refer to the maximum length of the sequence, so it won't mean the length of a tweet based on its characters, but also it means the length of sequence.
and after that, you don't need to set it the maximum length of the tweets sequences, even you could set it lower than that. but notice by this approach, it would be better to remove stopwords and filter characters before you fit tokenizer on the list of tweets.

Brainjs dynamique array

Quick question about BrainJs -> In server side.
I'm doing a neuronal that works with string so I encode everything and fill the shorter array with 0 values. But I'm wondering something.What if the user writes a string longer than all the strings I used in my dataset ?
So I tried and it didn't crash but I'm wondering if Brainjs is using all the value of the new longer string?
Thanks in advance for the information!
So after some tests this is my conclusion.
The longest array in my dataset is a size of 10, after the training the AI thinks this array is a class 1 at 0.987... chances.
I try to give an array of size 50 to the AI, but the 10 first digits are the same as the first array. The AI answer class 1 at 0.987... chances (In fact this is exactly the same result than before). That seems logic, I think other digit of the array are ignored because of the lack of the neurons.

How do I limit word length in FastText?

I am using FastText to compute skipgrams on a corpus containing a long sequence of characters with no spaces. After an hour or so, FastText produces a model containing vectors (of length 100) corresponding to "words" of length 50 characters from the corpus.
I tried setting -minn and -maxn parameters, but that does not help (I kind of knew it won't, but tried anyway), and -wordNgrams parameter only applies if there are spaces, I guess (?!). This is just a long stream of characters representing state, without spaces.
The documentation doesn't seem to have any information on this (or perhaps I'm missing something?)
The tool just takes whatever space-delimited tokens you feed it.
If you want to truncate, or discard, tokens that are longer than 50 characters (or any other threshold), you'd need to preprocess the data yourself.
(If your question is actually something else, add more details to the question showing example lines from your corpus, how you're invoking fasttext on it, how you are reviewing unsatisfactory results, and how you would expect satisfactory results to look instead.

How to find unknown repeated patterns in the set of strings?

Here is description of a problem. Suppose you have a set of strings (up to 10 billion of strings, each string length up to 10k characters, there are 1000 unique symbols string could be constructed from). How can I find patterns with length from 2 up to length N (lets say 10 for simplicity). Also I'd like to see only those patterns which occurs at least in 1% of all string (some threshold).
I'd like to find an algorithm which can help me solve this problem. The numbers are not exact but are the same order of magnitude as we have in project.
Thank you
Index all your strings in a suffix tree (link). This can be O(number of characters) and you only need to do it once before you start.
A suffix tree allows you to quickly(O(pattern length)) tell if a pattern appears in any of the strings you've indexed, and how many times.
You can do another pass through the structure and count the number of leafs in each subtree (O(N) again) and that tells you how often you can find the substring from the root to that node, so you can drop them or do whatever you want based on how common they are.
Now, 10 billion strings of length 10k, with 2 byte characters (to fit the 1000 unique symbols) is quite large (18TB if my math is right) which doesn't fit in ram. So you'll either need to wait for a while or get more computers and setup a distributted solution. You can apply the solution above to batches of strings so that they fit into your available memory, but the lookup in the structure needs to be multiplied by the number of batches you are doing.
If everything is in batches then the most efficient way would be to make batches as big as you can, then when you've build the suffix tree for a batch run all your queries through it, save the results and drop the tree to free memory for the next batch of input strings.

Find most repeated phrase on huge text

I have huge text data. My entire database is text format in UTF-8
I need to have list of most repeated phrase on my whole text data.
For example my desire output something like this:
{
'a': 423412341,
'this': 423412341,
'is': 322472341,
'this is': 222472341,
'this is a': 122472341,
'this is a my': 5235634
}
Process and store each phrase take huge size of database.
For example store in MySQL or MongoDB.
Question is is there any more efficient database or alghorithm for find this result ?
Solr, Elasticsearch or etc ...
I think i have max 10 words in each phrase can be good for me.
I'd suggest combining ideas from two fields, here: Streaming Algorithms, and the Apriori Algorithm From Market-Basket Analysis.
Let's start with the problem of finding the k most frequent single words without loading the entire corpus into memory. A very simple algorithm, Sampling (see Finding Frequent Items in Data Streams]), can do so very easily. Moreover, it is very amenable to parallel implementation (described below). There is a plethora of work on top-k queries, including some on distributed versions (see, e.g., Efficient Top-K Query Calculation in Distributed Networks).
Now to the problem of k most frequent phrases (of possibly multiple phrases). Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity. Hence, once you have the k most frequent single words, you can scan the corpus for only them (which is faster) to build the most frequent phrases of length 2. Using this, you can build the most frequent phrases of length 3, and so on. The stopping condition is when a phrase of length l + 1 does not evict any phrase of length l.
A Short Description of The Sampling Algorithm
This is a very simple algorithm which will, with high probability, find the top k items out of those having frequency at least f. It operates in two stages: the first finds candidate elements, and the second counts them.
In the first stage, randomly select ~ log(n) / f words from the corpus (note that this is much less than n). With high probability, all your desired words appear in the set of these words.
In the second stage, maintain a dictionary of the counts of these candidate elements; scan the corpus, and count the occurrences.
Output the top k of the items resulting from the second stage.
Note that the second stage is very amenable to parallel implementation. If you partition the text into different segments, and count the occurrences in each segment, you can easily combine the dictionaries at the end.
If you can store the data in Apache Solr, then the Luke Request Handler could be used to find the most common phrases. Example query:
http://127.0.0.1:8983/solr/admin/luke?fl=fulltext&numTerms=100
Additionally, the Terms Component may help find the most common individual words. Here is an article about Self Updating Solr Stopwords which uses the Terms Component to find the 100 most common indexed words and add them to the Stopwords file. Example query:
http://127.0.0.1:8983/solr/terms?terms.fl=fulltext&terms.limit=100
Have you considered using MapReduce?
Assuming you have access to a proper infrastructure, this seems to be a clear fit for it. You will need a tokenizer that splits lines into multi-word tokens up to 10 words. I don't think that's a big deal. The outcome from the MR job will be token -> frequency pairs, which you can pass to another job to sort them on the frequencies (one option). I would suggest to read up on Hadoop/MapReduce before considering other solutions. You may also use HBase to store any intermediary outputs.
Original paper on MapReduce by Google.
tokenize it by 1 to 10 words and insert into 10 SQL tables by token lengths. Make sure to use hash index on the column with string tokens. Then just call SELECT token,COUNT(*) FROM tablename GROUP BY token on each table and dump results somewhere and wait.
EDIT: that would be infeasible for large datasets, just for each N-gram update the count by +1 or insert new row into table (in MYSQL would be useful query INSERT...ON DUPLICATE KEY UPDATE). You should definitely still use hash indexes, though.
After that just sort by number of occurences and merge data from these 10 tables (you could do that in single step, but that would put more strain on memory).
Be wary of heuristic methods like suggested by Ami Tavory, if you select wrong parameters, you can get wrong results (flaw of sampling algorithm can be seen on some classic terms or phrases - e.g. "habeas corpus" - neither habeas nor corpus will be selected as frequent by itself, but as a 2 word phrase it may very well rank higher than some phrases you get by appending/prepending to common word). There is surely no need to use them for tokens of lesser length, you could use them only when classic methods fail (take too much time or memory).
The top answer by Amy Tavori states:
Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity.
While it is true that appending a word to a phrase cannot increase its popularity, there is no reason to assume that the frequency of 2-grams are bounded by the frequency of 1-grams. To illustrate, consider the following corpus (constructed specifically to illustrate this point):
Here, a tricksy corpus will exist; a very strange, a sometimes cryptic corpus will dumbfound you maybe, perhaps a bit; in particular since my tricksy corpus will not match the pattern you expect from it; nor will it look like a fish, a boat, a sunflower, or a very handsome kitten. The tricksy corpus will surprise a user named Ami Tavory; this tricksy corpus will be fun to follow a year or a month or a minute from now.
Looking at the most frequent single words, we get:
1-Gram Frequency
------ ---------
a 12
will 6
corpus 5
tricksy 4
or 3
from 2
it 2
the 2
very 2
you 2
The method suggested by Ami Tavori would identify the top 1-gram, 'a', and narrow the search to 2-grams with the prefix 'a'. But looking at the corpus from before, the top 2-grams are:
2-Gram Frequency
------ ---------
corpus will 5
tricksy corpus 4
or a 3
a very 2
And moving on to 3-grams, there is only a single repeated 3-gram in the entire corpus, namely:
3-Gram Frequency
------ ---------
tricksy corpus will 4
To generalize: you can't use the top m-grams to extrapolate directly to top (m+1)-grams. What you can do is throw away the bottom m-grams, specifically the ones which do not repeat at all, and look at all the ones that do. That narrows the field a bit.
This can be simplified greatly. You don't need a database at all. Just store the full text in a file. Then write a PHP script to open and read the file contents. Use the PHP regex function to extract matches. Keep the total in a global variable. Write the results to another file. That's it.

Resources