How do I limit word length in FastText? - nlp

I am using FastText to compute skipgrams on a corpus containing a long sequence of characters with no spaces. After an hour or so, FastText produces a model containing vectors (of length 100) corresponding to "words" of length 50 characters from the corpus.
I tried setting -minn and -maxn parameters, but that does not help (I kind of knew it won't, but tried anyway), and -wordNgrams parameter only applies if there are spaces, I guess (?!). This is just a long stream of characters representing state, without spaces.
The documentation doesn't seem to have any information on this (or perhaps I'm missing something?)

The tool just takes whatever space-delimited tokens you feed it.
If you want to truncate, or discard, tokens that are longer than 50 characters (or any other threshold), you'd need to preprocess the data yourself.
(If your question is actually something else, add more details to the question showing example lines from your corpus, how you're invoking fasttext on it, how you are reviewing unsatisfactory results, and how you would expect satisfactory results to look instead.

Related

Is there a good way to summarize a given text to specific length?

1. Why asking
I'm doing a regression task using transformers.BertModel (i.e. passing a text to the model, output a score for the text). To my knowledge, Bert can only receive max_length=512 input and my average training data length is 593. Of course, I can use truncation and padding to modify the input, but this can result in a loss in the performance (I'm aware of this by comparing "tail_truncate" and "head_truncate" result, also with some domain knowledge).
2. What is the problem
I want to apply a text summary preprocessor for my input text, the expected output text length should be no more than 510 but as near as possible (i.e. I don't want a one-line summary). Is there a method, a model, a library exists to do so?
3. What I've tried
As I've mentioned above, I have tried to implement tail truncation. For any given text, simply text[-511:-1] (consider the special token [CLS] and [SEP], the actual text length should be 510) then pass to the Bert Model. This improved 2% performance on my task and it is expected since the nature of the text.
The problem is that there are quite a few texts length more than 512 (or even 800), a truncation could lose tons of useful information. I think text summary could be a way out and there should be existing solutions since it's a heavily demanded NLP task. However, I can only find whether TextRank, LSA methods (provided by library PyTextRank) that tells you which sentence is more important, or give you a "one-line" summary (provided by library PaddleNLP)
More details about the texts:
The task is that given a commutation verdict, predict the reduction of months in jail.
The corpus is in Chinese, and it structured like this: what crime did the criminal committed, how does he/she behave in jail, what is the judge's opinion toward commutation.

Use the polarity distribution of word to detect the sentiment of new words

I have just started a project in NLP. Suppose I have a graph for each word that shows the polarity distribution of sentiments for that word in different sentences. I want to know what I can use to recognize the feelings of new words? Any other use you have in mind I will be happy to share.
I apologize for any possible errors in my writing. Thanks a lot
Assuming you've got some words that have been hand-labeled with positive/negative sentiments, but then you encounter some new words that aren't labeled:
If you encounter the new words totally alone, outside of contexts, there's not much you can do. (Maybe, you could go out to try to find extra texts with those new words, such as vis dictionaries or the web, then use those larger texts in the next approach.)
If you encounter the new words inside texts that also include some of your hand-labeled words, you could try guessing that the new words are most like the words you already know that are closest-to, or used-in-the-same-places. This would leverage what's called "the distributional hypothesis" – words with similar distributions have similar meanings – that underlies a lot of computer natural-language analysis, including word2vec.
One simple thing to try along these lines: across all your texts, for every unknown word U, tally up the counts all neighboring words within N positions. (N could be 1, or larger.) From that, pick the top 5 words occuring most often near the unknown word, and look up your prior labels, and avergae them together (perhaps weighted by the number of occurrences.)
You'll then have a number for the new word.
Alternatively, you could train a word2vec set-of-word-vectors for all of your texts, including the unknown & know words. Then, ask that model for the N most-similar neighbors to your unknown word. (Again, N could be small or large.) Then, from among those neighbors with known labels, average them together (again perhaps weighted by similarity), to get a number for the previously unknown word.
I wouldn't particularly expect either of these techniques to work very well. The idea that individual words can have specific sentiment is somewhat weak given the way that in actual language, their meaning is heavily modified, or even reversed, by the surrounding grammar/context. But in each case these simple calculate-from-neighbors techniqyes are probably better than random guesses.
If your real aim is to calculate the overall sentiment of longer texts, like sentences, paragraphs, reviews, etc, then you should discard your labels of individual words an acquire/create labels for full texts, and apply real text-classification techniques to those larger texts. A simple word-by-word approach won't do very well compared to other techniques – as long as those techniques have plenty of labeled training data.

My training data contains line breaks; how can I work with Gensim's LineSentence format for the corpus_file parameter?

Per Gensim's documentation, changelog, and previous StackOverflow answers, I know that passing training data in the LineSentence format to the corpus_data parameter can dramatically speed up Any2Vec training.
Documentation on the LineSentence format reads as follows:
Iterate over a file that contains sentences: one line = one sentence. Words must be already preprocessed and separated by whitespace.
My training data is comprised of tens of millions (and potentially 1xx million) of sentences extracted from plaintext files using spaCy. A sample sentence quite often contains one or more line break characters (\n).
How can I make these samples compatible with the LineSentence format? As far as I understand, these samples should be "understood" in the context of their linebreaks, as these breaks are present in the target text (data not trained upon). That means I can't just remove them from the training data.
Do I escape the newline characters with \\n? Is there a way to pass a custom delimiter?
I appreciate any guidance. Thanks in advance.
LineSentence is only an appropriate iterable for classes like Word2Vec, that expect a corpus to be a Python sequence, where each item is a list of tokens.
The exact placement of linebreaks is unlikely to make much difference in usual word-vector training. If reading a file line-by-line, all words on the same line will still appear in each other's contexts. (Extra linebreaks will just prevent words from 'bleeding' over slightly into the contexts of preceding/subsequent texts – which in a large training corpus probably makes no net difference for end results.)
So mainly: don't worry about it.
If you think it might be a problem you could try either...
Removing the newlines between texts that you conjecture "snould" be a single text, creating longer lines. (Note, though, that you dn't want any of your texts to be over 10000 tokens, or else an internal implementation limit in Gensim will mean tokens past the 1st 10000 will be ignored.)
Replacing the newlines, in texts that you conjecture "should" be a single text, with some synthetic token, like say <nl> (or whatever).
...then evaluate whether the results have improved over simply not doing that. (I doubt they will improve, for basic Word2Vec/FastText training.)
For Doc2Vec, you might have to pay more attention to ensuring that all words of a 'document' are handled as a single text. In that case, you should make sure that whatever iterable sequence you have that produces TaggedDocument-like objects assigns the desired, same tag to all raw text that should be considered part of the same document.

Find most repeated phrase on huge text

I have huge text data. My entire database is text format in UTF-8
I need to have list of most repeated phrase on my whole text data.
For example my desire output something like this:
{
'a': 423412341,
'this': 423412341,
'is': 322472341,
'this is': 222472341,
'this is a': 122472341,
'this is a my': 5235634
}
Process and store each phrase take huge size of database.
For example store in MySQL or MongoDB.
Question is is there any more efficient database or alghorithm for find this result ?
Solr, Elasticsearch or etc ...
I think i have max 10 words in each phrase can be good for me.
I'd suggest combining ideas from two fields, here: Streaming Algorithms, and the Apriori Algorithm From Market-Basket Analysis.
Let's start with the problem of finding the k most frequent single words without loading the entire corpus into memory. A very simple algorithm, Sampling (see Finding Frequent Items in Data Streams]), can do so very easily. Moreover, it is very amenable to parallel implementation (described below). There is a plethora of work on top-k queries, including some on distributed versions (see, e.g., Efficient Top-K Query Calculation in Distributed Networks).
Now to the problem of k most frequent phrases (of possibly multiple phrases). Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity. Hence, once you have the k most frequent single words, you can scan the corpus for only them (which is faster) to build the most frequent phrases of length 2. Using this, you can build the most frequent phrases of length 3, and so on. The stopping condition is when a phrase of length l + 1 does not evict any phrase of length l.
A Short Description of The Sampling Algorithm
This is a very simple algorithm which will, with high probability, find the top k items out of those having frequency at least f. It operates in two stages: the first finds candidate elements, and the second counts them.
In the first stage, randomly select ~ log(n) / f words from the corpus (note that this is much less than n). With high probability, all your desired words appear in the set of these words.
In the second stage, maintain a dictionary of the counts of these candidate elements; scan the corpus, and count the occurrences.
Output the top k of the items resulting from the second stage.
Note that the second stage is very amenable to parallel implementation. If you partition the text into different segments, and count the occurrences in each segment, you can easily combine the dictionaries at the end.
If you can store the data in Apache Solr, then the Luke Request Handler could be used to find the most common phrases. Example query:
http://127.0.0.1:8983/solr/admin/luke?fl=fulltext&numTerms=100
Additionally, the Terms Component may help find the most common individual words. Here is an article about Self Updating Solr Stopwords which uses the Terms Component to find the 100 most common indexed words and add them to the Stopwords file. Example query:
http://127.0.0.1:8983/solr/terms?terms.fl=fulltext&terms.limit=100
Have you considered using MapReduce?
Assuming you have access to a proper infrastructure, this seems to be a clear fit for it. You will need a tokenizer that splits lines into multi-word tokens up to 10 words. I don't think that's a big deal. The outcome from the MR job will be token -> frequency pairs, which you can pass to another job to sort them on the frequencies (one option). I would suggest to read up on Hadoop/MapReduce before considering other solutions. You may also use HBase to store any intermediary outputs.
Original paper on MapReduce by Google.
tokenize it by 1 to 10 words and insert into 10 SQL tables by token lengths. Make sure to use hash index on the column with string tokens. Then just call SELECT token,COUNT(*) FROM tablename GROUP BY token on each table and dump results somewhere and wait.
EDIT: that would be infeasible for large datasets, just for each N-gram update the count by +1 or insert new row into table (in MYSQL would be useful query INSERT...ON DUPLICATE KEY UPDATE). You should definitely still use hash indexes, though.
After that just sort by number of occurences and merge data from these 10 tables (you could do that in single step, but that would put more strain on memory).
Be wary of heuristic methods like suggested by Ami Tavory, if you select wrong parameters, you can get wrong results (flaw of sampling algorithm can be seen on some classic terms or phrases - e.g. "habeas corpus" - neither habeas nor corpus will be selected as frequent by itself, but as a 2 word phrase it may very well rank higher than some phrases you get by appending/prepending to common word). There is surely no need to use them for tokens of lesser length, you could use them only when classic methods fail (take too much time or memory).
The top answer by Amy Tavori states:
Clearly, the most frequent phrases of length l + 1 must contain the most frequent phrases of length l as a prefix, as appending a word to a phrase cannot increase its popularity.
While it is true that appending a word to a phrase cannot increase its popularity, there is no reason to assume that the frequency of 2-grams are bounded by the frequency of 1-grams. To illustrate, consider the following corpus (constructed specifically to illustrate this point):
Here, a tricksy corpus will exist; a very strange, a sometimes cryptic corpus will dumbfound you maybe, perhaps a bit; in particular since my tricksy corpus will not match the pattern you expect from it; nor will it look like a fish, a boat, a sunflower, or a very handsome kitten. The tricksy corpus will surprise a user named Ami Tavory; this tricksy corpus will be fun to follow a year or a month or a minute from now.
Looking at the most frequent single words, we get:
1-Gram Frequency
------ ---------
a 12
will 6
corpus 5
tricksy 4
or 3
from 2
it 2
the 2
very 2
you 2
The method suggested by Ami Tavori would identify the top 1-gram, 'a', and narrow the search to 2-grams with the prefix 'a'. But looking at the corpus from before, the top 2-grams are:
2-Gram Frequency
------ ---------
corpus will 5
tricksy corpus 4
or a 3
a very 2
And moving on to 3-grams, there is only a single repeated 3-gram in the entire corpus, namely:
3-Gram Frequency
------ ---------
tricksy corpus will 4
To generalize: you can't use the top m-grams to extrapolate directly to top (m+1)-grams. What you can do is throw away the bottom m-grams, specifically the ones which do not repeat at all, and look at all the ones that do. That narrows the field a bit.
This can be simplified greatly. You don't need a database at all. Just store the full text in a file. Then write a PHP script to open and read the file contents. Use the PHP regex function to extract matches. Keep the total in a global variable. Write the results to another file. That's it.

Bytes vs Characters vs Words - which granularity for n-grams?

At least 3 types of n-grams can be considered for representing text documents:
byte-level n-grams
character-level n-grams
word-level n-grams
It's unclear to me which one should be used for a given task (clustering, classification, etc). I read somewhere that character-level n-grams are preferred to word-level n-grams when the text contains typos, so that "Mary loves dogs" remains similar to "Mary lpves dogs".
Are there other criteria to consider for choosing the "right" representation?
Evaluate. The criterion for choosing the representation is whatever works.
Indeed, character level (!= bytes, unless you only care about english) probably is the most common representation, because it is robust to spelling differences (which do not need to be errors, if you look at history; spelling changes). So for spelling correction purposes, this works well.
On the other hand, Google Books n-gram viewer uses word level n-grams on their books corpus. Because they don't want to analyze spelling, but term usage over time; e.g. "child care", where the individual words aren't as interesting as their combination. This was shown to be very useful in machine translation, often referred to as "refrigerator magnet model".
If you are not processing international language, bytes may be meaningful, too.
I would outright discard byte-level n-grams for text-related tasks, because bytes are not a meaningful representation of anything.
Of the 2 remaining levels, the character-level n-grams will need much less storage space and will , subsequently, hold much less information. They are usually utilized in such tasks as language identification, writer identification (i.e. fingerprinting), anomaly detection.
As for word-level n-grams, they may serve the same purposes, and much more, but they need much more storage. For instance, you'll need up to several gigabytes to represent in memory a useful subset of English word 3-grams (for general-purpose tasks). Yet, if you have a limited set of texts you need to work with, word-level n-grams may not require so much storage.
As for the issue of errors, a sufficiently large word n-grams corpus will also include and represent them. Besides, there are various smoothing methods to deal with sparsity.
There other issue with n-grams is that they will almost never be able to capture the whole needed context, so will only approximate it.
You can read more about n-grams in the classic Foundations of Statistical Natural Language Processing.
I use character ngrams on small strings, and word ngrams for something like text classification of larger chunks of text. It is a matter of which method will preserve the context you need more or less...
In general for classification of text, word ngrams will help a bit with word-sense dissambiguation, where character ngrams would be easily confused and your features could be completely ambiguous. For unsupervised clustering, it will depend on how general you want your clusters, and on what basis you want docs to converge. I find stemming, stopword removal, and word bigrams work well in unsupervised clustering tasks on fairly large corpora.
Character ngrams are great for fuzzy string matching of small strings.
I like to think of a set of grams as a vector, and imagine comparing vectors with the grams you have, then ask yourself if what you are comparing maintains enough context to answer the question you are trying to answer.
HTH

Resources