Is there a good way to summarize a given text to specific length? - text

1. Why asking
I'm doing a regression task using transformers.BertModel (i.e. passing a text to the model, output a score for the text). To my knowledge, Bert can only receive max_length=512 input and my average training data length is 593. Of course, I can use truncation and padding to modify the input, but this can result in a loss in the performance (I'm aware of this by comparing "tail_truncate" and "head_truncate" result, also with some domain knowledge).
2. What is the problem
I want to apply a text summary preprocessor for my input text, the expected output text length should be no more than 510 but as near as possible (i.e. I don't want a one-line summary). Is there a method, a model, a library exists to do so?
3. What I've tried
As I've mentioned above, I have tried to implement tail truncation. For any given text, simply text[-511:-1] (consider the special token [CLS] and [SEP], the actual text length should be 510) then pass to the Bert Model. This improved 2% performance on my task and it is expected since the nature of the text.
The problem is that there are quite a few texts length more than 512 (or even 800), a truncation could lose tons of useful information. I think text summary could be a way out and there should be existing solutions since it's a heavily demanded NLP task. However, I can only find whether TextRank, LSA methods (provided by library PyTextRank) that tells you which sentence is more important, or give you a "one-line" summary (provided by library PaddleNLP)
More details about the texts:
The task is that given a commutation verdict, predict the reduction of months in jail.
The corpus is in Chinese, and it structured like this: what crime did the criminal committed, how does he/she behave in jail, what is the judge's opinion toward commutation.

Related

What are the negative & sample parameters?

I am new to NLP and doc2Vec. I want to understand the parameters of doc2Vec. Thank you
Doc2Vec(dm=0, vector_size=300, negative=5, hs=0, sample = 0, seed=0)
vector_size:I believe this is to control over-fitting. A larger feature vector will learn more details so it tends to over-fit. Is there a method to determine a appropriate vector size based on the number of document or total words in all doc?
negative: how many “noise words” should be drawn. What is noise word?
sample: the threshold for configuring which higher-frequency words are randomly down sampled. So what does sample=0 mean?
As a beginner, only vector_size will be of initial interest.
Typical values are 100-1000, but larger dimensionalities require far more training data & more memory. There's no hard & fast rules – try different values, & see what works for your purposes.
Very vaguely, you'll want your count of unique vocabulary words to be much larger than the vector_size, at least the square of the vector_size: the gist of the algorithm is to force many words into a smaller-number of dimensions. (If for some reason you're running experiments on tiny amounts of data with a tiny vocabulary – for which word2vec isn't really good anyway – you'll have to shrink the vector_size very low.)
The negative value controls a detail of how the internal neural network is adjusted: how many random 'noise' words the network is tuned away from predicting for each target positive word it's tuned towards predicting. The default of 5 is good unless/until you have a repeatable way to rigorously score other values against it.
Similarly, sample controls how much (if at all) more-frquent words are sometimes randomly skipped (down-sampled). (So many redundant usage examples are overkill, wasting training time/effort that could better be spent on rarer words.) Again, you'd only want to tinker with this if you've got a way to compare the results of alternate values. Smaller values make the downsampling more aggressive (dropping more words). sample=0 would turn off such down-sampling completely, leaving all training text words used.
Though you didn't ask:
dm=0 turns off the default PV-DM mode in favor of the PV-DBOW mode. That will train doc-vectors a bit faster, and often works very well on short texts, but won't train word-vectors at all (unless you turn on an extra dbow_words=1 mode to add-back interleaved ski-gram word-vector training).
hs is an alternate mode to train the neural-network that uses multi-node encodings of words, rather than one node per (positive or negative) word. If enabled via hs=1, you should disable the negative-sampling with negative=0. But negative-sampling mode is the default for a reason, & tends to get relatively better with larger amounts of training data - so it's rare to use this mode.

doc2vec: Pull documents from inferred document

i am new in word/paragraph embedding and trying to understand via doc2vec in GENSIM. I would like to seek advice on whether my understanding is incorrect. My understanding is that doc2vec is potentially able to return documents that may have semantically similar content. As a test, i tried the following and have the following questions.
Question 1: I noted that every run of training with the exact same parameters and examples will result in a model that produces very different results from previous trains (E.g. Different vectors and different ranking of similar documents eveytime).. Why is this so indeterministic? As such, can this be reliably used for any practical work?
Question 2: Why am i not getting the tag ids of the top similar documents instead?
Results: [('day',0.477),('2016',0.386)....
Question 2 answer: The problem was due to model.most_similar, should use model.docvecs.most_similar instead
Please advise if i misunderstood anything?
Data prep
I had created multiple documents with a sentence each. I had deliberately made it such that they are distinctly different semantically.
A: It is a fine summer weather, with the birds singing and sun shining bright.
B: It is a lovely day indeed, if only i had a degree in appreciating.
C: 2016-2017 Degree in Earth Science Earthly University
D: 2009-2010 Dip in Life and Nature Life College
Query: Degree in Philosophy from Thinking University from 2009 to 2010
Training
I trained the documents (tokens as words, running index as tag)
tdlist=[]
docstring=['It is a fine summer weather, with the birds singing and sun shining bright.',
'It is a lovely day indeed, if only i had a degree in appreciating.',
'2016-2017 Degree in Earth Science Earthly University',
'2009-2010 Dip in Life and Nature Life College']
counter=1
for para in docstring:
tokens=tokenize(para) #This will also strip punctuation
td=TaggedDocument(gensim.utils.to_unicode(str.encode(' '.join(tokens))).split(), str(counter))
tdlist.append(td)
counter=counter+1
model=gensim.models.Doc2Vec(tdlist,dm=0,alpha=0.025, size=20, min_alpha=0.025, min_count=0)
for epoch in range(200):
model.train(tdlist, total_examples=model.corpus_count, epochs=model.iter)
Inference
I then attempted to infer the query. Although they are many missing words in the vocab for the query, i would expect closest document similarity results for C and D. But the results only gave me a list of 'words' followed by a similarity score. I am unsure if my understanding is wrong. Below is my code extract.
mydocvector=model.infer_vector(['Degree' ,'in' ,'Philosophy' ,'from' ,'Thinking' ,'University', 'from', '2009', 'to', '2010'])
print(model.docvecs.most_similar(positive=[mydocvector])
Doc2Vec doesn't work well on toy-sized datasets - few documents, few total words, few words per document. You'll absolutely want more documents than vector dimensions (size), and ideally tens-of-thousands of documents or more.
The second argument to TaggedDocument should be a list of tags. By supplying a single string-of-an-int, each of its elements (characters) will be seen as tags. (With just documents 1 to 4 this won't yet hurt, but as soon as you have document 10, Doc2Vec will see it as tags 1 and 0, unless you supply it as ['10'] (a single-element list).
Yes, to find most-similar documents you use model.docvecs.most_similar() rather than model.most_similar() (which only operates on learned words, if any).
You are using dm=0 mode, which is a pretty good starting idea – it's fast and often a top-performer. But note that this mode doesn't train word-vectors too. So anything you ask for from the top model, like model['summer'] or model.most_similar('sun'), will be nonsense results based on randomly-initialized but never-trained words. (If you need words trained too, either add dbow_words=1 to the dm=0 mode, or use a dm=1 mode. But for pure doc-vectors, dm=0 is a pretty good choice.)
There's no need to call train() in a loop - or indeed at all, given the line above it. The form you've used to instantiate Doc2Vec, with an actual corpus tdlist as the first argument, already triggers model-setup and training, using the default number of iter passes (5) and the supplied alpha and min_alpha. Now, for Doc2Vec training you often want more passes (10 to 20 are common, though smaller datasets might benefit from even more). And for any training, for properly gradient-descent, you want the effective learning-rate alpha to gradually decline to a negligible value, such as the default 0.0001 (rather than a forced same-as-starting value).
The only situation where you'd usually call train() explicitly is if you instantiate the model without a corpus. In that case, you'd need to both call model.build_vocab(tdlist) (to let the model initialize with a discovered vocabulary), and then some form of train() - but you'd still need only one call to train, supplying the desired number of passes. (Allowing the default model.iter 5 passes, inside an outer loop of 200 iterations, means a total of 1000 passes over the data... and all at the same fixed alpha, which is not proper gradient-descent.)
When you have a beefier dataset, you may find results improve with a higher min_count. Usually words that appear only a few times can't contribute much meaning, and thus only serve as noise slowing training and interfering with other vectors becoming more expressive. (Don't assume "more words must equal better results".) Throwing out the singletons, or more, usually helps.
Regarding inference, almost none of the words in your inference text are in the training set. (I only see 'Degree', 'in', and 'University' repeated.) So in addition to all the issues above, inferring a good vector for the example text would be hard. With a richer training set, you'd likely get better results. It also often helps to increase the steps optional parameter to infer_vector() far above its default of 5.

Information Retrieval: How to combine different word results when using tf-idf?

Let's say I have a user search query which looks like:
"the happy bunny"
I have already computed tf-idf and have something like this (following are made up example values) for each document in which I am searching (of coures the idf is always the same):
tf idf score
the 0.06 1 0.06 * 1 = 0.06
happy 0.002 20 0.002 * 20 = 0.04
bunny 0.0005 60 0.0005 * 60 = 0.03
I have two questions with what to do next.
Firstly, the still has the highest score, even though it is adjusted for rarity by idf, still it's not exactly important - do you think I should square the idf values to weight in terms of rare words, or would this give bad results? Otherwise I'm worried that the is getting equal importance to happy and bunny, and it should be obvious that bunny is the most important word in the search. As long as rare always equals important then it would be always a good idea to weight in terms of rarity, but if that is not always the case then doing so could really mess up the results.
Secondly and more importantly: what is the best/preferred method for combining the scores for each word together to give each document a single score that represents how well it reflects the entire search query? I was thinking of adding them, but it has become apparent that that is going to give higher priority to a document containing 10,000 happy but only 1 bunny instead of another document with 500 happy and 500 bunny (which would be a better match).
First, make sure that you are computing the correct TF-IDF values. As others have pointed they do not look right. TF is relative to specific documents, and we often do not need to compute them for queries (since raw term frequency is almost always 1 in queries). There are different types of TF functions to pick from (check the Wikipedia page on tf-idf, it has a good coverage). Log Normalisation is common and the most efficient scheme, since it saves an extra disk access to get the respective document's total frequency maxF that is needed for something like Double Normalisation. When you are dealing with large volumes of documents this can be expensive, especially if you can't bring these into memory. A bit of insight on inverted files can go a long way in understanding some of the underlying complexities. Log normalisation is efficient and is a non-linear function, therefore better than raw frequency.
Once you are certain on your weighting scheme, then you may want to consider a stop list to get rid of very common/noisy words. These do not contribute to the rank of documents. It is generally recommended to use a stop list of high frequency, very common words. Do a search and you will find many available, including the one that Lucene uses.
The remaining lies on your ranking strategy and that will depend on your implementation/model. The vector space model (VSM) is simple and readily available with libraries like Lucene, Lemur, etc. VSM computes the Dot product or scalar of the weights of common terms between the query and a document. Term weights are normalised via vector length normalisation (which solves your second question), and the result of applying the model is a value between 0 and 1. This is also justified/interpreted as the Cosine of the angle between two vectors in a planar graph, or the Euclidean distance divided by the Euclidean vector length of two vectors.
One of the earliest comprehensive studies on weighting schemes and ranking with VSM is an article by Salton (pdf) and is a good read if you are interested in Information Retrieval. A bit outdated perhaps (notice how log normalisation is not mentioned in the article).
Your best read I believe is the book Introduction to Information Retrieval by Christopher Manning. It will take you through everything that you need to know, from indexing to ranking schemes, etc. A bit lacking on ranking models (does not cover some of the more complex probabilistic approaches).
You should reconsider your TF and IDF values, they do not look correct. The TF value is usually just how often the word occurs, so if the word "the" appeared 20 times it's tf value would be 20. A word like "the" should have a very low IDF value (possibly around 4 decimal places, 0.000...).
You could use stop word removal if word like the are not necessary, they would be removed rather than just given a low score.
A vector space model could be used for this.
can you compute tf-idf for amalgamated terms? That is, you first generate a sentiment that considers each of its component as equal before treating the sentiment as a single term for which you now compute the tf-idf

I need a function that describes a set of sequences of zeros and ones?

I have multiple sets with a variable number of sequences. Each sequence is made of 64 numbers that are either 0 or 1 like so:
Set A
sequence 1: 0,0,0,0,0,0,1,1,0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,0,0,0,0,0
sequence 2:
0,0,0,0,1,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0
sequence 3:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,0
...
Set B
sequence1:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,1,1,1,1,1,1
sequence2:
0,0,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,1,1,1,0,0,0,0,1,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,1,1,1,1,1,1,0
...
I would like to find a mathematical function that describes all possible sequences in the set, maybe even predict more and that does not contain the sequences in the other sets.
I need this because I am trying to recognize different gestures in a mobile app based on the cells in a grid that have been touched (1 touch/ 0 no touch). The sets represent each gesture and the sequences a limited sample of variations in each gesture.
Ideally the function describing the sequences in a set would allow me to test user touches against it to determine which set/gesture is part of.
I searched for a solution, either using Excel or Mathematica, but being very ignorant about both and mathematics in general I am looking for the direction of an expert.
Suggestions for basic documentation on the subject is also welcome.
It looks as if you are trying to treat what is essentially 2D data in 1D. For example, let s1 represent the first sequence in set A in your question. Then the command
ArrayPlot[Partition[s1, 8]]
produces this picture:
The other sequences in the same set produce similar plots. One of the sequences from the second set produces, in response to the same operations, the picture:
I don't know what sort of mathematical function you would like to define to describe these pictures, but I'm not sure that you need to if your objective is to recognise user gestures.
You could do something much simpler, such as calculate the 'average' picture for each of your gestures. One way to do this would be to calculate the average value for each of the 64 pixels in each of the pictures. Perhaps there are 6 sequences in your set A describing gesture A. Sum the sequences element-by-element. You will now have a sequence with values ranging from 0 to 6. Divide each element by 6. Now each element represents a sort of probability that a new gesture, one you are trying to recognise, will touch that pixel.
Repeat this for all the sets of sequences representing your set of gestures.
To recognise a user gesture, simply compute the difference between the sequence representing the gesture and each of the sequences representing the 'average' gestures. The smallest (absolute) difference will direct you to the gesture the user made.
I don't expect that this will be entirely foolproof, it may well result in some user gestures being ambiguous or not recognisable, and you may want to try something more sophisticated. But I think this approach is simple and probably adequate to get you started.
In Mathematica the following expression will enumerate all the possible combinations of {0,1} of length 64.
Tuples[{1, 0}, {64}]
But there are 2^62 or 18446744073709551616 of them, so I'm not sure what use that will be to you.
Maybe you just wanted the unique sequences contained in each set, in that case all you need is the Mathematica Union[] function applied to the set. If you have a the sets grouped together in a list in Mathematica, say mySets, then you can apply the Union operator to every set in the list my using the map operator.
Union/#mySets
If you want to do some type of prediction a little more information might be useful.
Thanks you for the clarifications.
Machine Learning
The task you want to solve falls under the disciplines known by a variety of names, but probably most commonly as Machine Learning or Pattern Recognition and if you know which examples represent the same gestures, your case would be known as supervised learning.
Question: In your case do you know which gesture each example represents ?
You have a series of examples for which you know a label ( the form of gesture it is ) from which you want to train a model and use that model to label an unseen example to one of a finite set of classes. In your case, one of a number of gestures. This is typically known as classification.
Learning Resources
There is a very extensive background of research on this topic, but a popular introduction to the subject is machine learning by Christopher Bishop.
Stanford have a series of machine learning video lectures Standford ML available on the web.
Accuracy
You might want to consider how you will determine the accuracy of your system at predicting the type of gesture for an unseen example. Typically you train the model using some of your examples and then test its performance using examples the model has not seen. The two of the most common methods used to do this are 10 fold Cross Validation or repeated 50/50 holdout. Having a measure of accuracy enables you to compare one method against another to see which is superior.
Have you thought about what level of accuracy you require in your task, is 70% accuracy enough, 85%, 99% or better?
Machine learning methods are typically quite sensitive to the specific type of data you have and the amount of examples you have to train the system with, the more examples, generally the better the performance.
You could try the method suggested above and compare it against a variety of well proven methods, amongst which would be Random Forests, support vector machines and Neural Networks. All of which and many more are available to download in a variety of free toolboxes.
Toolboxes
Mathematica is a wonderful system, is infinitely flexible and my favourite environment, but out of the box it doesn't have a great deal of support for machine learning.
I suspect you will make a great deal of progress more quickly by using a custom toolbox designed for machine learning. Two of the most popular free toolboxes are WEKA and R both support more than 50 different methods for solving your task along with methods for measuring the accuracy of the solutions.
With just a little data reformatting, you can convert your gestures to a simple file format called ARFF, load them into WEKA or R and experiment with dozens of different algorithms to see how each performs on your data. The explorer tool in WEKA is definitely the easiest to use, requiring little more than a few mouse clicks and typing some parameters to get started.
Once you have an idea of how well the established methods perform on your data you have a good starting point to compare a customised approach against should they fail to meet your criteria.
Handwritten Digit Recognition
Your problem is similar to a very well researched machine learning problem known as hand written digit recognition. The methods that work well on this public data set of handwritten digits are likely to work well on your gestures.

Supervised Learning for User Behavior over Time

I want to use machine learning to identify the signature of a user who converts to a subscriber of a website given their behavior over time.
Let's say my website has 6 different features which can be used before subscribing and users can convert to a subscriber at any time.
For a given user I have stats which represent the intensity on a continuous range of that user's interaction with features 1-6 on a daily basis so:
D1: f1,f2,f3,f4,f5,f6
D2: f1,f2,f3,f4,f5,f6
D3: f1,f2,f3,f4,f5,f6
D4: f1,f2,f3,f4,f5,f6
Let's say on day 5, the user converts.
What machine using algorithms would help me identify which are the most common patterns in feature usage which lead to a conversion?
(I know this is a super basic classification question, but I couldn't find a good example using longitudinal data, where input vectors are ordered by time like I have)
To develop the problem further, let's assume that each feature has 3 intensities at which the user can interact (H, M, L).
We can then represent each user as a string of states of interaction intensity. So, for a user:
LLLLMM LLMMHH LLHHHH
Would mean on day one they only interacted significantly with features 5 and 6, but by the third day they were interacting highly with features 3 through 6.
N-gram Style
I could make these states words and the lifetime of a user a sentence. (Would probably need to add a "conversion" word to the vocabulary as well)
If I ran these "sentences" through an n-gram model, I could get the likely future state of a user given his/her past few state which is somewhat interesting. But, what I really want to know the most common sets of n-grams that lead to the conversion word. Rather than feeding in an n-gram and getting the next predicted word, I want to give the predicted word and get back the 10 most common n-grams (from my data) which would be likely to lead to the word.
Amaç Herdağdelen suggests identifying n-grams to practical n and then counting how many n-gram states each user has. Then correlating with conversion data (I guess no conversion word in this example). My concern is that there would be too many n-grams to make this method practical. (if each state has 729 possibilities, and we're using trigrams, thats a lot of possible trigrams!)
Alternatively, could I just go thru the data logging the n-grams which led to the conversion word and then run some type of clustering on them to see what the common paths are to a conversion?
Survival Style
Suggested by Iterator, I understand the analogy to a survival problem, but the literature here seems to focus on predicting time to death as opposed to the common sequence of events which leads to death. Further, when looking up the Cox Proportional Hazard model, I found that it does not event accommodate variables which change over time (its good for differentiating between static attributes like gender and ethnicity)- so it seems very much geared toward a different question than mine.
Decision Tree Style
This seems promising though I can't completely wrap my mind around how to structure the data. Since the data is not flat, is the tree modeling the chance of moving from one state to another down the line and when it leads to conversion or not? This is very different than the decision tree data literature I've been able to find.
Also, need clarity on how to identify patterns which lead to conversion instead a models predicts likely hood of conversion after a given sequence.
Theoretically, hidden markov models may be a suitable solution to your problem. The features on your site would constitute the alphabet, and you can use the sequence of interactions as positive or negative instances depending on whether a user finally subscribed or not. I don't have a guess about what the number of hidden states should be, but finding a suitable value for that parameter is part of the problem, after all.
As a side note, positive instances are trivial to identify, but the fact that a user has not subscribed so far doesn't necessarily mean s/he won't. You might consider to limit your data to sufficiently old users.
I would also consider converting the data to fixed-length vectors and apply conceptually simpler models that could give you some intuition about what's going on. You could use n-grams (consecutive interaction sequences of length n).
As an example, assuming that the interaction sequence of a given user ise "f1,f3,f5", "f1,f3,f5" would constitute a 3-gram (trigram). Similarly, for the same user and the same interaction sequence you would have "f1,f3" and "f3,f5" as the 2-grams (bigrams). In order to represent each user as a vector, you would identify all n-grams up to a practical n, and count how many times the user employed a given n-gram. Each column in the vector would represent the number of times a given n-gram is observed for a given user.
Then -- probably with the help of some suitable normalization techniques such as pointwise mutual information or tf-idf -- you could look at the correlation between the n-grams and the final outcome to get a sense of what's going on, carry out feature selection to find the most prominent sequences that users are involved in, or apply classification methods such as nearest neighbor, support machine or naive Bayes to build a predictive model.
This is rather like a survival analysis problem: over time the user will convert or will may drop out of the population, or will continue to appear in the data and not (yet) fall into neither camp. For that, you may find the Cox proportional hazards model useful.
If you wish to pursue things from a different angle, namely one more from the graphical models perspective, then a Kalman Filter may be more appealing. It is a generalization of HMMs, suggested by #AmaçHerdağdelen, which work for continuous spaces.
For ease of implementation, I'd recommend the survival approach. It is the easiest to analyze, describe, and improve. After you have a firm handle on the data, feel free to drop in other methods.
Other than Markov chains, I would suggest decision trees or Bayesian networks. Both of these would give you a likely hood of a user converting after a sequence.
I forgot to mention this earlier. You may also want to take a look at the Google PageRank algorithm. It would help you account for the user completely disappearing [not subscribing]. The results of that would help you to encourage certain features to be used. [Because they're more likely to give you a sale]
I think Ngramm is most promising approach, because all sequnce in data mining are treated as elements depndent on few basic steps(HMM, CRF, ACRF, Markov Fields) So I will try to use classifier based on 1-grams and 2 -grams.

Resources