Is Seq2Seq Models used for Time series only? - nlp

Can we use Seq2Seq model with input data that has no temporal relation ( not a time series )? For example I have a list of image regions that I would like to feed my seq2seq model. And the the model should predict an description ( output is time series |) or captions.
I’m not asking from the technical perspective, I know that if the data is in the correct format then I can do that. My question is rather theoretical, is it ok to use Seq2Seq with none time series data? And are there any papers/articles/references of using Seq2Seq in this setting ?

No, it just has to be a sequence like requirement.
Klaus Greff, et al., LSTM: A Search Space Odyssey, 2015 :
Since LSTMs are effective at capturing long-term temporal dependencies without suffering from the optimization hurdles that plague simple recurrent networks (SRNs), they have been used to advance the state of the art for many difficult problems. This includes handwriting recognition and generation, language modeling and translation, acoustic modeling of speech, speech synthesis, protein secondary structure prediction, analysis of audio, and video data among others.
Felix A. Gers, et al., Learning to Forget: Continual Prediction with LSTM, 2000 : LSTM holds promise for any sequential processing task in which we suspect that a hierarchical decomposition may exist, but do not know in advance what this decomposition is.

Related

Sentiment Analysis: Is there a way to extract positive and negative aspects in reviews?

Currently, I'm working on a project where I need to extract the relevant aspects used in positive and negative reviews in real time.
For the notions of more negative and positive, it will be a question of contextualizing the word. Distinguish between a word that sounds positive in a negative context (consider irony).
Here is an example:
Very nice welcome!!! We ate very well with traditional dishes as at home, the quality but also the quantity are in appointment!!!*
Positive aspects: welcome, traditional dishes, quality, quantity
Can anyone suggest to me some tutorials, papers or ideas about this topic?
Thank you in advance.
This task is called Aspect Based Sentiment Analysis (ABSA). Most popular is the format and dataset specified in the 2014 Semantic Evaluation Workshop (Task 5) and its updated versions in the following years.
Overview of model efficiencies over the years:
https://paperswithcode.com/sota/aspect-based-sentiment-analysis-on-semeval
Good source for ressources and repositories on the topic (some are very advanced but there are some more starter friendly ressources in there too):
https://github.com/ZhengZixiang/ABSAPapers
Just from my general experience in this topic a very powerful starting point that doesn't require advanced knowledge in machine learning model design is to prepare a Dataset (such as the one provided for the SemEval2014 Task) that is in a Token Classification Format and use it to fine-tune a pretrained transformer model such as BERT, RoBERTa or similar. Check out any tutorial on how to do fine-tuning on a token classification model like this one in huggingface. They usually use the popular task of Named Entity Recognition (NER) as the example task but for the ABSA-Task you basically do the same thing but with other labels and a different dataset.
Obviously an even easier approach would be to take more rule-based approaches or combine a rule-based approach with a trained sentiment analysis model/negation detection etc., but I think generally with a rule-based approach you can expect a much inferior performance compared to using state-of-the-art models as transformers.
If you want to go even more advanced than just fine-tuning the pretrained transformer models then check out the second and third link I provided and look at some of the machine learning model designs specifically designed for Aspect Based Sentiment Analysis.

Changes in GPT2/GPT3 model during few shot learning

During transfer learning, we take a pre-trained network and some observation pair (input and label), and use these data to fine-tune the weight by use of backpropagation. However, during one shot/few shot learning, according to this paper- 'Language Models are Few-Shot Learners' (https://arxiv.org/pdf/2005.14165.pdf), "No gradient updates are performed". Then what changes happen to the models like GPT2 and GPT3 during one shot/few shot learning?
Then what changes happen to the models like GPT2 and GPT3 during one shot/few shot learning?
There is no change to the model at all. The model does not learn anything preservably. What they do is give the "training examples" as context to the model and the model generates an output at the end of this context. Figure 2.1 (Brown, Tom B., et al. "Language models are few-shot learners."(2020).) shows examples of input for the fine-tuning, zero-shot-learning and few-shot-learning.
As you see, the training examples are part of the input and must be given each time a prediction shall be done. Therefore no change happened to the model.
Brown, Tom B., et al. "Language models are few-shot learners."(2020)
You may think that there are some changes because the model returns better results in the case of a few-shot training. However, it is the same model but having a different context as an input. GPT-2 and GPT-3 both are auto-regressive models meaning that the output also depends on the context.
More examples would mean a more clear context and, thus, the chance to obtain the desired results increases.

Multiclass text classification with python and nltk

I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.

Linear Chain Conditional Random Field Sequence Model - NER

I am confused with what a linear chain CRF implementation exactly is. While some people say that "The Linear Chain CRF restricts the features to depend on only the current(i) and previous label(i-1), rather than arbitrary labels throughout the sentence" , some people say that it restricts the features to depend on the current(i) and future label(i+1).
I am trying to understand the implementation that goes behind the Stanford NER Model. Can someone please explain what exactly the linear chain CRF Model is?
Both models would be linear chain CRF models. The important part about the "linear chain" is that the features depend only on the current label and one direct neighbour in the sequence. Usually this would be the previous label (because that corresponds with reading order), but it could also be the future label. Such a model model would basically process the sentence backwards, and I have never seen this in the literature, but it would still be a linear chain CRF).
As far as I know, the Stanford NER model is based on a model that uses the current and the previous label, but it also uses an extension that can also look to labels further back. It is therefore not a strict linear-chain model, but uses an extension described in this paper:
Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370. http://nlp.stanford.edu/~manning/papers/gibbscrf3.pdf

What does discriminative reranking do in NLP tasks?

Recently,i have read about the "discriminative reranking for natural language processing" by Collins.
I'm confused what does the reranking actually do?
Add more global features to the rerank model? or something else?
If you mean this paper, then what is done is the following:
train a parser using a generative model, i.e. one where you compute P(term | tree) and use Bayes' rule to reverse that and get P(tree | term),
apply that to get an initial k-best ranking of trees from the model,
train a second model on features of the desired trees,
apply that to re-rank the output from 2.
The reason why the second model is useful is that in generative models (such as naïve Bayes, HMMs, PCFGs), it can be hard to add features other than word identity, because the model would try to predict the probability of the exact feature vector instead of the separate features, which might not have occurred in the training data and will have P(vector|tree) = 0 and therefore P(tree|vector) = 0 (+ smoothing, but the problem remains). This is the eternal NLP problem of data sparsity: you can't build a training corpus that contains every single utterance that you'll want to handle.
Discriminative models such as MaxEnt are much better at handling feature vectors, but take longer to fit and can be more complicated to handle (although CRFs and neural nets have been used to construct parsers as discriminative models). Collins et al. try to find a middle ground between the fully generative and fully discriminative approaches.

Resources