How to use BERT in image caption tasks,such as im2txt,densecap - nlp

Preparing to use Bert Before refining downstream tasks, let's start with a few general questions about NLP, saying that this is to emphasize Bert's universality. In general, most NLP problems fall into the four types of tasks shown in the figure above:
One type is sequence labeling. This is the most typical NLP task. For example, Chinese word segmentation, part-of-speech tagging, named entity recognition, semantic role tagging, etc. can be classified into this type of problem. It is characterized by the requirement of each word in the sentence model. The context must give a classification category.
The second category is classification tasks, such as our common text classification, emotional calculations, etc. can be classified into this category. It is characterized by the fact that no matter how long the article is, it is generally given a classification category.
The third type of task is sentence relationship judgment. For example, Entailment, QA, semantic rewriting, natural language reasoning and other tasks are all this model. Its characteristic is that given two sentences, the model judges whether two sentences have some semantic relationship.
The fourth category is generative tasks, such as machine translation, text summaries, writing poems, reading pictures, and so on. It is characterized by the need to generate another piece of text autonomously after entering the text content.
The first three tasks are very common, and now I want to do the image caption task in the fourth task.My normal implementation is based on densecap:A Hierarchical Approach for Generating Descriptive Image Paragraphs. the code is here

Related

How do you differentiate between names, places, and things?

Here is a list of proper nouns taken from The Lord of the Rings. I was wondering if there is a good way to sort them based on whether they refer to a person, place or thing. Does there exist a natural language processing library that can do this? Is there a way to differentiate between places, names, and things?
Shire, Tookland, Bagginses, Boffins, Marches, Buckland, Fornost, Norbury, Hobbits, Took, Thain, Oldbucks, Hobbitry, Thainship, Isengrim, Michel, Delving, Midsummer, Postmaster, Shirriff, Farthing, Bounders, Bilbo, Frodo
You're talking about Named Entity Recognition. It is the task of information extraction that seeks to locate and classify piece of text into predefined categories such as pre-defined names, location, organizations, time expressions, monetary values, etc. You can either do that by unsupervised methods using a dictionary such as the words you have. Or though supervised methods, using methods such as CRFs, Neural Networks etc. But you need a list of predefined sentences with the respective annotated names and classes. In this example here, using Spacy (A NLP library), the authors applied NER to Lord of the rings novels. You can read more in the link.
Here is the solution:
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Wikipedia link: https://en.wikipedia.org/wiki/Named-entity_recognition
Named Entity Recognition (NER) is a standard NLP problem which involves spotting named entities (people, places, organizations etc.) from a chunk of text, and classifying them into a predefined set of categories. Some of the practical applications of NER include:
Scanning news articles for the people, organizations and locations reported.
Providing concise features for search optimization: instead of searching the entire content, one may simply search for the major entities involved.
Quickly retrieving geographical locations talked about in Twitter posts.
NER with spaCy
spaCy is regarded as the fastest NLP framework in Python, with single optimized functions for each of the NLP tasks it implements. Being easy to learn and use, one can easily perform simple tasks using a few lines of code.
Installation :
!pip install spacy
!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(sentence)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Output:
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
In the output, the first column specifies the entity, the next two columns the start and end characters within the sentence/document, and the final column specifies the category.
Further, it is interesting to note that spaCy’s NER model uses capitalization as one of the cues to identify named entities. The same example, when tested with a slight modification, produces a different result.

Text Classification + NLP + Data-mining + Data Science: Should I do stop word removal and stemming before applying tf-idf?

I am working on a text classification problem. The problem is explained below:
I have a dataset of events which contains three columns - name of the event, description of the event, category of the event. There are about 32 categories in the dataset, such as, travel, sport, education, business etc. I have to classify each event to a category depending on its name and description.
What I understood is this particular task of classification is highly dependent on keywords, rather than, semantics. I am giving you two examples:
If the word 'football' is found either in the name or description or in both, it is highly likely that the event is about sport.
If the word 'trekking' is found either in the name or description or in both, it is highly likely that the event is about travel.
We are not considering multiple categories for an event(however, that's a plan for future !! )
I hope applying tf-idf before Multinomial Naive Bayes would lead to decent result for this problem. My question is:
Should I do stop word removal and stemming before applying tf-idf or should I apply tf-idf just on raw text? Here text means entries in name of event and description columns.
The question is too generic and you are not providing samples of the dataset, code, and not even indicating the language you are using. To this regard, I will presume that you are using English, since the two words that you are providing as an example are "football" and "trekking". The answer will however necessarily be generic.
Should I do stop word removal
Yes. Have a look at this to see the most frequent words in the English language. As you can see they have no semantic meaning, and thus would not contribute to solving the classification task that you have proposed. if stopwords is a list containing stopwords, the parameter stop_words=stopwords passed to the CountVectorizer or TfidfVectorizer constructor will automatically exclude the stopwords when invoking the .fit_transform() method.
Should I do stemming
It depends. Languages other than English, whose grammar rules allow for a big number of possible prefixes-suffixes, normally require stemming when performing classification task, in order to reach any useful result. The English language however has very poor grammar rules, and thus you can often get away without stemming/lemmatization. You should check the results obtained against the desired accuracy first, and if it is insufficient, try adding a stemming/lemmatization step in the preprocessing of your data. Stemming is a computationally expensive process for large corpora, and I personally use it only for languages that require it.
I hope applying tf-idf before Multinomial Naive Bayes would lead to decent result for this problem
Careful with this. While tf-idf in practice works with Naive Bayesian classifiers, this is not the way that specific classifier is meant to be used. From the documentation,
The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. It is in your best interest to tackle the classification task with CountVectorizer first and score it, and after you have a baseline accuracy for evaluating the TfidfVectorizer, check whether its results are better or worse than those of the CountVectorizer.
If you post some code and a sample of your dataset we can help you with that, otherwise this should be enough.

Multiclass text classification with python and nltk

I am given a task of classifying a given news text data into one of the following 5 categories - Business, Sports, Entertainment, Tech and Politics
About the data I am using:
Consists of text data labeled as one of the 5 types of news statement (Bcc news data)
I am currently using NLP with nltk module to calculate the frequency distribution of every word in the training data with respect to each category(except the stopwords).
Then I classify the new data by calculating the sum of weights of all the words with respect to each of those 5 categories. The class with the most weight is returned as the output.
Heres the actual code.
This algorithm does predict new data accurately but I am interested to know about some other simple algorithms that I can implement to achieve better results. I have used Naive Bayes algorithm to classify data into two classes (spam or not spam etc) and would like to know how to implement it for multiclass classification if it is a feasible solution.
Thank you.
In classification, and especially in text classification, choosing the right machine learning algorithm often comes after selecting the right features. Features are domain dependent, require knowledge about the data, but good quality leads to better systems quicker than tuning or selecting algorithms and parameters.
In your case you can either go to word embeddings as already said, but you can also design your own custom features that you think will help in discriminating classes (whatever the number of classes is). For instance, how do you think a spam e-mail is often presented ? A lot of mistakes, syntaxic inversion, bad traduction, punctuation, slang words... A lot of possibilities ! Try to think about your case with sport, business, news etc.
You should try some new ways of creating/combining features and then choose the best algorithm. Also, have a look at other weighting methods than term frequencies, like tf-idf.
Since your dealing with words I would propose word embedding, that gives more insights into relationship/meaning of words W.R.T your dataset, thus much better classifications.
If you are looking for other implementations of classification you check my sample codes here , these models from scikit-learn can easily handle multiclasses, take a look here at documentation of scikit-learn.
If you want a framework around these classification that is easy to use you can check out my rasa-nlu, it uses spacy_sklearn model, sample implementation code is here. All you have to do is to prepare the dataset in a given format and just train the model.
if you want more intelligence then you can check out my keras implementation here, it uses CNN for text classification.
Hope this helps.

Is it possible to supplement Naive Bayes text classification algorithm with author information?

I am working on a text classification project where I am trying to assign topic classifications to speeches from the Congressional Record.
Using topic codes from the Congressional Bills Project (http://congressionalbills.org/), I've tagged speeches that mention a specific bill as belonging to the topic of the bill. I'm using this as my "training set" for the model.
I have a "vanilla" Naive Bayes classifier working well-enough, but I keep feeling like I could get better accuracy out of the algorithm by incorporating information about the member of Congress who is making the speech (e.g. certain members are much more likely to talk about Foreign Policy than others).
One possibility would be to replace the prior in the NB classifier (usually defined as the proportion of documents in the training set that have the given classification) with speaker's observed prior speeches.
Is this worth pursuing? Are there existing approaches that have followed this same kind of logic? I'm a little bit familiar with the "author-topic models" that come out of Latent Dirichlet Allocation models, but I like the simplicity of the NB model.
There is no need to modify anything, simply add this information to your Naive Bayes and it will work just fine.
And as it was previously mentioned in the comment - do not change any priors - prior probability is P(class), this has nothing to do with actual features.
Just add to your computations another feature corresponding to the authorship, e.g. "author:AUTHOR" and train Naive Bayes as usual, ie. compute P(class|author:AUTHOR) for each class and AUTHOR and use it later on in your classification process.If your current representation is a bag of words, it is sufficient to add a "artificial" word of form "author:AUTHOR" to it.
One other option would be to train independent classifier for each AUTHOR, which would capture person-specific type of speech, for example - one uses lots of words "environment" only when talking about "nature", while other simply likes to add this word in each speach "Oh, in our local environment of ...". Independent NBs would capture these kind of phenomena.

Paraphrase recognition using sentence level similarity

I'm a new entrant to NLP (Natural Language Processing). As a start up project, I'm developing a paraphrase recognizer (a system which can recognize two similar sentences). For that recognizer I'm going to apply various measures at three levels, namely: lexical, syntactic and semantic. At the lexical level, there are multiple similarity measures like cosine similarity, matching coefficient, Jaccard coefficient, et cetera. For these measures I'm using the simMetrics package developed by the University of Sheffield which contains a lot of similarity measures. But for the Levenshtein distance and Jaro-Winkler distance measures, the code is only at character level, whereas I require code at the sentence level (i.e. considering a single word as a unit instead of character-wise). Additionally, there is no code for computing the Manhattan distance in SimMetrics. Are there any suggestions for how I could develop the required code (or someone provide me the code) at the sentence level for the above mentioned measures?
Thanks a lot in advance for your time and effort helping me.
I have been working in the area of NLP for a few years now, and I completely agree with those who have provided answers/comments. This really is a hard nut to crack! But, let me still provide a few pointers:
(1) Lexical similarity: Instead of trying to generalize Jaro-Winkler distance to sentence-level, it is probably much more fruitful if you develop a character-level or word-level language model, and compute the log-likelihood. Let me explain further: train your language model based on a corpus. Then take a whole lot of candidate sentences that have been annotated as similar/dissimilar to the sentences in the corpus. Compute the log-likelihood for each of these test sentences, and establish a cut-off value to determine similarity.
(2) Syntactic similarity: So far, only stylometric similarities can manage to capture this. For this, you will need to use PCFG parse trees (or TAG parse trees. TAG = tree adjoining grammar, a generalization of CFGs).
(3) Semantic similarity: off the top of my head, I can only think of using resources such as Wordnet, and identifying the similarity between synsets. But this is not simple either. Your first problem will be to identify which words from the two (or more) sentences are "corresponding words", before you can proceed to check their semantics.
As Chris suggests, this is a non-trivial project for a beginner. I would suggest you start of something simpler (if relatively boring) such as chunking.
Have a look at the docs and books for the Python NLTK library - there are some samples that are close to what you are looking for. For example, containment: is it plausible that one statement contains another. note the 'plausible' there, the state of the art isn't good enough for a simple yes/no or even a probability.

Resources