dataset to use for question formation from any text - nlp

I am trying to create an improved quiz generator that accepts a certain text as an input and forms questions from the sentences. I want to create a machine learning model that splits the sentence into different parts so it is capable of forming different questions from the same sentence. For example: from the sentence "Amazon river is the longest river in South America." should form questions: What is the longest river in South America? Is Amazon river the longest river in South America? Where is Amazon river located? etc. If possible, I would also like it to get the context from multiple sentences and then form one question from multiple sentence information. I want it to be able to perform well on any text, not just specific topic. How should I make my dataset or which dataset should I use?
I don't have a lot of previous knowledge on the topic, so I was thinking of somehow using nltk.pos_tag() which specifies everyword in a sentence. I am just not sure how to use it in my model and dataset.

What you're attempting to do is non-trivial and is related to the task of Automatic Question Generation (AQG) which looks at converting structured or unstructured declarative natural language sentences into valid interrogative forms. Various automated linguistic (rules-based) and statistical methods have been employed. I'd recommend reading [1] by Blšták & Rozinajová, particularly Section 2 which summarises some of the datasets and methods available. The survey by Lu & Lu [2] provides a recent overview of the field. It seems like the most common approach is to leverage existing QA datasets (e.g. SQuAD, HotpotQA et cetera, see Table 5 of [2]). In terms of more practical, quick ways to get started without having to train your own ML/DL model, you could use existing Transformer-based models from HuggingFace such as iarfmoose/t5-base-question-generator available here which takes concatenated answers and context as an input sequence, e.g.:
<answer> answer text here <context> context text here
and will generate a full question (interrogative) sentence as an output sequence. According to the author, it is recommended that a large number of sequences be generated and then filtered with iarfmoose/bert-base-cased-qa-evaluator.
References
[1] Blšták, M. and Rozinajová, V., 2022. Automatic question generation based on sentence structure analysis using machine learning approach. Natural Language Engineering, 28(4), pp.487-517.
[2] Lu, C.Y. and Lu, S.E., 2021, October. A Survey of Approaches to Automatic Question Generation: from 2019 to Early 2021. In Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021) (pp. 151-162).

Related

Which HuggingFace summarization models support more than 1024 tokens? Which model is more suitable for programming related articles?

If this is not the best place to ask this question, please lead me to the most accurate one.
I am planning to use one of the Huggingface summarization models (https://huggingface.co/models?pipeline_tag=summarization) to summarize my lecture video transcriptions.
So far I have tested facebook/bart-large-cnn and sshleifer/distilbart-cnn-12-6, but they only support a maximum of 1,024 tokens as input.
So, here are my questions:
Are there any summarization models that support longer inputs such as 10,000 word articles?
What are the optimal output lengths for given input lengths? Let's say for a 1,000 word input, what is the optimal (minimum) output length (the min. length of the summarized text)?
Which model would likely work on programming related articles?
Question 1
Are there any summarization models that support longer inputs such as
10,000 word articles?
Yes, the Longformer Encoder-Decoder (LED) [1] model published by Beltagy et al. is able to process up to 16k tokens. Various LED models are available here on HuggingFace. There is also PEGASUS-X [2] published recently by Phang et al. which is also able to process up to 16k tokens. Models are also available here on HuggingFace.
Alternatively, you can look at either:
Extractive followed by abstractive summarisation, or
Splitting a large document into chunks of max_input_length (e.g. 1024), summarise each, and then concatenate together. Care will have to be taken as to how the documents are chunked as to avoid chunking mid-way through particular topics, or having a relatively short final chunk that may produce an unusable summary.
Question 2
What are the optimal output lengths for given input lengths? Let's say
for a 1,000 word input, what is the optimal (minimum) output
length (i.e. the min. length of the summarized text)?
This is a very difficult question to answer as it hard to empirically evaluate the quality of a summarisation. I would suggest running a few tests yourself with varied output length limits (e.g. 20, 50, 100, 200) and find what subjectively works best. Each model and document genre will be different. Anecdotally, I would say 50 words will a good minimum, with 100-150 offering better results.
Question 3
Which model would likely to work on programming related articles?
I can imagine three possible cases for what constitutes a programming related article.
Source code summarisation (which involves producing a natural (informal) language summary of code (formal language)).
Traditional abstractive summarisation (i.e. natural language summary of natural language but for articles talking about programming yet have no code).
Combination of both 1 and 2.
For case (1), I'm not aware of any implementations on HuggingFace that focus on this problem. However, it is an active research topic (see [3], [4], [5]).
For case (2), you can use the models you've been using already, and if feasible, fine tune on your own specific dataset of programming related articles.
For case (3), simply look at combining implementations from both (1) and (2) based on whether the input is categorised as either formal (code) or informal (natural) language.
References
[1] Beltagy, I., Peters, M.E. and Cohan, A., 2020. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150.
[2] Phang, J., Zhao, Y. and Liu, P.J., 2022. Investigating Efficiently Extending Transformers for Long Input Summarization. arXiv preprint arXiv:2208.04347.
[3] Ahmad, W.U., Chakraborty, S., Ray, B. and Chang, K.W., 2020. A transformer-based approach for source code summarization. arXiv preprint arXiv:2005.00653.
[4] Wei, B., Li, G., Xia, X., Fu, Z. and Jin, Z., 2019. Code generation as a dual task of code summarization. Advances in neural information processing systems, 32.
[5] Wan, Y., Zhao, Z., Yang, M., Xu, G., Ying, H., Wu, J. and Yu, P.S., 2018, September. Improving automatic source code summarization via deep reinforcement learning. In Proceedings of the 33rd ACM/IEEE international conference on automated software engineering (pp. 397-407).

Named Entity Recognition Systems for German Texts

I am working on a Named Entity Recognition (NER) project in which I got a large amount of text in the sense that it is too much to read or skim read. Therefore, I want to create an overview of what is mentioned by extracting named entities (places, names, times, maybe topics) and create an index of kind (entity, list of pages/lines where it is mentioned). I have worked through Standford's NLP lecture, (parts of) Eisenstein's Introduction to NLP book found some literature and systems for English texts. As my corpus is in German, I would like to ask how I can approach this problem. Also, this is my first NLP project, so I would not know if I could solve this challenge even if texts were in English.
As a first step
are there German NER systems out there which I could use?
The further roadmap of my project is:
How can I avoid mapping misspellings or rare names to a NUL/UNK token? This is relevant because there are also some historic passages that use words no longer in use or that follow old orthography. I think the relevant terms are tokenisation or stemming.
I thought about fine-tuning or transfer learning the base NER model to a corpus of historic texts to improve NER.
A major challenge is that there is no annotated dataset for my corpus available and I could only manually annotate a tiny fraction of it. So I would be happy for hints on German annotated datasets which I could incorporate into my project.
Thank you in advance for your inputs and fruitful discussions.
Most good NLP toolkits can perform NER in German:
Stanford NLP
Spacy
probably NLTK and OpenNLP as well
What is crucial to understand is that using NER software like the above means using a pretrained model, i.e. a model which has been previously trained on some standard corpus with standard annotated entities.
Btw you can usually find the original annotated dataset by looking at the documentation. There's one NER corpus here.
This is convenient and might suit your goal, but sometimes it doesn't collect exactly every that you would like it to collect, especially if your corpus is from a very specific domain. If you need more specific NER, you must train your own model and this requires obtaining some annotated data (i.e. manually annotating or paying somebody to do it).
Even in this case, a NER model is statistical and it will unavoidably make some mistakes, don't expect perfect results.
About misspellings or rare names: a NER model doesn't care (or not too much) about the actual entity, because it's not primarily based on the words in the entity. It's based on indications in the surrounding text, for example in the sentence "It was announced by Mr XYZ that the event would take place in July", the NER model should find 'Mr XYZ' as a person due to "announced by" and 'July' as a date because of "take place in". However if the language used in the corpus is very different from the training data used for the model, the performance could be very bad.

How do you differentiate between names, places, and things?

Here is a list of proper nouns taken from The Lord of the Rings. I was wondering if there is a good way to sort them based on whether they refer to a person, place or thing. Does there exist a natural language processing library that can do this? Is there a way to differentiate between places, names, and things?
Shire, Tookland, Bagginses, Boffins, Marches, Buckland, Fornost, Norbury, Hobbits, Took, Thain, Oldbucks, Hobbitry, Thainship, Isengrim, Michel, Delving, Midsummer, Postmaster, Shirriff, Farthing, Bounders, Bilbo, Frodo
You're talking about Named Entity Recognition. It is the task of information extraction that seeks to locate and classify piece of text into predefined categories such as pre-defined names, location, organizations, time expressions, monetary values, etc. You can either do that by unsupervised methods using a dictionary such as the words you have. Or though supervised methods, using methods such as CRFs, Neural Networks etc. But you need a list of predefined sentences with the respective annotated names and classes. In this example here, using Spacy (A NLP library), the authors applied NER to Lord of the rings novels. You can read more in the link.
Here is the solution:
Named-entity recognition (NER) (also known as entity identification, entity chunking and entity extraction) is a subtask of information extraction that seeks to locate and classify named entity mentioned in unstructured text into pre-defined categories such as person names, organizations, locations, medical codes, time expressions, quantities, monetary values, percentages, etc.
Wikipedia link: https://en.wikipedia.org/wiki/Named-entity_recognition
Named Entity Recognition (NER) is a standard NLP problem which involves spotting named entities (people, places, organizations etc.) from a chunk of text, and classifying them into a predefined set of categories. Some of the practical applications of NER include:
Scanning news articles for the people, organizations and locations reported.
Providing concise features for search optimization: instead of searching the entire content, one may simply search for the major entities involved.
Quickly retrieving geographical locations talked about in Twitter posts.
NER with spaCy
spaCy is regarded as the fastest NLP framework in Python, with single optimized functions for each of the NLP tasks it implements. Being easy to learn and use, one can easily perform simple tasks using a few lines of code.
Installation :
!pip install spacy
!python -m spacy download en_core_web_sm
import spacy
nlp = spacy.load('en_core_web_sm')
sentence = "Apple is looking at buying U.K. startup for $1 billion"
doc = nlp(sentence)
for ent in doc.ents:
print(ent.text, ent.start_char, ent.end_char, ent.label_)
Output:
Apple 0 5 ORG
U.K. 27 31 GPE
$1 billion 44 54 MONEY
In the output, the first column specifies the entity, the next two columns the start and end characters within the sentence/document, and the final column specifies the category.
Further, it is interesting to note that spaCy’s NER model uses capitalization as one of the cues to identify named entities. The same example, when tested with a slight modification, produces a different result.

Text processing tool for tweets

I am collecting millions of sports related tweets daily. I want to process the text in those tweets. I want to recognize the entities, find the sentiment of the sentence and find the events in those tweets.
Entity recognizing :
For example :
"Rooney will play for England in their next match".
From this tweet i want to recognize person entity "Rooney" and place entity "England"
sentiment analysis:
I want to find the sentiment of a sentence. For example
Chelsea played their worst game ever
Ronaldo scored a beautiful goal
The first one should marked as "negative" sentence and the later one should marked as "positive".
Event recognizing :
I want to find "goal scoring event" from tweets. Sentences like "messi scored goal in first half" and "that was a fantastic goal from gerrald" should marked as "goal scoring event".
I know entity recognizing and sentiment analysis tools are available and i need to write the rules for event recognizing. I have seen so many tools like Stanford NER, alchemy api, open calais, meaning cloud api, ling pipe, illinois etc..
I'm really confused about which tool I should select? Is there any free tools available without daily rate limits? I want to process millions of tweets daily and java is my preferable language.
Thanks.
For NER you can also use TwitIE which is a GATE pipeline so you can use it using the GATE API in Java.
Given the consideration that your preferred language is Java, I would strongly suggest to start with Stanford NLP project. Most of your basic needs like cleansing, chunking, NER can be done based on that. For NER click here.
Going ahead for sentiment analysis you can use simplistic classifiers like Naive Bayes and then add complexities. More here.
For the event extraction, you can use linguistic approach to identify the verbs with their association with ontology on your side.
Just remember, this is just to get you started and no way an extensive answer.
No API with unlimited call availalble. IF you want to stick with java, use stanford package with customization as per your need.
If you are comfortable with python, look at nltk.
Well, for person, organization stanford will work, for your input query :
Rooney will play for England in their next match
[Text=Rooney CharacterOffsetBegin=0 CharacterOffsetEnd=6 PartOfSpeech=NNP Lemma=Rooney NamedEntityTag=PERSON] [Text=will CharacterOffsetBegin=7 CharacterOffsetEnd=11 PartOfSpeech=MD Lemma=will NamedEntityTag=O] [Text=play CharacterOffsetBegin=12 CharacterOffsetEnd=16 PartOfSpeech=VB Lemma=play NamedEntityTag=O] [Text=for CharacterOffsetBegin=17 CharacterOffsetEnd=20 PartOfSpeech=IN Lemma=for NamedEntityTag=O] [Text=England CharacterOffsetBegin=21 CharacterOffsetEnd=28 PartOfSpeech=NNP Lemma=England NamedEntityTag=LOCATION] [Text=in CharacterOffsetBegin=29 CharacterOffsetEnd=31 PartOfSpeech=IN Lemma=in NamedEntityTag=O] [Text=their CharacterOffsetBegin=32 CharacterOffsetEnd=37 PartOfSpeech=PRP$ Lemma=they NamedEntityTag=O] [Text=next CharacterOffsetBegin=38 CharacterOffsetEnd=42 PartOfSpeech=JJ Lemma=next NamedEntityTag=O] [Text=match CharacterOffsetBegin=43 CharacterOffsetEnd=48 PartOfSpeech=NN Lemma=match NamedEntityTag=O]
If you want to add eventrecognization too, you need to retrain the stanford package with extrac class having event based dataset. Which can help you to classify event based input.
Does the NER use part-of-speech tags?
None of our current models use pos tags by default. This is largely
because the features used by the Stanford POS tagger are very similar
to those used in the NER system, so there is very little benefit to
using POS tags.
However, it certainly is possible to train new models which do use POS
tags. The training data would need to have an extra column with the
tag information, and you would then add tag=X to the map parameter.
check - http://nlp.stanford.edu/software/crf-faq.shtml
Stanford NER and OPENNLP are both open-source and have models that perform well on formal article/texts.
But their accuracy drops significantly over Twitter (from 90% recall over formal text to 40% recall over tweets)
The informal nature of tweets (bad capitalization, spellings, punctuations), improper usage of words, vernacularity and emoticons makes it more complicated
NER, sentiment analysis and event extraction over tweets is a well-researched area apparently for its applications.
Take a look at this: https://github.com/aritter/twitter_nlp, see this demo of twitter NLP and event extraction: http://ec2-54-170-89-29.eu-west-1.compute.amazonaws.com:8000/
Thank you

NLP software for classification of large datasets

Background
For years I've been using my own Bayesian-like methods to categorize new items from external sources based on a large and continually updated training dataset.
There are three types of categorization done for each item:
30 categories, where each item must belong to one category, and at most two categories.
10 other categories, where each item is only associated with a category if there is a strong match, and each item can belong to as many categories as match.
4 other categories, where each item must belong to only one category, and if there isn't a strong match the item is assigned to a default category.
Each item consists of English text of around 2,000 characters. In my training dataset there are about 265,000 items, which contain a rough estimate of 10,000,000 features (unique three word phrases).
My homebrew methods have been fairly successful, but definitely have room for improvement. I've read the NLTK book's chapter "Learning to Classify Text", which was great and gave me a good overview of NLP classification techniques. I'd like to be able to experiment with different methods and parameters until I get the best classification results possible for my data.
The Question
What off-the-shelf NLP tools are available that can efficiently classify such a large dataset?
Those I've tried so far:
NLTK
TIMBL
I tried to train them with a dataset that consisted of less than 1% of the available training data: 1,700 items, 375,000 features. For NLTK I used a sparse binary format, and a similarly compact format for TIMBL.
Both seemed to rely on doing everything in memory, and quickly consumed all system memory. I can get them to work with tiny datasets, but nothing large. I suspect that if I tried incrementally adding the training data the same problem would occur either then or when doing the actual classification.
I've looked at Google's Prediction API, which seem to do much of what I'm looking for but not everything. I'd also like to avoid relying on an external service if possible.
About the choice of features: in testing with my homebrew methods over the years, three word phrases produced by far the best results. Although I could reduce the number of features by using words or two word phrases, that would most likely produce inferior results and would still be a large number of features.
After this post and based on the personal experience, I would recommend Vowpal Wabbit. It is said to have one of the fastest text classification algorithms.
MALLET has a number of classifiers (NB, MaxEnt, CRF, etc). It's written Andrew McCallum's group. SVMLib is another good option, but SVM models typically require a bit more tuning than MaxEnt. Alternatively some sort of online clustering like K-means might not be bad in this case.
SVMLib and MALLET are quite fast (C and Java) once you have your model trained. Model training can take a while though! Unfortunately it's not always easy to find example code. I have some examples of how to use MALLET programmatically (along with the Stanford Parser, which is slow and probably overkill for your purposes). NLTK is a great learning tool and is simple enough that is you can prototype what you are doing there, that's ideal.
NLP is more about features and data quality than which machine learning method you use. 3-grams might be good, but how about character n-grams across those? Ie, all the character ngrams in a 3-gram to account for spelling variations/stemming/etc? Named entities might also be useful, or some sort of lexicon.
I would recommend Mahout as it is intended for handling very large scale data sets.
The ML algorithms are built over Apache Hadoop(map/reduce), so scaling is inherent.
Take a look at classification section below and see if it helps.
https://cwiki.apache.org/confluence/display/MAHOUT/Algorithms
Have you tried MALLET?
I can't be sure that it will handle your particular dataset but I've found it to be quite robust in previous tests of mine.
However, I my focus was on topic modeling rather than classification per se.
Also, beware that with many NLP solutions you needn't input the "features" yourself (as the N-grams, i.e. the three-words-phrases and two-word-phrases mentioned in the question) but instead rely on the various NLP functions to produce their own statistical model.

Resources