Can you make Q&A language model stay on topic? - nlp

I’m thinking of fine-tuning a pre-trained language model for a Q&A task. More specifically, I’d like to fine-tune the model on a single chapter in a classic college textbook. Afterward, the reader of the chapter should be able to engage in a Q&A session with the model about the content of the chapter. But how do I make sure that the model stays on topic and doesn’t go out of a tangent? I know it is possible when looking at what https://play.aidungeon.io/ has achieved, but I don’t know if it will require me to build a model from the ground for each chapter. Can anyone tell me if I’m out of my mind or if it’s feasible?
Best,

Related

Best lexicons for sentence vs document level analysis

What are the best lexicons for document-level and sentence-level analysis? I'm using Vader currently for sentence-level analysis, however I'm worried that when I move to the document level, Vader may not perform as well as others.
Similar question to the post here, however more specific.
In addition to the sentiment lexica listed in the linked post, I can recommend aFinn sentiment lexicon.
For sentiment analysis, depending on only lexica may not be be best solution, especially on document level. Language is so flexible that its attributes and notions other than sentiment-laden vocabulary effect semantics deeply.
Some of the core notions are contrastive discource markers (especially for document level), negation and modality.
contrastive discourse markers
There are opinions that have both pros and cons within documents and we tie those via those markers like 'however', 'nevertheless' etc. to convey meaning or an idea. For a bag of words approach, the sentences below are treated the same, yet if people to annotate their sentiment with one label, they may not annotate them with the same one:
The laptop has amazing features, but its screen is killing me.
The laptop's screen is killing me, but it has amazing features.
In general, we evaluate these kind of sentences or paragraphs with the sentiment of the subclause after 'but'. Other contastive discource markers have their own semantics as well. This is inspected in an area called discource analysis.
negation and modality
These notions change semantics as well. So, they cannot be overlooked for both levels. There are studies and papers those used negation and modality triggers with sentiment lexica. You can google it 'negation and modality on sentiment analysis' to see what you can do.
Finally what I can suggest is if you have a domain-specific dataset, you may build your own lexicon using distant supervision.
Hope this helps,
Cheers

How are collaborative-filtering and topic-modeling different and how are they the same?

related to: Simple Python implementation of collaborative topic modeling?
I'm trying to grasp the fundamental differences and the fundamental parts that are the same in collaborative-filtering and topic-modeling. Both seems very much alike to me: trying to look for a latent dimension which can compactly predict which user would like which movie, or which document would contain which word?
Can you shed some light or send me to sources that will clarify that?
Thanks!
I think this paper is your best bet:
https://www.cs.princeton.edu/~blei/papers/WangBlei2011.pdf
It talks about combining collaborative filtering and topic modeling (two really distinct things).
There is maybe some resemblance if you look especially at probabilistic matrix factorization for collaborative filtering and probabilistic topic modeling, in the way the solutions is generated, but that is still rather limited.
From your question it is not clear whether you're wondering about topic modeling or collaborative topic modeling.
Nonetheless, the paper I mentioned gives some background on collaborative filtering (through matrix factorization), some background on probabilistic topic modeling and then:
COLLABORATIVE TOPIC REGRESSION (CTR), CTR combines traditional traditional collaborative filtering with topic modeling.
Just realized that this paper is already referenced in the question you are linking to, so let me share another great resource, this article in the NYT which is less math-heavy
http://open.blogs.nytimes.com/2015/08/11/building-the-next-new-york-times-recommendation-engine/
Where they describe, how they actually implemented the approach from the paper mentioned above.
On the contrary for more details on topic modeling I'd suggest diving into resources on this page:
https://www.cs.princeton.edu/~blei/topicmodeling.html
and this paper for matrix factorization for collaborative filtering:
https://datajobs.com/data-science-repo/Recommender-Systems-%5BNetflix%5D.pdf

NLP of Legal Texts?

I have a corpus of a few 100-thousand legal documents (mostly from the European Union) – laws, commentary, court documents etc. I am trying to algorithmically make some sense of them.
I have modeled the known relationships (temporal, this-changes-that, etc). But on the single-document level, I wish I had better tools to allow fast comprehension. I am open for ideas, but here's a more specific question:
For example: are there NLP methods to determine the relevant/controversial parts of documents as opposed to boilerplate? The recently leaked TTIP papers are thousands of pages with data tables, but one sentence somewhere in there may destroy an industry.
I played around with google's new Parsey McParface, and other NLP solutions in the past, but while they work impressively well, I am not sure how good they are at isolating meaning.
In order to make sense out of documents you need to perform some sort of semantic analysis. You have two main possibilities with their exemples:
Use Frame Semantics:
http://www.cs.cmu.edu/~ark/SEMAFOR/
Use Semantic Role Labeling (SRL):
http://cogcomp.org/page/demo_view/srl
Once you are able to extract information from the documents then you may apply some post-processing to determine which information is relevant. Finding which information is relevant is task related and I don't think you can find a generic tool that extracts "the relevant" information.
I see you have an interesting usecase. You've also mentioned the presence of a corpus(which a really good plus). Let me relate a solution that I had sketched for extracting the crux from research papers.
To make sense out of documents, you need triggers to tell(or train) the computer to look for these "triggers". You can approach this using a supervised learning algorithm with a simple implementation of a text classification problem at the most basic level. But this would need prior work, help from domain experts initially for discerning "triggers" from the textual data. There are tools to extract gists of sentences - for example, take noun phrases in a sentence, assign weights based on co-occurences and represent them as vectors. This is your training data.
This can be a really good start to incorporating NLP into your domain.
Don't use triggers. What you need is a word sense disambiguation and domain adaptation. You want to make sense of is in the documents i.e understand the semantics to figure out the meaning. You can build a legal ontology of terms in skos or json-ld format represent it ontologically in a knowledge graph and use it with dependency parsing like tensorflow/parseymcparseface. Or, you can stream your documents in using a kappa based architecture - something like kafka-flink-elasticsearch with added intermediate NLP layers using CoreNLP/Tensorflow/UIMA, cache your indexing setup between flink and elasticsearch using redis to speed up the process. To understand relevancy you can apply specific cases from boosting in your search. Furthermore, apply sentiment analysis to work out intents and truthness. Your use case is one of an information extraction, summarization, and semantic web/linked data. As EU has a different legal system you would need to generalize first on what is really a legal document then narrow it down to specific legal concepts as they relate to a topic or region. You could also use here topic modelling techniques from LDA or Word2Vec/Sense2Vec. Also, Lemon might also help from converting lexical to semantics and semantics to lexical i.e NLP->ontology ->ontology->NLP. Essentially, feed the clustering into your classification of a named entity recognition. You can also use the clustering to assist you in building out the ontology or seeing what word vectors are in a document or set of documents using cosine similarity. But, in order to do all that it be best to visualize the word sparsity of your documents. Something like commonsense reasoning + deep learning might help in your case as well.

Custom NER and POS tagging

I was checking out Stanford CoreNLP in order to understand NER and POS tagging. But what if I want to create custom tags for entities like<title>Nights</title>, <genre>Jazz</genre>, <year>1992</year> How can I do it? is CoreNLP useful in this case?
CoreNLP out-of-the-box will be restricted to types they mention : PERSON, LOCATION, ORGANIZATION, MISC, DATE, TIME, MONEY, NUMBER. No, you won't be able to recognize other entities just by assuming it could "intuitively" do it :)
In practice, you'll have to choose, either:
Find another NER systems that tags those types
Address this tagging task using knowledge-based / unsupervised approaches.
Search for extra resources (corpora) that contain types you want recognize, and re-train a supervised NER system (CoreNLP or other)
Build (and possibly annotate) your own resources - then you'll have to define an annotation scheme, rules, etc. - quite an interesting part of the work!
Indeed, unless you find an existing system that fulfills your needs, some effort will be required! Unsupervised approaches may help you bootstrapping a system, so as to see if you need to find / annotate a dedicated corpus. In the latter case, it would be better to separate data as train/dev/test parts, so as to be able to assess how much the resulting system performs on unseen data.
Look into this FAQ (http://nlp.stanford.edu/software/crf-faq.shtml) to use CRF classifier to train your model for new classes. You may find it useful.

UML Diagram Examples of Popular Software

Where can one find thorough documentation and UML diagrams of popular software? I've searched around and have found very few examples. I'm sure most of this documentation will be private for enterprises, but maybe there are a few links around?
Cheers!
You will not find such a document except if you work at MOF level like Omondo EclipseUML does.
There is also a dilemma in UML.
Should UML just be a view of a problem and therefore only covers a specific view of a software or should it cover the full project ?
UML has been stuck during many year by Model Driven Development which is a code generation from a model. It is therefore one shot modeling which could not be reused. The problem is that to complete a project you can not just generate a code from a model you need to complete the code by hand !! This is why MDD and UML is useless.
Your question is really good because why don't we just use an existing model which is specific to a business, arrange the model as much as needed and then generate the code ?
I think reusable models do not exist because too many transformation layers between graphical model, views, metamodel , MOF etc.....
Projects would be developped 10 times faster if reusable models and architecture would be robust. So why no free model exist today ? It is always question of money :-)
If no money then you need to use open source but open source is crap MDD. It is not a full model but just a specific view of a problem while you need full project to generate usable code !!
Omondo has done a courageous initiative to cover the full project by reversing all model information inside a MOF model and then giving views of the model using multiple class diagrams. Class diagram is live syncrhonized with the code and MOF. The problem is that you need to pay for the tool and consulting companies are selling their business models build on the top of MOF. The UML tools have less and less value but models could be a lot more profitable market in the near future.

Resources