Using Conditional Random Fields for Nested named entity recognition - nlp

My question is the following.
When we work on Named entity recognition tasks, in most cases the classic LSTM-CRF architecture is used, where the CRF uses the Viterbi decoder and the transition matrix to find the best tag sequence associated to a sentence.
My question is, if a token is now associated to multiple entities and not just one (which is the case of Nested NER), as in the case of Bank of China, where China is a location and Bank of China is an organization. Can the CRF algorithm be adapted for this case? That is, finding more than one possible path in the sequence.

This issue is related to the datasets format more than the LSTM-CRF in itself, i.e. you may indeed implement a LSTM-CRF that would recognize nested entities, without depth limitation, but they are rather rare.
Most of the machine learning (including LSTM-CRF) software are learned with a CoNLL (tab separated) dataset format, which is not convenient for unlimited depth nesting. Many dataset and systems implement a fixed depth nesting, using additional columns (more or less one per nesting depth). Software may use separate or joint learning for each depth or use cascading models.

Related

How to encode a taxonomy in Weaviate contextionary

I would like to create a semantic context for my data before vectorizing the actual data in Weaviate (https://github.com/semi-technologies/weaviate).
Lets say we have a taxonomy where we have a set of domain specific concepts together with links to their related concepts. Could you advise me what the best way is to encode not only those concepts but also relations between them using contextionary?
Depending on your use case, there are a few answers possible.
You can create the "semantic context" in a Weaviate schema and use a vectorization module to vectorized the data according to this schema.
You have domain-specific concepts in your data that the out-of-the-box vectorization modules don't know about (e.g., specific abbreviations).
You want to capture the semantic context of (i.e., vectorize) the graph itself before adding it to Weaviate.
The first is the easiest and straightforward one, the last one is the most esoteric.
Create a schema and use a vectorizer for your data
In your case, you would create a schema based on your taxonomy and load the data using an out-of-the-box vectorizer (this configurator helps you to build a Docker-compose file).
I would recommend starting with this anyway, because it will determine your data model and how you can search through and/or classify data. It might even be the case that for your use case this step already solves the problem because the out-of-the-box vectorizers are (bias alert) pretty decent.
Domain-specific concepts
At the moment of writing, Weaviate has two vectorizers, the contextionary and the transformers modules.
If you want to extend Weaviate with custom context, you can extend the contextionary or fine tune and distribute custom transformers.
If you do this, I would highly recommend still taking the first step. Because it will simply improve the results.
Capture semantic context of your graph
I don't think this is what you want, but it possible and quite esoteric. In principle, you can store your vectorized graph in Weaviate, but you need to generate the vectors on your own. For example, at the moment of writing, we are looking at RDF2Vec.
PS:
Because people often ask about the role of ontologies and taxonomies in Weaviate, I've written this blog post.

Is there a general way to calculate the similarity between product models or specifications?

Product models and specifications always differ subtly.
For example:
iphone6, iphone7sp
12mm*10mm*8mm, 12*8*8, (L)12mm*(W)8mm*(H)8mm
brand-410B-12, brand-411C-09, brand410B12
So, in common E-commerce search, is there a general method to calculate the model or specification similarity?
is there a general method to calculate the model or specification similarity?
No.
This is a research topic sometimes referred to as "product matching", or more broadly "schema matching". It's a hard problem with no standard approach.
Finding out if two strings refer to the same thing is covered by entity resolution, but that's typically used for things like the names of people or organizations where a small change is more likely to be a typo or meaningless change than an important difference (Example: Ulysses S. Grant vs Ulysses Grant). Because a small change in a model number may or may not be important it's different problem. Specifications make things even more complicated.
Here are some papers you can look at for example approaches:
Synthesizing Products for Online Catalogs - Semantic Scholar
Matching Unstructured Product Offers to Structured Product Descriptions - Microsoft Research
Tailoring entity resolution for matching product offers

NLP of Legal Texts?

I have a corpus of a few 100-thousand legal documents (mostly from the European Union) – laws, commentary, court documents etc. I am trying to algorithmically make some sense of them.
I have modeled the known relationships (temporal, this-changes-that, etc). But on the single-document level, I wish I had better tools to allow fast comprehension. I am open for ideas, but here's a more specific question:
For example: are there NLP methods to determine the relevant/controversial parts of documents as opposed to boilerplate? The recently leaked TTIP papers are thousands of pages with data tables, but one sentence somewhere in there may destroy an industry.
I played around with google's new Parsey McParface, and other NLP solutions in the past, but while they work impressively well, I am not sure how good they are at isolating meaning.
In order to make sense out of documents you need to perform some sort of semantic analysis. You have two main possibilities with their exemples:
Use Frame Semantics:
http://www.cs.cmu.edu/~ark/SEMAFOR/
Use Semantic Role Labeling (SRL):
http://cogcomp.org/page/demo_view/srl
Once you are able to extract information from the documents then you may apply some post-processing to determine which information is relevant. Finding which information is relevant is task related and I don't think you can find a generic tool that extracts "the relevant" information.
I see you have an interesting usecase. You've also mentioned the presence of a corpus(which a really good plus). Let me relate a solution that I had sketched for extracting the crux from research papers.
To make sense out of documents, you need triggers to tell(or train) the computer to look for these "triggers". You can approach this using a supervised learning algorithm with a simple implementation of a text classification problem at the most basic level. But this would need prior work, help from domain experts initially for discerning "triggers" from the textual data. There are tools to extract gists of sentences - for example, take noun phrases in a sentence, assign weights based on co-occurences and represent them as vectors. This is your training data.
This can be a really good start to incorporating NLP into your domain.
Don't use triggers. What you need is a word sense disambiguation and domain adaptation. You want to make sense of is in the documents i.e understand the semantics to figure out the meaning. You can build a legal ontology of terms in skos or json-ld format represent it ontologically in a knowledge graph and use it with dependency parsing like tensorflow/parseymcparseface. Or, you can stream your documents in using a kappa based architecture - something like kafka-flink-elasticsearch with added intermediate NLP layers using CoreNLP/Tensorflow/UIMA, cache your indexing setup between flink and elasticsearch using redis to speed up the process. To understand relevancy you can apply specific cases from boosting in your search. Furthermore, apply sentiment analysis to work out intents and truthness. Your use case is one of an information extraction, summarization, and semantic web/linked data. As EU has a different legal system you would need to generalize first on what is really a legal document then narrow it down to specific legal concepts as they relate to a topic or region. You could also use here topic modelling techniques from LDA or Word2Vec/Sense2Vec. Also, Lemon might also help from converting lexical to semantics and semantics to lexical i.e NLP->ontology ->ontology->NLP. Essentially, feed the clustering into your classification of a named entity recognition. You can also use the clustering to assist you in building out the ontology or seeing what word vectors are in a document or set of documents using cosine similarity. But, in order to do all that it be best to visualize the word sparsity of your documents. Something like commonsense reasoning + deep learning might help in your case as well.

concept extraction using Wordnet

I wish to know how can i used WordNet to extract concepts from a text document.Earlier I have used bag of words approach to measure similarity between text documents, however i wish to use semantic information of text therefore wants to extract concepts from the document.I understand Wordnet offer Sysnet that contains synonyms for the given word.
however what i am trying to achieve is that how can i use this information to define a concept in the textual data. I wonder should i need to define the list of concepts separately and manually before using sysnet and than compare those concepts with the sysnet.
Any suggestion or link is appreciated.
I think you'll find that there are too many concepts out there for enumerating all of them yourself to be practical. Instead, you should consider using a pre-existing source of knowledge such as Wikidata, Wikipedia, Freebase, the content of Tweets, the web at large, or some other source as the basis for constructing your concepts. You may find clustering algorithms useful for defining these. In terms of synonyms... words related to a concept may not necessarily be synonymous (e.g. both love and hate may be connected to the same concept regarding an intensity of emotion towards someone else) and some words could belong to multiple concepts (e.g. wedding could be in both the love and in the marriage concept), so I'd suggest having some linkage from synset to concept that isn't strictly 1:1.

ML based domain specific named enitty recognition (NER)?

I need to build a classifier which identifies NEs in a specific domain. So for instance if my domain is Hockey or Football, the classifier should go accept NEs in that domain but NOT all pronouns it sees on web pages. My ultimate goal is to improve text classification through NER.
For people working in this area please suggest me how should I build such a classifier?
thanks!
If all you want is to ignore pronouns, you can run any POS tagger followed by any NER algorithm ( the Stanford package is a popular implementation) and then ignore any named entities which are pronouns. However, the pronouns might refer to named entities, which may or may not turn out to be important for the performance of your classifier. The only way to tell for sure it to try.
A slightly unrelated comment- a NER system trained on domain-specific data (e.g. hockey) is more likely to pick up entities from that domain because it will have seen some of the contexts entities appear in. Depending on the system, it might also pick up entities from other domains (which you do not want, if I understand your question correctly) because of syntax, word shape patterns, etc.
I think something like AutoNER might be useful for this. Essentially, the input to the system is text documents from a particular domain and a list of domain-specific entities that you'd like the system to recognize (like Hockey players in your case).
According to their results in this paper, they perform well on recognizing chemical names and disease names among others.

Resources