Custom NER and POS tagging - nlp

I was checking out Stanford CoreNLP in order to understand NER and POS tagging. But what if I want to create custom tags for entities like<title>Nights</title>, <genre>Jazz</genre>, <year>1992</year> How can I do it? is CoreNLP useful in this case?

CoreNLP out-of-the-box will be restricted to types they mention : PERSON, LOCATION, ORGANIZATION, MISC, DATE, TIME, MONEY, NUMBER. No, you won't be able to recognize other entities just by assuming it could "intuitively" do it :)
In practice, you'll have to choose, either:
Find another NER systems that tags those types
Address this tagging task using knowledge-based / unsupervised approaches.
Search for extra resources (corpora) that contain types you want recognize, and re-train a supervised NER system (CoreNLP or other)
Build (and possibly annotate) your own resources - then you'll have to define an annotation scheme, rules, etc. - quite an interesting part of the work!
Indeed, unless you find an existing system that fulfills your needs, some effort will be required! Unsupervised approaches may help you bootstrapping a system, so as to see if you need to find / annotate a dedicated corpus. In the latter case, it would be better to separate data as train/dev/test parts, so as to be able to assess how much the resulting system performs on unseen data.

Look into this FAQ (http://nlp.stanford.edu/software/crf-faq.shtml) to use CRF classifier to train your model for new classes. You may find it useful.

Related

NLP of Legal Texts?

I have a corpus of a few 100-thousand legal documents (mostly from the European Union) – laws, commentary, court documents etc. I am trying to algorithmically make some sense of them.
I have modeled the known relationships (temporal, this-changes-that, etc). But on the single-document level, I wish I had better tools to allow fast comprehension. I am open for ideas, but here's a more specific question:
For example: are there NLP methods to determine the relevant/controversial parts of documents as opposed to boilerplate? The recently leaked TTIP papers are thousands of pages with data tables, but one sentence somewhere in there may destroy an industry.
I played around with google's new Parsey McParface, and other NLP solutions in the past, but while they work impressively well, I am not sure how good they are at isolating meaning.
In order to make sense out of documents you need to perform some sort of semantic analysis. You have two main possibilities with their exemples:
Use Frame Semantics:
http://www.cs.cmu.edu/~ark/SEMAFOR/
Use Semantic Role Labeling (SRL):
http://cogcomp.org/page/demo_view/srl
Once you are able to extract information from the documents then you may apply some post-processing to determine which information is relevant. Finding which information is relevant is task related and I don't think you can find a generic tool that extracts "the relevant" information.
I see you have an interesting usecase. You've also mentioned the presence of a corpus(which a really good plus). Let me relate a solution that I had sketched for extracting the crux from research papers.
To make sense out of documents, you need triggers to tell(or train) the computer to look for these "triggers". You can approach this using a supervised learning algorithm with a simple implementation of a text classification problem at the most basic level. But this would need prior work, help from domain experts initially for discerning "triggers" from the textual data. There are tools to extract gists of sentences - for example, take noun phrases in a sentence, assign weights based on co-occurences and represent them as vectors. This is your training data.
This can be a really good start to incorporating NLP into your domain.
Don't use triggers. What you need is a word sense disambiguation and domain adaptation. You want to make sense of is in the documents i.e understand the semantics to figure out the meaning. You can build a legal ontology of terms in skos or json-ld format represent it ontologically in a knowledge graph and use it with dependency parsing like tensorflow/parseymcparseface. Or, you can stream your documents in using a kappa based architecture - something like kafka-flink-elasticsearch with added intermediate NLP layers using CoreNLP/Tensorflow/UIMA, cache your indexing setup between flink and elasticsearch using redis to speed up the process. To understand relevancy you can apply specific cases from boosting in your search. Furthermore, apply sentiment analysis to work out intents and truthness. Your use case is one of an information extraction, summarization, and semantic web/linked data. As EU has a different legal system you would need to generalize first on what is really a legal document then narrow it down to specific legal concepts as they relate to a topic or region. You could also use here topic modelling techniques from LDA or Word2Vec/Sense2Vec. Also, Lemon might also help from converting lexical to semantics and semantics to lexical i.e NLP->ontology ->ontology->NLP. Essentially, feed the clustering into your classification of a named entity recognition. You can also use the clustering to assist you in building out the ontology or seeing what word vectors are in a document or set of documents using cosine similarity. But, in order to do all that it be best to visualize the word sparsity of your documents. Something like commonsense reasoning + deep learning might help in your case as well.

concept extraction using Wordnet

I wish to know how can i used WordNet to extract concepts from a text document.Earlier I have used bag of words approach to measure similarity between text documents, however i wish to use semantic information of text therefore wants to extract concepts from the document.I understand Wordnet offer Sysnet that contains synonyms for the given word.
however what i am trying to achieve is that how can i use this information to define a concept in the textual data. I wonder should i need to define the list of concepts separately and manually before using sysnet and than compare those concepts with the sysnet.
Any suggestion or link is appreciated.
I think you'll find that there are too many concepts out there for enumerating all of them yourself to be practical. Instead, you should consider using a pre-existing source of knowledge such as Wikidata, Wikipedia, Freebase, the content of Tweets, the web at large, or some other source as the basis for constructing your concepts. You may find clustering algorithms useful for defining these. In terms of synonyms... words related to a concept may not necessarily be synonymous (e.g. both love and hate may be connected to the same concept regarding an intensity of emotion towards someone else) and some words could belong to multiple concepts (e.g. wedding could be in both the love and in the marriage concept), so I'd suggest having some linkage from synset to concept that isn't strictly 1:1.

How apache UIMA is different from Apache Opennlp

I have been doing some capability testing with Apache OpenNLP, Which has the capability to Sentence detection, Tokenization, Name entity recognition. Now when i started looking at UIMA documents it is mentioned on the UIMA home page - "language identification" => "language specific segmentation" => "sentence boundary detection" => "entity detection (person/place names etc.)".
Which says that i can use UIMA to do the same task as done by OpenNLP. What added feature both have ? I am new to this area, Please help me to understand the uses and capability perspective of both.
As I understand the question, you are asking for the differences between the feature sets of Apache UIMA and Apache OpenNLP. Their feature sets barely have anything in common as these two projects have very different aims.
Apache UIMA is an open source implementation of the UIMA specification. The latter defines a conceptual framework for augmenting unstructured information (such as natural language produced by humans) with structured metadata so that computers can work with it.
As an example for an application working with unstructured information, let us take a an application that takes natural language text as input and marks all named entities in the given text, e.g.
Input text = "Bob's cat Charlie is chasing a mouse."
Result = "<NE>Bob</NE>'s cat <NE>Charlie</NE> is chasig a mouse."
To identify the named entities in this example (i.e. Bob and Charlie), several steps of natural language processing have to be performed. Without going into detail about what each of the steps does, a hypothetical system for named entity recognition might involve the following steps:
Data preparation
Sentence splitting
Tokenization
Token lemmatization
Part-of-speech tagging
Phrase detection
Classifying phrases as named entities or not
As you can see, such applications can be very intuitively modelled as sequences of components, and this is exactly what UIMA does. It models applications dealing with unstructed information as pipelines of components (called analytics in UIMA parlance). As you can imagine, many of the pipeline components listed above can be used for other tasks and so the architecture design of UIMA emphasizes reusability of components.
To avoid confusion, the UIMA standard itself doesn't provide any specific components, but defines an infrastructure for UIM (Unstructured Information Management) applications, e.g. workflows, data types, inter-component communication, and so on.
Apache OpenNLP on the other hand does exactly that, namely provide concrete implementations of NLP algorithms dealing with very specific tasks (sentence splitting, POS-tagging, etc.). The source of your confusion might be that it is possible to write Apache UIMA components that wrap OpenNLP tools. The OpenNLP project actually provides such components.
Whether you want to use the UIMA framework for your UIM applications depends on the size of the project. If it is small, I would go without UIMA and just use OpenNLP directly, as UIMA is rather heavy-weight and thus only adds complex yet (for small applications) unnecessary overhead. Also, due to its complexity, it takes a good amount of time to learn how to use it.
Summing up, Apache UIMA and Apache OpenNLP solve different problems, but since both deal with unstructured information, they can be combined profitably.

ML based domain specific named enitty recognition (NER)?

I need to build a classifier which identifies NEs in a specific domain. So for instance if my domain is Hockey or Football, the classifier should go accept NEs in that domain but NOT all pronouns it sees on web pages. My ultimate goal is to improve text classification through NER.
For people working in this area please suggest me how should I build such a classifier?
thanks!
If all you want is to ignore pronouns, you can run any POS tagger followed by any NER algorithm ( the Stanford package is a popular implementation) and then ignore any named entities which are pronouns. However, the pronouns might refer to named entities, which may or may not turn out to be important for the performance of your classifier. The only way to tell for sure it to try.
A slightly unrelated comment- a NER system trained on domain-specific data (e.g. hockey) is more likely to pick up entities from that domain because it will have seen some of the contexts entities appear in. Depending on the system, it might also pick up entities from other domains (which you do not want, if I understand your question correctly) because of syntax, word shape patterns, etc.
I think something like AutoNER might be useful for this. Essentially, the input to the system is text documents from a particular domain and a list of domain-specific entities that you'd like the system to recognize (like Hockey players in your case).
According to their results in this paper, they perform well on recognizing chemical names and disease names among others.

Is NER necessary for Coreference resolution?

... or is gender information enough?
More specifically, I'm interested in knowing if I can reduce the number of models loaded by the Stanford Core NLP to extract coreferences. I am not interested in actual named entity recognition.
Thank you
According to the EMNLP paper that describes the coref system packaged with Stanford CoreNLP, named entities tags are just used in the following coref annotation passes: precise constructs, relaxed head matching, and pronouns (Raghunathan et al. 2010).
You can specify what passes to use with the dcoref.sievePasses configuration property. If you want coreference but you don't want to do NER, you should be able to just run the pipeline without NER and specify that the coref system should only use the annotation passes that don't require NER labels.
However, the resulting coref annotations will take a hit on recall. So, you might want to do some experiments to determine whether the degraded quality of the annotations is problem for whatever your are using them for downstream.
In general, yes. First, you need named entities because they serve as the candidate antecedents, or targets to which the pronouns refer. Many (most?) systems perform both entity recognition and type classification in one step. Second, the semantic category (e.g. person, org, location) of the entities are important for constructing accurate coreference chains.

Resources