How to encode a taxonomy in Weaviate contextionary - search

I would like to create a semantic context for my data before vectorizing the actual data in Weaviate (https://github.com/semi-technologies/weaviate).
Lets say we have a taxonomy where we have a set of domain specific concepts together with links to their related concepts. Could you advise me what the best way is to encode not only those concepts but also relations between them using contextionary?

Depending on your use case, there are a few answers possible.
You can create the "semantic context" in a Weaviate schema and use a vectorization module to vectorized the data according to this schema.
You have domain-specific concepts in your data that the out-of-the-box vectorization modules don't know about (e.g., specific abbreviations).
You want to capture the semantic context of (i.e., vectorize) the graph itself before adding it to Weaviate.
The first is the easiest and straightforward one, the last one is the most esoteric.
Create a schema and use a vectorizer for your data
In your case, you would create a schema based on your taxonomy and load the data using an out-of-the-box vectorizer (this configurator helps you to build a Docker-compose file).
I would recommend starting with this anyway, because it will determine your data model and how you can search through and/or classify data. It might even be the case that for your use case this step already solves the problem because the out-of-the-box vectorizers are (bias alert) pretty decent.
Domain-specific concepts
At the moment of writing, Weaviate has two vectorizers, the contextionary and the transformers modules.
If you want to extend Weaviate with custom context, you can extend the contextionary or fine tune and distribute custom transformers.
If you do this, I would highly recommend still taking the first step. Because it will simply improve the results.
Capture semantic context of your graph
I don't think this is what you want, but it possible and quite esoteric. In principle, you can store your vectorized graph in Weaviate, but you need to generate the vectors on your own. For example, at the moment of writing, we are looking at RDF2Vec.
PS:
Because people often ask about the role of ontologies and taxonomies in Weaviate, I've written this blog post.

Related

Extract entities without specifying during intent specification

I am using Rasa 2.0 to build an FAQ chatbot, wherein I have a large dataset, and specifying entities while defining intents does not seem efficient to me.
I have the intents and examples defined in nlu.yml and would like to extract entities.
Here is an example of what I want to achieve,
User message -> I want a hospital in Delhi.
Entity -> Delhi, hospital
Is it possible to do so?
Entity detection is not a solved problem. There exist pre-trained models that integrate with Rasa like Duckling and spaCy and while these tools certainly contribute a lot of knowledge, they will make errors. If you're interested in learning more of the background on why these models can certainly fail, you can enjoy this youtube video that explains human name detection.
That's why a popular alternative is to use name-lists. There are lists of cities around the world as well as lists of baby names that you can download that might be used as a rule based alternative. You can configure this in Rasa via the RegexEntityExtractor but if you have namelists with 1000+ items then a FlashTextExtractor might be preferable.
If you've got labelled examples you can also train Rasa itself to recognise the entities. But in order to do this you will to have labels around.
specifying entities while defining intents does not seem efficient to me
Labelling might not be super fun, but it is super effective. Without labelling your received utterances you won't know what intents your users are interested in.
You could use entity annotations in your nlu training data; for example, assuming you have defined building_type and city as entity names:
I want a [hospital]("building_type") in [Delhi]("city").
Alternatively, you could try out these options:
annotate a smaller sample (for example, those entities that are essential for your FAQ assistant)
use the RegexEntityExtractor to write some rules
if you have a list of entities, you can use lookup tables to generate the regular expressions

Is it possible to generate parts of a meta model from upper layer?

based on the four layer MOF structure, I'm currently working on a model (in fact a UML class diagram) at M1 level. However, I observed that some parts of the meta model are highly depending on references to certain classes, which may may differ depending on the use case. Therefore, I created a meta model on the M2 level, which allows users to define the variable parts of the M1 Model, which again can then be generated and incorpareted in the M1 model. The following images tries to depict that:
A resulting M1 model example would then look like that:
As switching between the different levels can be a little bit confusing, I wonder if this approach is per se possible and UML conform? Furthermore, is there a notation for the "generated instances" relation in Figure 1 by chance? Within the MOF spec, <<merge>> or <<import>> is for example used, which maybe fit in for that purpose.
Probably your question is too broad to give a concise answer. However, here's my advice when dealing with meta models: I found that people hardly have an idea why you need a meta model at all and it takes quite some time to convince them starting to create one. Even with so called UML pros. Now, with that in background, it's evident that modelers who shall use the meta model might have even more difficulties dealing with it. This leaves just one way: keep it simple. And that's what I did in the past. Introducing a meta model with just really the basics, concentrating on meta types, tagged values and some connectors. After a while, people really get used to it and appreciate working with the meta model. Only then there starts the need to switch to a version two, which is still static though.
Now, what you want looks like a version ninety nine. This would probably only work in a super model where you have some gurus floating on top of it all an provide a meta meta model. This will going to be interesting and I'd like to be part of that team. However, I doubt you will be able to get practicable results from it. My recommendation is that you stay with the static meta model. Everything else will likely lead you to nowhere.

NLP of Legal Texts?

I have a corpus of a few 100-thousand legal documents (mostly from the European Union) – laws, commentary, court documents etc. I am trying to algorithmically make some sense of them.
I have modeled the known relationships (temporal, this-changes-that, etc). But on the single-document level, I wish I had better tools to allow fast comprehension. I am open for ideas, but here's a more specific question:
For example: are there NLP methods to determine the relevant/controversial parts of documents as opposed to boilerplate? The recently leaked TTIP papers are thousands of pages with data tables, but one sentence somewhere in there may destroy an industry.
I played around with google's new Parsey McParface, and other NLP solutions in the past, but while they work impressively well, I am not sure how good they are at isolating meaning.
In order to make sense out of documents you need to perform some sort of semantic analysis. You have two main possibilities with their exemples:
Use Frame Semantics:
http://www.cs.cmu.edu/~ark/SEMAFOR/
Use Semantic Role Labeling (SRL):
http://cogcomp.org/page/demo_view/srl
Once you are able to extract information from the documents then you may apply some post-processing to determine which information is relevant. Finding which information is relevant is task related and I don't think you can find a generic tool that extracts "the relevant" information.
I see you have an interesting usecase. You've also mentioned the presence of a corpus(which a really good plus). Let me relate a solution that I had sketched for extracting the crux from research papers.
To make sense out of documents, you need triggers to tell(or train) the computer to look for these "triggers". You can approach this using a supervised learning algorithm with a simple implementation of a text classification problem at the most basic level. But this would need prior work, help from domain experts initially for discerning "triggers" from the textual data. There are tools to extract gists of sentences - for example, take noun phrases in a sentence, assign weights based on co-occurences and represent them as vectors. This is your training data.
This can be a really good start to incorporating NLP into your domain.
Don't use triggers. What you need is a word sense disambiguation and domain adaptation. You want to make sense of is in the documents i.e understand the semantics to figure out the meaning. You can build a legal ontology of terms in skos or json-ld format represent it ontologically in a knowledge graph and use it with dependency parsing like tensorflow/parseymcparseface. Or, you can stream your documents in using a kappa based architecture - something like kafka-flink-elasticsearch with added intermediate NLP layers using CoreNLP/Tensorflow/UIMA, cache your indexing setup between flink and elasticsearch using redis to speed up the process. To understand relevancy you can apply specific cases from boosting in your search. Furthermore, apply sentiment analysis to work out intents and truthness. Your use case is one of an information extraction, summarization, and semantic web/linked data. As EU has a different legal system you would need to generalize first on what is really a legal document then narrow it down to specific legal concepts as they relate to a topic or region. You could also use here topic modelling techniques from LDA or Word2Vec/Sense2Vec. Also, Lemon might also help from converting lexical to semantics and semantics to lexical i.e NLP->ontology ->ontology->NLP. Essentially, feed the clustering into your classification of a named entity recognition. You can also use the clustering to assist you in building out the ontology or seeing what word vectors are in a document or set of documents using cosine similarity. But, in order to do all that it be best to visualize the word sparsity of your documents. Something like commonsense reasoning + deep learning might help in your case as well.

How does GATE use ontologies for NLP?

What is the role of ontologies in natural language processing when using GATE?
As I understand it, at a high level, an ontology allows for the modelling of a domain consisting of classes, their instances, properties of these instances and relationships between classes in the domain.
However is there an advantage to creating a custom ontology when working with GATE?
Or can processing be as effective using the only the built in processing resources provided by ANNIE?
You can check this tutorial on ontologies in GATE.
As stated in the pdf:
Link annotations to concepts in a knowledge base.
The annotated text is a “Mention” of a concept in the KB
We can use the knowledge associated with Mentions in our IE pipeline: e.g. Persons have JobTitles, Cities have zip codes
We can use the knowledge associated with Mentions for “Semantic Search”
We can use semantically annotated documents to add new facts to our knowledge base
In the process of annotation, ontology data (instances, classes, relations, etc.) can be used by JAPE for smarter matching, i.e. matching a mention with class "engineer", knowing that "engineer" is a subclass of a "person". Also there are ontology-aware gazetteers which can create annotations based on instances and put the right class and uri for the created annotations.
The last two questions are too generic but I'll try...
After following the tutorial, you'll know exactly how to use ontologies for annotation, hence you'll know if you need to create a custom ontology for your task.
ANNIE is an example of a pipeline and a good place to start studying GATE and writing your own application.

Why not split the data access layer into two?

Everywhere I look, I noticed that both Domain Driven Design (DDD) and entity hydration approaches attempt to populate entities directly from the data layer. I disagree with such approaches. It is not because these approaches do not work because these do. Instead, I would argue that such approaches give a low level of transparency for testing purposes. I propose that at the data access layer, data is retrieved to populate dictionaries instead of the directly populating the entities themselves. There are several reasons for this:
First, there is greater flexibility. A dictionary per result set could be populated. We would decide later which entities could be populated from these result sets.
Second, less knowledge about the data layer is needed to determine where data retrival is failing. We may still write tests for verify data retrieval without having to understand anything about its associated complex domain entity factories.
There is one so-called disadvantage, performance? Going through two layers is slower than going through one? Yes, it is but the performance gain from going through a single data layer is negliable here. The reason I say this is because both the dictionaries and the entries these dictionaries would populate would be cached. So, if anything there would be a memory overhead. I think this would be worthwhile to gain the two advantages stated above.
It seems like testing is the issue ("for testing purposes"), so I suggest you use repositories just like #tschmuck pointed out.
As Ayende points out, they might give you unnecessary lasagna code (i.e. too many layers), but they will give you flexibility. You can implement fakes/test spies yourself, mock and stub 'em, as well as use an in-memory DB such as SQLite, and the dependent class is just as happy.

Resources