I have a case where I need to detect anomalies in a dataset. I do not have a labeled trainingset. The problem is not entirely unsupervised either since I know part of the data contains no anomalies, and I know which part that is. This is called novelty detection, as far as I'm aware. One class SVM would work here but I'd like to compare it to alternative methods.
So far I've only found methods that are entirely unsupervised or entirely supervised, so no valid alternatives to one class SVM for my novelty detection case. Does anyone know of methods that would be valid alternatives (preferably available in Python)?
There is a textblob module and it has a PositiveNaiveBayesClassifier which is more practical to code than OneClassSVM. Have a look at that, It will suffice your needs.
Related
I would like to create a semantic context for my data before vectorizing the actual data in Weaviate (https://github.com/semi-technologies/weaviate).
Lets say we have a taxonomy where we have a set of domain specific concepts together with links to their related concepts. Could you advise me what the best way is to encode not only those concepts but also relations between them using contextionary?
Depending on your use case, there are a few answers possible.
You can create the "semantic context" in a Weaviate schema and use a vectorization module to vectorized the data according to this schema.
You have domain-specific concepts in your data that the out-of-the-box vectorization modules don't know about (e.g., specific abbreviations).
You want to capture the semantic context of (i.e., vectorize) the graph itself before adding it to Weaviate.
The first is the easiest and straightforward one, the last one is the most esoteric.
Create a schema and use a vectorizer for your data
In your case, you would create a schema based on your taxonomy and load the data using an out-of-the-box vectorizer (this configurator helps you to build a Docker-compose file).
I would recommend starting with this anyway, because it will determine your data model and how you can search through and/or classify data. It might even be the case that for your use case this step already solves the problem because the out-of-the-box vectorizers are (bias alert) pretty decent.
Domain-specific concepts
At the moment of writing, Weaviate has two vectorizers, the contextionary and the transformers modules.
If you want to extend Weaviate with custom context, you can extend the contextionary or fine tune and distribute custom transformers.
If you do this, I would highly recommend still taking the first step. Because it will simply improve the results.
Capture semantic context of your graph
I don't think this is what you want, but it possible and quite esoteric. In principle, you can store your vectorized graph in Weaviate, but you need to generate the vectors on your own. For example, at the moment of writing, we are looking at RDF2Vec.
PS:
Because people often ask about the role of ontologies and taxonomies in Weaviate, I've written this blog post.
based on the four layer MOF structure, I'm currently working on a model (in fact a UML class diagram) at M1 level. However, I observed that some parts of the meta model are highly depending on references to certain classes, which may may differ depending on the use case. Therefore, I created a meta model on the M2 level, which allows users to define the variable parts of the M1 Model, which again can then be generated and incorpareted in the M1 model. The following images tries to depict that:
A resulting M1 model example would then look like that:
As switching between the different levels can be a little bit confusing, I wonder if this approach is per se possible and UML conform? Furthermore, is there a notation for the "generated instances" relation in Figure 1 by chance? Within the MOF spec, <<merge>> or <<import>> is for example used, which maybe fit in for that purpose.
Probably your question is too broad to give a concise answer. However, here's my advice when dealing with meta models: I found that people hardly have an idea why you need a meta model at all and it takes quite some time to convince them starting to create one. Even with so called UML pros. Now, with that in background, it's evident that modelers who shall use the meta model might have even more difficulties dealing with it. This leaves just one way: keep it simple. And that's what I did in the past. Introducing a meta model with just really the basics, concentrating on meta types, tagged values and some connectors. After a while, people really get used to it and appreciate working with the meta model. Only then there starts the need to switch to a version two, which is still static though.
Now, what you want looks like a version ninety nine. This would probably only work in a super model where you have some gurus floating on top of it all an provide a meta meta model. This will going to be interesting and I'd like to be part of that team. However, I doubt you will be able to get practicable results from it. My recommendation is that you stay with the static meta model. Everything else will likely lead you to nowhere.
I have a corpus of a few 100-thousand legal documents (mostly from the European Union) – laws, commentary, court documents etc. I am trying to algorithmically make some sense of them.
I have modeled the known relationships (temporal, this-changes-that, etc). But on the single-document level, I wish I had better tools to allow fast comprehension. I am open for ideas, but here's a more specific question:
For example: are there NLP methods to determine the relevant/controversial parts of documents as opposed to boilerplate? The recently leaked TTIP papers are thousands of pages with data tables, but one sentence somewhere in there may destroy an industry.
I played around with google's new Parsey McParface, and other NLP solutions in the past, but while they work impressively well, I am not sure how good they are at isolating meaning.
In order to make sense out of documents you need to perform some sort of semantic analysis. You have two main possibilities with their exemples:
Use Frame Semantics:
http://www.cs.cmu.edu/~ark/SEMAFOR/
Use Semantic Role Labeling (SRL):
http://cogcomp.org/page/demo_view/srl
Once you are able to extract information from the documents then you may apply some post-processing to determine which information is relevant. Finding which information is relevant is task related and I don't think you can find a generic tool that extracts "the relevant" information.
I see you have an interesting usecase. You've also mentioned the presence of a corpus(which a really good plus). Let me relate a solution that I had sketched for extracting the crux from research papers.
To make sense out of documents, you need triggers to tell(or train) the computer to look for these "triggers". You can approach this using a supervised learning algorithm with a simple implementation of a text classification problem at the most basic level. But this would need prior work, help from domain experts initially for discerning "triggers" from the textual data. There are tools to extract gists of sentences - for example, take noun phrases in a sentence, assign weights based on co-occurences and represent them as vectors. This is your training data.
This can be a really good start to incorporating NLP into your domain.
Don't use triggers. What you need is a word sense disambiguation and domain adaptation. You want to make sense of is in the documents i.e understand the semantics to figure out the meaning. You can build a legal ontology of terms in skos or json-ld format represent it ontologically in a knowledge graph and use it with dependency parsing like tensorflow/parseymcparseface. Or, you can stream your documents in using a kappa based architecture - something like kafka-flink-elasticsearch with added intermediate NLP layers using CoreNLP/Tensorflow/UIMA, cache your indexing setup between flink and elasticsearch using redis to speed up the process. To understand relevancy you can apply specific cases from boosting in your search. Furthermore, apply sentiment analysis to work out intents and truthness. Your use case is one of an information extraction, summarization, and semantic web/linked data. As EU has a different legal system you would need to generalize first on what is really a legal document then narrow it down to specific legal concepts as they relate to a topic or region. You could also use here topic modelling techniques from LDA or Word2Vec/Sense2Vec. Also, Lemon might also help from converting lexical to semantics and semantics to lexical i.e NLP->ontology ->ontology->NLP. Essentially, feed the clustering into your classification of a named entity recognition. You can also use the clustering to assist you in building out the ontology or seeing what word vectors are in a document or set of documents using cosine similarity. But, in order to do all that it be best to visualize the word sparsity of your documents. Something like commonsense reasoning + deep learning might help in your case as well.
I was checking out Stanford CoreNLP in order to understand NER and POS tagging. But what if I want to create custom tags for entities like<title>Nights</title>, <genre>Jazz</genre>, <year>1992</year> How can I do it? is CoreNLP useful in this case?
CoreNLP out-of-the-box will be restricted to types they mention : PERSON, LOCATION, ORGANIZATION, MISC, DATE, TIME, MONEY, NUMBER. No, you won't be able to recognize other entities just by assuming it could "intuitively" do it :)
In practice, you'll have to choose, either:
Find another NER systems that tags those types
Address this tagging task using knowledge-based / unsupervised approaches.
Search for extra resources (corpora) that contain types you want recognize, and re-train a supervised NER system (CoreNLP or other)
Build (and possibly annotate) your own resources - then you'll have to define an annotation scheme, rules, etc. - quite an interesting part of the work!
Indeed, unless you find an existing system that fulfills your needs, some effort will be required! Unsupervised approaches may help you bootstrapping a system, so as to see if you need to find / annotate a dedicated corpus. In the latter case, it would be better to separate data as train/dev/test parts, so as to be able to assess how much the resulting system performs on unseen data.
Look into this FAQ (http://nlp.stanford.edu/software/crf-faq.shtml) to use CRF classifier to train your model for new classes. You may find it useful.
I had some questions regarding the structure and behavior of a model, using UML, and the relationship between the two :
Did you find any limitations for UML regarding the specification or understanding of the relationship between structure and behavior?
I was wondering if you have any practical ideas of how one can optimize the relationship between structure and behavior, using UML.
Do you know any UML tools that help understand better this relationship or represent it much easier?
Thanks
Yes:
A sequence diagram is readable at a high level, showing how a transaction involves a few components; but it's not good (not readable) at the detailed level, showing how a transaction involves dozens of methods (method A calls method B, which gets data from methods D and E, and then invokes method F, etc.).
Looking at a class diagram, you might see a based class with several subclasses; this tells you nearly nothing about the behaviour of the classes (it only tells you that they may have some behaviour in common, or at least a common API, plus some individual behaior that's unique to each subclass).
That's a big question. A quick answer is, "Attach text notes to the objects: diagrams aren't sufficient without descriptive text."
No, I don't really; a UML tool help you create UML diagrams (and generate code from the diagrams), but it's up to you how you use it. There was a neat product described in the book titled Real-Time Object-Oriented Modeling (1994) which was an executable model, i.e. the model itself had behaviour, but I know of no UML tool quite like that. The closest I know of is being able to "round trip" between the model and code (i.e. generate code from the model, and the model from code).
Sounds like a homework problem. Wiki can tell you all about UML.
The limitations of UML are the same as any form of communication. The simpler your language, the fewer things you can communicate and the clearer your communications will become. A shape like a square or circle identifies a structure, a line indicates relationship, an arrow indicates movement, or flow. You could enhance this by defining the meaning of other properties, like direction, boldness, color, number count, different shapes. You could incorporate multimedia layers like audio or video, motion, tooltips- but now we're not talking about UML anymore.
My favorite UML tools are a whiteboard and some dry erase markers.
I think that things have changed, regarding UML's usefulness to melculetz.
In Visual Studio 2010, I can define an association relationship, that will generate composite classes. I can specify the multiplicity and class qualifiers. I can also generate classes from the model.
Presently, I am attempting to visually model the phases of a system, in order to visually define the methods for a state-machine object. That is my attempt to integrate structure and behaviour. Check my blog to see how I get on.
Class Analyser visually expresses the behaviour of class objects. Limitation removed.
I think that the answer is to turn your development methods towards MDA. You will generate more classes, but the payoff is in terms of manageability and re-use (where you template your efforts).
I am still working through my model but, I find VS2010 promises good tools for managing the development process. I have yet to investigate UI modelling, but have heard the rumours. I may have it all wrong but I think that, by working with Lightswitch, I may be able to model the UI also.
UML allows you to specify the signature of a method, and group methods into classes, but it says nothing at all about what code you use as implementation. If that's what you mean by "behavior", I don't think UML addresses it at all at the class level.
It's even worse at the UI level. My impression of UML is that it's woefully inadequate for specifying UIs.
I think the effort required to embed everything into UML is greater than or equal to coding the application, with the added burdens of UML tools being poor IDEs and inability to prove correctness of UML the way you can with unit testing.
UML is way oversold, IMO. I consider it a convenient notation for informal communication between developers, nothing more. It has never been and never will be the object oriented equivalent of engineering drawings.