How to evaluate SciSpaCy's entity linking

How to evaluate SciSpaCy's entity linking - nlp

I'm using SciSpaCy's Entity Linker with a custom knowledge base. As I'm updating some components of my application (e.g. the underlying language model, sentence tokenization pipeline, the knowledge base itself, etc), I'm noticing that (1) the number of entities that the application picks up changes and (2) the linked concepts themselves change (not the detected entities but the concepts that are linked to these entities). With this in mind, I'd like to be able to evaluate my entity-linking application.
Unfortunately, I cannot seem to find any resources for that. I was hoping to find either an evaluation library of some sort (assuming we are not just interested in a confusion matrix) or a "gold standard" dataset with entities in various forms (e.g. abbreviated, inflected, etc) and the expected linked concept.
I'm afraid I'm a novice in this field which is why I'm reaching out here, hoping that anyone might be able to point me to a set of useful resources or share some tips with me.
Many thanks in advance.

Related

Is it possible to generate parts of a meta model from upper layer?

based on the four layer MOF structure, I'm currently working on a model (in fact a UML class diagram) at M1 level. However, I observed that some parts of the meta model are highly depending on references to certain classes, which may may differ depending on the use case. Therefore, I created a meta model on the M2 level, which allows users to define the variable parts of the M1 Model, which again can then be generated and incorpareted in the M1 model. The following images tries to depict that:
A resulting M1 model example would then look like that:
As switching between the different levels can be a little bit confusing, I wonder if this approach is per se possible and UML conform? Furthermore, is there a notation for the "generated instances" relation in Figure 1 by chance? Within the MOF spec, <<merge>> or <<import>> is for example used, which maybe fit in for that purpose.

Probably your question is too broad to give a concise answer. However, here's my advice when dealing with meta models: I found that people hardly have an idea why you need a meta model at all and it takes quite some time to convince them starting to create one. Even with so called UML pros. Now, with that in background, it's evident that modelers who shall use the meta model might have even more difficulties dealing with it. This leaves just one way: keep it simple. And that's what I did in the past. Introducing a meta model with just really the basics, concentrating on meta types, tagged values and some connectors. After a while, people really get used to it and appreciate working with the meta model. Only then there starts the need to switch to a version two, which is still static though.
Now, what you want looks like a version ninety nine. This would probably only work in a super model where you have some gurus floating on top of it all an provide a meta meta model. This will going to be interesting and I'd like to be part of that team. However, I doubt you will be able to get practicable results from it. My recommendation is that you stay with the static meta model. Everything else will likely lead you to nowhere.

What is common practise for designing an initial class diagram for a project?

I am currently taking a course that gives an introduction to project planning. It is mostly about how to draw UML diagrams (blegh), but also has a few other topics.
One part in particular keeps bugging me. In the course they describe a method for going from a set of requirements to an initial class diagram, but everything about the method gives me this feeling that it is most definitely not the way to go. Let me first give an example before proceeding.
Let's consider a system that manages a greenhouse company. The company has multiple greenhouses, and every employee is assigned to his/her own greenhouse. A greenhouse has a location and a type of plant being grown in there. An employee has a name and phone number.
Here's what according to the course's method the class diagram would look like:
To me this looks like a database layout adapted for code. When I go about designing a program, I try to identify major abstractions. Like all the code that interacts with the database or the code that is responsible for the GUI are all different parts of the system. That would be what I consider to be an initial class diagram.
I simply can not imagine that this is a common way to start designing the architecture of a project. The classes look ugly, since if you take a slightly larger example the classes will be flooded with responsibilities. To me they look like data objects that have functionality to them they shouldn't have. It does not give me a clue on how to continue from here and get a general architecture going. Everything about it seems obsolete.
All I want to know if there's someone out there that can tell me if this is a common way to get a first class diagram on paper for reasons I am overlooking.

I would say it's reasonable to start with a logical model that's free of implementation constraints. That logical model is not necessarily concerned with physical implementation details (e.g. whether or not to use a database, what type of database, OS / UI choice, etc.) and thus represents just "real" business domain objects and processes. The similarity to a potential database implementation shouldn't be surprising for the simple example.
By understanding your business domain (through the logical model you've started to construct), you will be better placed to subsequently identify, for example, which architectural patterns are appropriate, what screens you need to build, and database elements to design. Possibly, there will be another part of the course that will aid you in this stage.
In practice, you will often know that you're intending to implement, say, a web-based application using MVC with a back-end database, and may look to model the implementation classes in parallel with your business items. For your course to use a method that emphasises the distinction between logical and physical stages doesn't sound unreasonable.

When I go about designing a program, I try to identify major
abstractions
Same principle in UML as well. You represent abstractions and their relationships and due to existing Visual Tools you can do a presentation of a system to stakeholders or even generate automatically stubs from your design.

Can't help but see Domain entities as wasteful. Why?

I've got a question on my mind that has been stirring for months as I've read about DDD, patterns and many other topics of application architecture. I'm going to frame this in terms of an MVC web application but the question is, I'm sure, much broader. and it is this:  Does the adherence to domain entities  create rigidity and inefficiency in an application? 
The DDD approach makes complete sense for managing the business logic of an application and as a way of working with stakeholders. But to me it falls apart in the context of a multi-tiered application. Namely there are very few scenarios when a view needs all the data of an entity or when even two repositories have it all. In and of itself that's not bad but it means I make multiple queries returning a bunch of properties I don't need to get a few that I do. And once that is done the extraneous information either gets passed to the view or there is the overhead of discarding, merging and mapping data to a DTO or view model. I have need to generate a lot of reports and the problem seems magnified there. Each requires a unique slicing or aggregating of information that SQL can do well but repositories can't as they're expected to return full entities. It seems wasteful, honestly, and I don't want to pound a database and generate unneeded network traffic on a matter of principle. From questions like this Should the repository layer return data-transfer-objects (DTO)? it seems I'm not the only one to struggle with this question. So what's the answer to the limitations it seems to impose? 
Thanks from a new and confounded DDD-er.

What's the real problem here? Processing business rules and querying for data are 2 very different concerns. That realization leads us to CQRS - Command-Query Responsibility Segregation. What's that? You just don't use the same model for both tasks: Domain Model is about behavior, performing business processes, handling command. And there is a separate Reporting Model used for display. In general, it can contain a table per view. These tables contains only relevant information so you can get rid of DTO, AutoMapper, etc.
How these two models synchronize? It can be done in many ways:
Reporting model can be built just on top of database views
Database replication
Domain model can issue events containing information about each change and they can be handled by denormalizers updating proper tables in Reporting Model

as I've read about DDD, patterns and many other topics of application architecture
Domain driven design is not about patterns and architecture but about designing your code according to business domain. Instead of thinking about repositories and layers, think about problem you are trying to solve. Simplest way to "start rehabilitation" would be to rename ProductRepository to just Products.
Does the adherence to domain entities create rigidity and inefficiency in an application?
Inefficiency comes from bad modeling. [citation needed]
The DDD approach makes complete sense for managing the business logic of an application and as a way of working with stakeholders. But to me it falls apart in the context of a multi-tiered application.
Tiers aren't layers
Namely there are very few scenarios when a view needs all the data of an entity or when even two repositories have it all. In and of itself that's not bad but it means I make multiple queries returning a bunch of properties I don't need to get a few that I do.
Query that data as you wish. Do not try to box your problems into some "ready-made solutions". Instead - learn from them and apply only what's necessary to solve them.
Each requires a unique slicing or aggregating of information that SQL can do well but repositories can't as they're expected to return full entities.
http://ayende.com/blog/3955/repository-is-the-new-singleton
So what's the answer to the limitations it seems to impose?
"seems"
Btw, internet is full of things like this (I mean that sample app).
To understand what DDD is, read blue book slowly and carefully. Twice.

If you think that fully fledged DDD is too much effort for your scenario then maybe you need to take a step down and look at something closer to Active Record.
I use DDD but in my scenario I have to support multiple front-ends; a couple web sites and a WinForms app, as well as a set of services that allow interaction with other automated processes. In this case, the extra complexity is worth it. I use DTO's to transfer a representation of my data to the various presentation layers. The CPU overhead in mapping domain entities to DTO's is small - a rounding error when compared to net work calls and database calls. There is also the overhead in managing this complexity. I have mitigated this to some extent by using AutoMapper. My Repositories return fully populated domain objects. My service layer will map to/from DTO's. Here we can flatten out the domain objects, combine domain objects, etc. to produce a more tabulated representation of the data.
Dino Esposito wrote an MSDN Magazine article on this subject here - you may find this interesting.
So, I guess to answer your "Why" question - as usual, it depends on your context. DDD maybe too much effort. In which case do something simpler.

Each requires a unique slicing or aggregating of information that SQL can do well but repositories can't as they're expected to return full entities.
Add methods to your repository to return ONLY what you want e.g. IOrderRepository.GetByCustomer
It's completely OK in DDD.
You may also use Query object pattern or Specification to make your repositories more generic; only remember not to use anything which is ORM-specific in interfaces of the repositories(e.g. ICriteria of NHibernate)

DDD what all terms mean for Joe the plumber who can't afford to read books few times?

I am on a tight schedule with my project so don't have time to read books to understand it.
Just like anything else we can put it in few lines after reading books for few times. So here i need some description about each terms in DDD practices guideline so I can apply them bit at a piece to my project.
I already know terms in general but can't put it in terms with C# Project.
Below are the terms i have so far known out of reading some brief description in relation with C# project. Like What is the purpose of it in C# project.
Services
Factories
Repository
Aggregates
DomainObjects
Infrastructure
I am really confused about Infrastructure, Repository and Services
When to use Services and when to use Repository?
Please let me know if anyway i can make this question more clear

I recommend that you read through the Domain-Driven Design Quickly book from infoq, it is short, free in pdf form that you can download right away and does its' best to summarize the concepts presented in Eric Evan's Blue Bible
You didn't specify which language/framework the project you are currently working on is in, if it is a .NET project then take a look at the source code for CodeCampServer for a good example.
There is also a fairly more complicated example named Fohjin.DDD that you can look at (it has a focus on CQRS concepts that may be more than you are looking for)
Steve Bohlen has also given a presentation to an alt.net crowd on DDD, you can find the videos from links off of his blog post
I've just posted a blog post which lists these and some other resources as well.
Hopefully some of these resources will help you get started quickly.

This is my understanding and I did NOT read any DDD book, even the holy bible of it.
Services - stateless classes that usually operate on different layer objects, thus helping to decouple them; also to avoid code duplication
Factories - classes that knows how to create objects, thus decouple invoking code from knowing implementation details, making it easier to switch implementations; many factories also help to auto-resolve object dependencies (IoC containers); factories are infrastructure
Repository - interfaces (and corresponding implementations) that narrows data access to the bare minimum that clients should know about
Aggregates - classes that unifies access to several related entities via single interfaces (e.g. order and line items)
Domain Objects - classes that operate purely on domain/business logic, and do not care about persistence, presentation, or other concerns
Infrastructure - classes/layers that glue different objects or layers together; contains the actual implementation details that are not important to real application/user at all (e.g. how data is written to database, how HTTP form is mapped to view models).
Repository provides access to a very specific, usually single, kind of domain object. They emulate collection of objects, to some extent. Services usually operate on very different types of objects, usually accessed via static methods (do not have state), and can perform any operation (e.g. send email, prepare report), while repositories concentrate on CRUD methods.

DDD what all terms mean for Joe the plumber who can’t afford to read books few times?
I would say - not much. Not enough for sure.

I think you're being quite ambitious in trying to apply a new technique to a project that's under such tight deadlines that you can't take the time to study the technique in detail.
At a high level DDD is about decomposing your solution into layers and allocating responsibilities cleanly. If you attempt just to do that in your application you're likely to get some benefit. Later, when you have more time to study, you may discover that you didn't quite follow all the details of the DDD approach - I don't see that as a problem, you proabably already got some benefit of thoughtful structure even if you deviated from some of the DDD guidance.
To specifically answer your question in detail would just mean reiterating material that's already out there: Seems to me that this document nicely summarises the terms you're asking about.
They say about Services:
Some concepts from the domain aren’t
natural to model as objects. Forcing
the required domain functionality to
be the responsibility of an ENTITY or
VALUE either distorts the definition
of a model-based object or adds
meaningless artificial objects.
Therefore: When a significant process
or transformation in the domain is not
a natural responsibility of an ENTITY
or VALUE OBJECT, add an operation to
the model as a standalone interface
declared as a SERVICE.
Now the thing about this kind of wisdom is that to apply it you need to be able to spot when you are "distorting the definition". And I suspect that only with experience (or guidance from someone who is experienced) do you gain the insight to spot such things.
You must expect to experiment with ideas, get it a bit wrong sometimes, then reflect on why your decisions hurt or work. Your goal should not be to do DDD for its own sake, but to produce good software. When you find it cumbersome to implement something, or difficult to maintain something think about why this is, then examine what you did in the light of DDD advice. At that point you may say "Oh, if I had made that a Service, the Model would be so nmuch cleaner", or whatever.
You may find it helpful to read an example.:

Breaking up Entities into smaller Entities in DDD?

Does it make sense to make subsets of an Entity if you consider their usage in the application differently? IE. I take my entity and define a new entity with only some of the attributes of the first. Now I have 2 Entities that overlap but are used differently but ultimately persist in same datatable. These Entities will be accessed through different repositories...

I am only starting to learn about DDD myself, so if I am wrong please comment and let me know. Here are my thoughts though:
If the entity is going to be accessed through a different repository, I think it deserves its own class. Additionally, the bits that overlap now may not overlap in the future, and if you use a shared base class, you will probably be more likely to try adapting things at that point, which will dirty up your domain.
If the two classes are part of separate sub-domains, they probably should be separate. My thoughts are based around parts of an example I remember hearing in Rob Connery's interview on Hanselminutes. A product has several properties that are important to consumers (pricing, description, etc), and several properties that are important to warehouse personnel (location in the warehouse, weight, dimensions, etc). The implication to me in that episode was that the two products should be defined separately in the domain, instead of being defined once and shared.

If by "usage in the application" you mean you'll display different parts in different views, then I'd suggest you use a presentation pattern like Fowler describes in his Presentation Model (or if your developing a WPF app you can use the more WPF-specialized version called Model-View-ViewModel (MVVM)).
But if you by "usage" mean that you'll use different attributes of the entity in different sub-domains or part of your domain, then I agree with Chris; You'd probably be better off breaking them into different entities. The reason being that in your domain model you should reflect how the entity is used in that specific (sub)domain. And if you're using different parts of the entity under different circumstances, they probably have different meanings in these settings which again should be reflected in the naming of the entities. And if it were me, I'd probably make one repository for each of the entities. Having a 1:1 mapping between entities and repositories seems to make sense in most cases as far as I've experienced. But then again; just like Chris, Rob Conery and 90% of the developers trying to do DDD; I'm fairly new to the DDD-game and so my experiences might be overruled by someone more experienced :)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string