I am processing plain text documents and identifying entities like college/university names present in the document. Some times these names are written in different formats but they refer to a single college/university name.
Example:
Jawaharlal Nehru Technological University Hyderabad
J.N.T.U Hyderabad
JNTU Hyderabad
JNTU-H
Jawaharlal Nehru Technological University (JNTU) Hyderabad
All the above names refer to same college name.
How can we relate all these names to a single college/university names?
(I am looking for some kind of web service or something like Google search because if i search for any of those names it returns same college link.)
This task is named "Entity Linking". Some systems are dedicated to this, in most cases by leveraging Wikipedia (in particular redirects which give possible mentions for entities), such as Babelfy or DBpedia Spotlight.
Those service rely on data to link mentions to unique identifiers: if they have possible mentions for your entities, it should probably work in most cases (but for those that are to ambiguous). But in many cases their lexicon are not sufficient and you'll probably face unknown entities or mentions. In that case, you'll have to build your own system by using an existing framework and provide it with relevant database of entities and their mentions. Acronyms could be automatically generated from their full names.
Related
Hi I would like to know if we can have something like the following example on Doccano:
So let's say that we have a sentence like this : "MS is an IT company". I want to label some words in this sentence, for example MS (Microsoft). MS should be labelled as a Company (so imagine that I have an entity named Company) but I also want to say that MS stands for Microsoft.
Is there a way to do that with Doccano?
Thanks
Doccano supports
Sequence Labelling good for Named Entity Recognition (NER)
Text Classification good e.g. for Sentiment Analysis
Sequence To Sequence good for Machine Translation
What you're describing sounds a little like Entity Linking.
You can see from Doccano's roadmap in its docs that Entity Linking is part of the plans, but not yet available.
For now, I suggest to frame this as a NER problem, and to have different entities for MS (Microsoft) and MS (other). If you have too many entities to choose from, the labelling could become complicated, but then you could break up the dataset in smaller entity-focussed datasets. For example, you could get only documents with MS in them and label the mentions as one of the few synonyms.
I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
http://www.namedevelopment.com/brand-names.html
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
Misspellings
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
Conclusion
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.
I am not an expert in Machine Learning, so I will try to be as accurate as possible...
I am currently analyzing financial documents that are giving information on a specific fund. What I would like to do is to be able to extract the fund name.
For this, I am using Named Entity Recognition (NER) in Azure Machine Learning platform. After analyzing approx. 100 documents, I get results classified as Organizations. In most cases, they are really organizations. This is great, but my problem is that the fund name is also categorized as an organization. I am not able to distinguish between a company name and a fund name.
From some readings on Internet, I could discover that Gazette system could help so that we can match the recognized organizations against a list of funds, and therefore make sure that we have a fund name.
Do you think this would be a good approach? Or is there any other algorithm that I should try to improve the results?
Thanks for any suggestion!
NER has its origins in identifying text identifying broad semantic categories, like the names of people or organizations (companies) in your case. Reading the description of question, I don't think this is the problem you really want to solve. Specifically you mention:
that Gazette system could help so that we can match the recognized organizations against a list of funds
I suspect the problem you really want to solve is one of semantic interoperability - you want text from your NLP program to match a list you have that is part of another system. In that case, the only accepted way you are going to solve your problem is to map all of the input text to a list/common standard - ie) use the gazetteer. So you are on the right path.
The only caveat is that if you only need to distinguish between funds and other types of organizations - without the need to match the results against a list. If that is the case, you write a classifier to distinguish funds from everything else and you can avoid mapping to your list entirely. Otherwise use a gazetteer.
Does DBpedia name have any standard or convention? By that, I mean, e.g., United Kingdom has a resource named United_Kingdom. But I'm seeing that the fact of having an underscore and having each word being capitalized doesn't hold. For instance, take University_of_Manchester; if you type it as University_Of_Manchester with a capital ‘O’ in “of,” you won't get the resource. Is it obligatory to do a filtering to get the resource name in the proper case, because we may want to make all letters lowercase, have underscore in spaces and just make a query because doing in filtering in the SPARQL do takes some time.
Any suggestions? I've just started to learn about DBpedia, so I may be missing something.
DBpedia encodes the information available in Wikipedia, and its naming convention is based on the names of Wikipedia articles. The DBpedia wiki page, The DBpedia Data Set, says, in Section 3. Denoting or Naming “Things”:
Each thing in the DBpedia data set is denoted by a de-referenceable IRI- or URI-based reference of the form http://dbpedia.org/resource/Name, where Name is derived from the URL of the source Wikipedia article, which has the form http://en.wikipedia.org/wiki/Name. Thus, each DBpedia entity is tied directly to a Wikipedia article. Every DBpedia entity name resolves to a description-oriented Web document (or Web resource).
Until DBpedia release 3.6, we only used article names from the English Wikipedia, but since DBpedia release 3.7, we also provide Internationalized Datasets that contain IRIs like http://xx.dbpedia.org/resource/Name, where xx is a Wikipedia language code and Name is taken from the source URL, http://xx.wikipedia.org/wiki/Name.
Thus, since the Wikipedia article is University of Manchester, not University Of Manchester, the DBpedia resource is http://dbpedia.org/page/University_of_Manchester, and not http://dbpedia.org/page/University_Of_Manchester.
I need to categorize a text or word to a particular category. For example, the text 'Pink Floyd' should be categorized as 'music' or 'Wikimedia' as 'technology' or 'Einstein' as 'science'.
How can this be done? Is there a way I can use the DBpedia for the same? If not, the database has to be trained from time to time, right?
This is a text classification problem. Manning, Raghavan and Schütze's Information Retrieval book chapter is a nice introduction. I think you do not need DBPedia nor NER for this, just a small labeled training data set with enough labeled examples for all of your classes.
Yes, DBpedia may be a good choice for this kind of problem. You'll have to
squash the DBpedia category structure so you get the right granularity (e.g., Pink Floyd is listed under Capitol Records artists and a host of other categories, but not directly under Music). Maybe pick a few large categories and try to find whether your concepts are listed indirectly in them;
normalize text; Einstein is listed as Albert Einstein, not einstein
deal with ambiguity due to terms describing multiple concepts and concepts belonging to multiple top-level categories.
These problems may be solvable using machine learning, but I only see how it can be done if you extract these terms, along with relevant features, from running text. But in that case, you might just as well classify the entire text into one of the categories you choose in step 1.
This is the well-studied named entity recognition problem. Unless you have a particular need to roll your own technology (hint: it's a hard problem in general), using Gate, or perhaps one of the online services that builds on it (e.g. TSO's Data Enrichment Service), would be a good option. An alternative online service is OpenCalais.
Mapping your categries to DBPedia.
Index with lucene selected DBPedia categories and label data with your category names.
Do search for your data - tokenization, normalization will be done by Lucene.
This approach is somehow related to KNN classification.
Yes DBpedia is a good choice for text classification, as you can use its predicates/ relations to query and to extract the meaningful information for the particular category.
You can look into the endpoint for querying Dbpedia:
http://dbpedia.org/sparql
Further, learn the basic syntax of SPARQL to query on the endpoint from the following link:
http://www.w3.org/TR/rdf-sparql-query/