Entity extraction from bank wire transactions ( like not-so-natural-text) - nlp

I am trying to extract entities ( name , address , organization) from not so natural text, like comment in bank wire transactions.
Obviously not getting good results , used NLTK , OpenNLP and CoreNLP.
Any idea how to improve the results?
the text can look like,
EVERITT 620122T NAT ABC INDIA LTD
REF ROBERT FINEMANN - REASON SHOP RENTAL
REF BY92 00 112233999 - REASON SPEEDING FINE
GEM SS HEUTIGEM SCHIENDLER
PENSION CH1234 CAB28
...
Reference to research work or existing products will also help

if you're using opennlp and know how to train, you should give 15000 examples in the training data which can look like
<START:name> EVERITT <END> <START:Address> 620122T NAT <END> <START:Organisation> ABC INDIA LTD <END>
.......
....(15000 lines)
and then you can expect some good results!

It seem to me you'll have to use a dictionary/database.
You could try growing one using a procedure like this: http://www.cs.columbia.edu/~mcollins/papers/eacl2014.pdf
But you'll still need to have a way of defining candidate "phrases" -- examples from the paper, e.g. capitalized words, won't work here obviously.

Related

Direction on feature extraction in an unstructured document with Python

Project Background:
I'm quite new to NLP so please forgive me if my problem seems unreasonably complicated. I am trying to extract some features, like company names, some money value and names of individuals from a public company listing document, which large body of text (300+ pages).
Text parsed into the program looks something like this:
"In this motion, Company A Holdings (The "Company"), was sponsored by Company B Limited. John Doe, the chairman of the company, has approved of this activity"
The expected outcome looks something like this:
The Company: Company A Holdings
Sponsor: Company B Limited
Chairman: John Doe
Since all documents came in PDF form, I have parsed them in as text. Performed some NER with Spacy with the document I have, and based on the looks of the NER result, it had successfully recognised all the entities that I needed. (I.E. It recognised Company A Holdings, Company B Limited and John Doe)
How should I approach the said goal? I don't have a massive amount of files to train the model (currently around 30 ish documents), a general direction or example of modules on how to tackle the problem would be highly appreciated.
Thank you all in advance!

Identifying the context of word in sentence

I created classifier to classy the class of nouns,adjectives, Named entities in given sentence. I have used large Wikipedia dataset for classification.
Like :
Where Abraham Lincoln was born?
So classifier will give this short of result - word - class
Where - question
Abraham Lincoln - Person, Movie, Book (because classifier find Abraham Lincoln in all there categories)
born - time
When Titanic was released?
when - question
Titanic - Song, movie, Vehicle, Game (Titanic
classified in all these categories)
Is there any way to identify exact context for word?
Please see :
Word sense disambiguation would not help here. Because there might not be near by word in sentence which can help
Lesk algorithm with wordnet or sysnet also does not help. Because it for suppose word Bank lesk algo will behave like this
======== TESTING simple_lesk ===========
TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
Here for word bank it suggested as financial institute and slopping land. While in my case I am already getting such prediction like Titanic then it can be movie or game.
I want to know is there any other approach apart from Lesk algo, baseline algo, traditional word sense disambiguation which can help me to identify which class is correct for particular keyword?
Titanic -
Thanks for using the pywsd examples. With regards to wsd, there are many other variants and i'm coding them by myself during my free time. So if you want to see it improve do join me in coding the open source tool =)
Meanwhile, you will find the following technologies more relevant to your task, such as:
Knowledge base population (http://www.nist.gov/tac/2014/KBP/) where tokens/segments of text are assigned an entity and the task is to link them or to solve a simplified question and answer task.
Knowledge representation (http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html)
Knowledge extraction (https://en.wikipedia.org/wiki/Knowledge_extraction)
The above technologies usually includes several sub-tasks such as:
Wikification (http://nlp.cs.rpi.edu/kbp/2014/elreading.html)
Entity linking
Slot filling (http://surdeanu.info/kbp2014/def.php)
Essentially you're asking for a tool that is an NP-complete AI system for language/text processing, so I don't really think such a tool exists as of yet. Maybe it's IBM Watson.
if you're looking for the field to look into, the field is out there but if you're looking at tools, most probably wikification tools are closest to what you might need. (http://nlp.cs.rpi.edu/paper/WikificationProposal.pdf)

How to extract information from these sentences

I got a list of sentences like below:
They are some sentences I extracted from job descriptions. I want to extract information like: degree type, major, required or preferred.
There are
The result should be like :
{
degree: Bachelor,
major : Computer Science,
required: True
}
Thers are no obvious rules in these sentences. How can I achieve this goal?
Bachelor ’ s degree in Computer Science or equivalent
Pursuing B.S. or advanced degree in computer science or related technical/engineering degree .
Bachelor 's Degree in Computer Science or equivalent experience
Youre educated ( BS/MS in Computer Science or other technical degree ) .
•BS in Computer Science , Digital Media or similar technical degree with 3 + years of experience
· Bachelors degree .
Bachelor 's degree in computer science , design or related field
Ability to absorb , master and leverage emerging technologies
BA/BS degree or equivalent practical experience
Education Required : Bachelors Degree
• Bachelor 's degree in related field , OR four ( 4 ) years of experience in a directly related field .
So you are dealing with unstructured data, I hope using following steps you may reach to a decent accuracy level.
Create a lookup table of list of all keywords that may occur in each of your required variables like degree, education etc. You need to mine various online sources to grab these keywords.
Split your data into sentence or line by line and Iterate over the list.
While iterating, look for the key words into your lookup tables and find the useful lines.
Create hierarchal rules to accurately extract the variables, and append them in your result.
Overview of hierarchal rules:
for example, Degree name will be completely alphabetic.
Experience will be alphanumeric.
Terms like pursuing will point towards variable name Major
Try to modify these rules on each iteration of code. Keep adding new rules.
This is just the basic approach, I believe that if you do some iterations over your methodology, you will be able to extract information.
You probably need to gather a list of majors and degrees (for example : http://en.wikipedia.org/wiki/List_of_tagged_degrees ) to extract the degree and major. Then based on some general rules (or designing a classifier decide on "required" or "not required").
Another suggestion to do this would be:
First: clean up the data - remove all punctuation, stop
words,unwanted symbols etc.
Second: make a list of keywords are interested in.
Third: split your data into words (word_tokenize in nltk)
Fourth: make a dictionary of values you are looking in.
Fifth: lookup in the dictionary as you read the words list matching
it with your keywords list and then append it into new output
dictionary.
Hope this helps.

How do you extract various meanings of a certain word

Given "violence" as input would it be possible to come up with how violence construed by a person (e.g. physical violence, a book, an album, a musical group ..) as mentioned below in Ref #1.
Assuming if the user meant an Album, what would be the best way to look for violence as an album from a set of tweets.
Is there a way to infer this via any of the NLP API(s) say OpenNLP.
Ref #1
violence/N1 - intentional harmful physical action.
violence/N2 - the property of being wild or turbulent.
Violence/N6 - a book from Neil L. Whitehead; nonfiction
Violence/N7 - an album by The Last Resort
Violence/N8 - Violence is the third album by the Washington-based Alternative metal music group Nothingface.
Violence/N9 - a musical group which produced the albums Eternal Nightmare and Nothing to Gain
Violence/N10 - a song by Aesthetic Perfection, Angel Witch, Arsenic, Beth Torbert, Brigada Flores Magon, etc on the albums A Natural Disaster, Adult Themes for Voice, I Bificus, Retribution, S.D.E., etc
Violence/N11 - an album by Bombardier, Dark Quarterer and Invisible Limits
Violence/N12 - a song by CharlElie Couture, EsprieM, Fraebbblarnir, Ian Hunter, Implant, etc on the albums All the Young Dudes, Broke, No Regrets, Power of Limits, Repercussions, etc
Violence/N18 - Violence: The Roleplaying Game of Egregious and Repulsive Bloodshed is a short, 32-page roleplaying game written by Greg Costikyan under the pseudonym "Designer X" and published by Hogshead Publishing as part of its New Style line of games.
Violence/N42 - Violence (1947) is an American drama film noir directed by Jack Bernhard.
Pure automatic inference is a little to hard in general for this problem.
Instead we might use :
Resources like WordNet, or a semantics dictionary.
For languages other than English you can look at eurowordnet (non free) dataset.
To get more meaning (i.e. for the album sense) we process some well managed resource like Wikipedia. Wikipedia as a lot of meta information that would be very useful for this kind of processing.
The reliability of the process is achieve just by combining the maximum number of data source and processing them correctly, with specialized programs.
As a last resort you may try hand processing/annotating. Long and costly, but useful in enterprise context where you need only a small part of a language.
No free lunch here.
If you're working on English NLP in python, then you can try the wordnet API as such:
from nltk.corpus import wordnet as wn
query = 'violence'
for ss in wn.synsets(query):
print query, str(ss.offset).zfill(8)+'-'+ss.pos, ss.definition
If you're working on other human languages, maybe you can take a look at the open wordnets available from http://casta-net.jp/~kuribayashi/multi/
NOTE: the reason for str(ss.offset).zfill(8)+'-'+ss.pos, it's because it is used as the unique id for each sense of a specific word. And this id is consistent across the open wordnets for every language. the first 8 digits gives the id and the character after the dash is the Part-of-Speech of the sense.
Check this out: Twitter Filtering Demo from Idilia. It does exactly what you want by first analyzing a piece of text to discover the meaning of its words and then filtering the texts that contain the sense that you are looking for. It's available as an API.
Disclaimer: I work for Idilia.
You can extract all contexts "violence" occurs in (context can be a whole document, or a window of say 50 words), then convert them into features (using say bag of words), then cluster these features. As clustering is unsupervised, you won't have names for the clusters, but you can label them with some typical context.
Then you need to see which cluster "violence" in the query belongs to. Either based on other words in the query which act as a context or by asking explicitly (Do you mean violence as in "...." or as in "....")
This will be incredibly difficult due to the fact that the proper noun uses of the word 'Violence' will be incredibly infrequent as a proportion of all words and their frequency distribution is likely highly skewed in some way. We run into these problems almost any time we want to do some form of Named Entity Disambiguation.
No tool I'm aware of will do this for you, so you will be building your own classifier. Using Wikipedia as a training resource as Mr K suggested is probably your best bet.

Accurate algorithm for normalizing taxonomy terms?

I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!
Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).

Resources