Direction on feature extraction in an unstructured document with Python

Direction on feature extraction in an unstructured document with Python - python-3.x

Project Background:
I'm quite new to NLP so please forgive me if my problem seems unreasonably complicated. I am trying to extract some features, like company names, some money value and names of individuals from a public company listing document, which large body of text (300+ pages).
Text parsed into the program looks something like this:
"In this motion, Company A Holdings (The "Company"), was sponsored by Company B Limited. John Doe, the chairman of the company, has approved of this activity"
The expected outcome looks something like this:
The Company: Company A Holdings
Sponsor: Company B Limited
Chairman: John Doe
Since all documents came in PDF form, I have parsed them in as text. Performed some NER with Spacy with the document I have, and based on the looks of the NER result, it had successfully recognised all the entities that I needed. (I.E. It recognised Company A Holdings, Company B Limited and John Doe)
How should I approach the said goal? I don't have a massive amount of files to train the model (currently around 30 ish documents), a general direction or example of modules on how to tackle the problem would be highly appreciated.
Thank you all in advance!

Related

Data on segmentation of US search traffic by topic

I'm working on a research project trying to understand the patterns and breakdowns of search usage and volumes in the United States.
Ideally, I would love a breakdown of search volume across topics like:
navigational (ie just want to get to a domain link)
news (if possible: split amongst events, celebs, politics, ...)
sports (if possible: dig into splits of live scores, news about an athlete or a team, ... )
finance (e.g. stock names )
anything local (e.g. food, restaurants, places)
people (e.g. bios)
anytime time related (what time in nyc, sf, ...)
anything numbers related (math/calculators)
Other topics: immigration, legal, health/medicine, science/technology, food/recipes, code/math, politics, weather, images/video, etc.
Not sure if there is a dataset or good report somewhere that would give me insight into all these?
There seem to be a lot of keyword planning tools, which is somewhat helpful and I guess I could collect data on groups of keywords realted to the topics above, but for things like celebrity bios it would be quite difficult to group together all the data because each possible well known person is there own keyword…
Any help, direction would be appreciated! Thank you so much

How to extract artist's name from plain text?

I'm new to NLP.
I want to extract music artist's name from plain text like that is posted on social media.
The text looks like this. (this is just sample, not real)
Today bandcamp is waiving fees again! CHANGE, TAYLOR SWIFT and POP
SMOKE will be using all funds collected through bandcamp to donate to
Anti Repression Committee. No Justice No Peace.
This time,I want to extract string "CHANGE","TAYLOR SWIFT","POP SMOKE".
I already tried NLTK and spaCy but it didn't work as desired.
Is there any other idea how I can achieve this?
Thanks in advance.

If you have a lot of upper case data like in your example, you might want to pass the data through a truecaser first. There’s one available in the Stanford NLP package. After that, spacy might have a better shot at picking the names out. On this text:
Today bandcamp is waiving fees again! Change, Taylor Swift, and Pop Smoke will be using all funds collected through bandcamp to donate to Anti Repression Committee. No Justice No Peace.
en_core_web_sm will pick out Taylor Swift and Pop Smoke as entities. Change / CHANGE is going to be tough for any model to pick out.

Identifying the context of word in sentence

I created classifier to classy the class of nouns,adjectives, Named entities in given sentence. I have used large Wikipedia dataset for classification.
Like :
Where Abraham Lincoln was born?
So classifier will give this short of result - word - class
Where - question
Abraham Lincoln - Person, Movie, Book (because classifier find Abraham Lincoln in all there categories)
born - time
When Titanic was released?
when - question
Titanic - Song, movie, Vehicle, Game (Titanic
classified in all these categories)
Is there any way to identify exact context for word?
Please see :
Word sense disambiguation would not help here. Because there might not be near by word in sentence which can help
Lesk algorithm with wordnet or sysnet also does not help. Because it for suppose word Bank lesk algo will behave like this
======== TESTING simple_lesk ===========
TESTING simple_lesk() ...
Context: I went to the bank to deposit my money
Sense: Synset('depository_financial_institution.n.01')
Definition: a financial institution that accepts deposits and channels the money into lending activities
TESTING simple_lesk() with POS ...
Context: The river bank was full of dead fishes
Sense: Synset('bank.n.01')
Definition: sloping land (especially the slope beside a body of water)
Here for word bank it suggested as financial institute and slopping land. While in my case I am already getting such prediction like Titanic then it can be movie or game.
I want to know is there any other approach apart from Lesk algo, baseline algo, traditional word sense disambiguation which can help me to identify which class is correct for particular keyword?
Titanic -

Thanks for using the pywsd examples. With regards to wsd, there are many other variants and i'm coding them by myself during my free time. So if you want to see it improve do join me in coding the open source tool =)
Meanwhile, you will find the following technologies more relevant to your task, such as:
Knowledge base population (http://www.nist.gov/tac/2014/KBP/) where tokens/segments of text are assigned an entity and the task is to link them or to solve a simplified question and answer task.
Knowledge representation (http://groups.csail.mit.edu/medg/ftp/psz/k-rep.html)
Knowledge extraction (https://en.wikipedia.org/wiki/Knowledge_extraction)
The above technologies usually includes several sub-tasks such as:
Wikification (http://nlp.cs.rpi.edu/kbp/2014/elreading.html)
Entity linking
Slot filling (http://surdeanu.info/kbp2014/def.php)
Essentially you're asking for a tool that is an NP-complete AI system for language/text processing, so I don't really think such a tool exists as of yet. Maybe it's IBM Watson.
if you're looking for the field to look into, the field is out there but if you're looking at tools, most probably wikification tools are closest to what you might need. (http://nlp.cs.rpi.edu/paper/WikificationProposal.pdf)

Information extraction. Counting mentions to measure relevance

Is it possible to count how many times an entity has been mentioned in an article? For example
ABC Company is one of the largest car manufacturers in the
world. It is also the largest
company in terms of annual production.
It is also the second largest exporter of luxury cars, after XYZ
company. Both ABC and XYZ
together produces over n% of total car
production in the country.
mentions ABC company 4 times.

Yes, this is possible. It's a combination of
named-entity recognition (NER), which for English is practically a solved problem, and
coreference resolution, which is the subject of ongoing research (but give this package a try)

Accurate algorithm for normalizing taxonomy terms?

I'm developing a shopping comparison website, and the project is in a very advanced stage. We index 50 million products daily using merchant feeds from various affiliate networks. Most of the problems I had is already solved, including the majority of the performance bottlenecks.
What is my problem: Please, first of all, we are using apache solr with drupal BUT, this problem IS NOT specific to drupal or solr, if you do not have knowledge of them, it doesn't matter.
We receive product feeds from over 2000 different merchants, and those feeds are a mess. They have no specific pattern, each merchant send the feeds the way they want. We already solved many problems regarding this, but one remains. Normalizing the taxonomy terms for the faceted browsing functionality.
Suppose that I have a "Narrow by Brands" browsing facet on my website. Now suppose that 100 merchants offer products from Microsoft. Now comes the problem. Some merchants put in the "Brands" column of the data feed "Microsoft", others "Microsoft, Inc.", others "Microsoft Corporation" others "Products from Microsoft", etc... there is no specific pattern between merchants and worst, some individual merchants are so sloppy that they have different strings for the same brand IN THE SAME DATA FEED.
We do not want all those different brands appearing in the navigation. We have a manual solution to the problem where we manually map the imported brands to the "good" brands table ("Microsoft Corporation" -> "Microsoft", "Products from Microsoft" -> "Microsoft", etc..). We have something like 10,000 brands in the database and this is doable. The problem is when it comes with bigger things like "Authors". When we import books into the system, there are over 800,000 authors and we have the same problem and this is not doable by hand mapping. The problem is the same: "Tom Mike Apostol", "Tom M. Apostol", "Apostol, Tom M.", etc...
Does anybody know a good way to automatically solve this problem with an acceptable degree of accuracy (85%-95% accuracy)?
Thanks you for the help!

Some idea that comes to my mind, altough it's just a loose thought:
Convert names to initials (in your example: TMA). Treat '-' as spaces, so fe. Antoine de Saint-Exupéry would be ADSE. Problem here is how to treat ",", altough, it's common usage is to have surname before forename, so just swapping positions should work (so A,TM would be TM,A, get rid of comma - TMA).
Filters authors in database by those initials
For each intitial, if you have whole name (Tom, Apostol) check if it match, otherwise (M.) consider it a match automatically.
If you want some tolerance, you can compare names with Levenshtein distance and tolerate some differences (here you have Oracle implementation)
Names that match you treat as the same authors, to find the whole name, for each initial (T, M, A) you look up your filtered authors (after step 2) and try to find one without just initial (M.) but with whole name (Mike), if you can't find one, use initial. Therefore, each of examples you gave would be converted to the same value, which would be full name (Tom Mike Apostol).
Things that are worth to think about:
Include mappings for name synonyms (would be more likely maximally hundred of records, like Thomas <-> Tom
This way is crucial to have valid initials (no M instead of N etc.).
edit: I've coded such thing some time ago, when I had to identify a person by it's signature, ignoring scanning problems, people sometimes sign by Name S. Surname, or N.S. or just by Name Surname (which is another thing maybe you should consider in the solution, to allow the algorithm to ignore second name, altough in your situation it would be rather rare to ommit someone's second name I guess).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string