Named entity recognition - tagging tools - nlp

Does someone have a recommendation of tagging tool for NER types in raw text?
The input for the tool should be a library of text files(.txt simple format) , there should be a convenient UI for selecting words and set the tag/annotation fit to selection, the output should be structural representations of the tags(e.gs tart index , last index, tag in a JSON format)

Founderof LightTag here
We provide a super convenient interface to do span annotations such as named entity recognition, classifications and relationships.
You can work as one labeler or bring in a team and LightTag will disribute work between everyone automatically (no more selecting files and remembering what you labeled already) .
You can upload your own suggestions and let labelers use those, or use LightTags built in model.
Of course you can annotate at the character level and highlight subwords or multi word phrases.

You can try https://github.com/lasigeBioTM/MER (bash)
see the demo at http://labs.fc.ul.pt/mer/

Online tools:
I guess Dataturks' POS tool should work fine for your use case, you can just upload your data and specify the labels. The UI seems convenient enough.
Here is the link:
https://dataturks.com
It's an online tool, so you can work with multiple people to get the tagging done.
The exact output format you are looking for is not supported, but the format can easily be converted to what you are looking for, the output is like: word___LABEL word2___LABEL , so a simple 2-line script can convert it to start and end index.
Offline:
Another tool you can check out is prodigy, it's a downloadable software and does similar things. Just that you might be willing to pay for it upfront.
https://prodi.gy

Related

Remove part of a string in each row of a large column of data in KNIME

I am stumbed.
I have a column with some thousand rows of unique adresses regarding universities, pharmacompanies etc. in a KNIME workflow
Example:
55 Shattuck Street Boston Massachusetts 02115 US [NAT: US RES: US] for all designated states
What I need is to clean the data, so each row look like nice and computable like this:
55 Shattuck Street Boston Massachusetts 02115 US.
My problem Is I can't seem to get the system to remove everything after US. Does anyone know a suitable approach in KNIME?
You should be able to use either String Replacer or String Manipulation for this. The first one lets you use either a simple wildcard or a full regular expression pattern while the second one uses a Java-like syntax - the choice comes down to how many different variations on the input data you need to handle and which syntax you prefer.
If you just need to remove any text between square brackets including the space before the open bracket then you can use String Replacer configured like this:
Beside the nodes which were already mentioned by nekomatic and which will work perfectly for the given scenario, there's also a user-friendly regular expression tool in the Palladian nodes extension called Regex Extractor, which allows you to build your regexes with a live preview as you might know from popular online regex testers.
For your scenario, you could e.g. set up a regex like this:
^(?<address>.*)(?:\s\[.*)
In prose, this means: Capture all characters until a space + square opening bracket and output into a column named address.
The Palladian extension is available here as a free plugin for KNIME Desktop and provides a variety of different tools for web, text, and geo data mining and classification.

Visualizing keywords from text using spaCY

I am using Textrank method for extracting keywords from text and I am able to print individual keywords along with their scores. But I am currently trying to output whole text with the keywords I extracted earlier be highlighted (encircled etc).
I'm not sure who your target audience is, but I think the simplest solution might be to programmatically generate hypertext (HTTP), for example, where the keywords are given a foreground/background color of your choice. In fact, this can see this as being quite useful.
SpaCy has visualization tools but I believe they are targetted at providing specific NLP visualizations. I don't think they offer what you want, which seems to be a canvas for present information.
Oh! If you want to hack a solution, you can try this:
Create a custom entity type in SpaCy and have SpaCy report your keywords as your new custom entity type. Then you can use the SpaCy Entity Visualizer to highlight your entities.

how to access google define feature in a batch

Suppose I have a huge set of noisy phrases. For each one of them, I want to check if it is defined by some resources by using the google define feature. Once I type "define my_phrase" to the google search box, if the retrieved results contain the definition panel (e.g. https://www.google.com/#q=define+home+cooking), I put it into my phrase pool.
I'm wondering is this possible to do this task in a batch so that I don't have to type each of the phrase manually one by one? It would be great if this could be achieved from a unix terminal but windows is also welcome!
I heard of google-app-engine but I only have a rough idea and not sure if it could help.
Thanks!
as starting point, you may try and play with the Google Custom search following API reference - Xml results
https://developers.google.com/custom-search/docs/xml_results?hl=en&csw=1#XML_Results
Be aware of:
google TOS for this service
quantity courtesy limit

Does SharePoint Search support range tags?

I am working on a project to digitize approximately 1 million images for which metadata will be added to facilitate search.
Each image is, for example, a page in a dictionary. But not text. Just a static scanned image. OCR is not an option :(
My objective is to emulate the current search procedure which consists of looking up the alphabetical entries till the correct page is found. In absence of machine readable text, I am looking at tagging each page with Dictionary range tag. For Example (Apple-Canada). So if someone searches for "Banana", it should hit the (Apple-Canada) range Tag.
Is this supported in SharePoint out of the box? If not, is there an addon product which provides this functionality or am I looking at building a customized extension?
Any help will be appreciated :)
Installing the IFilter for TIF files is done with a couple of clicks and gives you free OCR along the way. Very good for scanned pages.
On your question though: No, SharePoint does not have any kind of "range" tags or fields. The only vaguely similar thing to what you are requesting is the Thesaurus of the search. There you could define acronyms and synonyms for words and it would actually search for something else. So you could enter Banana but it would actually search for Apple. Some examples here: How to: Customize the Thesaurus in SharePoint Search and Search Server.
Other than that I can only think of a custom implemented search provider giving you the flexibility you need.

United States State shapes for Office

I want to create visuals along the lines of CNN's "red-state, blue-state" shadings of the states in the U.S. for my project. I'm planning to do something fancier than just shading the state's shape in a color. Are there open source libraries of state shapes/polygons (or - if not open source - others) that I can import into Word, Excel, etc. that I can use to show complicated graphs based on states?
I have Map Point, but haven't been able to figure out how to shade the states in a complex way.
you could try google charts, it looks like http://www.woot.com is doing something similar to what you need
Here is a good example using google maps... I've used code like that before.. perhaps from this exact example.
http://econym.org.uk/gmap/example_states2.htm
EDIT: you might want to consider converting the states.xml into JSON... it'll be smaller (136k of XML right now!) and should load faster in most browsers.
There might be a couple parts to the question you are asking, but to address the first part "Are there open source libraries of state shapes/polygons...", here's a resource to check out:
http://commons.wikimedia.org/wiki/Category:SVG_maps_of_the_United_States
It's a list of various SVG(scalable vector graphics) files which can be imported into a number of applications. Basically a giant xml representation of lines and endpoints. This can be directly converted to XAML, if you're into a more programmatic way of charting(ie, C# w/ Silverlight).
However, to address the second part regarding MS Office, Visio can import SVG files for manipulation as well. I'm unsure what type of graphs you were looking for, but I hope this can assist in some small way on your path to awesomeness ;)

Resources