Natural language de-identification - nlp

I am looking for a natural language tool that can automatically de-identify English text. For example, every email address should be renamed or obscured. But proper names should be de-identified, as should addresses and what not.
There is a MITRE Identification Scrubber Toolkit. I don't know how well it works.
My questions:
Are there any other tools out there?
Does anyone have experience with the MITRE tool? How well does it work?
Thanks.

De-identification (perhaps more often referred to as anonymization) is a very active research area as its success is obviously a requirement for the use of authentic text corpora in such fields as NLP for healthcare, medicine and the like. I recommend that you look at the tools listed in the answer to this question on CrossValidated. If you follow the links further, you will find research papers describing how these tools work with further references and results evaluations.

Related

Simple toolkits for emotion (sentiment) analysis (not using machine learning)

I am looking for a tool that can analyze the emotion of short texts. I searched for a week and I couldn't find a good one that is publicly available. The ideal tool is one that takes a short text as input and guesses the emotion. It is preferably a standalone application or library.
I don't need tools that is trained by texts. And although similar questions are asked before no satisfactory answers are got.
I searched the Internet and read some papers but I can't find a good tool I want. Currently I found SentiStrength, but the accuracy is not good. I am using emotional dictionaries right now. I felt that some syntax parsing may be necessary but it's too complex for me to build one. Furthermore, it's researched by some people and I don't want to reinvent the wheels. Does anyone know such publicly/research available software? I need a tool that doesn't need training before using.
Thanks in advance.
I think that you will not find a more accurate program than SentiStrength (or SoCal) for this task - other than machine learning methods in a specific narrow domain. If you have a lot (>1000) of hand-coded data for a specific domain then you might like to try a generic machine learning approach based on your data. If not, then I would stop looking for anything better ;)
Identifying entities and extracting precise information from short texts, let alone sentiment, is a very challenging problem specially with short text because of lack of context. Hovewer, there are few unsupervised approaches to extracting sentiments from texts mainly proposed by Turney (2000). Look at that and may be you can adopt the method of extracting sentiments based on adjectives in the short text for your use-case. It is hovewer important to note that this might require you to efficiently POSTag your short text accordingly.
Maybe EmoLib could be of help.

NLP: Language Analysis Techniques and Algorithms

Situation:
I wish to perform a Deep-level Analysis of a given text, which would mean:
Ability to extract keywords and assign importance levels based on contextual usage.
Ability to draw conclusions on the mood expressed.
Ability to hint on the education level (word does this a little bit though, but something more automated)
Ability to mix-and match phrases and find out certain communication patterns
Ability to draw substantial meaning out of it, so that it can be quantified and can be processed for answering by a machine.
Question:
What kind of algorithms and techniques need to be employed for this?
Is there a software that can help me in doing this?
When you figure out how to do this please contact DARPA, the CIA, the FBI, and all other U.S. intelligence agencies. Contracts for projects like these are items of current research worth many millions in research grants. ;)
That being said you'll need to process it in layers and analyze at each of those layers. For items 2 and 3 you'll find training an SVM on n-tuples (try, 3) words will help. For 1 and 4 you'll want deeper analysis. Use a tool like NLTK, or one of the many other parsers and find the subject words in sentences and related words. Also use WordNet (from Princeton)
to find the most common senses used and take those as key words.
5 is extremely challenging, I think intelligent use of the data above can give you what you want, but you'll need to use all your grammatical knowledge and programming knowledge, and it will still be very rough grained.
It sounds like you might be open to some experimentation, in which case a toolkit approach might be best? If so, look at the NLTK Natural Language Toolkit for Python. Open source under the Apache license, and there are a couple of excellent books about it (including one from O'Reilly which is also released online under a creative commons license).

Programmatic parsing and understanding of language (English)

I am looking for some resources pertaining to the parsing and understanding of English (or just human language in general). While this is obviously a fairly complicated and wide field of study, I was wondering if anyone had any book or internet recommendations for study of the subject. I am aware of the basics, such as searching for copulas to draw word relationships, but anything you guys recommend I will be sure to thoroughly read.
Thanks.
Check out WordNet.
You probably want a book like "Representation and Inference for Natural Language - A First Course in Computational Semantics"
http://homepages.inf.ed.ac.uk/jbos/comsem/book1.html
Another way is looking at existing tools that already do the job on the basis of research papers: http://nlp.stanford.edu/index.shtml
I've used this tool once, and it's very nice. There's even an online version that lets you parse English and draws dependency trees and so on.
So you can start taking a look at their papers or the code itself.
Anyway take in consideration that in any field, what you get from such generic tools is almost always not what you want. In the sense that the semantics attributed by such tools is not what you would expect. For most cases, given a specific constrained domain it's preferable to roll your own parser, and do your best to avoid any ambiguities beforehand.
The process that you describe is called natural language understanding. There are various algorithms and software tools that have been developed for this purpose.

(human) Language of a document

Is there a way (a program, a library) to approximately know which language a document is written in?
I have a bunch of text documents (~500K) in mixed languages to import in a i18n enabled CMS (Drupal)..
I don't need perfect matches, only some guess.
There is a pretty easy way to do this, given that you have corpus data in all the different languages you'll need to identify. It's called n-gram modeling. I think Lingua::Identify does this already, though, so that is your best bet rather than implementing your own.
I'd say your best bet is to look for key words - articles, that kind of thing - that are unique to the languages you're looking for. "Un" will show up in both Spanish and French, for example, but "une" is identifiably French whereas "unos", for example, is identifiably Spanish. Diacritics are useful too - you'll see "ñ" in Spanish and possibly Portuguese, "ç" in French and a few others... that kind of thing.
edit - Paul's solution is probably the best; looks like it uses methods like what I outlined, plus a few extra.
By running a Google search for "determine language of document" I found many different sites that will help you. The third link on the first page eventually led me to a function in the Google Code API that is exactly what you need.
Google Translation API is cool, and has a REST interface. But I need to send it a LOT of BIG document (yes, I could use an excerpt) and, even if Google is Google, I don't think this
fair.
Document are also not mine, and Id ask my client if it is ok to send them to a third party (even if, soon or later, G will get them ;)).
I think I'll go trough the Perl path...
There seems to be a Perl module for this: Lingua::Identify
Paul.

NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

Does anyone have a suggestion for where to find archives or collections of everyday English text for use in a small corpus? I have been using Gutenberg Project books for a working prototype, and would like to incorporate more contemporary language. A recent answer here pointed indirectly to a great archive of usenet movie reviews, which hadn't occurred to me, and is very good. For this particular program technical usenet archives or programming mailing lists would tilt the results and be hard to analyze, but any kind of general blog text, or chat transcripts, or anything that may have been useful to others, would be very helpful. Also, a partial or downloadable research corpus that isn't too marked-up, or some heuristic for finding an appropriate subset of wikipedia articles, or any other idea, is very appreciated.
(BTW, I am being a good citizen w/r/t downloading, using a deliberately slow script that is not demanding on servers hosting such material, in case you perceive a moral hazard in pointing me to something enormous.)
UPDATE: User S0rin points out that wikipedia requests no crawling and provides this export tool instead. Project Gutenberg has a policy specified here, bottom line, try not to crawl, but if you need to: "Configure your robot to wait at least 2 seconds between requests."
UPDATE 2 The wikpedia dumps are the way to go, thanks to the answerers who pointed them out. I ended up using the English version from here: http://download.wikimedia.org/enwiki/20090306/ , and a Spanish dump about half the size. They are some work to clean up, but well worth it, and they contain a lot of useful data in the links.
Use the Wikipedia dumps
needs lots of cleanup
See if anything in nltk-data helps you
the corpora are usually quite small
the Wacky people have some free corpora
tagged
you can spider your own corpus using their toolkit
Europarl is free and the basis of pretty much every academic MT system
spoken language, translated
The Reuters Corpora are free of charge, but only available on CD
You can always get your own, but be warned: HTML pages often need heavy cleanup, so restrict yourself to RSS feeds.
If you do this commercially, the LDC might be a viable alternative.
Wikipedia sounds like the way to go. There is an experimental Wikipedia API that might be of use, but I have no clue how it works. So far I've only scraped Wikipedia with custom spiders or even wget.
Then you could search for pages that offer their full article text in RSS feeds. RSS, because no HTML tags get in your way.
Scraping mailing lists and/or the Usenet has several disatvantages: you'll be getting AOLbonics and Techspeak, and that will tilt your corpus badly.
The classical corpora are the Penn Treebank and the British National Corpus, but they are paid for. You can read the Corpora list archives, or even ask them about it. Perhaps you will find useful data using the Web as Corpus tools.
I actually have a small project in construction, that allows linguistic processing on arbitrary web pages. It should be ready for use within the next few weeks, but it's so far not really meant to be a scraper. But I could write a module for it, I guess, the functionality is already there.
If you're willing to pay money, you should check out the data available at the Linguistic Data Consortium, such as the Penn Treebank.
Wikipedia seems to be the best way. Yes you'd have to parse the output. But thanks to wikipedia's categories you could easily get different types of articles and words. e.g. by parsing all the science categories you could get lots of science words. Details about places would be skewed towards geographic names, etc.
You've covered the obvious ones. The only other areas that I can think of too supplement:
1) News articles / blogs.
2) Magazines are posting a lot of free material online, and you can get a good cross section of topics.
Looking into the wikipedia data I noticed that they had done some analysis on bodies of tv and movie scripts. I thought that might interesting text but not readily accessible -- it turns out it is everywhere, and it is structured and predictable enough that it should be possible clean it up. This site, helpfully titled "A bunch of movie scripts and screenplays in one location on the 'net", would probably be useful to anyone who stumbles on this thread with a similar question.
You can get quotations content (in limited form) here:
http://quotationsbook.com/services/
This content also happens to be on Freebase.

Resources