Normalize using SNOMED-CT - health-monitoring

I wanted to understand the puropse of using SNOMED-CT for normalization of clinical terms.
Let's say I have a criteria/statement like
Gender is Male
My question is if SNOMED-CT is used for normalizing both
Gender and Male OR just one of them like
Sex is M OR
Gender is M

I'm not sure I quite follow the question but this might help. SNOMED CT can repressent the same information in multiple ways. For example left sided hip scan can be repressented using a single concept (426100003 | Ultrasound scan of left hip |) or gluing a laterality of left to the concept for ultrasound of hip (the actual expression is a little complex here, I can post it if you need).
However when doing some operations, e.g. subsumption tests, the form needs to be consistent. Thus there is are standardised forms and standard algorithms to get to them, I nearly always use the Long Normal Form.
So in short the normal form of an expression is a standard repressentation of that expression which can be transformed to from other repressentations.
More information can be found if you search "Normal form" on the technical reference guide:

Both. It includes terms for the abstract concept of "Gender", the notion of a "Finding of biological sex", and the concept of a specific finding like "Male":
However, please note that the concept of Gender is different from Sex.

Supporting the answer above but from a different perspective
Normalization using SNOMED CT allows computer to
- Define a single set of representations (i.e. you don't have to map from M or F) that can be used for information exchange and understood in all healthcare settings irrespective of the geographic or healthcare domain.
- These representations are used as rules for queries in clinical decision support (for example). Where these rules are developed by a professional body (such as e.g. pharmacists) the rules can be shared irrespective of your legacy system and used consistently across all products. At least that is the intention.
This supports safe clinical practice.


How to determine if a piece of text mentions a product

I'm new to natural language process so I apologize if my question is unclear. I have read a book or two on the subject and done general research of various libraries to figure out how i should be doing this, but I'm not confident yet that know what to do.
I'm playing with an idea for an application and part of it is trying to find product mentions in unstructured text (e.g. tweets, facebook posts, emails, websites, etc.) in real-time. I wont go into what the products are but it can be assumed that they are known (stored in a file or database). Some examples:
"starting tomorrow, we have 5 boxes of #hersheys snickers available for $5 each - limit 1 pp" (snickers is the product from the hershey company [mentioned as "#hersheys"])
"Big news: 12-oz. bottles of Coke and Pepsi on sale starting Fri." (coca-cola is the product [aliased as "coke"] from coca-cola company and Pepsi is the product from the PepsiCo company)
"#OMG, i just bought my dream car. a mustang!!!!" (mustang is the product from Ford)
So basically, given a piece of text, query the text to see if it mentions a product and receive some indication (boolean or confidence number) that it does mention the product.
Some concerns I have are:
Missing products because of misspellings. I thought maybe i could use a string similarity check to catch these.
Product names that are also English words or things would get caught. Like mustang the horse versus mustang the car
Needing to keep a list of alternative names for products (e.g. "coke" for "coco-cola", etc.)
I don't really know where to start with this but any help would be appreciated. I've already looked at NLTK and SciKit and didn't really gleam how to do this from there. If you know of examples or papers that explain this, links would be helpful. I'm not specific to any language at this point. Java preferably but Python and Scala are acceptable.
The answer that you chose is not really answering your question.
The best approach you can take is using Named Entity Recognizer(NER) and POS tagger (grab NNP/NNPS; Proper nouns). The database there might be missing some new brands like Lyft (Uber's rival) but without developing your own prop database, Stanford tagger will solve half of your immediate needs.
If you have time, I would build the dictionary that has every brands name and simply extract it from tweet strings.
If you know how to crawl, it's not a hard problem to solve.
It looks like your goal is to classify linguistic forms in a given text as references to semantic entities (which can be referred to by many different linguistic forms). You describe a number of subtasks which should be done in order to get good results, but they nevertheless are still independent tasks.
In order to deal with potential misspellings of words, you need to associate these possible misspellings to their canonical (i.e. correct) form.
Phonetic similarity: Many reasons for "misspellings" is opacity in the relationship between the word's phonetic form (i.e. how it sounds) and its orthographic form (i.e. how it's spelled). Therefore, a good way to address this is to index terms phonetically so that e.g. innovashun is associated with innovation.
Form similarity: Additionally, you could do a string similarity check, but you may introduce a lot of noise into your results which you would have to address because many distinct words are in fact very similar (e.g. chic vs. chick). You could make this a bit smarter by first morphologically analyzing the word and then using a tree kernel instead.
Hand-made mappings: You can also simply make a list of common misspelling → canonical_form mappings. This would work well for "exceptions" not handled by the above methods.
Word-sense disambiguation
Mustang the car and Mustang the horse are the same form but refer to entirely different entities (or rather classes of entities, if you want to be pedantic). In fact, we ourselves as humans can't tell which one is meant unless we also know the word's context. One widely-used way of modelling this context is distributional lexical semantics: Defining a word's semantic similarity to another as the similarity of their lexical contexts, i.e. the words preceding and succeeding them in text.
Linguistic aliases (synonyms)
As stated above, any given semantic entity can be referred to in a number of different ways: bathroom, washroom, restroom, toilet, water closet, WC, loo, little boys'/girls' room, throne room etc. For simple meanings referring to generic entities like this, they can often be considered to be variant spellings in the same way that "common misspellings" are and can be mapped to a "canonical" form with a list. For ambiguous references such as throne room, other metrics (such as lexical-distributional methods) can also be included in order to disambiguate the meaning, so that you don't relate e.g. I'm in the throne room just now! to The throne room of the Buckingham Palace is beautiful.
You have a lot of work to do in order to get where you want to go, but it's all interesting stuff and there are already good libraries available for doing most of these tasks.

How do you extract various meanings of a certain word

Given "violence" as input would it be possible to come up with how violence construed by a person (e.g. physical violence, a book, an album, a musical group ..) as mentioned below in Ref #1.
Assuming if the user meant an Album, what would be the best way to look for violence as an album from a set of tweets.
Is there a way to infer this via any of the NLP API(s) say OpenNLP.
Ref #1
violence/N1 - intentional harmful physical action.
violence/N2 - the property of being wild or turbulent.
Violence/N6 - a book from Neil L. Whitehead; nonfiction
Violence/N7 - an album by The Last Resort
Violence/N8 - Violence is the third album by the Washington-based Alternative metal music group Nothingface.
Violence/N9 - a musical group which produced the albums Eternal Nightmare and Nothing to Gain
Violence/N10 - a song by Aesthetic Perfection, Angel Witch, Arsenic, Beth Torbert, Brigada Flores Magon, etc on the albums A Natural Disaster, Adult Themes for Voice, I Bificus, Retribution, S.D.E., etc
Violence/N11 - an album by Bombardier, Dark Quarterer and Invisible Limits
Violence/N12 - a song by CharlElie Couture, EsprieM, Fraebbblarnir, Ian Hunter, Implant, etc on the albums All the Young Dudes, Broke, No Regrets, Power of Limits, Repercussions, etc
Violence/N18 - Violence: The Roleplaying Game of Egregious and Repulsive Bloodshed is a short, 32-page roleplaying game written by Greg Costikyan under the pseudonym "Designer X" and published by Hogshead Publishing as part of its New Style line of games.
Violence/N42 - Violence (1947) is an American drama film noir directed by Jack Bernhard.
Pure automatic inference is a little to hard in general for this problem.
Instead we might use :
Resources like WordNet, or a semantics dictionary.
For languages other than English you can look at eurowordnet (non free) dataset.
To get more meaning (i.e. for the album sense) we process some well managed resource like Wikipedia. Wikipedia as a lot of meta information that would be very useful for this kind of processing.
The reliability of the process is achieve just by combining the maximum number of data source and processing them correctly, with specialized programs.
As a last resort you may try hand processing/annotating. Long and costly, but useful in enterprise context where you need only a small part of a language.
No free lunch here.
If you're working on English NLP in python, then you can try the wordnet API as such:
from nltk.corpus import wordnet as wn
query = 'violence'
for ss in wn.synsets(query):
print query, str(ss.offset).zfill(8)+'-'+ss.pos, ss.definition
If you're working on other human languages, maybe you can take a look at the open wordnets available from
NOTE: the reason for str(ss.offset).zfill(8)+'-'+ss.pos, it's because it is used as the unique id for each sense of a specific word. And this id is consistent across the open wordnets for every language. the first 8 digits gives the id and the character after the dash is the Part-of-Speech of the sense.
Check this out: Twitter Filtering Demo from Idilia. It does exactly what you want by first analyzing a piece of text to discover the meaning of its words and then filtering the texts that contain the sense that you are looking for. It's available as an API.
Disclaimer: I work for Idilia.
You can extract all contexts "violence" occurs in (context can be a whole document, or a window of say 50 words), then convert them into features (using say bag of words), then cluster these features. As clustering is unsupervised, you won't have names for the clusters, but you can label them with some typical context.
Then you need to see which cluster "violence" in the query belongs to. Either based on other words in the query which act as a context or by asking explicitly (Do you mean violence as in "...." or as in "....")
This will be incredibly difficult due to the fact that the proper noun uses of the word 'Violence' will be incredibly infrequent as a proportion of all words and their frequency distribution is likely highly skewed in some way. We run into these problems almost any time we want to do some form of Named Entity Disambiguation.
No tool I'm aware of will do this for you, so you will be building your own classifier. Using Wikipedia as a training resource as Mr K suggested is probably your best bet.

Word Map for Emotions

I am looking for a resource similar to WordNet. However, I want to be able to look up the positive/negative connotation of a word. For example:
bribe - negative
offer - positive
I'm curious as to whether anyone has run across any tool like this in AI/NLP research, or even in linguistics.
For the curious, the accepted answer below put me on the right track towards what I needed. Wikipedia listed several different resources. The two I would recommend (because of ease of use/free use for a small number of API calls) are AlchemyAPI and Lymbix. I decided to go with AlchemyAPI, since people affiliated with academic institutions (like myself) and non-profits can get even more API calls per day if they just email the company.
Start looking up topics on 'sentiment analysis':
The are some vocabulary compilations regarding affect, aka dictionaries of affect, such as the Affective Norms of English Words (ANEW) or the Dictionary of Affect in Language (DAL). They provide a dimensional representation of affect (valence, activation and control) that may be of use in a sentiment analysis scenario (detection of positive/negative connotation). In this sense, EmoLib works with the former, by default, but may be easily extended with a more specific lexicon to tackle particular needs (for example, EmoLib provides an additional neutral label that is more appropriate than the positive/negative tag set alone in a Text-To-Speech synthesis setting).
There is also SentiWordNet, which gives you positive, negative and objective scores for each WordNet synset.
However, you should be aware that the positive and negative connotation of a term often depends on the context in which it is used. A great introduction to this topic is the book Opinion mining and sentiment analysis by Bo Pang and Lillian Lee, which is available online for free.

Word characteristics tags

I want to do a riddle AI chatbot for my AI class.
So i figgured the input to the chatbot would be :
Something like :
"It is blue, and it is up, but it is not the ceiling"
Translation :
<Object X>
</Object X>
(Answer : sky?)
So Input is a set of characteristics (existing \ not existing in the object), output is a matched, most likely object.
The domain will be limited to a number of objects, i could input all attributes myself, but i was thinking :
How could I programatically build a database of characteristics for a word?
Is there such a database available? How could i tag a word, how could i programatically find all it's attributes? I was thinking on crawling wikipedia, or some forum, but i can't see it build any reliable word tag database.
Any ideas on how i could achieve such a thing? Any ideas on some literature on the subject?
Thank you
This sounds like a basic classification problem. You're essentially asking; given N features (color=blue, location=up, etc), which of M classifications is the most likely? There are many algorithms for accomplishing this (Naive Bayes, Maximum Entropy, Support Vector Machine), but you'll have to investigate which is the most accurate and easiest to implement. The biggest challenge is typically acquiring accurate training data, but if you're willing to restrict it to a list of manually entered examples, then that should simplify your implementation.
Your example suggests that whatever algorithm you choose will have to support sparse data. In other words, if you've trained the system on 300 features, it won't require you to enter all 300 features in order to get an answer. It'll also make your training and testing files smaller, because you'll be omit features that are irrelevant for certain objects. e.g.
sky | color:blue,location:up
tree | has_bark:true,has_leaves:true,is_an_organism=true
cat | has_fur:true,eats_mice:true,is_an_animal=true,is_an_organism=true
It might not be terribly helpful, since it's proprietary, but a commercial application that's similar to what you're trying to accomplish is the website, albeit the system asks the questions instead of the user. It's interesting in that it's trained "online" based on user input.
Wikipedia certainly has a lot of data, but you'll probably find extracting that data for your program will be very difficult. Cyc's data is more normalized, but its API has a huge learning curve. Another option is the semantic dictionary project Wordnet. It has reasonably intuitive APIs for nearly every programming language, as well as an extensive hypernym/hyponym model for thousands of words (e.g. cat is a type of feline/mammal/animal/organism/thing).
The Cyc project has very similar aims: I believe it contains both inference engines to perform the AI, and databases of facts about commonsense knowledge (like the colour of the sky).

Determining what a word "is" - categorizing a token

I'm writing a bridge between the user and a search engine, not a search engine. Part of my value added will be inferring the intent of a query. The intent of a tracking number, stock symbol, or address is fairly obvious. If I can categorise a query, then I can decide if the user even needs to see search results. Of course, if I cannot, then they will see search results. I am currently designing this inference engine.
I'm writing a parser; it should take any given token and assign it a category. Here are some theoretical English examples:
"denver" is a USCITY and a PLACENAME
"555 555 5555" is a USPHONENUMBER
I know that each of these cases will most likely require specific handling, however I'm not sure where to start.
Ideally I'd end up with something simple like:
queryCategory = magicCategoryFinder( query )
>print queryCategory
>"SOMECATEGORY or a list"
Natural language parsing is a complicated topic. One of the problems here is that determining what a word is depends on context and implied knowledge. Also, you're not so much interested in words as you are in groups of words. Consider, "New York City" is a place but its three words, two of which (new and city) have other meanings.
also you have to consider ambiguity, which is once again where context and implied knowledge comes in. For example, JAVA is (or was) a stock symbol for Sun Microsystems. It's also a programming language, a place and has meaning associated with coffee. How do you classify it? You'd need to know the context in which it was used.
And if you can solve that problem reliably you can make yourself very wealthy.
What's all this in aid of anyway?
To learn about "tagging" (the term of art for what you're trying to do), I suggest playing around with NLTK's tag module. More generally, NLTK, the Natural Language ToolKit, is an excellent toolkit (based on the Python programming language) for experimentation and learning in the field of Natural Language Processing (whether it's suitable for a given production application may be a different issue, esp. if said application requires very high speed processing on large volumes of data -- but, you have to walk before you can run!-).
You're bumping up against one of the hardest problems in computer science today... determining semantics from english context. This is the classic text mining problem and get into some very advanced topics. I thiink I would suggest thinking more about you're problem and see if you can a) go without categorization or b) perhaps utilize structural info such as document position or something to give you a hint (is either a city or placename or an undetermined) and maybe some lookup tables to help. ie stock symbols are pretty easy to create a pretty full lookup for. You might consider downloading CIA world factbook for a lookup of cities... etc.
As others have already pointed out, this is an exceptionally difficult task. The classic test is a pair of sentences:Time flies like an arrow.Fruit flies like a bananna.
In the first sentence, "flies" is a verb. In the second, it's part of a noun. In the first, "like" is an adverb, but in the second it's a verb. The context doesn't make this particularly easy to sort out either -- there's no obvious difference between "Time" and "Fruit" (both normally nouns). Likewise, "arrow" and "bananna" are both normally nouns.
It can be done -- but it really is decidedly non-trivial.
Although it might not help you much with disambiguation, you could use Cyc. It's a huge database of what things are that's intended to be used in AI applications (though I haven't heard any success stories).
