Wiktionary API to retrieve word forms (or other free service) - nlp

This is a question particularly for Russian/Ukrainian languages but may be useful for other languages too.
Is there a possibility to retrieve word forms as raw data? To use in mobile application for example. These forms are present on the general wiki page. For example Forms of verb 'to be'. The same you can find for nouns Noun forms for 'apple' in Russian.
I need these forms with description of the form. What I mean is for example:
to be - infinitive; am - first person singular, present time; are - first person plural, present time; etc.
So far I have found that only wiktionary.org provides such information for Russian language. It would be nice if someone could point me to some other services/dictionaries for Russian, Ukrainian and English.

If you're interested to use Wiktionary, you can consider Wikokit which is an interface to parsed Wiktionary database.The English and Russian database dumps are available in their download section, but they also provide code/library (Java) for you to create your own database dump. They also provided (I think) the code/library for you to interface with the database, so you no longer have to deal with web services, since you have it running locally.

Related

Wget to download Wikipedia text

This is what I want to do:
Given an initial url (eg. http://en.wikipedia.org/wiki/Lists_of_scientists), I want to visit all the links on that page (relevant links of course).
Each link corresponds to another page containing several other links (eg. http://en.wikipedia.org/wiki/List_of_American_scientists). I want to visit each such link so that I can extract xml information from them.
Can this be done using wget? Someone suggested I should use Scrapy, however I am facing problem installing it.
The hierarchy to crawl looks like this: List of Scientists->List of American Scientists->Bryan Hayes (And a lot more scientists).
My target is to extract basic information from these wiki texts, like a person's name, organization, age, etc.
PS: I am a NOOB with good understanding.
Rather than scrape Wikipedia, you can just download the whole thing in one go.
There are tools for scanning categories, so you don't have to crawl the articles yourself.
Of course, you could just skip Wikipedia altogether, as there's already an effort to do this.
If you're still intent on extracting information from Wikipedia itself, start by exploiting Wikipedia's own structure and formatting. Writing a tool to pull information from InfoBoxes would be a good start. If you absolutely want to get information out of the text, the first place to begin is with a named entity recognizer. This finds all of the named entities in text. If you're too lazy to deploy an existing one, you're working on English, and you don't mind a few extra errors, you can just grab sequences of tokens that start with capital letters.
From there, you're probably looking for particular patterns in the data to get information from. You can use a parser, such as the Stanford Parser, to exploit grammatical relations of the language in text. There are also systems that work on finding patterns in strings without any traditional or explicit grammatical knowledge, like Etzioni et al's KnowItAll system. Depending on what exactly you're looking for, one may be better than the other.

how schema.org can help in nlp

I am basically working on nlp, collecting interest based data from web pages.
I came across this source http://schema.org/ as being helpful in nlp stuff.
I go through the documentation, from which I can see it adds additional tag properties to identify html tag content.
It may help search engine to get specific data as per user query.
it says : Schema.org provides a collection of shared vocabularies webmasters can use to mark up their pages in ways that can be understood by the major search engines: Google, Microsoft, Yandex and Yahoo!
But I don't understand how it can help me being nlp guy? Generally I parse web page content to process and extract data from it. schema.org may help there, but don't know how to utilize it.
Any example or guidance would be appreciable.
Schema.org uses microdata format for representation. People use microdata for text analytics and extracting curated contents. There can be numerous application.
Suppose you want to create news summarization system. So you can use hNews microformats to extract most relevant content and perform summrization onit
Suppose if you have review based search engine, where you want to list products with most positive review. You can use hReview microfomrat to extract the reviews, now perform sentiment analysis on it to identify product has -ve or +ve review
If you want to create skill based resume classifier then extract content with hResume microformat. Which can give you various details like contact (uses the hCard microformat), experience, achievements , related to this work, education , skills/qualifications, affiliations
, publications , performance/skills for performance etc. You can perform classifier on it to classify CVs with particular skillsets
Thought schema.org does not helps directly to nlp guys, it provides platform to perform text processing in better way.
Check out this http://en.wikipedia.org/wiki/Microformat#Specific_microformats to see various mircorformat, same page will give you more details.
Schema.org is something like a vocabulary or ontology to annotate data and here specifically Web pages.
It's a good idea to extract microdata from Web pages but is it really used by Web developper ? I don't think so and I think that the majority of microdata are used by company such as Google or Yahoo.
Finally, you can find data but not a lot and mainly used by a specific type of website.
What do you want to extract and for what type of application ? Because you can probably use another type of data such as DBpedia or Freebase for example.
GoodRelations also supports schema.org. You can annotate your content on the fly from the front-end based on the various domain contexts defined. So, schema.org is very useful for NLP extraction. One can even use it for HATEOS services for hypermedia link relations. Metadata (data about data) for any context is good for content and data in general. Alternatives, include microformats, RDFa, RDFa Lite, etc. The more context you have the better as it will turn your data into smart content and help crawler bots to understand the data. It also leads further into web of data and in helping global queries over resource domains. In long run such approaches will help towards domain adaptation of agents for transfer learning on the web. Pretty much making the web of pages an externalized unit of a massive commonsense knowledge base. They also help advertising agencies understand publisher sites and to better contextualize ad retargeting.

list of english verbs and their tenses, various forms, etc

Is there a huge CSV/XML or whatever file somewhere that contains a list of english verbs and their variations (e.g sell -> sold, sale, selling, seller, sellee)?
I imagine this will be useful for NLP systems, but there doesn't seem to be a listing anywhere, or it could be my terrible googling skills. Does anybody have a clue otherwise?
Consider Catvar:
A Categorial-Variation Database (or Catvar) is a database of clusters of uninflected words (lexemes) and their categorial (i.e. part-of-speech) variants. For example, the words hunger(V), hunger(N), hungry(AJ) and hungriness(N) are different English variants of some underlying concept describing the state of being hungry. Another example is the developing cluster:(develop(V), developer(N), developed(AJ), developing(N), developing(AJ), development(N)).
I am not sure what you are looking for but I think WordNet -- a lexical database for the English language -- would be a good place to start. Read more at http://wordnet.princeton.edu/
The link I referred to you says that
WordNet's structure makes it a useful tool for computational linguistics and natural language processing.
Considering getting a dump of wiktionary and extracting this information out of it.
http://en.wiktionary.org/wiki/sell mentions many of the forms of the word (sells, selling, sold).
If your aim is simply to normalize words to some base canonical form, considering using a lemmatizer or stemmer. Trying playing with morpha which is a really good english lemmatizer.

Extracting user interests from social profiles

This is my first time dabbling in NLP so please excuse my ignorance. I'm looking for a method to extract interests/likes/hobbies from users' social profiles. Here is an example where all the interests/likes/hobbies are in bold:
"I consider myself a pretty diverse character... I'm a professional
wrestler, but I'd take a bullet for Wall•E. I train like a one-man genocide machine in the gym, but I cried at
"Armageddon." I'll head bang to AC/DC, and I'm seriously
considering getting a Legend of Zelda tattoo. I'm 420-friendly. I
like to party it up with the frat crowd one night, hang out with
my Burning Man friends the next, play Halo and World of
Warcraft the next, and jam with friends that aren't any younger than
40 the next. My youngest friend is 16, my oldest friend is 66. I'll
sing karaoke at the bars, and I'm my friends' collective
psychiatrist/shoulder."
The profiles are plain text. There are no meta tags or ids associated with any of it, it's just a paragraph of text.
My naiive idea was to take each noun and match it against Freebase to see if it's an activity/artist/movie/book etc. The problem is that although most entities mentioned will be things the user likes, she will also mention things she doesn't like and I have no means of distinguishing the 2.
I have 2 questions:
What sub field of NLP should I be looking at? Some googleable algorithms/techniques/authors would be greatly appreciated.
How hard is this problem?
Thanks!
First, unless using NLP to do this is a particular objective for you, check your problem domain to see if you can avoid it completely.
For instance:
do these profiles have tags (supplied either by the Site or by the
user)?
what does the Site's API make available (assuming that's how you are accessing this data; if you are scraping it, then this doesn't of course apply)? A good example, Facebook. if you read a user's posts, you'll see words like "wrestler", "karaoke", etc. but if you look at what fields are exposed via the Graph API, you'll see that these activities nearly always have an associated FB ID.
I am not a specialist in this field, but I can recommend a couple of resources directed to NLP and which are accessible to the non-specialist or novice. The first is a text processing API. This simple web service uses REST and JSON IO. It is free and seems to have a fairly large rate limit.
This API appears to rely heavily on the excellent Natural Language Tooolkit (NLTK) which is a mature stable library in python, that includes modules directed to the problem in your Question, e.g., Sentiment Analysis, Tagging and Chunk Extraction, etc.
Which particular sub-domain is most relevant to solving the Question in the OP? I don't know, but I suspect there's a module somewhere in the NLTK that does what you need. Finding that module is hopefully just a matter of skimming the API Documentation (which is organized by module); reading the Getting Started section which contains an excellent survey of NLTK's modules as well as demos for all of each of them.

The best IR software for my use?

I want to take what people chat about in a chat room and do the following information retrieval:
Get the keywords
Ignore all noise words, keep verb an nouns mainly
Perform stemming on the keywords so that I don't store the same keyword in many forms
If a synonym keyword is already stored in my storage then the existing synonym should be used instead of the new keyword
Store the processed keyword in a persistant storage with a reference to the chat message it was located in and the user who uttered it
With this prosessed information I want to slowly get an idea of what people are talking about in chatrooms, and then use this to automatically find related chatrooms etc. based on these keywords.
My question to you is a follows: What is the best C/C++ or .NET tools for doing the above?
I partially agree with #larsmans comment. Your question, in practice, may indeed be more complex than the question you posted.
However, simplifying the question/problem, I guess the answer to your question could be one of Lucene's implementation: Lucene (Java), Lucene.Net (C#) or CLucene (C++).
Following the points in your question:
Lucene would take care of point 1 by using String tokenizers (you can customize or use your own).
For point 2 you could use a TokenFilter like StopFilter so Lucene can read a list of stopwords ("the", "a", "an"...) that it should not use.
For point 3 you could use PorterStemFilter.
Point 4 is a little bit trickier, but could be done using a customized TokenFilter.
Point 1 to 4 are perfomed in the Analysis/tokenization phase, which an Analyzer is responsible.
Regarding point 5, in Lucene you can store Documents with fields. A document can have an arbitrary number and mix of fields. So you could create a single Document for each chat room with all its text concatenated, and have another field of the document reference the chatroom it was extracted from. You will end up with a bunch of Lucene documents that you can compare. So you can compare your current chat room with others to see which one is more similar to the one you are on.
If all you want is a set of the best keywords to describe a chatrom your needs are closer to information extraction/automatic summarization/topic spotting task as #larsmans said. But you can still use Lucene for the parsing/tokenization phase.
*I referenced the Java docs, but CLucene and Lucene.Net have very similar APIs so it won't be much trouble to figure out the differences.

Resources