Extracting user interests from social profiles - nlp

This is my first time dabbling in NLP so please excuse my ignorance. I'm looking for a method to extract interests/likes/hobbies from users' social profiles. Here is an example where all the interests/likes/hobbies are in bold:
"I consider myself a pretty diverse character... I'm a professional
wrestler, but I'd take a bullet for Wall•E. I train like a one-man genocide machine in the gym, but I cried at
"Armageddon." I'll head bang to AC/DC, and I'm seriously
considering getting a Legend of Zelda tattoo. I'm 420-friendly. I
like to party it up with the frat crowd one night, hang out with
my Burning Man friends the next, play Halo and World of
Warcraft the next, and jam with friends that aren't any younger than
40 the next. My youngest friend is 16, my oldest friend is 66. I'll
sing karaoke at the bars, and I'm my friends' collective
psychiatrist/shoulder."
The profiles are plain text. There are no meta tags or ids associated with any of it, it's just a paragraph of text.
My naiive idea was to take each noun and match it against Freebase to see if it's an activity/artist/movie/book etc. The problem is that although most entities mentioned will be things the user likes, she will also mention things she doesn't like and I have no means of distinguishing the 2.
I have 2 questions:
What sub field of NLP should I be looking at? Some googleable algorithms/techniques/authors would be greatly appreciated.
How hard is this problem?
Thanks!

First, unless using NLP to do this is a particular objective for you, check your problem domain to see if you can avoid it completely.
For instance:
do these profiles have tags (supplied either by the Site or by the
user)?
what does the Site's API make available (assuming that's how you are accessing this data; if you are scraping it, then this doesn't of course apply)? A good example, Facebook. if you read a user's posts, you'll see words like "wrestler", "karaoke", etc. but if you look at what fields are exposed via the Graph API, you'll see that these activities nearly always have an associated FB ID.
I am not a specialist in this field, but I can recommend a couple of resources directed to NLP and which are accessible to the non-specialist or novice. The first is a text processing API. This simple web service uses REST and JSON IO. It is free and seems to have a fairly large rate limit.
This API appears to rely heavily on the excellent Natural Language Tooolkit (NLTK) which is a mature stable library in python, that includes modules directed to the problem in your Question, e.g., Sentiment Analysis, Tagging and Chunk Extraction, etc.
Which particular sub-domain is most relevant to solving the Question in the OP? I don't know, but I suspect there's a module somewhere in the NLTK that does what you need. Finding that module is hopefully just a matter of skimming the API Documentation (which is organized by module); reading the Getting Started section which contains an excellent survey of NLTK's modules as well as demos for all of each of them.

Related

how to get the comments on social media and make it as your data?

I've proposed a title for our thesis, Movie Success Prediction through Social Media comments using Sentiment Analysis, is there a way you can get the comments on social media (twitter, Instagram, Facebook etc.) and use it for your software? like an API or any other way. is that even possible to use your software on different social media to get the comments for prediction or should i change my title and stick to one social media like Facebook or twitter only?
what's the good algorithm for this?
what programming language and framework/IDE should i use?
I've done lots of research on google and still hoping for more info here. Thank you.
Edit: I'll only use YouTube and YouTube API.
From the title of your question, it seems that the method you need to use is distant supervision. You need to retrieve data with labels you think it is proper for your task. For instance, a tweet containing #perfect hashtag would probably be a positive tweet. So, you can define set of hashtags for your task, negative, positive or even for neutral; then you can retrieve tweets by those via Twitter API. For your task, those should be for movies, therefore your data should contain movie related information in first place.
Given that you will deal with text data and you'd like to create your own dataset, it is better to start with Twitter. Its API works for your needs and it is very well-documented. The language and frameworks are upto your choice, since APIs supports many known languages as well. Personally, I'd start with python or java to quickly solve future problems easier with community support.
For a general survey of this area, you may dive into papers and resources from here:
https://scholar.google.com.tr/scholar?hl=en&q=distant+supervision+sentiment+analysis
Distant supervision could be used to create a sentiment lexicon out of millions English tweets by using sets of negative and positive hashtags as well. You may take a look at Chapter 5 of this thesis ( https://spectrum.library.concordia.ca/980377/1/Ozdemir_MCompSc_F2015.pdf ), this may also give a good insight for your thesis, too.
Hope this helps.
Cheers

suggest list of how-to articles based on text content

I have 20,000 messages (combination of email and live chat) between my customer and my support staff. I also have a knowledge base for my product.
Often times, the questions customers ask are quite simple and my support staff simply point them to the right knowledge base article.
What I would like to do, in order to save my support staff time, is to show my staff a list of articles that may likely be relevant based on the initial user's support request. This way they can just copy and paste the link to the help article instead of loading up the knowledge base and searching for the article manually.
I'm wondering what solutions I should investigate.
My current line of thinking is to run analysis on existing data and use a text classification approach:
For each message, see if there is a response with a link to a how-to article
If Yes, extract key phrases (microsoft cognitive services)
TF-IDF?
Treat each how-to as a 'classification' that belongs to sets of key phrases
Use some supervised machine learning, support vector machines maybe to predict which 'classification, aka how-to article' belongs to key phrase determined from a new support ticket.
Feed new responses back into the set to make the system smarter.
Not sure if I'm over complicating things. Any advice on how this is done would be appreciated.
PS: naive approach of just dumping 'key phrases' into search query of our knowledge base yielded poor results since the content of the help article is often different than how a person phrases their question in an email or live chat.
A simple classifier along the lines of a "spam" classifier might work, except that each FAQ would be a feature as opposed to a single feature classifier of spam, not-spam.
Most spam-classifiers start-off with a dictionary of words/phrases. You already have a start on this with your naive approach. However, unlike your approach a spam classifier does much more than a text search. Essentially, in a spam classifier, each word in the customer's email is given a weight and the sum of weights indicates if the message is spam or not-spam. Now, extend this to as many features as FAQs. That is, features like: FAQ1 or not-FAQ1, FAQ2 or not-FAQ2, etc.
Since your support people can easily identify which of the FAQs an e-mail requires then using a supervised learning algorithm would be appropriate. To reduce the impact of any miss-classification errors, then consider the application presenting a support person with the customer's email followed by the computer generated response and all the support person would have to-do is approve the response or modify it. Modifying a response should result in a new entry in the training set.
Support Vector Machines are one method to implement machine learning. However, you are probably suggesting this solution way too early in the process of first identifying the problem and then getting a simple method to work, as well as possible, before using more sophisticated methods. After all, if a multi-feature spam classifier works why invest more time and money in something else that also works?
Finally, depending on your system this is something I would like to work-on.

how schema.org can help in nlp

I am basically working on nlp, collecting interest based data from web pages.
I came across this source http://schema.org/ as being helpful in nlp stuff.
I go through the documentation, from which I can see it adds additional tag properties to identify html tag content.
It may help search engine to get specific data as per user query.
it says : Schema.org provides a collection of shared vocabularies webmasters can use to mark up their pages in ways that can be understood by the major search engines: Google, Microsoft, Yandex and Yahoo!
But I don't understand how it can help me being nlp guy? Generally I parse web page content to process and extract data from it. schema.org may help there, but don't know how to utilize it.
Any example or guidance would be appreciable.
Schema.org uses microdata format for representation. People use microdata for text analytics and extracting curated contents. There can be numerous application.
Suppose you want to create news summarization system. So you can use hNews microformats to extract most relevant content and perform summrization onit
Suppose if you have review based search engine, where you want to list products with most positive review. You can use hReview microfomrat to extract the reviews, now perform sentiment analysis on it to identify product has -ve or +ve review
If you want to create skill based resume classifier then extract content with hResume microformat. Which can give you various details like contact (uses the hCard microformat), experience, achievements , related to this work, education , skills/qualifications, affiliations
, publications , performance/skills for performance etc. You can perform classifier on it to classify CVs with particular skillsets
Thought schema.org does not helps directly to nlp guys, it provides platform to perform text processing in better way.
Check out this http://en.wikipedia.org/wiki/Microformat#Specific_microformats to see various mircorformat, same page will give you more details.
Schema.org is something like a vocabulary or ontology to annotate data and here specifically Web pages.
It's a good idea to extract microdata from Web pages but is it really used by Web developper ? I don't think so and I think that the majority of microdata are used by company such as Google or Yahoo.
Finally, you can find data but not a lot and mainly used by a specific type of website.
What do you want to extract and for what type of application ? Because you can probably use another type of data such as DBpedia or Freebase for example.
GoodRelations also supports schema.org. You can annotate your content on the fly from the front-end based on the various domain contexts defined. So, schema.org is very useful for NLP extraction. One can even use it for HATEOS services for hypermedia link relations. Metadata (data about data) for any context is good for content and data in general. Alternatives, include microformats, RDFa, RDFa Lite, etc. The more context you have the better as it will turn your data into smart content and help crawler bots to understand the data. It also leads further into web of data and in helping global queries over resource domains. In long run such approaches will help towards domain adaptation of agents for transfer learning on the web. Pretty much making the web of pages an externalized unit of a massive commonsense knowledge base. They also help advertising agencies understand publisher sites and to better contextualize ad retargeting.

Open source projects for email scrubbing generating structured data from unstructured source?

Don't know where to start on this one so hopefully you guys can clear up my question. I have project where email will be searched for specific words/patterns and stored in a structured manner. Something that is done with Trip it.
The article states that they developed a DataMapper
The DataMapper is responsible for taking inbound email messages
addressed to plans [at] tripit.com and transforming them from the
semi-structured format you see in your mail reader into a highly
structured XML document.
There is a comment that also states
If you're looking to build this yourself, reading a little bit about
Wrappers and Wrapper Induction might be helpful
I Googled and read about wrapper induction but it was just too broad of a definition and didn't help me understand how one would go about solving such problem.
Is there some open source project out there that does similar things?
There are a couple of different ways and things you can do to accomplish this.
The first part, which involves getting access to the email content I'll not answer here. Basically, I'll assume that you have access to the text of emails, and if you don't there are some libraries that allow you to connect java to an email box like camel (http://camel.apache.org/mail.html).
So now you've got the email so then what?
A handy thing that could help is that lingpipe (http://alias-i.com/lingpipe/) has an entity recognizer that you can populate with your own terms. Specifically, look at some of their extraction tutorials and their dictionary extractor (http://alias-i.com/lingpipe/demos/tutorial/ne/read-me.html) So inside of the lingpipe dictionary extractor (http://alias-i.com/lingpipe/docs/api/com/aliasi/dict/ExactDictionaryChunker.html) you'd simply import the terms you're interested in and use that to associate labels with an email.
You might also find the following question helpful: Dictionary-Based Named Entity Recognition with zero edit distance: LingPipe, Lucene or what?
Really a very broad question, but I can try to give you some general ideas, which might be enough to get started. Basically, it sounds like you're talking about an elaborate parsing problem - scanning through the text and looking to apply meaning to specific chunks. Depending on what exactly you're looking for, you might get some good mileage out of a few regular expressions to start - things like phone numbers, email addresses, and dates have fairly standard structures that should be matchable. Other data points might benefit from some indicator words - the phrase "departing from" might indicate that what follows is an address. The natural language processing community also has a large tool set available for text processing - check out things like parts of speech taggers and semantic analyzers if they're appropriate to what you're trying to do.
Armed with those techniques, you can follow a basic iterative development process: For each data point in your expected output structure, define some simple rules for how to capture it. Then, run the application over a batch of test data and see which samples didn't capture that datum. Look at the samples and revise your rules to catch those samples. Repeat until the extractor reaches an acceptable level of accuracy.
Depending on the specifics of your problem, there may be machine learning techniques that can automate much of that process for you.

social media search engine question

I came across this site called social mention and am curious about how applications like this work, hopefully somebody can offer some glimpses/suggestions on this.
Upon looking at the search results, I realize that they grab results from facebook, twitter, google.... I suppose this is done on the fly, probably through some REST api exposed by the mentioned?
If what I mention in point 1 is probably true, does that means sentiment analysis on the documents/links return is done on the fly too? Wouldn't that be too computationally intensive? I am curious because other than sentiments, they also return the top keywords in the document set.
They have something called the "trends". They looked like the trendingtopics in twitter, but seems like they also include phrases >3 words long. Is this relevant to nlp's entity extraction or more to keyphrase extraction? Is there apis other than that of Twitter that provides this? Is "trends" generally done on search queries submitted by users or do the system actually processes the pages?
A curious man.
sentiment can be fast and on the fly, if it is for example rule-based and the dictionaries are in memory. Curious? Get in touch

Resources