What technique would you use to find similar people with the same social profile as you? (computer science) - text

Let's take your Facebook social profile. There are interests, activities, movies, music, and tv-shows.
You have these 5 things, in text, of course. Given your social profile and 10 other people, we want to find overlaps, similarity, etc. What method would you use to do it?
I"m guessing it would be best to use vectors and Euclidean/Pearson correlation? That's my approach. What's yours?
Please use a visual-style to answer this question, including examples and/or drawing out the vectors.

The December ACM student magazine discussed this area.
http://mags.acm.org/crossroads/2009winter/

Related

how to get the comments on social media and make it as your data?

I've proposed a title for our thesis, Movie Success Prediction through Social Media comments using Sentiment Analysis, is there a way you can get the comments on social media (twitter, Instagram, Facebook etc.) and use it for your software? like an API or any other way. is that even possible to use your software on different social media to get the comments for prediction or should i change my title and stick to one social media like Facebook or twitter only?
what's the good algorithm for this?
what programming language and framework/IDE should i use?
I've done lots of research on google and still hoping for more info here. Thank you.
Edit: I'll only use YouTube and YouTube API.
From the title of your question, it seems that the method you need to use is distant supervision. You need to retrieve data with labels you think it is proper for your task. For instance, a tweet containing #perfect hashtag would probably be a positive tweet. So, you can define set of hashtags for your task, negative, positive or even for neutral; then you can retrieve tweets by those via Twitter API. For your task, those should be for movies, therefore your data should contain movie related information in first place.
Given that you will deal with text data and you'd like to create your own dataset, it is better to start with Twitter. Its API works for your needs and it is very well-documented. The language and frameworks are upto your choice, since APIs supports many known languages as well. Personally, I'd start with python or java to quickly solve future problems easier with community support.
For a general survey of this area, you may dive into papers and resources from here:
https://scholar.google.com.tr/scholar?hl=en&q=distant+supervision+sentiment+analysis
Distant supervision could be used to create a sentiment lexicon out of millions English tweets by using sets of negative and positive hashtags as well. You may take a look at Chapter 5 of this thesis ( https://spectrum.library.concordia.ca/980377/1/Ozdemir_MCompSc_F2015.pdf ), this may also give a good insight for your thesis, too.
Hope this helps.
Cheers

Teachable AI Chatbot

I'm starting on AI chatbots and don't know where to actually start.
what I've imagined is something like this:
Empty chat bot that doesn't know anything
Learns when user asks question and if the bot doesn't know the answer, it'd ask for it
Records all the data learned and parse synonymous questions
Example procedure:
User: what is the color of a ripped mango?
Bot: I don't know [to input answer add !#: at the start]
User: !#:yellow
User: do you know the color of ripped mango?
Bot: yellow
Chatbots, or conversational dialogue systems in general, will have to be able to generate natural language and as you might expect, this is not something trivial. The state-of-the-art approaches usually mine conversations of human-human conversations (such as for example conversations on chat platforms like Facebook or Twitter, or even movie dialogs, basically things which are available in large quantities and resemble natural conversation). These conversations are then for example labelled as question-answer pairs, possibly using pretrained word embeddings.
This is an active area of research in the field of NLP. An example category of used systems is that of "End-to-End Sequence-to-Sequence models" (seq2seq). However, basic seq2seq models have a tendency to produce repetitive and therefore dull responses. More recent papers try to address this using reinforcement learning, as well as techniques like adversarial networks, in order to learn to choose responses. Another technique that improves the system is to extend the context of the conversation by allowing the model to see (more) prior turns, for example by using a hierarchical model.
If you don't really know where to start, I think you will find all the basics you will need in this free chapter of "Speech and Language Processing." by Daniel Jurafsky & James H. Martin (August 2017). Good luck!

FourSquare vs. Google Places vs. Yelp API

I am trying to create an app that will help users find restaurants/movie theaters/malls/etc. to hang out based on ratings and distance. Other than just the place itself, I would also like to know more detailed information about the place. For example, if I were to look for parks, I would also like to know if theres a basketball or tennis court there. Ratings and popularity would also be an important aspect to prioritize suggestions.
After looking through all three of the APIs, I could not really find any substantial differences other than their search limits. Could anyone really differentiate each API for me? Maybe even recommend one based on my specific need?
Thanks!
The Foursquare API would fit this use case perfectly because you can supply very specific filters through the API. Also, they have extensive coverage around the world, unlike Google or Yelp.
I would check out the venues/explore endpoint and use a categoryId of Parks. You can use a query parameter of "basketball" or "tennis" to find parks that have courts for these.

Extracting user interests from social profiles

This is my first time dabbling in NLP so please excuse my ignorance. I'm looking for a method to extract interests/likes/hobbies from users' social profiles. Here is an example where all the interests/likes/hobbies are in bold:
"I consider myself a pretty diverse character... I'm a professional
wrestler, but I'd take a bullet for Wall•E. I train like a one-man genocide machine in the gym, but I cried at
"Armageddon." I'll head bang to AC/DC, and I'm seriously
considering getting a Legend of Zelda tattoo. I'm 420-friendly. I
like to party it up with the frat crowd one night, hang out with
my Burning Man friends the next, play Halo and World of
Warcraft the next, and jam with friends that aren't any younger than
40 the next. My youngest friend is 16, my oldest friend is 66. I'll
sing karaoke at the bars, and I'm my friends' collective
psychiatrist/shoulder."
The profiles are plain text. There are no meta tags or ids associated with any of it, it's just a paragraph of text.
My naiive idea was to take each noun and match it against Freebase to see if it's an activity/artist/movie/book etc. The problem is that although most entities mentioned will be things the user likes, she will also mention things she doesn't like and I have no means of distinguishing the 2.
I have 2 questions:
What sub field of NLP should I be looking at? Some googleable algorithms/techniques/authors would be greatly appreciated.
How hard is this problem?
Thanks!
First, unless using NLP to do this is a particular objective for you, check your problem domain to see if you can avoid it completely.
For instance:
do these profiles have tags (supplied either by the Site or by the
user)?
what does the Site's API make available (assuming that's how you are accessing this data; if you are scraping it, then this doesn't of course apply)? A good example, Facebook. if you read a user's posts, you'll see words like "wrestler", "karaoke", etc. but if you look at what fields are exposed via the Graph API, you'll see that these activities nearly always have an associated FB ID.
I am not a specialist in this field, but I can recommend a couple of resources directed to NLP and which are accessible to the non-specialist or novice. The first is a text processing API. This simple web service uses REST and JSON IO. It is free and seems to have a fairly large rate limit.
This API appears to rely heavily on the excellent Natural Language Tooolkit (NLTK) which is a mature stable library in python, that includes modules directed to the problem in your Question, e.g., Sentiment Analysis, Tagging and Chunk Extraction, etc.
Which particular sub-domain is most relevant to solving the Question in the OP? I don't know, but I suspect there's a module somewhere in the NLTK that does what you need. Finding that module is hopefully just a matter of skimming the API Documentation (which is organized by module); reading the Getting Started section which contains an excellent survey of NLTK's modules as well as demos for all of each of them.

NLP classify sentences/paragraph as funny

Is there a way to classify a particular sentence/paragraph as funny. There are very few pointers as to where one should go further on this.
There is research on this, it's called Computational Humor. It's an interdisciplinary area that takes elements from computational linguistics, psycholinguistics, artificial intelligence, machine learning etc. They are trying to find out what it is that makes stories or jokes funny (e.g. the unexpected connection, or using a taboo topic in a surprising way etc) and apply it to text (either to generate a funny story or to measure the 'funniness' of text).
There are books and articles about it (e.g. by Graeme Ritchie).
Yes, you should use a Training Corpora to build a predictive model able to detect funny sentences. Sometimes this is known as "Sentiment Analysis" in the literature. Take a look at this article about Sentiment Analysis with LingPipe.
If you can use Java, you can use their library (see license matrix). I found it very useful, not exactly in the same context than you.
The only way to pull this off is to get a couple of thousand people (monkeys won't do, sorry) to look through thousands of funny sentences/stories, rate them, and then build some sort of expert system/neural network out of it. Given the problem scope and the subjectivity of it (a thing funny to one person might not be funny - even offensive - to another), I'd say it's an impossible task.
You can use the same technique as spam filters. Instead of spam/non-spam you classify on funny/not-funny. Look into naive bayesian classifiers for more information.
http://en.wikipedia.org/wiki/Naive_Bayesian_classification
Also, try Computational Humor # Google Scholar if you're serious about getting into the field. Sentiment Analysis has been mentioned too, see wikipedia on that.
Of course, this all depends on what your scope and aims are...

Resources