Hi I am writing a small app using Facebook to group people by social networks. The main problem I face is grouping similar texts together. Some people have the education as Anna University, Guindy while others put it as Anna University. How do I group these together? What algorithm or term should I search for?
Related
Let's say we have two sentences:
Jacob is going to watch a movie with Justin.
He will be back by 10 pm.
How does Stanford NLP identify "he" refers to Jacob and not Justin?
This is called coreference resolution and is a well-studied problem in NLP. As such, there are many possible ways to do it. Stackoverflow is not the right venue for a literature review, but here are a few links to get you started
http://www-labs.iro.umontreal.ca/~felipe/IFT6010-Hiver2015/Presentations/Abbas-Coreference.pdf
https://nlp.stanford.edu/projects/coref.shtml and links therein
I'm starting on AI chatbots and don't know where to actually start.
what I've imagined is something like this:
Empty chat bot that doesn't know anything
Learns when user asks question and if the bot doesn't know the answer, it'd ask for it
Records all the data learned and parse synonymous questions
Example procedure:
User: what is the color of a ripped mango?
Bot: I don't know [to input answer add !#: at the start]
User: !#:yellow
User: do you know the color of ripped mango?
Bot: yellow
Chatbots, or conversational dialogue systems in general, will have to be able to generate natural language and as you might expect, this is not something trivial. The state-of-the-art approaches usually mine conversations of human-human conversations (such as for example conversations on chat platforms like Facebook or Twitter, or even movie dialogs, basically things which are available in large quantities and resemble natural conversation). These conversations are then for example labelled as question-answer pairs, possibly using pretrained word embeddings.
This is an active area of research in the field of NLP. An example category of used systems is that of "End-to-End Sequence-to-Sequence models" (seq2seq). However, basic seq2seq models have a tendency to produce repetitive and therefore dull responses. More recent papers try to address this using reinforcement learning, as well as techniques like adversarial networks, in order to learn to choose responses. Another technique that improves the system is to extend the context of the conversation by allowing the model to see (more) prior turns, for example by using a hierarchical model.
If you don't really know where to start, I think you will find all the basics you will need in this free chapter of "Speech and Language Processing." by Daniel Jurafsky & James H. Martin (August 2017). Good luck!
I'm currently working on a project where I'm taking emails, stripping out the message bodies using the email package, then I want to categorize them using labels like sports, politics, technology, etc...I've successfully stripped the message bodies out of my emails. I'm looking to start classifying.
To make multiple labels like sports, technology, politics, entertainment I need some set of words of each one to make the labelling. Example for
Sports label will have the label data: Football, Soccer, Hockey……
Where can I find online label data to help me ?
You can use DMOZ.
Be award, there are different kinds of text. For e.g one of the most common words in email-text will be Hi or Hello but in wiki-text Hi and Hello will not be common words
What you're trying to do is called topic modeling:
https://en.wikipedia.org/wiki/Topic_model
The list of topics is very dependent on your training dataset and the ultimate purpose for which you're building this.
A good place to start can be here:
https://nlp.stanford.edu/software/tmt/tmt-0.4/
You can look on their topics, but you can probably also use it to give some initial topics to your data and just work on top of their topics.
You can use the BBC dataset.
It has labeled news articles which can help.
for feature extraction, remove stopwords, do stemming, use n-gram with tf-idf, and than choose the best features
I am new to Natural Language Processing and I want to learn more by creating a simple project. NLTK was suggested to be popular in NLP so I will use it in my project.
Here is what I would like to do:
I want to scan our company's intranet pages; approximately 3K pages
I would like to parse and categorize the content of these pages based on certain criteria such as: HR, Engineering, Corporate Pages, etc...
From what I have read so far, I can do this with Named Entity Recognition. I can describe entities for each category of pages, train the NLTK solution and run each page through to determine the category.
Is this the right approach? I appreciate any direction and ideas...
Thanks
It looks like you want to do text/document classification, which is not quite the same as Named Entity Recognition, where the goal is to recognize any named entities (proper names, places, institutions etc) in text. However, proper names might be very good features when doing text classification in a limited domain, it is for example likely that a page with the name of the head engineer could be classified as Engineering.
The NLTK book has a chapter on basic text classification.
Let's take your Facebook social profile. There are interests, activities, movies, music, and tv-shows.
You have these 5 things, in text, of course. Given your social profile and 10 other people, we want to find overlaps, similarity, etc. What method would you use to do it?
I"m guessing it would be best to use vectors and Euclidean/Pearson correlation? That's my approach. What's yours?
Please use a visual-style to answer this question, including examples and/or drawing out the vectors.
The December ACM student magazine discussed this area.
http://mags.acm.org/crossroads/2009winter/