Text summarization dataset - text

Does anyone has the text summarization dataset containing the text and the summaries of the text.
I found http://www.nist.gov/tac/data/past/2009/Summ09.html, but for the distribution of it they need lot of paper work and authorization.
Can somebody help me here please.
Thanks.

The dataset should be used only if you have an agreement with NIST for the data. Try signing an agreement and get some authorization from your organization.

There are strict rules about distributing TAC data, you can request access to the data but will have to fill in some forms first:
TAC: http://www.nist.gov/tac/data/forms/index.html

Related

Any way to get passed the minimum of 20 tokens for text classification - Google NLP API

Is there anyway to get passed the minimum token requirement for google's NLP API text classification method? I'm trying to input a short simple sentence such as "I can't wait for the presidential debates" but this would return an error saying:
Invalid text content: too few tokens (words) to process.
Is there any way to get around this? I've inputting random words until the inputted string got to 20 characters but that messes up the labels and confidence a lot of the time. If there is any way around this such as setting an option or adding something that would be awesome! If there is no workaround, let me know if you know of another pre-trained text classification model that would work for me!
Also, I can't create the categorizes and labels I want. There would just be too many needed for what I'm doing so that's why these predefined categories in nlp api is great. Just need to get rid of that 20 character requirement.
As clarified in the official Content Classification documentation:
Important: You must supply a text block (document) with at least twenty tokens (words) to the classifyText method.
Considering that, checking for possible alternatives, it seems that, unfortunately, there isn't a way to workaround this. Indeed, you will need to supply at least 20 words.
For this reason, searching around, I found this one here and this other - this one in Chinese, but it might help you :) - of pre-trained models for Text Classification that I believe might help you.
Anyway, feel free to raise a Feature Request in Google's Issue Tracker, for them to check about the possibility of removing this limitation.
Let me know if the information helped you!

Using Learning To Rank on textual documents?

i need some help in implementing Learning To Rank (LTR). It is related to my semester project and I'm totally new to this. The details are as follows:
I gathered around 90 documents and populated 10 user queries. Now i have to rank these documents based on each query using three algorithms specifically LambdaMart, AdaRank, and Coordinate Ascent. Previously i applied clustering techniques on Vector Space Model but that was easy. However in this case, I don't know how to change the data according to these algorithms. As i have this textual data( document and queries) in txt format in separate files. I have searched for solutions online and I'm unable to find a proper solution so can anyone here please guide me in the right direction i.e. Steps. I would really appreciate.
As you said you have applied the clustering in vector space model. the input of these algorithms are also vectors.
Why don't you have a look at the standard data set introduced for learning to rank issue (Letor benchmark) in which documents are shown in vectors of features?
There is also implementation of these algorithm provided in java (RankLib), which may give you the idea to solve the problem. I hope, this help you!

suggest list of how-to articles based on text content

I have 20,000 messages (combination of email and live chat) between my customer and my support staff. I also have a knowledge base for my product.
Often times, the questions customers ask are quite simple and my support staff simply point them to the right knowledge base article.
What I would like to do, in order to save my support staff time, is to show my staff a list of articles that may likely be relevant based on the initial user's support request. This way they can just copy and paste the link to the help article instead of loading up the knowledge base and searching for the article manually.
I'm wondering what solutions I should investigate.
My current line of thinking is to run analysis on existing data and use a text classification approach:
For each message, see if there is a response with a link to a how-to article
If Yes, extract key phrases (microsoft cognitive services)
TF-IDF?
Treat each how-to as a 'classification' that belongs to sets of key phrases
Use some supervised machine learning, support vector machines maybe to predict which 'classification, aka how-to article' belongs to key phrase determined from a new support ticket.
Feed new responses back into the set to make the system smarter.
Not sure if I'm over complicating things. Any advice on how this is done would be appreciated.
PS: naive approach of just dumping 'key phrases' into search query of our knowledge base yielded poor results since the content of the help article is often different than how a person phrases their question in an email or live chat.
A simple classifier along the lines of a "spam" classifier might work, except that each FAQ would be a feature as opposed to a single feature classifier of spam, not-spam.
Most spam-classifiers start-off with a dictionary of words/phrases. You already have a start on this with your naive approach. However, unlike your approach a spam classifier does much more than a text search. Essentially, in a spam classifier, each word in the customer's email is given a weight and the sum of weights indicates if the message is spam or not-spam. Now, extend this to as many features as FAQs. That is, features like: FAQ1 or not-FAQ1, FAQ2 or not-FAQ2, etc.
Since your support people can easily identify which of the FAQs an e-mail requires then using a supervised learning algorithm would be appropriate. To reduce the impact of any miss-classification errors, then consider the application presenting a support person with the customer's email followed by the computer generated response and all the support person would have to-do is approve the response or modify it. Modifying a response should result in a new entry in the training set.
Support Vector Machines are one method to implement machine learning. However, you are probably suggesting this solution way too early in the process of first identifying the problem and then getting a simple method to work, as well as possible, before using more sophisticated methods. After all, if a multi-feature spam classifier works why invest more time and money in something else that also works?
Finally, depending on your system this is something I would like to work-on.

Text Mining - What is the best way to mine descriptive excel sheet data

I have university placement data pulled from databases in excel sheet. I need to text mine the job description offered by companies, which is a descriptive field for all the rows and then come up with the analysis of profiles in demand.
Here is a snapshot of the data
Could anyone help me to kick start this activity?
Thanks
Saurabh
I am not a data expert but I have some data mining experience. I would try following these steps for starters:
Excel is not a good for such an analysis. Find some tool dedicated to data mining e.g. RStudio. R has many useful out-of-the-box algorithms for data mining.
Cleanse the data e.g. all texts to lower case, remove stop words, remove punctuation, remove additional white spaces.
Tokenize the data e.g. 1 word tokens - "finance", "bachelor"
Decide on how you will assert if a certain profile is in demand or not? If by profile you mean that you need the information on the frequency of certain tokens appearing in the data more often then others e.g. "finance", "bachelor" etc. then simply create a frequency matrix. R allows you to create a visualisation of this - Word Clouds.
This is to start you off :). I am sure there is much more to be suggested in this matter.

How can I analyze pieces of text for positive or negative words?

I'm looking for some sort of module (preferably for python) that would allow me to give that module a string about 200 characters long. The module should then return how many positive or negative words that string had. (e.g. love, like, enjoy vs. hate, dislike, bad)
I'd really like to avoid having to reinvent the wheel in natural language processing, so if there is anything you guys know of that would allow me to do what I described above, it'd be a huge time-saver if you could share.
Thanks for the help!
I think you're looking for sentiment analysis. Here's a Twitter sentiment app.
Here's a question about sentiment analysis using Python.
Before you analyse pieces of text you need to preprocess given text by striping punctuation, repair language, split spaces,lower the whole text and store the words in an iterable data structure.
For some basic sentiment analysis, following techniques can be used:
Bag of words
In bag of words technique we basically go through a bag(file) of words and check if the iterable made by us contains these. If it does then we assign some value to each word's presence in order to weigh the total sentiment of the text.
This link should help you understand more about this
https://en.wikipedia.org/wiki/Bag-of-words_model
Keyword Extraction and Tagging
Keywords and important information can be extracted from the input text by tagging the elements and then removing unwanted data.
For example:
My name is John.
Here John, name are the information and "is" isn't really needed.
Similarly verbs and other unimportant things can be removed in order to retain only the main information.
Chunking and Chinking helps.
This link must be of help.
http://nltk.org/book/ch07.html
You can tokenize your text and get the sentiment using existing sentiment analysis tools. The most comprehensive sentiment analysis tool that I know is SentiBench. This is basically a survey study of all sentiment analysis tools. As well as the code and examples on how to use the code.

Resources