I want to summarize some news article.I need a dataset.
There is BBC dataset but the problem is that I cant evaluate my output with others.
This dataset can be used for your text summarizations tasks and check your output with the given output.
https://www.kaggle.com/sunnysai12345/news-summary/data
if you need a news articles dataset this can be useful.
http://research.signalmedia.co/newsir16/signal-dataset.html (one million news articles dataset)
Related
I am using gensim LDA for topic modelling.
I need to get the topic distribution of a corpus, not the individual documents.
Let say I have 1000 documents, which belongs to 10 different categories (let say 100 docs for each category).
After training the LDA model overall 1000 documents, then I want to see what are the dominant topics of each category. The following image illustrates my dataset and aim.
So far I can think of two approaches, but I am not sure either is sane, I will be happy to know if there is a better way of doing it.
In the first approach, I can concatenate the documents of each category into one large document. So there will be only 10 large documents, hence for each document, I will be able to retrieve its topic distribution.
Another approach might be getting the topic distribution of all document, without concatenating documents. Hence for each category, we will have 100 documents topic distributions. To get the dominant topics for each category, I may sum the probability of each topic, and get only a few highest scored topics.
I am not sure any of this approaches are right, what would you suggest?
In approach 1), you are concatenating documents (of possibly different lengths), and getting topics of one big document. So importance of smaller documents is likely to get diminished.
In approach 2), documents of all lengths get almost equal importance (depending on how you combine the topic distributions)
Approach you need to go with will depend on your usecase.
I'm working on text classification and I want to use Topic models (LDA).
My corpus consists of at least 24, 000 Persian news documents. each doc in the corpus is in format of (keyword, weight) pairs extracted from the news.
I saw two Java toolkits: mallet and lingpipe.
I've read mallet tutorial on importing the data and it gets data in plain text, not the format that I have. is there any way that I could change it?
Also read a little about the lingpipe, the example from tutorial was using arrays of integers. Is it convenient for large data?
I need to know which implementation of LDA is better for me? Are there any other implementation that suits my data? (in Java)
From the keyword-weight file you can create an artificial text containing the words in random order with the given weights. Run mallet on the so-generated texts to retrieve the topics.
Is there any publicly available news+summary corpus for automatic summarization. If yes, can you please provide way to get it ?
Here you can also get Priberam Compressive Summarization Corpus for free, it's in Portuguese:
http://labs.priberam.com/Resources/PCSC.aspx
This corpus contains 801 documents split into 80 topics, each of which has 10 documents (one has 11). The documents are news stories from major Portuguese newspapers, radio and TV stations. Each topic also has two human generated summaries up to 100 words. The human summaries are compressive: the annotators performed only sentence and word deletion operations.
There is the Open Text Summarizer, downloadable at Sourceforge. For more ideas, please see the answers to this question.
I am experimenting with Classification algorithms in ML and am looking for some corpus to train my model to distinguish among the different categories like sports,weather, technology, football,cricket etc,
I need some pointers on where i can find some dataset with these categories,
Another option for me, is to crawl wikipedia to get data for the 30+ categories, but i wanted some brainstorming and opinions, if there is a better way to do this.
Edit
Train the model using the bag of words approach for these categories
Test - classify new/unknown websites to these predefined categories depending on the content of the webpage.
The UCI machine learning repository contains a searchable archive of datasets for supervised learning.
You might get better answers if you provide more specific information about what inputs and outputs your ideal dataset would have.
Edit:
It looks like dmoz has a dump that you can download.
A dataset of newsgroup messages, classified by subject
I am embarking upon a NLP project for sentiment analysis.
I have successfully installed NLTK for python (seems like a great piece of software for this). However,I am having trouble understanding how it can be used to accomplish my task.
Here is my task:
I start with one long piece of data (lets say several hundred tweets on the subject of the UK election from their webservice)
I would like to break this up into sentences (or info no longer than 100 or so chars) (I guess i can just do this in python??)
Then to search through all the sentences for specific instances within that sentence e.g. "David Cameron"
Then I would like to check for positive/negative sentiment in each sentence and count them accordingly
NB: I am not really worried too much about accuracy because my data sets are large and also not worried too much about sarcasm.
Here are the troubles I am having:
All the data sets I can find e.g. the corpus movie review data that comes with NLTK arent in webservice format. It looks like this has had some processing done already. As far as I can see the processing (by stanford) was done with WEKA. Is it not possible for NLTK to do all this on its own? Here all the data sets have already been organised into positive/negative already e.g. polarity dataset http://www.cs.cornell.edu/People/pabo/movie-review-data/ How is this done? (to organise the sentences by sentiment, is it definitely WEKA? or something else?)
I am not sure I understand why WEKA and NLTK would be used together. Seems like they do much the same thing. If im processing the data with WEKA first to find sentiment why would I need NLTK? Is it possible to explain why this might be necessary?
I have found a few scripts that get somewhat near this task, but all are using the same pre-processed data. Is it not possible to process this data myself to find sentiment in sentences rather than using the data samples given in the link?
Any help is much appreciated and will save me much hair!
Cheers Ke
The movie review data has already been marked by humans as being positive or negative (the person who made the review gave the movie a rating which is used to determine polarity). These gold standard labels allow you to train a classifier, which you could then use for other movie reviews. You could train a classifier in NLTK with that data, but applying the results to election tweets might be less accurate than randomly guessing positive or negative. Alternatively, you can go through and label a few thousand tweets yourself as positive or negative and use this as your training set.
For a description of using Naive Bayes for sentiment analysis with NLTK: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/
Then in that code, instead of using the movie corpus, use your own data to calculate word counts (in the word_feats method).
Why dont you use WSD. Use Disambiguation tool to find senses. and use map polarity to the senses instead of word. In this case you will get a bit more accurate results as compared to word index polarity.