Is there any publicly available news+summary corpus for automatic summarization. If yes, can you please provide way to get it ?
Here you can also get Priberam Compressive Summarization Corpus for free, it's in Portuguese:
http://labs.priberam.com/Resources/PCSC.aspx
This corpus contains 801 documents split into 80 topics, each of which has 10 documents (one has 11). The documents are news stories from major Portuguese newspapers, radio and TV stations. Each topic also has two human generated summaries up to 100 words. The human summaries are compressive: the annotators performed only sentence and word deletion operations.
There is the Open Text Summarizer, downloadable at Sourceforge. For more ideas, please see the answers to this question.
Related
we have a news website where we have to match news to a particular user.
We have to use for the matching only the user textual information, like for example the interests of the user or a brief description about them.
I was thinking to threat both the user textual information and the news text as document and find document similarity.
In this way, I hope, that if in my profile I wrote sentences like: I loved the speach of the president in Chicago last year, and a news talks about: Trump is going to speak in Illinois I can have a match (the example is purely casual).
I tried, first, to embed my documents using TF-IDF and then I tried a kmeans to see if there was something that makes sense, but I don't like to much the results.
I think the problem derives from the poor embedding that TF-IDF gives me.
Thus I was thinking of using BERT embedding to retrieve the embedding of my documents and then use cosine similarity to check similarity of two document (a document about the user profile and a news).
Is this an approach that could make sense? Bert can be used to retrieve the embedding of sentences, but there is a way to embed an entire document?
What would you advice me?
Thank you
BERT is trained on pairs of sentences, therefore it is unlikely to generalize for much longer texts. Also, BERT requires quadratic memory with the length of the text, using too long texts might result in memory issues. In most implementations, it does not accept sequences longer than 512 subwords.
Making pre-trained Transformers work efficiently for long texts is an active research area, you can have a look at a paper called DocBERT to have an idea what people are trying. But it will take some time until there is a nicely packaged working solution.
There are also other methods for document embedding, for instance Gensim implements doc2vec. However, I would still stick with TF-IDF.
TF-IDF is typically very sensitive to data pre-processing. You certainly need to remove stopwords, in many languages it also pays off to do lemmatization. Given the specific domain of your texts, you can also try expanding the standard list of stop words by words that appear frequently in news stories. You can get further improvements by detecting and keeping together named entities.
I am working on a document classification problem for financial reports/documents. Is there a ready made corpus for this ? I found a couple of use cases, but they all made their own corpus.
You will more than likely have to create your own corpus. I had a similar task and manually creating such a corpus would be too tedious. As a result I created News Corpus Builder a python module that would allow you to quickly develop a corpus based on your particular interest of topics.
The module allows you to generate your own corpus and store the text and associated label in sqlite or as flat files.
from news_corpus_builder import NewsCorpusGenerator
# Location to save generated corpus
corpus_dir = '/Users/skillachie/finance_corpus'
# Save results to sqlite or files per article
ex = NewsCorpusGenerator(corpus_dir,'sqlite')
# Retrieve 50 links related to the search term dogs and assign a category of Pet to the retrieved links
links = ex.google_news_search('dogs','Pet',50)
# Generate and save corpus
ex.generate_corpus(links)
More details on my blog
The finance corpus is available for download here . The corpus has the following categories:
Policy (licenses , regulation, SEC, monetary, fed, monetary,fiscal,imf)
International Finance( global finance, IMF, ECB, trouble in Greece, RMB devaluation)
Economy (GDP, Jobs, unemployment, housing, economy) Raising Capital(ipo, equity)
Real Estate
Mergers & Acquisitions (merger,acquisitions)
Oil(oil,oil prices,natural gas price)
Commodities (commodities,gold ,silver)
Fraud(insider trading, ponzi scheme, finance fraud)
Litigation (company litigation, company settlement,)
Earning Reports
You can use the Reuters-21578 corpus. http://www.daviddlewis.com/resources/testcollections/reuters21578/
It is a basic corpus for test classification.
I'm looking to train a naive Bayes with some new data sources that haven't been used before. I've already looked at the Lee & Pang corpus of IMDB reviews and the MPQA opinion corpus. I'm looking for new web services that fit the following criteria.
Easily Classified - must have a like/dislike or 5 star rating
Readily available
Pertain to new material (less important than the first two)
Here are some samples I have come up with on my own.
Etsy API
Rotten Tomatoes API
Yelp API
Any other suggestions would be much appreciated =)
In Pang&Lee's later work (2008) "Opinion Mining and Sentiment Analysis" here they have a section for publicly available resources. It has links to those corpora.
Take a look at sentiment140. It has a corpus that you can download and train with. You can easily extend to new tweets.
I am experimenting with Classification algorithms in ML and am looking for some corpus to train my model to distinguish among the different categories like sports,weather, technology, football,cricket etc,
I need some pointers on where i can find some dataset with these categories,
Another option for me, is to crawl wikipedia to get data for the 30+ categories, but i wanted some brainstorming and opinions, if there is a better way to do this.
Edit
Train the model using the bag of words approach for these categories
Test - classify new/unknown websites to these predefined categories depending on the content of the webpage.
The UCI machine learning repository contains a searchable archive of datasets for supervised learning.
You might get better answers if you provide more specific information about what inputs and outputs your ideal dataset would have.
Edit:
It looks like dmoz has a dump that you can download.
A dataset of newsgroup messages, classified by subject
I have encountered a very unusual problem. I have a set of phrases (noun phrases) extracted from a large corpus of documents. These phrases are >=2 and <=3 words of length. There is a need to cluster these phrases because the number of phrases extracted are very large in number and showing them as a simple list might not be useful for the user.
We are thinking of nice very simple ways of clustering these. Is there a quick tool/software/method that I could use to cluster these so that all phrases inside a cluster belong to a particular theme/topic, if I keep the number of topics as a fixed initially? I don't have any training set or any other clusters that I can use as a training set.
Topic classification is not an easy problem.
The conventional methods used to classify long documents (100's of words) are usually based on frequent words, and not suitable for very short messages. I believe that your problem is somewhat similar to tweet classification.
Two very interesting papers are:
Discovering Context: Classifying Tweets through a Semantic Transform Based on Wikipedia
(presented at HCI International 2011)
Eddi: Interactive Topic-based Browsing of Social Status Streams (presented at UIST'10)
If you want to include knowledge about the world so that, e.g., cat and dog will be clustered together, you can use WordNet's domains hierarchy.