Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am doing a project in news classification. Basically the system will classifying news articles based on the pre-defined topic (e.g. sports, politic, international). To build the system, I need free data sets for training the system.
So far, after few hours googling and links from here the only suitable data sets I could find is this. While this will hopefully enough, I think I will try to find more.
Note that the data sets I want:
Contains full news articles, not just title
Is in English
In .txt format,not XML or db
Can anybody help me?
Have you tried to use Reuters21578? It is the most common dataset for text classification. It is formated in SGML, but it is quite simple to parse and transform to a txt format.
You can build it, you can write a Python/Perl/PHP script where you run a search, then when you find the answers you can isolate the attributes with regex... I think is the best option. Is not easy but should be fun, finally you can share this dataset with us.
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 days ago.
Improve this question
Is there any open source project or free avaliable source where I can query a webpage's category type (like https://www.trustedsource.org/en/feedback/url). I have more than 200K webpage in my dataset.
To me it looks like more of a classification problem which is suitable for Machine Learning. For this purpose you can make your model in popular ML frameworks (such as Keras/TensorFlow and PyTorch) or search for available ones on internet and use your dataset to do a transfer learning.
I could find a project on GitHub (link) that can be a good starting point.
Hi today and happy weekend!
that's interesting to know if a category is used as category pages, since google shows up multiple spots of one domain when it has category pages.
Examples:
danlok(com)
best example to see: bloomberg....
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
Assume a site like https://www.wood-database.com/wood-finder/ (our working example). Each page of it has data on a wood species. Assuming we need to sort the woods by a ratio of its data, for example hardness/weight, the site's tools aren't very useful.
What would be useful, though, is passing that data into an excel, which could trivially calculate the ratio and sort.
What ways are there to automatically fill that sheet out? What other tools besides excel could do it?
You should have a look at python, it's perfectly fit for the job. You could use the request library together with beatifulsoup to begin with, then load all data into a Pandas Dataframe and simply export it to excel (standard funtionality of Pandas).
If you really want to scrape the site thoroughly, you could consider using Scrapy (https://scrapy.org/)
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I've got a pet project for which I need graphics of the outlines of certain countries (mostly European countries). I want to dynamically generate country graphics like the image below. Prefferably in a combination of JavaScript, HTML and CSS. I've been Googling for a bit and found: http://www.dafont.com/geobats.font.
It is near perfect, the sad thing is that there are missing a few countries. I have no clue how to edit TTF files so I'm not able to update it myself. I also lack the Photoshop skills to create the images I need by hand. So I was hoping you guys could help me out. Is there a site where I can get SVG's* of several countries of a TTF file such as geobats only with more (European) countries? Thanks in advance.
*In the case of SVG's I'd prefer cutouts over outlines.
Update 1: I've included an image to show which kind of graphics I'm trying to make.
Mike Bostock is a map geek and has a separate project from d3, topojson with all kinds of sampling and projection features. This may be too much for your project, but he also has a blog post that talks about finding data while he demonstrates the topojson capabilities. The link is:
http://bost.ocks.org/mike/map/#finding-data
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I am looking for training data for text classification into categories like sports, finance, politics, music, etc.
Please guide to references. Hello.
You can get a Reuters corpus by applying at Reuters
You can also get the Technion Text Repository TechnionRepo
If you are building a text classfication system in real time, you would be already having a corpus of documents. One of the assumption in any Classifier is, training data & test data are similar or from the same distribution.
If you are just exploring or building sample usecases in this area, then probably this link might be helpful to get some train data.
http://web.ist.utl.pt/acardoso/datasets/
http://disi.unitn.it/moschitti/corpora.htm
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I want to determine which part of audio file contain speech or music.
I hope someone has a made something like this or can tell me where to start.
Can you please suggest some method/tutorial for doing the same.
Thank you.
Check out the pyAudioAnalysis python library. Among others, it has a pre-trained speech-music classifier and two segmentation-classification methods (one based on fix-sized windows and another based on HMMs).
You can extract speech and music parts of an audio recording quite easily, e.g.:
from pyAudioAnalysis import audioSegmentation as aS
[flagsInd, classesAll, acc] = aS.mtFileClassification("data/scottish.wav", "data/svmSM", "svm", True, 'data/scottish.segments')
with a result as the one in this image
There's lots of prior art in this area, but I'd suggest browsing through some of Dan Ellis's papers. The slides for this talk has some good background. In short it's all down to picking the right feature vectors.