looking for training data for text classification [closed] - document

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I am looking for training data for text classification into categories like sports, finance, politics, music, etc.
Please guide to references. Hello.

You can get a Reuters corpus by applying at Reuters
You can also get the Technion Text Repository TechnionRepo

If you are building a text classfication system in real time, you would be already having a corpus of documents. One of the assumption in any Classifier is, training data & test data are similar or from the same distribution.
If you are just exploring or building sample usecases in this area, then probably this link might be helpful to get some train data.
http://web.ist.utl.pt/acardoso/datasets/
http://disi.unitn.it/moschitti/corpora.htm

Related

Train NER with Custom Training Data [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
I am new to NLP and started with Spacy. I want to train NER with custom data and I am looking for free tool that can be used for Annotation.
Please suggest if you are using any open-source and User-friendly tool
Thanks in advance
You can start labeling with:
https://labelstud.io, or
https://github.com/doccano/doccano
The Spacy team also has the Prodigy (https://prodi.gy/), which could be freely used in academia.

How can I determine a webpage's category [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 days ago.
Improve this question
Is there any open source project or free avaliable source where I can query a webpage's category type (like https://www.trustedsource.org/en/feedback/url). I have more than 200K webpage in my dataset.
To me it looks like more of a classification problem which is suitable for Machine Learning. For this purpose you can make your model in popular ML frameworks (such as Keras/TensorFlow and PyTorch) or search for available ones on internet and use your dataset to do a transfer learning.
I could find a project on GitHub (link) that can be a good starting point.
Hi today and happy weekend!
that's interesting to know if a category is used as category pages, since google shows up multiple spots of one domain when it has category pages.
Examples:
danlok(com)
best example to see: bloomberg....

Implement Faster Rcnn from scratch [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I want to build my own Faster-RCNN model from scratch for multi-object detection from image data.
Can somebody please refer me good sources to step by step approach to implement faster-RCNN?
Which one will be good YOLO or faster-RCNN in terms of accuracy and execution time?
If you are in computer vision go through https://www.pyimagesearch.com/ guy named Adrian has great work over there
Instead of starting from scratch use pre-build model as base model afterward you can
go for implementation of your own intermediate layer
The architecture of faster RCNN
https://medium.com/#smallfishbigsea/faster-r-cnn-explained-864d4fb7e3f8
Actual implementation source -1
Actual implementation source-2

Speech/ Music classification [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I want to determine which part of audio file contain speech or music.
I hope someone has a made something like this or can tell me where to start.
Can you please suggest some method/tutorial for doing the same.
Thank you.
Check out the pyAudioAnalysis python library. Among others, it has a pre-trained speech-music classifier and two segmentation-classification methods (one based on fix-sized windows and another based on HMMs).
You can extract speech and music parts of an audio recording quite easily, e.g.:
from pyAudioAnalysis import audioSegmentation as aS
[flagsInd, classesAll, acc] = aS.mtFileClassification("data/scottish.wav", "data/svmSM", "svm", True, 'data/scottish.segments')
with a result as the one in this image
There's lots of prior art in this area, but I'd suggest browsing through some of Dan Ellis's papers. The slides for this talk has some good background. In short it's all down to picking the right feature vectors.

News Article Data Sets [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am doing a project in news classification. Basically the system will classifying news articles based on the pre-defined topic (e.g. sports, politic, international). To build the system, I need free data sets for training the system.
So far, after few hours googling and links from here the only suitable data sets I could find is this. While this will hopefully enough, I think I will try to find more.
Note that the data sets I want:
Contains full news articles, not just title
Is in English
In .txt format,not XML or db
Can anybody help me?
Have you tried to use Reuters21578? It is the most common dataset for text classification. It is formated in SGML, but it is quite simple to parse and transform to a txt format.
You can build it, you can write a Python/Perl/PHP script where you run a search, then when you find the answers you can isolate the attributes with regex... I think is the best option. Is not easy but should be fun, finally you can share this dataset with us.

Resources