Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I am working on my bachelor thesis and my topic is to benchmark nodejs applications. I have an I/O intensive application where I would need some sample files.
Therefore it would be great if I had a lot of small sized files, some medium files and some big files (>1gb). But they should represent real data (eg. pictures, pdfs, documents, archives, ...)
If you think that this should not be asked on stackoverflow please tell me where i can ask this.
Do you know where I can get such sample datasets?
English wikipedia database dumps (~12GB)
Sample audio and video files (~12MB - ~650MB)
Text files of various sizes (~1KB - 114KB)
StackExchange data dumps (~1MB - ~15.4GB)
Related
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 days ago.
Improve this question
I m struggling to analyze a very large and complex data set with multi dimension on #excel. File takes long minutes to open (there are millions of columns) and even longer to analyze. Sorry for the rookie question but I am sure technology can help.
Thank you.
Excel pivot table. It was slow, didn't do the job on the multi dimensional aspect
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 days ago.
Improve this question
Is there any open source project or free avaliable source where I can query a webpage's category type (like https://www.trustedsource.org/en/feedback/url). I have more than 200K webpage in my dataset.
To me it looks like more of a classification problem which is suitable for Machine Learning. For this purpose you can make your model in popular ML frameworks (such as Keras/TensorFlow and PyTorch) or search for available ones on internet and use your dataset to do a transfer learning.
I could find a project on GitHub (link) that can be a good starting point.
Hi today and happy weekend!
that's interesting to know if a category is used as category pages, since google shows up multiple spots of one domain when it has category pages.
Examples:
danlok(com)
best example to see: bloomberg....
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I have video which contains 700mb(5 min duration) size. I want to reduce the file size to less then 30 mb .
I have go through the FFMPEG blog and successfully run the samples. But the problem in that library is, it will take too much of time to compress a video file(146 mb take 20 minute). so looking for a good library or right path to achieve my requirement.
My application support from android api-9 and above.
The public APIs for access to hardware video codecs was added in API 16, though it didn't really stabilize until API 18. See the docs for the MediaCodec class. Some examples are available here.
For API 9+ you're generally limited to software solutions like ffmpeg.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I want to determine which part of audio file contain speech or music.
I hope someone has a made something like this or can tell me where to start.
Can you please suggest some method/tutorial for doing the same.
Thank you.
Check out the pyAudioAnalysis python library. Among others, it has a pre-trained speech-music classifier and two segmentation-classification methods (one based on fix-sized windows and another based on HMMs).
You can extract speech and music parts of an audio recording quite easily, e.g.:
from pyAudioAnalysis import audioSegmentation as aS
[flagsInd, classesAll, acc] = aS.mtFileClassification("data/scottish.wav", "data/svmSM", "svm", True, 'data/scottish.segments')
with a result as the one in this image
There's lots of prior art in this area, but I'd suggest browsing through some of Dan Ellis's papers. The slides for this talk has some good background. In short it's all down to picking the right feature vectors.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I am doing a project in news classification. Basically the system will classifying news articles based on the pre-defined topic (e.g. sports, politic, international). To build the system, I need free data sets for training the system.
So far, after few hours googling and links from here the only suitable data sets I could find is this. While this will hopefully enough, I think I will try to find more.
Note that the data sets I want:
Contains full news articles, not just title
Is in English
In .txt format,not XML or db
Can anybody help me?
Have you tried to use Reuters21578? It is the most common dataset for text classification. It is formated in SGML, but it is quite simple to parse and transform to a txt format.
You can build it, you can write a Python/Perl/PHP script where you run a search, then when you find the answers you can isolate the attributes with regex... I think is the best option. Is not easy but should be fun, finally you can share this dataset with us.