This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Looking for dataset to test FULLTEXT style searches on
I am recently on to a project of Data Mining, for which I need 100 GB of plain text for testing. I am tired of searching the net the whole day. Someone please help me out by providing the links, where can I download such text files?
What type of text are you searching for? Conversational, articles, books - or a good spread of everything?
Project Gutenberg might be a good start:
http://www.gutenberg.org/
Wikipedia also allows you to download an archive of articles:
http://en.wikipedia.org/wiki/Wikipedia:Database_download
you should use http://dumps.wikimedia.org/
Related
This question already has answers here:
Trying to get the frequencies of a .wav file in Python
(4 answers)
Closed last year.
I have a fun artistic project in mind :) Basically, I would like to load a music file (probably as a .wav), and take it up multiple octaves so that you can represent it on the color spectrum. My question is, how do I go about breaking down an audio file so that I can perform this transformation? I'm curious to see if colors can enhance the experience of listening to music if both are "in harmony" :)
All the best!
Anthony :)
Hey so I found a thread that pretty much answers my question! Trying to get the frequencies of a .wav file in Python Basically I can multiply the calculated frequency by 2^x to represent it with light :) Thanks for all the help!
Edit: While I was working on finishing up my program I stumbled upon this https://github.com/rho-bit/Psynesthesia. Looks like someone did exactly what I was trying to do back in 2003!! This was my first post and I totally love this website. Look forward to using it more! :)
I have a text script that is used to create podcasts. So the words in podcast audio are exactly the same as in my text. Now what I want to have is the following:
Word in text | Pronounciation started at
Hello 0:0:0.000
my 0:0:1.125
friends 0:0:2.750
Is that possible to do at all?
Thanks in advance!
One of the key words you could start with to approach the complexity of the problem is "forced alignment". This site also covers questions regarding this topic e.g. here which leads you to questions and answers concerning HTK (the Hidden Markov Model Toolkit) via the releated threads.
You can find a more hands-on style description of how to use forced alignment in automated audio segmentation here.
So the answer is: yes, it is possible, but it is algorithmically very complex and even in its best implementations it is not error-free.
PS.: I found you a really simple tool
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Has reCaptcha been cracked / hacked / OCR'd / defeated / broken?
I recognized that during some time reCaptcha is used without the horizontal line. I do not like the changes. Referring to some articles, google's captcha service is getting weaker and weaker. Do you guys think reCaptcha is still the best captcha there is, or should i think about using another implementation?
(I'm using play in my current project, and I'm also not very happy with the implementation play supplies.)
Have a read of this article at allspammedup from January 2011 which gives statistics on a crack rate of 17.5% on google reCAPTCHA
Two definites - people are getting better at programming algorithms to get around CAPTCHAs, and CAPTCHAs themselves are getting to the limit of being understood by peoples, so yeas - the effective strength of CAPTCHAs is decreasing, at at some point another solution will have to be used.
Is there any way we can get the text from a scanned document in jpg jpeg or any other format ? I am using ruby as my programming language . But I guess if I can get the texts with some help from other programming languages , it will not be much of a problem to integrate.
Thanks.
Yes, you can use an OCR library. There are additional details at https://stackoverflow.com/questions/1085/free-ocr-library.
In brief, you may wish to consider using tessnet (http://www.pixel-technology.com/freeware/tessnet2/).
This technology is called optical character recognition (OCR).
For programming, check out this question, which recommends tesseract-ocr.
OCR for ruby? check out this question.
If it's just a couple images, here's a site that supposedly does it for free.
OCR Terminal http://www.ocrterminal.com has been the best (most accurate) free tool out of at least a dozen that I have used. It works especially well with formatted (table) data.
This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Text editor to open big (giant, huge, large) text files
I saw text editor to open big text files but that question referred to megabyte sized files. I work with 7GB csv files and find that even vim and gedit take a long time to open up.
What text editor do you use for for gigabyte sized files?
Appreciate any advice I can get.
don't know about others but i use vim (on windows) for editing GB files and it works every time. http://vim.sourceforge.net/
You can use total commander