How do i trasform multiple Ascii files to dfs2 files? - text

Is if possible to use any code for changing Ascii (text) file to create dfs2 files? Can this be done in Mike Zero for multiple files? Now i can use Mike Zero tool box for changing each file in a time.

Related

Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files

I’m using tesseract to batch convert a list of images to both a searchable PDF as well as a TXT file containing the OCRd text.
tesseract infile outfile -l eng myconfig
infile contains a list of image paths to process
myconfig contains tesseract preferences to specify the output types (tessedit_create_text 1 and tessedit_create_pdf 1)
This leaves me with outfile.pdf and outfile.txt, the latter of which contains page separators for delimiting text between images.
What I’m really looking to do, however, is to output multiple TXT files on a per-image basis, using the same corresponding image name. For example, Image1.jpg.txt, Image2.jpg.txt, Image3.jpg.txt...
Does tesseract have the option to support this behavior natively? I realize that I can loop through the image file list and execute tesseract on a per-image basis, but this is not ideal as I’d also have to run tesseract a second time to generate the merged PDF. Instead, I’d like to run both options at the same time, with less overall execution time.
I also realize that I can split the merged TXT file on the page separator into multiple text files, but then I have to introduce less elegant code to map and rename all of those split files to correspond to their original image names: Rename 0001.txt to Image1.jpg.txt...
I’m working with both Python 3 and Linux commands at my disposal.
You can prepare a batch file that loops through the input images and output to both txt and pdf at the same time -- more efficient, one single OCR operation instead of two. You can then split output .txt file to pages.
tesseract inimagefile outfile txt pdf
Converting multiple images to a single PDF file.
On Linux, you can list all images and then pipe them to tesseract
ls *.jpg | tesseract - yourFileName txt pdf
Where:
youFileName: is the name of the output file.
txt pdf: are the output formats, you can also use only one of them.
Converting images to individual text files
On Linux, you can use the for loop to go through files and execute an action for every file.
for FILE in *.jpg; do tesseract $FILE ${FILE::-4}; done
Where:
for FILE in *.jpg : loop through all JPG files (you can change the extension based on your format)
$FILE: is the name of the image file, e.g. 001.jpg
${FILE::-4}: is the name of the image but without the extension, e.g. 001.jpg will be 001 because we removed the last 4 characters.
We need this to name the text files to the corresponding names, e.g.
001.jpg will be converted to 001.txt
002.jpg will be converted to 002.txt
Since Tesseract doesn't seem to handle this natively, I've just developed a function to split the merged TXT file on the page separator into multiple text files. Although from my observations, I'm not sure that Tesseract runs any faster by simultaneously converting batch images to both PDF and TXT (versus running it twice - once for PDF, and once for TXT).
Thank you!
BTW i'm using 4.1.1.
And i discovered another trainedata for spanish language that do a better job than the standard one. Actually recognizes well the "o" character. The only problem is the processing time, but i let the PC working overnight.
Honestly i don't know how the new trainedata file is doing the job better. I donwloaded at:
https://github.com/tesseract-ocr/tessdata_best

Split large file into small files with particular extension

I want to split a large file into small files of 10000 lines each. I know I can do the same using:
split --lines=10000
However, the above command does not give extensions to the splitted files. I want to give all my split files the extension .txt Is it possible to do the same using split in linux. If yes, then how?
Also is it possible to number the files such that the first file has the name a1.txt. The second file has the name a2.txt, and so on. I know split gives names of the files as aa,ab, etc. but I want to replace this with a1.txt, a2.txt, a3.txt, a4.txt, a5.txt, a6.txt, a7.txt, etc.
Uses the -d parameter, as:
split --lines=10000 -d <file>

Are there any difference between commas at the end of a line or not of a CSV File?

csvFile1.csv
abc,def
csvFile2.csv
abc,def,
Are there any difference between them?
Usually there is no issue with this except that it may create an empty field at the end, besides that there is no other difference I believe.
EDIT: It also depends where you will be using your CSV file. I do know some places where these CSV files must have a specific format but as I said this depends on the use of the file itself.

.dat file how to create one based on excel document

I have a .csv file in my matlab folder with 38 columns and about 48 thousand entries. I was hoping on using the findcluster gui but it only accepts .dat files.
How do I create a .dat file in matlab or specifically how do I convert the .csv file into a .dat file that can be used by the matlab fcm clustering tool?
example of csv:
how would I go about creating a data file for this kind of information?
The only documentation I could find about the file format was
The data set must have the extension .dat. For example, to load the data set,
clusterdemo.dat, type findcluster('clusterdemo.dat').
I checked clusterdemo.dat and found that the data is stored in ASCII format. Therefore, try
a = csvread('data.csv');
save 'data.dat' a -ASCII
Just rename xxx.csv to xxx.dat. This worked for me.
you should try changing extension.For changing extension you can go to folder settingand in view where we show hidden file…uncheck the hide extension for known files and now you can change the extension of any file by renaming it.
Because
There really isn't such a thing as 'dat' format, a 'dat' file is just a text file, it could theoretically have any extension you want.It could also be delimited however you want/need, it all really depends on what you are trying to achieve.
ie what are you going to use this file for?
If it's for use with another application then the requirements of that application will probably dictate how it's delimited/structured etc.
OR simply you can save the file from the excel as .csv and then later can change the extension.
It worked for me.

I want to change the way text is represented internally in ANY Text Editor

I want to use a algorithm to reduce memory used to save the particular text file.I don't really know how text is stored but i have an idea in mind.
Would it be better to extend a open source text editor (if yes than which one) or write a text editor myself.
It would be nice if someone could also give me a link or tutorial to some basics on how text editors work and the way data is stored.
Edited to add
To clarify, what I wanted to do is instead of saving duplicates of a word make a hash table and store the address where it needs to be placed.
That way I wouldn't be storing the duplicates.
This would have become specific to a particular text editor.
Update
thanks everyone I got what all of you'll are trying to say. Anyways all i wanted to do is instead of saving duplicates of a word make a hash table and store the address where it needs to be placed.
This was i wouldn't be storing the duplicates.
Yes and this would have become specific to a particular text editor. never realized that.
I want to use a algorithm to reduce memory used to save the particular text file
If you did this you would no longer have a text editor, but instead you would have created some sort of binary file editor.
The whole point of the text file format is that it is universal, meaning any text file can be open in any other text editor.
Emacs handles compression transparently. Just create a text file with .gz extension. Emacs will automatically compress contents of the file during save operation, and decompress when you open the file next time.
Text is basically stored as-is. i.e., every character takes up a byte or two (wide chars), and there is no conversion done on it when it's saved. It might add an end-of-file character or something though. Don't try coming up with your own algorithm to compress these files. That's why zip-files and other archives were created. They're really good at compressing text. If you wanted to add these feature to your text-editor, you'd have to add some sort of post-save hook to zip it, and then put a hook on the open command to unzip it. Unless you wanted to do it by hand every time. Don't try writing the text editor yourself from scratch, unless (maybe) you're writing notepad. Text editors with syntax highlighting aren't very easy to make, even with the proper libraries. I'd say write a plugin for something like Visual Studio or what have you. Or find an open-source text editor.

Resources