Split large file into small files with particular extension - linux

I want to split a large file into small files of 10000 lines each. I know I can do the same using:
split --lines=10000
However, the above command does not give extensions to the splitted files. I want to give all my split files the extension .txt Is it possible to do the same using split in linux. If yes, then how?
Also is it possible to number the files such that the first file has the name a1.txt. The second file has the name a2.txt, and so on. I know split gives names of the files as aa,ab, etc. but I want to replace this with a1.txt, a2.txt, a3.txt, a4.txt, a5.txt, a6.txt, a7.txt, etc.

Uses the -d parameter, as:
split --lines=10000 -d <file>

Related

Tesseract Batch Convert Images to Searchable PDF And Multiple Corresponding Text Files

I’m using tesseract to batch convert a list of images to both a searchable PDF as well as a TXT file containing the OCRd text.
tesseract infile outfile -l eng myconfig
infile contains a list of image paths to process
myconfig contains tesseract preferences to specify the output types (tessedit_create_text 1 and tessedit_create_pdf 1)
This leaves me with outfile.pdf and outfile.txt, the latter of which contains page separators for delimiting text between images.
What I’m really looking to do, however, is to output multiple TXT files on a per-image basis, using the same corresponding image name. For example, Image1.jpg.txt, Image2.jpg.txt, Image3.jpg.txt...
Does tesseract have the option to support this behavior natively? I realize that I can loop through the image file list and execute tesseract on a per-image basis, but this is not ideal as I’d also have to run tesseract a second time to generate the merged PDF. Instead, I’d like to run both options at the same time, with less overall execution time.
I also realize that I can split the merged TXT file on the page separator into multiple text files, but then I have to introduce less elegant code to map and rename all of those split files to correspond to their original image names: Rename 0001.txt to Image1.jpg.txt...
I’m working with both Python 3 and Linux commands at my disposal.
You can prepare a batch file that loops through the input images and output to both txt and pdf at the same time -- more efficient, one single OCR operation instead of two. You can then split output .txt file to pages.
tesseract inimagefile outfile txt pdf
Converting multiple images to a single PDF file.
On Linux, you can list all images and then pipe them to tesseract
ls *.jpg | tesseract - yourFileName txt pdf
Where:
youFileName: is the name of the output file.
txt pdf: are the output formats, you can also use only one of them.
Converting images to individual text files
On Linux, you can use the for loop to go through files and execute an action for every file.
for FILE in *.jpg; do tesseract $FILE ${FILE::-4}; done
Where:
for FILE in *.jpg : loop through all JPG files (you can change the extension based on your format)
$FILE: is the name of the image file, e.g. 001.jpg
${FILE::-4}: is the name of the image but without the extension, e.g. 001.jpg will be 001 because we removed the last 4 characters.
We need this to name the text files to the corresponding names, e.g.
001.jpg will be converted to 001.txt
002.jpg will be converted to 002.txt
Since Tesseract doesn't seem to handle this natively, I've just developed a function to split the merged TXT file on the page separator into multiple text files. Although from my observations, I'm not sure that Tesseract runs any faster by simultaneously converting batch images to both PDF and TXT (versus running it twice - once for PDF, and once for TXT).
Thank you!
BTW i'm using 4.1.1.
And i discovered another trainedata for spanish language that do a better job than the standard one. Actually recognizes well the "o" character. The only problem is the processing time, but i let the PC working overnight.
Honestly i don't know how the new trainedata file is doing the job better. I donwloaded at:
https://github.com/tesseract-ocr/tessdata_best

How to remove multiple lines from CSV file in nodejs

I have a CSV file that has 5 lines at the top of the file that I want to remove using node.js. I then want to add my own header line that better matches the header I would use. I have no control of the original csv file so unable to to do this at the source.
It will be easiest using one of the following modules:
https://www.npmjs.com/package/csv
https://www.npmjs.com/package/tsv
or other that you find in:
https://www.npmjs.com/browse/keyword/csv
https://www.npmjs.com/browse/keyword/tsv
(don't worry if it's CSV or TSV - just make sure that you use the correct delimiter which is comma in your case).
You might do it all manually parsing the file as text but using a module for that will be much less error prone.
(cat good-header.csv; tail -n +5 original-file.csv) > the-result.csv

Can gulp collect lines grep:ed into single output file?

I'm trying to filter out lines from all .js source files, and put into a separate file. (Specifically, I'm trying to grep all calls to a string translation function and post-process them).
I think I have the different parts figured out but can't make them fit together.
For each file, process it
Write each file's grep:ed lines to output.
Append the result to a file
I've tried to through.push(<output per file>) from the plugin, but the following step expects a file, not a string.
From there, I expect I could do something like gulp-concat or stream merge on the results and pipe it on to gulp.dist, but there's bit missing here.
I figured out a way - simply replace the Vinyl file's content with the lines to output, and push that through to through.push.

Importing list of files to compress/zip

Does anybody know of any way (on Windows) to create an archive (zip, rar,..) and adding files to it by importing a list of files to be added (say from a CSV or text file or simply pasted) that need to be archived. Say I have a simple list of 1,000 files across multiple directories that I want to add to an archive, this would be a much simpler method of doing it than adding each file individually. Also I do not want to arhive the entire directory as it is absolutely massive.
eg:
c:\somedir\file1.php
c:\somedir\somesubdir\file2.php
c:\someotherdir\file3.php
...
And no I do not want to import all files in certain directories, the hundreds of files are scattered across tens of directories which also contain lots of other files that I do not want to archive.
Thanks
rar.exe from WinRAR has the following option:
n#<list> Include files listed in specified list file

Getting files names inside a rar/zip file without unzip

Does anyone know if it is possible to get the name of files inside a rar/zip without having to unrar/unzip the file.. and if yes, is there a way to block it or make difficult..
Thanks
The file names in a zip file are visible even if the data is encrypted. If you want to hide the names, the easy solution is to zip the zip file encrypted.
Later versions of PKZip do have an option to encrypt the file names as well with –cd=encrypt. (cd means central directory.)
The -l flag to unzip(1) does just that:
-l
list archive files (short format). The names, uncompressed file sizes and modification dates and times of the specified files are printed, along with totals for all files specified.
unrar(1) has the l option:
l
List archive content.

Resources