Concatenate PDFs while preserving rank in list - linux

I am currently struggling to concatenate my various PDF files into one file in an automated way while at the same time preserving the order the files are provided in.
The main problem is, that I include a rank for each file (they are visualizations of list items), ranging currently from 1 to 100. If I run
pdftk *.pdf cat output all.pdf
the combined PDF pages will not be ordered from 1 to 100 accordingly. My PDFs are named in a similar way to the following example, please note that "rank_XXX" determines obviously their rank in the list. However, the fact that in the terminal 10 and 100 are listed prior to 2 messes up my sorting. I was thinking that ls -v could somehow be useful for pipe the filenames into pdftk or a similar tool, but I could not get it working.
rank_1_XYZ_123123A.pdf
rank_1_XYZ_123123B.pdf
rank_2_XYZ_123141A.pdf
rank_2_XYZ_123141B.pdf
rank_3_ABC_394124A.pdf
rank_3_ABC_394124B.pdf
...
rank_10_XYZ_129123A.pdf
rank_10_XYZ_129123B.pdf
...
rank_100_ZZZ_929123A.pdf
rank_100_ZZZ_929123B.pdf
I managed to get at least partially what I want by using
pdftk rank_[1-9]*.pdf cat output all.pdf
Nevertheless, this somehow does not work for numbers larger than 9.
Any help is greatly appreciated.

ls -v seems to do the job:
pdftk `ls -v` cat output all.pdf

Related

How to code for iterating through multiple files in linux?

I have code which I am trying to update from another example. The aim is to run plink using files of: each chromosome, snp ids, and a file containing only 1 ID which is an individual's ID. Running these files in plink ultimately makes a vcf file per individual for a given chromosome.
I have 22 chromosome files, 1 snp file (which is always the same), and 500 individual files. For each individual I am aiming to make a vcf for each chromosome, so I have 22*500 (11000) vcf files as output.
With doing this at the moment I have tried a bash script with this:
ID=$SGE_TASK_ID
indiv=$SGE_TASK_ID
plink --bed chr${ID}.bed --bim chr${ID}.bim --fam chr${ID}.fam --extract snps.txt
--recode vcf-iid --out output${indiv}chr${ID}vcf --keep-fam individual${indiv}.txt
This runs, however it only runs through 1 individual, giving me 22 chromosome vcf files for that one person, and stops there. How do I make this run for all 500 people, would it be with a for loop? Looking through other questions I haven't been able to find one that matches my question and is in linux, any help would appreciated.
${indiv} would just be a number, so the text file that runs looks like individual1.txt and increases through the 500 individuals (individual1.txt, individual2.txt, individual3.txt)
Assuming that ${indiv} contains no spaces,
for indiv in $(<individuals.data); do
plink [...] individual${indiv}.txt
done
The file individuals.data would name the individuals, separated by spaces or newlines.
If unsure what the Bash shell's $(<...) operator does, try this:
for A in $(<individuals.data); do
echo "[$A]"
done
Note that, as #Kaz has observed, if wish your script to work also in shells other than Bash, then you might write $(cat ...) rather than $(<...)

Combine lots of CSV files in CentOS?

I have a CentOS machine and I want to combine .csv data.
I have thousands of small documents all with the same column information.
How would I go about combining all of them into files of up to 20Mb in size?
For example 1.csv would combine the first few files and once the 20Mb limit is reached the data will continue to go into 2.csv and so on.
Any help is greatly appreciated
If they don't have headers, something as simple as;
$ cat *.csv > combined.csv
would work (we run in the directory containing the files (assuming you want them in the order returned by ls *.csv)).
You can acheive what you want with simple tail command :
tail -q -n+2 *.csv
You only need to add the proper header column afterward.
You might want to look at the join utility: https://www.gnu.org/software/coreutils/manual/html_node/join-invocation.html#join-invocation

grab 2 numbers from file name then insert into command

I'm a bit new to programming in general and I'm not sure how to go about accomplish this task in my bash script.
A quick background: when importing my music library (formerly organized by iTunes) to Banshee, all of the files were duplicated to fit Banshee's number style (ex: 02. instead of 02 ) on top of that, iTunes apparently did not save the ID3 tags to the files, so many of them are blank. So now I've got a few thousand tags to fix and duplicate files to get rid of.
To automate the process, I started learning to write bash scripts. I came up with a script (which you can see here) that does four things: removes unnecessary iTunes files, takes input from user about ID3 Tag information and stores it in variables, clears any present tag info from all files, writes new tags with info taken from user, using a program called eyeD3.
Now, here's where I run into my problem. This script is basically blindly writing info to all mp3 files in the dir. This is fine for tags that all the files have in common - like artist, album, total tracks, year, etc. But I can't tag each individual track number with this method. So I'm still editing the track# tags one at a time, manually. And that's something I really don't want to do 2,000+ times.
The files names all look like this:
01. song1.mp3
02. song2.mp3
03. song3.mp3
The command to write a track number to a tag looks like this:
$ eyeD3 -n 1 "01. song1.mpg"
So... I'm not sure how to go about automating this. I need to grab the first two digits of each file name, store them somewhere, then recall each one into a separate eyeD3 command.
You can loop over the files using globbing, and use substring expansion to capture the first two characters of the filename:
for f in *mp3; do
eyeD3 -n ${f:0:2} "$f"
done

awk/sed/grep command to compare the contents of three files

Hi I am trying to automate some data entry, and I am using a tcp server/client to send filenames around for other server to go into a repository and pull these files. as part of testing this I am running the program with logging the filenames that are supposed to be sent, what was received, and if it got received I am sending a reply back with the filename.
so I have three text files with file names inside of them.
SupposedToSend.txt
Recieved.txt
GotReplyFor.txt
I know that awk could do what I am trying to do but I am not sure how to set it up, I need to compare the three files for elements that does not exists in any of the other files, so if one entry is missing from any file i need to know which one and from which file.
I can write a program for this which will take much longer to write and to run since these files are getting 5 elements/minute dumped into them
paste -d '\n' SupposedToSend.txt Recieved.txt GotReplyFor.txt | uniq -c | grep -v '^ 3'
It's tolerable if you have no errors, deeply suboptimal otherwise. Or if the data in the different files is out of sequence... (In which case you might need to sort them somehow.)
Or you could just run diff3 to compare 3 files...

splitting text files based column wise

So I have an invoice that I need to make a report out of. It is on average to be about 250 pages long. So I'm trying to create a script that would extract the specific value of the invoice and make a report. Here's my problem:
the invoice is in pdf format with it spanning two column. In Linux command, I want to use 'pdftotext' Linux command to convert into multiple text files (with each txt file representing each pdf page). How do I do that
I recognize that 'pdftotext' command splits it left part of the page and right part of the page by having 21 spaces in between. How do I the right side of the data(identified after reading at least 21 spaces in a row) to the end of the file
Since the file is large and I only last few page of the files, how do I delete all those text files in a script (not manually) until I read a keyword (let's just say the keyword = Start Invoice)?
I know this is a lot of questions, but I'm confused in what Linux command can do. Can you guys guide me to the right direction? Thanks
PS: I'm using CentOS 5.2
What about:
pdftotext YOUR.pdf | sed 's/^\([^ ]\+\) \{21\}.*/\1/' > OUTPUT
pdftotext YOUR.pdf | sed 's/.* \{21\}\(.*\)/\1/' >> OUTPUT
But you should check out pdftotext's -raw and -layout options too. And there are more ways to do it...

Resources