Split a PDF by Bookmarks? - linux

I am to process single PDFs that have each been created by 'merging' multiple PDFs. Each of the merged PDF has the places where the PDF parts start displayed with a bookmark.
Is there any way to automatically split this up by bookmarks with a script?
We only have the bookmarks to indicate the parts, not the page numbers, so we would need to infer the page numbers from the bookmarks. A Linux tool would be best.

pdftk can be used to split the PDF file and extract the page numbers of the bookmarks.
To get the page numbers of the bookmarks do
pdftk in.pdf dump_data
and make your script read the page numbers from the output.
Then use
pdftk in.pdf cat A-B output out_A-B.pdf
to get the pages from A to B into out_A-B.pdf.
The script could be something like this:
#!/bin/bash
infile=$1 # input pdf
outputprefix=$2
[ -e "$infile" -a -n "$outputprefix" ] || exit 1 # Invalid args
pagenumbers=( $(pdftk "$infile" dump_data | \
grep '^BookmarkPageNumber: ' | cut -f2 -d' ' | uniq)
end )
for ((i=0; i < ${#pagenumbers[#]} - 1; ++i)); do
a=${pagenumbers[i]} # start page number
b=${pagenumbers[i+1]} # end page number
[ "$b" = "end" ] || b=$[b-1]
pdftk "$infile" cat $a-$b output "${outputprefix}"_$a-$b.pdf
done

There's a command line tool written in Java called Sejda where you can find the splitbybookmarks command that does exactly what you asked. It's Java so it runs on Linux and being a command line tool you can write script to do that.
Disclaimer
I'm one of the authors

you have programs that are built like pdf-split that can do that for you:
A-PDF Split is a very simple, lightning-quick desktop utility program that lets you split any Acrobat pdf file into smaller pdf files. It provides complete flexibility and user control in terms of how files are split and how the split output files are uniquely named. A-PDF Split provides numerous alternatives for how your large files are split - by pages, by bookmarks and by odd/even page. Even you can extract or remove part of a PDF file. A-PDF Split also offers advanced defined splits that can be saved and later imported for use with repetitive file-splitting tasks. A-PDF Split represents the ultimate in file splitting flexibility to suit every need.
A-PDF Split works with password-protected pdf files, and can apply various pdf security features to the split output files. If needed, you can recombine the generated split files with other pdf files using a utility such as A-PDF Merger to form new composite pdf files.
A-PDF Split does NOT require Adobe Acrobat, and produces documents compatible with Adobe Acrobat Reader Version 5 and above.
edit*
also found a free open sourced program Here if you do not want to pay.

Here's a little Perl program I use for the task. Perl isn't special; it's just a wrapper around pdftk to interpret its dump_data output to turn it into page numbers to extract:
#!perl
use v5.24;
use warnings;
use Data::Dumper;
use File::Path qw(make_path);
use File::Spec::Functions qw(catfile);
my $pdftk = '/usr/local/bin/pdftk';
my $file = $ARGV[0];
my $split_dir = $ENV{PDF_SPLIT_DIR} // 'pdf_splits';
die "Can't find $ARGV[0]\n" unless -e $file;
# Read the data that pdftk spits out.
open my $pdftk_fh, '-|', $pdftk, $file, 'dump_data';
my #chapters;
while( <$pdftk_fh> ) {
state $chapter = 0;
next unless /\ABookmark/;
if( /\ABookmarkBegin/ ) {
my( $title ) = <$pdftk_fh> =~ /\ABookmarkTitle:\s+(.+)/;
my( $level ) = <$pdftk_fh> =~ /\ABookmarkLevel:\s+(.+)/;
my( $page_number ) = <$pdftk_fh> =~ /\BookmarkPageNumber:\s+(.+)/;
# I only want to split on chapters, so I skip higher
# level numbers (higher means more nesting, 1 is lowest).
next unless $level == 1;
# If you have front matter (preface, etc) then this numbering
# will be off. Chapter 1 might be called Chapter 3.
push #chapters, {
title => $title,
start_page => $page_number,
chapter => $chapter++,
};
}
}
# The end page for one chapter is one before the start page for
# the next chapter. There might be some blank pages at the end
# of the split for PDFs where the next chapter needs to start on
# an odd page.
foreach my $i ( 0 .. $#chapters - 1 ) {
my $last_page = $chapters[$i+1]->{start_page} - 1;
$chapters[$i]->{last_page} = $last_page;
}
$chapters[$#chapters]->{last_page} = 'end';
make_path $split_dir;
foreach my $chapter ( #chapters ) {
my( $start, $end ) = $chapter->#{qw(start_page last_page)};
# slugify the title so use it as a filename
my $title = lc( $chapter->{title} =~ s/[^a-z]+/-/gri );
my $path = catfile( $split_dir, "$title.pdf" );
say "Outputting $path";
# Use pdftk to extract that part of the PDF
system $pdftk, $file, 'cat', "$start-$end", 'output', $path;
}

Related

Split a huge file in LINUX into multiple small files (each less than 100MB) splitting at a specific line with pattern match

I have the below source file (~10GB) and I need to split into several small files (<100MB each) and each file should have the same header record. The tricky part is I can't just split the file at any random line by using some split command. Records belonging to an agent shouldn't be split across multiple files. For simplicity I am only showing 2 agents here (there are thousands of them in the real file).
Inout.csv
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3
Output1.csv
Src,AgentNum,PhoneNum
DWH,Agent_1234,phone1
NULL,NULL,phone2
NULL,NULL,phone3
Output2.csv
Src,AgentNum,PhoneNum
DWH,Agent_5678,phone1
NULL,NULL,phone2
NULL,NULL,phone3
DWH,Agent_9999,phone1
NULL,NULL,phone2
NULL,NULL,phone3
#!/bin/bash
#Calculate filesize in bytes
FileSizeBytes=`du -b $FileName | cut -f1`
#Check for the file size
if [[ $FileSizeBytes -gt 100000000 ]]
then
echo "Filesize is greater than 100MB"
NoOfLines=`wc -l < $FileName`
AvgLineSize=$((FileSizeBytes / NoOfLines))
LineCountInEachFile=$((100000000 / AvgLineSize))
#Section for splitting the files
else
echo "Filesize is already less than 100MB. No splitting needed"
exit 0
fi
I an new to UNIX but trying this bash script on my own and kind of stuck at splitting the files. I am not expecting somebody to give me a full script, I am looking for any simple approach/recommendation possibly using other simple alternatives like sed or such. Many thanks in advance!
Here is a rough idea of how to do it in Perl. Please modify the regular expression if it doesn't exactly match to your actual data. I only tested it on your dummy data.
#!/usr/bin/perl -w
my $l=<>; chomp($l); my $header=$l;
my $agent=""; my $fh;
while ($l=<>) {
chomp($l);
if ($l=~m/^\s*[^,]+,(Agent_\d+),[^,]+/) {
$agent="$1";
open($fh,">","${agent}.txt") or die "$!";
print $fh $header."\n";
}
print $fh $l."\n";
}
Use it as follows:
./perlscript.pl < inputfile.txt
If you don't have perl (check for perl at /usr/bin/perl or some other such location), I will try to do a awk script. Let me know if you find problems running in the above script.
In response to your updated request that you only want to split the file, with each output file as less than 100MB, with no agent records split across two files, and that that header is printed in each output file, here is a rough idea of how you can accomplish that. It doesn't to a exact-cut (because you would need to calculate before you write). If you set the $maxfilesize to a value like 95*1024*1024 or 99*1024*1024, that should let you have a file that is less than 100MB (For ex., if the maximum size of a agent's records are less than 5MB, then set the $maxfilesize to 95*1024*1024)
#!/usr/bin/perl -w
# Max file size, approximately in bytes
#
# For 99MB make it as 99*1024*1024
#
my $maxfilesize=95*1024*1024;
#my $maxfilesize=400;
my $l=<>; chomp($l); my $header=$l;
my $fh;
my $filecounter=0;
my $filename="";
my $filesize=1000000000000; # big dummy size for first iteration
while ($l=<>) {
chomp($l);
if ($l=~m/^\s*[^,]+,Agent_\d+,[^,]+/) {
if ($filesize>$maxfilesize) {
print "FileSize: $filesize\n";
$filecounter++; $filename=sprintf("outfile_%05d",$filecounter);
print "Opening New File: $filename\n";
open($fh,">","${filename}.txt") or die "$!";
print $fh $header."\n";
$filesize=length($header);
}
}
print $fh $l."\n";
$filesize+=length($l);
print "FileSize: $filesize\n";
}
If you want more precise cuts than this, I will update it buffer the data before printing.
Step 1. Save the header
Step 2. create a variable "content" to temp-save the things the program is going to read
Step 3. start reading the next lines, in python:
if line.startswith("DWH"):
if content != "":
#if the content.len() reaches your predefined size, output_your_header + content here and reinitiate content by 'content = ""'
#else, content.len() is still under size limit, keep adding the new agent to content by doing 'content += line'
else:
content += line

Adding custom header to specific files in a directory

I would like to add a unique one line header that pertains to each file FOCUS*.tsv file in a specified directory. After that, I would like to combine all of these files into one file.
First I’ve tried sed command.
`my $cmd9 = `sed -i '1i$SampleID[4]' $tsv_file`;` print $cmd9;
It looked like it worked but after I’ve combined all of these files into one file in the next section of the code, the inserted row was listed four times for each file.
I’ve tried the following Perl script to accomplish the same but it deleted the content of the file and only prints out the added header.
I’m looking for the simplest way to accomplish what I’m looking for.
Here is what I’ve tried.
#!perl
use strict;
use warnings;
use Tie::File;
my $home="/data/";
my $tsv_directory = $home."test_all_runs/".$ARGV[0];
my $tsvfiles = $home."test_all_runs/".$ARGV[0]."/tsv_files.txt";
my #run_directory = (); #run_directory = split /\//, $tsv_directory; print "The run directory is #############".$run_directory[3]."\n";
my $cmd = `ls $tsv_directory/FOCUS*\.tsv > $tsvfiles`; #print "$cmd";
my $cmda = "ls $tsv_directory/FOCUS*\.tsv > $tsvfiles"; #print "$cmda";
my #tsvfiles =();
#this code opens the vcf_files.txt file and passes each line into an array for indidivudal manipulation
open(TXT2, "$tsvfiles");
while (<TXT2>){
push (#tsvfiles, $_);
}
close(TXT2);
foreach (#tsvfiles){
chop($_);
}
#this loop works fine
for my $tsv_file (#tsvfiles){
open my $in, '>', $tsv_file or die "Can't write new file: $!";
open my $out, '>', "$tsv_file.new" or die "Can't write new file: $!";
$tsv_file =~ m|([^/]+)-oncomine.tsv$| or die "Can't extract Sample ID";
my $sample_id = $1;
#print "The sample ID is ############## $sample_id\n";
my $headerline = $run_directory[3]."/".$sample_id;
print $out $headerline;
while( <$in> ) {
print $out $_;
}
close $out;
close $in;
unlink($tsv_file);
rename("$tsv_file.new", $tsv_file);
}
Thank you
Apparently, the wrong '>' when opening the file for reading was the problem and it got solved.
However, I'd like to make a few comments on some of the rest of the code.
The list of files is built by running external ls redirected to a file, then reading this file into an array. However, that is exactly the job of glob and all of that is replaced by
my #tsvfiles = glob "$tsv_directory/FOCUS*.tsv";
Then you don't need the chomp either, and the chop that is used would actually hurt since it removes the last character, not only the newline (or really $/).
Use of chop is probably not what you want. If you are removing the linefeed ($/) use chomp
To extract a match and assign it, a common idiom is
my ($sample_id) = $tsv_file =~ m|([^/]+)-oncomine.tsv$|
or die "Can't extract Sample ID: $!";
Note that I also added $!, to actually print the error. Otherwise we just don't know what it was.
The unlink and rename appear to be overwriting one file with another. You can do that by using move from the core module File::Copy
use File::Copy qw(move);
move ($tsv_file_new, $tsv_file)
or die "Can't move $tsv_file to $tsv_file_new: $!";
which renames the _new into $tsv_file, so overwriting it.
As for how the files need to be combined, more precise explanation would be needed.

Parsing Excel of ASCII format in Perl

We have perl script whose job is to read an excel file and convert it into a flat file. We get excel file from some other system on a shared location.
The other system is generating a flat file dumping data in the file seperated by a tab and appending with .xls. Now the problem with that is since in xls file if there is a string with leading 0 e.g 012345 it will be displayed as 12345 in excel. To preseve the leading 0 what they do is they write the data in this fashion (in java)
"=\"" + some string + "\""
Now if we open the file in excel it is proper with no = or ", but when reading via perl it read the string as it is i.e. ="some string".
How can we work around this, i have tried a solution to trim the leading =" and " but do not feel it to be clen one. Can someont suggest anything else
My suggestion (with the limited information you've given) would be to read the source file using Text::CSV, and then output via whatever means you would otherwise. For substitution, use regular expressions.
Simplified example (partially taken straight from the documentation):
#!usr/bin/perl
use Text::CSV;
# set binary attribute and use a tab as a separator:
my $csv = Text::CSV->new ( { binary => 1, sep_char => "\t"} )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
open my $fh, "<:encoding(utf8)", "test.xls" or die "test.xls: $!";
my #rows;
while ( my $row = $csv->getline( $fh ) ) {
foreach my $column (#$row) {
$column =~ s/^="|"$//g; # remove opening '="' and closing '"'
}
push #rows, $row;
}
$csv->eof or $csv->error_diag();
close $fh;
# do file writing magic or further processing here
In case you don't know, the 's' at the beginning of the regular expression indicates you want to substitue and the 'g' at the end means "repeat for all matches".
For more information, see:
https://metacpan.org/pod/Text::CSV
http://perldoc.perl.org/perlretut.html (Perl documentation on regular expressions)

How to rename multiple files in terminal (LINUX)?

I have bunch of files with no pattern in their name at all in a directory. all I know is that they are all Jpg files. How do I rename them, so that they will have some sort of sequence in their name.
I know in Windows all you do is select all the files and rename them all to a same name and Windows OS automatically adds sequence numbers to compensate for the same file name.
I want to be able to do that in Linux Fedora but I you can only do that in Terminal. Please, help. I am lost.
What is the command for doing this?
The best way to do this is to run a loop in the terminal going from picture to picture and renaming them with a number that gets bigger by one with every loop.
You can do this with:
n=1
for i in *.jpg; do
p=$(printf "%04d.jpg" ${n})
mv ${i} ${p}
let n=n+1
done
Just enter it into the terminal line by line.
If you want to put a custom name in front of the numbers, you can put it before the percent sign in the third line.
If you want to change the number of digits in the names' number, just replace the '4' in the third line (don't change the '0', though).
I will assume that:
There are no spaces or other weird control characters in the file names
All of the files in a given directory are jpeg files
That in mind, to rename all of the files to 1.jpg, 2.jpg, and so on:
N=1
for a in ./* ; do
mv $a ${N}.jpg
N=$(( $N + 1 ))
done
If there are spaces in the file names:
find . -type f | awk 'BEGIN{N=1}
{print "mv \"" $0 "\" " N ".jpg"
N++}' | sh
Should be able to rename them.
The point being, Linux/UNIX does have a lot of tools which can automate a task like this, but they have a bit of a learning curve to them
Create a script containing:
#!/bin/sh
filePrefix="$1"
sequence=1
for file in $(ls -tr *.jpg) ; do
renamedFile="$filePrefix$sequence.jpg"
echo $renamedFile
currentFile="$(echo $file)"
echo "renaming \"$currentFile\" to $renamedFile"
mv "$currentFile" "$renamedFile"
sequence=$(($sequence+1))
done
exit 0
If you named the script, say, RenameSequentially then you could issue the command:
./RenameSequentially Images-
This would rename all *.jpg files in the directory to Image-1.jpg, Image-2.jpg, etc... in order of oldest to newest... tested in OS X command shell.
I wrote a perl script a long time ago to do pretty much what you want:
#
# reseq.pl renames files to a new named sequence of filesnames
#
# Usage: reseq.pl newname [-n seq] [-p pad] fileglob
#
use strict;
my $newname = $ARGV[0];
my $seqstr = "01";
my $seq = 1;
my $pad = 2;
shift #ARGV;
if ($ARGV[0] eq "-n") {
$seqstr = $ARGV[1];
$seq = int $seqstr;
shift #ARGV;
shift #ARGV;
}
if ($ARGV[0] eq "-p") {
$pad = $ARGV[1];
shift #ARGV;
shift #ARGV;
}
my $filename;
my $suffix;
for (#ARGV) {
$filename = sprintf("${newname}_%0${pad}d", $seq);
if (($suffix) = m/.*\.(.*)/) {
$filename = "$filename.$suffix";
}
print "$_ -> $filename\n";
rename ($_, $filename);
$seq++;
}
You specify a common prefix for the files, a beginning sequence number and a padding factor.
For exmaple:
# reseq.pl abc 1 2 *.jpg
Will rename all matching files to abc_01.jpg, abc_02.jpg, abc_03.jpg...

Combine CSV files

What's the best way to combine two csv files and append the results to the same line in perl?
For example, one CSV file looks like
1234,user1,server
4323,user2,server
532,user3,server
The second looks like
user1,owner
user2,owner
user3,owner1
The result I want it to look like is
1234,user1,server,owner
4323,user2,server,owner
532,user3,server,owner1
The users are not in order so I'll need to search the first csv file which I've stored in an array to see which users match then apply the owner to the end of the line.
So far I've read in both files into arrays and then I get lost
I would post the code but it's part of a much larger script
This sounds most suited for a hash. First read the one file into a hash, then add the other. Might add warnings for values that exist in one file but not the other.
Something like:
use warnings;
use strict;
use Text::CSV;
use autodie;
my %data;
my $file1 = "user.csv";
my $file2 = "user2.csv";
my $csv = Text::CSV->new ( { binary => 1 } );
open my $fh, '<', $file1;
while (my $row = $csv->getline($fh)) {
my ($num, $user, $server) = #$row;
$data{$user} = { 'num' => $num, 'server' => $server };
}
open $fh, '<', $file2;
while (my $row = $csv->getline($fh)) {
my ($user, $owner) = #$row;
if (not defined $data{$user}) {
# warning? something else appropriate
} else {
$data{$user}{'owner'} = $owner;
}
}
for my $user (keys %data) {
print join(',', $data{$user}{'num'}, $user, $data{$user}{'server'},
$data{$user}{'owner'}), "\n";
}
Edit: As recommended in comments and other answers, I changed the method of extracting the data to using Text::CSV instead of split. I'm not too familiar with the module, but it seems to be working in my testing.
Looks like a direct application for the join command (tied with sort). This assumes that the data is as simple as shown - no commas embedded in strings or anything nasty.
sort -t, -k 2 file1 > file1.sorted
sort -t, -k 1 file2 > file2.sorted
join -t, -1 2 -2 1 file1.sorted file2.sorted
With bash, you could do it all on one line.
If you really want to do it in Perl, then you need to use a hash keyed by the user column, potentially with an array of entries per hash key. You then iterate through the keys of one of the hashes, pulling the matching values from the other and printing the data. If you're in Perl, you can use the Text::CSV module to get accurate CSV splitting.
Assuming the 1st has 2 commas, and the 2nd only one, you will get all lines of the 1st file, but only the matching ones of the 2nd:
my %content;
while( <$file1> ) {
chomp;
/,(.+),/;
$content{$1} = "$_,";
}
while( <$file2> ) {
chomp;
/(.+),(.+)/;
$content{$1} .= $2;
}
print "$content{$_}\n" for sort keys %content;
import csv
files=['h21.csv', 'h20.csv','h22.csv']
lineCount=0
for file in files:
with open(file,'r') as f1:
csv_reader=csv.reader(f1, delimiter=',')
with open('testout1.csv','a' ,newline='') as f2:
csv_writer=csv.writer(f2,delimiter=',')
if lineCount==0:
csv_writer.writerow(["filename","sno","name","age"])
lineCount += 1
next(csv_reader,None)
for row in csv_reader:
data=[file]+row
csv_writer.writerow(data)

Resources