Trying to output the page counts of a large number of PDF's to a log file

Trying to output the page counts of a large number of PDF's to a log file - linux

I have about 1,550 .pdf files that I want to find page counts for.
I used the command lS -Q | grep \.pdf > ../lslog.log to output all the file names with the extension .pdf to be output into a .log file with double quotes around them. I then opened the lslog.log file in gedit and replaced all the " (double quotes) with ' (apostrophe) so that I can use the files that contain parentheses in the final command.
When I use the command exiftool -"*Count*" (which outputs any exifdata of the selected file that contains the word "count") on a single file, for example, exiftool -"*Count*" 'examplePDF(withparantheses).pdf' I get something like, "Page Count: 512" or whatever the page count happens to be.
However, when I use it on multiple files, for example: exiftool -"*Count*" 'examplePDF(withparantheses).pdf' 'anotherExamplePDF.pdf' I get
File not found: examplePDF(withparantheses).pdf,
======== anotherExamplePDF.pdf
Page Count : 362
1 image files read
1 files could not be read
So basically, I'm able to read the last file, but not the first one. This pattern continues as I add more files. It's able to find the file itself and page count of the last file, but not the other files.
Do I need to input multiple files differently? I'm using a comma right now to separate files, but even without the comma I get the same result.
Does exiftool take multiple files?

I don't know exactly why you're getting the behaviour that you're getting, but it looks like to me like everything you're doing can be collapsed into one line:
exiftool -"*Count*" *.pdf
My output from a bunch of PDFs I had around look like this
======== 86A103EW00.pdf
Page Count : 494
======== DSET3.5_Reportable_Items_Linux.pdf
Page Count : 70
======== DSView 4 v4.1.0.36.pdf
Page Count : 7
======== DSView-Release-Notes-v4.1.0.77 (1).pdf
Page Count : 7
======== DSView-Release-Notes-v4.1.0.77.pdf
Page Count : 7

Related

Copying and pasting using Python for files with similar but not exact names

I have two folders each with several files.
Folder 1:
abc_1600_efg.xlsx
abc_1601_efg.xlsx
abc_1602_efg.xlsx
abc_1603_efg.xlsx
Folder 2:
ijk_1600_xyz.xlsx
ijk_1601_xyz.xlsx
ijk_1602_xyz.xlsx
ijk_1603_xyz.xlsx
lmn_1600_tuv.xlsx
lmn_1601_tuv.xlsx
lmn_1602_tuv.xlsx
lmn_1603_tuv.xlsx
Assuming the files in each folder are randomized, anyone have any ideas on how to use python 3.x to copy from file 'abc_1600_efg.xlsx' in folder 1 then have python search for the corresponding file in folder 2 ('ijk_1600_xyz.xlsx'). The number portion of the title is the key that needs to be matched. Then I want to paste the data into the file 'ijk_1600_xyz.xlsx' (folder two has two files with the same number 1600 but I need to find just the 'ijk_1600_xyz' file).
I want to loop this so that this would be done for every file in folder 1 starting at 1600 then 1601 then 1602 etc. I have the copy and paste portion finished I'm just stuck on the search and match portion.
Thank you in advance.

I haven't checked it
but something like:
import re,os
for file1 in os.listdir(folder1):
match=re.match('..._(\d+)_.*'),file1).group(1)
for file2 in os.listdir(folder2):
if ('_'+match+'_' in file2) :
... copy ...
Anyway, you should know how to adapt to these situations.

How to avoid the 40 character maximum, when reading a filename from a Weblogic server?

I am trying to read the name of some files from a weblogic server.
dir.eachFileRecurse(FileType.FILES) { file ->
println file.getName()
}
However the base filename must be too long, since it's cutted of when i print the file.getName(). Looking at the deployed jar, I have the file
OnlineOfflineSomethingknowledgement-2.DDD
The result of the print however is
OnlineOfflineSomethingknowledgement-2.D
It's like 40 characters is the maximum length of the filename.
Looking at the SB-console, and look at the list of files. The 40 character maximum is also present in the web view. Hovering the mouse over the filename though, will show the full name of the file.
Is there a way to get the full file name from the code?

not clear of the environment of your script execution.
normally, there is no such limitation.
try to print the class of your dir and file variables and probably this will give you an answer.

Searching in multiple files using findstr, only proceeding with the resulting files? (cmd)

I'm currently working on a project where I search hundreds of files using findstr in the command line. If I find the string which I searched for, I want to proceed with this exact file (and the other ones that include my string).
So in my case:
I searched for the string WRI2016 by using:
H:\KOBINI>findstr "WRI2016" *.ini > %temp%\xx.txt && %temp%\xx.txt
To see what the PC does, I save it in a .txt file as you can see.
So if my file includes WRI2016 I want to extract some facts out of the file. In my case it is NR, Kunde, WebHDAktiv, DigIDAktiv.
But I just can't find a proper way to link both of these functions.
At first I simply printed all of the parameters:
H:\KOBINI>findstr "\<NR Kunde WRI2016 WebHDAktiv DigIDAktiv" *.ini > %temp%\xx.csv && %temp%\xx.csv
I also played around using the if command but that didn't really work out. I'm pretty new to this stuff as you'll see in my following tries to solve this problem:
H:\KOBINI>findstr "\<NR DigIDAktiv WebHDAktiv" set a =*.ini findstr "WRI2016" set b =*.ini if a EQU b > %temp%\xx.txt && %temp%\xx.txt
So all I wanted to achieve with that weird code was: if there is a WRI2016 in the file, give me the remaining parameters. But that didn't work out at all.
I also tried it with using new lines for every command which didn't change a thing.
As I want this to be a .csv in the end I want to add a semicolon between my parameters, any chance how I could do that? I've seen versions using -s";" which didn't do anything for me.

Sorry, I'm quite new and thought I'd give it a shot.
an example of my .ini files Looks like this:
> Kunde=Markt
> Nr=101381
> [...]
> DigIDAktiv=Ja
> WebHDAktiv=Nein
> Version=WRI2016_U2_P1
some files have a different Version though.
So I only want to know "NR, DigIDAktiv ..." if it's the 2016 Version.
As a result it should be sorted in a CSV, in different columns.
My Folder Looks like this
So I search These files in order to find Version 2016 and then try to extract my Information and put it into a .csv

Matching text files from a list of system numbers

I have ~ 60K bibliographic records, which can be identified by system number. These records also hold full-text (individudal text files named by system number).
I have lists of system numbers in bunches of 5K and I need to find a way to copy only the text files from each 5K list.
All text files are stored in a directory (/fulltext) and are named something along these lines:
014776324.txt.
The 5k lists are plain text stored in separated directories (e.g. /5k_list_1, 5k_list_2, ...), where each system number matches to a .txt file.
For example: bibliographic record 014776324 matches to 014776324.txt.
I am struggling to find a way to copy into the 5k_list_* folders only the corresponding text files.
Any idea?
Thanks indeed,

Let's assume we invoke the following script this way:
./the-script.sh fulltext 5k_list_1 5k_list_2 [...]
Or more succinctly:
./the-script.sh fulltext 5k_list_*
Then try using this (totally untested) script:
#!/usr/bin/env bash
set -eu # enable error checking
src_dir=$1 # first argument is where to copy files from
shift 1
for list_dir; do # implicitly consumes remaining args
while read bibliographic record sys_num rest; do
cp "$src_dir/$sys_num.txt" "$list_dir/"
done < "$list_dir/list.txt"
done

Filename manipulation in cygwin

I am running cygwin on Windows 7. I am using a signal processing tool and basically performing alignments. I had about 1200 input files. Each file is of the format given below.
input_file_ format = "AC_XXXXXX.abc"
The first step required building some kind of indexes for all the input files, this was done with the tool's build-index command and now each file had 6 indexes associated with it. Therefore now I have about 1200*6 = 7200 index files. The indexes are of the form given below.
indexes_format = "AC_XXXXXX.abc.1",
"AC_XXXXXX.abc.2",
"AC_XXXXXX.abc.3",
"AC_XXXXXX.abc.4",
"AC_XXXXXX.abc.rev.1",
"AC_XXXXXX.abc.rev.1"
Now, I need to use these indexes to perform the alignment. All the 6 indexes of each file are called together and the final operation is done as follows.
signal-processing-tool ..\path-to-indexes\AC_XXXXXX.abc ..\Query file
Where AC_XXXXXX.abc is the index associated with that particular index file. All 6 index files are called with **AC_XXXXXX.abc*.
My problem is that I need to use only the first 14 characters of the index file names for the final operation.
When I use the code below, the alignment is not executed.
for file in indexes/*; do ./tool $file|cut -b1-14 Project/query_file; done
I'd appreciate help with this!

First of all, keep in mind that $file will always start with "indexes/", so trimming first 14 characters would always include that folder name in the beginning.
To use first 14 characters in a variable, use ${file:0:14}, where 0 is the starting string index, and 14 is the length of the desired substring.
Alternatively, if you want to use cut, you need to run it in a subshell: for file in indexes/*; do ./tool $(echo $file|cut -c 1-14) Project/query_file; done I changed the arg for cut to -c for characters instead of bytes

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Trying to output the page counts of a large number of PDF's to a log file - linux

Related

Copying and pasting using Python for files with similar but not exact names

How to avoid the 40 character maximum, when reading a filename from a Weblogic server?

Searching in multiple files using findstr, only proceeding with the resulting files? (cmd)

Matching text files from a list of system numbers

Filename manipulation in cygwin

Categories

Resources