Get list of file fron regex - linux

I'm trying to do a bash script that extract info from pdf documents. the first argument should be a regex or the name of a file. Es:
$ autobib shrek2001.pdf
$ autobib *.pdf
My idea is to generate a list of files matching the regex and extract information from them. My code at the moment looks like this:
for article in $(ls $1);do
pdfinfo $article
done
But doing so the loop stops at the first file. How can I loop over all the files matching my regex?

clpgr has it completely right. Change your program to look like this:
for article in "$#" ;do
pdfinfo $article
done
The reason your program only does the first file is that the shell command gets globbed. That is, when you issue the command autobib *.pdf, you are really issuing this command: autobib 1.pdf 2.pdf 3.pdf (well, I'm making up some file names since I don't know what's in the directory. But the point is, your program will have $1 set to 1.pdf so you'll be executing this code $( ls 1.pdf ) which would only return 1.pdf.
Truth is, your program may have worked (depending on the file names in the directory) if you executed this way: autobib "*.pdf". In this example, the "*.pdf" is not globbed by the shell because it is quoted. Now, your program's $1 variable will have the value *.pdf.
That said, "$#" is soooooo much better than $( ls $1 ). "$#" will actually preserve spaces in the arguments.

Related

How to have the list of files represented by a regex given in input to a bash script

I'm creating a code for the automatic extraction of bib records from scientific papers.
In an old version of the script i gave in input the name of the folder where all the pdfs were stored, now I want to give a regex. E.g. before:
./AutoBib.sh Papers/
Now:
./Autobib.sh Papers/*.pdf
In the folder there are, for example 3 pdf files: Shrek.pdf, Fiona.pdf, Donkey.pdf, using my script I should be able to retrieve the doi from all files creating a file where all doi are listed but executing my script it returns the doi of the first file and nothing more.
Here there is my code:
for i in $1; do
doi $i
done
doi is a function that extract the doi from a pdf and puts it in a txt file. When i run the script it returns me only the doi of the first file.
How can I feed a regex in my script and being able to iterate though all files that matches that regex?
It's important to understand that Papers/*.pdf is not a regular expression, it's a wildcard pattern that causes bash to perform filename expansion, or globbing.
$1 represents the first argument to your script, so your for loop is only ever iterating over that one argument.
Use $# to represent all arguments:
for i in "$#"; do
doi "$i"
done
If you want to filter files within directory by pattern, you can pass this pattern as second script parameter and search for matching files using find.
Here is the code. It's additionally resistant to filenames containing spaces:
find "$1" -maxdepth 1 -name "$2" -exec doi {} \;
Usage example: ./Autobib.sh Papers/ *.pdf
You can just run the ls command in loop and it will solve your problem.
for x in $(ls $#/*.pdf)
do
echo $x ## if you want only file name you can change this line to echo `basename $x`
done
I have created the same scenario as you mentioned above, refer the snapshot.

how to pass asterisk into ls command inside bash script

Hi… Need a little help here…
I tried to emulate the DOS' dir command in Linux using bash script. Basically it's just a wrapped ls command with some parameters plus summary info. Here's the script:
#!/bin/bash
# default to current folder
if [ -z "$1" ]; then var=.;
else var="$1"; fi
# check file existence
if [ -a "$var" ]; then
# list contents with color, folder first
CMD="ls -lgG $var --color --group-directories-first"; $CMD;
# sum all files size
size=$(ls -lgGp "$var" | grep -v / | awk '{ sum += $3 }; END { print sum }')
if [ "$size" == "" ]; then size="0"; fi
# create summary
if [ -d "$var" ]; then
folder=$(find $var/* -maxdepth 0 -type d | wc -l)
file=$(find $var/* -maxdepth 0 -type f | wc -l)
echo "Found: $folder folders "
echo " $file files $size bytes"
fi
# error message
else
echo "dir: Error \"$var\": No such file or directory"
fi
The problem is when the argument contains an asterisk (*), the ls within the script acts differently compare to the direct ls command given at the prompt. Instead of return the whole files list, the script only returns the first file. See the video below to see the comparation in action. I don't know why it behaves like that.
Anyone knows how to fix it? Thank you.
Video: problem in action
UPDATE:
The problem has been solved. Thank you all for the answers. Now my script works as expected. See the video here: http://i.giphy.com/3o8dp1YLz4fIyCbOAU.gif
The asterisk * is expanded by the shell when it parses the command line. In other words, your script doesn't get a parameter containing an asterisk, it gets a list of files as arguments. Your script only works with $1, the first argument. It should work with "$#" instead.
This is because when you retrieve $1 you assume the shell does NOT expand *.
In fact, when * (or other glob) matches, it is expanded, and broken into segments by $IFS, and then passed as $1, $2, etc.
You're lucky if you simply retrieved the first file. When your first file's path contains spaces, you'll get an error because you only get the first segment before the space.
Seriously, read this and especially this. Really.
And please don't do things like
CMD=whatever you get from user input; $CMD;
You are begging for trouble. Don't execute arbitrary string from the user.
Both above answers already answered your question. So, i'm going a bit more verbose.
In your terminal is running the bash interpreter (probably). This is the program which parses your input line(s) and doing "things" based on your input.
When you enter some line the bash start doing the following workflow:
parsing and lexical analysis
expansion
brace expansion
tidle expansion
variable expansion
artithmetic and other substitutions
command substitution
word splitting
filename generation (globbing)
removing quotes
Only after all above the bash
will execute some external commands, like ls or dir.sh... etc.,
or will do so some "internal" actions for the known keywords and builtins like echo, for, if etc...
As you can see, the second last is the filename generation (globbing). So, in your case - if the test* matches some files, your bash expands the willcard characters (aka does the globbing).
So,
when you enter dir.sh test*,
and the test* matches some files
the bash does the expansion first
and after will execute the command dir.sh with already expanded filenames
e.g. the script get executed (in your case) as: dir.sh test.pas test.swift
BTW, it acts exactly with the same way for your ls example:
the bash expands the ls test* to ls test.pas test.swift
then executes the ls with the above two arguments
and the ls will print the result for the got two arguments.
with other words, the ls don't even see the test* argument - if it is possible - the bash expands the wilcard characters. (* and ?).
Now back to your script: add after the shebang the following line:
echo "the $0 got this arguments: $#"
and you will immediatelly see, the real argumemts how your script got executed.
also, in such cases is a good practice trying to execute the script in debug-mode, e.g.
bash -x dir.sh test*
and you will see, what the script does exactly.
Also, you can do the same for your current interpreter, e.g. just enter into the terminal
set -x
and try run the dir.sh test* = and you will see, how the bash will execute the dir.sh command. (to stop the debug mode, just enter set +x)
Everbody is giving you valuable advice which you should definitely should follow!
But here is the real answer to your question.
To pass unexpanded arguments to any executable you need to single quote them:
./your_script '*'
The best solution I have is to use the eval command, in this way:
#!/bin/bash
cmd="some command \"with_quetes_and_asterisk_in_it*\""
echo "$cmd"
eval $cmd
The eval command takes its arguments and evaluates them into the command as the shell does.
This solves my problem when I need to call a command with asterisk '*' in it from a script.

find returning inverted results

In a few words a wrote this little script to clean up some directories where I had consolidated directories/files from multiple sources where I used the cp command with the --backup=numbered feature so that files with identical names would have a suffix like .~1~ appended to avoid overwriting. I then ran fdupes to remove duplicate files, in some cases fdupes removed the file which did not have the suffix appended from the cp command (the original file) so I wanted to scan the directories looking for files with the suffix appended by the cp command and if the file does not exist with the suffix removed I would move mv the file otherwise I would leave it to avoid deleting anything as fdupes did not think it was a duplicate.
The issues is the test condition if [ -f ... ] part of the code below returns inverted results than what it should and I cannot understand why. For example, when the file exists it would return false and when the file did not exist it would return true. I fixed it by reversing the actions that I wanted to do based on the inverted return code and verified it was working as intended and it was so I ran it as such but would like to know if anyone knows why it would behave the way it did. I am not a bash script expert by any means so its possible that I missed something simple.
#!/bin/bash
logfile=$$.log
exec > $logfile 2>&1
IFS='
'
#set -f
for FILE in $(find . -type f -regextype posix-extended -regex '^.*(\.~[0-9]+~)+$')
do
FILE2=${FILE%%.~[0-9]*} # remove the suffix
if [ -f "${FILE2}" ]
then
echo ERROR: "${FILE2}" already exists!
else
echo "${FILE}" renamed "${FILE2}"
mv "${FILE}" "${FILE2}"
fi
done
You might be able to see the problem by modifying your script to show both FILE and FILE2 in the error message. There are a few minor problems with the script which could cause some confusion (but not the "inverted" logic):
find output is not sorted. If you had more than one backup file, a randomly chosen one would replace the original file;
you could sort the output using an expression like |sort -t~ -n -k2 on the end of the find-command.
the regular expression allows multiple matches of the ~[0-9]~ pattern. Conceivably you could have some odd file which ends with ~1~~2~.
the part where the suffix is removed assumes a single ~[0-9]~ is on the end of the filename. An embedded ~0, e.g., foo~0bar~1~ would reduce FILE to foo. The workaround for that would be more cumbersome (since the suffix-stripping uses globbing), but could be done with a case statement which matched an explicit number of digits (likely three digits would be enough).

How to make this (l)unix script dynamically accept directory name in for-loop?

I am teaching myself more (l)unix skills and wanted to see if I could begin to write a program that will eventually read all .gz files and expand them. However, I want it to be super dynamic.
#!/bin/bash
dir=~/derp/herp/path/goes/here
for file in $(find dir -name '*gz')
do
echo $file
done
So when I excute this file, I simply go
bash derp.sh.
I don't like this. I feel the script is too brittle.
How can I rework my for loop so that I can say
bash derp.sh ~/derp/herp/path/goes/here (1)
I tried re-coding it as follows:
for file in $*
However, I don't want to have to type in bash
derp.sh ~/derp/herp/path/goes/here/*.gz.
How could I rewrite this so I could simply type what is in (1)? I feel I must be missing something simple?
Note
I tried
for file in $*/*.gz and that obviously did not work. I appreciate your assistance, my sources have been a wrox unix text, carpentry v5, and man files. Unfortunately, I haven't found anything that will what I want.
Thanks,
GeekyOmega
for dir in "$#"
do
for file in "$dir"/*.gz
do
echo $file
done
done
Notes:
In the outer loop, dir is assigned successively to each argument given on the command line. The special form "$#" is used so that the directory names that contain spaces will be processed correctly.
The inner loop runs over each .gz file in the given directory. By placing $dir in double-quotes, the loop will work correctly even if the directory name contains spaces. This form will also work correctly if the gz file names have spaces.
#!/bin/bash
for file in $(find "$#" -name '*.gz')
do
echo $file
done
You'll probably prefer "$#" instead of $*; if you were to have spaces in filenames, like with a directory named My Documents and a directory named Music, $* would effectively expand into:
find My Documents Music -name '*.gz'
where "$#" would expand into:
find "My Documents" "Music" -name '*.gz'
Requisite note: Using for file in $(find ...) is generally regarded as a bad practice, because it does tend to break if you have spaces or newlines in your directory structure. Using nested for loops (as in John's answer) is often a better idea, or using find -print0 and read as in this answer.

run cat command for all the files in the directory given in argument of the script file and out put with the name given as second argument

I run the following code for concatenating files in a directory given as the argument for the script file in bash
for i in $*
do
cat $* > /home/christy/Documents/filetest/catted.txt
done
This produce the error
cat: /home/christy/Documents/filetest/catted.txt: input file is output file
I think there are at least 4 things wrong with your script....
Firstly, your loop will set the value of i to the name of each file in succession, so you would want to actually use i inside your loop, like this:
for i in $*
cat "$i" ....somewhere
done
Secondly, if you use the > redirection, each file will land exactly on top of the previous one, so you should really use the >> redirection will append the current file to the end of the previous one like this
for i in $*
do
cat "$i" >> ...somewhere
done
Thirdly, I think you should use double-quoted "$#" to get all your command-line arguments, rather than plain $*
for i in "$#"
...
Fourthly, you can achieve the exact effect I think you want with this simpler command:
cat "$#" > /home/christy/Documents/filetest/catted.txt
You can't cat a file back onto itself. That's what "input file is output file" means. Because catted.txt shows up in your list of arguments to cat, it is going to try to cat to itself. So, move catted.txt to somewhere other than the source directory.

Resources