Remove string of characters from filename using Applescript or Linux - linux

I am converting several word documents to pdfs. The input file names are like this "CM_Genetics_in_OBGYN_docx" while the output file names are like this "job_10-Microsoft_Word_-_CM_Genetics_in_OBGYN_docx.pdf" I want to delete "job_10-Microsoft_Word_-_" and "_docx" and only have the pdf file name left "CM_Genetics_in_OBGYN.pdf". I would really like to end up with "CM Genetics in OBGYN.pdf" but "CM_Genetics_in_OBGYN.pdf" would be acceptable if that last part makes it too complicated. I have some experience with applescript and linux commands but can't nail this down.

Here you go:
for fn in job_*.pdf; do
newname=${fn#job_??-*-_}
newname=${newname/_docx}
newname=${newname//_/ }
echo "mv '$fn' '$newname'"
done
This will print mv commands ready execute, but without renaming anything. To execute the rename, simply pipe the output to sh.
The echo is useful to test everything safely. Make sure to check on the strangest pattern you can find to cover all corner cases. If everything looks good, change the echo to do the real action you want to perform instead, for example:
for fn in job_*.pdf; do
newname=${fn#job_??-*-_}
newname=${newname/_docx}
newname=${newname//_/ }
mv "$fn" "/some/other/dir/$newname"
done

Related

Trying to iterate through files stored in variables

I have to go through 2 files stored as variables and delete the lines which contain a string stored in another variable:
file1="./file1"
file2="./file2"
text="searched text"
for i in $file1,$file2; do
sed -i.txt '/$text/d' $i
done
The files to exist in the same folder as the script.
I get "No such file or directory". I have been stuck for the past 3 hours on this and honestly I'm pretty much about to quit Linux.
You have a several issues in your script. The right way to do is:
file1="./file1"
file2="./file2"
text="searched text"
for i in "$file1" "$file2"; do
sed -i.txt "/$text/d" "$i"
done
Issues:
for expects a space delimited list of arguments, not comma separated
it is important to enclose your variable expansions in double quotes to prevent word splitting
you need double quotes to enclose the sed expression since single quotes won't expand the variable inside
You could catch these issues through shellcheck and debug mode (bash -x script) as suggested by Charles.
Sorry to say that your shell script is not nicely design. In a shell scripts multi files should not be stored in multiple variables. Suppose you need to do the same operation on 100 different files what will you do? So follow the below style of code. Put all your file names in a file for example filelist.dat now see:-
First put all the file names in filelist.dat and save it
text="searched text"
while read file; do
sed -i.txt '/$text/d' $i
done < filelist.dat
Also not sure whether sed command will work like that. If not working make it like below:-
sed -i.txt 's|'"$text"'|d' $i

concatenate two strings and one variable using bash

I need to generate filename from three parts, two strings, and one variable.
for f in `cat files.csv`; do echo fastq/$f\_1.fastq.gze; done
files.csv has the following lines:
Sample_11
Sample_12
I need to generate the following:
fastq/Sample_11_1.fastq.gze
fastq/Sample_12_1.fastq.gze
My problem is that I got the below files:
_1.fastq.gze_11
_1.fastq.gze_12
the string after the variable deletes the string before it.
I appreciate any help
Regards
By the way your idiom: for f in cat files.csv should be avoid. Refer: Dangerous Backticks
while read f
do
echo "fastq/${f}/_1.fastq.gze"
done < files.csv
You can make it a one-liner with xargs and printf.
xargs printf 'fastq/%s_1.fastq.gze\n' <files.csv
The function of printf is to apply the first argument (the format string) to each argument in turn.
xargs says to run this command on as many files as it can fit onto the command line (splitting it up into multiple invocations if the input file is too large to fit all the arguments onto a single command line, subject to the ARG_MAX constant in your kernel).
Your best bet, generally, is to wrap the variable name in braces. So, in this case:
echo fastq/${f}_1.fastq.gz
See this answer for some details about the general concept, as well.
Edit: An additional thought looking at the now-provided output makes me think that this isn't a coding problem at all, but rather a conflict between line-endings and the terminal/console program.
Specifically, if the CSV file ends its lines with just a carriage return (ASCII/Unicode 13), the end of Sample_11 might "rewind" the line to the start and overwrite.
In that case, based loosely on this article, I'd recommend replacing cat (if you understandably don't want to re-architect the actual script with something like while) with something that will strip the carriage returns, such as:
for f in $(tr -cd '\011\012\040-\176' < temp.csv)
do
echo fastq/${f}_1.fastq.gze
done
As the cited article explains, Octal 11 is a tab, 12 a line feed, and 40-176 are typeable characters (Unicode will require more thinking). If there aren't any line feeds in the file, for some reason, you probably want to replace that with tr '\015' '\012', which will convert the carriage returns to line feeds.
Of course, at that point, better is to find whatever produces the file and ask them to put reasonable line-endings into their file...

how to use do loop to read several files with similar names in shell script

I have several files named scale1.dat, scale2.dat scale3.dat ... up to scale9.dat.
I want to read these files in do loop one by one and with each file I want to do some manipulation (I want to write the 1st column of each scale*.dat file to scale*.txt).
So my question is, is there a way to read files with similar names. Thanks.
The regular syntax for this is
for file in scale*.dat; do
awk '{print $1}' "$file" >"${file%.dat}.txt"
done
The asterisk * matches any text or no text; if you want to constrain to just single non-zero digits, you could say for file in scale[1-9].dat instead.
In Bash, there is a non-standard additional glob syntax scale{1..9}.dat but this is Bash-only, and so will not work in #!/bin/sh scripts. (Your question has both sh and bash so it's not clear which you require. Your comment that the Bash syntax is not working for you suggests that you may need a POSIX portable solution.) Furthermore, Bash has something called extended globbing, which allows for quite elaborate pattern matching. See also http://mywiki.wooledge.org/glob
For a simple task like this, you don't really need the shell at all, though.
awk 'FNR==1 { if (f) close (f); f=FILENAME; sub(/\.dat/, ".txt", f); }
{ print $1 >f }' scale[1-9]*.dat
(Okay, maybe that's slightly intimidating for a first-timer. But the basic point is that you will often find that the commands you want to use will happily work on multiple files, and so you don't need shell loops at all in those cases.)
I don't think so. Similar names or not, you will have to iterate through all your files (perhaps with a for loop) and use a nested loop to iterate through lines or words or whatever you plan to read from those files.
Alternatively, you can copy your files into one (say, scale-all.dat) and read that single file.

How does linux redirect IO work internally

When we use the redirect IO operator for a shell script does the operator keep all the data to be written in memory and write it all at once or does write it to file line by line.
Here is what i am working on.
I have about 200 small files ~1000 lines each in a specific format. I want to process (do a regex and change the format a little) each line in all the files and have the new transformed lines in a single combined file.
I have a transformscript.sh that takes a single file and applies the transformation. I run it in the following manner
sh transformscript.sh somefile.txt > newfile.txt
This works fine and fast for a single file.
How do i extend to do it for all the files. will it be efficient to change transformscript.sh to take a directory as argument instead of filename and add a for loop to transform all the lines of all the files together. Or should I run the above trnsformscript.sh for each file and create a new file for each one and combine then separately.
Thanks.
The redirect operator simply opens the file for writing and passes that file descriptor to the shell as its standard output. The shell then writes to the file directly.
You probably do NOT want to run the script separately for each file since you will incur the overhead of bash process creation for each pass. For example:
# don't do it this way
for somefile in $(ls somefiles*.txt); do
newfile=${somefile//some/new}
sh transformscript.sh $somefile > $newfile
done
The above starts one shell for every file found which is pretty inefficient. It would be better to rewrite transformscript.sh to handle multiple files if possible. Depending on how complicated your transform is and whether you need to keep the original filenames, you might be able to use a single sed process. For example, assume you have 200 files named test1.txt through test200.txt all with a "Hello world" line you want to change to "Hello joe". You could do something as simple a this:
sed -i.save 's/Hello world/Hello joe/' test*.txt
The -i tells sed to do an "in place" edit (edit the original file) and the optional ".save" argument to -i makes a backup copy of the original file with a .save extension before editing the original file. Note, this will leave the original contents in the .save files and the new content in the files with the original name which may not be what you want.

How do you pass on filenames to other programs correctly in bash scripts?

What idiom should one use in Bash scripts (no Perl, Python, and such please) to build up a command line for another program out of the script's arguments while handling filenames correctly?
By correctly, I mean handling filenames with spaces or odd characters without inadvertently causing the other program to handle them as separate arguments (or, in the case of < or > — which are, after all, valid if unfortunate filename characters if properly escaped — doing something even worse).
Here's a made-up example of what I mean, in a form that doesn't handle filenames correctly: Let's assume this script (foo) builds up a command line for a command (bar, assumed to be in the path) by taking all of foo's input arguments and moving anything that looks like a flag to the front, and then invoking bar:
#!/bin/bash
# This is clearly wrong
FILES=
FLAGS=
for ARG in "$#"; do
echo "foo: Handling $ARG"
if [ x${ARG:0:1} = "x-" ]; then
# Looks like a flag, add it to the flags string
FLAGS="$FLAGS $ARG"
else
# Looks like a file, add it to the files string
FILES="$FILES $ARG"
fi
done
# Call bar with the flags and files (we don't care that they'll
# have an extra space or two)
CMD="bar $FLAGS $FILES"
echo "Issuing: $CMD"
$CMD
(Note that this just an example; there are lots of other times one needs to do this and that to a bunch of args and then pass them onto other programs.)
In a naive scenario with simple filenames, that works great. But if we assume a directory containing the files
one
two
three and a half
four < five
then of course the command foo * fails miserably in its task:
foo: Handling four < five
foo: Handling one
foo: Handling three and a half
foo: Handling two
Issuing: bar four < five one three and a half two
If we actually allow foo to issue that command, well, the results won't be what we're expecting.
Previously I've tried to handle this through the simple expedient of ensuring that there are quotes around each filename, but I've (very) quickly learned that that is not the correct approach. :-)
So what is? Constraints:
I want to keep the idiom as simple as possible (not least so I can remember it).
I'm looking for a general-purpose idiom, hence my making up the bar program and the contrived example above instead of using a real scenario where people might easily (and reasonably) go down the route of trying to use features in the target program.
I want to stick to Bash script, I don't want to call out to Perl, Python, etc.
I'm fine with relying on (other) standard *nix utilities, like xargs, sed, or tr provided we don't get too obtuse (see #1 above). (Apologies to Perl, Python, etc. programmers who think #3 and #4 combine to draw an arbitrary distinction.)
If it matters, the target program might also be a Bash script, or might not. I wouldn't expect it to matter...
I don't just want to handle spaces, I want to handle weird characters correctly as well.
I'm not bothered if it doesn't handle filenames with embedded nul characters (literally character code 0). If someone's managed to create one in their filesystem, I'm not worried about handling it, they've tried really hard to mess things up.
Thanks in advance, folks.
Edit: Ignacio Vazquez-Abrams pointed me to Bash FAQ entry #50, which after some reading and experimentation seems to indicate that one way is to use Bash arrays:
#!/bin/bash
# This appears to work, using Bash arrays
# Start with blank arrays
FILES=()
FLAGS=()
for ARG in "$#"; do
echo "foo: Handling $ARG"
if [ x${ARG:0:1} = "x-" ]; then
# Looks like a flag, add it to the flags array
FLAGS+=("$ARG")
else
# Looks like a file, add it to the files array
FILES+=("$ARG")
fi
done
# Call bar with the flags and files
echo "Issuing (but properly delimited, not exactly as this appears): bar ${FLAGS[#]} ${FILES[#]}"
bar "${FLAGS[#]}" "${FILES[#]}"
Is that correct and reasonable? Or am I relying on something environmental above that will bite me later. It seems to work and it ticks all the other boxes for me (simple, easy to remember, etc.). It does appear to rely on a relatively recent Bash feature (FAQ entry #50 mentions v3.1, but I wasn't sure whether that was arrays in general of some of the syntax they were using with it), but I think it's likely I'll only be dealing with versions that have it.
(If the above is correct and you want to un-delete your answer, Ignacio, I'll accept it provided I haven't accepted any others yet, although I stand by my statement about link-only answers.)
Why do you want to "build up" a command? Add the files and flags to arrays using proper
quoting and issue the command directly using the quoted arrays as arguments.
Selected lines from your script (omitting unchanged ones):
if [[ ${ARG:0:1} == - ]]; then # using a Bash idiom
FLAGS+=("$ARG") # add an element to an array
FILES+=("$ARG")
echo "Issuing: bar \"${FLAGS[#]}\" \"${FILES[#]}\""
bar "${FLAGS[#]}" "${FILES[#]}"
For a quick demo of using arrays in this manner:
$ a=(aaa 'bbb ccc' ddd); for arg in "${a[#]}"; do echo "..${arg}.."; done
Output:
..aaa..
..bbb ccc..
..ddd..
Please see BashFAQ/050 regarding putting commands in variables. The reason that your script doesn't work is because there's no way to quote the arguments within a quoted string. If you were to put quotes there, they would be considered part of the string itself instead of as delimiters. With the arguments left unquoted, word splitting is done and arguments that include spaces are seen as more than one argument. Arguments with "<", ">" or "|" are not a problem in any case since redirection and piping is performed before variable expansion so they are seen as characters in a string.
By putting the arguments (filenames) in an array, spaces, newlines, etc., are preserved. By quoting the array variable when it's passed as an argument, they are preserved on the way to the consuming program.
Some additional notes:
Use lowercase (or mixed case) variable names to reduce the chance that they will collide with the shell's builtin variables.
If you use single square brackets for conditionals in any modern shell, the archaic "x" idiom is no longer necessary if you quote the variables (see my answer here). However, in Bash, use double brackets. They provide additional features (see my answer here).
Use getopts as Let_Me_Be suggested. Your script, though I know it's only an example, will not be able to handle switches that take arguments.
This for ARG in "$#" can be shortened to this for ARG (but I prefer the readability of the more explicit version).
See BashFAQ #50 (and also maybe #35 on option parsing). For the scenario you describe, where you're building a command dynamically, the best option is to use arrays rather than simple strings, as they won't lose track of where the word boundaries are. The general rules are: to create an array, instead of VAR="foo bar baz", use VAR=("foo" "bar" "baz"); to use the array, instead of $VAR, use "${VAR[#]}". Here's a working version of your example script using this method:
#!/bin/bash
# This is clearly wrong
FILES=()
FLAGS=()
for ARG in "$#"; do
echo "foo: Handling $ARG"
if [ x${ARG:0:1} = "x-" ]; then
# Looks like a flag, add it to the flags array
FLAGS=("${FLAGS[#]}" "$ARG") # FLAGS+=("$ARG") would also work in bash 3.1+, as Dennis pointed out
else
# Looks like a file, add it to the files string
FILES=("${FILES[#]}" "$ARG")
fi
done
# Call bar with the flags and files (we don't care that they'll
# have an extra space or two)
CMD=("bar" "${FLAGS[#]}" "${FILES[#]}")
echo "Issuing: ${CMD[*]}"
"${CMD[#]}"
Note that in the echo command I used "${VAR[*]}" instead of the [#] form because there's no need/point to preserving word breaks here. If you wanted to print/record the command in unambiguous form, this would be a lot messier.
Also, this gives you no way to build up redirections or other special shell options in the built command -- if you add >outfile to the FILES array, it'll be treated as just another command argument, not a shell redirection. If you need to programmatically build these, be prepared for headaches.
getopts should be able to handle spaces in arguments correctly ("file name.txt"). Weird characters should work as well, assuming they are correctly escaped (ls -b).

Resources