bash script zip filename parsing strangely - linux

I'm trying to zip various files together (one of the included files is actually a zip itself) and name the resulting zip based on a handful of bash variables defined earlier. One of the variables used in the zip file name is being parsed from a #define in a config.h file. I successfully parsed together a .zip with the correct name, but when I tried to implement the same zip script in a slightly different situation I get erroneous zip names.
In Windows explorer, the erroneous zip name looks something like X1276N~E.ZIP
In linux the zip appears with the intended name except with a question mark (which I've come to understand to be some sort of placeholder). i.g. foo-stuff-bar-9.1b?.zip
My current code trying to zip a file with name foo-stuff-bar-9.1b.zip:
foo_name=$1
bar_name=$2
rev_number=$(grep define[[:space:]]*SOME_NUMBER $directory/config.h | awk '{print $3;}'| tr -d '/"')
archive_name="$foo_name"-stuff-"$bar_name"-9."$rev_number"
zip "$archive_name".zip file1 file2 backup1.zip file3
So "foo_name" and "bar_name" are strings coming from the terminal when the script is run, "rev_number" is being parsed from config.h, and I'm formatting it all into "archive_name" before using it in the zip command.
I've tried all sorts of variations of quotation marks and brackets and I get the same weird name name no matter what I try. I'm not sure where my error is being caused as I'm parsing from many sources. Any advice is much appreciated.

Per Marc B's suggestion, I piped the string to xxd -b to look at each character byte by byte. It appeared as though I was accidentally parsing a character at the end of $archive_name when scraping the config.h file.
I was able to fix this by just piping my string through tr -d "[:cntrl:]" to remove the any control characters that would give weird file names.

Related

Is it possible to partially unzip a .vcf file?

I have a ~300 GB zipped vcf file (.vcf.gz) which contains the genomes of about 700 dogs. I am only interested in a few of these dogs and I do not have enough space to unzip the whole file at this time, although I am in the process of getting a computer to do this. Is it possible to unzip only parts of the file to begin testing my scripts?
I am trying to a specific SNP at a position on a subset of the samples. I have tried using bcftools to no avail: (If anyone can identify what went wrong with that I would also really appreciate it. I created an empty file for the output (722g.990.SNP.INDEL.chrAll.vcf.bgz) but it returns the following error)
bcftools view -f PASS --threads 8 -r chr9:55252802-55252810 -o 722g.990.SNP.INDEL.chrAll.vcf.gz -O z 722g.990.SNP.INDEL.chrAll.vcf.bgz
The output type "722g.990.SNP.INDEL.chrAll.vcf.bgz" not recognised
I am planning on trying awk, but need to unzip the file first. Is it possible to partially unzip it so I can try this?
Double check your command line for bcftools view.
The error message 'The output type "something" is not recognized' is printed by bcftools when you specify an invalid value for the -O (upper-case O) command line option like this -O something. Based on the error message you are getting it seems that you might have put the file name there.
Check that you don't have your input and output file names the wrong way around in your command. Note that the -o (lower-case o) command line option specifies the output file name, and the file name at the end of the command line is the input file name.
Also, you write that you created an empty file for the output. You don't need to do that, bcftools will create the output file.
I don't have that much experience with bcftools but generically If you want to to use awk to manipulate a gzipped file you can pipe to it so as to only unzip the file as needed, you can also pipe the result directly through gzip so it too is compressed e.g.
gzip -cd largeFile.vcf.gz | awk '{ <some awk> }' | gzip -c > newfile.txt.gz
Also zcat is an alias for gzip -cd, -c is input/output to standard out, -d is decompress.
As a side note if you are trying to perform operations on just a part of a large file you may also find the excellent tool less useful it can be used to view your large file loading only the needed parts, the -S option is particularly useful for wide formats with many columns as it stops line wrapping, as is -N for showing line numbers.
less -S largefile.vcf.gz
quit the view with q and g takes you to the top of the file.

bash handling of quotation marks in filename

I am trying to remove and replace quotation marks that are present in a file name. For example, I would like to change:
$ ls
abc"def"ghi"jkl"mno
to this
$ ls
abc:def:ghi:jkl:mno
In trying to solve this, I came across How to rename a bunch of files to eliminate quote marks, which is exactly what I want to do. However, it didn't work for my case. To figure out why, I tried creating a test file like this:
$ touch abba\"abba\"cde\"cde\"efef
With this file, the solutions I came across (such as mentioned above) worked. But why didn't it work for the first file?
One thing I discovered was that bash command completion sees them differently. If I type in
$ ls abb<tab>
bash will complete the filename like so:
$ abba\"abba\"cde\"cde\"efef
just as I created it. But for the original file, bash completion went like this:
$ ls abc<tab>
results in
$ abc"def"ghi"jkl"mno
So in the test case file, there is an escape of the quotation marks, and in the other case (the file I really want to rename), there is no escaping of the the quotation marks. I don't know how the original files were named.
Can anyone explain why bash sees these names differently, and how I would go about renaming my file?
Here is two ways to rename a file with "(quotation) mark,
option 1: With escape character \
mv abc\"cdf\"efg\"hij newFileName
option 2: By using '(single quote)
mv 'abc"cdf"efg"hij' newFileName
Note: using special charaters like :(colon) in file name might not be a good idea,
and regarding the auto completion, it usually fill the name with escape character, example
ls abc<tab> will complete the name to ls abc\"cdf\"efg\"hij
unless you start the name with a quote, example
ls 'abc<tab> will complete the name to ls 'abc"cdf"efg"hij'

Iterate through files in a directory, create output files, linux

I am trying to iterate through every file in a specific directory (called sequences), and perform two functions on each file. I know that the functions (the 'blastp' and 'cat' lines) work, since I can run them on individual files. Ordinarily I would have a specific file name as the query, output, etc., but I'm trying to use a variable so the loop can work through many files.
(Disclaimer: I am new to coding.) I believe that I am running into serious problems with trying to use my file names within my functions. As it is, my code will execute, but it creates a bunch of extra unintended files. This is what I intend for my script to do:
Line 1: Iterate through every file in my "sequences" directory. (All of which end with ".fa", if that is helpful.)
Line 3: Recognize the filename as a variable. (I know, I know, I think I've done this horribly wrong.)
Line 4: Run the blastp function using the file name as the argument for the "query" flag, always use "database.faa" as the argument for the "db" flag, and output the result in a new file that is has the same name as the initial file, but with ".txt" at the end.
Line 5: Output parts of the output file from line 4 into a new file that has the same name as the initial file, but with "_top_hits.txt" at the end.
for sequence in ./sequences/{.,}*;
do
echo "$sequence";
blastp -query $sequence -db database.faa -out ${sequence}.txt -evalue 1e-10 -outfmt 7
cat ${sequence}.txt | awk '/hits found/{getline;print}' | grep -v "#">${sequence}_top_hits.txt
done
When I ran this code, it gave me six new files derived from each file in the directory (and they were all in the same directory - I'd prefer to have them all in their own folders. How can I do that?). They were all empty. Their suffixes were, ".txt", ".txt.txt", ".txt_top_hits.txt", "_top_hits.txt", "_top_hits.txt.txt", and "_top_hits.txt_top_hits.txt".
If I can provide any further information to clarify anything, please let me know.
If you're only interested in *.fa files I would limit your input to only those matching files like this:
for sequence in sequences/*.fa;
do
I can propose you the following improvements:
for fasta_file in ./sequences/*.fa # ";" is not necessary if you already have a new line for your "do"
do
# ${variable%something} is the part of $variable
# before the string "something"
# basename path/to/file is the name of the file
# without the full path
# $(some command) allows you to use the result of the command as a string
# Combining the above, we can form a string based on our fasta file
# This string can be useful to name stuff in a clean manner later
sequence_name=$(basename ${fasta_file%.fa})
echo ${sequence_name}
# Create a directory for the results for this sequence
# -p option avoids a failure in case the directory already exists
mkdir -p ${sequence_name}
# Define the name of the file for the results
# (including our previously created directory in its path)
blast_results=${sequence_name}/${sequence_name}_blast.txt
blastp -query ${fasta_file} -db database.faa \
-out ${blast_results} \
-evalue 1e-10 -outfmt 7
# Define a file name for the top hits
top_hits=${sequence_name}/${sequence_name}_top_hits.txt
# alternatively, using "%"
#top_hits=${blast_results%_blast.txt}_top_hits.txt
# No need to cat: awk can take a file as argument
awk '/hits found/{getline;print}' ${blast_results} \
| grep -v "#" > ${sequence_name}_top_hits.txt
done
I made more intermediate variables, with (hopefully) meaningful names.
I used \ to escape line ends and allow putting commands in several lines.
I hope this improves code readability.
I haven't tested. There may be typos.
You should be using *.fa if you only want files with a .fa ending. Additionally, if you want to redirect your output to new folders you need to create those directories somewhere using
mkdir 'folder_name'
then you need to redirect your -o outputs to those files, something like this
'command' -o /path/to/output/folder
To help you test this script out, you can run each line one by one to test them. You need to make sure each line works by itself before combining.
One last thing, be careful with your use of colons, it should look something like this:
for filename in *.fa; do 'command'; done

Obtaining file names from directory in Bash

I am trying to create a zsh script to test my project. The teacher supplied us with some input files and expected output files. I need to diff the output files from myExecutable with the expected output files.
Question: Does $iF contain a string in the following code or some kind of bash reference to the file?
#!/bin/bash
inputFiles=~/project/tests/input/*
outputFiles=~/project/tests/output
for iF in $inputFiles
do
./myExecutable $iF > $outputFiles/$iF.out
done
Note:
Any tips in fulfilling my objectives would be nice. I am new to shell scripting and I am using the following websites to quickly write the script (since I have to focus on the project development and not wasting time on extra stuff):
Grammar for bash language
Begginer guide for bash
As your code is, $iF contains full path of file as a string.
N.B: Don't use for iF in $inputFiles
use for iF in ~/project/tests/input/* instead. Otherwise your code will fail if path contains spaces or newlines.
If you need to diff the files you can do another for loop on your output files. Grab just the file name with the basename command and then put that all together in a diff and output to a ".diff" file using the ">" operator to redirect standard out.
Then diff each one with the expected file, something like:
expectedOutput=~/<some path here>
diffFiles=~/<some path>
for oF in ~/project/tests/output/* ; do
file=`basename ${oF}`
diff $oF "${expectedOutput}/${file}" > "${diffFiles}/${file}.diff"
done

zip command not working

I am trying to zip a file using shell script command. I am using following command:
zip ./test/step1.zip $FILES
where $FILES contain all the input files. But I am getting a warning as follows
zip warning: name not matched: myfile.dat
and one more thing I observed that the file which is at last in the list of files in a folder has the above warning and that file is not getting zipped.
Can anyone explain me why this is happening? I am new to shell script world.
zip warning: name not matched: myfile.dat
This means the file myfile.dat does not exist.
You will get the same error if the file is a symlink pointing to a non-existent file.
As you say, whatever is the last file at the of $FILES, it will not be added to the zip along with the warning. So I think something's wrong with the way you create $FILES. Chances are there is a newline, carriage return, space, tab, or other invisible character at the end of the last filename, resulting in something that doesn't exist. Try this for example:
for f in $FILES; do echo :$f:; done
I bet the last line will be incorrect, for example:
:myfile.dat :
...or something like that instead of :myfile.dat: with no characters before the last :
UPDATE
If you say the script started working after running dos2unix on it, that confirms what everybody suspected already, that somehow there was a carriage-return at the end of your $FILES list.
od -c shows the \r carriage-return. Try echo $FILES | od -c
Another possible cause that can generate a zip warning: name not matched: error is having any of zip's environment variables set incorrectly.
From the man page:
ENVIRONMENT
The following environment variables are read and used by zip as described.
ZIPOPT
contains default options that will be used when running zip. The contents of this environment variable will get added to the command line just after the zip command.
ZIP
[Not on RISC OS and VMS] see ZIPOPT
Zip$Options
[RISC OS] see ZIPOPT
Zip$Exts
[RISC OS] contains extensions separated by a : that will cause native filenames with one of the specified extensions to be added to the zip file with basename and extension swapped.
ZIP_OPTS
[VMS] see ZIPOPT
In my case, I was using zip in a script and had the binary location in an environment variable ZIP so that we could change to a different zip binary easily without making tonnes of changes in the script.
Example:
ZIP=/usr/bin/zip
...
${ZIP} -r folder.zip folder
This is then processed as:
/usr/bin/zip /usr/bin/zip -r folder.zip folder
And generates the errors:
zip warning: name not matched: folder.zip
zip I/O error: Operation not permitted
zip error: Could not create output file (/usr/bin/zip.zip)
The first because it's now trying to add folder.zip to the archive instead of using it as the archive. The second and third because it's trying to use the file /usr/bin/zip.zip as the archive which is (fortunately) not writable by a normal user.
Note: This is a really old question, but I didn't find this answer anywhere, so I'm posting it to help future searchers (my future self included).
eebbesen hit the nail in his comment for my case (but i cannot vote for comment).
Another possible reason missed in the other comments is file exceeding the file size limit (4GB).
I converted my script for unix environment using dos2unix command and executed my script as ./myscript.sh instead bash myscript.sh.
I just discovered another potential cause for this. If the permissions of the directory/subdirectory don't allow the zip to find the file, it will report this error. Actually, if you run a chmod -R 444 on the directory, and then try to zip it, you will reproduce this error, and also have a "stored 0%" report, like this:
zip warning: name not matched: borrar/enviar
adding: borrar/ (stored 0%)
Hence, try changing the permissions of the file. If you are trying to send them through email, and those email filters (like Gmail's) invent silly filters of not sending executables, don't forget that making permissions very strict when making zip compression can be the cause of the error you are reporting, of "name not matched".
spaces are not allowed:
it would fail if there are more than one files(s) in $FILES unless you put them in loop
I also encountered this issue. In my case, the line separate is CRLF in my zip shell script which causes the problem. Using LF fixed it.

Resources