Pick the specific file in the folder - linux

I want pick the specific format of file among the list of files in a directory. Please find the below example.
I have a below list of files (6 files).
Set-1
1) MAG_L_NT_AA_SUM_2017_01_20.dat
2) MAG_L_NT_AA_2017_01_20.dat
Set-2
1) MAG_L_NT_BB_SUM_2017_01_20.dat
2) MAG_L_NT_BB_2017_01_20.dat
Set-3
1) MAG_L_NT_CC_SUM_2017_01_20.dat
2) MAG_L_NT_CC_2017_01_20.dat
From the above three sets I need only 3 files.
1) MAG_L_NT_AA_2017_01_20.dat
2) MAG_L_NT_BB_2017_01_20.dat
3) MAG_L_NT_CC_2017_01_20.dat
Note: There can be multiple lines of commands because i have create the script for above req. Thanks

Probably easiest and least complex solution to your problem is combining find (a tool for searching for files in a directory hierarchy) and grep (tool for printing lines that match a pattern). You also can read those tools manuals by typing man find and man grep.
Before going straight to solution we need to understand, how we will approach your problem. To find pattern in a name of file we search we will use find command with option -name:
-name pattern
Base of file name (the path with the leading directories removed) matches shell pattern pattern. The metacharacters ('*', '?', and '[]')
match a '.' at the start of the base name (this is a change in
findutils-4.2.2; see section STANDARDS CONFORMANCE below). To ignore a
directory and the files under it, use -prune; see an example in the
description of -path. Braces are not recognised as being special,
despite the fact that some shells including Bash imbue braces with a
special meaning in shell patterns. The filename matching is performed
with the use of the fnmatch(3) library function. Don't forget to
enclose the pattern in quotes in order to protect it from expansion by
the shell.
For instance, if we want to search for a file containing string 'abc' in directory called 'words_directory', we will enter following:
$ find words_directory -name "*abc*"
And if we want to search all directories in directory:
$ find words_directory/* -name "*abc*"
So first, we will need to find all files, which begin with string "MAG_L_NT_" and end with ".dat", therefore to find all matching names in /your/specified/path/ which contains many subdirectories, which could contain files that match this pattern:
$ find /your/specified/path/* -name "MAG_L_NT_*.dat"
However this prints all found filenames, but we still get names containing "SUM" string, there comes in grep. To exclude names containing unwanted string we will use option -v:
-v, --invert-match
Invert the sense of matching, to select non-matching lines. (-v is
specified by POSIX .)
To use grep to filter out first commands output we will use pipe () |:
The standard shell syntax for pipelines is to list multiple commands,
separated by vertical bars ("pipes" in common Unix verbiage). For
example, to list files in the current directory (ls), retain only the
lines of ls output containing the string "key" (grep), and view the
result in a scrolling page (less), a user types the following into the
command line of a terminal:
ls -l | grep key | less
"ls -l" produces a process, the output (stdout) of which is piped to
the input (stdin) of the process for "grep key"; and likewise for the
process for "less". Each process takes input from the previous process
and produces output for the next process via standard streams. Each
"|" tells the shell to connect the standard output of the command on
the left to the standard input of the command on the right by an
inter-process communication mechanism called an (anonymous) pipe,
implemented in the operating system. Pipes are unidirectional; data
flows through the pipeline from left to right.
process1 | process2 | process3
After you got acquainted to mentioned commands and options which will be used to achieve your goal, you are ready for solution:
$ find /your/specified/path/* -name "MAG_L_NT_*.dat" | grep -v "SUM"
This command will produce output of all names which begin "MAG_L_NT_" and end with ".dat". grep -v will use first command output as input and remove all lines containing "SUM" string.

Related

How do I exclude a character in Linux

Write a wildcard to match all files (does not matter the files are in which directory, just ask for the wildcard) named in the following rule: starts with a string “image”, immediately followed by a one-digit number (in the range of 0-9), then a non-digit char plus anything else, and ends with either “.jpg” or “.png”. For example, image7.jpg and image0abc.png should be matched by your wildcard while image2.txt or image11.png should not.
My folder contained these files imag2gh.jpeg image11.png image1agb.jpg image1.png image2gh.jpg image2.txt image5.png image70.jpg image7bn.jpg Screenshot .png
If my command work it should only display image1agb.jpg image1.png image2gh.jpg image5.png image70.jpg image7bn.jpg
This is the command I used (ls -ad image[0-9][^0-9]*{.jpg,.png}) but I'm only getting this image1agb.jpg image2gh.jpg image7bn.jpg so I'm missing (image1.png image5.png)Kali Terminal and what I did
ls -ad image[0-9][!0-9]*{.jpg,.png}
Info
Character ranges like [0-9] are usually seen in RegEx statements and such. They won't work as shell globs (wildcards) like that.
Possible solution
Pipe output of command ls -a1
to standard input of the grep command (which does support RegEx).
Use a RegEx statement to make grep filter filenames.
ls -a1|grep "image"'[[:digit:]]\+[[:alpha:]]*\.\(png\|jpg\)'

"Cat" into multiple files using brace expansion

I am quite new to bash and trying to type some text into multiple files with a single command using brace expansion.
I tried: cat > file_{1..100} to write into 100 files some text that I will type in the terminal. I get the following error:
bash: file_{1..100}: ambiguous redirect
I also tried: cat > "file_{1..100}" but that creates a singe file named: file_{1..100}.
I tried: cat > `file_{1..100}` but that gives the error:
file_1: command not found
How can I achieve this using brace expansion? Maybe there are other ways using other utilities and/or pipelines. But I want to know if that is possible using only simple brace expansion or not.
You can't do this with cat alone. It only writes its output to its standard output, and that single file descriptor can only be associated with a single file.
You can however do it with tee file_{1..100}.
You may wish to consider using tee file_{01..100} instead, so that the filenames are zero-padded to all have the same width: file_001, file_002, ... This has the advantage that lexicographic order will agree with numerical order, and so ls, *, etc, will process them in numerical order. Without this, you have the situation that file_2 comes after file_10 in lexicographic order.
target could be only a pipe, not a multiple files.
If you want redirect output to multiple files, use tee
cat | tee file_{1..100}
Don't forget to check man tee, for example if you want to append to the files, you should add -a option (tee -a file_{1..100})
This types the string or text into file{1..4}
echo "hello you just knew me by kruz" > file{1..4}
Use to remove them
rm file*

Find command with quotation marks results in "no such file"

In my directory there are the files:
file1.txt fix.log fixRRRRRR.log fixXXXX.log output.txt
In order to understand the find command, I tried a lot of stuff among other things I wanted to use 2 wildcards. Target was to find files that start with an f and have an extension starting with an l.
$ find . f*.l*
./file1.txt
./fix.log
./fixRRRRRR.log
./output.txt
./fixXXXX.log
fix.log
fixRRRRRR.log
fixXXXX.log
I read in a forum answer to use quotation marks with find find . "f*.l*" with the result: `
./file1.txt
./fix.log
./fixRRRRRR.log
./output.txt
./fixXXXX.log
It results in find: ‘f*.l*’: No such file or directory
What am I doing wrong, where is my error in reasoning?
Thanks for an answer.
find doesn't work like that. In general find's call form looks like:
find [entry1] [entry2] ... [expressions ...]
Where an entry is a starting point where find starts the search for files.
In your case, you haven't actually supplied any expressions.
In the first command (without quotes), the shell expands the wildcards to a list of matching files (in the current directory), then passes the list to find as arguments. So find . f*.l* is essentially equivalent to find . fix.log fixRRRRRR.log fixXXXX.log. As a result, find treats all of those arguments as directories/files to search (not patterns to search for), and lists all files under ., (everything) then all files under fix.log (it's not a directory, so that's just the file itself), then all files under fixRRRRRR.log and finally all files under fixXXXX.log.
In the second one (with quotes) it searches for all files beneath the current directory (.) and tries the same for the file literally called "f*.l*".
Actually you are likely seeking for the "-name" expression, which may be used like this:
find . -name "f*.l*"

Iterate through files in a directory, create output files, linux

I am trying to iterate through every file in a specific directory (called sequences), and perform two functions on each file. I know that the functions (the 'blastp' and 'cat' lines) work, since I can run them on individual files. Ordinarily I would have a specific file name as the query, output, etc., but I'm trying to use a variable so the loop can work through many files.
(Disclaimer: I am new to coding.) I believe that I am running into serious problems with trying to use my file names within my functions. As it is, my code will execute, but it creates a bunch of extra unintended files. This is what I intend for my script to do:
Line 1: Iterate through every file in my "sequences" directory. (All of which end with ".fa", if that is helpful.)
Line 3: Recognize the filename as a variable. (I know, I know, I think I've done this horribly wrong.)
Line 4: Run the blastp function using the file name as the argument for the "query" flag, always use "database.faa" as the argument for the "db" flag, and output the result in a new file that is has the same name as the initial file, but with ".txt" at the end.
Line 5: Output parts of the output file from line 4 into a new file that has the same name as the initial file, but with "_top_hits.txt" at the end.
for sequence in ./sequences/{.,}*;
do
echo "$sequence";
blastp -query $sequence -db database.faa -out ${sequence}.txt -evalue 1e-10 -outfmt 7
cat ${sequence}.txt | awk '/hits found/{getline;print}' | grep -v "#">${sequence}_top_hits.txt
done
When I ran this code, it gave me six new files derived from each file in the directory (and they were all in the same directory - I'd prefer to have them all in their own folders. How can I do that?). They were all empty. Their suffixes were, ".txt", ".txt.txt", ".txt_top_hits.txt", "_top_hits.txt", "_top_hits.txt.txt", and "_top_hits.txt_top_hits.txt".
If I can provide any further information to clarify anything, please let me know.
If you're only interested in *.fa files I would limit your input to only those matching files like this:
for sequence in sequences/*.fa;
do
I can propose you the following improvements:
for fasta_file in ./sequences/*.fa # ";" is not necessary if you already have a new line for your "do"
do
# ${variable%something} is the part of $variable
# before the string "something"
# basename path/to/file is the name of the file
# without the full path
# $(some command) allows you to use the result of the command as a string
# Combining the above, we can form a string based on our fasta file
# This string can be useful to name stuff in a clean manner later
sequence_name=$(basename ${fasta_file%.fa})
echo ${sequence_name}
# Create a directory for the results for this sequence
# -p option avoids a failure in case the directory already exists
mkdir -p ${sequence_name}
# Define the name of the file for the results
# (including our previously created directory in its path)
blast_results=${sequence_name}/${sequence_name}_blast.txt
blastp -query ${fasta_file} -db database.faa \
-out ${blast_results} \
-evalue 1e-10 -outfmt 7
# Define a file name for the top hits
top_hits=${sequence_name}/${sequence_name}_top_hits.txt
# alternatively, using "%"
#top_hits=${blast_results%_blast.txt}_top_hits.txt
# No need to cat: awk can take a file as argument
awk '/hits found/{getline;print}' ${blast_results} \
| grep -v "#" > ${sequence_name}_top_hits.txt
done
I made more intermediate variables, with (hopefully) meaningful names.
I used \ to escape line ends and allow putting commands in several lines.
I hope this improves code readability.
I haven't tested. There may be typos.
You should be using *.fa if you only want files with a .fa ending. Additionally, if you want to redirect your output to new folders you need to create those directories somewhere using
mkdir 'folder_name'
then you need to redirect your -o outputs to those files, something like this
'command' -o /path/to/output/folder
To help you test this script out, you can run each line one by one to test them. You need to make sure each line works by itself before combining.
One last thing, be careful with your use of colons, it should look something like this:
for filename in *.fa; do 'command'; done

Using wildcards to exclude files with a certain suffix

I am experimenting with wildcards in bash and tried to list all the files that start with "xyz" but does not end with ".TXT" but getting incorrect results.
Here is the command that I tried:
$ ls -l xyz*[!\.TXT]
It is not listing the files with names "xyz" and "xyzTXT" that I have in my directory. However, it lists "xyz1", "xyz123".
It seems like adding [!\.TXT] after "xyz*" made the shell look for something that start with "xyz" and has at least one character after it.
Any ideas why it is happening and how to correct this command? I know it can be achieved using other commands but I am especially interested in knowing why it is failing and if it can done just using wildcards.
These commands will do what you want
shopt -s extglob
ls -l xyz!(*.TXT)
shopt -u extglob
The reason why your command doesn't work is beacause xyz*[!\.TXT] which is equivalent to xyz*[!\.TX] means xyz followed by any sequence of character (*) and finally a character in set {!,\,.,T,X} so matches 'xyzwhateveryouwant!' 'xyzwhateveryouwant\' 'xyzwhateveryouwant.' 'xyzwhateveryouwantT' 'xyzwhateveryouwantX'
EDIT: where whateveryouwant does not contain any of !\.TX
I don't think this is doable with only wildcards.
Your command isn't working because it means:
Match everything that has xyz followed by whatever you want and it must not end with sequent character: \, .,T and X. The second T doesn't count as far as what you have inside [] is read as a family of character and not as a string as you thought.
You don't either need to 'escape' . as long as it has no special meaning inside a wildcard.
At least, this is my knowledge of wildcards.

Resources