List files in unix in a specific format to a text file - linux

I want to list files in a particular folder to a file list but in a particular format
for instance i have below files in a folder
/path/file1.csv
/path/file2.csv
/path/file3.csv
I want to create a string in a text file that lists them as below
-a file1.csv -a file2.csv -a file3.csv
assist create a script for that
ls /path/* > file_list.lst

The find utility can do this.
find /path/ -type f -printf "-a %P " > file_list.lst
This gives, for each thing that is a file in the given path (recursively), their Path relative to the starting point, formatted as in your example.
Note that:
Linux filenames can contain spaces and newlines; this does not deal with those.
The file_list.lst file will have a trailing space but no trailing newline.
The results will not be in a particular order.

You can just printf them.
printf "-a %s " /path/*
If you plan to be using it with a command, you may want to read https://mywiki.wooledge.org/BashFAQ/050 and interest yourself with %q printf format specifier.

Related

How to rename fasta header based on filename in multiple files?

I have a directory with multiple fasta file named as followed:
BC-1_bin_1_genes.faa
BC-1_bin_2_genes.faa
BC-1_bin_3_genes.faa
BC-1_bin_4_genes.faa
etc. (about 200 individual files)
The fasta header look like this:
>BC-1_k127_3926653_6 # 4457 # 5341 # -1 # ID=2_6;partial=01;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.697
I now want to add the filename to the header since I want to annotate the sequences for each file.I tried the following:
for file in *.faa;
do
sed -i "s/>.*/${file%%.*}/" "$file" ;
done
It worked partially but it removed the ">" from the header which is essential for the fasta file. I tried to modify the "${file%%.*}" part to keep the carrot but it always called me out on bad substitutions.
I also tried this:
awk '/>/{sub(">","&"FILENAME"_");sub(/\.faa/,x)}1' *.faa
This worked in theory but only printed everything on my terminal rather than changing it in the respective files.
Could someone assist with this?
It's not clear whether you want to replace the earlier header, or add to it. Both scenarios are easy to do. Don't replace text you don't want to replace.
for file in ./*.faa;
do
sed -i "s/^>.*/>${file%%.*}/" "$file"
done
will replace the header, but include a leading > in the replacement, effectively preserving it; and
for file in ./*.faa;
do
sed -i "s/^>.*/&${file%%.*}/" "$file"
done
will append the file name at the end of the header (& in the replacement string evaluates to the string we are replacing, again effectively preserving it).
For another variation, try
for file in *.faa;
do
sed -i "/^>/s/\$/ ${file%%.*}/" "$file"
done
which says on lines which match the regex ^>, replace the empty string at the end of the line $ with the file name.
Of course, your Awk script could easily be fixed, too. Standard Awk does not have an option to parallel the -i "in-place" option of sed, but you can easily use a temporary file:
for file in ./*.faa;
do
awk '/>/{ $0 = $0 " " FILENAME);sub(/\.faa/,"")}1' "$file" >"$file.tmp" &&
mv "$file.tmp" "$file"
done
GNU Awk also has an -i inplace extension which you could simply add to the options of your existing script if you have GNU Awk.
Since FASTA files typically contain multiple headers, adding to the header rather than replacing all headers in a file with the same string seems more useful, so I changed your Awk script to do that instead.
For what it's worth, the name of the character ^ is caret (carrot is 🥕). The character > is called greater than or right angle bracket, or right broket or sometimes just wedge.
You just need to detect the pattern to replace and use regex to implement it:
fasta_helper.sh
location=$1
for file in $location/*.faa
do
full_filename=${file##*/}
filename="${full_filename%.*}"
#scape special chars
filename=$(echo $filename | sed 's_/_\\/_g')
echo "adding file name: $filename to: $full_filename"
sed -i -E "s/^[^#]+/>$filename /" $location/$full_filename
done
usage:
Just pass the folder with fasta files:
bash fasta_helper.sh /foo/bar
test:
lectures
Regex: matching up to the first occurrence of a character
Extract filename and extension in Bash
https://unix.stackexchange.com/questions/78625/using-sed-to-find-and-replace-complex-string-preferrably-with-regex
Locating your files
Suggesting to first identify your files with find command or ls command.
find . -type f -name "*.faa" -printf "%f\n"
A find command to print only file with filenames extension .faa. Including sub directories to current directory.
ls -1 "*.faa"
An ls command to print files and directories with extension .faa. In current directory.
Processing your files
Once you have the correct files list, iterate over the list and apply sed command.
for fileName in $(find . -type f -name "*.faa" -printf "%f\n"); do
stripedFileName=${fileName/.*/} # strip extension .faa
sed -i "1s|\$| $stripedFileName|" "fileName" # append value of stripedFileName at end of line 1
done

Shell script how to list file name in ascending order

I am new to linux, writing a bash script below.
The files in the current folder are stored as 1.jpg,2.jpg, and so on, I have to process files sequentially according to their names but in the below loop I get file names is some different order.
for i in ./*.jpg
do
filename=$(basename "$i")
echo "filename is ./$filename"
done
output I get is like this
filename is ./10.jpg
filename is ./11.jpg
filename is ./12.jpg
filename is ./13.jpg
filename is ./14.jpg
filename is ./15.jpg
filename is ./16.jpg
filename is ./17.jpg
filename is ./18.jpg
filename is ./19.jpg
filename is ./1.jpg
filename is ./20.jpg
filename is ./21.jpg
filename is ./22.jpg
filename is ./27.jpg
filename is ./28.jpg
filename is ./29.jpg
filename is ./2.jpg
filename is ./3.jpg
filename is ./4.jpg
filename is ./6.jpg
filename is ./7.jpg
filename is ./8.jpg
filename is ./9.jpg
Any assistance as to how can I process them in the sequence of names 1.jpg, 2.jpg etc
Pathname expansion (glob expansion) returns a list of filenames which is alphabetically sorted according to your current locale. If you have something simple like UTF-8 or C, your sorting order will be ASCII sorted. This is visible in the result of the OP. The file with name 19.jpg is sorted before 1.jpg because the lt;dot>-character has a higher lexicographical order than the character 9.
If you want to traverse your files in a different sorting order, then a different approach needs to be taken.
Under the bold assumption that the OP requests to traverse the files in a numeric sorted way, i.e. order the names according to a number at the beginning of the file-name, you can do the following:
while IFS= read -r -d '' file; do
echo "filename: $file"
done < <(find . -maxdepth 1 -type f -name '*.jpg' -print0 | sort -z -n)
Here we use find to list all files in the current directory (depth==1) we print them with a \0 as a separator, and use sort to ask for the requested sorting, indicating that we have \0 as the field separator. Instead of using a for-loop, we use a while-loop to read the information.
See BashPitFall001 For some details
note: sort -z is a GNU extension
not quite sure if this is what you're asking, but you have the echo inside your loop which will cause it to be printed in a different row each time.
you can do:
list ""
for i in ./*.jpg
do
filename=$(basename "$i")
list="$list $filename"
done
echo "files: $list"
which would output
files: 1.jpg 2.jpg
Nevertheless, You should clarify your question.
From your requirement to "process them in the sequence of names 1.jpg, 2.jpg etc", this will accomplish that. The sort specifies a numeric key obtained by defining the first field as the string before a "." delimiter.
#!/usr/bin/env bash
shopt -s nullglob
allfiles=(*.jpg);
for f in "${allfiles[#]}"
do
echo "$f"
done | sort -t"." -k1n

Search multiple strings from file in multiple files in specific column and output the count in unix shell scripting

I have searched extensively on the internet about this but haven't found much details.
Problem Description:
I am using aix server.
I have a pattern.txt file that contains customer_id for 100 customers in the following sample format:
160471231
765082023
75635713
797649756
8011688321
803056646
I have a directory (/home/aswin/temp) with several files (1.txt, 2.txt, 3.txt and so on) which are pipe(|) delimited. Sample format:
797649756|1001|123270361|797649756|O|2017-09-04 23:59:59|10|123769473
803056646|1001|123345418|1237330|O|1999-02-13 00:00:00|4|1235092
64600123|1001|123885297|1239127|O|2001-08-19 00:00:00|10|1233872
75635713|1001|123644701|75635713|C|2006-11-30 00:00:00|11|12355753
424346821|1001|123471924|12329388|O|1988-05-04 00:00:00|15|123351096
427253285|1001|123179704|12358099|C|2012-05-10 18:00:00|7|12352893
What I need to do search all the strings from pattern.txt file in all files in the directory, in first column of each file and list each filename with number of matches. so if same row has more than 1 match it should be counted as 1.
So the output should be something like (only the matches in first column should count):
1.txt:4
2.txt:3
3.txt:2
4.txt:5
What I have done till now:
cd /home/aswin/temp
grep -srcFf ./pattern.txt * /dev/null >> logfile.txt
This is giving the output in the desired format, but it searching the strings in all columns and not just first column. So the output count is much more than expected.
Please help.
If you want to do that with grep, you must change the pattern.
With your command, you search for pattern in /dev/null and the output is /dev/null:0
I think you want 2>/dev/null but this is not needed because you tell -s to grep.
Your pattern file is in the same directory so grep search in it and output pattern.txt:6
All your files are in the same directory so the -r is not needed.
You put the logfile in the same directory, so the second time you run the command grep search in it and output logfile.txt:0
If you can modify the pattern file, you write each line like ^765082023|
and you rename this file without .txt
So this command give you what you look for.
grep -scf pattern *.txt >>logfile
If you can't modify the pattern file, you can use awk.
awk -F'|' '
NR==FNR{a[$0];next}
FILENAME=="pattern.txt"{next}
$1 in a {b[FILENAME]++}
END{for(i in b){print i,":",b[i]}}
' pattern.txt *.txt >>logfile.txt

Looping through a file with path and file names and within these file search for a pattern

I have a file called lookupfile.txt with the following info:
path, including filename
Within bash I would like to search through these files in mylookup file.txt for a pattern : myerrorisbeinglookedat. When found, output the lines where found into another recorder file. All the found result can land in the same file.
Please help.
You can write a single grep statement to achieve this:
grep myerrorisbeinglookedat $(< lookupfile.txt) > outfile
Assuming:
the number of entries in lookupfile.txt is small (tens or hundreds)
there are no white spaces or wildcard characters in the file names
Otherwise:
while IFS= read -r file; do
# print the file names separated by a NULL character '\0'
# to be fed into xargs
printf "$file\0"
done < lookupfile.txt | xargs -0 grep myerrorisbeinglookedat > outfile
xargs takes output of the loop, tokenizes them correctly and invokes grep command. xargs batches up the files based on operating system limits in case there are a large number of files.

Building a file index in Linux

I have a filesystem with deeply nested directories. Inside the bottom level directory for any node in the tree is a directory whose name is the guid of a record in a database. This folder contains the binary file(s) (pdf, jpg, etc) that are attached to that record.
Two Example paths:
/g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf
/g/camm/MOUNT/raid_fs1/FOO/052014/22/321.654.987/04.20.30--27.04.2014--RJ123.pdf
In the above example, 123.456.789 and 321.654.987 are guids
I want to build an index of the complete filesystem so that I can create a lookup table in my database to easily map the guid of the record to the absolute path(s) of its attached file(s).
I can easily generate a straight list of files with:
find /g/camm/MOUNT -type f > /g/camm/MOUNT/files.index
but I want to parse the output of each file path into a CSV file which looks like:
GUID ABSOLUTEPATH FILENAME
123.456.789 /g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf 04.20.30--27.04.2014--RJ123.pdf
321.654.987 /g/camm/MOUNT/raid_fs1/FOO/052014/22/321.654.987/04.20.30--27.04.2014--RJ123.pdf 04.20.30--27.04.2014--RJ123.pdf
I think I need to pipe the output of my find command into xargs and again into awk to process each line of the output into the desired format for the CSV output... but I can't make it work...
Wait for your long-running find to finish, then you
can pass the list of filenames through awk:
awk -F/ '{printf "%s,%s,%s\n",$(NF-1),$0,$NF}' /g/camm/MOUNT/files.index
and this will convert lines like
/g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf
into
123.456.789,/g/camm/MOUNT/raid_fs0/FOO/042014/27/123.456.789/04.20.30--27.04.2014--RJ123.pdf,04.20.30--27.04.2014--RJ123.pdf
The -F/ splits the line into fields using "/" as separator, NF is the
number of fields, so $NF means the last field, and $(NF-1) the
next-to-last, which seems to be the directory you want in the first column
of the output. I used "," in the printf to separate the output columns, as
is typical in a csv; you can replace it by any character such as space or ";".
I dont think there can be anything much faster than your find command, but
you may be interested by the locate package. It uses the updatedb command, usually run each night by cron, to traverse the filesystem and creates a file holding all the filenames in a manner than can be easily searched by another command.
The locate command is used to read the database to find matching directories, files, and so on, even using glob wild-card or regex pattern matching. Once tried, it is hard to live without it.
For example, on my system locate -S lists the statistics:
Database /var/lib/mlocate/mlocate.db:
59945 directories
505330 files
30401572 bytes in file names
12809265 bytes used to store database
and I can do
locate rc-dib0700-nec.ko
locate -r rc-.*-nec.ko
locate '*/media/*rc-*-nec.ko*'
to find files like /usr/lib/modules/4.1.6-100.fc21.x86_64/kernel/drivers/media/rc/keymaps/rc-dib0700-nec.ko.xz in no time at all.
You can nearly do what you want with the find's -printf option.
The difficuty is on GUID.
Assuming prefixes have the same length as in your example, I would probably do:
find /g/camm/MOUNT -type f -printf "%h %p %f\n" | colrm 1 37 > /g/camm/MOUNT/files.index
Or if the number of / is constant
find /g/camm/MOUNT -type f -printf "%h %p %f\n" | cut -d '/' -f 9- > /g/camm/MOUNT/files.index
Otherwise, I would use sed:
find /g/camm/MOUNT -type f -printf "%h %p %f\n" | sed -e 's#^.*/\(.*\) #\1 #' > /g/camm/MOUNT/files.index

Resources