Searching for specific beginning pattern in first lines of files only

Searching for specific beginning pattern in first lines of files only - linux

I am searching for files containing records that begin with specific pattern but am now running into problems with files (bad data) that contain multiple values in that position within the file, which should never be the case (it should match each record in the file but sometime doesn't). Below is the current code:
echo "Parsing out list of warehouses contained in file set."
( cd $DATA && grep -l '^ 80' * ) >$TEMP/$program.list.whse80.$$
( cd $DATA && grep -l '^ 61' * ) >$TEMP/$program.list.whse61.$$
( cd $DATA && grep -l '^ 68' * ) >>$TEMP/$program.list.whse61.$$
( cd $DATA && grep -l '^ 69' * ) >>$TEMP/$program.list.whse61.$$
( cd $DATA && grep -l '^ 01' * ) >$TEMP/$program.list.whse01.$$
.etc...
What is happening is when there is a file containing records records that begin with both the 61 pattern (with preceding 9 spaces) and the 01 pattern, the same filename is being captured in the 61 file and the 01 file. I would like to force only grepping the first line of each file in this manner as I have other logic to catch mixed files later in my program.
Many thanks in advance for any assistance.

use head to restrict to top n lines only, for example
head -3 file | grep ...
for globbing files you can do in a for loop
for f in *; do if [ -f "$f" ]; then head -1 "$f" | grep ...; fi; done
If you want to output the file name, this solution is not going to work since head just extracts the first line. However, you can check for grep status and report the file name yourself.
if grep -q pattern; then echo $f fi
Alternatively you can use awk instead of grep
for f in *; do if [ -f "$f" ]; then awk 'NR==1 && /pattern/{print FILENAME}' "$f"; fi; done
replace pattern with your pattern.

Related

Iterate over multiple files, remove those who contains x specific characters

New to Shell. I have more than 10 thousand files and I have to delete files that contain the "<" characters less than 10 times.
wc -l * 2>&1 | while read -r num file; do ((num < 10)) && echo rm "$file"; - this one removes files if they have less than 10 lines, but how do I put "<" character?

With GNU grep, bash and GNU xargs:
#!/bin/bash
grep -cZ '<' * |
while IFS='' read -r -d '' file && read count
do
(( count < 10 )) && printf '%s\0' "$file"
done |
xargs -0r rm
Explanations
grep -cZ outputs a stream of file \0 count \n records.
You process it with a while loop that reads the file (using a NUL-byte delimiter) and the count (using a newline delimiter).
You do your filtering logic and output the files that you want to delete (in the form of NUL-delimited records).
Finally, xargs -r0 rm does the deletion of the files
Here's an alternative with GNU awk and xargs:
awk -v n=10 '
FNR == 1 {
count = 0
}
/</ && ++count >= n {
nextfile
}
ENDFILE {
if (count < n)
printf "%s%c", FILENAME, 0
}
' * |
xargs -0r rm
`

Using GNU grep (for the -m option, to make it a bit more efficient):
#!/bin/bash
for f in *; do
(( $(grep -Fc -m10 \< "$f") < 10 )) && echo rm "$f"
done
Drop the echo if output looks fine.

Replace filename to a string of the first line in multiple files in bash

I have multiple fasta files, where the first line always contains a > with multiple words, for example:
File_1.fasta:
>KY620313.1 Hepatitis C virus isolate sP171215 polyprotein gene, complete cds
File_2.fasta:
>KY620314.1 Hepatitis C virus isolate sP131957 polyprotein gene, complete cds
File_3.fasta:
>KY620315.1 Hepatitis C virus isolate sP127952 polyprotein gene, complete cds
I would like to take the word starting with sP* from each file and rename each file to this string (for example: File_1.fasta to sP171215.fasta).
So far I have this:
$ for match in "$(grep -ro '>')";do
fname=$("echo $match|awk '{print $6}'")
echo mv "$match" "$fname"
done
But it doesn't work, I always get the error:
grep: warning: recursive search of stdin
I hope you can help me!

you can use something like this:
grep '>' *.fasta | while read -r line ; do
new_name="$(echo $line | cut -d' ' -f 6)"
old_name="$(echo $line | cut -d':' -f 1)"
mv $old_name "$new_name.fasta"
done
It searches for *.fasta files and handles every "hitted" line
it splits each result of grep by spaces and gets the 6th element as new name
it splits each result of grep by : and gets the first element as old name
it
moves/renames from old filename to new filename

There are several things going on with this code.
For a start, .. I actually don't get this particular error, and this might be due to different versions.
It might resolve to the fact that grep interprets '>' the same as > due to bash expansion being done badly. I would suggest maybe going for "\>".
Secondly:
fname=$("echo $match|awk '{print $6}'")
The quotes inside serve unintended purpose. Your code should like like this, if anything:
fname="$(echo $match|awk '{print $6}')"
Lastly, to properly retrieve your data, this should be your final code:
for match in "$(grep -Hr "\>")"; do
fname="$(echo "$match" | cut -d: -f1)"
new_fname="$(echo "$match" | grep -o "sP[^ ]*")".fasta
echo mv "$fname" "$new_fname"
done
Explanations:
grep -H -> you want your grep to explicitly use "Include Filename", just in case other shell environments decide to alias grep to grep -h (no filenames)
you don't want to be doing grep -o on your file search, as you want to have both the filename and the "new filename" in one data entry.
Although, i don't see why you would search for '>' and not directory for 'sP' as such:
for match in "$(grep -Hro "sP[0-9]*")"
This is not the exact same behaviour, and has different edge cases, but it just might work for you.

Quite straightforward in (g)awk :
create a file "script.awk":
FNR == 1 {
for (i=1; i<=NF; i++) {
if (index($i, "sP")==1) {
print "mv", FILENAME, $i ".fasta"
nextfile
}
}
}
use it :
awk -f script.awk *.fasta > cmmd.txt
check the content of the output.
mv File_1.fasta sP171215.fasta
mv File_2.fasta sP131957.fasta
if ok, launch rename with . cmmd.txt

For all fasta files in directory, search their first line for the first word starting with sP and rename them using that word as the basename.
Using a bash array:
for f in *.fasta; do
arr=( $(head -1 "$f") )
for word in "${arr[#]}"; do
[[ "$word" =~ ^sP* ]] && echo mv "$f" "${word}.fasta" && break
done
done
or using grep:
for f in *.fasta; do
word=$(head -1 "$f" | grep -o "\bsP\w*")
[ -z "$word" ] || echo mv "$f" "${word}.fasta"
done
Note: remove echo after you are ok with testing.

bash - loop through subdirectories, cat files and rename with directory name

I have a folder structure like this ...
data/
---B1/
name_x_1.gz
name_y_1.gz
name_z_2.gz
name_p_2.gz
---C1
name_s_1.gz
name_t_1.gz
name_u_2.gz
name_v_2.gz
I need to go in to each subdirectory (e.g. B1) and perform the following:
cat *_1.gz > B1_1.gz
cat *_2.gz > B1_2.gz
I'm having problems with the file naming part. I can get in directories using the following:
for d in */; do
cat *_1.gz > $d_1.gz
cat *_2.gz > $d_2.gz
done
However I get an error that $d is a directory -- how do I strip the name to create the concatenated filename?
Thanks

Taking your question verbatim: If you have a variable d, where you know that it ends in / (as is the case in your example), you can get the value with this last character stripped by writing ${d:0:-1} (i.e. the substring starting at the beginning, up to (excluding) the last character.
Of course in your case, I would rather write the loop as
for d in *; do
which already creates the names without a trailing slash. But this is still probably not what you want, because d would assume the name of the entries in the directory you have cd'ed to, but you want the name of the directory itself. You can optain this for instance by $(basename "$PWD"), which turns your loop into (i.e.)
cd B1
prefix=$(basename "$PWD") # This set prefix to B1
for f in *
do
# Since your original code indicates that you want to create a *copy* of the file
# with a new name, I do the same here.
cp -v "$f" "${prefix}_$f" #
done
You can also use cat, as in your original solution, if you prefer.

If you're calling bash, you can use parameter expansion and do everything natively in the shell without creating a sub-shell to another process. This is POSIX compliant
#!/bin/bash
for dir in data/*; do
cat "$dir/"*_1.gz > "$dir/${dir##*/}_1.gz"
cat "$dir/"*_2.gz > "$dir/${dir##*/}_2.gz"
done

Sure, just descend into the directory.
# assuming PWD = data/
for d in */; do
(
cd "$d"
cat *_1.gz > "$(basename "$d")"_1.gz
cat *_2.gz > "$(basename "$d")"_2.gz
)
done
how do I strip the name to create the concatenated filename?
The simplest and most portable is with basename.

This requires Ed, which should hopefully be present on your machine. If not, I trust your distribution will have a package for it.
#!/bin/sh
cat >> edprint+.txt << EOF
1p
q
EOF
cat >> edpop+.txt << EOF
1d
wq
EOF
b1="${PWD}/data/B1"
c1="${PWD}/$data/C1"
find "${b1}" -maxdepth 1 -type f > b1stack
find "${c1}" -maxdepth 1 -type f > c1stack
while [ $(wc -l b1stack | cut -d' ' -f1) -gt 0 ]
do
b1line=$(ed -s b1stack < edprint+.txt)
b1name=$(basename "${b1line}")
b1suffix=$(echo "${b1name}" | cut -d'_' -f3)
b1fixed=$(echo "B1_${b1suffix}"
mv -v "${b1}/${b1line}" "${b1}/${b1fixed}"
ed -s b1stack < edpop+.txt
done
while [ $(wc -l c1stack | cut -d' ' -f1) -gt 0 ]
do
c1line=$(ed -s c1stack < edprint+.txt)
c1name=$(basename "${c1line}")
c1suffix=$(echo "${c1name}" | cut -d'_' -f3)
c1fixed=$(echo "B1_${c1suffix}"
mv -v "${c1}/${c1line}" "${c1}/${c1fixed}"
ed -s c1stack < edpop+.txt
done
rm -v ./edprint+.txt
rm -v ./edpop+.txt
rm -v ./b1stack
rm -v ./c1stack

Find file with largest number of lines in single directory

I'm trying to create a function that only outputs the file with the largest number of lines in a directory (and not any sub-directories). I'm being asked to make use of the wc function but don't really understand how to read each file individually and then sort them just to find the largest. Here is what I have so far:
#!/bin/bash
function sort {
[ $# -ne 1 ] && echo "Invalid number of arguments">&2 && exit 1;
[ ! -d "$1" ] && echo "Invalid input: not a directory">&2 && exit 1;
# Insert function here ;
}
# prompt if wanting current directory
# if yes
# sort $PWD
# if no
#sort $directory

This solution is almost pure Bash (wc is the only external command used):
shopt -s dotglob # Include filenames with initial '.' in globs
shopt -s nullglob # Make globs produce nothing when nothing matches
dir=$1
maxlines=-1
maxfile=
for file in "$dir"/* ; do
[[ -f $file ]] || continue # Skip non-files
[[ -L $file ]] && continue # Skip symlinks
numlines=$(wc -l < "$file")
if (( numlines > maxlines )) ; then
maxfile=$file
maxlines=$numlines
fi
done
[[ -n "$maxfile" ]] && printf '%s\n' "$maxfile"
Remove the shopt -s dotglob if you don't want to process files whose names begin with a dot. Remove the [[ -L $file ]] && continue if you want to process symlinks to files.
This solution should handle all filenames (ones containing spaces, ones containing glob characters, ones beginning with '-', ones containing newlines, ...), but it runs wc for each file so it may be unacceptably slow compared to solutions that feed many files to wc at once if you need to handle directories that have large numbers of files.

How about this:
wc -l * | sort -nr | head -2 | tail -1
wc -l counts lines (you get an error for directories, though), then sort in reverse order treating the first column as a number, then take the first two lines, then the second, as we need to skip over the total line.
wc -l * 2>/dev/null | sort -nr | head -2 | tail -1
The 2>/dev/null throws away all the errors, if you want a neater output.

Use a function like this:
my_custom_sort() {
for i in "${1+$1/}"*; do
[[ -f "$i" ]] && wc -l "$i"
done | sort -n | tail -n1 | cut -d" " -f2
}
And use it with or without directory (in latter case, it uses the current directory):
my_custom_sort /tmp
helloworld.txt

How to replace file's names with numbers starting with certain number?

I want files to be named like 177.jpg, 178.jpg and so on starting with 177.jpg.
I used this to rename them from 1 to amount of files:
ls | cat -n | while read n f; do mv "$f" "$n.jpg"; done
How to modify this ? But completely new script also would be great.

Bash can do simple math for you:
mv "$f" $(( n + 176 )).jpg
Just hope no filename contains a newline.
There are safer ways than parsing the output of ls, e.g. iterating over an expanded wildcard:
n=177
for f in * ; do
mv "$f" $(( n++ )).jpg
done

This should work.
#!/bin/bash
c=177;
for i in `ls | grep -v '^[0-9]' | grep .png`; # This will make sure only png files are selected to replace and only the files which have filenames which starts with non-numeric
do
mv "$i" "$c".png;
(( c=c+1 ));
done

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Searching for specific beginning pattern in first lines of files only - linux

Related

Iterate over multiple files, remove those who contains x specific characters

Replace filename to a string of the first line in multiple files in bash

bash - loop through subdirectories, cat files and rename with directory name

Find file with largest number of lines in single directory

How to replace file's names with numbers starting with certain number?

Categories

Resources