How to search string references in specified location? - string

I'm trying to find the occurence of elements list(from a text file) in a directory.
Below is the Bash code I'm using ,but I'm unable to get the output of grep command on to console.
!/bin/bash
FILENAME=$1
count=0
while read LINE
do
let count++
echo "$count $LINE"
grep -r $LINE /home/user/vaishnavi
done < $FILENAME
echo -e "\nTotal $count Lines read"
Output:
1 ASK
2 TELL
3 ORDER
4 NUMBER
5 SIZE
6 BASKET
7 FRUIT
8 VEGGIES
Total 8 Lines read
I'm getting only the list of elements but not their occurences in the specified location.
Is there anything wrong with my code?
Thanks.

You need to echo the result of the grep, for example:
echo $(grep -r $LINE /home/user/vaishnavi)

Related

How do I use for to loop over potentially-empty lines of output from egrep?

I'm trying to print out blank lines in a text file but I want it to also print out numbers to see how many lines of white spaces the egrep returned by using:
for x in $(egrep '$^ txtfile); do echo '$x'; done
but this doesn't echo or return anything, is there any way I know how many blank lines the egrep command returned?
for is the wrong tool for this job; the right one (if you don't want to use grep -c but really do want to read each line of output) is a while read loop, as discussed in BashFAQ #1:
#!/usr/bin/env bash
# ^^^^- important: bash, not sh
count=0
while IFS= read -r x; do
echo "Read a line: <$x>" >&2
(( ++count ))
done < <(egrep '^$' txtfile)
echo "Read $count lines" >&2

How do I split a string on a pattern at the linux bash prompt and return the last instance of my pattern and everything after

This is my first question on StackOverflow, I hope it's not too noob for this forum. Thanks for your help in advance!!!
[PROBLEM]
I have a Linux bash variable in my bash script with the below content:
[split]
this is a test 1
[split]
this is a test 2
[split]
this is a test 3
this is a test 4
this is a test 5
How can I split this file on the string "[split]" and return the last section after the split?
this is a test 3
this is a test 4
this is a test 5
The last section can vary in length but it is always at the end of the "string" / "file"
Using awk, set record separator to the regular expression representing the split string, print the last record at END.
gawk 'BEGIN{ RS="[[]split[]]" } END{ print $0 }' tmp/test.txt
Result assuming input coming from a file:
this is a test 3
this is a test 4
this is a test 5
How about this ? :)
FILE="test.txt"
NEW_FILE="test_result.txt"
SPLIT="split"
while read line
do
if [[ $line == $SPLIT ]]
then
$(rm ${NEW_FILE})
else
$(echo -e "${line}" >> ${NEW_FILE})
fi
done < $FILE
#!/bin/bash
s="[split]
this is a test 1
[split]
this is a test 2
[split]
this is a test 3
this is a test 4
this is a test 5"
a=()
i=0
while read -r line
do
a[i]="${a[i]}${line}"$'\n'
if [ "$line" == "[split]" ]
then
let ++i
fi
done <<< "$s"
echo ${a[-1]}
I simply read each line from the string into an array and when I encounter [split] ,I increment the array index.At last,I echo the last element.
EDIT:
if you just want the last part no need for an array too.You could do something like
while read -r line
do
a+="${line}"$'\n'
if [ "$line" == "[split]" ]
then
a=""
fi
done <<< "$s"
echo $a

bash returning erroneous results after about 36 million lines when iterating through a pair of files - is this a memory error?

I have written a simple script in bash to iterate through a pair of text files to make sure they are properly formatted.
The required format is as follows:
Each file contains millions of ‘records’.
Each record takes up two lines in each file – a header line and a sequence line.
Each header lines consists of a “>” symbol, followd by a sample name (alphanumeric string), followed by a period, followed by a unique record identifier number (an integer), followed by a suffix of either ‘/1’ or ‘/2’.
Each sequence line contains a string of 30-100 A,C,G and T characters (the four DNA nucleotides, if anyone is wondering).
The files are paired, in that the first record in one file corresponds to the first record in the second file, and so forth. The header lines in the two files should be identical, except that in one files they will all have a ‘/1’ siffix and in the other file they will all have a ‘/2’ suffix. The sequence lines can be very different between the two files.
The code I developed is designed to check that (a) the hearder lines in each record follow the correct format, (2) the header lines in the corresponding records in the two files match (i.e. are identical except for the /1 and /2 suffixes) and (c) the sequence lines contain only A,C,G and T characters.
Example of properly formatted records:
> cat -n file1 | head -4
1 >SRR573705.1/1
2 ATAATCATTTGCCTCTTAAGTGGGGGCTGGTATGAATGGCAAGACGGGAATCTAGCTGTCTCTCCCTTATATCTTGAAGTTAATATTTCTGTGAAGAAGC
3 >SRR573705.2/1
4 CCACTTGTCCCAGTCTGTGCTGCCTGTACAATGGATTAGCTGAGGAAAACTGGCATCCCATGGCCTCAAACAGACGCAGCAAGTCCATGAAGCCATAATT
> cat –n file2 | head -4
1 >SRR573705.1/2
2 TTTCTAACAATTGAATTAGCAACACAAACACTATTGACAAAGCTATATCTTATTTCTACTAAAGCTCGATAGGGTCTTCTCGTCCTGCGATCCCATTCCT
3 >SRR573705.2/2
4 GTATGATGGGTGTGTCAAGGAGCTCAACCATCGTGATAGGCTACCTCATGCATCGAGACAAGATCACATTTAATGAGGCATTTGACATGGTCAGGAAGCA
My code is below. It works perfectly well for small test files containing only a couple of hundred records. When reading a real data file with millions or records, however, it returns non-sensical errors, for example:
Inaccurate header line in read 18214236 of file2
Line 36428471: TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
The error above is simply wrong. Line 36,428,471 of file2 is ‘>SRR573705.19887618/2’
The string reported in the error is not even present in file 2. It does, however, appear multiple times in file1, i.e.:
cat -n /file1 | grep 'TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC'
4632838 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
24639990 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
36428472 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
143478526 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
The data in the two files seems to match perfectly in the region where the error was returned:
cat -n file1 | head -36428474 | tail
36428465 >SRR573705.19887614/1
36428466 CACCCCAGCATGTTGACCACCCATGCCATTATTTCATGGTATTTTCTTACATTTTGTATATAACAGATGCATTACGTATTATAGCATTGCTTTTCGTAAA
36428467 >SRR573705.19887616/1
36428468 AGATCCTCCTCCTCATCGGTCAGTCGCCAATCCAACAACTCAACCTTCTTCTTCAAGTCACTCAGCCGTCGGCCCGGGACTGCCGTTTCATGATGCCTAT
36428469 >SRR573705.19887617/1
36428470 CAATAGCGTATATTAAAATTGCTGCAGTTAAAAAGCTCGTAGTTGGATCTTGGGCGCAGGCTGGCGGTCCGCCGCAAGGCGCGCCACTGCCAGCCTGGCC
36428471 >SRR573705.19887618/1
36428472 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
36428473 >SRR573705.19887619/1
36428474 CCAGCCTGCGCCCAAGATCCAACTACGAGCTTTTTAACTGCAGCAATTTTAATATACGCTATTGGAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTG
>cat -n file2 | head -36428474 | tail
36428465 >SRR573705.19887614/2
36428466 GTAATTTACAGGAATTGTTTACATTCTGAGCAAATAAAACAAATAATTTTAATACACAAACTTGTTGAAAGTTAATTAGGTTTTACGAAAA
36428467 >SRR573705.19887616/2
36428468 GCCGTCGCAGCAACATTTGAGATATCCCGTAAGACGTCTTGAACGGCTGGCTCTGTCTGCTCTCGGAGAACCTGCCGGCTGAACCGGACAGCGCAGACG
36428469 >SRR573705.19887617/2
36428470 CTCGAGTTCCGAAAACCAACGCAATAGAACCGAGGTCCTATTCCATTATTCCATGCTCTGCTGTCCAGGCGGTCGGCCTG
36428471 >SRR573705.19887618/2
36428472 GGACATGGAAACAGAAAATAATGAAAAGACCAAAGAAGATGCACTTGAGGTTGATAAGCCTAAAGG
36428473 >SRR573705.19887619/2
36428474 CCCGACACGGGGAGGTAGTGACGAAAAATAGCAATACAGGACTCTTTCGAGGCCCTGTAATTGGAATGAGTACACTTTAAATCCTTTAACGAGGATCTAT
Is there some sort of memory limit in bash that could cause such an error? I have run various versions of this code over multiple files and consistently get this problem after 36,000,000 lines.
My code:
set -u
function fastaConsistencyChecker {
F_READS=$1
R_READS=$2
echo -e $F_READS
echo -e $R_READS
if [[ ! -s $F_READS ]]; then echo -e "File $F_READS could not be found."; exit 0; fi
if [[ ! -s $R_READS ]]; then echo -e "File $R_READS could not be found."; exit 0; fi
exec 3<$F_READS
exec 4<$R_READS
line_iterator=1
read_iterator=1
while read FORWARD_LINE <&3 && read REVERSE_LINE <&4; do
if [[ $(( $line_iterator % 2 )) == 1 ]]; then
## This is a header line ##
if [[ ! ( $FORWARD_LINE =~ ^">"[[:alnum:]]+\.[0-9]+/1$ ) ]]; then
echo -e "Inaccurate header line in read ${read_iterator} of file ${F_READS}"
echo -e "Line ${line_iterator}: ${FORWARD_LINE}"
exit 0
fi
if [[ ! ( $REVERSE_LINE =~ ^">"[[:alnum:]]+\.[0-9]+/2$ ) ]]; then
echo -e "Inaccurate header line in read ${read_iterator} of file ${R_READS}"
echo -e "Line ${line_iterator}: ${REVERSE_LINE}"
exit 0
fi
F_Name=${FORWARD_LINE:1:${#FORWARD_LINE}-3}
R_Name=${REVERSE_LINE:1:${#REVERSE_LINE}-3}
if [[ $F_Name != $R_Name ]]; then
echo -e "Record names do not match. "
echo -e "Line ${line_iterator}: ${FORWARD_LINE}"
echo -e "Line ${line_iterator}: ${REVERSE_LINE}"
exit 0
fi
line_iterator=$(( $line_iterator + 1 ))
else
if [[ ! ( $FORWARD_LINE =~ ^[ATCGNatcgn]+$ ) ]]; then
echo -e "Ambigous sequence detected for read ${read_iterator} at line ${line_iterator} in file ${F_READS}"
exit 0
fi
read_iterator=$(( $read_iterator + 1 ))
line_iterator=$(( $line_iterator + 1 ))
fi
unset FORWARD_LINE
unset REVERSE_LINE
done
echo -e "$line_iterator lines and $read_iterator reads"
echo -e "No errors detected."
echo -e ""
}
export -f fastaConsistencyChecker
FILE3="filepath/file1"
FILE4="filepath/file2"
fastaConsistencyChecker $FILE3 $FILE4
I think you've proven there's an issue related to memory usage with bash. I think you can accomplish your format verification without running afoul of the memory issue by using text processing tools from bash.
#!/bin/bash
if ! [[ $1 && $2 && -s $1 && -s $2 ]]; then
echo "usage: $0 <file1> <file2>"
exit 1
fi
set -e
dir=`mktemp -d`
clean () { rm -fr $dir; }
trap clean EXIT
pairs () { sed 'N;s/\n/\t/' "$#"; }
pairs $1 > $dir/$1
pairs $2 > $dir/$2
paste $dir/$1 $dir/$2 | grep -vP '^>(\w+\.\d+)/1\t[ACGT]+\t>\1/2\t[ACGT]+$' && exit 1
exit 0
The sed script takes a line and concatenates it with the next, separated by a tab. This:
>SRR573705.1/1
ATAATCATTTGCCTCTT...
becomes this:
>SRR573705.1/1 ATAATCATTTGCCTCTT...
The paste takes the first line of file 1 and the first line of file 2 and outputs them as one line separated by a tab. It does the same for the second line, and so forth. grep see input like this:
>SRR573705.1/1. ATAATCATTTGCCTCT.... >SRR573705.1/2. TTTCTAACAATTGAAT...
The regular expression captures the first identifier and matches the same identifier later in the line with the backreference \1.
The script outputs any lines failing to match the regex due to the -v switch to grep. If lines are output, the script exits with status 1.

Finding max lines in a file while printing file name and lines separately?

So I keep messing this up and I think where I was going wrong was that the code i'm writing needs to return only the file name and number of lines from an argument.
So using wc I need to get something to accept either 0 or 1 arguments and print out something like "The file findlines.sh has 4 lines" or if they give a ./findlines.sh Desktop/testfile they'll get the "the file testfile has 5 lines"
I have a few attempts and all of them have failed. I can't seem to figure out how to approach it at all.
Should I echo "The file" and then toss the argument name in and then add another echo for "has the number of lines [lines]"?
Sample input would be from terminal something like
>findlines.sh
Output:the file findlines.sh has 18 lines
Or maybe
>findlines.sh /home/directory/user/grocerylist
Output of 'the file grocerylist has 16 lines
#! /bin/sh -
file=${1-findfiles.sh}
lines=$(wc -l < "$file") &&
printf 'The file "%s" has %d lines\n' "$file" "$lines"
This should work:
#!/bin/bash
file="findfiles.sh"
if [ $# -ge 1 ]
then
file=$1
fi
if [ -f $file ]
then
lines=`wc -l "$file" | awk '{print $1}'`
echo "The file $file has $lines lines"
else
echo "File not found"
fi
See sch's answer for a shorter example that doesn't use awk.

Parsing Command Output in Bash Script

I want to run a command that gives the following output and parse it:
[VDB VIEW]
[VDB] vhctest
[BACKEND] domain.computername: ENABLED:RW:CONSISTENT
[BACKEND] domain.computername: ENABLED:RW:CONSISTENT
...
I'm only interested in some key works, such as 'ENABLED' etc. I can't search just for ENABLED as I need to parse each line at a time.
This is my first script, and I want to know if anyone can help me?
EDIT:
I now have:
cmdout=`mycommand`
while read -r line
do
#check for key words in $line
done < $cmdout
I thought this did what I wanted but it always seems to output the following right before the command output.
./myscript.sh: 29: cannot open ... : No such file
I don't want to write to a file to have to achieve this.
Here is the psudo code:
cmdout=`mycommand`
loop each line in $cmdout
if line contains $1
if line contains $2
output 1
else
output 0
The reason for the error is that
done < $cmdout
thinks that the contents of $cmdout is a filename.
You can either do:
done <<< $cmdout
or
done <<EOF
$cmdout
EOF
or
done < <(mycommand) # without using the variable at all
or
done <<< $(mycommand)
or
done <<EOF
$(mycommand)
EOF
or
mycommand | while
...
done
However, the last one creates a subshell and any variables set in the loop will be lost when the loop exits.
"How can I read a file (data stream, variable) line-by-line (and/or field-by-field)?"
"I set variables in a loop. Why do they suddenly disappear after the loop terminates? Or, why can't I pipe data to read?"
$ cat test.sh
#!/bin/bash
while read line ; do
if [ `echo $line|grep "$1" | wc -l` != 0 ]; then
if [ `echo $line|grep "$2" | wc -l` != 0 ]; then
echo "output 1"
else
echo "output 0"
fi
fi
done
USAGE
$ cat in.txt | ./test.sh ENABLED RW
output 1
output 1
This isn't the best solution, but its a word by word translation of what you want and should give you something to start with and add your own logic

Resources