How to test for certain characters in a file - linux

I am currently running a script with an if statement. Before I run the script, I want to make sure the file provided as the first argument has certain characters.
If the file does not have those certain characters in certain spots then the output would be else "File is Invalid" on the command line.
For the if statement to be true, the file needs to have at least one hyphen in Field 1 line 1 and at least one comma in Field one Line one.
How would I create an if statement with perhaps a test command to validate those certain characters are present?
Thanks
Im new to Linux/Unix, this is my homework so I haven't really tried anything, only brain storming possible solutions.
function usage
{
echo "usage: $0 filename ..."
echo "ERROR: $1"
}
if [ $# -eq 0 ]
then
usage "Please enter a filename"
else
name="Yaroslav Yasinskiy"
echo $name
date
while [ $# -gt 0 ]
do
if [ -f $1 ]
then
if <--------- here is where the answer would be
starting_data=$1
echo
echo $1
cut -f3 -d, $1 > first
cut -f2 -d, $1 > last
cut -f1 -d, $1 > id
sed 's/$/:/' last > last1
sed '/last:/ d' last1 > last2
sed 's/^ *//' last2 > last3
sed '/first/ d' first > first1
sed 's/^ *//' first1 > first2
sed '/id/ d' id > id1
sed 's/-//g' id1 > id2
paste -d\ first2 last3 id2 > final
cat final
echo ''
else
echo
usage "Coult not find file $1"
fi
shift
done
fi

In answer to your direct question:
For the if statement to be true, the file needs to have at least one
hyphen in Field 1 line 1 and at least one comma in Field one Line one.
How would I create an if statement with perhaps a test command to
validate those certain characters are present?
Bash provides all the tools you need. While you can call awk, you really just need to read the first line of the file into two-variable (say a and b) and then use the [[ $a =~ regex ]] to where the regex is an extended regular expression that verifies that the first field (contained in $a) contains both a '-' and ','.
For details on the [[ =~ ]] expression, see bash(1) - Linux manual page under the section labeled [[ expression ]].
Let's start with read. When you provide two variables, read will read the first field (based on normal word-splitting given by IFS (the Internal Field Separator, default $'[ \t\n]' - space, tab, newline)). So by doing read -r a b you read the first field into a and the rest of the line into b (you don't care about b for your test)
Your regex can be ([-]+.*[,]+|[,]+.*[-]+) which is an (x|y), e.g. x OR y expression where x is [-]+.*[,]+ (one or more '-' and one or more ','), your y is [,]+.*[-]+ (one or more ',' and one or more '-'). So by using the '|' your regex will accept either a comma then zero-or-more characters and a hyphen or a hyphen and zero-or-more characters and then a comma in the first field.
How do you read the line? With simple redirection, e.g.
read -r a b < "$1"
So your conditional test in your script would look something like:
if [ -f $1 ]
then
read -r a b < "$1"
if [[ $a =~ ([-]+.*[,]+|[,]+.*[-]+) ]] # <-- here is where the ...
then
starting_data=$1
...
else
echo "File is Invalid" >&2 # redirection to 2 (stderr)
fi
else
echo
usage "Coult not find file $1"
fi
shift
...
Example Test Files
$ cat valid
dog-food, cat-food, rabbit-food
50lb 16lb 5lb
$ cat invalid
dogfood, catfood, rabbitfood
50lb 16lb 5lb
Example Use/Output
$ read -r a b < valid
if [[ $a =~ ([-]+.*[,]+|[,]+.*[-]+) ]]; then
echo "file valid"
else
echo "file invalid"
fi
file valid
and for the file without the certain characters:
$ read -r a b < invalid
if [[ $a =~ ([-]+.*[,]+|[,]+.*[-]+) ]]; then
echo "file valid"
else
echo "file invalid"
fi
file invalid
Now you really have to concentrate on eliminating the spawning of at least a dozen subshells where you call cut 3-times, sed 7-times, paste once and then cat. While it is good you are thinking through what you need to do, and getting it working, as mentioned in my comment, any time you are looping, you want to eliminate the number of subshells spawned to the greatest extent possible. I suspect as #Mig answered, awk will be the proper tool that can likely eliminate all 12 subshells are replace it with a single call to awk.

I personally would use awk for this all part since you want to test fields and create a string with concatenated fields. Awk is perfect for that.
But here is a small script which shows how you could just test your file's first line:
if [[ $(head -n 1 file.csv | awk '$1~/-/ && $1~/,/ {print "MATCH"}') == 'MATCH' ]]; then
echo "yes"
else
echo "no"
fi
It looks overkill when not doing the whole thing in awk but it works. I am sure there is a way to test only one regex, but that would involve knowing which flavour of awk you have because I think they don't all use the same regex engine. Therefore I left this out for the sake of simplicity.

Related

extract substrings starting with same pattern in a file

I have a fileA.txt which contains strings:
>AMP_16 RS0247 CENPF__ENST00000366955.7__3251__30__0.43333__66.8488__1 RS0255
>AMP_16 RS0332 CENPF__ENST00000366955.7__2262__30__0.43333__67.9513__1 RS0451
>AMP_16 RS0332 CENPF__ENST00000366955.7__1108__30__0.43333__67.4673__1 RS0247
and so on .....
I would like to extract all the substring which start with RS
Output:
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
I tried something like this:
while read line
do
for word in $line
do
if [[ "$word" =~ ^RS ]]
then
[ -z "$str" ] && str="$word"
fi
done
done < fileA.txt
echo "$str"
However I only get the first string RS0247 printed out when I do echo
Given the three sample lines pasted above in the file f...
Assuming a fixed format:
awk '{print $2"_"$4}' f
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
Assuming a flexible format (and that you're only interested in the first two occurrences of fields that start with RS:
awk '{f=1;for (i=1;i<=NF;i++){if($i~/^RS/){a[f]=$i;f++}}print a[1]"_"a[2]}' f
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
Edit 1:
And assuming that you want your own script patched rather than an efficient solution:
#!/bin/bash
while read line
do
str=""
for word in $line
do
if [[ "$word" =~ ^RS ]]
then
if [[ -z $str ]]
then
str=$word
else
str+="_${word}"
fi
fi
done
echo "$str"
done < fileA.txt
Edit 2:
In terms of efficiency; I copied and pasted those 3 lines into fileA.txt 60 times (180 lines total). The runtimes for the three attempts above in the same order are:
real 0m0.002s
real 0m0.002s
real 0m0.011s

bash returning erroneous results after about 36 million lines when iterating through a pair of files - is this a memory error?

I have written a simple script in bash to iterate through a pair of text files to make sure they are properly formatted.
The required format is as follows:
Each file contains millions of ‘records’.
Each record takes up two lines in each file – a header line and a sequence line.
Each header lines consists of a “>” symbol, followd by a sample name (alphanumeric string), followed by a period, followed by a unique record identifier number (an integer), followed by a suffix of either ‘/1’ or ‘/2’.
Each sequence line contains a string of 30-100 A,C,G and T characters (the four DNA nucleotides, if anyone is wondering).
The files are paired, in that the first record in one file corresponds to the first record in the second file, and so forth. The header lines in the two files should be identical, except that in one files they will all have a ‘/1’ siffix and in the other file they will all have a ‘/2’ suffix. The sequence lines can be very different between the two files.
The code I developed is designed to check that (a) the hearder lines in each record follow the correct format, (2) the header lines in the corresponding records in the two files match (i.e. are identical except for the /1 and /2 suffixes) and (c) the sequence lines contain only A,C,G and T characters.
Example of properly formatted records:
> cat -n file1 | head -4
1 >SRR573705.1/1
2 ATAATCATTTGCCTCTTAAGTGGGGGCTGGTATGAATGGCAAGACGGGAATCTAGCTGTCTCTCCCTTATATCTTGAAGTTAATATTTCTGTGAAGAAGC
3 >SRR573705.2/1
4 CCACTTGTCCCAGTCTGTGCTGCCTGTACAATGGATTAGCTGAGGAAAACTGGCATCCCATGGCCTCAAACAGACGCAGCAAGTCCATGAAGCCATAATT
> cat –n file2 | head -4
1 >SRR573705.1/2
2 TTTCTAACAATTGAATTAGCAACACAAACACTATTGACAAAGCTATATCTTATTTCTACTAAAGCTCGATAGGGTCTTCTCGTCCTGCGATCCCATTCCT
3 >SRR573705.2/2
4 GTATGATGGGTGTGTCAAGGAGCTCAACCATCGTGATAGGCTACCTCATGCATCGAGACAAGATCACATTTAATGAGGCATTTGACATGGTCAGGAAGCA
My code is below. It works perfectly well for small test files containing only a couple of hundred records. When reading a real data file with millions or records, however, it returns non-sensical errors, for example:
Inaccurate header line in read 18214236 of file2
Line 36428471: TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
The error above is simply wrong. Line 36,428,471 of file2 is ‘>SRR573705.19887618/2’
The string reported in the error is not even present in file 2. It does, however, appear multiple times in file1, i.e.:
cat -n /file1 | grep 'TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC'
4632838 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
24639990 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
36428472 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
143478526 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
The data in the two files seems to match perfectly in the region where the error was returned:
cat -n file1 | head -36428474 | tail
36428465 >SRR573705.19887614/1
36428466 CACCCCAGCATGTTGACCACCCATGCCATTATTTCATGGTATTTTCTTACATTTTGTATATAACAGATGCATTACGTATTATAGCATTGCTTTTCGTAAA
36428467 >SRR573705.19887616/1
36428468 AGATCCTCCTCCTCATCGGTCAGTCGCCAATCCAACAACTCAACCTTCTTCTTCAAGTCACTCAGCCGTCGGCCCGGGACTGCCGTTTCATGATGCCTAT
36428469 >SRR573705.19887617/1
36428470 CAATAGCGTATATTAAAATTGCTGCAGTTAAAAAGCTCGTAGTTGGATCTTGGGCGCAGGCTGGCGGTCCGCCGCAAGGCGCGCCACTGCCAGCCTGGCC
36428471 >SRR573705.19887618/1
36428472 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
36428473 >SRR573705.19887619/1
36428474 CCAGCCTGCGCCCAAGATCCAACTACGAGCTTTTTAACTGCAGCAATTTTAATATACGCTATTGGAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTG
>cat -n file2 | head -36428474 | tail
36428465 >SRR573705.19887614/2
36428466 GTAATTTACAGGAATTGTTTACATTCTGAGCAAATAAAACAAATAATTTTAATACACAAACTTGTTGAAAGTTAATTAGGTTTTACGAAAA
36428467 >SRR573705.19887616/2
36428468 GCCGTCGCAGCAACATTTGAGATATCCCGTAAGACGTCTTGAACGGCTGGCTCTGTCTGCTCTCGGAGAACCTGCCGGCTGAACCGGACAGCGCAGACG
36428469 >SRR573705.19887617/2
36428470 CTCGAGTTCCGAAAACCAACGCAATAGAACCGAGGTCCTATTCCATTATTCCATGCTCTGCTGTCCAGGCGGTCGGCCTG
36428471 >SRR573705.19887618/2
36428472 GGACATGGAAACAGAAAATAATGAAAAGACCAAAGAAGATGCACTTGAGGTTGATAAGCCTAAAGG
36428473 >SRR573705.19887619/2
36428474 CCCGACACGGGGAGGTAGTGACGAAAAATAGCAATACAGGACTCTTTCGAGGCCCTGTAATTGGAATGAGTACACTTTAAATCCTTTAACGAGGATCTAT
Is there some sort of memory limit in bash that could cause such an error? I have run various versions of this code over multiple files and consistently get this problem after 36,000,000 lines.
My code:
set -u
function fastaConsistencyChecker {
F_READS=$1
R_READS=$2
echo -e $F_READS
echo -e $R_READS
if [[ ! -s $F_READS ]]; then echo -e "File $F_READS could not be found."; exit 0; fi
if [[ ! -s $R_READS ]]; then echo -e "File $R_READS could not be found."; exit 0; fi
exec 3<$F_READS
exec 4<$R_READS
line_iterator=1
read_iterator=1
while read FORWARD_LINE <&3 && read REVERSE_LINE <&4; do
if [[ $(( $line_iterator % 2 )) == 1 ]]; then
## This is a header line ##
if [[ ! ( $FORWARD_LINE =~ ^">"[[:alnum:]]+\.[0-9]+/1$ ) ]]; then
echo -e "Inaccurate header line in read ${read_iterator} of file ${F_READS}"
echo -e "Line ${line_iterator}: ${FORWARD_LINE}"
exit 0
fi
if [[ ! ( $REVERSE_LINE =~ ^">"[[:alnum:]]+\.[0-9]+/2$ ) ]]; then
echo -e "Inaccurate header line in read ${read_iterator} of file ${R_READS}"
echo -e "Line ${line_iterator}: ${REVERSE_LINE}"
exit 0
fi
F_Name=${FORWARD_LINE:1:${#FORWARD_LINE}-3}
R_Name=${REVERSE_LINE:1:${#REVERSE_LINE}-3}
if [[ $F_Name != $R_Name ]]; then
echo -e "Record names do not match. "
echo -e "Line ${line_iterator}: ${FORWARD_LINE}"
echo -e "Line ${line_iterator}: ${REVERSE_LINE}"
exit 0
fi
line_iterator=$(( $line_iterator + 1 ))
else
if [[ ! ( $FORWARD_LINE =~ ^[ATCGNatcgn]+$ ) ]]; then
echo -e "Ambigous sequence detected for read ${read_iterator} at line ${line_iterator} in file ${F_READS}"
exit 0
fi
read_iterator=$(( $read_iterator + 1 ))
line_iterator=$(( $line_iterator + 1 ))
fi
unset FORWARD_LINE
unset REVERSE_LINE
done
echo -e "$line_iterator lines and $read_iterator reads"
echo -e "No errors detected."
echo -e ""
}
export -f fastaConsistencyChecker
FILE3="filepath/file1"
FILE4="filepath/file2"
fastaConsistencyChecker $FILE3 $FILE4
I think you've proven there's an issue related to memory usage with bash. I think you can accomplish your format verification without running afoul of the memory issue by using text processing tools from bash.
#!/bin/bash
if ! [[ $1 && $2 && -s $1 && -s $2 ]]; then
echo "usage: $0 <file1> <file2>"
exit 1
fi
set -e
dir=`mktemp -d`
clean () { rm -fr $dir; }
trap clean EXIT
pairs () { sed 'N;s/\n/\t/' "$#"; }
pairs $1 > $dir/$1
pairs $2 > $dir/$2
paste $dir/$1 $dir/$2 | grep -vP '^>(\w+\.\d+)/1\t[ACGT]+\t>\1/2\t[ACGT]+$' && exit 1
exit 0
The sed script takes a line and concatenates it with the next, separated by a tab. This:
>SRR573705.1/1
ATAATCATTTGCCTCTT...
becomes this:
>SRR573705.1/1 ATAATCATTTGCCTCTT...
The paste takes the first line of file 1 and the first line of file 2 and outputs them as one line separated by a tab. It does the same for the second line, and so forth. grep see input like this:
>SRR573705.1/1. ATAATCATTTGCCTCT.... >SRR573705.1/2. TTTCTAACAATTGAAT...
The regular expression captures the first identifier and matches the same identifier later in the line with the backreference \1.
The script outputs any lines failing to match the regex due to the -v switch to grep. If lines are output, the script exits with status 1.

bash palindrome grep loop if then else missing '

My Syst admin prof just started teaching us bash and he wanted us to write a bash script using grep to find all 3-45 letter palindromes in the linux dictionary without using reverse. And im getting an error on my if statement saying im missing a '
UPDATED CODE:
front='\([a-z]\)'
front_s='\([a-z]\)'
numcheck=1
back='\1'
middle='[a-z]'
count=3
while [ $count -ne "45" ]; do
if [[ $(($count % 2)) == 0 ]]
then
front=$front$front_s
back=+"\\$numcheck$back"
grep "^$front$back$" /usr/share/dict/words
count=$((count+1))
else
grep "^$front$middle$back$" /usr/share/dict/words
numcheck=$((numcheck+1))
count=$((count+1))
fi
done
You have four obvious problems here:
First about a misplaced and unescaped backslash:
back="\\$numcheck$back" # and not back="$numcheck\$back"
Second is that you only want to increment numcheck if count is odd.
Third: in the line
front=$front$front
you're doubling the number of patterns in front! hey, that yields an exponential growth, hence the explosion Argument list too long. To fix this: add a variable, say, front_step:
front_step='\([a-z]\)'
front=$front_step
and when you increment front:
front=$front$front_step
With these fixed, you should be good!
The fourth flaw is that grep's back-references may only have one digit: from man grep:
Back References and Subexpressions
The back-reference \n, where n is a single digit, matches the substring
previously matched by the nth parenthesized subexpression of the
regular expression.
In your approach, we'll need up to 22 back-references. That's too much for grep. I doubt there are any such long palindromes, though.
Also, you're grepping the file 43 times… that's a bit too much.
Try this:
#!/bin/bash
for w in `grep -E "^[[:alnum:]]{3,45}$" /usr/share/dict/words`; do if [[ "$w" == "`echo $w|sed "s/\(.\)/\1\n/g"|tac|tr -d '\012'`" ]]; then echo "$w == is a palindrome"; fi; done
OR
#!/bin/bash
front='\([a-z]\)'
numcheck=1
back='\1'
middle='[a-z]'
count=3
while [ $count -ne "45" ]; do
if [[ $(($count % 2)) == 0 ]]
then
front=$front$front
back="\\$numcheck$back"
grep "^$front$back$" /usr/share/dict/words
else
grep "^$front$middle$back$" /usr/share/dict/words
## Thanks to gniourf for catching this.
numcheck=$((numcheck+1))
fi
count=$((count+1))
## Uncomment the following if you want to see one by one and run script using bash -x filename.sh
#echo Press any key to continue: ; read toratora;
done

Linux script to search for string in a file

I am newbie to shell scripting. I have a requirement to read a file by line and match for specific string. If it matches, print x and if it doesn't match, print y.
Here is what I am trying. But,I am getting unexpected results. I am getting 700 lines of result where my /tmp/l1.txt has 10 lines only. Somewhere, I am going through the loop. I appreciate your help.
for line in `cat /tmp/l3.txt`
do
if echo $line | grep "abc.log" ; then
echo "X" >>/tmp/l4.txt
else
echo "Y" >>/tmp/l4.txt
fi
done
I don't understand the urge to do looping ...
awk '{if($0 ~ /abc\.log/){print "x"}else{print "y"}}' /tmp/13.txt > /tmp/14.txt
EDIT after inquiry ...
Of course, your spec wasn't overly precise, and I'm jumping to conclusions regarding your lines format ... we basically take the whole line that matched abc.log, replace everything up to the directory abc and from /log to the end of line with nothing, which leaves us with clusterX/xyz.
awk '{if($0 ~ /abc\.log/){print gensub(/.+\/abc\/(.+)\/logs/, "\\1", 1)}else{print "y"}}' /tmp/13.txt > /tmp/14.txt
cat /tmp/l3.txt | while read line # read the entire line into the variable "line"
do
if [ -n `echo "$line" | grep "abc.log"` ] # If there is a value "-n"
then
echo "X" >> /tmp/l4.txt # Echo "X" or the value of the variable "line" into l4.txt
else
echo "Y" >> /tmp/l4.txt # If empty echo "Y" into l4.txt
fi
done
While read statement will read either the entire line if only one variable is given, in this case "line" or if you have a fixed amount of fields you can specify a variable for each field, I.E. "| while read field1 field2" etc... The -n tests for if their is a value or not. -z will test if it's empty.
Why worry about cat and the rest before grep, you can simply test the return of grep and append all matching lines to /tmp/14.txt or append "Y":
[ -f "/tmpfile.tmp" ] && :> /tmpfile.tmp # test for existing tmpfile & truncate
if grep "abc.log" /tmp/13.txt >>tmpfile.tmp ; then # write all matching lines to tmpfile
cat tmpfile.tmp /tmp/14.txt # if grep matched append to /tmp/14.txt
else
echo "Y" >> /tmp/14.txt # write "Y" to /tmp/14.txt
fi
rm tmpfile.tmp # cleanup
Note: if you don't want the result of the grep appended to /tmp/14.txt, then just replace cat tmpfile.tmp /tmp/14.txt with echo "X" >> /tmp/14.txt and you can remove the 1st and last lines.
I think the "awk" answer above is better. However, if you really need to interact using a bash loop, you can use:
PATTERN="abc.log"
OUTPUTFILE=/tmp/14.txt
INPUTFILE=/tmp/13.txt
while read line
do
grep -q "$PATTERN" <<< "$line" > /dev/null 2>&1 && echo X || echo Y
done < $INPUTFILE >> $OUTPUTFILE

find string in file using bash

I need to find strings matching some regexp pattern and represent the search result as array for iterating through it with loop ), do I need to use sed ? In general I want to replace some strings but analyse them before replacing.
Using sed and diff:
sed -i.bak 's/this/that/' input
diff input input.bak
GNU sed will create a backup file before substitutions, and diff will show you those changes. However, if you are not using GNU sed:
mv input input.bak
sed 's/this/that/' input.bak > input
diff input input.bak
Another method using grep:
pattern="/X"
subst=that
while IFS='' read -r line; do
if [[ $line = *"$pattern"* ]]; then
echo "changing line: $line" 1>&2
echo "${line//$pattern/$subst}"
else
echo "$line"
fi
done < input > output
The best way to do this would be to use grep to get the lines, and populate an array with the result using newline as the internal field separator:
#!/bin/bash
# get just the desired lines
results=$(grep "mypattern" mysourcefile.txt)
# change the internal field separator to be a newline
IFS=$'/n'
# populate an array from the result lines
lines=($results)
# return the third result
echo "${lines[2]}"
You could build a loop to iterate through the results of the array, but a more traditional and simple solution would just be to use bash's iteration:
for line in $lines; do
echo "$line"
done
FYI: Here is a similar concept I created for fun. I thought it would be good to show how to loop a file and such with this. This is a script where I look at a Linux sudoers file check that it contains one of the valid words in my valid_words array list. Of course it ignores the comment "#" and blank "" lines with sed. In this example, we would probably want to just print the Invalid lines only but this script prints both.
#!/bin/bash
# -- Inspect a sudoer file, look for valid and invalid lines.
file="${1}"
declare -a valid_words=( _Alias = Defaults includedir )
actual_lines=$(cat "${file}" | wc -l)
functional_lines=$(cat "${file}" | sed '/^\s*#/d;/^\s*$/d' | wc -l)
while read line ;do
# -- set the line to nothing "" if it has a comment or is empty line.
line="$(echo "${line}" | sed '/^\s*#/d;/^\s*$/d')"
# -- if not set to nothing "", check if the line is valid from our list of valid words.
if ! [[ -z "$line" ]] ;then
unset found
for each in "${valid_words[#]}" ;do
found="$(echo "$line" | egrep -i "$each")"
[[ -z "$found" ]] || break;
done
[[ -z "$found" ]] && { echo "Invalid=$line"; sleep 3; } || echo "Valid=$found"
fi
done < "${file}"
echo "actual lines: $actual_lines funtional lines: $functional_lines"

Resources