Get Nth character with Sed - string

I'm in the middle of Playing Final Fantasy 7, and I'm at the part where you're in the Library at Shinra HQ and you have to write down the Nth letter--minus the spaces, where Nth is the number in front of the book's title--for every book that doesn't seem to belong in the current room, of which there are 4.
I need a sed script or other command-line to print the title of the book and get the Nth letter in its name.

You don't need sed for that. You can use bash string substitution:
$ book="The Ancients in History"
$ book="${book// /}" # Do global substition to remove spaces
$ echo "${book:13:1}" # Start at position 13 indexed at 0 and print 1 character
H

I figured out how to do it:
echo "The Ancients in History" | sed -r 's/\s//g ; s/^(.{13})(.).*$/\2/'
=> H
NOTESed starts counting at 0 instead of 1, so if you want the 14th letter, ask for the 13th one.
Here's it in a shell script:
#!/bin/sh
if [[ -n "$1" ]]; then # string
if [[ -n "$2" ]]; then # Nth
echo "Getting character" $[$2 - 1]
export Nth=$[$2 - 1]
echo "$1" | sed -r "s/\s//g ; s/^(.{$Nth})(.).*$/\2/";
fi
fi

Related

Comparing strings for alphabetical order in Bash, test vs. double bracket syntax

I am working on a Bash scripting project in which I need to delete one of two files if they have identical content. I should delete the one which comes last in an alphabetical sort and in the example output my professor has provided, apple.dat is deleted when the choices are apple.dat and Apple.dat.
if [[ "apple" > "Apple" ]]; then
echo apple
else
echo Apple
fi
prints Apple
echo $(echo -e "Apple\napple" | sort | tail -n1)
prints Apple
The ASCII value of a is 97 and A is 65, why is the test saying A is greater?
The weird thing is that I get opposite results with the older syntax:
if [ "apple" \> "Apple" ]; then
echo apple
else
echo Apple
fi
prints apple
and if we try to use the \> in the [[ ]] syntax, it is a syntax error.
How can we correct this for the double bracket syntax? I have tested this on the school Debian server, my local machine, and my Digital Ocean droplet server. On my local Ubuntu 20.04 and on the school server I get the output described above. Interestingly, on my Digital Ocean droplet which is an Ubuntu 20.04 server, I get "apple" with both double and single bracket syntax. We are allowed to use either syntax, double bracket or the single bracket actual test call, however I prefer using the newer double bracket syntax and would rather learn how to make this work than to convert my mostly finished script to the older more POSIX compliant syntax.
Hints:
$ (LC_COLLATE=C; if [ "apple" \> "Apple" ]; then echo apple; else echo Apple; fi)
apple
$ (LC_COLLATE=en_US; if [ "apple" \> "Apple" ]; then echo apple; else echo Apple; fi)
apple
but:
$ (LC_COLLATE=C; if [[ "apple" > "Apple" ]]; then echo apple; else echo Apple; fi)
apple
$ (LC_COLLATE=en_US; if [[ "apple" > "Apple" ]]; then echo apple; else echo Apple; fi)
Apple
The difference is that the Bash specific test [[ ]] uses the locale collation's rules to compare strings. Whereas the POSIX test [ ] uses the ASCII value.
From bash man page:
When used with [[, the < and > operators sort lexicographically using the current locale.
When used with test or [, the < and > operators sort lexicographically using ASCII ordering.
I have come up with my own solution to the problem, however I must first thank #GordonDavisson and #LéaGris for their help and for what I have learned from them as that is invaluable to me.
No matter if computer or human locale is used, if, in an alphabetical sort, apple comes after Apple, then it also comes after Banana and if Banana comes after apple, then Apple comes after apple. So I have come up with the following:
# A function which sorts two words alphabetically with lower case coming after upper case.
# The last word in the sort will be printed twice to demonstrate that this works for both
# the POSIX compliant single bracket test call and the newer double bracket condition
# syntax.
# arg 1: One of two words to sort
# arg 2: One of two words to sort
# Return: 0 upon completion, 1 if incorrect number of args is given
sort_alphabetically() {
[ $# -ne 2 ] && return 1
word_1_val=0
word_2_val=0
while read -n1 letter; do
(( word_1_val += $(printf '%d' "'$letter") ))
done < <(echo -n "$1")
while read -n1 letter; do
(( word_2_val += $(printf '%d' "'$letter") ))
done < <(echo -n "$2")
if [ $word_1_val -gt $word_2_val ]; then
echo $1
else
echo $2
fi
if [[ $word_1_val -gt $word_2_val ]]; then
echo $1
else
echo $2
fi
return 0
}
sort_alphabetically "apple" "Apple"
sort_alphabetically "Banana" "apple"
sort_alphabetically "aPPle" "applE"
prints:
apple
apple
Banana
Banana
applE
applE
This works using process substitution and redirecting the output into the while loop to read one character at a time and then using printf to get the decimal ASCII value of each character. It is like creating a temporary file from the string which will be automatically destroyed and then reading it one character at a time. The -n for echo means the \n character, if there is one from user input or something, will be ignored.
From bash man pages:
Process Substitution
Process substitution allows a process's input or output to be referred to using a filename. It takes the form of <(list) or >(list). The process list is run asynchronously, and its input or output appears as a filename. This filename is passed as an argument to the current command as the result of the expansion. If the >(list) form is used, writing to the file will provide input for list. If the <(list) form is used, the file passed as an argument should be read to obtain the output of list. Process substitution is supported on systems that support named pipes (FIFOs) or the /dev/fd method of naming open files.
When available, process substitution is performed simultaneously with parameter and variable expansion, command substitution, and arithmetic expansion.
from stackoverflow post about printf:
If the leading character is a single-quote or double-quote, the value shall be the numeric value in the underlying codeset of the character following the single-quote or double-quote.
Note: process substitution is not POSIX compliant, but it is supported by Bash in the way stated in the bash man page.
UPDATE: The above does not work in all cases!
The above solution works in many cases however we get some anomalies.
first word
second word
last alphabetically
apple
Apple
apple correct
Apple
apple
apple correct
apPLE
Apple
Apple incorrect
apple
Banana
Banana correct
apple
BANANA
apple incorrect
The following solution gets the results that are needed:
#!/bin/bash
sort_alphabetically() {
[ $# -ne 2 ] && return 1
local WORD_1="$1"
local WORD_2="$2"
local WORD_1_LOWERED="$(echo -n $1 | tr '[:upper:]' '[:lower:]')"
local WORD_2_LOWERED="$(echo -n $2 | tr '[:upper:]' '[:lower:]')"
if [ $(echo -e "$WORD_1\n$WORD_2" | sort | tail -n1) = "$WORD_1" ] ||\
[ $(echo -e "$WORD_1_LOWERED\n$WORD_2_LOWERED" | sort | tail -n1) =\
"$WORD_1_LOWERED" ]; then
if [ "$WORD_1_LOWERED" = "$WORD_2_LOWERED" ]; then
ASCII_VAL_WORD_1=0
ASCII_VAL_WORD_2=0
read -n1 FIRST_CHAR_1 < <(echo -n "$WORD_1")
read -n1 FIRST_CHAR_2 < <(echo -n "$WORD_2")
while read -n1 character; do
(( ASCII_VAL_WORD_1 += $(printf '%d' "'$character") ))
done < <(echo -n $WORD_1)
while read -n1 character; do
(( ASCII_VAL_WORD_2 += $(printf '%d' "'$character") ))
done < <(echo -n $WORD_2)
if [ $ASCII_VAL_WORD_1 -gt $ASCII_VAL_WORD_2 ] &&\
[ "$FIRST_CHAR_1" \> "$FIRST_CHAR_2" ]; then
echo "$WORD_1"
elif [ $ASCII_VAL_WORD_2 -gt $ASCII_VAL_WORD_1 ] &&\
[ "$FIRST_CHAR_2" \> "$FIRST_CHAR_1" ]; then
echo "$WORD_2"
elif [ "$FIRST_CHAR_1" \> "$FIRST_CHAR_2" ]; then
echo "$WORD_1"
else
echo "$WORD_2"
fi
else
echo "$WORD_1"
fi
else
echo $WORD_2
fi
return 0
}
sort_alphabetically "apple" "Apple"
sort_alphabetically "Apple" "apple"
sort_alphabetically "apPLE" "Apple"
sort_alphabetically "Apple" "apPLE"
sort_alphabetically "apple" "Banana"
sort_alphabetically "apple" "BANANA"
exit 0
prints:
apple
apple
apPLE
apPLE
Banana
BANANA
Change your syntax. if [[ "Apple" -gt "apple" ]] works as expected.

How to test for certain characters in a file

I am currently running a script with an if statement. Before I run the script, I want to make sure the file provided as the first argument has certain characters.
If the file does not have those certain characters in certain spots then the output would be else "File is Invalid" on the command line.
For the if statement to be true, the file needs to have at least one hyphen in Field 1 line 1 and at least one comma in Field one Line one.
How would I create an if statement with perhaps a test command to validate those certain characters are present?
Thanks
Im new to Linux/Unix, this is my homework so I haven't really tried anything, only brain storming possible solutions.
function usage
{
echo "usage: $0 filename ..."
echo "ERROR: $1"
}
if [ $# -eq 0 ]
then
usage "Please enter a filename"
else
name="Yaroslav Yasinskiy"
echo $name
date
while [ $# -gt 0 ]
do
if [ -f $1 ]
then
if <--------- here is where the answer would be
starting_data=$1
echo
echo $1
cut -f3 -d, $1 > first
cut -f2 -d, $1 > last
cut -f1 -d, $1 > id
sed 's/$/:/' last > last1
sed '/last:/ d' last1 > last2
sed 's/^ *//' last2 > last3
sed '/first/ d' first > first1
sed 's/^ *//' first1 > first2
sed '/id/ d' id > id1
sed 's/-//g' id1 > id2
paste -d\ first2 last3 id2 > final
cat final
echo ''
else
echo
usage "Coult not find file $1"
fi
shift
done
fi
In answer to your direct question:
For the if statement to be true, the file needs to have at least one
hyphen in Field 1 line 1 and at least one comma in Field one Line one.
How would I create an if statement with perhaps a test command to
validate those certain characters are present?
Bash provides all the tools you need. While you can call awk, you really just need to read the first line of the file into two-variable (say a and b) and then use the [[ $a =~ regex ]] to where the regex is an extended regular expression that verifies that the first field (contained in $a) contains both a '-' and ','.
For details on the [[ =~ ]] expression, see bash(1) - Linux manual page under the section labeled [[ expression ]].
Let's start with read. When you provide two variables, read will read the first field (based on normal word-splitting given by IFS (the Internal Field Separator, default $'[ \t\n]' - space, tab, newline)). So by doing read -r a b you read the first field into a and the rest of the line into b (you don't care about b for your test)
Your regex can be ([-]+.*[,]+|[,]+.*[-]+) which is an (x|y), e.g. x OR y expression where x is [-]+.*[,]+ (one or more '-' and one or more ','), your y is [,]+.*[-]+ (one or more ',' and one or more '-'). So by using the '|' your regex will accept either a comma then zero-or-more characters and a hyphen or a hyphen and zero-or-more characters and then a comma in the first field.
How do you read the line? With simple redirection, e.g.
read -r a b < "$1"
So your conditional test in your script would look something like:
if [ -f $1 ]
then
read -r a b < "$1"
if [[ $a =~ ([-]+.*[,]+|[,]+.*[-]+) ]] # <-- here is where the ...
then
starting_data=$1
...
else
echo "File is Invalid" >&2 # redirection to 2 (stderr)
fi
else
echo
usage "Coult not find file $1"
fi
shift
...
Example Test Files
$ cat valid
dog-food, cat-food, rabbit-food
50lb 16lb 5lb
$ cat invalid
dogfood, catfood, rabbitfood
50lb 16lb 5lb
Example Use/Output
$ read -r a b < valid
if [[ $a =~ ([-]+.*[,]+|[,]+.*[-]+) ]]; then
echo "file valid"
else
echo "file invalid"
fi
file valid
and for the file without the certain characters:
$ read -r a b < invalid
if [[ $a =~ ([-]+.*[,]+|[,]+.*[-]+) ]]; then
echo "file valid"
else
echo "file invalid"
fi
file invalid
Now you really have to concentrate on eliminating the spawning of at least a dozen subshells where you call cut 3-times, sed 7-times, paste once and then cat. While it is good you are thinking through what you need to do, and getting it working, as mentioned in my comment, any time you are looping, you want to eliminate the number of subshells spawned to the greatest extent possible. I suspect as #Mig answered, awk will be the proper tool that can likely eliminate all 12 subshells are replace it with a single call to awk.
I personally would use awk for this all part since you want to test fields and create a string with concatenated fields. Awk is perfect for that.
But here is a small script which shows how you could just test your file's first line:
if [[ $(head -n 1 file.csv | awk '$1~/-/ && $1~/,/ {print "MATCH"}') == 'MATCH' ]]; then
echo "yes"
else
echo "no"
fi
It looks overkill when not doing the whole thing in awk but it works. I am sure there is a way to test only one regex, but that would involve knowing which flavour of awk you have because I think they don't all use the same regex engine. Therefore I left this out for the sake of simplicity.

bash returning erroneous results after about 36 million lines when iterating through a pair of files - is this a memory error?

I have written a simple script in bash to iterate through a pair of text files to make sure they are properly formatted.
The required format is as follows:
Each file contains millions of ‘records’.
Each record takes up two lines in each file – a header line and a sequence line.
Each header lines consists of a “>” symbol, followd by a sample name (alphanumeric string), followed by a period, followed by a unique record identifier number (an integer), followed by a suffix of either ‘/1’ or ‘/2’.
Each sequence line contains a string of 30-100 A,C,G and T characters (the four DNA nucleotides, if anyone is wondering).
The files are paired, in that the first record in one file corresponds to the first record in the second file, and so forth. The header lines in the two files should be identical, except that in one files they will all have a ‘/1’ siffix and in the other file they will all have a ‘/2’ suffix. The sequence lines can be very different between the two files.
The code I developed is designed to check that (a) the hearder lines in each record follow the correct format, (2) the header lines in the corresponding records in the two files match (i.e. are identical except for the /1 and /2 suffixes) and (c) the sequence lines contain only A,C,G and T characters.
Example of properly formatted records:
> cat -n file1 | head -4
1 >SRR573705.1/1
2 ATAATCATTTGCCTCTTAAGTGGGGGCTGGTATGAATGGCAAGACGGGAATCTAGCTGTCTCTCCCTTATATCTTGAAGTTAATATTTCTGTGAAGAAGC
3 >SRR573705.2/1
4 CCACTTGTCCCAGTCTGTGCTGCCTGTACAATGGATTAGCTGAGGAAAACTGGCATCCCATGGCCTCAAACAGACGCAGCAAGTCCATGAAGCCATAATT
> cat –n file2 | head -4
1 >SRR573705.1/2
2 TTTCTAACAATTGAATTAGCAACACAAACACTATTGACAAAGCTATATCTTATTTCTACTAAAGCTCGATAGGGTCTTCTCGTCCTGCGATCCCATTCCT
3 >SRR573705.2/2
4 GTATGATGGGTGTGTCAAGGAGCTCAACCATCGTGATAGGCTACCTCATGCATCGAGACAAGATCACATTTAATGAGGCATTTGACATGGTCAGGAAGCA
My code is below. It works perfectly well for small test files containing only a couple of hundred records. When reading a real data file with millions or records, however, it returns non-sensical errors, for example:
Inaccurate header line in read 18214236 of file2
Line 36428471: TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
The error above is simply wrong. Line 36,428,471 of file2 is ‘>SRR573705.19887618/2’
The string reported in the error is not even present in file 2. It does, however, appear multiple times in file1, i.e.:
cat -n /file1 | grep 'TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC'
4632838 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
24639990 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
36428472 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
143478526 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
The data in the two files seems to match perfectly in the region where the error was returned:
cat -n file1 | head -36428474 | tail
36428465 >SRR573705.19887614/1
36428466 CACCCCAGCATGTTGACCACCCATGCCATTATTTCATGGTATTTTCTTACATTTTGTATATAACAGATGCATTACGTATTATAGCATTGCTTTTCGTAAA
36428467 >SRR573705.19887616/1
36428468 AGATCCTCCTCCTCATCGGTCAGTCGCCAATCCAACAACTCAACCTTCTTCTTCAAGTCACTCAGCCGTCGGCCCGGGACTGCCGTTTCATGATGCCTAT
36428469 >SRR573705.19887617/1
36428470 CAATAGCGTATATTAAAATTGCTGCAGTTAAAAAGCTCGTAGTTGGATCTTGGGCGCAGGCTGGCGGTCCGCCGCAAGGCGCGCCACTGCCAGCCTGGCC
36428471 >SRR573705.19887618/1
36428472 TGATTTCCTCCATAAGTGCCTTCTCGCACTCAACATCTTGATCACTACGTTCCTCAGCATTCGCCTCTTCTTCTTCTTCCTGTTCCTTTTTTTCATCCTC
36428473 >SRR573705.19887619/1
36428474 CCAGCCTGCGCCCAAGATCCAACTACGAGCTTTTTAACTGCAGCAATTTTAATATACGCTATTGGAGCTGGAATTACCGCGGCTGCTGGCACCAGACTTG
>cat -n file2 | head -36428474 | tail
36428465 >SRR573705.19887614/2
36428466 GTAATTTACAGGAATTGTTTACATTCTGAGCAAATAAAACAAATAATTTTAATACACAAACTTGTTGAAAGTTAATTAGGTTTTACGAAAA
36428467 >SRR573705.19887616/2
36428468 GCCGTCGCAGCAACATTTGAGATATCCCGTAAGACGTCTTGAACGGCTGGCTCTGTCTGCTCTCGGAGAACCTGCCGGCTGAACCGGACAGCGCAGACG
36428469 >SRR573705.19887617/2
36428470 CTCGAGTTCCGAAAACCAACGCAATAGAACCGAGGTCCTATTCCATTATTCCATGCTCTGCTGTCCAGGCGGTCGGCCTG
36428471 >SRR573705.19887618/2
36428472 GGACATGGAAACAGAAAATAATGAAAAGACCAAAGAAGATGCACTTGAGGTTGATAAGCCTAAAGG
36428473 >SRR573705.19887619/2
36428474 CCCGACACGGGGAGGTAGTGACGAAAAATAGCAATACAGGACTCTTTCGAGGCCCTGTAATTGGAATGAGTACACTTTAAATCCTTTAACGAGGATCTAT
Is there some sort of memory limit in bash that could cause such an error? I have run various versions of this code over multiple files and consistently get this problem after 36,000,000 lines.
My code:
set -u
function fastaConsistencyChecker {
F_READS=$1
R_READS=$2
echo -e $F_READS
echo -e $R_READS
if [[ ! -s $F_READS ]]; then echo -e "File $F_READS could not be found."; exit 0; fi
if [[ ! -s $R_READS ]]; then echo -e "File $R_READS could not be found."; exit 0; fi
exec 3<$F_READS
exec 4<$R_READS
line_iterator=1
read_iterator=1
while read FORWARD_LINE <&3 && read REVERSE_LINE <&4; do
if [[ $(( $line_iterator % 2 )) == 1 ]]; then
## This is a header line ##
if [[ ! ( $FORWARD_LINE =~ ^">"[[:alnum:]]+\.[0-9]+/1$ ) ]]; then
echo -e "Inaccurate header line in read ${read_iterator} of file ${F_READS}"
echo -e "Line ${line_iterator}: ${FORWARD_LINE}"
exit 0
fi
if [[ ! ( $REVERSE_LINE =~ ^">"[[:alnum:]]+\.[0-9]+/2$ ) ]]; then
echo -e "Inaccurate header line in read ${read_iterator} of file ${R_READS}"
echo -e "Line ${line_iterator}: ${REVERSE_LINE}"
exit 0
fi
F_Name=${FORWARD_LINE:1:${#FORWARD_LINE}-3}
R_Name=${REVERSE_LINE:1:${#REVERSE_LINE}-3}
if [[ $F_Name != $R_Name ]]; then
echo -e "Record names do not match. "
echo -e "Line ${line_iterator}: ${FORWARD_LINE}"
echo -e "Line ${line_iterator}: ${REVERSE_LINE}"
exit 0
fi
line_iterator=$(( $line_iterator + 1 ))
else
if [[ ! ( $FORWARD_LINE =~ ^[ATCGNatcgn]+$ ) ]]; then
echo -e "Ambigous sequence detected for read ${read_iterator} at line ${line_iterator} in file ${F_READS}"
exit 0
fi
read_iterator=$(( $read_iterator + 1 ))
line_iterator=$(( $line_iterator + 1 ))
fi
unset FORWARD_LINE
unset REVERSE_LINE
done
echo -e "$line_iterator lines and $read_iterator reads"
echo -e "No errors detected."
echo -e ""
}
export -f fastaConsistencyChecker
FILE3="filepath/file1"
FILE4="filepath/file2"
fastaConsistencyChecker $FILE3 $FILE4
I think you've proven there's an issue related to memory usage with bash. I think you can accomplish your format verification without running afoul of the memory issue by using text processing tools from bash.
#!/bin/bash
if ! [[ $1 && $2 && -s $1 && -s $2 ]]; then
echo "usage: $0 <file1> <file2>"
exit 1
fi
set -e
dir=`mktemp -d`
clean () { rm -fr $dir; }
trap clean EXIT
pairs () { sed 'N;s/\n/\t/' "$#"; }
pairs $1 > $dir/$1
pairs $2 > $dir/$2
paste $dir/$1 $dir/$2 | grep -vP '^>(\w+\.\d+)/1\t[ACGT]+\t>\1/2\t[ACGT]+$' && exit 1
exit 0
The sed script takes a line and concatenates it with the next, separated by a tab. This:
>SRR573705.1/1
ATAATCATTTGCCTCTT...
becomes this:
>SRR573705.1/1 ATAATCATTTGCCTCTT...
The paste takes the first line of file 1 and the first line of file 2 and outputs them as one line separated by a tab. It does the same for the second line, and so forth. grep see input like this:
>SRR573705.1/1. ATAATCATTTGCCTCT.... >SRR573705.1/2. TTTCTAACAATTGAAT...
The regular expression captures the first identifier and matches the same identifier later in the line with the backreference \1.
The script outputs any lines failing to match the regex due to the -v switch to grep. If lines are output, the script exits with status 1.

bash palindrome grep loop if then else missing '

My Syst admin prof just started teaching us bash and he wanted us to write a bash script using grep to find all 3-45 letter palindromes in the linux dictionary without using reverse. And im getting an error on my if statement saying im missing a '
UPDATED CODE:
front='\([a-z]\)'
front_s='\([a-z]\)'
numcheck=1
back='\1'
middle='[a-z]'
count=3
while [ $count -ne "45" ]; do
if [[ $(($count % 2)) == 0 ]]
then
front=$front$front_s
back=+"\\$numcheck$back"
grep "^$front$back$" /usr/share/dict/words
count=$((count+1))
else
grep "^$front$middle$back$" /usr/share/dict/words
numcheck=$((numcheck+1))
count=$((count+1))
fi
done
You have four obvious problems here:
First about a misplaced and unescaped backslash:
back="\\$numcheck$back" # and not back="$numcheck\$back"
Second is that you only want to increment numcheck if count is odd.
Third: in the line
front=$front$front
you're doubling the number of patterns in front! hey, that yields an exponential growth, hence the explosion Argument list too long. To fix this: add a variable, say, front_step:
front_step='\([a-z]\)'
front=$front_step
and when you increment front:
front=$front$front_step
With these fixed, you should be good!
The fourth flaw is that grep's back-references may only have one digit: from man grep:
Back References and Subexpressions
The back-reference \n, where n is a single digit, matches the substring
previously matched by the nth parenthesized subexpression of the
regular expression.
In your approach, we'll need up to 22 back-references. That's too much for grep. I doubt there are any such long palindromes, though.
Also, you're grepping the file 43 times… that's a bit too much.
Try this:
#!/bin/bash
for w in `grep -E "^[[:alnum:]]{3,45}$" /usr/share/dict/words`; do if [[ "$w" == "`echo $w|sed "s/\(.\)/\1\n/g"|tac|tr -d '\012'`" ]]; then echo "$w == is a palindrome"; fi; done
OR
#!/bin/bash
front='\([a-z]\)'
numcheck=1
back='\1'
middle='[a-z]'
count=3
while [ $count -ne "45" ]; do
if [[ $(($count % 2)) == 0 ]]
then
front=$front$front
back="\\$numcheck$back"
grep "^$front$back$" /usr/share/dict/words
else
grep "^$front$middle$back$" /usr/share/dict/words
## Thanks to gniourf for catching this.
numcheck=$((numcheck+1))
fi
count=$((count+1))
## Uncomment the following if you want to see one by one and run script using bash -x filename.sh
#echo Press any key to continue: ; read toratora;
done

Linux script to search for string in a file

I am newbie to shell scripting. I have a requirement to read a file by line and match for specific string. If it matches, print x and if it doesn't match, print y.
Here is what I am trying. But,I am getting unexpected results. I am getting 700 lines of result where my /tmp/l1.txt has 10 lines only. Somewhere, I am going through the loop. I appreciate your help.
for line in `cat /tmp/l3.txt`
do
if echo $line | grep "abc.log" ; then
echo "X" >>/tmp/l4.txt
else
echo "Y" >>/tmp/l4.txt
fi
done
I don't understand the urge to do looping ...
awk '{if($0 ~ /abc\.log/){print "x"}else{print "y"}}' /tmp/13.txt > /tmp/14.txt
EDIT after inquiry ...
Of course, your spec wasn't overly precise, and I'm jumping to conclusions regarding your lines format ... we basically take the whole line that matched abc.log, replace everything up to the directory abc and from /log to the end of line with nothing, which leaves us with clusterX/xyz.
awk '{if($0 ~ /abc\.log/){print gensub(/.+\/abc\/(.+)\/logs/, "\\1", 1)}else{print "y"}}' /tmp/13.txt > /tmp/14.txt
cat /tmp/l3.txt | while read line # read the entire line into the variable "line"
do
if [ -n `echo "$line" | grep "abc.log"` ] # If there is a value "-n"
then
echo "X" >> /tmp/l4.txt # Echo "X" or the value of the variable "line" into l4.txt
else
echo "Y" >> /tmp/l4.txt # If empty echo "Y" into l4.txt
fi
done
While read statement will read either the entire line if only one variable is given, in this case "line" or if you have a fixed amount of fields you can specify a variable for each field, I.E. "| while read field1 field2" etc... The -n tests for if their is a value or not. -z will test if it's empty.
Why worry about cat and the rest before grep, you can simply test the return of grep and append all matching lines to /tmp/14.txt or append "Y":
[ -f "/tmpfile.tmp" ] && :> /tmpfile.tmp # test for existing tmpfile & truncate
if grep "abc.log" /tmp/13.txt >>tmpfile.tmp ; then # write all matching lines to tmpfile
cat tmpfile.tmp /tmp/14.txt # if grep matched append to /tmp/14.txt
else
echo "Y" >> /tmp/14.txt # write "Y" to /tmp/14.txt
fi
rm tmpfile.tmp # cleanup
Note: if you don't want the result of the grep appended to /tmp/14.txt, then just replace cat tmpfile.tmp /tmp/14.txt with echo "X" >> /tmp/14.txt and you can remove the 1st and last lines.
I think the "awk" answer above is better. However, if you really need to interact using a bash loop, you can use:
PATTERN="abc.log"
OUTPUTFILE=/tmp/14.txt
INPUTFILE=/tmp/13.txt
while read line
do
grep -q "$PATTERN" <<< "$line" > /dev/null 2>&1 && echo X || echo Y
done < $INPUTFILE >> $OUTPUTFILE

Resources