extract substrings starting with same pattern in a file

extract substrings starting with same pattern in a file - linux

I have a fileA.txt which contains strings:
>AMP_16 RS0247 CENPF__ENST00000366955.7__3251__30__0.43333__66.8488__1 RS0255
>AMP_16 RS0332 CENPF__ENST00000366955.7__2262__30__0.43333__67.9513__1 RS0451
>AMP_16 RS0332 CENPF__ENST00000366955.7__1108__30__0.43333__67.4673__1 RS0247
and so on .....
I would like to extract all the substring which start with RS
Output:
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
I tried something like this:
while read line
do
for word in $line
do
if [[ "$word" =~ ^RS ]]
then
[ -z "$str" ] && str="$word"
fi
done
done < fileA.txt
echo "$str"
However I only get the first string RS0247 printed out when I do echo

Given the three sample lines pasted above in the file f...
Assuming a fixed format:
awk '{print $2"_"$4}' f
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
Assuming a flexible format (and that you're only interested in the first two occurrences of fields that start with RS:
awk '{f=1;for (i=1;i<=NF;i++){if($i~/^RS/){a[f]=$i;f++}}print a[1]"_"a[2]}' f
RS0247_RS0255
RS0332_RS0451
RS0332_RS0247
Edit 1:
And assuming that you want your own script patched rather than an efficient solution:
#!/bin/bash
while read line
do
str=""
for word in $line
do
if [[ "$word" =~ ^RS ]]
then
if [[ -z $str ]]
then
str=$word
else
str+="_${word}"
fi
fi
done
echo "$str"
done < fileA.txt
Edit 2:
In terms of efficiency; I copied and pasted those 3 lines into fileA.txt 60 times (180 lines total). The runtimes for the three attempts above in the same order are:
real 0m0.002s
real 0m0.002s
real 0m0.011s

Related

How to test for certain characters in a file

I am currently running a script with an if statement. Before I run the script, I want to make sure the file provided as the first argument has certain characters.
If the file does not have those certain characters in certain spots then the output would be else "File is Invalid" on the command line.
For the if statement to be true, the file needs to have at least one hyphen in Field 1 line 1 and at least one comma in Field one Line one.
How would I create an if statement with perhaps a test command to validate those certain characters are present?
Thanks
Im new to Linux/Unix, this is my homework so I haven't really tried anything, only brain storming possible solutions.
function usage
{
echo "usage: $0 filename ..."
echo "ERROR: $1"
}
if [ $# -eq 0 ]
then
usage "Please enter a filename"
else
name="Yaroslav Yasinskiy"
echo $name
date
while [ $# -gt 0 ]
do
if [ -f $1 ]
then
if <--------- here is where the answer would be
starting_data=$1
echo
echo $1
cut -f3 -d, $1 > first
cut -f2 -d, $1 > last
cut -f1 -d, $1 > id
sed 's/$/:/' last > last1
sed '/last:/ d' last1 > last2
sed 's/^ *//' last2 > last3
sed '/first/ d' first > first1
sed 's/^ *//' first1 > first2
sed '/id/ d' id > id1
sed 's/-//g' id1 > id2
paste -d\ first2 last3 id2 > final
cat final
echo ''
else
echo
usage "Coult not find file $1"
fi
shift
done
fi

In answer to your direct question:
For the if statement to be true, the file needs to have at least one
hyphen in Field 1 line 1 and at least one comma in Field one Line one.
How would I create an if statement with perhaps a test command to
validate those certain characters are present?
Bash provides all the tools you need. While you can call awk, you really just need to read the first line of the file into two-variable (say a and b) and then use the [[ $a =~ regex ]] to where the regex is an extended regular expression that verifies that the first field (contained in $a) contains both a '-' and ','.
For details on the [[ =~ ]] expression, see bash(1) - Linux manual page under the section labeled [[ expression ]].
Let's start with read. When you provide two variables, read will read the first field (based on normal word-splitting given by IFS (the Internal Field Separator, default $'[ \t\n]' - space, tab, newline)). So by doing read -r a b you read the first field into a and the rest of the line into b (you don't care about b for your test)
Your regex can be ([-]+.*[,]+|[,]+.*[-]+) which is an (x|y), e.g. x OR y expression where x is [-]+.*[,]+ (one or more '-' and one or more ','), your y is [,]+.*[-]+ (one or more ',' and one or more '-'). So by using the '|' your regex will accept either a comma then zero-or-more characters and a hyphen or a hyphen and zero-or-more characters and then a comma in the first field.
How do you read the line? With simple redirection, e.g.
read -r a b < "$1"
So your conditional test in your script would look something like:
if [ -f $1 ]
then
read -r a b < "$1"
if [[ $a =~ ([-]+.*[,]+|[,]+.*[-]+) ]] # <-- here is where the ...
then
starting_data=$1
...
else
echo "File is Invalid" >&2 # redirection to 2 (stderr)
fi
else
echo
usage "Coult not find file $1"
fi
shift
...
Example Test Files
$ cat valid
dog-food, cat-food, rabbit-food
50lb 16lb 5lb
$ cat invalid
dogfood, catfood, rabbitfood
50lb 16lb 5lb
Example Use/Output
$ read -r a b < valid
if [[ $a =~ ([-]+.*[,]+|[,]+.*[-]+) ]]; then
echo "file valid"
else
echo "file invalid"
fi
file valid
and for the file without the certain characters:
$ read -r a b < invalid
if [[ $a =~ ([-]+.*[,]+|[,]+.*[-]+) ]]; then
echo "file valid"
else
echo "file invalid"
fi
file invalid
Now you really have to concentrate on eliminating the spawning of at least a dozen subshells where you call cut 3-times, sed 7-times, paste once and then cat. While it is good you are thinking through what you need to do, and getting it working, as mentioned in my comment, any time you are looping, you want to eliminate the number of subshells spawned to the greatest extent possible. I suspect as #Mig answered, awk will be the proper tool that can likely eliminate all 12 subshells are replace it with a single call to awk.

I personally would use awk for this all part since you want to test fields and create a string with concatenated fields. Awk is perfect for that.
But here is a small script which shows how you could just test your file's first line:
if [[ $(head -n 1 file.csv | awk '$1~/-/ && $1~/,/ {print "MATCH"}') == 'MATCH' ]]; then
echo "yes"
else
echo "no"
fi
It looks overkill when not doing the whole thing in awk but it works. I am sure there is a way to test only one regex, but that would involve knowing which flavour of awk you have because I think they don't all use the same regex engine. Therefore I left this out for the sake of simplicity.

Unix - Replace column value inside while loop

I have comma separated (sometimes tab) text file as below:
parameters.txt:
STD,ORDER,ORDER_START.xml,/DML/SOL,Y
STD,INSTALL_BASE,INSTALL_START.xml,/DML/IB,Y
with below code I try to loop through the file and do something
while read line; do
if [[ $1 = "$(echo "$line" | cut -f 1)" ]] && [[ "$(echo "$line" | cut -f 5)" = "Y" ]] ; then
//do something...
if [[ $? -eq 0 ]] ; then
// code to replace the final flag
fi
fi
done < <text_file_path>
I wanted to update the last column of the file to N if the above operation is successful, however below approaches are not working for me:
sed 's/$f5/N/'
'$5=="Y",$5=N;{print}'
$(echo "$line" | awk '$5=N')
Update: Few considerations which need to be considered to give more clarity which i missed at first, apologies!
The parameters file may contain lines with last field flag as "N" as well.
Final flag needs to be update only if "//do something" code has successfully executed.
After looping through all lines i.e, after exiting "while loop" flags for all rows to be set to "Y"

perhaps invert the operations do processing in awk.
$ awk -v f1="$1" 'BEGIN {FS=OFS=","}
f1==$1 && $5=="Y" { // do something
$5="N"}1' file
not sure what "do something" operation is, if you need to call another command/script it's possible as well.

with bash:
(
IFS=,
while read -ra fields; do
if [[ ${fields[0]} == "$1" ]] && [[ ${fields[4]} == "Y" ]]; then
# do something
fields[4]="N"
fi
echo "${fields[*]}"
done < file | sponge file
)
I run that in a subshell so the effects of altering IFS are localized.
This uses sponge to write back to the same file. You need the moreutils package to use it, otherwise use
done < file > tmp && mv tmp file
Perhaps a bit simpler, less bash-specific
while IFS= read -r line; do
case $line in
"$1",*,Y)
# do something
line="${line%Y}N"
;;
esac
echo "$line"
done < file

To replace ,N at the end of the line($) with ,Y:
sed 's/,N$/,Y/' file

How do I split a string on a pattern at the linux bash prompt and return the last instance of my pattern and everything after

This is my first question on StackOverflow, I hope it's not too noob for this forum. Thanks for your help in advance!!!
[PROBLEM]
I have a Linux bash variable in my bash script with the below content:
[split]
this is a test 1
[split]
this is a test 2
[split]
this is a test 3
this is a test 4
this is a test 5
How can I split this file on the string "[split]" and return the last section after the split?
this is a test 3
this is a test 4
this is a test 5
The last section can vary in length but it is always at the end of the "string" / "file"

Using awk, set record separator to the regular expression representing the split string, print the last record at END.
gawk 'BEGIN{ RS="[[]split[]]" } END{ print $0 }' tmp/test.txt
Result assuming input coming from a file:
this is a test 3
this is a test 4
this is a test 5

How about this ? :)
FILE="test.txt"
NEW_FILE="test_result.txt"
SPLIT="split"
while read line
do
if [[ $line == $SPLIT ]]
then
$(rm ${NEW_FILE})
else
$(echo -e "${line}" >> ${NEW_FILE})
fi
done < $FILE

#!/bin/bash
s="[split]
this is a test 1
[split]
this is a test 2
[split]
this is a test 3
this is a test 4
this is a test 5"
a=()
i=0
while read -r line
do
a[i]="${a[i]}${line}"$'\n'
if [ "$line" == "[split]" ]
then
let ++i
fi
done <<< "$s"
echo ${a[-1]}
I simply read each line from the string into an array and when I encounter [split] ,I increment the array index.At last,I echo the last element.
EDIT:
if you just want the last part no need for an array too.You could do something like
while read -r line
do
a+="${line}"$'\n'
if [ "$line" == "[split]" ]
then
a=""
fi
done <<< "$s"
echo $a

find string in file using bash

I need to find strings matching some regexp pattern and represent the search result as array for iterating through it with loop ), do I need to use sed ? In general I want to replace some strings but analyse them before replacing.

Using sed and diff:
sed -i.bak 's/this/that/' input
diff input input.bak
GNU sed will create a backup file before substitutions, and diff will show you those changes. However, if you are not using GNU sed:
mv input input.bak
sed 's/this/that/' input.bak > input
diff input input.bak
Another method using grep:
pattern="/X"
subst=that
while IFS='' read -r line; do
if [[ $line = *"$pattern"* ]]; then
echo "changing line: $line" 1>&2
echo "${line//$pattern/$subst}"
else
echo "$line"
fi
done < input > output

The best way to do this would be to use grep to get the lines, and populate an array with the result using newline as the internal field separator:
#!/bin/bash
# get just the desired lines
results=$(grep "mypattern" mysourcefile.txt)
# change the internal field separator to be a newline
IFS=$'/n'
# populate an array from the result lines
lines=($results)
# return the third result
echo "${lines[2]}"
You could build a loop to iterate through the results of the array, but a more traditional and simple solution would just be to use bash's iteration:
for line in $lines; do
echo "$line"
done

FYI: Here is a similar concept I created for fun. I thought it would be good to show how to loop a file and such with this. This is a script where I look at a Linux sudoers file check that it contains one of the valid words in my valid_words array list. Of course it ignores the comment "#" and blank "" lines with sed. In this example, we would probably want to just print the Invalid lines only but this script prints both.
#!/bin/bash
# -- Inspect a sudoer file, look for valid and invalid lines.
file="${1}"
declare -a valid_words=( _Alias = Defaults includedir )
actual_lines=$(cat "${file}" | wc -l)
functional_lines=$(cat "${file}" | sed '/^\s*#/d;/^\s*$/d' | wc -l)
while read line ;do
# -- set the line to nothing "" if it has a comment or is empty line.
line="$(echo "${line}" | sed '/^\s*#/d;/^\s*$/d')"
# -- if not set to nothing "", check if the line is valid from our list of valid words.
if ! [[ -z "$line" ]] ;then
unset found
for each in "${valid_words[#]}" ;do
found="$(echo "$line" | egrep -i "$each")"
[[ -z "$found" ]] || break;
done
[[ -z "$found" ]] && { echo "Invalid=$line"; sleep 3; } || echo "Valid=$found"
fi
done < "${file}"
echo "actual lines: $actual_lines funtional lines: $functional_lines"

merge csv lines if not ending with a pipe

I have a rather large csv file where each line should end with a pipe (|) and if it doesn't combine the next line into it until find a pipe again. This need to done using a shell script.
I got an answer as
awk '!/|$/{l=l""$0|next|}{print l""$0|l=""}' file
But it gives me error as size of each line is quite large for me. I found out that I should be using perl to do that and have tried something as below but it does produce the desired result.
perl -pe 's/^\n(|\n)/ /gs' input.csv > output.csv
My data looks like
A|1|abc|<xml/>|
|2|def|<xml
>hello world</xml>|
|3|ghi|<xml/>|
And the desired output should be
A|1|abc|<xml/>|
|2|def|<xml>hello world</xml>|
|3|ghi|<xml/>|
Obviously the line size is quite large than the sample input here.
Any help would be highly appreciated.

awk '{printf "%s",$0} /[|][[:space:]]*$/ {print ""}'
Print every line without a newline. If the last non-whitespace character is a pipe, you have a complete line so print a newline.

This should work:
perl -lne 'unless(/\|$/){$line=$line.$_}else{print $line." $_";undef $line}' your_file
if you want to do an inplace replacement do this:
perl -i -lne 'unless(/\|$/){$line=$line.$_}else{print $line." $_";undef $line}' your_file
check here regarding your comment

This should happily handle all cases for you, and not break on any line length:
#!/bin/bash
newLine=0
IFS=
while read -r -n 1 char; do
if [[ $char =~ ^$ ]]; then
if [[ $newLine -eq 1 ]]; then
newLine=0
echo '|' # add a newline
fi
elif [[ $char =~ . && ( $newLine -eq 1 ) ]]; then
newLine=0
echo -n "|$char"
elif [[ $char =~ [|] ]]; then
if [[ $newLine -eq 1 ]]; then
echo -n '|'
fi
newLine=1
else
echo -n $char
fi
done < file.txt
Please note that building a lexer by hand in bash is usually a bad idea.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

extract substrings starting with same pattern in a file - linux

Related

How to test for certain characters in a file

Unix - Replace column value inside while loop

How do I split a string on a pattern at the linux bash prompt and return the last instance of my pattern and everything after

find string in file using bash

merge csv lines if not ending with a pipe

Categories

Resources