Partial replace with sed command - linux

We have a filewith some utf-16 decimal characters and we would like to replace them in the following manner
Test Line in a file \u343- ? some random words \u1233? 300 \u241? \u208?\cell
The required out put is
Test Line in a file \u343- ? some random words UTF16-1233| 300 UTF16-241| UTF16-208|\cell
The requirement is to change \u[0-9]+? to UTF16-[0-9]+|
Replace the initial \u to UTF16- and the ending ? with a pipe |.
Please note if there is any non digit character between \u and ? it should not be considered

Using sed to modify the file in place, you can:
Match \\u([0-9]+)\?:
Match a literal \u, match and capture one or more digits, match a literal ?.
Replace UTF16-\1:
Replace with the string UTF16- followed by the captured group.
$ sed -i -E 's/\\u([0-9]+)\?/UTF16-\1|/g' file
$ cat file
Test Line in a file \u343- ? some random words UTF16-1233| 300 UTF16-241| UTF16-208|\cell

Related

How to grep the string with specific pattern

I am trying to grep a file.txt to search 2 strings cp and (target file name) where the line in file is as below,
cp (source file name) (target file name)
the problem for me here is string '(target file name)' has specific pattern as /path/to/file/TC_12_IT_(6 digits)_(6 digits)_TC_12_TEST _(2 digits).tc12.tc12
I am using below grep command to search a line with these 2 strings,
grep -E cp.*/path/to/file/TC_12_IT_ file.txt
how can I be more specific about (target file name) in grep command to search (target file name) with all its patterns, something like below,
grep -E 'cp.*/path/to/file/TC_12_IT_*_*_TC_12_TEST_*.tc12.tc12' file.txt
can we use wildcards in grep to search string in file just like we can use wilecard like * in listing out files e.g.
ls -lrt TC_12_*_12345678.txt
please suggest if there are any other ways to achieve this.
More specifically:
grep -P '^cp\s+.+\s+\S+/TC_12_IT_\d{6}_\d{6}_TC_12_TEST _\d2[.]tc12[.]tc12$' in_file > out_file
^ : beginning of the line.
\s+ : 1 or more whitespace characters.
.+ : 1 or more any characters.
\S+ : 1 or more non-whitespace characters.
\d{6} : exactly 6 digits.
[.] : literal dot (.). Note that just plain . inside a regular expression means any character, unless it is inside a character class ([.]) or escaped (\.).
$ : end of the line.
SEE ALSO:
GNU grep manual
perlre - Perl regular expressions
Like this, using GNU grep:
grep -P 'cp.*TC_12_IT_\d{6}_\d{6}TC_12_TEST\d{2}.tc12.tc12' file
The regular expression matches as follows:
Node
Explanation
cp
'cp'
.*
any character except \n (0 or more times (matching the most amount possible))
TC_12_IT_
'TC_12_IT_'
\d{6}
digits (0-9) (6 times)
_
_
\d{6}
digits (0-9) (6 times)
TC_12_TEST
'TC_12_TEST'
\d{2}
digits (0-9) (2 times)
.
any character except \n
tc12
'tc12'
.
any character except \n
tc12
'tc12'

How to remove the first three character from the fasta file header

I have a fasta file like this:
>rna-XM_00001.1
actact
>rna-XM_00002.1
atcatc
How do I remove the 'rna-' so it become
>XM_00001.1
actact
>XM_00002.1
atcatc
What you're showing is the file contents? Then sed should be able to do this:
sed 's/^>rna-/>/' < inputfile > outputfile
Explanation:
The first character of the command-line to sed is s, which tells sed to do substitution
The / are delimiters
The ^ tells sed to look only at the start of a line
The next >rna- is the pattern to match at the start of a line
The next > is the replacement substituted for the pattern
If, instead, you want to always remove the first four characters after a > as long as they end in -, you could use:
sed 's/^>...-/>/' < inputfile > outputfile
Explanation:
This is similar to above, except the pattern to match at the start of a line is >...-. The pattern is a regexp, where a . matches any single character. So this pattern matches any line starting with >, followed by any three characters, followed by -.

replace sub-string with last special character, being (3rd part) of comma separated string

I have a string with comma separated values, like:
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0-,,,
As you can see, the 3rd comma separated value has sometimes special character, like the dash (-), in the end. I want to used sed, or preferably perl command to replace this string (with the -i option, so as to replace at existing file), with same string at the same place (i.e. 3rd comma separated value) but without the special character (like the dash (-)) at the end of the string. So, result at above example string should be:
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0,,,
Since such multiple lines like the above are inside a file, I am using while loop at shell/bash script to loop and manipulate all lines of the file. And I have assigned the above string values to variables, so as to replace them using perl. So, my while loop is:
while read mystr
do
myNEWstr=$(echo $mystr | sed s/[_.-]$// | sed s/[__]$// | sed s/[_.-]$//)
perl -pi -e "s/\b$mystr\b/$myNEWstr/g" myFinalFile.txt
done < myInputFile.txt
where:
$mystr is the "SOME-STRING_A_-BLAHBLAH_1-4MP0-"
$myNEWstr result is the "SOME-STRING_A_-BLAHBLAH_1-4MP0"
Note that the myInputFile.txt is a file that contains the 3rd comma separated values of the myFinalFile.txt, so that those EXACT string values ($mystr) will be checked for special characters in the end, like underscore, dash, dot, double-underscore, and if they exist to be removed and form the new string ($myNEWstr), then finally that new string ($myNEWstr) to be replaced at the myFinalFile.txt, so as to have the resulting strings like the example final string shown above, i.e. with the 3rd comma separated sub-string value WITHOUT the special character in the end (which is dash (-) at above example).
Thank you.
You could use the following regex:
s/^([^,]*,[^,]*,[^,]*)-,/$1,/
This defined csv fields as series of characters other than a comma (empty fields are allowed). We are looking for a dash at the very end of the third csv field. The regex captures everything until there, and then replaces it while omitting the dash.
$ cat t.txt
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0-,,,
]$ perl -p -e 's/^([^,]*,[^,]*,[^,]*)-,/$1,/' t.txt
742108,SOME-STRING_A_-BLAHBLAH_1-4MP0RTTYE,SOME-STRING_A_-BLAHBLAH_1-4MP0,,,
]$

Delete none AGTC charachter in a text file

I have a Text file and it should contains A,G,C,T characters. However it sometimes has some unknown characters (very few) which I want to delete and if it is N replace it with A. Also I want to escape the lines which starts with a > symbol.
So far I know only how to replace N with A, which I do like this :
sed "s/N/A/g" file1.fa >file2.fasta
But I don't know how to do the first task.
Example :
Initial file
first line
AGCCCMCCCN
Target file should be like this
first line
AGCCCCCCA
Any help will be appreciate. Thanks in advance!
You can do another substitution on your sed
sed -e 's/N/A/g' -e 's/[^AGCT>]//g' -e 's/^>/\\>/' -e 's/[^\]>//g' file1.fa > file2.fasta
Pattern 1
-e 's/N/A/g'
Your pattern, replaces all instances of N with A first of all.
Pattern 2
-e 's/[^AGCT>]//g'
Secondly replace all characters that aren't A, G, C, T or > with nothing.
Pattern 3
-e 's/^>/\\>/'
Then replace all instances of > that are at the start of a string with \>
Pattern 4
-e 's/[^\]>//g'
Finally remove all > characters that aren't preceded by a \

Need oneliner to insert character after Nth occurrence of delimiter on 2nd line in unix

Hello I need oneliner to insert character after Nth occurrence of delimiter on 2nd line in unix; The criteria are these.
Find position of nth occurrence of the delimiter.
Insert character after the nth occurrence.
This is on the 2nd line only.
Note: I am doing this in Linux.
With awk :
INPUT FILE
1 foo bar base
2 foo bar base
CODE
awk 'NR==2{$2=$2"X"; print}' file
you can specify a delimiter with -F
NR to specify the line we work on
$2 is the 2th value separated by space (in this case)
$2=$2"X" is a concatenation
print alone print the entire line
OUTPUT
2 fooX bar base
Suppose we have the input file:
$ cat file
1 foo bar base
2 foo bar base
To insert the character X after the 3rd occurrence of the delimiter space, use:
$ sed -r '2 s/([^ ]* ){3}/&X/' file
1 foo bar base
2 foo bar Xbase
To make the change to the file in place, use sed's -i option:
sed -i -r '2 s/([^ ]* ){3}/&X/' file
How it works
Consider the sed command:
2 s/([^ ]* ){3}/&X/
The initial 2 instructs sed to apply this command only to the second line.
We are using the s or substitute command. This command has the form s/old/new/ where old and new are:
old is the regular expression ([^ ]* ){3}. This matches everything up to and including the third occurrence of space.
new is the replacement text, &X. The ampersand refers to what we matched in old, which is all the line up to and including the third space. The X is the new character that we are inserting.
This might work for you (GNU sed):
sed '2s/X/&Y/3' file
This inserts Y after the third occurence of X on the second line only.

Resources