Delete Repeated Characters without back-referencing with SED - linux

Let's say we have a file that contains
HHEELLOO
HHYYPPOOTTHHEESSIISS
and we want to delete repeated characters. To my knowledge we can do this with
s/\([A-Z]\)\1/\1/g
This is a homework problem and the professor said he wants us to try the exercises without back-referencing or extended regular expressions. Is that possible on this one? I would appreciate it if anyone could point me in the right direction, thanks!

The only reasonable way to do this is to use the right tool for the job, in this case tr:
$ tr -s 'A-Z' < file
HELO
HYPOTHESIS
If you were going to use sed for that specific problem though then it'd just be:
$ sed 's/\(.\)./\1/g' file
HELO
HYPOTHESIS
If that's not what you're looking for then edit your question to show more truly representative sample input and expected output.

Here's one way:
s/AA/A/g
s/BB/B/g
...
s/ZZ/Z/g
As a one-liner:
sed 's/AA/A/g; s/BB/B/g; ...'

Related

Using SED to replace capture group with regex pattern

I need some help with a sed command that I thought would help solve an issue I have. I have basically have long text files that look something like this:
>TRINITY_DN112253_co_g1_i2 Len=3873 path=[38000:0-183]
ACTCACGCCCACATAAT
The ACT text blocks continue on, and then there are more blocks of text that follow the same pattern, except the text after the > differs slightly by numbers. I want to replace only this header part (the part followed by the >) to everything up until the very last “_” the sed command I thought seemed logical is the following:
sed -i ‘s/>.*/TRINITY.*_/‘
However, sed is literally changing each header to TRINITY.*_ rather than capturing the block I thought it would. Any help is appreciated!
(Also.. just to make things clear, I thought that my sed command would convert the top header block into this:
>TRINITY_DN112253_co_g1_
ACTCACGCCCACATAAT
This might help:
sed '/^>/s/[^_]*$//' file
Output:
>TRINITY_DN112253_co_g1_
ACTCACGCCCACATAAT
See: The Stack Overflow Regular Expressions FAQ

Embedding quotation marks in command string generated by AWK?

I need to match all instances of strings in one file, with a master list in another. However, if my string is abc I want only that, not abcdef, abc1234 and so on.
So, a word boundary for the regex? Right now, I'm using a simple awk one liner:
cat results_file| sort -k 1| awk -F" " '{ print $1" /home/owner/file_2_search"}'|
xargs -L 1 /bin/grep -i
However, to force a word boundary, I'd need to grep string\b and the quotes (single or double) seem to be required.
In awk, \b is a special character, you need \\b ... And the quoted quotes ... (arg) ... Or am I missing something and overdoing this?
This is a Linux box, so presumably gawk. I have gone over quoting rules for awk, and realize this has got to be simple (and not complex ... but), but am not seeing it.
Had meant to post as an answer, not a comment. Will try to pose a more readable question, but confess to having second thoughts about doing this as a one-liner in the first place -- may be best to follow an alternate method. Appreciate the willingness to help.
--Joe

substitute strings with special characters in a huge file using sed

I'm stuck in this very easy problem (I hope it is for you).
I need to substitute several strings with special characters in a huge file.
I'm trying using sed and bash because I'm a linux user but I've only used sed for "standard" string so far.
These are the kind of strings that I'm trying to manipulate
(alpha[1],alpha[2]) and diff(A45(i,j),alpha[1])
and the substituting strings would be
(i,j) and dzA45(i,j)
I tried sed -i 's/(alpha[1],alpha[2])/(i,j)/g' $filetowork and
sed -i 's/\(alpha\[1\],alpha\[2\]\)/i,j/g' $filetowork without any success
The second option seems to work for the first kind of string but it doesn't for the second one, why?
could you please help me? I took a look around stackoverflow old questions without any help, unfortunately :(
I just tried on the command line, but
echo "(alpha[1],alpha[2])" | sed 's/(alpha\[1\],alpha\[2\])/(i,j)/
worked for the first case. Please note that you should not escape ( or ), because that is how you activate groups.
For the second one
echo "diff(A45(i,j),alpha[1])" | sed 's/diff(A45(i,j),alpha\[1\])/dzA45(i,j)/'
worked for me. The same case, don't escape brackets!

bash - remove improper words

I have a file with bunch of words in which many of them don't make much sense such as 'completemakes' or even #s mixed with letters/words. What I need is to use a tool to spell check them, if it exists on the dictionary leave it, if not delete it.
What would be a good way of doing this in bash?
Thanks
You can script Aspell.
I had some fun with getting a single quote character in here, but hey, it should be as hard to read as it was to write, right? (assuming your words are listed in words.txt)
awk 'system("grep -i -q " "'"'"'^"$0"$'"'"'" " /usr/share/dict/words") == 0 {print $0};' words.txt

CShell word replacement

I have a short text file with the following syntax:
FileName: some name
Version: 3
Length: 45
hello, this is an irrelevant, unimportant text.
So is this line.
Now, I'm trying to write a script that replace the version number with a given new number.
Anyone knows how to? I really don't mind it to be ugly
thanks,
Udi
Why not just use sed?
sed -i 's/^Version: .*$/Version: 99/' foo.txt
I don't know how to do it in csh. However, at the risk of coming across as one of those annoying people who tells you to use their favourite thing at every opportunity, there are better ways than using csh. The traditional unix command sed is good at this stuff, or a language like Perl is also useful.
perl -p -i -e 's/Version: 3/Version: 4/g;' myfile
should do it.

Resources