Linux tr command with punctuation - linux

I need to use the tr command to translate for ROT13, (moving along 13 characters in the alphabet) for both upper and lower case
This is what I have come up with
tr "A-Za-z" "N-ZA-Mn-za-m"
However it now also needs to translate for the punctuation characters.
I've seen someone mention that
[A-Za-z0-9 _.,!"'/$]*
would help me, but I honestly have no clue how to add this into my code.
I am completely new to linux!

It depends on exactly how you define "rot13". I believe this is sufficient:
http://www.linuxjournal.com/article/2563
If you read the International Obfuscated C Code Contest
(ftp://ftp.uu.net./pub/ioccc/), you frequently see that part of the
hints are coded by a method called rot13. rot13 is a Caesar cypher,
i.e., a cypher in which all letters are shifted some number of places.
For example, a becomes b, b becomes c, ..., y becomes z, and z becomes
a. In rot13 each letter is shifted 13 places. It is a weak cypher, and
to decipher it, you can use rot13 again. You can also use tr to read
the text in this way:
tr a-zA-Z n-za-mN-ZA-M
Note, too, that the quotes (") are only needed if a string argument has a blank in it. Since you don't have any blanks in your "tr" arguments, the quotes aren't needed. These two statements are functionally identical: tr "A-Za-z" "N-ZA-Mn-za-m" == tr A-Za-z -ZA-Mn-za-m

Related

Grep expression filter out lines of the form [alnum][punct][alnum]

Hi all my first post is for what I thought would be simple ...
I haven't been able to find an example of a similar problem/solution.
I have thousands of text files with thousands of lines of content in the form
<word><space><word><space><number>
Example:
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
In the above example I want to exclude line 3 as it contains internal punctuation
I'm trying to use the GNU grep 2.25 however not having luck
my initial attempt was (however this does not allow the "-" internal to the pattern):
grep -v [:alnum:]*[:punct:]*[:alnum:]* filename
so tried this however
grep -v [:alnum:]*[:space:]*[!]*["]*[#]*[$]*[%]*[&]*[']*[(]*[)]*[*]*[+]*[,]*[.]*[/]*[:]*[;]*[<]*[=]*[>]*[?]*[#]*[[]*[\]*[]]*[^]*[_]*[`]*[{]*[|]*[}]*[~]*[.]*[:space:]*[:alnum:]* filename
however I need to factor in spaces and - as these are acceptable internal to the string.
I had been trying with the :punct" set however now see it contains - so clearly that will not work
I do currently have a stored procedure in TSQL to process these however would prefer to preprocess prior to loading if possible as the routine takes some seconds per file.
Has someone been able to achieve something similar?
On the face of it, you're looking for the 'word space word space number' schema, assuming 'word' is 'one alphanumeric optionally followed by zero or one occurrences of zero or more alphanumeric or punctuation characters and ending with an alphanumeric', and 'space' is 'one or more spaces' and 'number' is 'one or more digits'.
In terms of grep -E (aka egrep):
grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+'
That contains:
[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?
That detects a word with any punctuation surrounded by alphanumerics, and:
[[:space:]]+
[[:digit:]]+
which look for one or more spaces or digits.
Using a mildly extended data file, this produces:
$ cat data
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$ grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+' data
example for 1
useful when 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$
It eliminates the for. However 1 line as required.
Your regex contains a long string of ordered optional elements, but that means it will fail if something happens out of order. For example,
[!]*[?]*
will capture !? but not ?! (and of course, a character class containing a single character is just equivalent to that single character, so you might as well say !*?*).
You can instead use a single character class which contains all of the symbols you want to catch. As soon as you see one next to an alphanumeric character, you are done, so you don't need for the regex to match the entire input line.
grep -v '[[:alnum:]][][!"#$%&'"'"'()*+,./:;<=>?#\^_`{|}~]' filename
Also notice how the expression needs to be in single quotes in order for the shell not to interfere with the many metacharacters here. In order for a single-quoted string to include a literal single quote, I temporarily break out into a double-quoted string; see here for an explanation (I call this "seesaw quoting").
In a character class, if the class needs to include ], it needs to be at the beginning of the enumerated list; for symmetry and idiom, I also moved [ next to it.
Moreover, as pointed out by Jonathan Leffler, a POSIX character class name needs to be inside a character class; so to match one character belonging to the [:alnum:] named set, you say [[:alnum:]]. (This means you can combine sets, so [-[:alnum:].] covers alphanumerics plus dash and period.)
If you need to constrain this to match only on the first field, change the [[:alnum:]] to ^[[:alnum:]]\+.
Not realizing that a*b*c* matches anything is a common newbie error. You want to avoid writing an expression where all elements are optional, because it will match every possible string. Focus on what you want to match (the long list of punctuation characters, in your case) and then maybe add optional bits of context around it if you really need to; but the fewer of these you need, the faster it will run, and the easier it will be to see what it does. As a quick rule of thumb, a*bc* is effectively precisely equivalent to just b -- leading or trailing optional expressions might as well not be specified, because they do not affect what is going to be matched.

How to make a Palindrome with a sed command?

I'm trying to find the code that searches all palindromes in a dictionary file
this is what I got atm which is wrong :
sed -rn '/^([a-z])-([a-z])\2\1$/p' /usr/share/dict/words
Can somebody explain the code as well.
Found the right answer.
sed -n '/^\([a-z]\)\([a-z]\)\2\1$/p' /usr/share/dict/words
I have no idea why I used -
I also don't have an explenation for the \ ater each group
You can use the grep command as explained here
grep -w '^\(.\)\(.\).\2\1'
explanation The grep command searches for the first any three letters by using (.)(.). after that we are searching the same 2nd character and 1st character is occuring or not.
The above grep command will find out only 5 letters palindrome words.
extended version is proposed as well on that page; and works correctly for the first line but then crashes... there is surely some good to keep and maybe to adapt...
Guglielmo Bondioni proposed a single RE that finds all palindromes up to 19 characters long using 9 subexpressions and 9 back-references:
grep -E -e '^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1' file
You can extend this further as much as you want :)
Perl to the rescue:
perl -lne 'print if $_ eq reverse' /usr/share/dict/words
Hate to say it, but while regex may be able to cook your breakfast, I don't think it can find a palindrome. According to the all-knowing Wikipedia:
In the automata theory, a set of all palindromes in a given alphabet is a typical example of a language that is context-free, but not regular. This means that it is impossible for a computer with a finite amount of memory to reliably test for palindromes. (For practical purposes with modern computers, this limitation would apply only to incredibly long letter-sequences.)
In addition, the set of palindromes may not be reliably tested by a deterministic pushdown automaton which also means that they are not LR(k)-parsable or LL(k)-parsable. When reading a palindrome from left-to-right, it is, in essence, impossible to locate the "middle" until the entire word has been read completely.
So a regular expression won't be able to solve the problem based on the problem's nature, but a computer program (or sed examples like #NeronLeVelu or #potong) will work.
explanation of your code
sed -rn '/^([a-z])-([a-z])\2\1$/p' /usr/share/dict/words
select and print line that correspond to :
A first (starting the line) small alphabetic character followed by - followed by another small alaphabetic character (could be the same as the first) followed by the last letter of the previous group followed by the first letter Letter1-Letter2Letter2Letter1 and the no other element (end of line)
sample:
a-bba
a is first letter
b second letter
b is \2
a is \1
But it's a bit strange for any work unless it came from a very specific dictionnary (limited to combination by example)
This might work for you (GNU sed):
sed -r 'h;s/[^[:alpha:]]//g;H;x;s/\n/&&/;ta;:a;s/\n(.*)\n(.)/\n\2\1\n/;ta;G;/\n(.*)\n\n\1$/IP;d' file
This copies the original string(s) to the hold space (HS), then removes everything but alpha characters from the string(s) and appends this to the HS. The second copy is then reversed and the current string(s) and the reversed copy compared. If the two strings are equal then the original string(s) is printed out otherwise the line is deleted.

Replace control characters and spaces with escape sequences

I want to replace control characters (ASCII 0-31) and spaces (ASCII 32) with hex escape codes. For example:
$ escape 'label=My Disc'
label=My\x20Disc
$ escape $'multi\nline\ttabbed string'
multi\x0Aline\x09tabbed\x20string
$ escape '\'
\\
For context, I'm writing a script which statuses a DVD drive. Its output is designed to be parsed by another program. My idea is to print each piece of info as a separate space-separated word. For example:
$ ./discStatus --monitor
/dev/dvd: no-disc
/dev/dvd: disc blank writable size=0 capacity=2015385600
/dev/dvd: disc not-blank not-writable size=2015385600 capacity=2015385600
I want to add the disc's label to this output. To fit with the parsing scheme I need to escape spaces and newlines. I might as well do all the other control characters as well.
I'd prefer to stick to bash, sed, awk, tr, etc., if possible. I can't think of a really elegant way to do this with those tools, though. I'm willing to use perl or python if there's no good solution with basic shell constructs and tools.
Here's a Perl one-liner I came up with. It uses /e to run code in the replacements.
perl -pe 's/([\x00-\x20\\])/sprintf("\\x%02X", ord($1))/eg'
A slight deviation from the example in my question: it emits \x5C for backslashes instead of \\.
I would use a higher-level language. There are three different types of replacement going on (single character to multicharacter for the control characters and space, identity for other printable characters, and the special case of doubling the backslash), which I think is too much for awk, sed, and the like to handle simply.
Here's my approach for Python
def translate(c):
cp = ord(c)
if cp in range(33):
return '\\x%02x'%(cp,)
elif c == '\\':
return r'\\'
else:
return c
if __name__ == '__main__':
import sys
print ''.join( map(translate, sys.argv[1]) )
If speed is a concern, you can replace the translate function with a prebuilt dictionary mapping each character to its desired string representation.
Wow, it looks like a fairly trivial sed script along the lines of
's|\n|\\n|' for each character you want to substitute.

How to implement Caesar cipher-like text substitution in Vim?

I was doing some puzzle where each English letter is replaced by the one two letters down the alphabet. For example, the word apple is to be transformed into crrng, as a + 2 → c, b + 2 → d, etc.
In Python, I was able to implement this transformation using the maketrans()
string method. I wonder: Is it possible to do the same via search and replace in Vim?
1. If the alphabetic characters are arranged sequentially in the target
encoding (as is the case for ASCII and some alphabets in UTF-8, like
English), one can use the following substitution command:
:%s/./\=nr2char(char2nr(submatch(0))+2)/g
(Before running the command, make sure that the encoding option
is set accordingly.)
However, this replacement implements a non-circular letter shift.
A circular shift can be implemented by two substitutions separately
handling lowercase and uppercase letters:
:%s/\l/\=nr2char(char2nr('a') + (char2nr(submatch(0)) - char2nr('a') + 2) % 26)/g
:%s/\u/\=nr2char(char2nr('A') + (char2nr(submatch(0)) - char2nr('A') + 2) % 26)/g
2. Another way is to translate characters using the tr() function.
Let us assume that the variable a contains lowercase characters
of an alphabet arranged in correct order, and the variable a1 hold
the string of characters corresponding to those in a (below is
an example for English letters).
:let a = 'abcdefghijklmnopqrstuvwxyz'
:let a1 = a[2:] . a[:1]
To avoid typing the whole alphabet by hand, the value of a can be
produced as follows:
:let a = join(map(range(char2nr('a'), char2nr('z')), 'nr2char(v:val)'), '')
Then, to replace each letter on a line by the letter two positions down
the alphabet, one can use the following substitution:
:%s/.*/\=tr(submatch(0), a . toupper(a), a1 . toupper(a1))
Yes, \= will execute the function
%s/\(.\)/\=nr2char(char2nr(submatch(1)) + 2)/g
Can't think of anything in vim, but you could use the unix command line utility 'tr' (stands for translate, I believe).
The puzzle you describe is widely known as the caesar cipher, and is normally implemented via the tr command or sed -e y/. Since y is not available in vim, you'll need a pretty dirty hack like ib proposed, but calling tr is much nicer work.
Especially considering the corner case of y and z: I assume these should be mapped to a and b, respectively?

Format text with regard to punctuation

How can I format text in a natural language taking punctuation into account? The built-in gq command of Vim, or command line tools, such as fmt or par break lines without regard to punctuation. Let me give you an example,
fmt -w 40 gives not what I want:
we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way
smart_formatter -w 40 would give:
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way
Of course, there are cases when no punctuation mark is found within the given text width, then it can fallback to the standard text formatting behavior.
The reason why I want this is to get a meaningful diff of text where I can spot which sentence or subsentence changed.
Here is a not very elegant, but working method I finally came up with. Suppose, a line break at a punctuation mark is worth 6 characters. It means, I'll accept a result which is more ragged but contains more lines ending in a punctuation mark if the "raggedness" is less than 6 characters long. For example, this is OK ("raggedness" is 3 characters).
Wait!
He said.
This is not OK ("raggedness" is more than 6 characters)
Wait!
He said to them.
The method is to add 6 dummy characters after each punctuation mark, format the text, then remove the dummy characters.
Here is the code for this
sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g'
I used _ (space + underscore) as a pair of dummy characters, supposing they're not contained in the text. The result looks quite good,
we had everything before us,
we had nothing before us,
we were all going direct to
Heaven, we were all going
direct the other way

Resources