Format text with regard to punctuation - text

How can I format text in a natural language taking punctuation into account? The built-in gq command of Vim, or command line tools, such as fmt or par break lines without regard to punctuation. Let me give you an example,
fmt -w 40 gives not what I want:
we had everything before us, we had
nothing before us, we were all going
direct to Heaven, we were all going
direct the other way
smart_formatter -w 40 would give:
we had everything before us,
we had nothing before us,
we were all going direct to Heaven,
we were all going direct the other way
Of course, there are cases when no punctuation mark is found within the given text width, then it can fallback to the standard text formatting behavior.
The reason why I want this is to get a meaningful diff of text where I can spot which sentence or subsentence changed.

Here is a not very elegant, but working method I finally came up with. Suppose, a line break at a punctuation mark is worth 6 characters. It means, I'll accept a result which is more ragged but contains more lines ending in a punctuation mark if the "raggedness" is less than 6 characters long. For example, this is OK ("raggedness" is 3 characters).
Wait!
He said.
This is not OK ("raggedness" is more than 6 characters)
Wait!
He said to them.
The method is to add 6 dummy characters after each punctuation mark, format the text, then remove the dummy characters.
Here is the code for this
sed -e 's/\([.?!,]\)/\1 _ _ _/g' | fmt -w 34 | sed -e 's/ _//g' -e 's/_ //g'
I used _ (space + underscore) as a pair of dummy characters, supposing they're not contained in the text. The result looks quite good,
we had everything before us,
we had nothing before us,
we were all going direct to
Heaven, we were all going
direct the other way

Related

convert the numbers to their names(1=one etc) with SED command

I need to convert digits to their name ex: 1=one 2=two etc but I can only use SED command. Just the one-digit numbers should change.
sed 's/[1]/one/g; s/[2]/two/g; s/[3]/three/g; s/[4]/four/g; s/[5]/five/g; s/[6]/six/g; s/[7]/seven/g; s/[8]/eight/g; s/[9]/nine/g; s/[0]/zero/g'
Text:
Lo5se eyes get fat shew. Win4ter can indeed letter oppose way change te5nded now. So is imp6rove my charmed picture exposed adapt5ed demands. Received had en4d prod4uced prepared dive5rted strictly off man br55anched. Known 72ye money 6so large decay voice t6here to. Preserved be mr cordially incom88888mode as an. He 3doors qui03ck child an point at. Had sh2a9re vexed front least style off why him.
The result should be:
Lofivese eyes get fat shew. Winfourter can indeed letter oppose way change tefivended now. So is impsixrove my charmed picture exposed adaptfiveed demands. Received had enfourd prodfouruced prepared divefiverted strictly off man br55anched. Known 72ye money sixso large decay voice tsixhere to. Preserved be mr cordially incom88888mode as an. He threedoors qui03ck child an point at. Had shtwoaninere vexed front least style off why him.
With a sed that has -E to enable EREs and recognizes \n as meaning a newline (e.g. GNU sed):
sed -E 's/(^|[^0-9])([0-9])([^0-9]|$)/\1\n\2\n\3/g; s/\n1\n/one/g; s/\n2\n/two/g; s/\n3\n/three/g; s/\n4\n/four/g; s/\n5\n/five/g; s/\n6\n/six/g; s/\n7\n/seven/g; s/\n8\n/eight/g; s/\n9\n/nine/g; s/\n0\n/zero/g' file
Lofivese eyes get fat shew. Winfourter can indeed letter oppose way change tefivended now. So is impsixrove my charmed picture exposed adaptfiveed demands. Received had enfourd prodfouruced prepared divefiverted strictly off man br55anched. Known 72ye money sixso large decay voice tsixhere to. Preserved be mr cordially incommode as an. He threedoors qui03ck child an point at. Had shtwoa9re vexed front least style off why him.
This might work for you (GNU sed):
sed -E 's/[0-9]+/\n&/g;s/\n(.[0-9])/\1/g;s/$/\n1one2two3three4four5five6six7seven8eight9nine0zero/;:a;s/\n(.)(.*\n.*\1([^0-9]+))/\3\2/;ta;P;d' file
Prepend a newline to each group of numbers.
Remove the prepended newline for numbers with more than two digits.
Append a newline and a lookup table to the end of the line.
Use a loop and pattern matching to replace each single digit with its literal, ensuring the lookup table is maintained.
Print the amended current line less the lookup table.

How to edit this file using grep or using cat or using vim or using another tool?

One of my elder brother who is studying in Statistics. Now, he is writing his thesis paper in LaTeX. Almost all contents are written for the paper. And he took 5 number after point(e.g. 5.55534) for each value those are used for his calculation. But, at the last time his instructor said to change those to 3 number after point(e.g. 5.555) which falls my brother in trouble. Finding and correcting those manually is not easy. So, he told me to help.
I believe there is also a easy solution which is know to me. The snapshot of a portion of the thesis looks like-
&se($\hat\beta_1$)&0.35581&0.35573&0.35573\\
&mse($\hat\beta_1$)&.12945&.12947&.12947\\
\addlinespace
&$\hat\beta_2$&0.03329&0.03331&0.03331 \\
&se($\hat\beta_2$)&0.01593&0.01592&0.01591\\
&mse($\hat\beta_2$)&.000265&.000264&.000264 \\
\midrule
{n=100} & $\hat\beta_1$&-.52006&-.52001&-.51946\\
&se($\hat\beta_1$)&.22819&.22814&.22795\\
&mse($\hat\beta_1$)&.05247&.05244&.05234\\
\addlinespace
&$\hat\beta_2$&0.03134&0.03134&0.03133 \\
&se($\hat\beta_2$)&0.00979&0.00979&0.00979\\
&mse($\hat\beta_2$)&.000098&.000098&.000098
I want -
&se($\hat\beta_1$)&0.355&0.355&0.355\\
&mse($\hat\beta_1$)&.129&.129&.129\\
......................................................................
........................................................................
........................................................................
Note: Don't feel boring for the syntax(These are LaTeX syntax).
If anybody has solution or suggestion, please provide. Thank you.
In sed:
$ sed 's/\(\.[0-9]\{3\}\)[0-9]*/\1/g' file
&se($\hat\beta_1$)&0.355&0.355&0.355\\
&mse($\hat\beta_1$)&.129&.129&.129\\
ie. replace period starting numeric strings with at least 3 numbers with the leading period and three first numbers.
Here is the command in vim:
:%s/\.\d\{3}\zs\d\+//g
Explanation:
: entering command-mode
% is the range of all lines of the file
s substitution command
\.\d\{3}\zs\d\+ pattern you would like to change
\. literal point (.)
\d\{3} match 3 consecutive digits
\zs start substitution from here
\d\+ one or more digits
g Replace all occurrences in the line
Concerning grep and cat they have nothing to do with replacing text. These commands are only for searching and printing contents of files.
Instead, what you are looking is substitution there are lots of commands in Linux that can do that mainly sed, perl, awk, ex etc.

Grep expression filter out lines of the form [alnum][punct][alnum]

Hi all my first post is for what I thought would be simple ...
I haven't been able to find an example of a similar problem/solution.
I have thousands of text files with thousands of lines of content in the form
<word><space><word><space><number>
Example:
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
In the above example I want to exclude line 3 as it contains internal punctuation
I'm trying to use the GNU grep 2.25 however not having luck
my initial attempt was (however this does not allow the "-" internal to the pattern):
grep -v [:alnum:]*[:punct:]*[:alnum:]* filename
so tried this however
grep -v [:alnum:]*[:space:]*[!]*["]*[#]*[$]*[%]*[&]*[']*[(]*[)]*[*]*[+]*[,]*[.]*[/]*[:]*[;]*[<]*[=]*[>]*[?]*[#]*[[]*[\]*[]]*[^]*[_]*[`]*[{]*[|]*[}]*[~]*[.]*[:space:]*[:alnum:]* filename
however I need to factor in spaces and - as these are acceptable internal to the string.
I had been trying with the :punct" set however now see it contains - so clearly that will not work
I do currently have a stored procedure in TSQL to process these however would prefer to preprocess prior to loading if possible as the routine takes some seconds per file.
Has someone been able to achieve something similar?
On the face of it, you're looking for the 'word space word space number' schema, assuming 'word' is 'one alphanumeric optionally followed by zero or one occurrences of zero or more alphanumeric or punctuation characters and ending with an alphanumeric', and 'space' is 'one or more spaces' and 'number' is 'one or more digits'.
In terms of grep -E (aka egrep):
grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+'
That contains:
[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?
That detects a word with any punctuation surrounded by alphanumerics, and:
[[:space:]]+
[[:digit:]]+
which look for one or more spaces or digits.
Using a mildly extended data file, this produces:
$ cat data
example for 1
useful when 1
for. However 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$ grep -E '[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:alnum:]]([[:alnum:][:punct:]]*[[:alnum:]])?[[:space:]]+[[:digit:]]+' data
example for 1
useful when 1
,boy wonder 1
,hary-horse wondered 2
O'Reilly Books 23
Coelecanths, Dodos Etc 19
$
It eliminates the for. However 1 line as required.
Your regex contains a long string of ordered optional elements, but that means it will fail if something happens out of order. For example,
[!]*[?]*
will capture !? but not ?! (and of course, a character class containing a single character is just equivalent to that single character, so you might as well say !*?*).
You can instead use a single character class which contains all of the symbols you want to catch. As soon as you see one next to an alphanumeric character, you are done, so you don't need for the regex to match the entire input line.
grep -v '[[:alnum:]][][!"#$%&'"'"'()*+,./:;<=>?#\^_`{|}~]' filename
Also notice how the expression needs to be in single quotes in order for the shell not to interfere with the many metacharacters here. In order for a single-quoted string to include a literal single quote, I temporarily break out into a double-quoted string; see here for an explanation (I call this "seesaw quoting").
In a character class, if the class needs to include ], it needs to be at the beginning of the enumerated list; for symmetry and idiom, I also moved [ next to it.
Moreover, as pointed out by Jonathan Leffler, a POSIX character class name needs to be inside a character class; so to match one character belonging to the [:alnum:] named set, you say [[:alnum:]]. (This means you can combine sets, so [-[:alnum:].] covers alphanumerics plus dash and period.)
If you need to constrain this to match only on the first field, change the [[:alnum:]] to ^[[:alnum:]]\+.
Not realizing that a*b*c* matches anything is a common newbie error. You want to avoid writing an expression where all elements are optional, because it will match every possible string. Focus on what you want to match (the long list of punctuation characters, in your case) and then maybe add optional bits of context around it if you really need to; but the fewer of these you need, the faster it will run, and the easier it will be to see what it does. As a quick rule of thumb, a*bc* is effectively precisely equivalent to just b -- leading or trailing optional expressions might as well not be specified, because they do not affect what is going to be matched.

How to make a Palindrome with a sed command?

I'm trying to find the code that searches all palindromes in a dictionary file
this is what I got atm which is wrong :
sed -rn '/^([a-z])-([a-z])\2\1$/p' /usr/share/dict/words
Can somebody explain the code as well.
Found the right answer.
sed -n '/^\([a-z]\)\([a-z]\)\2\1$/p' /usr/share/dict/words
I have no idea why I used -
I also don't have an explenation for the \ ater each group
You can use the grep command as explained here
grep -w '^\(.\)\(.\).\2\1'
explanation The grep command searches for the first any three letters by using (.)(.). after that we are searching the same 2nd character and 1st character is occuring or not.
The above grep command will find out only 5 letters palindrome words.
extended version is proposed as well on that page; and works correctly for the first line but then crashes... there is surely some good to keep and maybe to adapt...
Guglielmo Bondioni proposed a single RE that finds all palindromes up to 19 characters long using 9 subexpressions and 9 back-references:
grep -E -e '^(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?)(.?).?\9\8\7\6\5\4\3\2\1' file
You can extend this further as much as you want :)
Perl to the rescue:
perl -lne 'print if $_ eq reverse' /usr/share/dict/words
Hate to say it, but while regex may be able to cook your breakfast, I don't think it can find a palindrome. According to the all-knowing Wikipedia:
In the automata theory, a set of all palindromes in a given alphabet is a typical example of a language that is context-free, but not regular. This means that it is impossible for a computer with a finite amount of memory to reliably test for palindromes. (For practical purposes with modern computers, this limitation would apply only to incredibly long letter-sequences.)
In addition, the set of palindromes may not be reliably tested by a deterministic pushdown automaton which also means that they are not LR(k)-parsable or LL(k)-parsable. When reading a palindrome from left-to-right, it is, in essence, impossible to locate the "middle" until the entire word has been read completely.
So a regular expression won't be able to solve the problem based on the problem's nature, but a computer program (or sed examples like #NeronLeVelu or #potong) will work.
explanation of your code
sed -rn '/^([a-z])-([a-z])\2\1$/p' /usr/share/dict/words
select and print line that correspond to :
A first (starting the line) small alphabetic character followed by - followed by another small alaphabetic character (could be the same as the first) followed by the last letter of the previous group followed by the first letter Letter1-Letter2Letter2Letter1 and the no other element (end of line)
sample:
a-bba
a is first letter
b second letter
b is \2
a is \1
But it's a bit strange for any work unless it came from a very specific dictionnary (limited to combination by example)
This might work for you (GNU sed):
sed -r 'h;s/[^[:alpha:]]//g;H;x;s/\n/&&/;ta;:a;s/\n(.*)\n(.)/\n\2\1\n/;ta;G;/\n(.*)\n\n\1$/IP;d' file
This copies the original string(s) to the hold space (HS), then removes everything but alpha characters from the string(s) and appends this to the HS. The second copy is then reversed and the current string(s) and the reversed copy compared. If the two strings are equal then the original string(s) is printed out otherwise the line is deleted.

Linux tr command with punctuation

I need to use the tr command to translate for ROT13, (moving along 13 characters in the alphabet) for both upper and lower case
This is what I have come up with
tr "A-Za-z" "N-ZA-Mn-za-m"
However it now also needs to translate for the punctuation characters.
I've seen someone mention that
[A-Za-z0-9 _.,!"'/$]*
would help me, but I honestly have no clue how to add this into my code.
I am completely new to linux!
It depends on exactly how you define "rot13". I believe this is sufficient:
http://www.linuxjournal.com/article/2563
If you read the International Obfuscated C Code Contest
(ftp://ftp.uu.net./pub/ioccc/), you frequently see that part of the
hints are coded by a method called rot13. rot13 is a Caesar cypher,
i.e., a cypher in which all letters are shifted some number of places.
For example, a becomes b, b becomes c, ..., y becomes z, and z becomes
a. In rot13 each letter is shifted 13 places. It is a weak cypher, and
to decipher it, you can use rot13 again. You can also use tr to read
the text in this way:
tr a-zA-Z n-za-mN-ZA-M
Note, too, that the quotes (") are only needed if a string argument has a blank in it. Since you don't have any blanks in your "tr" arguments, the quotes aren't needed. These two statements are functionally identical: tr "A-Za-z" "N-ZA-Mn-za-m" == tr A-Za-z -ZA-Mn-za-m

Resources