String Range forward and backward lookaround - string

I am trying to write a script that gets input from a user and returns the input in a formatted area. I have been using the string range function however it obviously cuts the the input at the range that I give. Is there any way to do a look around at the specified range to find the next space character and cut the input at that location?
For example, if I have the input of:
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris
My current string range function formats the input with \r\n as such:
Lorem ipsum dolor sit amet, consectetur a
dipisicing elit, sed do eiusmod tempor in
cididunt ut labore et dolore magna aliqua
. Ut enim ad minim veniam, quis nostrud e
xercitation ullamco laboris
As you can see on line 1 the adipisicing line 2 incididunt words have been cut off. I am looking for a way to look for the closest space to that location. So for line 1 it would have been before the a on line 2 it would have been before the i. …In some cases it may be after the word.
Is that clear what I am looking for? Any assistance would be great!

The string range operation is pretty stupid; it doesn't know anything about the string it is splitting other than that it contains characters. To get smarter splitting, your best bet is probably an intelligently chosen regular expression:
set s "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod\
tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis\
nostrud exercitation ullamco laboris."
# Up to 40 characters, from word-start, to word-start or end-of-string
set RE {\m.{1,40}(?:\m|\Z)}
# Extract the split-up list of "lines" and print them as lines
puts [join [regexp -all -inline $RE $s] "\n"]
This produces this output for me:
Lorem ipsum dolor sit amet, consectetur
adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris.
Implementing full justification by inserting spaces is left as an exercise for the reader (because it's really quite a lot harder than greedy line splitting!)

The textutil::adjust module in tcllib is what you need:
package require textutil::adjust
set line "Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris"
set formatted [textutil::adjust::adjust $line -length 41]
puts $formatted
Lorem ipsum dolor sit amet, consectetur
adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna
aliqua. Ut enim ad minim veniam, quis
nostrud exercitation ullamco laboris

Related

How to replace word in text with link (Pug)?

I am a very beginner in Pug and I'm trying to solve the following task: Every occurrence (in text) of the following words shall become a link to the destination defined below:
sed -> https://google.com
liq -> https://facebook.com
Тhis works as I expected, but keeps anchor tag as string.
- var str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Pulvinar elementum integer sed neque volutpat ac. Facilisis liq odio morbi quis commodo odio aenean . Vel facilisis volutpat liq velit. Viverra aliquet liq sit amet tellus cras adipiscing sed.";
- var url = '#[a(href="https://google.com") sed]'; //I tried with these ones too - var url = 'sed'; var url = 'a(href="https://google.com)sed'
- var res = str.replace(/sed/g, url);
p #{res}
Here my latest atempt:
mixin link(href, name)
a(href=href target!=attributes.target )= name
//+link('https://google.com', 'Google')(target="blank")
- var str = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Pulvinar elementum integer sed neque volutpat ac. Facilisis liq odio morbi quis commodo odio aenean . Vel facilisis volutpat liq velit. Viverra aliquet liq sit amet tellus cras adipiscing sed.";
- var link = +link('https://google.com', 'Google')(target="blank");
- var res = str.replace(/sed/g, link);
p #{res}
I've never seen Pug mixins used within javascript code blocks like that before. I'm surprised it even outputs the anchor tag at all.
That being said, you can use the unescaped string interpolation syntax (!{variable}) instead of the regular string interpolation syntax (#{variable}) to get the link to render as a link.
In your case:
p !{res}
But keep in mind this word of warning from the Pug documentation:
Caution
Keep in mind that buffering unescaped content into your templates can be mighty risky if that content comes fresh from your users. Never trust user input!

using AWK to remove characters match with html tag (not regex) [duplicate]

This question already has answers here:
RegEx match open tags except XHTML self-contained tags
(35 answers)
Closed 6 years ago.
This post was edited and submitted for review 15 days ago.
I want to remove every html tag with awk from this regex: /[<.*.>]/ if said regex is found in any field. I've been trying to make it work with sub or substr, I am unable to find the correct logic for this.
Input text:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation<br/><div style="margin-top:6px">< b>veniam:< /b>< /div> <br/><div style="margin-top:6px">< b>Confort:< /b></div>Comenzi volan; Cruise-control; Servodirectie; <br/>
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitationveniam: Confort:Comenzi volan; Cruise-control; Servodirectie;
If you're not really parsing HTML but instead just want to remove everything between each <...> pair in a text file, then that'd be this with GNU awk for multi-char RS:
$ awk -v RS='<[^>]+>' -v ORS= '1' file
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitationveniam: Confort:Comenzi volan; Cruise-control; Servodirectie;

Using cheerio to select text from paragraph tags (<p>) with no class

I'm using cheerio (cheeriojs) to scrape content from a site which has the following HTML layout.
<div class="foo"></div>
<p></p>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
<br><br>
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat.
<br><br>
</p>
I'm able to reach this content using the each function in the docs (here) by traversing the DOM looking for ".foo" class like so.
$('.foo').each(function(i, el){
//Do something...
$(this).next().next().text()
}
From here I'm able to simply convert this content to a string, and retrieve as I wish, however the text comes back in one unformatted long string. (i.e. a long essay of paragraphs without spacing between the respective paragraphs). Is there a way, trick I could retrieve the content whilst keeping the formatted content?
I've attempted the following;
`var fruits = [];
$('.foo').each(function(i, el){
fruits[i] = $(this).next().next().text();
}`
As a way to get the current tag and push it to an array, however this isn't much different from my earlier code. I'm assuming this would be possible if the <br> tags had some id or classes, however they don't. Is there a way I can directly target these (<br>) as a way to get the text, and retrieve it in proper format (i.e. with spacing between paragraphs). At this junction, I must ask those who are more familiar and experience with cheerio if what I'm trying to do in this particular cash is even feasible with cheerio? I'm open to pursuing other route, and would welcome recommendation for modules/libraries that could help make this an easier task.
To recap: I want to retrieve all text between the second <p> tags, maintaining format and spacing as seen on rendered HTML.
Thanks in advance.
If you ask for .text() it will strip formatting. If you ask for .html() it'll return all the content, preserving all the tags.
So change this:
fruits[i] = $(this).next().next().text();
To this:
fruits[i] = $(this).next().next().html();

delete lines not (containing pattern and 2 lines above pattern)

In Vim I try to delete all lines in a file not (containing a pattern and 2 lines above the pattern). I try:
:g!/pattern/.-2 d
But it says: invalid range...
What to do?
The command below looks for lines that don't match pattern and deletes them and the two lines above:
:g!/pattern/-2,.d
The command below looks for lines that don't match pattern and deletes the line located two lines above:
:g!/pattern/-2d
Ranges always go downwards so we use the upper address first — -2 — and the lower one second — . —.
That said, you'll most likely get an error if a matching line doesn't have two lines above it.
then how should i delete all the lines exept the lines 4, 5 and 6 in the folowing file: line 1 line 2 line 3 line 4 line 5 line containing pattern line 7 ?
Like this:
:v/\v(.*\n){,2}.*pattern.*/d
This matches if:
the line contains the pattern, or
next line contains the pattern, or
the 2nd next line contains the pattern.
These lines are kept. All other lines (:v) are deleted.
Example:
"Lorem ipsum dolor sit amet,
consectetur adipiscing elit,
sed do eiusmod tempor incididunt
ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut
aliquip ex ea commodo consequat.
Duis aute irure dolor in reprehenderit
in voluptate velit esse cillum dolore eu
fugiat nulla pariatur. Excepteur sint
occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim
id est laborum."
Run:
:v/\v(.*\n){,2}.*labor.*/d
Result:
consectetur adipiscing elit, -2
sed do eiusmod tempor incididunt -1
ut labore et dolore magna aliqua. <-0 labor(e)
Ut enim ad minim veniam, quis nostrud -1
exercitation ullamco laboris nisi ut <-0 labor(is)
occaecat cupidatat non proident, sunt in -2
culpa qui officia deserunt mollit anim -1
id est laborum." <-0 labor(um)

Characters prone to word-wrapping

Browsers, when resized, word-wrap text on the fly, right?
What characters beside normal spaces, allow to be "breaked" down?
I know soft hyphens and zero with spaces also do this. But what others?
e.g.
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do
eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad
minim veniam, quis nostrud exercitation ullamco laboris nisi ut
aliquip ex ea commodo consequat. Duis aute irure dolor in
reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla
pariatur. Excepteur sint occaecat cupidatat non proident, sunt in
culpa qui officia deserunt mollit anim id est laborum.
When resized:
Lorem ipsum dolor sit amet, consectetur
adipisicing elit, sed do eiusmod tempor
incididunt ut labore et dolore magna aliqua.
Ut enim ad minim veniam, quis nostrud
exercitation ullamco laboris nisi ut aliquip
ex ea commodo consequat. Duis aute irure
dolor in reprehenderit in voluptate velit
esse cillum dolore eu fugiat nulla pariatur.
Excepteur sint occaecat cupidatat non proident,
sunt in culpa qui officia deserunt mollit
anim id est laborum.
The following is from the Line Breaking and Word Boundaries section in the latest W3C CSS3 Draft: http://www.w3.org/TR/css3-text/#line-breaking
In most writing systems, in the absence of hyphenation a line break occurs only at word boundaries. Many writing systems use spaces or punctuation to explicitly separate words, and line break opportunities can be identified by these characters. Scripts such as Thai, Lao, and Khmer, however, do not use spaces or punctuation to separate words. Although the zero width space (U+200B) can be used as an explicit word delimiter in these scripts, this practice is not common. As a result, a lexical resource is needed to correctly identify break points in such texts.
In several other writing systems, (including Chinese, Japanese, Yi, and sometimes also Korean) a line break opportunity is based on character boundaries, not word boundaries. In these systems a line can break anywhere except between certain character combinations. Additionally the level of strictness in these restrictions can vary with the typesetting style.

Resources