How can I repeatedly match from A until B in VIM? - vim

I need to get all text between <Annotation> and </Annotation>, where a word MATCH occurs. How can I do it in VIM?
<Annotation about="MATCH UNTIL </Annotation> " timestamp="0x000463e92263dd4a" href=" 5raS5maS90ZWh0YXZha29rb2VsbWEvbGFza2FyaS8QyrqPk5L9mAI">
<Label name="las" />
<Label name="_cse_6sbbohxmd_c" />
<AdditionalData attribute="original_url" value="MATCH UNTIL </Annotation> " />
</Annotation>
<Annotation about="NO MATCH" href=" Cjl3aWtpLmhlbHNpbmtpLmZpL2Rpc3BsYXkvbWF0aHN0YXRLdXJzc2l0L0thaWtraStrdXJzc2l0LyoQh_HGoJH9mAI">
<Label name="_cse_6sbbohxmd_c" />
<Label name="courses" />
<Label name="kurssit" />
<AdditionalData attribute="original_url" value="NO MATCH" />
</Annotation>
<Annotation about="MATCH UNTIL </ANNOTATION> " score="1" timestamp="0x000463e90f8eed5c" href="CiZtYXRoc3RhdC5oZWx zaW5raS5maS90ZWh0YXZha29rb2VsbWEvKhDc2rv8kP2YAg">
<Label name="_cse_6sbbohxmd_c" />
<Label name="exercises_without_solutions" />
<Label name="tehtäväkokoelma" />
<AdditionalData attribute="original_url" value="MATCH UNTIL </ANNOTATION>" />
</Annotation>

First, a disclaimer: Any attempt to slice and dice XML with regular expressions is fragile; a real XML parser would do better.
The pattern:
\(<Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>\)\#<=\(\(<\/Annotation\)\#!\_.\)\{-}"MATCH\_.\{-}\(<\/Annotation>\)\#=
Let's break it down...
Group 1 is <Annotation\(\s*\w\+="[^"]\{-}"\s\{-}\)*>. It matches the start-tag of the Attribute element. Group 2, which is embedded in Group 1, matches an attribute and may be repeated 0 or more times.
Group 2 is \s*\w\+="[^"]\{-}"\s\{-}. Most of these pieces are commonly used; the most unusual is \{-}, which means non-greedy repetition (*? in Perl-compatible regular expressions). The non-greedy whitespace match at the end is important for performance; without it, Vim will try every possible way to split the whitespace between attributes between the \s* at the end of Group 2 and the \s* at the beginning of the next occurrence of Group 2.
Group 1 is followed by \#<=. This is a zero-width positive look-behind. It prevents the start-tag from being included in the matched text (e.g., for s///).
Group 3 is \(<\/Annotation\)\#!\_.. It includes Group 4, which matches the beginning of the Attribute end-tag. The \#! is a zero-width negative look-ahead and \_. matches any character (including newlines). Together, this groups matches at any character except where the Attribute end-tag starts. Group 3 is followed by a non-greedy repetition marker \{-} so that it matches the smallest block of text before MATCH. If you were to use \_. instead of Group 3, the matched text could include the end-tag of an Annotation element that did not include MATCH and continue through into the next Annotation element with MATCH. (Try it.)
The next bit is straightforward: Find MATCH and a minimal number of other characters before the end-tag.
Group 5 is easy: It's the end tag. \#= is a zero-width positive look-ahead, which is included here for the same reason as the \#<= for the start-tag. We have to repeat <\/Attribute rather than use \4 because groups with zero-width modifiers aren't captured.

Does it have to be done within vim? Could you cheat, and open a second window where you pipe something into more/less that tells you what line number to go to within vim?
-- edit --
I have never done a multi-line match/search in vi[m]. However, to cheat in another window:
perl -n -e 'if ( /<tag/ .. /<\/tag/)' -e '{ print "$.:$_"; }' file.xml | less
will show the elements/blocks for "tag" (or other longer matching names), with line numbers, in less, and you can then search for the other text within each block.
Close enough?
-- edit --
within "less", type
/MATCH
to search for occurrences of MATCH. On the left margin will be the line number where that instance (within the targeted element/tags) is.
within vi[m], type
:n
where "n" is the desired line number.
Of course, if what you really wanted to do was some kind of search/yank/replace, it's more complicated. At that point, awk / perl / ruby (or something similar which meets your tastes ... or xsl?) is really the tool you should be using for the transformation.

Related

I would like to use regex to retrieve text between two words. This text has xml tags but isnt xml

For example I have a bunch of text that is upparsed from a command that I am looping through and would like to get the text between. I've tried (.*?) \([</Location>])$ and nothing happened. Not a single thing. SO in this body of text, for example I need the paths inside the <Location>
<?xml version="1.0" encoding="utf-16"?><AppMgmtDigest xmlns="http://schemas.microsoft.com/SystemCenterConfigurationManager/2009/AppMgmtDigest" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"><Application AuthoringScopeId="ScopeId_844389FD-D138-4D2A-BF1E-BFEAB11391B5" LogicalName="Application_0487d42d-94f8-4424-bd10-693005c74d9c" Version="11"><DisplayInfo DefaultLanguage="en-US"><Info Language="en-US"><Title>Update BeyondTrust</Title><ReleaseDate>2022-01-14</ReleaseDate></Info></DisplayInfo><DeploymentTypes><DeploymentType AuthoringScopeId="ScopeId_844389FD-D138-4D2A-BF1E-BFEAB11391B5" LogicalName="DeploymentType_3f86c80a-f4d6-4c63-b066-7c030730456a" Version="11"/></DeploymentTypes><Title ResourceId="Res_163096156">Update BeyondTrust</Title><ReleaseDate ResourceId="Res_2088816488">2022-01-14</ReleaseDate><Owners><User Qualifier="LogonName" Id="Admin.MH"/></Owners><Contacts><User Qualifier="LogonName" Id="Admin.MH"/></Contacts></Application><DeploymentType AuthoringScopeId="ScopeId_844389FD-D138-4D2A-BF1E-BFEAB11391B5" LogicalName="DeploymentType_3f86c80a-f4d6-4c63-b066-7c030730456a" Version="11"><Title ResourceId="Res_1162077075">Update BeyondTrust</Title><DeploymentTechnology>GLOBAL/ScriptDeploymentTechnology</DeploymentTechnology><Technology>Script</Technology><Hosting>Native</Hosting><Installer Technology="Script"><ExecutionContext>System</ExecutionContext><Contents><Content ContentId="Content_27d453bb-3439-4440-a90b-ddd731e5a4a7" Version="1"><File Name="PrivilegeManagementConsoleAdapter_x64.msi" Size="7425536"/><File Name="PrivilegeManagementForWindows_x64.msi" Size="21287936"/><File Name="remediate.ps1" Size="3020"/><Location>\\pennoni.com\util\Software\BeyondTrust\PMCloud\application_sccm\</Location><PeerCache>true</PeerCache><OnFastNetwork>Download</OnFastNetwork><OnSlowNetwork>DoNothing</OnSlowNetwork></Content></Contents><DetectAction><Provider>Local</Provider><Args><Arg Name="ExecutionContext" Type="String">System</Arg><Arg Name="MethodBody" Type="String"><?xml version="1.0" encoding="utf-16"?>
Basically, in a body of text, I want to retrieve the text between
<Location> pathThatINeed </Location>
here is a solution that uses regex ONLY for the -split operator. this presumes your sample line of text is stored in $Test. [grin]
the code ...
(($Test -split '<location>')[1] -split '</location>')[0]
the result ...
\\pennoni.com\util\Software\BeyondTrust\PMCloud\application_sccm\
Lee Dailey's helpful answer offers a pragmatic solution that is easy to conceptualize.
To offer a single-operation alternative using the regex-based -replace operator:
# $text is assumed to contain the (incomplete) input XML text.
$text -replace '^.*<location>(.*?)</location>.*$', '$1'
Note: If the regex doesn't match the input, the input is returned as-is.
For an explanation of the regex and the ability to experiment with it, see this regex101.com page.
As an aside: As MikeSh notes, in the case at hand the regex can be simplified:
$text -replace '.*<location>(.*)</location>.*', '$1'
The start and end anchors, ^ and $, aren't strictly necessary, because the .* on either end implicitly ensures that a match will capture the entire input string, which is necessary for the logic of the replace operation - however, I've added them for conceptual clarity.
If the assumption is that only one location element is present in the input, (.*), as a greedy subexpression (one that matches as much as possible) works fine, because when the regex engine backtracks to the last instance of </location>, it'll by definition be the only one.
Generally, however, if the intent is to match non-greedily, i.e. only to the next, not the last instance of the subexpression that follows, (.*?) is required - in this case, it is a more readable alternative to [^<]* (match everything up to the next <)
The following example shows when (.*?) is required:
# !! WRONG: Greedy subexpression matches from the start of the
# !! *first* opening tag through the end of the *last* one:
# !! -> 'a</el> <el>b'
'<el>a</el> <el>b</el>' -replace '<el>(.*)</el>', '$1'
# OK: Non-greedy subexpression matches only up to the *next*
# closing tag. However, the regex now matches *twice*.
# -> 'a b'
'<el>a</el> <el>b</el>' -replace '<el>(.*?)</el>', '$1'
# OK: Start the regex with a greedy match-anything subexpression
# in order to limit matching to the *last* element.
# Note: For the reasons explained above, (.*) will
# *technically* do here, but using (.*?) for *conceptual*
# reasons - to signal the intent - is advisable.
# -> 'b'
'<el>a</el> <el>b</el>' -replace '.*<el>(.*?)</el>', '$1'
# OK: Start the regex with a non-greedy match-anything subexpression
# and end it with a greedy one in order to limit matching
# to the *first* element.
# -> 'a'
'<el>a</el> <el>b</el>' -replace '.*?<el>(.*?)</el>.*', '$1'
That should do the trick:
(?<=<Location>).*?(?=<Location/>)
Output:
THisismyDesiredText
Explanation:
(?<=): Positive Lookbehind
.*?: Matches any character between zero and unlimited times, as few times as possible (lazy)
(?=): Positive Lookahead

Strip characters to the left of a specific character in a pandas column

I have the following data:
key German
0 0:- Profile 1
1 1:- Archetype Realist*in
2 2:- RIASEC Code: R- Realistic
3 3:- Subline Deine Stärke? Du bleibst dir selber treu.
4 4:- Copy Dein Erfolg basiert auf deiner praktischen Ver...
In the "Key" column I would like to remove the numbers and colon dash which follows. This order is always the same (from the left). So for the first row I would like to remove "0:- ", and just leave "Profile 1". I am struggling to find the correct regex expression to do what I want. Originally I tried the following:
df_json['key'] = df_json['key'].map(lambda x: x.strip(':- ')[1])
However, this approach is too restrictive since there can be multiple words in the field.
I would like to use pd.Series.str.replace(), but I cant figure out the correct regex expression to achieve the desired results. Any help would be greatly appreciated.
With your shown samples, please try following. Using replace function of Pandas here. Simple explanation would be, apply replace function of Pandas to German column of dataframe and then use regex ^[0-9]+:-\s+ to replace values with NULL.
df['German'].replace('(^[0-9]+:-\s+)','', regex=True)
Explanation:
^[0-9]+: match starting digits followed by colon here.
:-\s+: Match colon, followed by - followed by 1 or more space occurrences.
What about just using pandas.Series.str.partition instead of regular expressions:
df['German'] = df['German'].str.partition()[2]
This would split the series on the 1st space only and grab the trailing part. Alternatively to partition you could also just split:
df['German'] = df['German'].str.split(' ', 1).str[1]
If regex is a must for you, maybe use a lazy quantifier to match upto the 1st space character:
df['German'] = df['German'].replace('^.*? +','', regex=True)
Where:
^ - Start line anchor.
.*? - Any 0+ (lazy) characters other than newline upto;
+ - 1+ literal space characters.
Here is an online demo
You need
df_json['key'] = df_json['key'].str.replace(r'^\d+:-\s*', '', regex=True)
See the regex demo and the regex graph:
Details:
^ - start of string
\d+ - one or more digits
: - a colon
- - a hyphen
\s* - zero or more whitespaces
Extract any non white Space \S and Non Digits \D which are immediately to the left of unwanted characters
df['GermanFiltered']=df['German'].str.extract("((?<=^\d\:\-\s)\S+\D+)")

How to remove "Certain" (not all) empty lines, or two consequent empty lines in notepad++

I know that there is a way to remove empty lines in the Notepad++ through the "Edit > Line Operation"; however, it either removes ALL empty lines (in case where the empty line is not containing any characters), OR it removes only empty lines containing no characters including a space character.
My goal is to choose which "empty line"s to be deleted. (For example, to delete those double empty lines where two consequent lines are empty; not those single empty lines. I need keep those single empty lines)
Super ideal is to combine this operation after-which one more lines containing a specific "String" is deleted through the .String. Search in "Regexp" mode.
Example:
</section>
<!--One line empty line -->
<section id="phone-number">
<parts level="1">
<part choice="NPXMA" name="NPXMA" desc_en="Phone Number - COUNTRY" desc_fr="Numéro de Téléphone - COUNTRY" costRetail="X.XX" costWholesale="4.50" period="monthly" applyCommission="true" visibleAtSignup="true" visibleAtPlanChange="true" />
<part choice="NSCA00" name="NPXMA" desc_en="Phone Number Setup Fee" desc_fr="frais d'activation de numéro" costRetail="X.XX" costWholesale="X.XX" period="once" applyCommission="false" visibleAtSignup="true" visibleAtPlanChange="true" />
<part choice="NPXMA" name="NPXMA" desc_en="Phone Number - United States" desc_fr="Numéro de Téléphone - États-Unis" costRetail="X.XX" costWholesale="X.XX" period="monthly" applyCommission="true" visibleAtSignup="true" visibleAtPlanChange="true" />
Does anyone help? is there actually any solution for this?
Thanks
If I understood your question correctly, you can use \r\n to get the required results. It will make sure that it only removes double empty lines (with single ones, I suppose).
The below animated gif explains it.

Lua: Search a specific string

Hi all tried all the string pattrens and library arguments but still stuck.
i want to get the name of the director from the following string i have tried the string.matcH but it matches the from the first character it finD from the string
the string is...
fixstrdirector = {id:39254,cast:[{id:15250,name:Hope Davis,character:Aunt Debra,order:5,cast_id:10,profile_path:/aIHF11Ss8P0A8JUfiWf8OHPVhOs.jpg},{id:53650,name:Anthony Mackie,character:Finn,order:3,cast_id:11,profile_path:/5VGGJ0Co8SC94iiedWb2o3C36T.jpg},{id:19034,name:Evangeline Lilly,character:Bailey Tallet,order:2,cast_id:12,profile_path:/oAOpJKgKEdW49jXrjvUcPcEQJb3.jpg},{id:6968,name:Hugh Jackman,character:Charlie Kenton,order:0,cast_id:13,profile_path:/wnl7esRbP3paALKn4bCr0k8qaFu.jpg},{id:79072,name:Kevin Durand,character:Ricky,order:4,cast_id:14,profile_path:/c95tTUjx5T0D0ROqTcINojpH6nB.jpg},{id:234479,name:Dakota Goyo,character:Max Kenton,order:1,cast_id:15,profile_path:/7PU6n4fhDuFwuwcYVyRNVEZE7ct.jpg},{id:8986,name:James Rebhorn,character:Marvin,order:6,cast_id:16,profile_path:/ezETMv0YM0Rg6YhKpu4vHuIY37D.jpg},{id:930729,name:Marco Ruggeri,character:Cliff,order:7,cast_id:17,profile_path:/1Ox63ukTd2yfOf1LVJOMXwmeQjO.jpg},{id:19860,name:Karl Yune,character:Tak Mashido,order:8,cast_id:18,profile_path:/qK315vPObCNdywdRN66971FtFez.jpg},{id:111206,name:Olga Fonda,character:Farra Lemkova,order:9,cast_id:19,profile_path:/j1qabOHf3Pf82f1lFpUmdF5XvSp.jpg},{id:53176,name:John Gatins,character:Kingpin,order:10,cast_id:41,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:1126350,name:Sophie Levy,character:Big Sister,order:11,cast_id:42,profile_path:null},{id:1126351,name:Tess Levy,character:Little Sister,order:12,cast_id:43,profile_path:null},{id:1126352,name:Charlie Levy,character:Littlest Sister,order:13,cast_id:44,profile_path:null},{id:187983,name:Gregory Sims,character:Bill Panner,order:14,cast_id:45,profile_path:null}],crew:[{id:58726,name:Leslie Bohem,department:Writing,job:Screenplay,profile_path:null},{id:53176,name:John Gatins,department:Writing,job:Screenplay,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:17825,name:Shawn Levy,department:Directing,job:Director,profile_path:/7f2f8EXdlWsPYN0HPGcIlG21xU.jpg},{id:12415,name:Richard Matheson,department:Writing,job:Story,profile_path:null},{id:57113,name:Dan Gilroy,department:Writing,job:Story,profile_path:null},{id:25210,name:Jeremy Leven,department:Writing,job:Story,profile_path:null},{id:17825,name:Shawn Levy,department:Production,job:Producer,profile_path:/7f2f8EXdlWsPYN0HPGcIlG21xU.jpg},{id:34970,name:Susan Montford,department:Production,job:Producer,profile_path:/1XJt51Y9ciPhkHrAYE0j6Jsmgji.jpg},{id:3183,name:Don Murphy,department:Production,job:Producer,profile_path:null},{id:34967,name:Rick Benattar,department:Production,job:Producer,profile_path:null},{id:1126348,name:Eric Hedayat,department:Production,job:Producer,profile_path:null},{id:186721,name:Ron Ames,department:Production,job:Producer,profile_path:null},{id:10956,name:Josh McLaglen,department:Production,job:Executive Producer,profile_path:null},{id:57634,name:Mary McLaglen,department:Production,job:Executive Producer,profile_path:null},{id:23779,name:Jack Rapke,department:Production,job:Executive Producer,profile_path:null},{id:488,name:Steven Spielberg,department:Production,job:Executive Producer,profile_path:/cuIYdFbEe89PHpoiOS9tmo84ED2.jpg},{id:30,name:Steve Starkey,department:Production,job:Executive Producer,profile_path:null},{id:24,name:Robert Zemeckis,department:Production,job:Executive Producer,profile_path:/isCuZ9PWIOyXzdf3ihodXzjIumL.jpg},{id:531,name:Danny Elfman,department:Sound,job:Original Music Composer,profile_path:/pWacZpYPos8io22nEiim7d3wp2j.jpg},{id:18265,name:Mauro Fiore,department:Crew,job:Cinematography,profile_path:null},{id:54271,name:Dean Zimmerman,department:Editing,job:Editor,profile_path:null},{id:25365,name:Richard Hicks,department:Production,job:Casting,profile_path:null},{id:5490,name:David Rubin,department:Production,job:Casting,profile_path:null},{id:52088,name:Tom Meyer,department:Art,job:Production Design,profile_path:null}]}
i have tried string.match(fixstrdirector,"name:(.+),department:Directing")
but it gives me the from the first occurace it find the name to the end of thr string
output:
Hope Davis,character:Aunt Debra,order:5,cast_id:10,profile_path:/aIHF11Ss8P0A8JUfiWf8OHPVhOs.jpg},{id:53650,name:Anthony Mackie,character:Finn,order:3,cast_id:11,profile_path:/5VGGJ0Co8SC94iiedWb2o3C36T.jpg},{id:19034,name:Evangeline Lilly,character:Bailey Tallet,order:2,cast_id:12,profile_path:/oAOpJKgKEdW49jXrjvUcPcEQJb3.jpg},{id:6968,name:Hugh Jackman,character:Charlie Kenton,order:0,cast_id:13,profile_path:/wnl7esRbP3paALKn4bCr0k8qaFu.jpg},{id:79072,name:Kevin Durand,character:Ricky,order:4,cast_id:14,profile_path:/c95tTUjx5T0D0ROqTcINojpH6nB.jpg},{id:234479,name:Dakota Goyo,character:Max Kenton,order:1,cast_id:15,profile_path:/7PU6n4fhDuFwuwcYVyRNVEZE7ct.jpg},{id:8986,name:James Rebhorn,character:Marvin,order:6,cast_id:16,profile_path:/ezETMv0YM0Rg6YhKpu4vHuIY37D.jpg},{id:930729,name:Marco Ruggeri,character:Cliff,order:7,cast_id:17,profile_path:/1Ox63ukTd2yfOf1LVJOMXwmeQjO.jpg},{id:19860,name:Karl Yune,character:Tak Mashido,order:8,cast_id:18,profile_path:/qK315vPObCNdywdRN66971FtFez.jpg},{id:111206,name:Olga Fonda,character:Farra Lemkova,order:9,cast_id:19,profile_path:/j1qabOHf3Pf82f1lFpUmdF5XvSp.jpg},{id:53176,name:John Gatins,character:Kingpin,order:10,cast_id:41,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:1126350,name:Sophie Levy,character:Big Sister,order:11,cast_id:42,profile_path:null},{id:1126351,name:Tess Levy,character:Little Sister,order:12,cast_id:43,profile_path:null},{id:1126352,name:Charlie Levy,character:Littlest Sister,order:13,cast_id:44,profile_path:null},{id:187983,name:Gregory Sims,character:Bill Panner,order:14,cast_id:45,profile_path:null}],crew:[{id:58726,name:Leslie Bohem,department:Writing,job:Screenplay,profile_path:null},{id:53176,name:John Gatins,department:Writing,job:Screenplay,profile_path:/A2MqnSKVzOuBf8MVfNyve2h2LxJ.jpg},{id:17825,name:Shawn Levy
You're searching from the first occurrence of "name:" until the "department:Directing" with everything in between.
Instead, you need to restrict what can be between the two strings. Here for example I'm saying that the characters that make up the name can only be alphanumeric or a space:
string.match(fixstrdirector,"name:([%w ]+),department:Directing")
Alternatively, given that there's a comma separating the parameters, a better approach would be to search for "name:" followed by any characters other than a comma, followed by "department:Directing":
string.match(fixstrdirector,"name:([^,]+),department:Directing")
Of course that wouldn't work if the name had a comma it in!
Lua patterns provides - modifier for tasks as you have above. As stated on PiL - Section 20.2:
The + modifier matches one or more characters of the original class.
It will always get the longest sequence that matches the pattern.
Like *, the modifier - also matches zero or more occurrences of
characters of the original class. However, instead of matching the
longest sequence, it matches the shortest one.
Next, when you are using . to match, it'll find any and all characters satisfying the pattern. Therefore, you'll get the result from first occurence of name until the ,department:Directing is found. Since you know that it is a JSON data, you can try to match for [^,]; that is, non-comma characters.
So, for your case try:
local tAllNames = {}
for sName in fixstrdirector:gmatch( "name:([^,]-),department:Directing" ) do
tAllNames[ #tAllNames + 1 ] = sName
end
and all your required names will be stored in the table tAllNames. An example of the above can be seen at codepad.

Replace numbers in a string

I Have some text file. theses texts contain a string like this(a part of text):
<abbr class="word p1"">dd</abbr>
<img src"D:\Images\1.png">
<abbr class="word p1">dd</abbr>
<img src"D:\ticket\t\1.png">
In each text file,(D:\Images\1.png) png name is different but it is always numbers(from 1 to 114)for example(1,2,3,10,...)
I want to replace this text D:\Images\[number].png with a specific text for expample:
string newtext=Replace("D:\Images\[number].png","Something");
How can i do this?
thanks.
Use a regular expression:
string newtext = Regex.Replace(text, #"(D:\\Images\\)\d+(.png)","$1Something$2");
It will replace the full match, including D:\Images\ and .png, so $1 and $2 puts back what's caught by the parentheses, so that Somthing only replaces the digits.
Use regular expressions that are represented mostly be the Regex class. See these links:
http://www.codeproject.com/Articles/93804/Using-Regular-Expressions-in-C-NET
http://msdn.microsoft.com/en-us/library/ms228595%28v=vs.80%29.aspx

Resources