Deleting duplicate values using find and replace in a text editor

Deleting duplicate values using find and replace in a text editor - text

I messed something up. In my xml, each non preferred term has a preferred term to use:
Something I have done has created some non preffered terms where the preferred term to use is the exact same name as this non preferred term.
<term>
<termId>127699289611384833453kNgWuDxZEK37Lo4QVWZ</termId>
<termUpdate>Add</termUpdate>
<termName>Adenosquamous Carcinoma</termName>
<termType>Nd</termType>
<termStatus>Active</termStatus>
<termApproval>Approved</termApproval>
<termCreatedDate>20110704T09:41:31</termCreatedDatae>
<termCreatedBy>admin</termCreatedBy>
<termModifiedDate>20110704T09:45:17</termModifiedDate>
<termModifiedBy>admin</termModifiedBy>
<relation>
<relationType>USE</relationType>
<termId>1276992897N1537166632rbr7BISWAI93SarY118G</termId>
<termName>Adenosquamous Carcinoma</termName>
</relation>
Is there a text editor with a find and replace function I can use to tell it that if the in =the of the actual term, to just delete the whole ? I looked at the related queries and they mentioned regular expressions, but I've spent ages trying to build them and they are beyond me,
thanks!

It is nearly 3 years too late answering this question, but there are Perl regular expressions which can be indeed used for this task.
Finding and deleting a term block containing same termName in relation as defined above for the term itself is possible with UltraEdit for Windows v21.10.0.1032 and most likely also with other text editors supporting Perl regular expression using a case-sensitive Perl regular expression Replace with search string:
^[ \t]*<term>(?:(?!</term>)[\S\s])+<termName>([^\r\n]+?)</termName>(?:(?!</term>)[\S\s])+<relation>(?:(?!</term>)[\S\s])+<termName>\1</termName>(?:(?!</term>)[\S\s])+</term>[ \t\r]*\n
The replace string is an empty string.
Explanation:
^ ... start every search at beginning of a line.
[ \t]* ... there can be 0 or more spaces or tabs at beginning of the line.
<term> ... this string must be found next on the line.
Next the tricky expression follows which is required to match any character up to next string of interest, but with avoiding matching something in next term block if the remaining expression does not return a positive result on current term block.
(?:(?!</term>)[\S\s])+ ... this expression finds any character because of [\S\s] matching any non whitespace character or any whitespace character. There must be at least 1 character before next fixed string because of the +, but it can be also more characters. Additionally the Perl regular expression must make look ahead on every character matched to check if NOT </term> follows. If right of the currently matched character there is the string </term>, the Perl regexp engine must stop matching any character at current position in stream and continue with next part of the search string. So this expression can match any character, but not beyond </term> and therefore only characters between <term> and </term>. Because of ?: nothing is captured/marked for back referencing by this expression.
<termName> ... this fixed string within a term block must be found next.
([^\r\n]+?) ... matches the characters of the name of the term and captures/marks this string for back referencing. Instead of the negative character class expression [^\r\n], it would be also possible to use another class definition, or just . if a dot does not match new line characters. Also possible would be ([^<]+) if it is not possible that a not encoded opening angle bracket is part of the term name. Character < must be encoded with < according to XML specification within an element's value except within a CDATA block.
</termName> ... this fixed string within a term block must be found next.
(?:(?!</term>)[\S\s])+ ... again any character within a term block up to next fixed string.
<relation> ... this fixed string within a term block must be found next.
(?:(?!</term>)[\S\s])+ ... again any character within a term block up to next fixed string.
<termName> ... this fixed string within a term block must be found next.
\1 ... this expression back references the captured/marked term name and therefore the next string must be the same as the name of the term defined above.
</termName> ... this fixed string within a term block must be found next.
(?:(?!</term>)[\S\s])+ ... again any character within a term block up to next fixed string.
</term> ... this fixed string marking end of a term block must be found next.
[ \t\r]*\n ... matches 0 or more spaces, tabs and carriage returns and next a line-feed. So this expression works for a DOS/Windows (CR+LF) and a Unix (only LF) text file.
Also possible with UltraEdit is:
(?s)^[ \t]*<term>(?:(?!</term>).)+<termName>([^<]+?)</termName>(?:(?!</term>).)+<relation>(?:(?!</term>).)+<termName>\1</termName>(?:(?!</term>).)+</term>[ \t\r]*\n
(?s) ... this expression at beginning of the search string changes the behavior of . from matching any character except line terminators to really any character and therefore . is now like [\S\s].

Related

Dialogflow RE2 Regex

I am new here. I wanted to ask a question on using REGEX for an entity in DialogFlow
I wanted the entity to accept all text and spaces except for the symbol *
I have tried to use [A-Za-z0-9 ][^*], but it is not working. Any advice. thanks!

In your Regex expression, [^*] means "capture any character at the start of the line." To refer to a literal asterisk rather than matching any character, you need to use \*
If you want to match a line of letters or numbers as in the [A-Za-z0-9] example you give, but only if that string does not include an asterisk, then this expression should work for you:
^[a-zA-Z0-9]+$
This means "match a whole line of text if it only contains one or more of the characters a-z, A-Z, or 0-9".
If you want to match any character or group of characters in a line except for the asterisk, then you could use something like this:
(?!\*)([a-zA-Z0-9]+)(?<!\*)
The first part is called a "negative lookahead," and it looks forward to ensure we're not matching the asterisk. The last part is called a "negative lookbehind," and it looks backwards to make sure we're not matching the asterisk. The middle part is your "capture group," and confirms that you're matching any letters or numbers in a given string, but excluding the * character.
If this Regex gets input like *abc, it will capture abc. If it encounters abc*, it will still capture abc. If it encounters abc*def, it will capture abc and def separately in two capture groups, because it will break around the asterisk.
This link explains the concept of lookarounds in Regex. You can also use this Regex tester to get started practicing your Regular Expressions with explanations of what each block of characters does.
EDITED TO ADD If you're just interested in matching single characters rather than groups of characters, you can use [A-Za-z0-9] and match any upper or lowercase letter and any single digit. You don't need to exclude the * character, because the character group is already exclusive.
This is a slight duplicate of the question below, so responses here may also help you. Hope this helps!
How can I exclude asterisk in a regex expression

[A-Za-z0-9 ][^*]
What you regex will do is match 2 consecutive characters. First, it will look for anything A-Za-z0-9 . Then, it will look at the negated set that includes *, and will match ANY character except *.
You can type your regex into https://regexr.com/ to see a breakdown of how it matches and test some strings.
For example, your regex would match these:
Aa
AA
a&
A1
0_
But would not match these:
A*
a*
1*
And WOULD NOT match anything longer than 2 characters. If you really want to match any string with any characters except *, this should work:
[^\*]+
What that will do is match any number of consecutive characters that are not *. (The + means match 1 or more characters in the set). It is also a good idea to escape * because it is also a reserved character in regex. Even though most regex parsers are smart enough to know that inside a group you probably mean the literal char *, it is still a best practice to escape it. (And by that same token, you would want to use \s instead of the blank space in your original regex.)

Substitute and change case for program variables

I'm changing some notation in a few source code files.
In particular, variable names using the format
m_variable1
m_anothervariable
should be renamed and reformatted to
mVariable1
mAnotherVariable
That is, substitute m_ with m and make the next character uppercase.
I know how todo simple substitutions, like
%s/m_/m/gc
using vim, but not sure how to add syntax for changing a char to uppercase in a substitute statement?

You can make the first character of variable name uppercase, but I think you can hardly separate words from a consecutive string simply by built-in command.
I hope following command will help you:
:%s/\vm_(\w+)/m\u\1/g
Explaination
\v enables the 'very magic' mode
\u makes the first character of word after it uppercase
\1 references the first captured group
Result
mVariable1
mAnothervariable

How to perform following search and replace in vim?

I have the following string in the code at multiple places,
m_cells->a[ Id ]
and I want to replace it with
c(Id)
where the string Id could be anything including numbers also.

A regular expression replace like below should do:
%s/m_cells->a\[\s\(\w\+\)\s\]/c(\1)/g
If you wish to apply the replacement operation on a number of files you could use the :bufdo command.

Full explanation of #BasBossink's answer (as a separate answer because this won't fit in a comment), because regexes are awesome but non-trivial and definitely worth learning:
In Command mode (ie. type : from Normal mode), s/search_term/replacement/ will replace the first occurrence of 'search_term' with 'replacement' on the current line.
The % before the s tells vim to perform the operation on all lines in the document. Any range specification is valid here, eg. 5,10 for lines 5-10.
The g after the last / performs the operation "globally" - all occurrences of 'search_term' on the line or lines, not just the first occurrence.
The "m_cells->a" part of the search term is a literal match. Then it gets interesting.
Many characters have special meaning in a regex, and if you want to use the character literally, without the special meaning, then you have to "escape" it, by putting a \ in front.
Thus \[ and \] match the literal '[' and ']' characters.
Then we have the opposite case: literal characters that we want to treat as special regex entities.
\s matches white*s*pace (space, tab, etc.).
\w matches "*w*ord" characters (letters, digits, and underscore _).
(. matches any character (except a newline). \d matches digits. There are more...)
If a character is not followed by a quantifier, then exactly one such character matches. Thus, \s will match one space or tab, but not fewer or more.
\+ is a quantifier, and means "one or more". (\? matches 0 or 1; * (with no backslash) matches any number: zero or more. Warning: matching on zero occurrences takes a little getting used to; when you're first learning regexes, you don't always get the results you expected. It's also possible to match on an arbitrary exact number or range of occurrences, but I won't get into that here.)
$ and $ work together to form a "capturing group". This means that we don't just want to match on these characters, we also want to remember them specially so that we can do something with them later. You can have any number of capturing groups, and they can be nested too. You can refer to them later by number, starting at 1 (not 0). Just start counting (escaped) left-parantheses from the left to determine the number.
So here, we are matching a space followed by a group (which we will capture) of at least one "word" character followed by a space, within the square brackets.
Then section between the second and third / is the replacement text.
The "c" is literal.
\1 means the first captured group, which in this case will be the "Id".
In summary, we are finding text that matches the given description, capturing part of it, and replacing the entire match with the replacement text that we have constructed.
Perhaps a final suggestion: c after the final / (doesn't matter whether it comes before or after the 'g') enables *c*onfirmation: vim will highlight the characters to be replaced and will show the replacement text and ask whether you want to go ahead. Great for learning.
Yes, regexes are complicated, but super powerful and well worth learning. Once you have them internalized, they're actually fairly easy. I suggest that, as with learning vim itself, you start with the basics, get fluent in them, and then incrementally add new features to your repertoire.
Good luck and have fun.

Why do I have to escape the final ]

I have a file containing string like this one :
print $hash_xml->{'div'}{'div'}{'div'}[1]...
I want to replace {'div'}{'div'}{'div'}[1] by something else.
So I tried
%s/{'div'}{'div'}{'div'}[1]/by something else/gc
The strings were not found. I though I had to escape the {,},[ and ]
Still string not found.
So I tried to search a single { and it found them.
Then I tried to search {'div'}{'div'}{'div'} and it found it again.
Then {'div'}{'div'}{'div'}[1 was still found.
To find {'div'}{'div'}{'div'}[1]
I had to use %s/{'div'}{'div'}{'div'}[1\]
Why ?
vim 7.3 on Linux

The [] are used in regular expressions to wrap a range of acceptable characters.
When both are supplied unescaped, vim is treating the search string as a regex.
So when you leave it out, or escape the final character, vim cannot interpret a single bracket in a regex context, so does a literal search (basically the best it can do given the search string).
Personally, I would escape the opening and closing square brace to ensure that the meaning is clear.

That's because the [ and ] characters are used to build the search pattern.
See :h pattern and use the help file pattern.txt to try the following experiment:
Searching for the "[9-0]" pattern (without quotes) using /[0-9] will match every digit from 0 to 9 individually (see :h \[)
Now, if you try /\[0-9] or /[0-9\] you will match the whole pattern: a zero, an hyphen and a nine inside square brackets. That's because when you escape one of [ or ] the operator [*] ceases to exist.
Using your search pattern, /{'div'}{'div'}{'div'}[1\] and /{'div'}{'div'}{'div'}\[1] should match the same pattern which is the one you want, while /{'div'}{'div'}{'div'}[1] matches the string {'div'}{'div'}{'div'}1.

In order to avoid being caught by these special characters in regular expressions, you can try using the very magic flag.
E.g.:
:%s/\V{'div'}[1]/replacement/
Notice the \V flag at the beginning of the line.

Because the square brackets mean that vim thinks you're looking for any of the characters inside. This is known as a 'character class'. By escaping either of the square brackets it lets vim know that you're looking for the literal square string ending with '[1]'.
Ideally you should write your expression as:
%s/{'div'}{'div'}{'div'}\[1\]/replacement string/
to ensure that the meaning is completely clear.

What [...] in a bash script stands for?

I am reading this tutorial, and I encountered that bash script uses [...] as a wild card character. So what exactly [...] stands in a bash script?

It's a regex-style character matching syntax; from the Bash Reference Manual, §3.5.8.1 (Pattern Matching):
[...]
Matches any one of the enclosed characters. A pair of characters separated by a hyphen denotes a range expression; any character that sorts between those two characters, inclusive, using the current locale's collating sequence and character set, is matched. If the first character following the ‘[’ is a ‘!’ or a ‘^’ then any character not enclosed is matched. A ‘−’ may be matched by including it as the first or last character in the set. A ‘]’ may be matched by including it as the first character in the set. The sorting order of characters in range expressions is determined by the current locale and the value of the LC_COLLATE shell variable, if set.
For example, in the default C locale, ‘[a-dx-z]’ is equivalent to ‘[abcdxyz]’. Many locales sort characters in dictionary order, and in these locales ‘[a-dx-z]’ is typically not equivalent to ‘[abcdxyz]’; it might be equivalent to ‘[aBbCcDdxXyYz]’, for example. To obtain the traditional interpretation of ranges in bracket expressions, you can force the use of the C locale by setting the LC_COLLATE or LC_ALL environment variable to the value ‘C’.
Within ‘[’ and ‘]’, character classes can be specified using the syntax [:class:], where class is one of the following classes defined in the posix standard:
alnum alpha ascii blank cntrl digit graph lower
print punct space upper word xdigit
A character class matches any character belonging to that class. The word character class matches letters, digits, and the character ‘_’.
Within ‘[’ and ‘]’, an equivalence class can be specified using the syntax [=c=], which matches all characters with the same collation weight (as defined by the current locale) as the character c.
Within ‘[’ and ‘]’, the syntax [.symbol.] matches the collating symbol symbol.
(emphasis added to the most common usage patterns)

It is used in the tutorial to speak about regular expressions in addition to globbing ('*' and '?'). For example [a-z] regular expression will match one lowercase character.

Actually, what is a wildcard is [abc] for example. It matches one of the three letters.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string