Removing characters from a string in Perl that are not alphanumeric

Removing characters from a string in Perl that are not alphanumeric - string

I've used string replacement in Perl a couple of times and have particular substrings and replace them with something else.
I'm curious if there is a trick to only keep certain characters, specifically I want to remove any characters from the string that are not a-z, A-Z or 0-9.
E.g., a b c !##$%^&*()_~+=[]{}\|;':",./<>? 123 would just be abc123.

Using regex,
s/[^a-zA-Z0-9]//g;
using translation,
tr/a-zA-Z0-9//dc;

Related

Mark words in notepad++ including dash (-)

I would like to mark in Notepad++ the sql scripts in a text log. The sql files have this format in the text:
AAAAAAAA.BBBBBBBBBBB.sql
So what I execute is this sentence in search menu:
\w*.sql
As I should get BBBBBBBBBBB.sql. The point is that in some script names there are dashes (-), and when that happens I dont get the whole name, but just the end after the last dash.
For example, in:
AAAAAAAA.BBBBB-CCCCCCC.sql
I would like to get BBBBB-CCCCCCC.sql, but I just get CCCCCCC.sql
Is there any possible formula to get them?

If the match can not start and end with a hyphen:
\w+(?:-\w+)*\.sql
\w+ Match 1+ word characters
(?:-\w+)* Optionally match - and 1+ word characters
\.sql Match .sql
See a regex demo.
Note that in your pattern the \w* can also match 0 occurrences and that the . can match any character if it is not escaped.
Another option could be using a character class to match either - or a word character, but this would also allow to mix and match like --a--.sql
[\w-]+\.sql
See another regex demo.

What do you understand by this RegEx?

I´m working with VBA and trying to split a string into three columns, almost all strings are like Company Name 3567782 Agent Name.pdf
With this pattern I want to match all the text before a space and digits (1st group), the digits (2nd group) and all the text after the space and before the .pdf (3rd group).
strPattern = "^(.+)\n(\d{4,10})\n(.+).pdf"
I recall spaces in python are \s but saw in VBA are \n.
Can you help me find the right pattern for what I´m looking for?

As I put in my comment, I use the https://regex101.com site. There are others but I find this one the most helpful to me.
When I put in your regex
^(.+)\n(\d{4,10})\n(.+).pdf
and test string
Company Name 3567782 Agent Name.pdf
the first thing I notice is that the regex does not match the test string (see right side under MATCH INFORMATION).
Here are a couple things that I saw:
\n is newline, not space. In regex, space is " ".
Your last "." in ".pdf" is not registering as a literal period, it's a token that matches any character. To match a literal period, you need \.
If we change those two things it returns three groups that seem to match what you are looking for.
^(.+) (\d{4,10}) (.+)\.pdf
It looks like for the digits, you are looking for between 4 and 10 digits. If that's correct, it looks like your regex is good. You could put in a handful of example strings into the TEST STRING area and make sure that it works in all cases.

I'd use either of these:
(?:(?:([a-zA-Z]+\.?)|(\d+)))
capture a-Z greedy with a possible . to allow for the .pdf or capture digits
this version excludes the space [ ] or \s
or keep the search structured so you can control what goes in and out of each column
^(\w+\s\w+)|(\d+)|(\w+\s\w+\.\w+$)
\b or ^ - word boundary or start of string
(\w+\s\w+) - 1st capture \w+ - any alpha numeric char greedily, followed by 1 x space (use \s* or \s+ for more), followed again by alpha numeric greedily
|(\d+) - alteration - \d+ - capture just digits
`|(\w+\s\w+.\w+$) - similar to 1st group but allows for the '.' of pdf and bounds to the end of string (\G or $).
you could optionally build the '.' into the 1st group like my top answer, but for neatness and better control I prefer the 2nd.

remove exponents from a formula string

I would like to isolate all operands from a formula (in the form of a string) by taking out the arithmetic operators so take out: "+","-","/","*","**2"
the formula string is something like:
"y=A+B1*options+B2*items**2+B3*factor+B4"
However: I can manage for most arithmetic operators, except for the exponents "**2" part. It has to be a wildcard search or so (not positional), because the whole formula might change in future and also might have another exponent (eg **5 or **54)
What would be the easiest way to strip "**?" out of the formula where ? can be any number?

To match the pattern you want, use the regex string r"\*\*\d+"
Breakdown:
r"" is the how one denotes regex in python (see the re module for more info)
\* matches a single * character - because the * is a special character in regex, we escape it with the \
\d matches a digit
+ matches the previous pattern at least once greedily: this means it will try to find at least one digit, then keep finding digits until it can find no more. So, it will match **2, **44382, and so on
As for stripping the pattern from the equation, you can do re.sub(pattern, "", equation) - replacing all instances of the pattern with nothing

VIM line count in status bar with thousands separator?

Is it possible to display the line count in the VIM status bar with thousands separators, preferably custom thousands separators?
Example:
set statusline=%L
should lead to "1,234,567" instead of "1234567".

I've found a way but it looks a bit crazy:
set statusline=%{substitute(line('$')\,'\\d\\zs\\ze\\%(\\d\\d\\d\\)\\+$'\,'\,'\,'g')}
The first round of backslashes is just for set (I have to escape , and \ itself).
What I'm actually setting the option to is this string:
%{substitute(line('$'),'\d\zs\ze\%(\d\d\d\)\+$',',','g')}
As a format string, this line contains one formatting code, which is %{...}. Everything in ... is evaluated as an expression and the result substituted back in.
The expression I'm evaluating is (spaces added (if I had added them to the real code, I would've had to escape them for set again, forcing yet more backslashes)):
substitute(line('$'), '\d\zs\ze\%(\d\d\d\)\+$', ',', 'g')
This is a call to the substitute function. The arguments are the source string, the regex, the replacement string, and a list of flags.
The string we're starting with is line('$'). This call returns the number of lines in the current buffer (or rather the number of the last line in the buffer). This is what %L normally shows.
The search pattern we're looking for is \d(\d\d\d)+$ (special vim craziness removed), i.e. a digit followed by 1 or more groups of 3 digits, followed by the end of the string. Grouping is spelled \%( \) in vim, and "1 or more" is \+, which gives us \d\%(\d\d\d\)\+$. The last bit of magic is \zs\ze. \zs sets the start of the matched string; \ze sets the end. This works as if everything before \zs were a look-behind pattern and everything after \ze were a look-ahead pattern.
What this amounts to is: We're looking for every position in the source string that is preceded by a digit and followed by exactly N digits (where N is a multiple of 3). This works like starting at the right and going left, skipping 3 digits each time. These are the positions where we need to insert a comma.
That's what the replacement string is: ',' (a comma). Because we're matching a string of length 0, we're effectively inserting into the source string (by replacing '' with ',').
Finally, the g flag says to do this with all matches, not just the first one.
TL;DR:
line('$') gives us the number of lines
substitute(..., '\d\zs\ze\%(\d\d\d\)\+$', ',', 'g') adds commas where we want them
%{ } lets us embed arbitrary expressions into statusline

Bash get string between 2 6-digit numbers

I have a UTF-8-BOM encoded text file full of lines of which most start with a 6-10-digit (number increases every line) and have a string behind them.
I want to get each of those "lines" (including the number) to process further in my bash script.
It'd be an easy to do by just using a for loop with sed -n '$line\p' but unfortunately some of those strings I need have line breaks as part of them, so I need a way of extracting the string between two 6+ digit numbers (including the first number) which mark a new line.
An example of 3 "lines":
123456\tA random string here
123567\t another string
this time
it goes over
multiple lines
124567\t a normal string again
What I need:
123456\tA random string here
,
123567\t another string
this time
it goes over
multiple lines
and
124567\t a normal string again
A few things:
The strings are not surrounded with "" unfortunately
All numbers the strings contain are <6 digits long, so a >=6 digit number is always the start of a new string line
The number increases, so the number before the string is always lower than the one behind
I'd like to convert all special characters like tabs or line breaks to \t or \n
I need to get the byte length later in the script, a string must keep it's length
I'm still new here, so if I put this in the wrong place or if it was already answered, tell me!

I hope the "UTF-8-BOM encoded" is not a trap.
Here is my proposal if it is not.
bash-3.1$ sed -En '/^[0-9]{6,10}/!{:a;H;n;/^[0-9]{6,10}/!ba;x;s/\n/\\n/g;s/\t/\\t/g;p};/^[0-9]{6,10}/{x;s/\t/\\t/g;1!p;x;h;z;}' input.txt
Output for sample input (with a newline at the end):
123456\tA random string here
123567\t another string\nthis time\nit goes over\nmultiple lines
124567\t a normal string again
I assumed that the relevant 6-10 digits also always are at the start of a line,
otherwise it gets trickier.
Note:
The string length will increase by 1 for each newline \n or tabulator \t;
because the requested "\n" and "\t" are two characters each.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Removing characters from a string in Perl that are not alphanumeric - string

Using regex, s/[^a-zA-Z0-9]//g; using translation, tr/a-zA-Z0-9//dc;

Related

Mark words in notepad++ including dash (-)

What do you understand by this RegEx?

remove exponents from a formula string

VIM line count in status bar with thousands separator?

Bash get string between 2 6-digit numbers

Categories

Resources