Ruby - Parse a multi-line tab-delimited string into an array of arrays - transform

My apologies if this has already been asked in a Ruby setting--I checked before posting but to be perfectly honest it has been a very long day and If I am missing the obvious, I apologize in advance!
I have the following string which contains a list of software packages installed on a system and for some reason I am having the hardest time parsing it. I know there has got to be a straight forward means of doing this in Ruby but I keep coming up short.
I would like to parse the below multi-line, tab-delimited, string into an array of arrays where I can then loop through each array element with an each_with_index and spit out the HTML code into my Rails app.
str = 'Product and/or Software Full Name 5242 [version 6.5.24] [Installed on: 12/31/2015]
Product and/or Software Full Name 5426 [version 22.4] [Installed on: 06/11/2013]
Product and/or Software Full Name 2451 [version 1.63] [Installed on: 12/17/2015]
Product and/or Software Full Name 5225 [version 43.22.51] [Installed on: 11/15/2011]
Product and/or Software Full Name 2420 [version 43.51-r2] [Installed on: 12/31/2015]'
The end result would be an array of arrays with 5 elements like so:
[["Product and/or Software Full Name 5245"],["version 6.5.24"],
["Installed on: 12/31/2015"],["Product and/or Software Full Name 5426"],["version 22.4"],["Installed on: 06/11/2013"],["Product and/or Software Full Name 2451"],["version 1.63"],["Installed on: 12/17/2015"]]
Please Note: Only 3 of 5 arrays are shown for brevity
I would prefer to strip out the brackets from both 'version' and 'Installed on' but I can do that with gsub separately if that cannot easily be baked into an answer.
Last thing is that there won't always be an 'Installed on' entry for every line in the multiline string, so the answer will need to take that into account as applicable.

This ought to do:
expr = /(.+?)\s+\[([^\]]+)\](?:\s+\[([^\]]+)\])?/
str.scan(expr)
The expression is actually a lot less complex than it looks. It looks complex because we're matching square brackets, which have to be escaped, and also using character classes, which are enclosed in square brackets in the regular expression language. All together it adds a lot of noise.
Here it is split up:
expr = /
(.+?) # Capture #1: Any characters (non-greedy)
\s+ # Whitespace
\[ # Literal '['
( # Capture #2:
[^\]]+ # One or more characters that aren't ']'
)
\] # Literal ']'
(?: # Non-capturing group
\s+ # Whitespace
\[ # Literal '['
([^\]]+) # Capture #3 (same as #2)
\] # Literal ']'
)? # Preceding group is optional
/x
As you can see, the third part is identical to the second part, except it's in a non-capture group followed by a ? to make it optional.
It's worth noting that this may fail if e.g. the product name contains square brackets. If that's a possibility, one potential solution is include the version and Installed text in the match, e.g.:
expr = /(.+?)\s+\[(version [^\]]+)\](?:\s+\[(Installed [^\]]+)\])?/
P.S. Here's a solution that uses String#split instead:
expr = /\]?\s+\[|\]$/
res = str.each_line.map {|ln| ln.strip.split(expr) }
.reject {|arr| arr.empty? }
If you have brackets in your product names, a possible workaround here is to specify a minimum number of spaces between parts, e.g.:
expr = /\]?\s{3,}\[|\]$/
...which of course depends on product names never having more than three consecutive spaces.

Related

LUA -- gsub problems -- passing a variable to the match string isn't working [duplicate]

This question already has an answer here:
How to match a sentence in Lua
(1 answer)
Closed 1 year ago.
Been stuck on this for over a day.
I'm trying to use gsub to extract a portion of an input string. The exact pattern of the input varies in different cases, so I'm trying to use a variable to represent that pattern, so that the same routine - which is otherwise identical - can be used in all cases, rather than separately coding each.
So, I have something along the lines of:
newstring , n = oldstring:gsub(matchstring[i],"%1");
where matchstring[] is an indexed table of the different possible pattern matches, set up so that "%1" will match the target sequence in each matchstring[].
For instance, matchstring[1] might be
"\[User\] <code:%w*>([^<]*)<\\code>.*" -- extract user name from within the <code>...<\code>
while matchstring[2] could be
"\[World\] (%w)* .*" -- extract user name as first word after prefix '[World] '
and matchstring[3] could be
"<code:%w*>([^<]*)<\\code>.*" -- extract username from within <code>...<\code> at start
This does not work.
Yet when, debugging one of the cases, I replace matchstring[i] with the exact same string -- only now passed as a string literal rather than saved in a variable -- it works.
So.. I'm guessing there must be some 'processing' of the string - stripping out special characters or something - when it's sent as a variable rather than a string literal ... but for the life of me I can't figure out how to adjust the matchstring[] entries to compensate!
Help much appreciated...
FACEPALM
Thankyou, Piglet, you got me on the right track.
Given how this particular platform processes & passes strings, anything within <...> needed the escape character \ for downstream use, but of course - duh - for the lua gsub's processing itself it needed the standard %
much obliged

Python regular expressions with Foreign characters in python PyQT5

This problem might be very simple but I find it a bit confusing & that is why I need help.
With relevance to this question I posted that got solved, I got a new issue that I just noticed.
Source code:
from PyQt5 import QtCore,QtWidgets
app=QtWidgets.QApplication([])
def scroll():
#QtCore.QRegularExpression(r'\b'+'cat'+'\b')
item = listWidget.findItems(r'\bcat\b', QtCore.Qt.MatchRegularExpression)
for d in item:
print(d.text())
window = QtWidgets.QDialog()
window.setLayout(QtWidgets.QVBoxLayout())
listWidget = QtWidgets.QListWidget()
window.layout().addWidget(listWidget)
cats = ["love my cat","catirization","cat in the clouds","catść"]
for i,cat in enumerate(cats):
QtWidgets.QListWidgetItem(f"{i} {cat}", listWidget)
btn = QtWidgets.QPushButton('Scroll')
btn.clicked.connect(scroll)
window.layout().addWidget(btn)
window.show()
app.exec_()
Output GUI:
Now as you can see I am just trying to print out the text data based on the regex r"\bcat\b" when I press the "Scroll" button and it works fine!
Output:
0 love my cat
2 cat in the clouds
3 catść
However... as you can see on the #3, it should not be printed out cause it obviously does not match with the mentioned regular expression which is r"\bcat\b". However it does & I am thinking it has something to do with that special foreign character ść that makes it a match & prints it out (which it shouldn't right?).
I'm expecting an output like:
0 love my cat
2 cat in the clouds
Researches I have tried
I found this question and it says something about this \p{L} & based on the answer it means:
If all you want to match is letters (including "international"
letters) you can use \p{L}.
To be honest I'm not so sure how to apply that with PyQT5 also still I've made some tries & and I tried changing the regex to like this r'\b'+r'\p{cat}'+r'\b'. However I got this error.
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
QString::contains: invalid QRegularExpression object
Obviously the error says it's not a valid regex. Can someone educate me on how to solve this issue? Thank you!
In general, when you need to make your shorthand character classes and word boundaries Unicode-aware, you need to pass the QRegularExpression.UseUnicodePropertiesOption option to the regex compiler. See the QRegularExpression.UseUnicodePropertiesOption reference:
The meaning of the \w, \d, etc., character classes, as well as the meaning of their counterparts (\W, \D, etc.), is changed from matching ASCII characters only to matching any character with the corresponding Unicode property. For instance, \d is changed to match any character with the Unicode Nd (decimal digit) property; \w to match any character with either the Unicode L (letter) or N (digit) property, plus underscore, and so on. This option corresponds to the /u modifier in Perl regular expressions.
In Python, you could declare it as
rx = QtCore.QRegularExpression(r'\bcat\b', QtCore.QRegularExpression.UseUnicodePropertiesOption)
However, since the QListWidget.findItems does not support a QRegularExpression as argument and only allows the regex as a string object, you can only use the (*UCP) PCRE
verb as an alternative:
r'(*UCP)\bcat\b'
Make sure you define it at the regex beginning.

Extract/Parse all subdomains and domains from any file content [duplicate]

This question already has answers here:
Extract domain names from a file in Shell [closed]
(2 answers)
Closed 4 years ago.
I'm trying to make a regex for grep that match only valid domains.
My version work pretty well but match the following invalid domain :
#subdom..dom.ext
Here is my regex :
echo "#dom.ext" | grep "^#[[:alnum:]]\+[[:alnum:]\-\.]\+[[:alnum:]]\+\.[[:alpha:]]\+\$"
I'm working with bash so I escaped special characters.
Sample that should match :
#subdom.dom.ext
#subsubdom.subdom.dom.ext
#subsub-dom.sub-dom.ext
Thanks for help
A truly complete solution requires more work, but here's an approximation that may work well enough (note that a # prefix is assumed and the input string is expected to start with it):
^#(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$
You can use this with egrep (or grep -E), but also with [[ ... =~ ... ]], bash's regex-matching operator.
Makes the following assumptions, which are more permissive than actual DNS name constraints:
Only ASCII (non-foreign) letters are allowed - see below for Internationalized Domain Name (IDN) considerations; also, the Punycode *(ASCII-compatible) forms of IDNs - e.g., xn--bcher-kva.ch for bücher.ch - are not matched - see below.
There's no limit on the number of nested subdomains.
There's no limit on the length of any label (name component), and no limit on the overall length of the name (for actual limits, see here).
The TLD (last component) is composed of letters only and has a length of at least 2.
Both subdomain and domain names must start with a letter; subdomains are allowed to be single-letter.
Here's a quick test:
for d in #subdom..dom.ext #dom.ext #subdom.dom.ext #subsubdom.subdom.dom.ext #subsub-dom.sub-dom.ext #x.org; do
[[ $d =~ \
^#(([a-zA-Z](-?[a-zA-Z0-9])*)\.)+[a-zA-Z]{2,}$ \
]] && echo YES || echo NO
done
Support for Internationalized Domain Names (IDN) with literal Unicode characters - again, a complete solution requires more work:
A simple improvement to also match IDNs is to replace [a-zA-Z] with [[:alpha:]] and [a-zA-Z0-9] with [[:alnum:]] in the above regex; i.e.:
^#(([[:alpha:]](-?[[:alnum:]])*)\.)+[[:alpha:]]{2,}$
Caveats:
No attempt is made to recognize Punycode-encoded versions of IDNs, which use an ASCII-based encoding with prefix xn--, and which would require decoding afterwards.
As Patrick Mevzek points out, the above can yield both false negatives and false positives (using his examples):
False positive: an invalid Punycode-encoded name such as ab--whatever
False positive: Invalid cross-language names; e.g., cαfe.fr, which uses a Greek letter in a French domain name - a rule that is impossible to enforce via a regex alone.
False negatives: emoji-based names such as 💄.ws (xn--jr8h.ws)
False negative: பரிட்சை is a valid TLD in IANA root today, but will not match [[:alpha:]]{2,}$
... and many more
Not all Unix-like platforms fully support all Unicode letters when matching against [[:alpha:]] or [[:alnum:]]. For instance, using UTF-8-based locales, OS X 10.9.1 apparently only matches Latin diacritics (e.g., ü, á) and Cyrillic characters (in addition to ASCII), whereas Linux 3.2 laudably appears to cover all scripts, including Asian and Arabic ones.
I'm unclear on whether names in right-to-left writing scripts are properly matched.
For the sake of completeness: even though the regex above makes no attempt to enforce length limits, attempting to do so with IDNs would be much more complex, as the length limits apply to the ASCII encoding of the name (via Punycode), not the original.
Tip of the hat to #Alfe and for pointing out the problem with IDNs, and to #Arka for offering a simplified version of the regex to replace the lengthier one I had initially crafted under the mistaken assumption that single-letter domain names must be ruled out.
echo "#dom.ext" | grep -E "^#[a-zA-Z0-9]+([-.]?[a-zA-Z0-9]+)*.[a-zA-Z]+$"
This did the job.
Use
grep '#[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*\.[[:alpha:]][[:alnum:]\-]*$'

Adding space in a specific position in a string of uppercase and lowercase letters

Dear stackoverflow users,
Many people encounter situations in which they need to modify strings. I have seen many
posts related to string modification. But, I have not come across solutions I am looking
for. I believe my post would be useful for some other R users who will face similar
challenges. I would like to seek some help from R users who are familiar with string
modification.
I have been trying to modify a string like the following.
x <- "Marcus HELLNERJohan OLSSONAnders SOEDERGRENDaniel RICHARDSSON"
There are four individuals in this string. Family names are in capital letters.
Three out of four family names stay in chunks with first names (e.g., HELLNERJohan).
I want to separate family names and first names adding space (e.g., HELLNER Johan).
I think I need to state something like "Select sequences of uppercase letters, and
add space between the last and second last uppercase letters, if there are lowercase
letters following."
The following post is probably somewhat relevant, but I have not been successful in writing codes yet.
Splitting String based on letters case
Thank you very much for your generous support.
This works by finding and capturing two consecutive sub-patterns, the first consisting of one upper case letter (the end of a family name), and the next consisting of an upper then a lower-case letter (taken to indicate the start of a first name). Everywhere these two groups are found, they are captured and replaced by themselves with a space inserted between (the "\\1 \\2" in the call below).
x <- "Marcus HELLNERJohan OLSSONAnders SOEDERGRENDaniel RICHARDSSON"
gsub("([[:upper:]])([[:upper:]][[:lower:]])", "\\1 \\2", x)
# "Marcus HELLNER Johan OLSSON Anders SOEDERGREN Daniel RICHARDSSON"
If you want to separate the vector into a vector of names, this splits the string using a regular expression with zero-width lookbehind and lookahead assertions.
strsplit(x, split = "(?<=[[:upper:]])(?=[[:upper:]][[:lower:]])",
perl = TRUE)[[1]]
# [1] "Marcus HELLNER" "Johan OLSSON" "Anders SOEDERGREN"
# [4] "Daniel RICHARDSSON"

Prolog : Remove extra spaces in a stream of characters

Total newb to Prolog. This one is frustrating me a bit. My 'solution' below is me trying to make Prolog procedural...
This will remove spaces or insert a space after a comma if needed, that is, until a period is encountered:
squish:-get0(C),put(C),rest(C).
rest(46):-!.
rest(32):-get(C),put(C),rest(C).
rest(44):-put(32), get(C), put(C), rest(C).
rest(Letter):-squish.
GOAL: I'm wondering how to remove any whitespace BEFORE the comma as well.
The following works, but it is so wrong on so many levels, especially the 'exit'!
squish:-
get0(C),
get0(D),
iteratesquish(C,D).
iteratesquish(C,D):-
squishing(C,D),
get0(E),
iteratesquish(D,E).
squishing(46,X):-put(46),write('end.'),!,exit.
squishing(32,32):-!.
squishing(32,44):-!.
squishing(32,X):-put(32),!.
squishing(44,32):-put(44),!.
squishing(44,44):-put(44), put(32),!.
squishing(44,46):-put(44), put(32),!.
squishing(44,X):-put(44), put(32),!.
squishing(X,32):-put(X),!.
squishing(X,44):-put(X),!.
squishing(X,46):-put(X),!.
squishing(X,Y):-put(X),!.
Since you are describing lists (in this case: of character codes), consider using DCG notation. For example, to let any comma be followed by a single whitespace, consider using code similar to:
squish([]) --> [].
squish([(0',),(0' )|Rest]) --> [0',], spaces, !, squish(Rest).
squish([L|Ls]) --> [L], squish(Ls).
spaces --> [0' ], spaces.
spaces --> [].
Example query:
?- phrase(squish(Ls), "a, b,c"), format("~s", [Ls]).
a, b, c
So, first focus on a clear declarative description of the relation between character sequences and the desired "clean" string. You can then use SWI-Prolog's library(pio) to read from files via these grammar rules. To remove all spaces preceding commas, you only have to add a single rule to the DCG above (to squish//1), which I leave as exercise to you. A corner case of course is if a comma is followed by another comma, in which case the requirements are contradictory :-)

Resources