Prolog : Remove extra spaces in a stream of characters - string

Total newb to Prolog. This one is frustrating me a bit. My 'solution' below is me trying to make Prolog procedural...
This will remove spaces or insert a space after a comma if needed, that is, until a period is encountered:
squish:-get0(C),put(C),rest(C).
rest(46):-!.
rest(32):-get(C),put(C),rest(C).
rest(44):-put(32), get(C), put(C), rest(C).
rest(Letter):-squish.
GOAL: I'm wondering how to remove any whitespace BEFORE the comma as well.
The following works, but it is so wrong on so many levels, especially the 'exit'!
squish:-
get0(C),
get0(D),
iteratesquish(C,D).
iteratesquish(C,D):-
squishing(C,D),
get0(E),
iteratesquish(D,E).
squishing(46,X):-put(46),write('end.'),!,exit.
squishing(32,32):-!.
squishing(32,44):-!.
squishing(32,X):-put(32),!.
squishing(44,32):-put(44),!.
squishing(44,44):-put(44), put(32),!.
squishing(44,46):-put(44), put(32),!.
squishing(44,X):-put(44), put(32),!.
squishing(X,32):-put(X),!.
squishing(X,44):-put(X),!.
squishing(X,46):-put(X),!.
squishing(X,Y):-put(X),!.

Since you are describing lists (in this case: of character codes), consider using DCG notation. For example, to let any comma be followed by a single whitespace, consider using code similar to:
squish([]) --> [].
squish([(0',),(0' )|Rest]) --> [0',], spaces, !, squish(Rest).
squish([L|Ls]) --> [L], squish(Ls).
spaces --> [0' ], spaces.
spaces --> [].
Example query:
?- phrase(squish(Ls), "a, b,c"), format("~s", [Ls]).
a, b, c
So, first focus on a clear declarative description of the relation between character sequences and the desired "clean" string. You can then use SWI-Prolog's library(pio) to read from files via these grammar rules. To remove all spaces preceding commas, you only have to add a single rule to the DCG above (to squish//1), which I leave as exercise to you. A corner case of course is if a comma is followed by another comma, in which case the requirements are contradictory :-)

Related

Fullname multilingual Regexp

Currently the validation of fullname looks like:
/^[a-zA-Z ]{2,30}$/
But that regexp validates only latin alphabet names. This should be changed in order to handle multilingual characters also. I have tried:
/^(\p{L}\p{M}*){2,30}$/u
But it validates numbers within names also, which is not correct.
As in the first case, use a character class with Unicode as well:
/^[\p{L}\p{M}\p{Zs}]{2,30}$/u
The \p{Zs} denotes a space char, such as regular space and Japanese space char  .
In case you want to prevent space at the start and end, use these negative lookaheads:
/^(?!\p{Zs})(?!.*\p{Zs}$)[\p{L}\p{M}\p{Zs}]{2,30}$/u
See a demo on regex101.com.

How can I remove all characters inside angle brackets python?

How can I remove all characters inside angle brackets including the brackets in a string? How can I also remove all the text between ("\r\n") and ("."+"any 3 characters") Is this possible? I am currently using the solution by #xkcdjerry
e.g
body = """Dear Students roads etc. you place a tree take a snapshot, then when you place a\r\nbuilding, take a snapshot. Place at least 5-6 objects and then have 5-6\r\nsnapshots. Please keep these snapshots with you as everyone will be asked\r\nto share them during the class.\r\n\r\nI am attaching one PowerPoint containing instructions and one video of\r\nexplanation for your reference.\r\n\r\nKind regards,\r\nTeacher Name\r\n zoom_0.mp4\r\n<https://drive.google.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>"""
d = re.compile("\r\n.+?\\....")
body = d.sub('', body)
a = re.compile("<.*?>")
body = a.sub('', body)
print(body)```
For some reason the output is fine except that it has:
```gle.com/file/d/1UX-klOfVhbefvbhZvIWijaBdQuLgh_-Uru4_1QTkth/view?usp=drive_web>
randomly attached to the end How can I fix it.
Answer
Your problem can be solved by a regex:
Put this into the shell:
import re
a=re.compile("<.*?>")
a.sub('',"Keep this part of the string< Remove this part>Keep This part as well")
Output:
'Keep this part of the stringKeep This part as well'
Second question:
import re
re.compile("\r\n.*?\\..{3}")
a.sub('',"Hello\r\nFilename.png")
Output:
'Hello'
Breakdown
Regex is a robust way of finding, replacing, and mutating small strings inside bigger ones, for further reading,consult https://docs.python.org/3/library/re.html. Meanwhile, here are the breakdowns of the regex information used in this answer:
. means any char.
*? means as many of the before as needed but as little as possible(non-greedy match)
So .*? means any number of characters but as little as possible.
Note: The reason there is a \\. in the second regex is that a . in the match needs to be escaped by a \, which in its turn needs to be escaped as \\
The methods:
re.compile(patten:str) compiles a regex for farther use.
regex.sub(repl:str,string:str) replaces every match of regex in string with repl.
Hope it helps.

Python3 strip() get unexpect result

It's a weird problem
to_be_stripped="D:\\Users\\UserKnown\\PycharmProjects\\ProjectKnown\\PT\\collections\\120"
And two strings below:
s1="D:\\Users\\UserKnown\\PycharmProjects\\ProjectKnown\\PT\\collections\\120\\[Content_Types].xml"
s2="D:\\Users\\UserKnown\\PycharmProjects\\ProjectKnown\\PT\\collections\\120\\_rels\.rels"
When I use the command below:
s1.strip(to_be_stripped)
s2.strip(to_be_stripped)
I get these outputs:
'[Content_Types].x'
'_rels\\.'
If I use lstrip(), they will be:
'[Content_Types].xml'
'_rels\\.rels'
Which is the right outputs.
However, if we replace all Project Known with zeus_pipeline:
to_be_stripped="D:\\Users\\UserKnown\\PycharmProjects\\zeus_pipeline\\PT\\collections\\120"
And:
s2="D:\\Users\\UserKnown\\PycharmProjects\\zeus_pipeline\\PT\\collections\\120\\_rels\.rels"
s2.lstrip(to_be_stripped)will be '.rels'
If I use / instead of \\, nothing goes wrong. I am wondering why this problem happens.
strip isn't meant to remove full strings exactly. Rather, you give it a string, and every character in that string is removed from the start and of the string to be stripped.
In your case, the variable to_be_stripped contains the characters m and l, so those are stripped from the end of s1. However, it doesn't contain the character x, so the stripping stops there and no characters beyond that are removed.
Check out this question. The accepted answer is probably more extensive than you need - I like another user's suggestion of using replace instead of strip. This would look like:
s1.replace(to_be_stripped, "")

XML schema restriction pattern for not allowing specific string

I need to write an XSD schema with a restriction on a field, to ensure that
the value of the field does not contain the substring FILENAME at any location.
For example, all of the following must be invalid:
FILENAME
ORIGINFILENAME
FILENAMETEST
123FILENAME456
None of these values should be valid.
In a regular expression language that supports negative lookahead, I could do this by writing /^((?!FILENAME).)*$ but the XSD pattern language does not support negative lookahead.
How can I implement an XSD pattern restriction with the same effect as /^((?!FILENAME).)*$ ?
I need to use pattern, because I don't have access to XSD 1.1 assertions, which are the other obvious possibility.
The question XSD restriction that negates a matching string covers a similar case, but in that case the forbidden string is forbidden only as a prefix, which makes checking the constraint easier. How can the solution there be extended to cover the case where we have to check all locations within the input string, and not just the beginning?
OK, the OP has persuaded me that while the other question mentioned has an overlapping topic, the fact that the forbidden string is forbidden at all locations, not just as a prefix, complicates things enough to require a separate answer, at least for the XSD 1.0 case. (I started to add this answer as an addendum to my answer to the other question, and it grew too large.)
There are two approaches one can use here.
First, in XSD 1.1, a simple assertion of the form
not(matches($v, 'FILENAME'))
ought to do the job.
Second, if one is forced to work with an XSD 1.0 processor, one needs a pattern that will match all and only strings that don't contain the forbidden substring (here 'FILENAME').
One way to do this is to ensure that the character 'F' never occurs in the input. That's too drastic, but it does do the job: strings not containing the first character of the forbidden string do not contain the forbidden string.
But what of strings that do contain an occurrence of 'F'? They are fine, as long as no 'F' is followed by the string 'ILENAME'.
Putting that last point more abstractly, we can say that any acceptable string (any string that doesn't contain the string 'FILENAME') can be divided into two parts:
a prefix which contains no occurrences of the character 'F'
zero or more occurrences of 'F' followed by a string that doesn't match 'ILENAME' and doesn't contain any 'F'.
The prefix is easy to match: [^F]*.
The strings that start with F but don't match 'FILENAME' are a bit more complicated; just as we don't want to outlaw all occurrences of 'F', we also don't want to outlaw 'FI', 'FIL', etc. -- but each occurrence of such a dangerous string must be followed either by the end of the string, or by a letter that doesn't match the next letter of the forbidden string, or by another 'F' which begins another region we need to test. So for each proper prefix of the forbidden string, we create a regular expression of the form
$prefix || '([^F' || next-character-in-forbidden-string || ']'
|| '[^F]*'
Then we join all of those regular expressions with or-bars.
The end result in this case is something like the following (I have inserted newlines here and there, to make it easier to read; before use, they will need to be taken back out):
[^F]*
((F([^FI][^F]*)?)
|(FI([^FL][^F]*)?)
|(FIL([^FE][^F]*)?)
|(FILE([^FN][^F]*)?)
|(FILEN([^FA][^F]*)?)
|(FILENA([^FM][^F]*)?)
|(FILENAM([^FE][^F]*)?))*
Two points to bear in mind:
XSD regular expressions are implicitly anchored; testing this with a non-anchored regular expression evaluator will not produce the correct results.
It may not be obvious at first why the alternatives in the choice all end with [^F]* instead of .*. Thinking about the string 'FEEFIFILENAME' may help. We have to check every occurrence of 'F' to make sure it's not followed by 'ILENAME'.

Adding space in a specific position in a string of uppercase and lowercase letters

Dear stackoverflow users,
Many people encounter situations in which they need to modify strings. I have seen many
posts related to string modification. But, I have not come across solutions I am looking
for. I believe my post would be useful for some other R users who will face similar
challenges. I would like to seek some help from R users who are familiar with string
modification.
I have been trying to modify a string like the following.
x <- "Marcus HELLNERJohan OLSSONAnders SOEDERGRENDaniel RICHARDSSON"
There are four individuals in this string. Family names are in capital letters.
Three out of four family names stay in chunks with first names (e.g., HELLNERJohan).
I want to separate family names and first names adding space (e.g., HELLNER Johan).
I think I need to state something like "Select sequences of uppercase letters, and
add space between the last and second last uppercase letters, if there are lowercase
letters following."
The following post is probably somewhat relevant, but I have not been successful in writing codes yet.
Splitting String based on letters case
Thank you very much for your generous support.
This works by finding and capturing two consecutive sub-patterns, the first consisting of one upper case letter (the end of a family name), and the next consisting of an upper then a lower-case letter (taken to indicate the start of a first name). Everywhere these two groups are found, they are captured and replaced by themselves with a space inserted between (the "\\1 \\2" in the call below).
x <- "Marcus HELLNERJohan OLSSONAnders SOEDERGRENDaniel RICHARDSSON"
gsub("([[:upper:]])([[:upper:]][[:lower:]])", "\\1 \\2", x)
# "Marcus HELLNER Johan OLSSON Anders SOEDERGREN Daniel RICHARDSSON"
If you want to separate the vector into a vector of names, this splits the string using a regular expression with zero-width lookbehind and lookahead assertions.
strsplit(x, split = "(?<=[[:upper:]])(?=[[:upper:]][[:lower:]])",
perl = TRUE)[[1]]
# [1] "Marcus HELLNER" "Johan OLSSON" "Anders SOEDERGREN"
# [4] "Daniel RICHARDSSON"

Resources