Banned Words space sensitivity? - node.js

I have a discord bot: https://github.com/LebryantJohnson/ultimatebot. It has a censor that works great except for one shortcoming. In the banned words.txt there are my banned words. I don't want to curse so let's say one of the words is Chicken. If I were to Spam Chicken without spacing it, the anti swear won't remove it. Is there a way I can get it to remove the message even if a swear word was used?
Discord JS V 11 btw

As mentioned by #Sfue in the comments, you can use Regex to achieve this. In fact, regex could be used to solve both this issue and the issue I helped you with earlier at once. Here's how it would go:
checkProfanity: function(message, bannedWords) {
var words = message.split(' ');
for (var word of words) {
if (bannedWords.some(element => word.match(new RegExp(element, "i")) && element != "")) return true;
}
return false;
}
This uses a combination of a few things that weren't in your code before. First off is Array.some(), which is used to check if at least one element in an array passes the test specified by the supplied function. This is very useful in your situation, since you have an array of banned words and you want to check to see if any one of those banned words is present in your message. Second is RegExp, which is used for matching text with a pattern. We are using one RegExp flag here: i (meaning case insensitive). This flag is crucial to solving your earlier issue. Lastly, we pull together our use of all of the components mentioned above with String.match(), which is used to retrieve the result of attempting to match a String to a Regex pattern. Match() paired with our Regex will return a value if any banned word appears in one of the message's words, and in a case insensitive manner due to our i flag, solving both issues at once.

Related

How to get a substring with Regex in Python

I am trying to formnulate a regex to get the ids from the below two strings examples:
/drugs/2/drug-19904-5106/magnesium-oxide-tablet/details
/drugs/2/drug-19906/magnesium-moxide-tablet/details
In the first case, I should get 19904-5106 and in the second case 19906.
So far I tried several, the closes I could get is [drugs/2/drug]-.*\d but would return g-19904-5106 and g-19907.
Please any help to get ride of the "g-"?
Thank you in advance.
When writing a regex expression, consider the patterns you see so that you can align it correctly. For example, if you know that your desired IDs always appear in something resembling ABCD-1234-5678 where 1234-5678 is the ID you want, then you can use that. If you also know that your IDs are always digits, then you can refine the search even more
For your example, using a regex string like
.+?-(\d+(?:-\d+)*)
should do the trick. In a python script that would look something like the following:
match = re.search(r'.+?-(\d+(?:-\d+)*)', my_string)
if match:
my_id = match.group(1)
The pattern may vary depending on the depth and complexity of your examples, but that works for both of the ones you provided
This is the closest I could find: \d+|.\d+-.\d+

Regex for specific permutations of a word

I am working on a wordle bot and I am trying to match words using regex. I am stuck at a problem where I need to look for specific permutations of a given word.
For example, if the word is "steal" these are all the permutations:
'tesla', 'stale', 'steal', 'taels', 'leats', 'setal', 'tales', 'slate', 'teals', 'stela', 'least', 'salet'.
I had some trouble creating a regex for this, but eventually stumbled on positive lookaheads which solved the issue. regex -
'(?=.*[s])(?=.*[l])(?=.*[a])(?=.*[t])(?=.*[e])'
But, if we are looking for specific permutations, how do we go about it?
For example words that look like 's[lt]a[lt]e'. The matching words are 'steal', 'stale', 'state'. But I want to limit the count of l and t in the matched word, which means the output should be 'steal' & 'stale'. 1 obvious solution is this regex r'slate|stale', but this is not a general solution. I am trying to arrive at a general solution for any scenario and the use of positive lookahead above seemed like a starting point. But I am unable to arrive at a solution.
Do we combine positive lookaheads with normal regex?
s(?=.*[lt])a(?=.*[lt])e (Did not work)
Or do we write nested lookaheads or something?
A few more regex that did not work -
s(?=.*[lt]a[tl]e)
s(?=.*[lt])(?=.*[a])(?=.*[lt])(?=.*[e])
I tried to look through the available posts on SO, but could not find anything that would help me understand this. Any help is appreciated.
You could append the regex which matches the permutations of interest to your existing regex. In your sample case, you would use:
(?=.*s)(?=.*l)(?=.*a)(?=.*t)(?=.*e)s[lt]a[lt]e
This will match only stale and slate; it won't match state because it fails the lookahead that requires an l in the word.
Note that you don't need the (?=.*s)(?=.*a)(?=.*e) in the above regex as they are required by the part that matches the permutations of interest. I've left them in to keep that part of the regex generic and not dependent on what follows it.
Demo on regex101
Note that to allow for duplicated characters you might want to change your lookaheads to something in this form:
(?=(?:[^s]*s){1}[^s]*)
You would change the quantifier on the group to match the number of occurrences of that character which are required.

Using flex to identify variable name without repeating characters

I'm not fully sure how to word my question, so sorry for the rough title.
I am trying to create a pattern that can identify variable names with the following restraints:
Must begin with a letter
First letter may be followed by any combination of letters, numbers, and hyphens
First letter may be followed with nothing
The variable name must not be entirely X's ([xX]+ is a seperate identifier in this grammar)
So for example, these would all be valid:
Avariable123
Bee-keeper
Y
E-3
But the following would not be valid:
XXXX
X
3variable
5
I am able to meet the first three requirements with my current identifier, but I am really struggling to change it so that it doesn't pick up variables that are entirely the letter X.
Here is what I have so far: [a-z][a-z0-9\-]* {return (NAME);}
Can anyone suggest a way of editing this to avoid variables that are made up of just the letter X?
The easiest way to handle that sort of requirement is to have one pattern which matches the exceptional string and another pattern, which comes afterwards in the file, which matches all the strings:
[xX]+ { /* matches all-x tokens */ }
[[:alpha:]][[:alnum:]-]* { /* handle identifiers */ }
This works because lex (and almost all lex derivatives) select the first match if two patterns match the same longest token.
Of course, you need to know what you want to do with the exceptional symbol. If you just want to accept it as some token type, there's no problem; you just do that. If, on the other hand, the intention was to break it into subtokens, perhaps individual letters, then you'll have to use yyless(), and you might want to switch to a new lexing state in order to avoid repeatedly matching the same long sequence of Xs. But maybe that doesn't matter in your case.
See the flex manual for more details and examples.

String pattern or String manipulation to search and replace a pattern in lua

I get the list of domains on a system and I need to replace only the patterns which contain "domain\username" with '*'.
As of now I am able to do mask the domain names with * using string.gsub() but What pattern should I add to make sure any presence of domain\username is replaced with *
Example:
If on the system there are 2 domains test.com and work-user.com and users as admin and guest a file has the following details:
User tried to login from TEST\admin; but should have logged in from work-user\user1, No logs present for testing\guest, account.
The domain test.com and WORK-USER.org are active and TESTING domain in inactive.
Then the output should look like this:
User tried to login from *********; but should have logged in from ********\user1, No logs present for testing\*****, account.
The domain ****.com and *********.org are active and TESTING domain in inactive.
Since Testing and user1 are not the domain and username on that system, they should not get replaced.
I have the logic to replace the username and domain name independently in any given format, but when it is the format of domain\username I am not able to replace it.
I have to add some logic\pattern after I get the domain name so it matches the above requirement.
Can you please let me know how to proceed?
I tried the below code:
test_string="User tried to login from TEST\\admin; but should have logged in from work-user\\user1, No logs present for testing\\guest, account. The domain test.com and WORK-USER.org are active and TESTING domain in inactive"
s= "test"
t=( string.gsub(s.."$DNname", "%$(%w+)", {DNname="\\([%w_]+)"}) )
n=( string.gsub(s.."$DNname", "%$(%w+)", {DNname="\\([%a%d]+)([%;%,%.%s]?)"}) ) print (t)
print(n)
r=string.match(test_string,t)
res=string.match(test_string,n)
print(r)
print(res)
It is printing nil, and is not able to match any pattern
First let's talk about why your code doesn't work.
For one thing, your patterns both have a backslash in them, so you are right away missing anything without a backslash:
print(t) -- test\([%w_]+)
print(n) -- test\([%a%d]+)([%;%,%.%s]?)
But there is also another problem. The only thing with a backslash that ought to match in your test message is TEST\admin. But here TEST is all uppercase, and pattern matching is case sensitive, so you will not find it.
The first part of the answer, then, is to make a case-insensitive pattern. This can be done as follows:
s= "[Tt][Ee][Ss][Tt]"
Here I have replaced each letter with the character class that will match either the uppercase or lowercase letter.
What happens if we look for this pattern in the original message, though? We will have an unfortunate problem: we will find testing and TESTING. It looks like you may have already encountered this problem as you wrote "([%;%,%.%s]?)".
The better way to do this is the frontier pattern. (Note that the frontier pattern is an undocumented feature in Lua 5.1. I'm not sure if it is in Lua 5.0 or not. It became a documented feature in Lua 5.2.)
The frontier pattern takes a character set and will only match spaces between characters where the previous character is not in the set and the next character is in the set. It sounds complicated, but basically it lets you find the beginnings or endings of words.
To use the frontier pattern, we need to figure out what a domain or username might look like. We may not be able to do this perfectly, but, in practice, being overly greedy should be fine.
s = "%f[%w-][Tt][Ee][Ss][Tt]%f[^%w-]"
This new pattern will match "TEST" and "test", but will not match "TESTING" or "testing".
Before proceeding, let's look at a problem that might occur with a domain like your "work-user". The character "-" has a special meaning in patterns, so we must escape it. All special characters can be escaped by adding a "%" in front. So, our work-user pattern would look like:
s = "%f[%w-][Ww][Oo][Rr][Kk]%-[Uu][Ss][Ee][Rr]%f[^%w-]"
Well, these kind of patterns are sort of awful to write out, so let's try to write a function to do it for us:
function string_to_pattern(str, frontier_set, ci)
-- escape magic characters
str = str:gsub("[][^$()%%.*+-?]", "%%%0")
if ci then
-- make the resulting pattern case-insensitive
str = str:gsub("%a", function(letter)
return "["..letter:upper()..letter:lower().."]"
end)
end
if frontier_set then
str = "%f["..frontier_set.."]"..str.."%f[^"..frontier_set.."]"
end
return str
end
print(string_to_pattern("work-user", "%w-", true))
-- %f[%w-][Ww][Oo][Rr][Kk]%-[Uu][Ss][Ee][Rr]%f[^%w-]
I'll go ahead a mention the corner case now: this pattern will not match "-work-user" or "work-user-". This may be okay or not depending on what kind of messages get generated. You could take "-" out of frontier set, but then you would match e.g. "my-work-user". You can decide if this matters, but I haven't thought how to solve it with Lua's pattern matching language.
Now, how do we replace a match with *'s? This part is pretty easy. The built-in string.gsub function will allow us to replace matches of our patterns with other strings. We just need to generate a replacement string that consists of as many *'s as characters.
function string_to_stars(str)
return ("*"):rep(str:len())
end
local pattern = string_to_pattern("test", "%w-", true)
print( (test_string:gsub(pattern, string_to_stars)) )
Now, there's a final problem. We can match users in the same we match domains. For example:
-- note that different frontier_set here
-- I don't know what the parameters for your usernames are,
-- but this matches your code
local pattern = string_to_pattern("admin", "%w_", true)
print( (test_string:gsub(pattern, string_to_stars)) )
However, even if we replace all the domains and usernames separately, the backslash between "TEST" and "admin" in "TEST\admin" will not be replaced. We could do a hack like this:
test_string:gsub("%*\\%*","***")
This would replace "**" with "***" in the final output. However, this is not quite robust because it could replace a "**" that was in the original message and not a result of our processing. To do things properly, we would have to iterate over all domain+user pairs and do something like this:
test_string:gsub(domain_pattern .. "\\" .. user_pattern, string_to_stars)
Note that this must be done before any other replacements, as otherwise the domain and username will have already been replaced, and can no longer be matched.
Now that the problem is solved in that way, let me suggest an alternative approach that reflects something more like what I would write from scratch. I think it is probably simpler and more readable. Instead of using pattern matching to find our domains and usernames exactly, let's instead just match tokens that could be domains or usernames and then check if they match exactly later.
local message = -- broken into multiple lines only for
-- formatting reasons
"User tried to login from TEST\\admin; but should "
.."have logged in from work-user\\user1, No logs present "
.."for testing\\guest, account. The domain test.com and "
.."WORK-USER.org are active and TESTING domain in inactive"
-- too greedy, but may not matter in your case
local domain_pattern = "%w[%w-]*"
-- again, not sure
local user_pattern = "[%w_]+"
-- for case-insensitivity, call :lower before inserting into the set
local domains = {["test"]=true, ["work-user"]=true}
local users = {["admin"]=true, ["guest"]=true}
local pattern = "(("..domain_pattern..")\\("..user_pattern.."))"
message = message:gsub(pattern, function(whole, domain, user)
-- only call lower if case-insensitive
if domains[domain:lower()] and users[user:lower()] then
return string_to_stars(whole)
else
return whole
end
end)
local function replace_set(message, pattern, set, ci)
return (message:gsub(pattern, function(str)
if ci then str = str:lower() end
if set[str] then
return string_to_stars(str)
else
return str
end
end))
end
message = replace_set(message, domain_pattern, domains, true)
message = replace_set(message, user_pattern, users, true)
print(message)
Notice how simple the patterns are in this example. We no longer need case-insensitive character classes like "[Tt]" because the case-insensitivity is checked after the matching by forcing both strings to be lowercase with string.lower (which may not be maximally efficient, but, hey, this is Lua). We no longer need to use the frontier pattern because we are guaranteed to get full words because of greedy matching. The backslash case is still weird, but I've handled it in the same "robust" way as I suggested above.
A final note: I don't know exactly why your doing this, but I can maybe guess that it is to prevent someone from seeing domains or usernames. Replacing them with *'s is not necessarily the best way to go. First, doing matching in these ways could be problematic if your messages are (for example) delimited with letters. This seems unlikely for user-friendly messages, but I don't know whether that's something you should count on when security is at stake. Another thing is that you are not hiding the lengths of the domains or usernames. This can also be a major source of insecurity. For example, a user might reasonably guess that ***** is "admin".

identify common chars in correct order (kind of regular expression) from a array of strings

I am looking for how to identify common chars from a set of strings of different
length. First let me tell the same problem had posted here, and the author is somehow able to find out the answer. But i could not get his solution. I tried to post my query over
there, but not sure whether I will get any reply. So i am posting as a new one. (this is
the link for old qs Find common chars in array of strings, in the right order
of-strings-in-the-right-order).
I m taking the same example from him.
Let's assume "+" is the "wildcard char":
Array(
0 => '48ca135e0$5',
1 => 'b8ca136a0$5',
2 => 'c48ca13730$5',
3 => '48ca137a0$5');
Should return :
$wildcard='+8ca13+0$5';
This looks to me as a standard problem. so i doubt there will be some library
for this. If not pls show some light for solving this.
I dont think comparing char-by-char work (as told in the reply), becoz the matching char can come in anywhere (eg:- arr1[1] and arr2[3] can be starting index of matching some substring and the other way also).
regards,
Looks like you're looking for the "longest common substring". The first longest common substring is 8ca13, the second longest is 0$5. Once we have these two strings, you can take any of the strings in the set and replace extra characters with a single +.
http://en.wikipedia.org/wiki/Longest_common_substring_problem

Resources