In ANTLR 4 is there a way to access tokens on the hidden channel (or some other channels) in semantic predicates of the parser?
I would like to send the \r\n to hidden channel since mostly I don't need the EOL characters. But in some cases in a semantic predicate I would need to see if there is an EOL after the given token.
Tbh. I have no experience with ANLTR 4 but in ANTLR 3 you can use the token source to get all tokens, regardless of the channel. Something similar is certainly possible in version 4 too. I use this feature to restore the original input for AST subtrees (i.e. from token stream start index to end index).
Yes, this can be done. Look at this question and this question for some examples. The first one of these seems to directly address your question about handling EOL "some of the time".
Related
Background Information
We use SonarQube to obtain quality metrics regarding the codebase. SonarQube has flagged over a dozen bugs in our Node.js codebase, under rule S6324, related to an email validation regular expression advocated by a top ranking website on Google called emailregex.com. The website claims the regex is an RFC 5322 Official Standard. However, the control characters in the regex are flagged by SonarQube for removal because they're non-printable characters. Here is the regex:
(?:[a-z0-9!#$%&'*+/=?^_`{|}~-]+(?:\.[a-z0-9!#$%&'*+/=?^_`{|}~-]+)*|"(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21\x23-\x5b\x5d-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])*")#(?:(?:[a-z0-9](?:[a-z0-9-]*[a-z0-9])?\.)+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?|\[(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?|[a-z0-9-]*[a-z0-9]:(?:[\x01-\x08\x0b\x0c\x0e-\x1f\x21-\x5a\x53-\x7f]|\\[\x01-\x09\x0b\x0c\x0e-\x7f])+)\])
And here is the full list of control characters SonarQube complains about:
‘.\x0e…\x0e…\x0c…\x0c…\x0b…\x0c…\x1f…\x01…\x1f…\x01…\x01…\x09…\x08…\x0b…\x0b…\x0e…\x0b…\x08…\x0c…\x0e…\x09…\x01.’
Regular-Expressions.info's Email page does address a variation of the above regular expression as follows:
The reason you shouldn’t use this regex is that it is overly broad. Your application may not be able to handle all email addresses this regex allows. Domain-specific routing addresses can contain non-printable ASCII control characters, which can cause trouble if your application needs to display addresses...
However, I can't seem to find any information that explains why some sites are adding these non-printable control characters or what they mean by "domain-specific routing addresses". I have looked at some Stack Overflow regex questions and the Stack Overflow Regex Wiki. Control characters don't seem to be addressed.
The Question
Can someone please explain the purpose of these control-characters in the regular expression and possibly supply some examples of when this regular expression is useful?
(Note: Please avoid debates/discussion about what the best/worst regular expression is for validating emails. There doesn't seem to be agreement on that issue, which has been discussed and debated in many places on Stack Overflow and the broader Internet. This question is focused on understanding the purpose of control characters in the regular expression).
Update
I also reached out to the SonarQube community, and no one seems to have any answers.
Update
Still looking for authoritative answers which explain why the email regular expression above is specifically checking for non-printable control characters in email addresses.
There is this in the RFC5322 Section 5, but it's about the message body, not the address:
Security Considerations
Care needs to be taken when displaying messages on a terminal or
terminal emulator. Powerful terminals may act on escape sequences
and other combinations of US-ASCII control characters with a variety
of consequences. They can remap the keyboard or permit other
modifications to the terminal that could lead to denial of service or
even damaged data. They can trigger (sometimes programmable)
The Purpose
Can someone please explain the purpose of these control-characters in the regular expression [...]?
The purpose of those non-printable control characters would be to create a regex that conforms closesly to the RFCs defining email address format.
Just in case anyone is wondering- yes- the control characters in this email regex really do conform to the RFC specs. I think validating this is outside the scope of this question so I won't quote the spec in detail, but here are links to the relevant sections: 3.2.3 (atoms), 3.2.4 (quoted strings), 3.4 (address specification), 3.4.1 (addr-spec specification), 4.1 (Misc Obsolete Tokens). In summary, the local part and domain part of the address are allowed to contain quoted strings, which are allowed to contain certain non-printable control characters.
Quoting from SonarQube rule S6324 (emphasis added):
Entries in the ASCII table below code 32 are known as control characters or non-printing characters. As they are not common in JavaScript strings, using these invisible characters in regular expressions is most likely a mistake.
Following a spec is not a mistake. When a lint rule that is usually helpful hits a case in peoples' code where it is not helpful, people usually just use the lint tool's case-by-case ignore mechanism. I think this addresses the second clause of your bounty, which states:
What is a better alternative that will avoid breaking our site while also passing SonarQube's quality gate?
Ie. Use one of the provided mechanisms to make SonarQube ignore those rule violations. You could also choose to opt out of checking that rule entirely, but that's probably overkill.
For SonarQube, use NOSONAR comments to disable warnings on a case-by-case basis.
Examples of Usefulness
This comes down to context.
If your end goal is purely to validate whether any given email address is a valid email address as defined by the RFCs, then a regex that closely follows the RFC specs is very useful.
That's not everyone's end goal. Quoting from wikipedia:
Despite the wide range of special characters which are technically valid, organisations, mail services, mail servers and mail clients in practice often do not accept all of them. For example, Windows Live Hotmail only allows creation of email addresses using alphanumerics, dot (.), underscore (_) and hyphen (-). Common advice is to avoid using some special characters to avoid the risk of rejected emails.
There's nothing there that explains why most applications do not fully adhere to the spec, but you could speculate, or you could go try and ask their maintainers. For example, considerations such as simplicity could- in someone's context- be declared or seen as more important than full RFC complicance.
If your goal was to check if a given email address is a valid hotmail email address and to reject email addresses that are allowed by the RFCs but not by the subset that hotmail uses, then full RFC compliance would not be necessary (useful).
My team is using Solr and I have a question regarding it.
There are some search terms which doesn't gives relevant results or results which should have been displayed. For example:
Searching for Macy's without the apostrophe like "Macys" doesnt give back any result for Macy's.
Searching for JPMorgan vs JP Morgan gives different result
Searching for IBM doesn't show results which contains its full name i.e International business machine.
How can we improve and optimize such cases so that it gets applied to all, even to the one we didn't catch apart from these 3 above?
Any suggestions?
All these issues are related to how you process the incoming text for those fields. You'll have to create a filter chain for the field - and possibly use multiple fields for different use cases and prioritize those using qf - that processes the input values to do what you want.
Your first case can be solved by using a PatternReplaceFilter to remove any apostrophes - depending on your use case and tokenizer you might want to use the CharFilter version, as it processes the text before it's split into multiple tokens.
Your second case is a straight forward synonym filter or a WordDelimiterFilter, where you expand JPMorgan to "JP Morgan", or use the WordDelimiterFilter to expand case changes into separate tokens. That'll also allow you to search for JP and get JPMorgan related entries. These might have different effects on score, use debugQuery=true to see exactly how each term in your query contributes to the score.
The third case is in general the same as the second case. You'll have to create a decent synonym word list for the terms used, and this is usually something you build as you get feedback from your users, from existing dictionaries and from domain knowledge. There's also the option of preprocessing text using NLP, or in this case, something as primitive as indexing the initials of any capitalized words after each other could help.
"Five things everyone should know about Unicode" is a blog post showing how Unicode characters can be used as an attack vector for websites.
The main example given of such a real world attack is a fake WhatsApp app submitted to the Google Play store using a unicode non-printable space in the developer name which made the name unique and allowed it to get past Google's filters. The Mongolian Vowel Separator (U+180E) is one such non-printable space character.
Another vulnerability is to use alternative Unicode characters that look similar. The Mimic tool shows how this can work.
An example I can think of is to protect usernames when registering a new user. You don't want two usernames to be the same or for them to look the same either.
How do you protect against this? Is there a list of these characters out there? Should it be common practice to strip all of these types of characters from all form inputs?
What you are talking about is called a homoglyph attack.
There is a "confusables" list by Unicode here, and also have a look at this. There should be libraries based on these or pontentially other databases. One such library is this one that you can use in Java or Javascript. The same must exist for other languages as well, or you can write one.
The important thing I think is to not have your own database - the library or service is easy to do on top of good data.
As for whether you should filter out similar looking usernames - I think it depends. If there is an interest for users to try and fake each other's usernames, maybe yes. For many other types of data, maybe there is no point in doing so. There is no generic best practice I think other than you should assess the risk in your application, with your datapoints.
Also a different approach for a different problem, but what may often work for Unicode input validation is the \w word character in a regular expression, if your regex engine is Unicode-ready. In such an engine, \w should match all Unicode classes of word characters, ie. letters, modifiers and connectors in any language, but nothing else (no special characters). This does not protect against homoglyph attacks, but may protect against some injections while keeping your application Unicode-friendly.
All sanitization works best when you have a whitelist of known safe values, and exclude all others.
ASCII is one such set of characters.
This could be approached in various ways, however each one might increase the number of false positives, causing legitimate users' annoyance. Also, none of them will work for 100% of the cases (even if combined). They will just add an extra layer.
One approach would be to have tables with characters that look similar and check if duplicate names exist. What 'look similar' means is subjective in many cases, so building such list might be tricky. This method might produce false positives in certain occasions.
Also, reversing the order of certain letters might trick many users. Checking for anagrams or very similar names can be achieved using algorithms like Jaro-Winkler and Levenshtein distance (i.e., checking if a similar username/company name already exists). Sometimes however, this might be due to a different spelling of some word in some region (e.g., 'centre' vs 'center'), or the name of some company might deliberately contain an anagram. This approach might further increase the number of false positives.
Furthermore, as Jonathan mentioned, sanitisation is also a good approach, however it might not protect against anagrams and cause issues to legitimate users who want to use some special character.
As the OP also mentioned, special characters can also be stripped. Other parts of the name might also need to be stripped, for example common names like 'Inc.', '.com' etc.
Finally, the name can be restricted to only contain characters in one language and not a mixture of characters from various languages (a more relaxed version of this may not allow mixture of characters in the same word - while would allow if separated by space). Restricting using a capital first letter and lower case for the rest of the letters can further improve this approach, as certain lower case letters (like 'l') may look like upper case ones (like 'I') when certain fonts are used. Excluding the use of certain symbols (like '|') will enhance this approach further. This solution will increase the amount of annoyance of certain users who will not be able to use certain names.
A combination of some/all aforementioned approaches can also be used. The selection of the methods and how exactly they will be applied (e.g., you may choose to forbid similar names, or to require moderator approval in case a name is similar, or to not take any action, but simply warn a moderator/administrator) depends on the scenario that you are trying to solve.
I may have an innovative solution to this problem regarding usernames. Obviously, you want to allow ASCII characters, but in some special cases, other characters will be used (different language, as you said).
I think an intuitive way to allow both ASCII and other characters to be used in an username, while being protected against "Unicode Vulnerabilities", would be something like this:
Allow all ASCII characters and disallow other characters, except when there are x or more of these special characters in the username(the username is in another language).
Take for example this:
Whatsapp, Inc + (U+180E) - Not allowed, only has 1 special character.
элч + (U+180E) - Allowed! It has more than x special characters (for example, 3). It can use the Mongolian separator since it's Mongolian.
Obviously, this does not protect you 100% from these types of vulnerabilities, but it is a very efficient method I have been using, ESPECIALLY if you do not mention the existence of this algorithm on the "login" or "register" page, as attackers might figure out that you have an algorithm protecting the website from these types of attacks, but not mention it so they cannot reverse engineer it and find a way to bypass it.
Sorry if this is not an answer you are looking for, just sharing my ideas.
Edit: Or you can use a RNN (Recurrent Neural Network) AI to detect the language and allow specific characters from that language.
What are the recommended methods for extracting locations from free text?
What I can think of is to use regex rules like "words ... in location". But are there better approaches than this?
Also I can think of having a lookup hash table table with names for countries and cities and then compare every extracted token from the text to that of the hash table.
Does anybody know of better approaches?
Edit: I'm trying to extract locations from tweets text. So the issue of high number of tweets might also affect my choice for a method.
All rule-based approaches will fail (if your text is really "free"). That includes regex, context-free grammars, any kind of lookup... Believe me, I've been there before :-)
This problem is called Named Entity Recognition. Location is one of the 3 most studied classes (with Person and Organization). Stanford NLP has an open source Java implementation that is extremely powerful: http://nlp.stanford.edu/software/CRF-NER.shtml
You can easily find implementations in other programming languages.
Put all of your valid locations into a sorted list. If you are planning on comparing case-insensitive, make sure the case of your list already is normalized.
Then all you have to do is loop over individual "words" in your input text and at the start of each new word, start a new binary search in your location list. As soon as you find a no-match, you can skip the entire word and proceed with the next.
Possible problem: multi-word locations such as "New York", "3rd Street", "People's Republic of China". Perhaps all it takes, though, is to save the position of the first new word, if you find your bsearch leads you to a (possible!) multi-word result. Then, if the full comparison fails -- possibly several words later -- all you have to do is revert to this 'next' word, in relation to the previous one where you started.
As to what a "word" is: while you are preparing your location list, make a list of all characters that may appear inside locations. Only phrases that contain characters from this list can be considered a valid 'word'.
How fast are the tweets coming in? As in is it the full twitter fire hose or some filtering queries?
A bit more sophisticated approach, that is similar to what you described is using an NLP tool that is integrated to a gazetteer.
Very few NLP tools will keep up to twitter rates, and very few do very well with twitter because of all of the leet speak. The NLP can be tuned for precision or recall depending on your needs, to limit down performing lockups in the gazetteer.
I recommend looking at Rosoka(also Rosoka Cloud through Amazon AWS) and GeoGravy
I want to take what people chat about in a chat room and do the following information retrieval:
Get the keywords
Ignore all noise words, keep verb an nouns mainly
Perform stemming on the keywords so that I don't store the same keyword in many forms
If a synonym keyword is already stored in my storage then the existing synonym should be used instead of the new keyword
Store the processed keyword in a persistant storage with a reference to the chat message it was located in and the user who uttered it
With this prosessed information I want to slowly get an idea of what people are talking about in chatrooms, and then use this to automatically find related chatrooms etc. based on these keywords.
My question to you is a follows: What is the best C/C++ or .NET tools for doing the above?
I partially agree with #larsmans comment. Your question, in practice, may indeed be more complex than the question you posted.
However, simplifying the question/problem, I guess the answer to your question could be one of Lucene's implementation: Lucene (Java), Lucene.Net (C#) or CLucene (C++).
Following the points in your question:
Lucene would take care of point 1 by using String tokenizers (you can customize or use your own).
For point 2 you could use a TokenFilter like StopFilter so Lucene can read a list of stopwords ("the", "a", "an"...) that it should not use.
For point 3 you could use PorterStemFilter.
Point 4 is a little bit trickier, but could be done using a customized TokenFilter.
Point 1 to 4 are perfomed in the Analysis/tokenization phase, which an Analyzer is responsible.
Regarding point 5, in Lucene you can store Documents with fields. A document can have an arbitrary number and mix of fields. So you could create a single Document for each chat room with all its text concatenated, and have another field of the document reference the chatroom it was extracted from. You will end up with a bunch of Lucene documents that you can compare. So you can compare your current chat room with others to see which one is more similar to the one you are on.
If all you want is a set of the best keywords to describe a chatrom your needs are closer to information extraction/automatic summarization/topic spotting task as #larsmans said. But you can still use Lucene for the parsing/tokenization phase.
*I referenced the Java docs, but CLucene and Lucene.Net have very similar APIs so it won't be much trouble to figure out the differences.