How to match all whitespace? [duplicate] - rust

This question already has an answer here:
Is there a way to use custom patterns such as a regex or functions in a match?
(1 answer)
Closed 2 years ago.
Context: Rust has the match construct, which is really useful to make a (possibly) exhaustive list of cases and their corresponding results. The problem is: how do I create a case which encompasses a subset of many cases?
Regarding my specific problem, I'm making a lexer which reads a string character-by-character and spits out tokens. Its main function looks like this:
(...)
fn new(input: &str) -> Lexer {
let mut characters = input.chars();
for c in characters {
let mut token: Option<Token> = match c {
'+' => Some(Token::Add),
'-' => Some(Token::Minus),
'*' => Some(Token::Mul),
'/' => Some(Token::Div),
'e' => Some(Token::EulersNum),
'π' => Some(Token::Pi),
'(' => Some(Token::LeftParen),
')' => Some(Token::RightParen),
' ' | '\t' | '\n' => continue, //Whitespace
_ => None
};
if token == None {
continue;
}
}
todo!()
}
(...)
Now, the most important part, for the purposes of this question, is the one commented with 'Whitespace'. The problem with my handling of the whitespaces is that it may not correspond to the actual implementation of whitespaces in a given string format. Sure, I could handle all of the different kinds of ascii whitespaces, but what about Unicode? Making an exhaustive list of whitespaces is something that is not only cumbersome, but also something that obfuscates the code. It should be left to the language, not to it's users.
Is it possible to just match it with a 'Whitespace' expression, such as:
(...)
Whitespace => continue,
(...)
And if so, how do I do it?

You could use char::is_whitespace() in a match guard:
match c {
'+' => Some(Token::Add),
'-' => Some(Token::Minus),
'*' => Some(Token::Mul),
'/' => Some(Token::Div),
c if c.is_whitespace() => Some(Token::Whitespace),
_ => None,
};
Playground link

Sure, I could handle all of the different kinds of ascii whitespaces, but what about Unicode?
Use a library that provides a is_whitespace() function for many string formats or more complex matching capabilities, if you need it.
Otherwise, use functions like char::is_whitespace() if you just need to match against Unicode whitespace.
Making an exhaustive list of whitespaces is something that is not only cumbersome, but also something that obfuscates the code. It should be left to the language, not to it's users.
No, a match (and pattern matching etc.) are general tools. Rust is not a language specialized for language or string processing. Thus adding support for "whitespace matching" to pattern matching does not make sense.
Rust has basic ASCII, UTF-8, UTF-16 etc. support for practical reasons, but that's about it. Adding complex bits part of the standard library is debatable.

Related

What is an efficient way to compare strings while ignoring case?

To compare two Strings, ignoring case, it looks like I first need to convert to a lower case version of the string:
let a_lower = a.to_lowercase();
let b_lower = b.to_lowercase();
a_lower.cmp(&b_lower)
Is there a method that compares strings, ignoring case, without creating the temporary lower case strings, i.e. that iterates over the characters, performs the to-lowercase conversion there and compares the result?
There is no built-in method, but you can write one to do exactly as you described, assuming you only care about ASCII input.
use itertools::{EitherOrBoth::*, Itertools as _}; // 0.9.0
use std::cmp::Ordering;
fn cmp_ignore_case_ascii(a: &str, b: &str) -> Ordering {
a.bytes()
.zip_longest(b.bytes())
.map(|ab| match ab {
Left(_) => Ordering::Greater,
Right(_) => Ordering::Less,
Both(a, b) => a.to_ascii_lowercase().cmp(&b.to_ascii_lowercase()),
})
.find(|&ordering| ordering != Ordering::Equal)
.unwrap_or(Ordering::Equal)
}
As some comments below have pointed out, case-insensitive comparison is not going to work properly for UTF-8, without operating on the whole string, and even then there are multiple representations of some case conversions, which could give unexpected results.
With those caveats, the following will work for a lot of extra cases compared with the ASCII version above (e.g. most accented Latin characters) and may be satisfactory, depending on your requirements:
fn cmp_ignore_case_utf8(a: &str, b: &str) -> Ordering {
a.chars()
.flat_map(char::to_lowercase)
.zip_longest(b.chars().flat_map(char::to_lowercase))
.map(|ab| match ab {
Left(_) => Ordering::Greater,
Right(_) => Ordering::Less,
Both(a, b) => a.cmp(&b),
})
.find(|&ordering| ordering != Ordering::Equal)
.unwrap_or(Ordering::Equal)
}
If you are only working with ASCII, you can use eq_ignore_ascii_case:
assert!("Ferris".eq_ignore_ascii_case("FERRIS"));
UNICODE
The best way for supporting UNICODE is using to_lowercase() or to_uppercase().
This is because UNICODE has many caveats and these functions handles most situations. There are some locale specific strings not handled correctly.
let left = "first".to_string();
let right = "FiRsT".to_string();
assert!(left.to_lowercase() == right.to_lowercase());
Efficiency
It is possible to iterate and return on first non-equal character, so in essence you only allocate one character at a time. However iterating using chars function does not account for all situations UNICODE can throw at us.
See the answer by Peter Hall for details on this.
ASCII
Most efficient if only using ASCII is to use eq_ignore_ascii_case (as per Ibraheem Ahmed's answer). This is does not allocate/copy temporaries.
This is only good if your code controls at least one side of the comparison and you are certain that it will only include ASCII.
assert!("Ferris".eq_ignore_ascii_case("FERRIS"));
Locale
Rusts case functions are best effort regarding locales and do not handle all locales. To support proper internationalisation, you will need to look for other crates that do this.

Matching Strings with string literals in Rust

I'm learning Rust and have been stuck on this piece of code that matches a string with some literals for a while.
while done_setup == false {
println!("Enter difficultly mode you wish to play (Easy/Medium/Hard):");
let mut difficulty = String::new();
io::stdin()
.read_line(&mut difficulty)
.expect("Invalid input, aborting");
match difficulty.as_str() {
"Easy" => {num_guesses = 10;},
"Medium" => {num_guesses = 7;},
"Hard" => {num_guesses = 3;},
_ => {
println!("Pls enter a valid difficulty mode!");
continue;
},
}
println!("You are playing in {} mode, you have {} tries!", difficulty, num_guesses);
done_setup = true;
}
Apparently the pattern never matches with "Easy", "Medium" or "Hard" since user input ALWAYS flows to the default case. I've read similar SO questions and I understand that String objects are not the same as literals (str), but shouldn't difficulty.as_str() take care of that?
I'm looking for a clean & "proper" way to code this, any suggestions welcome & thanks in advance!
The result possibly contains a trailing newline.
trim can be used to strip away leading and trailing whitespace.

Implementing String Interpolation in Flex/Bison

I'm currently writing an interpreter for a language I have designed.
The lexer/parser (GLR) is written in Flex/Bison and the main interpreter in D - and everything working flawlessly so far.
The thing is I want to also add string interpolation, that is identify string literals that contain a specific pattern (e.g. "[some expression]") and convert the included expression. I think this should be done at parser level, from within the corresponding Grammar action.
My idea is converting/treating the interpolated string as what it would look like with simple concatenation (as it works right now).
E.g.
print "this is the [result]. yay!"
to
print "this is the " + result + ". yay!"
However, I'm a bit confused as to how I could do that in Bison: basically, how do I tell it to re-parse a specific string (while constructing the main AST)?
Any ideas?
You could reparse the string, if you really wanted you, by generating a reentrant parser. You would probably want a reentrant scanner, as well, although I suppose you could kludge something together with a default scanner, using flex's buffer stack. Indeed, it's worth learning how to build reentrant parsers and scanners on the general principle of avoiding unnecessary globals, whether or not you need them for this particular purpose.
But you don't really need to reparse anything; you can do the entire parse in one pass. You just need enough smarts in your scanner so that it knows about nested interpolations.
The basic idea is to let the scanner split the string literal with interpolations into a series of tokens, which can easily be assembled into an appropriate AST by the parser. Since the scanner may return more than one token out of a single string literal, we'll need to introduce a start condition to keep track of whether the scan is currently inside a string literal or not. And since interpolations can, presumably, be nested we'll use flex's optional start condition stack, enabled with %option stack, to keep track of the nested contexts.
So here's a rough sketch.
As mentioned, the scanner has extra start conditions: SC_PROGRAM, the default, which is in effect while the scanner is scanning regular program text, and SC_STRING, in effect while the scanner is scanning a string. SC_PROGRAM is only needed because flex does not provide an official interface to check whether the start condition stack is empty; aside from nesting, it is identical to the INITIAL top-level start condition. The start condition stack is used to keep track of interpolation markers ([ and ] in this example), and it is needed because an interpolated expression might use brackets (as array subscripts, for example) or might even include a nested interpolated string. Since SC_PROGRAM is, with one exception, identical to INITIAL, we'll make it an inclusive rule.
%option stack
%s SC_PROGRAM
%x SC_STRING
%%
Since we're using a separate start condition to analyse string literals, we can also normalise escape sequences as we parse. Not all applications will want to do this, but it's pretty common. But since that's not really the point of this answer, I've left out most of the details. More interesting is the way that embedded interpolation expressions are handled, particularly deeply nested ones.
The end result will be to turn string literals into a series of tokens, possibly representing a nested structure. In order to avoid actually parsing in the scanner, we don't make any attempt to create AST nodes or otherwise rewrite the string literal; instead, we just pass the quote characters themselves through to the parser, delimiting the sequence of string literal pieces:
["] { yy_push_state(SC_STRING); return '"'; }
<SC_STRING>["] { yy_pop_state(); return '"'; }
A very similar set of rules is used for interpolation markers:
<*>"[" { yy_push_state(SC_PROGRAM); return '['; }
<INITIAL>"]" { return ']'; }
<*>"]" { yy_pop_state(); return ']'; }
The second rule above avoids popping the start condition stack if it is empty (as it will be in the INITIAL state). It's not necessary to issue an error message in the scanner; we can just pass the unmatched close bracket through to the parser, which will then do whatever error recovery seems necessary.
To finish off the SC_STRING state, we need to return tokens for pieces of the string, possibly including escape sequences:
<SC_STRING>{
[^[\\"]+ { yylval.str = strdup(yytext); return T_STRING; }
\\n { yylval.chr = '\n'; return T_CHAR; }
\\t { yylval.chr = '\t'; return T_CHAR; }
/* ... Etc. */
\\x[[:xdigit]]{2} { yylval.chr = strtoul(yytext, NULL, 16);
return T_CHAR; }
\\. { yylval.chr = yytext[1]; return T_CHAR; }
}
Returning escaped characters like that to the parser is probably not the best strategy; normally I would use an internal scanner buffer to accumulate the entire string. But it was simple for illustrative purposes. (Some error handling is omitted here; there are various corner cases, including newline handling and the annoying case where the last character in the program is a backslash inside an unterminated string literal.)
In the parser, we just need to insert a concatenation node for interpolated strings. The only complication is that we don't want to insert such a node for the common case of a string literal without any interpolations, so we use two syntax productions, one for a string with exactly one contained piece, and one for a string with two or more pieces:
string : '"' piece '"' { $$ = $2; }
| '"' piece piece_list '"' { $$ = make_concat_node(
prepend_to_list($2, $3));
}
piece : T_STRING { $$ = make_literal_node($1); }
| '[' expr ']' { $$ = $2; }
piece_list
: piece { $$ = new_list($1); }
| piece_list piece { $$ = append_to_list($1, $2); }

Prolog DCG Building/Recognizing Word Strings from Alphanumeric Characters

So I'm writing simple parsers for some programming languages in SWI-Prolog using Definite Clause Grammars. The goal is to return true if the input string or file is valid for the language in question, or false if the input string or file is not valid.
In all almost all of the languages there is an "identifier" predicate. In most of the languages the identifier is defined as the one of the following in EBNF: letter { letter | digit } or ( letter | digit ) { letter | digit }, that is to say in the first case a letter followed by zero or more alphanumeric characters, or i
My input file is split into a list of word strings (i.e. someIdentifier1 = 3 becomes the list [someIdentifier1,=,3]). The reason for the string to be split into lists of words rather than lists of letters is for recognizing keywords defined as terminals.
How do I implement "identifier" so that it recognizes any alphanumeric string or a string consisting of a letter followed by alphanumeric characters.
Is it possible or necessary to further split the word into letters for this particular predicate only, and if so how would I go about doing this? Or is there another solution, perhaps using SWI-Prolog libraries' built-in predicates?
I apologize for the poorly worded title of this question; however, I am unable to clarify it any further.
First, when you need to reason about individual letters, it is typically most convenient to reason about lists of characters.
In Prolog, you can easily convert atoms to characters with atom_chars/2.
For example:
?- atom_chars(identifier10, Cs).
Cs = [i, d, e, n, t, i, f, i, e, r, '1', '0'].
Once you have such characters, you can used predicates like char_type/2 to reason about properties of each character.
For example:
?- char_type(i, T).
T = alnum ;
T = alpha ;
T = csym ;
etc.
The general pattern to express identifiers such as yours with DCGs can look as follows:
identifier -->
[L],
{ letter(L) },
identifier_rest.
identifier_rest --> [].
identifier_rest -->
[I],
{ letter_or_digit(I) },
identifier_rest.
You can use this as a building block, and only need to define letter/1 and letter_or_digit/1. This is very easy with char_type/2.
Further, you can of course introduce an argument to relate such lists to atoms.

What is the easiest way to determine if a character is in Unicode range, in Rust?

I'm looking for easiest way to determine if a character in Rust is between two Unicode values.
For example, I want to know if a character s is between [#x1-#x8] or [#x10FFFE-#x10FFFF]. Is there a function that does this already?
The simplest way for me to match a character was this
fn match_char(data: &char) -> bool {
match *data {
'\x01'...'\x08' |
'\u{10FFFE}'...'\u{10FFFF}' => true,
_ => false,
}
}
Pattern matching a character was the easiest route for me, compared to a bunch of if statements. It might not be the most performant solution, but it served me very well.
The simplest way, assuming that they are not Unicode categories (in which case you should be using std::unicode) is to use the regular comparison operators:
(s >= '\x01' && s <= '\x08') || s == '\U0010FFFE' || s == '\U0010FFFF'
(In case you weren't aware of the literal forms of these things, one gets 8-bit hexadecimal literals \xXX, 16-bit hexadecimal literals \uXXXX, and 32-bit hexadecimal literals \UXXXXXXXX. Matter of fact, casts would work fine too, e.g. 0x10FFFE as char, and would be just as efficient; just less easily readable.)

Resources