How to parse &str with named parameters? - rust

I am trying to find the best way to parse a &str and extract out the COMMAND_TYPE and named parameters. The named parameters can be anything.
Here is the proposed string (it can be changed).
COMMAND_TYPE(param1:2222,param2:"the quick \"brown\" fox, blah,", param3:true)
I have been trying a few ways to extract the COMMAND_TYPE, which seems fairly simple:
pub fn parse_command(command: &str) -> Option<String> {
let mut matched = String::new();
let mut chars = command.chars();
while let Some(next) = chars.next() {
if next != '(' {
matched.push(next);
} else {
break;
}
}
if matched.is_empty() {
None
} else {
Some(matched)
}
}
Extracting the parameters from within the brackets seems straightforward to:
pub fn parse_params(command: &str) -> Option<&str> {
let start = command.find("(");
let end = command.rfind(")");
if start.is_some() && end.is_some() {
Some(&command[start.unwrap() + 1..end.unwrap()])
} else {
None
}
}
I have been looking at the nom crate and that seems fairly powerful (and complicated), so I am not sure if I really need to use it.
How do I extract the named parameters in between the brackets into a HashMap?

Your code seems to work for extracting the command and the full parameter list. If you don't need to parse something more complex than that, you can probably avoid using nom as a dependency.
But you will probably have problems if you want to parse individually each parameters : your format seems broken. In your example, there is no escape caracters neither for double quote nor comma. param2 just can't be extracted cleanly.

Related

The best way to enumerate through a string in Rust? (chars() vs as_bytes())

I'm new to Rust, and I'm learning it using Rust Book.
Recently, I found this function there:
// Returns the number of characters in the first
// word of the given string
fn first_word(s: &String) -> usize {
let bytes = s.as_bytes();
for (i, &item) in bytes.iter().enumerate() {
if item == b' ' {
return i;
}
}
s.len()
}
As you see, the authors were using String::as_bytes() method here to enumerate through a string. Then, they were casting the char ' ' to u8 type to check whether we've reached the end of the first word.
As I know, ther is another option, which looks much better:
fn first_word(s: &String) -> usize {
for (i, item) in s.chars().enumerate() {
if item == ' ' {
return i;
}
}
s.len()
}
Here, I'm using String::chars() method, and the function looks much cleaner.
So the question is: is there any difference between these two things? If so, which one is better and why?
If your string happens to be purely ASCII (where there is only one byte per character), the two functions should behave identically.
However, Rust was designed to support UTF8 strings, where a single character could be composed of multiple bytes, therefore using s.chars() should be preferred, it will allow your function to still work as expected if you have non-ascii characters in your string.
As #eggyal points out, Rust has a str::split_whitespace method which returns an iterator over words, and this method will split all whitespace (instead of just spaces). You could use it like so:
fn first_word(s: &String) -> usize {
if let Some(word) = s.split_whitespace().next() {
word.len()
}
else {
s.len()
}
}

Attempt to implement sscanf in Rust, failing when passing &str as argument

Problem:
Im new to Rust, and im trying to implement a macro which simulates sscanf from C.
So far it works with any numeric types, but not with strings, as i am already trying to parse a string.
macro_rules! splitter {
( $string:expr, $sep:expr) => {
let mut iter:Vec<&str> = $string.split($sep).collect();
iter
}
}
macro_rules! scan_to_types {
($buffer:expr,$sep:expr,[$($y:ty),+],$($x:expr),+) => {
let res = splitter!($buffer,$sep);
let mut i = 0;
$(
$x = res[i].parse::<$y>().unwrap_or_default();
i+=1;
)*
};
}
fn main() {
let mut a :u8; let mut b :i32; let mut c :i16; let mut d :f32;
let buffer = "00:98;76,39.6";
let sep = [':',';',','];
scan_to_types!(buffer,sep,[u8,i32,i16,f32],a,b,c,d); // this will work
println!("{} {} {} {}",a,b,c,d);
}
This obviously wont work, because at compile time, it will try to parse a string slice to str:
let a :u8; let b :i32; let c :i16; let d :f32; let e :&str;
let buffer = "02:98;abc,39.6";
let sep = [':',';',','];
scan_to_types!(buffer,sep,[u8,i32,&str,f32],a,b,e,d);
println!("{} {} {} {}",a,b,e,d);
$x = res[i].parse::<$y>().unwrap_or_default();
| ^^^^^ the trait `FromStr` is not implemented for `&str`
What i have tried
I have tried to compare types using TypeId, and a if else condition inside of the macro to skip the parsing, but the same situation happens, because it wont expand to a valid code:
macro_rules! scan_to_types {
($buffer:expr,$sep:expr,[$($y:ty),+],$($x:expr),+) => {
let res = splitter!($buffer,$sep);
let mut i = 0;
$(
if TypeId::of::<$y>() == TypeId::of::<&str>(){
$x = res[i];
}else{
$x = res[i].parse::<$y>().unwrap_or_default();
}
i+=1;
)*
};
}
Is there a way to set conditions or skip a repetition inside of a macro ? Or instead, is there a better aproach to build sscanf using macros ? I have already made functions which parse those strings, but i couldnt pass types as arguments, or make them generic.
Note before the answer: you probably don't want to emulate sscanf() in Rust. There are many very capable parsers in Rust, so you should probably use one of them.
Simple answer: the simplest way to address your problem is to replace the use of &str with String, which makes your macro compile and run. If your code is not performance-critical, that is probably all you need. If you care about performance and about avoiding allocation, read on.
A downside of String is that under the hood it copies the string data from the string you're scanning into a freshly allocated owned string. Your original approach of using an &str should have allowed for your &str to directly point into the data that was scanned, without any copying. Ideally we'd like to write something like this:
trait MyFromStr {
fn my_from_str(s: &str) -> Self;
}
// when called on a type that impls `FromStr`, use `parse()`
impl<T: FromStr + Default> MyFromStr for T {
fn my_from_str(s: &str) -> T {
s.parse().unwrap_or_default()
}
}
// when called on &str, just return it without copying
impl MyFromStr for &str {
fn my_from_str(s: &str) -> &str {
s
}
}
Unfortunately that doesn't compile, complaining of a "conflicting implementation of trait MyFromStr for &str", even though there is no conflict between the two implementations, as &str doesn't implement FromStr. But the way Rust currently works, a blanket implementation of a trait precludes manual implementations of the same trait, even on types not covered by the blanket impl.
In the future this will be resolved by specialization. Specialization is not yet part of stable Rust, and might not come to stable Rust for years, so we have to think of another solution. In case of macro usage, we can just let the compiler "specialize" for us by creating two traits with the same name. (This is similar to the autoref-based specialization invented by David Tolnay, but even simpler because it doesn't require autoref resolution to work, as we have the types provided explicitly.)
We create separate traits for parsed and unparsed values, and implement them as needed:
trait ParseFromStr {
fn my_from_str(s: &str) -> Self;
}
impl<T: FromStr + Default> ParseFromStr for T {
fn my_from_str(s: &str) -> T {
s.parse().unwrap_or_default()
}
}
pub trait StrFromStr {
fn my_from_str(s: &str) -> &str;
}
impl StrFromStr for &str {
fn my_from_str(s: &str) -> &str {
s
}
}
Then in the macro we just call <$y>::my_from_str() and let the compiler generate the correct code. Since macros are untyped, this works because we never need to provide a single "trait bound" that would disambiguate which my_from_str() we want. (Such a trait bound would require specialization.)
macro_rules! scan_to_types {
($buffer:expr,$sep:expr,[$($y:ty),+],$($x:expr),+) => {
#[allow(unused_assignments)]
{
let res = splitter!($buffer,$sep);
let mut i = 0;
$(
$x = <$y>::my_from_str(&res[i]);
i+=1;
)*
}
};
}
Complete example in the playground.

Find Files that Match a Dynamic Pattern

I want to be able to parse all the files in a directory to find the one with the greatest timestamp that matches a user provided pattern.
I.e. if the user runs
$ search /foo/bar/baz.txt
and the directory /foo/bar/ contains files baz.001.txt, baz.002.txt, and baz.003.txt, then the result should be baz.003.txt
At the moment I'm constructing a PathBuf.
Using that to build a Regex.
Then finding all the files in the directory that match the expression.
But it feels like this is a lot of work for a relatively simple problem.
fn find(foo: &str) -> Result<Vec<String>, Box<dyn Error>> {
let mut files = vec![];
let mut path = PathBuf::from(foo);
let base = path.parent().unwrap().to_str().unwrap();
let file_name = path.file_stem().unwrap().to_str().unwrap();
let extension = path.extension().unwrap().to_str().unwrap();
let pattern = format!("{}\\.\\d{{3}}\\.{}", file_name, extension);
let expression = Regex::new(&pattern).unwrap();
let objects: Vec<String> = fs::read_dir(&base)
.unwrap()
.map(|entry| {
entry
.unwrap()
.path()
.file_name()
.unwrap()
.to_str()
.unwrap()
.to_owned()
})
.collect();
for object in objects.iter() {
if expression.is_match(object) {
files.push(String::from(object));
}
}
Ok(files)
}
Is there an easier way to take the file path, generate a pattern, and find all the matching files?
Rust is not really a language appropriated for quick and dirty solutions. Instead, it strongly incentivizes elegant solutions, where all corner cases are properly handled. This usually does not lead to extremely short solutions, but you can avoid too much boilerplate relying on external crates that factor a lot of code. Here is what I would do, assuming you don't already have a "library-wide" error.
fn find(foo: &str) -> Result<Vec<String>, FindError> {
let path = PathBuf::from(foo);
let base = path
.parent()
.ok_or(FindError::InvalidBaseFile)?
.to_str()
.ok_or(FindError::OsStringNotUtf8)?;
let file_name = path
.file_stem()
.ok_or(FindError::InvalidFileName)?
.to_str()
.ok_or(FindError::OsStringNotUtf8)?;
let file_extension = path
.extension()
.ok_or(FindError::NoFileExtension)?
.to_str()
.ok_or(FindError::OsStringNotUtf8)?;
let pattern = format!(r"{}\.\d{{3}}\.{}", file_name, file_extension);
let expression = Regex::new(&pattern)?;
Ok(
fs::read_dir(&base)?
.map(|entry| Ok(
entry?
.path()
.file_name()
.ok_or(FindError::InvalidFileName)?
.to_str()
.ok_or(FindError::OsStringNotUtf8)?
.to_string()
))
.collect::<Result<Vec<_>, FindError>>()?
.into_iter()
.filter(|file_name| expression.is_match(&file_name))
.collect()
)
}
A simplistic definition of FindError could be achieved via the thiserror crate:
use thiserror::Error;
#[derive(Error, Debug)]
enum FindError {
#[error(transparent)]
RegexError(#[from] regex::Error),
#[error("File name has no extension")]
NoFileExtension,
#[error("Not a valid file name")]
InvalidFileName,
#[error("No valid base file")]
InvalidBaseFile,
#[error("An OS string is not valid utf-8")]
OsStringNotUtf8,
#[error(transparent)]
IoError(#[from] std::io::Error),
}
Edit
As pointed out by #Masklinn, you can retrieve the stem and the extension of the file without all that hassle. It results in less-well handled errors (and some corner cases such as a hidden file without extension get handled poorly), but overall less verbose code. For you to chose depending on your needs.
fn find(foo: &str) -> Result<Vec<String>, FindError> {
let (file_name, file_extension) = foo
.rsplit_one('.')
.ok_or(FindError::NoExtension)?;
... // the rest is unchanged
}
You probably need to adapt FindError too. You can also get rid of the ok_or case, and just replace it with a .unwrap_or((foo, "")) if you don't really care about it (however this will give surprising results...).

Efficient String trim

I have a String value and I want to trim() it. I can do something like:
let trimmed = s.trim().to_string();
But that will always create a new String instance, even though in real life the string is much more likely to be already trimmed. In order to avoid the redundant new String creation, I could do something like this:
let ss = s.trim();
let trimmed = if ss.len() == s.len() { s } else { ss.to_string() };
But that is quite verbose. Is there a more concise way to do the above?
I can't think of a more concise way to do it. As it is, it seems maximally concise: You need to tell the compiler to trim the string, and then if the strings are the same length return the original string, otherwise make a new string. There isn't a "trim but return the original string if they're equal" method on String.
That said, you could make your own trait TrimOwned which had such a method, for example (implementation courtesy of StackOverflower):
trait TrimOwned {
fn trim_owned(self) -> Self;
}
impl TrimOwned for String {
fn trim_owned(self) -> Self {
let s = self.trim();
if s.len() == self.len() {
self
} else {
s.to_string()
}
}
}
fn main() {
let left = " left".to_string();
let right = "right ".to_string();
let both = " both ".to_string();
println!("'{}'->'{}'", left.clone(), left.trim_owned());
println!("'{}'->'{}'", right.clone(), right.trim_owned());
println!("'{}'->'{}'", both.clone(), both.trim_owned());
}
Playground
Sorry, I cannot give you a more concise version. But I can tell you that there still is more room for premature optimization:
When trimming on the right, you only need to shorten the string, there is obviously no need for reallocating
You can't easily do the "shorten the string" trick when trimming to the left, since a string must always start at index 0 of its allocated area. But when you allocate a new string, you end up copying its contents, so you could as well just move the string content inside the already allocated area.
One nifty and reasonably fast way of implementing this is String::drain:
fn trim_owned(mut trim: String) -> String {
trim.drain(trim.trim_end().len()..);
trim.drain(..(trim.len() - trim.trim_start().len()));
trim
}

How do I get the value and type of a Literal in a procedural macro?

I am implementing a function-like procedural macro which takes a single string literal as an argument, but I don't know how to get the value of the string literal.
If I print the variable, it shows a bunch of fields, which includes both the type and the value. They are clearly there, somewhere. How do I get them?
extern crate proc_macro;
use proc_macro::{TokenStream,TokenTree};
#[proc_macro]
pub fn my_macro(input: TokenStream) -> TokenStream {
let input: Vec<TokenTree> = input.into_iter().collect();
let literal = match &input.get(0) {
Some(TokenTree::Literal(literal)) => literal,
_ => panic!()
};
// can't do anything with "literal"
// println!("{:?}", literal.lit.symbol); says "unknown field"
format!("{:?}", format!("{:?}", literal)).parse().unwrap()
}
#![feature(proc_macro_hygiene)]
extern crate macros;
fn main() {
let value = macros::my_macro!("hahaha");
println!("it is {}", value);
// prints "it is Literal { lit: Lit { kind: Str, symbol: "hahaha", suffix: None }, span: Span { lo: BytePos(100), hi: BytePos(108), ctxt: #0 } }"
}
After running into the same problem countless times already, I finally wrote a library to help with this: litrs on crates.io. It compiles faster than syn and lets you inspect your literals.
use std::convert::TryFrom;
use litrs::StringLit;
use proc_macro::TokenStream;
use quote::quote;
#[proc_macro]
pub fn my_macro(input: TokenStream) -> TokenStream {
let input = input.into_iter().collect::<Vec<_>>();
if input.len() != 1 {
let msg = format!("expected exactly one input token, got {}", input.len());
return quote! { compile_error!(#msg) }.into();
}
let string_lit = match StringLit::try_from(&input[0]) {
// Error if the token is not a string literal
Err(e) => return e.to_compile_error(),
Ok(lit) => lit,
};
// `StringLit::value` returns the actual string value represented by the
// literal. Quotes are removed and escape sequences replaced with the
// corresponding value.
let v = string_lit.value();
// TODO: implement your logic here
}
See the documentation of litrs for more information.
To obtain more information about a literal, litrs uses the Display impl of Literal to obtain a string representation (as it would be written in source code) and then parses that string. For example, if the string starts with 0x one knows it has to be an integer literal, if it starts with r#" one knows it is a raw string literal. The crate syn does exactly the same.
Of course, it seems a bit wasteful to write and run a second parser given that rustc already parsed the literal. Yes, that's unfortunate and having a better API in proc_literal would be preferable. But right now, I think litrs (or syn if you are using syn anyway) are the best solutions.
(PS: I'm usually not a fan of promoting one's own libraries on Stack Overflow, but I am very familiar with the problem OP is having and I very much think litrs is the best tool for the job right now.)
If you're writing procedural macros, I'd recommend that you look into using the crates syn (for parsing) and quote (for code generation) instead of using proc-macro directly, since those are generally easier to deal with.
In this case, you can use syn::parse_macro_input to parse a token stream into any syntatic element of Rust (such as literals, expressions, functions), and will also take care of error messages in case parsing fails.
You can use LitStr to represent a string literal, if that's exactly what you need. The .value() function will give you a String with the contents of that literal.
You can use quote::quote to generate the output of the macro, and use # to insert the contents of a variable into the generated code.
use proc_macro::TokenStream;
use syn::{parse_macro_input, LitStr};
use quote::quote;
#[proc_macro]
pub fn my_macro(input: TokenStream) -> TokenStream {
// macro input must be `LitStr`, which is a string literal.
// if not, a relevant error message will be generated.
let input = parse_macro_input!(input as LitStr);
// get value of the string literal.
let str_value = input.value();
// do something with value...
let str_value = str_value.to_uppercase();
// generate code, include `str_value` variable (automatically encodes
// `String` as a string literal in the generated code)
(quote!{
#str_value
}).into()
}
I always want a string literal, so I found this solution that is good enough. Literal implements ToString, which I can then use with .parse().
#[proc_macro]
pub fn my_macro(input: TokenStream) -> TokenStream {
let input: Vec<TokenTree> = input.into_iter().collect();
let value = match &input.get(0) {
Some(TokenTree::Literal(literal)) => literal.to_string(),
_ => panic!()
};
let str_value: String = value.parse().unwrap();
// do whatever
format!("{:?}", str_value).parse().unwrap()
}
I had similar problem for parsing doc attribute. It is also represented as a TokenStream. This is not exact answer but maybe will guide in a proper direction:
fn from(value: &Vec<Attribute>) -> Vec<String> {
let mut lines = Vec::new();
for attr in value {
if !attr.path.is_ident("doc") {
continue;
}
if let Ok(Meta::NameValue(nv)) = attr.parse_meta() {
if let Lit::Str(lit) = nv.lit {
lines.push(lit.value());
}
}
}
lines
}

Resources