Find Files that Match a Dynamic Pattern - rust

I want to be able to parse all the files in a directory to find the one with the greatest timestamp that matches a user provided pattern.
I.e. if the user runs
$ search /foo/bar/baz.txt
and the directory /foo/bar/ contains files baz.001.txt, baz.002.txt, and baz.003.txt, then the result should be baz.003.txt
At the moment I'm constructing a PathBuf.
Using that to build a Regex.
Then finding all the files in the directory that match the expression.
But it feels like this is a lot of work for a relatively simple problem.
fn find(foo: &str) -> Result<Vec<String>, Box<dyn Error>> {
let mut files = vec![];
let mut path = PathBuf::from(foo);
let base = path.parent().unwrap().to_str().unwrap();
let file_name = path.file_stem().unwrap().to_str().unwrap();
let extension = path.extension().unwrap().to_str().unwrap();
let pattern = format!("{}\\.\\d{{3}}\\.{}", file_name, extension);
let expression = Regex::new(&pattern).unwrap();
let objects: Vec<String> = fs::read_dir(&base)
.unwrap()
.map(|entry| {
entry
.unwrap()
.path()
.file_name()
.unwrap()
.to_str()
.unwrap()
.to_owned()
})
.collect();
for object in objects.iter() {
if expression.is_match(object) {
files.push(String::from(object));
}
}
Ok(files)
}
Is there an easier way to take the file path, generate a pattern, and find all the matching files?

Rust is not really a language appropriated for quick and dirty solutions. Instead, it strongly incentivizes elegant solutions, where all corner cases are properly handled. This usually does not lead to extremely short solutions, but you can avoid too much boilerplate relying on external crates that factor a lot of code. Here is what I would do, assuming you don't already have a "library-wide" error.
fn find(foo: &str) -> Result<Vec<String>, FindError> {
let path = PathBuf::from(foo);
let base = path
.parent()
.ok_or(FindError::InvalidBaseFile)?
.to_str()
.ok_or(FindError::OsStringNotUtf8)?;
let file_name = path
.file_stem()
.ok_or(FindError::InvalidFileName)?
.to_str()
.ok_or(FindError::OsStringNotUtf8)?;
let file_extension = path
.extension()
.ok_or(FindError::NoFileExtension)?
.to_str()
.ok_or(FindError::OsStringNotUtf8)?;
let pattern = format!(r"{}\.\d{{3}}\.{}", file_name, file_extension);
let expression = Regex::new(&pattern)?;
Ok(
fs::read_dir(&base)?
.map(|entry| Ok(
entry?
.path()
.file_name()
.ok_or(FindError::InvalidFileName)?
.to_str()
.ok_or(FindError::OsStringNotUtf8)?
.to_string()
))
.collect::<Result<Vec<_>, FindError>>()?
.into_iter()
.filter(|file_name| expression.is_match(&file_name))
.collect()
)
}
A simplistic definition of FindError could be achieved via the thiserror crate:
use thiserror::Error;
#[derive(Error, Debug)]
enum FindError {
#[error(transparent)]
RegexError(#[from] regex::Error),
#[error("File name has no extension")]
NoFileExtension,
#[error("Not a valid file name")]
InvalidFileName,
#[error("No valid base file")]
InvalidBaseFile,
#[error("An OS string is not valid utf-8")]
OsStringNotUtf8,
#[error(transparent)]
IoError(#[from] std::io::Error),
}
Edit
As pointed out by #Masklinn, you can retrieve the stem and the extension of the file without all that hassle. It results in less-well handled errors (and some corner cases such as a hidden file without extension get handled poorly), but overall less verbose code. For you to chose depending on your needs.
fn find(foo: &str) -> Result<Vec<String>, FindError> {
let (file_name, file_extension) = foo
.rsplit_one('.')
.ok_or(FindError::NoExtension)?;
... // the rest is unchanged
}
You probably need to adapt FindError too. You can also get rid of the ok_or case, and just replace it with a .unwrap_or((foo, "")) if you don't really care about it (however this will give surprising results...).

Related

How do I chain functions returning results in Rust?

In the following code I would like to get the 2nd string in a vector args and then parse it into an i32. This code will not complie, however, because i can not call parse() on the Option value returned by nth().
use std::env;
fn main() {
let args: Vec<String> = env::args().collect();
let a = args.iter().nth(1).parse::<i32>();
}
I know i could just use expect() to unwrap the value, before trying to parse it, however I do not want my code to panic. I want a to be a Result value that is an Err if either nth() or parse() fails, and otherwise is a Ok(Int). Is there a way to accomplish this in rust? Thanks.
It is quite easy if you look in the documentation for either Option or Result. The function you are thinking of is likely and_then which allows you to then provide a closure which can change the Ok type and value if filled, but otherwise leaves it unchanged when encountering an error. However, you need to do though is decide on a common error type to propagate. Since the Option<&String> needs to be turned to an error on a None value we have to choose a type to use.
Here I provide a brief example with a custom error type. I decided to use .get instead of .iter().nth(1) since it does the same thing and we might as well take advantage of the Vec since you have gone to the work of creating it.
use std::num::ParseIntError;
enum ArgParseError {
NotFound(usize),
InvalidArg(ParseIntError),
}
let args: Vec<String> = env::args().collect();
let a: Result<i32, ArgParseError> = args
.get(1) // Option<&String>
.ok_or_else(|| ArgParseError::NotFound(1)) // Result<&String, ArgParseError>
.and_then(|x: &String| {
x.parse::<i32>() // Result<i32, ParseIntError>
.map_err(|e| ArgParseError::InvalidArg(e)) // Result<i32, ArgParseError>
});
You could try the following.
use std::{env, num::ParseIntError};
enum Error {
ParseIntError(ParseIntError),
Empty,
}
fn main() {
let args: Vec<String> = env::args().collect();
let a: Option<Result<i32, ParseIntError>> = args.iter().nth(1).map(|s| s.parse::<i32>());
let a: Result<i32, Error> = match a {
Some(Ok(a)) => Ok(a),
Some(Err(e)) => Err(Error::ParseIntError(e)),
None => Err(Error::Empty),
};
}

Writing a Vec<String> to files using std::fs::write

I'm writing a program that handles a vector which is combination of numbers and letters (hence Vec<String>). I sort it with the .sort() method and am now trying to write it to a file.
Where strvec is my sorted vector that I'm trying to write using std::fs::write;
println!("Save results to file?");
let to_save: String = read!();
match to_save.as_str() {
"y" => {
println!("Enter filename");
let filename: String = read!();
let pwd = current_dir().into();
write("/home/user/dl/results", strvec);
Rust tells me "the trait AsRef<[u8]> is not implemented for Vec<String>". I've also tried using &strvec.
How do I avoid this/fix it?
When it comes to writing objects to the file you might want to consider serialization. Most common library for this in Rust is serde, however in this example where you want to store vector of Strings and if you don't need anything human readable in file (but it comes with small size :P), you can also use bincode:
use std::fs;
use bincode;
fn main() {
let v = vec![String::from("aaa"), String::from("bbb")];
let encoded_v = bincode::serialize(&v).expect("Could not encode vector");
fs::write("file", encoded_v).expect("Could not write file");
let read_v = fs::read("file").expect("Could not read file");
let decoded_v: Vec<String> = bincode::deserialize(&read_v).expect("Could not decode vector");
println!("{:?}", decoded_v);
}
Remember to add bincode = "1.3.3" under dependencies in Cargo.toml
#EDIT:
Actually you can easily save String to the file so simple join() should do:
use std::fs;
fn main() {
let v = vec![
String::from("aaa"),
String::from("bbb"),
String::from("ccc")];
fs::write("file", v.join("\n")).expect("");
}
Rust can't write anything besides a &[u8] to a file. There are too many different ways which data can be interpreted before it gets flattened, so you need to handle all of that ahead of time. For a Vec<String>, it's pretty simple, and you can just use concat to squish everything down to a single String, which can be interpreted as a &[u8] because of its AsRef<u8> trait impl.
Another option would be to use join, in case you wanted to add some sort of delimiter between your strings, like a space, comma, or something.
fn main() {
let strvec = vec![
"hello".to_string(),
"world".to_string(),
];
// "helloworld"
std::fs::write("/tmp/example", strvec.concat()).expect("failed to write to file");
// "hello world"
std::fs::write("/tmp/example", strvec.join(" ")).expect("failed to write to file");
}
You can't get a &[u8] from a Vec<String> without copying since a slice must refer to a contiguous sequence of items. Each String will have its own allocation on the heap somewhere, so while each individual String can be converted to a &[u8], you can't convert the whole vector to a single &[u8].
While you can .collect() the vector into a single String and then get a &[u8] from that, this does some unnecessary copying. Consider instead just iterating the Strings and writing each one to the file. With this helper, it's no more complex than using std::fs::write():
use std::path::Path;
use std::fs::File;
use std::io::Write;
fn write_each(
path: impl AsRef<Path>,
items: impl IntoIterator<Item=impl AsRef<[u8]>>,
) -> std::io::Result<()> {
let mut file = File::create(path)?;
for i in items {
file.write_all(i.as_ref())?;
}
// Surface any I/O errors that could otherwise be swallowed when
// the file is closed implicitly by being dropped.
file.sync_all()
}
The bound impl IntoIterator<Item=impl AsRef<[u8]>> is satisfied by both Vec<String> and by &Vec<String>, so you can call this as either write_each("path/to/output", strvec) (to consume the vector) or write_each("path/to/output", &strvec) (if you need to hold on to the vector for later).

Rust chunks method with owned values?

I'm trying to perform a parallel operation on several chunks of strings at a time, and I'm finding having an issue with the borrow checker:
(for context, identifiers is a Vec<String> from a CSV file, client is reqwest and target is an Arc<String> that is write once read many)
use futures::{stream, StreamExt};
use std::sync::Arc;
async fn nop(
person_ids: &[String],
target: &str,
url: &str,
) -> String {
let noop = format!("{} {}", target, url);
let noop2 = person_ids.iter().for_each(|f| {f.as_str();});
"Some text".into()
}
#[tokio::main]
async fn main() {
let target = Arc::new(String::from("sometext"));
let url = "http://example.com";
let identifiers = vec!["foo".into(), "bar".into(), "baz".into(), "qux".into(), "quux".into(), "quuz".into(), "corge".into(), "grault".into(), "garply".into(), "waldo".into(), "fred".into(), "plugh".into(), "xyzzy".into()];
let id_sets: Vec<&[String]> = identifiers.chunks(2).collect();
let responses = stream::iter(id_sets)
.map(|person_ids| {
let target = target.clone();
tokio::spawn( async move {
let resptext = nop(person_ids, target.as_str(), url).await;
})
})
.buffer_unordered(2);
responses
.for_each(|b| async { })
.await;
}
Playground: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=e41c635e99e422fec8fc8a581c28c35e
Given chunks yields a Vec<&[String]>, the compiler complains that identifiers doesn't live long enough because it potentially goes out of scope while the slices are being referenced. Realistically this won't happen because there's an await. Is there a way to tell the compiler that this is safe, or is there another way of getting chunks as a set of owned Strings for each thread?
There was a similarly asked question that used into_owned() as a solution, but when I try that, rustc complains about the slice size not being known at compile time in the request_user function.
EDIT: Some other questions as well:
Is there a more direct way of using target in each thread without needing Arc? From the moment it is created, it never needs to be modified, just read from. If not, is there a way of pulling it out of the Arc that doesn't require the .as_str() method?
How do you handle multiple error types within the tokio::spawn() block? In the real world use, I'm going to receive quick_xml::Error and reqwest::Error within it. It works fine without tokio spawn for concurrency.
Is there a way to tell the compiler that this is safe, or is there another way of getting chunks as a set of owned Strings for each thread?
You can chunk a Vec<T> into a Vec<Vec<T>> without cloning by using the itertools crate:
use itertools::Itertools;
fn main() {
let items = vec![
String::from("foo"),
String::from("bar"),
String::from("baz"),
];
let chunked_items: Vec<Vec<String>> = items
.into_iter()
.chunks(2)
.into_iter()
.map(|chunk| chunk.collect())
.collect();
for chunk in chunked_items {
println!("{:?}", chunk);
}
}
["foo", "bar"]
["baz"]
This is based on the answers here.
Your issue here is that the identifiers are a Vector of references to a slice. They will not necessarily be around once you've left the scope of your function (which is what async move inside there will do).
Your solution to the immediate problem is to convert the Vec<&[String]> to a Vec<Vec<String>> type.
A way of accomplishing that would be:
let id_sets: Vec<Vec<String>> = identifiers
.chunks(2)
.map(|x: &[String]| x.to_vec())
.collect();

How to parse &str with named parameters?

I am trying to find the best way to parse a &str and extract out the COMMAND_TYPE and named parameters. The named parameters can be anything.
Here is the proposed string (it can be changed).
COMMAND_TYPE(param1:2222,param2:"the quick \"brown\" fox, blah,", param3:true)
I have been trying a few ways to extract the COMMAND_TYPE, which seems fairly simple:
pub fn parse_command(command: &str) -> Option<String> {
let mut matched = String::new();
let mut chars = command.chars();
while let Some(next) = chars.next() {
if next != '(' {
matched.push(next);
} else {
break;
}
}
if matched.is_empty() {
None
} else {
Some(matched)
}
}
Extracting the parameters from within the brackets seems straightforward to:
pub fn parse_params(command: &str) -> Option<&str> {
let start = command.find("(");
let end = command.rfind(")");
if start.is_some() && end.is_some() {
Some(&command[start.unwrap() + 1..end.unwrap()])
} else {
None
}
}
I have been looking at the nom crate and that seems fairly powerful (and complicated), so I am not sure if I really need to use it.
How do I extract the named parameters in between the brackets into a HashMap?
Your code seems to work for extracting the command and the full parameter list. If you don't need to parse something more complex than that, you can probably avoid using nom as a dependency.
But you will probably have problems if you want to parse individually each parameters : your format seems broken. In your example, there is no escape caracters neither for double quote nor comma. param2 just can't be extracted cleanly.

How to create chaining API after read_to_string was changed to take a buffer?

I'm trying to port my library clog to the latest Rust version.
Rust changed a lot in the previous month and so I'm scratching my head over this code asking myself if there's really no way anymore to write this in a completely chained way?
fn get_last_commit () -> String {
let output = Command::new("git")
.arg("rev-parse")
.arg("HEAD")
.output()
.ok().expect("error invoking git rev-parse");
let encoded = String::from_utf8(output.stdout).ok().expect("error parsing output of git rev-parse");
encoded
}
In an older version of Rust the code could be written like that
pub fn get_last_commit () -> String {
Command::new("git")
.arg("rev-parse")
.arg("HEAD")
.spawn()
.ok().expect("failed to invoke rev-parse")
.stdout.as_mut().unwrap().read_to_string()
.ok().expect("failed to get last commit")
}
It seems there is no read_to_string() method anymore that doesn't take a buffer which makes it hard to implement a chaining API unless I'm missing something.
UPDATE
Ok, I figured I can use map to get it chaining.
fn get_last_commit () -> String {
Command::new("git")
.arg("rev-parse")
.arg("HEAD")
.output()
.map(|output| {
String::from_utf8(output.stdout).ok().expect("error reading into string")
})
.ok().expect("error invoking git rev-parse")
}
Actually I wonder if I could use and then but it seems the errors don't line up correctly ;)
As others have said, this was changed to allow reusing buffers/avoiding allocations.
Another alternative is to use read_to_string and manually provide the buffer:
pub fn get_last_commit () -> String {
let mut string = String::new();
Command::new("git")
.arg("rev-parse")
.arg("HEAD")
.spawn()
.ok().expect("failed to invoke rev-parse")
.stdout.as_mut().unwrap()
.read_to_string(&mut string)
.ok().expect("failed to get last commit");
string
}
This API was changed so that you didn't have to re-allocate a new String each time. However, as you've noticed, there's some convenience loss if you don't care about allocation. It might be a good idea to suggest re-adding this back in, like what happened with Vec::from_elem. Maybe open a small RFC?
While it may make sense to try to add this back to the standard library, here's a version of read_to_string that allocates on its own that you can use today:
#![feature(io)]
use std::io::{self,Read,Cursor};
trait MyRead: Read {
fn read_full_string(&mut self) -> io::Result<String> {
let mut s = String::new();
let r = self.read_to_string(&mut s);
r.map(|_| s)
}
}
impl<T> MyRead for T where T: Read {}
fn main() {
let bytes = b"hello";
let mut input = Cursor::new(bytes);
let s = input.read_full_string();
println!("{}", s.unwrap());
}
This should allow you to use the chaining style you had before.

Resources