Compress and serialize with rust: the specific case of `Vec<u8>` - rust

I am currently working on a way to compress data (structs I made) when serializing with serde. Everything works fine except with the specitic case of Vec. I'd like to know if some of you already met this problem or if you would have any thought to share :)
My goal is to provide a simple way of compressing any part of a struct by adding the #[serde(with="crate::compress")] macro. No matter of the underlying data structure, I want it to be compressed with my custom serialize function.
For instance, I want this structure to be serializable with compression :
struct MyCustomStruct {
data: String,
#[serde(with="crate::compress")]
data2: SomeOtherStruct,
#[serde(with="crate::compress")]
data3: Vec<u8>,
}
For now, everything works fine and calls my custom module :
// in compress module
pub fn serialize<T, S>(data: T, serializer: S) -> Result(S::Ok, S::Error)
where
T: Serialize,
S: Serializer,
{
// Simplified functioning below:
let serialized_data: Vec<u8> = some_function(data);
let compressed_data: Vec<u8> = some_other_fonction(serialized_data);
Ok(choosen_serializer::serialize(compressed_data, serializer)?)
}
However, I do have a problem when it comes to compressing Vec elements (as data3 in the struct above).
Since the data is already Vec, I don't need to serialize it and I can directly pass it to my compression function. Worse: if I serialize it, I my not be with serde_bytes and so the compression call macro will result in increasing size!
I also don't want to have two different functions (one for Vec, one other) since it would be to the user to choose which one to use and using the wrong one with Vec would work but also increasing size (instead of decreasing it).
I already thought about / tried a few things but none of them can work :
1) Macro :
Very complicated since it needs to rewrite another macro. It would be a macro writing #[serde(with="crate::compress")] or #[serde(with="crate::compress_vec_u8")] depending on the type. I don't even know if this is possible, a meta-macro? :')
2) Trait implementation
It would be something like that :
trait CompressAndSerialize {
fn serialize<S>(&self, serializer: S) -> Result<S::Ok, S::Error)
where S:Serializer;
}
impl CompressAndSerialize for Vec<u8> { ... }
impl<T> CompressAndSerialize for T where T:Serialize { ... }
but then I got an error (conflicting implementations of trait CompressAndSerialize for Vec) which seems normal since there are indeed two implementations for Vec :/
3) Worse solution but the one I'm heading for: using TypeId::of::<T>()
and skip serialization if data is already a Vec. I would still have to encapsulate it in an Enum so de-serialization will know if data is Vec or something else...
Edit: this isn't possible because type must have static lifetime which is almost always impossible (not in my case anyway)
Sorry this is a bit long and quite specific but I hope maybe one of you will have suggestions on how to deal with this problem :D
Edit: on the serde documentation (https://serde.rs/impl-serialize.html#other-special-cases) there is a link to a issue in rust-lang/rust (https://github.com/rust-lang/rust/issues/31844) and once this issue will be resolved I won't have any problem with the serialization of Vec since serde_bytes won't be needed anymore. Too bad this issue has been opened since Feb 2016 :'(

Related

Best way to model storing an iterator for a vector inside same struct

Context
I'm a beginner, and, at a high level, what I want to do is store some mutable state (to power a state machine) in a struct with the following constraints
Mutating the state doesn't require the entire struct to be mut (since I'd have to update a ton of callsites to be mut + I don't want every field to be mutable)
The state is represented as an enum, and can, in the right state, store a way to index into the correct position in a vec that's in the same struct
I came up with two different approaches/examples that seem quite complicated and I want to see if there's a way to simplify. Here's some playgrounds that minimally reproduce what I'm exploring
Using a Cell and a usize
#[derive(Clone, Copy)]
enum S {
A,
B(usize),
}
struct Test {
a: Vec<i32>,
b: Cell<S>,
}
where usage look like this
println!("{}", t.a[index]);
t.b.set(S::B(index + 1));
Using a RefCell and an iterator
enum S<'a> {
A,
B(Iter<'a, i32>),
}
struct Test<'a> {
a: Vec<i32>,
b: RefCell<S<'a>>,
}
where usage looks like this
println!("{}", iter.next().unwrap());
Questions
Is there a better way to model this in general vs. what I've tried?
I like approach #2 with the iterator better in theory since it feels cleaner, but I don't like how it introduces explicit lifetime annotations into the struct...in the actual codebase I'm working on, I'd need to update a ton of callsites to add the lifetime annotation and the tiny bit of convenience doesn't seem worth it. Is there some way to do #2 without introducing lifetimes?

Ergonomically passing a slice of trait objects

I am converting a variety of types to String when they are passed to a function. I'm not concerned about performance as much as ergonomics, so I want the conversion to be implicit. The original, less generic implementation of the function simply used &[impl Into<String>], but I think that it should be possible to pass a variety of types at once without manually converting each to a string.
The key is that ideally, all of the following cases should be valid calls to my function:
// String literals
perform_tasks(&["Hello", "world"]);
// Owned strings
perform_tasks(&[String::from("foo"), String::from("bar")]);
// Non-string types
perform_tasks(&[1,2,3]);
// A mix of any of them
perform_tasks(&["All", 3, String::from("types!")]);
Some various signatures I've attempted to use:
fn perform_tasks(items: &[impl Into<String>])
The original version fails twice; it can't handle numeric types without manual conversion, and it requires all of the arguments to be the same type.
fn perform_tasks(items: &[impl ToString])
This is slightly closer, but it still requires all of the arguments to be of one type.
fn perform_tasks(items: &[&dyn ToString])
Doing it this way is almost enough, but it won't compile unless I manually add a borrow on each argument.
And that's where we are. I suspect that either Borrow or AsRef will be involved in a solution, but I haven't found a way to get them to handle this situation. For convenience, here is a playground link to the final signature in use (without the needed references for it to compile), alongside the various tests.
The following way works for the first three cases if I understand your intention correctly.
pub fn perform_tasks<I, A>(values: I) -> Vec<String>
where
A: ToString,
I: IntoIterator<Item = A>,
{
values.into_iter().map(|s| s.to_string()).collect()
}
As the other comments pointed out, Rust does not support an array of mixed types. However, you can do one extra step to convert them into a &[&dyn fmt::Display] and then call the same function perform_tasks to get their strings.
let slice: &[&dyn std::fmt::Display] = &[&"All", &3, &String::from("types!")];
perform_tasks(slice);
Here is the playground.
If I understand your intention right, what you want is like this
fn main() {
let a = 1;
myfn(a);
}
fn myfn(i: &dyn SomeTrait) {
//do something
}
So it's like implicitly borrow an object as function argument. However, Rust won't let you to implicitly borrow some objects since borrowing is quite an important safety measure in rust and & can help other programmers quickly identified which is a reference and which is not. Thus Rust is designed to enforce the & to avoid confusion.

Refactoring out `clone` when Copy trait is not implemented?

Is there a way to get rid of clone(), given the restrictions I've noted in the comments? I would really like to know if it's possible to use borrowing in this case, where modifying the third-party function signature is not possible.
// We should keep the "data" hidden from the consumer
mod le_library {
pub struct Foobar {
data: Vec<i32> // Something that doesn't implement Copy
}
impl Foobar {
pub fn new() -> Foobar {
Foobar {
data: vec![1, 2, 3],
}
}
pub fn foo(&self) -> String {
let i = third_party(self.data.clone()); // Refactor out clone?
format!("{}{}", "foo!", i)
}
}
// Can't change the signature, suppose this comes from a crate
pub fn third_party(data:Vec<i32>) -> i32 {
data[0]
}
}
use le_library::Foobar;
fn main() {
let foobar = Foobar::new();
let foo = foobar.foo();
let foo2 = foobar.foo();
println!("{}", foo);
println!("{}", foo2);
}
playground
As long as your foo() method accepts &self, it is not possible, because the
pub fn third_party(data: Vec<i32>) -> i32
signature is quite unambiguous: regardless of what this third_party function does, it's API states that it needs its own instance of Vec, by value. This precludes using borrowing of any form, and because foo() accepts self by reference, you can't really do anything except for cloning.
Also, supposedly this third_party is written without any weird unsafe hacks, so it is quite safe to assume that the Vec which is passed into it is eventually dropped and deallocated. Therefore, unsafely creating a copy of the original Vec without cloning it (by copying internal pointers) is out of question - you'll definitely get a use-after-free if you do it.
While your question does not state it, the fact that you want to preserve the original value of data is kind of a natural assumption. If this assumption can be relaxed, and you're actually okay with giving the data instance out and e.g. replacing it with an empty vector internally, then there are several things you can potentially do:
Switch foo(&self) to foo(&mut self), then you can quite easily extract data and replace it with an empty vector.
Use Cell or RefCell to store the data. This way, you can continue to use foo(&self), at the cost of some runtime checks when you extract the value out of a cell and replace it with some default value.
Both these approaches, however, will result in you losing the original Vec. With the given third-party API there is no way around that.
If you still can somehow influence this external API, then the best solution would be to change it to accept &[i32], which can easily be obtained from Vec<i32> with borrowing.
No, you can't get rid of the call to clone here.
The problem here is with the third-party library. As the function third_party is written now, it's true that it could be using an &Vec<i32>; it doesn't require ownership, since it's just moving out a value that's Copy. However, since the implementation is outside of your control, there's nothing preventing the person maintaining the function from changing it to take advantage of owning the Vec. It's possible that whatever it is doing would be easier or require less memory if it were allowed to overwrite the provided memory, and the function writer is leaving the door open to do so in the future. If that's not the case, it might be worth suggesting a change to the third-party function's signature and relying on clone in the meantime.

Adding state to a nom parser

I wrote a parser in nom that is completely stateless, now I need to wrap it in a few stateful layers.
I have a top-level parsing function named alt_fn that will provide me the next bit of parsed output as an enum variant, the details of which probably aren't important.
I have three things I need to do that involve state:
1) I need to conditionally perform a transformation on the output of alt_fn if there is a match in a non-mutable HashMap that is part of my State struct. This should basically be like a map! but as a method call on my struct. Something like this:
named!(alt_fn<AllTags> ,alt!(// snipped for brevity));
fn applyMath(self, i:AllTags)->AllTags { // snipped for brevity }
method!(apply_math<State, &[u8], AllTags>, mut self, call_m!(self.applyMath, call!(alt_fn)));
This currently gives me: error: unexpected end of macro invocation with alt_fn underlined.
2) I need to update the other fields of the state struct with the data I got from the input (such as computing checksums and updating timestamps, etc.), and then transform the output again with this new knowledge. This will probably look like the following:
fn updateState(mut self, i:AllTags) -> AllTags { // snipped for brevity }
method!(update_state<State, &[u8], AllTags>, mut self, call_m!(self.updateState, call_m!(self.applyMath)));
3) I need to call the method from part two repeatedly until all the input is used up:
method!(pub parse<State,&[u8],Vec<AllTags>>, mut self, many1!(update_state));
Unfortunately the nom docs are pretty limited, and I'm not great with macro syntax so I don't know what I'm doing wrong.
When I need to do something complicated with nom, I normally write my own functions.
For example
named!(my_func<T>, <my_macros>);
is equivalent to
fn my_func(i: &[u8]) -> nom::IResult<T, &[u8]> {
<my_macros>
}
with the proviso that you must pass i to the macro (see my comment).
Creating your own function means you can have any control flow you want in there, and it will play nice with nom as long as it takes a &[u8] and returns nom::IResult where the output &[u8] is the remaining unparsed raw input.
If you need some more info comment and I'll try to improve my answer!

Idiomatic way to parse binary data into primitive types

I've written the following method to parse binary data from a gzipped file using GzDecoder from the Flate2 library
fn read_primitive<T: Copy>(reader: &mut GzDecoder<File>) -> std::io::Result<T>
{
let sz = mem::size_of::<T>();
let mut vec = Vec::<u8>::with_capacity(sz);
let ret: T;
unsafe{
vec.set_len(sz);
let mut s = &mut vec[..];
try!(reader.read(&mut s));
let ptr :*const u8 = s.as_ptr();
ret = *(ptr as *const T)
}
Ok(ret)
}
It works, but I'm not particularly happy with the code, especially with using the dummy vector and the temporary variable ptr. It all feels very inelegant to me and I'm sure there's a better way to do this. I'd be happy to hear any suggestions of how to clean up this code.
Your code allows any copyable T, not just primitives. That means you could try to parse in something with a reference, which is probably not what you want:
#[derive(Copy)]
struct Foo(&str);
However, the general sketch of your code is what I'd expect. You need a temporary place to store some data, and then you must convert that data to the appropriate primitive (perhaps dealing with endinaness issues).
I'd recommend the byteorder library. With it, you call specific methods for the primitive that is required:
reader.read_u16::<LittleEndian>()
Since these methods know the desired size, they can stack-allocate an array to use as the temporary buffer, which is likely a bit more efficient than a heap-allocation. Additionally, I'd suggest changing your code to accept a generic object with the Read trait, instead of the specific GzDecoder.
You may also want to look into a serialization library like rustc-serialize or serde to see if they fit any of your use cases.

Resources