Rust embedded binary size - rust

I'm new to Rust and after many fights with the compiler and borrow-checker I am finally nearly finished with my first project. But now I have the problem that the binary gets to big to fit into the flash of the microcontroller.
I'm using an STM32F103C8 with 64K flash on a BluePill.
At first I was able to fit the code on the mc and bit by bit I had to enable optimization and such. Now I compile with:
[profile.dev]
codegen-units = 1
debug = 0
lto = true
opt-level = "z"
and am able to fit the binary. opt-level = "s" does generate a to big binary. The error I am getting then is: rust-lld: error: section '.rodata' will not fit in region 'FLASH': overflowed by 606 bytes
As I have under 1000 lines of code and as I would say not so unusual dependencies this seems strange.
There are a few sites like this with ways to minimize the binary. As these are not for embedded most of the ways to minimize are followed anyway.
How am I able to minimize the binary size and am still able to debug it?
My dependencies are:
[dependencies]
cortex-m = "*"
panic-halt = "*"
embedded-hal = "*"
[dependencies.cortex-m-rtfm]
version = "0.4.3"
features = ["timer-queue"]
[dependencies.stm32f1]
version = "*"
features = ["stm32f103", "rt"]
[dependencies.stm32f1xx-hal]
version = "0.4.0"
features = ["stm32f103", "rt"]
Maybe there is a problem as I noticed that cargo build does compile some sub dependencies multiple times in different versions.
Inside the memory.x file:
MEMORY
{
FLASH : ORIGIN = 0x08000000, LENGTH = 64K
RAM : ORIGIN = 0x20000000, LENGTH = 20K
}
Rustc version rustc 1.37.0 (eae3437df 2019-08-13)
edit
The rust panic behavior is abort.
The code is view able under: https://github.com/DarkPhoeniz/rc-switcher-rust

I've run into similar issues and may be able to shed some light on what you can do to reduce the size of the binary you're outputting.
You've already discovered one of them: opt-level = "z". The difference between s and z is the inlining constraint - essentially, the size of a function where the compiler deems it not worth inlining. z specifies this to be 25, s 75. Depending on what you are building, this may or may not be a consequent reduction in size (and it affects .rodata and .text primarily).
Another thing you can play on is the behavior on panic on your code. If I remember correctly, the stm32 target supports both unwind and abort, with unwind enabled on the dev profile. As I'm sure you can understand, unwinding the stack is a large and costly process in terms of code size. As such, setting panic = "abort" in your cargo file might reduce the binary size a bit further.
Beyond that, it is down to manual tuning, and tools like cargo-binutils may be extremely useful for this. Depending on your use case, there may be leftover Debug implementations which are only sporadically needed, and that is definitely something that you could act on.

A few other general tips for shrinking the binary:
First, the cargo-bloat utility is useful for determining what is taking up space in your binary, then you can make informed decisions about how to modify your code to shrink it down.
Second, I've had significant success by configuring the compiler to optimize all dependencies, but leave the top level crate unoptimized for easier debugging. You can do this by adding the following to your Cargo.toml:
# Optimize all dependencies
[profile.dev.package."*"]
opt-level = "z"
If you want to debug a specific dependency (for example: cortex-m-rt), you can make it unoptimized like so:
# Don't optimize the `cortex-m-rt` crate
[profile.dev.package.cortex-m-rt]
opt-level = 0
# Optimize all the other dependencies
[profile.dev.package."*"]
opt-level = "z"

Related

Explicitly setting the inlining threshold vs. using optimization level "s"

If I'm reading the documentation here right, setting opt-level = "s" in Cargo.toml is equivalent to setting the inlining threshold to 75.
So I would expect the following two Cargo.toml snippets to be equivalent:
[profile.release]
opt-level = "s"
cargo-features = ["profile-rustflags"]
...
[profile.release]
rustflags = ["-C", "inline-threshold=75"]
However, the executable size I get with the second version is almost twice the size of the fist version, and it matches the size I get without setting the inline-threshold at all (i.e. using the release build's default of 275).
How do I manually set the inlining threshold to match the behaviour of opt-level = "s"? Yes, I could just use opt-level = "s" itself, but my ultimate goal is to then start tweaking the threshold to see how the performance and the binary size changes.

Why does this bevy project take so long to compile and launch?

I started following this tutorial on how to make a game in bevy. The code compiles fine, but it's still pretty slow (I'm honestly not sure if that's normal, it takes around 8 seconds), but when I launch the game, the window goes white (Not Responding) for a few seconds (about the same amount of time as the compile time, maybe a tiny bit less) before properly loading.
Here's my Cargo.toml:
[package]
name = "rustship"
version = "0.1.0"
edition = "2021"
[dependencies]
bevy = "0.8.1"
# Enable a small amount of optimization in debug mode
[profile.dev]
opt-level = 1
# Enable high optimizations for dependencies (incl. Bevy), but not for our code:
[profile.dev.package."*"]
opt-level = 3
[workspace]
resolver = "2"
I tried it with and without the workspace resolver. My rustup toolchain is nightly-x86_64-pc-windows-gnu and I'm using rust-lld to link the program:
[target.nightly-x86_64-pc-windows-gnu]
linker = "rust-lld.exe"
rustflags = ["-Zshare-generics=n"]
According to the official bevy setup guide it should be faster this way. I tried it with rust-lld and without, but it doesn't seem to change anything.
Here's the output of cargo run (with A_NUMBER being a 4-digit number):
AdapterInfo { name: "NVIDIA GeForce RTX 3090", vendor: A_NUMBER, device: A_NUMBER, device_type: DiscreteGpu, backend: Vulkan }
Any ideas on how I can maybe improve the compile time and make the window load directly? My game isn't heavy at all. For now, I'm just loading a sprite. The guy in the tutorial uses MacOS and it seems to be pretty fast for him.

rust / cargo workspace: how to specify different profile for different sub project

I have a rust Cargo workspace that contains different subproject:
./
├─Cargo.toml
├─project1/
│ ├─Cargo.toml
│ ├─src/
├─project2/
│ ├─Cargo.toml
│ ├─src/
I would like to build one project optimized for binary size and the other for speed.
From my understanding we can tweak the profiles only at the cargo.toml root level so this for instance applies to all my sub-projects.
root Cargo.toml:
[workspace]
members = ["project1", "project2"]
[profile.release]
# less code to include into binary
panic = 'abort'
# optimization over all codebase ( better optimization, slower build )
codegen-units = 1
# optimization for size ( more aggressive )
opt-level = 'z'
# optimization for size
# opt-level = 's'
# link time optimization using using whole-program analysis
lto = true
If I try to apply this configuration in a sub Cargo.toml it doesn't work
Question: is there a way to configure each project independently ?
Thank you in advance.
Edit: Also I forgot to say but one project is build with trunk and is a wasm project (I want to be the smaller possible) the other is a backend and I really need it to be built for speed
Each crate in a workspace can have its own .cargo/config.toml where different profiles can be defined. I've toyed around with this a bit to have one crate for an embedded device, one for a CLI utility to connect to the device over serial, and shared libraries for both of them. Pay attention to the caveat in the docs about needing to be in the crate directory for the config to be read, it won't work from the workspace root.

How to improve Vec initialization time?

Initializing a Vec in Rust is incredibly slow if compared with other languages. For example, the following code
let xs: Vec<u32> = vec![0u32, 1000000];
will translate to
let xs: Vec<u32> = Vec::new();
xs.push(0);
xs.push(0);
xs.push(0);
// ...
one million times. If you compare this to the following code in C:
uint32_t* xs = calloc(1000000, sizeof(uint32_t));
the difference is striking.
I had a little bit more luck with
let xs: Vec<u32> = Vec::with_capacity(1000000);
xs.resize(1000000, 0);
bit it's still very slow.
Is there any way to initialize a Vec faster?
You are actually performing different operations. In Rust, you are allocating an array of one million zeroes. In C, you are allocating an array of one million zero-length elements (i.e. non-existent elements). I don't believe this actually does anything, as DK commented on your question and pointed out.
Also, Running the code you presented verbatim gave me very comparable times on my laptop when optimizing, however this is probably because the vector allocation in Rust is being optimized away, as the variable is never used.
cargo build --release
time ../target/release/test
real 0.024s
usr 0.004s
sys 0.008s
and the C:
gcc -O3 test.c
time ./a.out
real 0.023s
usr 0.004s
sys 0.004s`
Without --release, the Rust performance drops, presumably because the allocation actually happens. Note that calloc() also looks to see if the memory is zeroed out first, and doesn't reset the memory if it is already set to zero. This makes the execution time of calloc() somewhat reliant on the previous state of your memory.

How can I compile C code to get a bare-metal skeleton of a minimal RISC-V assembly program?

I have the following simple C code:
void main(){
int A = 333;
int B=244;
int sum;
sum = A + B;
}
When I compile this with
$riscv64-unknown-elf-gcc code.c -o code.o
If I want to see the assembly code I use
$riscv64-unknown-elf-objdump -d code.o
But when I explore the assembly code I see that this generates a lot of code which I assume is for Proxy Kernel support (I am a newbie to riscv). However, I do not want that this code has support for Proxy kernel, because the idea is to implement only this simple C code within an FPGA.
I read that riscv provides three types of compilation: Bare-metal mode, newlib proxy kernel and riscv Linux. According to previous research, the kind of compilation that I should do is bare metal mode. This is because I desire a minimum assembly without support for the operating system or kernel proxy. Assembly functions as a system call are not required.
However, I have not yet been able to find as I can compile a C code for get a skeleton of a minimal riscv assembly program. How can I compile the C code above in bare metal mode or for get a skeleton of a minimal riscv assembly code?
Warning: this answer is somewhat out-of-date as of the latest RISC-V Privileged Spec v1.9, which includes the removal of the tohost Control/Status Register (CSR), which was a part of the non-standard Host-Target Interface (HTIF) which has since been removed. The current (as of 2016 Sep) riscv-tests instead perform a memory-mapped store to a tohost memory location, which in a tethered environment is monitored by the front-end server.
If you really and truly need/want to run RISC-V code bare-metal, then here are the instructions to do so. You lose a bunch of useful stuff, like printf or FP-trap software emulation, which the riscv-pk (proxy kernel) provides.
First things first - Spike boots up at 0x200. As Spike is the golden ISA simulator model, your core should also boot up at 0x200.
(cough, as of 2015 Jul 13, the "master" branch of riscv-tools (https://github.com/riscv/riscv-tools) is using an older pre-v1.7 Privileged ISA, and thus starts at 0x2000. This post will assume you are using v1.7+, which may require using the "new_privileged_isa" branch of riscv-tools).
So when you disassemble your bare-metal program, it better
start at 0x200!!! If you want to run it on top of the proxy-kernel, it
better start at 0x10000 (and if Linux, it’s something even larger…).
Now, if you want to run bare metal, you’re forcing yourself to write up the
processor boot code. Yuck. But let’s punt on that and pretend that’s not
necessary.
(You can also look into riscv-tests/env/p, for the “virtual machine”
description for a physically addressed machine. You’ll find the linker script
you need and some macros.h to describe some initial setup code. Or better
yet, in riscv-tests/benchmarks/common.crt.S).
Anyways, armed with the above (confusing) knowledge, let’s throw that all
away and start from scratch ourselves...
hello.s:
.align 6
.globl _start
_start:
# screw boot code, we're going minimalist
# mtohost is the CSR in machine mode
csrw mtohost, 1;
1:
j 1b
and link.ld:
OUTPUT_ARCH( "riscv" )
ENTRY( _start )
SECTIONS
{
/* text: test code section */
. = 0x200;
.text :
{
*(.text)
}
/* data: Initialized data segment */
.data :
{
*(.data)
}
/* End of uninitalized data segement */
_end = .;
}
Now to compile this…
riscv64-unknown-elf-gcc -nostdlib -nostartfiles -Tlink.ld -o hello hello.s
This compiles to (riscv64-unknown-elf-objdump -d hello):
hello: file format elf64-littleriscv
Disassembly of section .text:
0000000000000200 <_start>:
200: 7810d073 csrwi tohost,1
204: 0000006f j 204 <_start+0x4>
And to run it:
spike hello
It’s a thing of beauty.
The link script places our code at 0x200. Spike will start at
0x200, and then write a #1 to the control/status register
“tohost”, which tells Spike “stop running”. And then we spin on an address
(1: j 1b) until the front-end server has gotten the message and kills us.
It may be possible to ditch the linker script if you can figure out how to
tell the compiler to move <_start> to 0x200 on its own.
For other examples, you can peruse the following repositories:
The riscv-tests repository holds the RISC-V ISA tests that are very minimal
(https://github.com/riscv/riscv-tests).
This Makefile has the compiler options:
https://github.com/riscv/riscv-tests/blob/master/isa/Makefile
And many of the “virtual machine” description macros and linker scripts can
be found in riscv-tests/env (https://github.com/riscv/riscv-test-env).
You can take a look at the “simplest” test at (riscv-tests/isa/rv64ui-p-simple.dump).
And you can check out riscv-tests/benchmarks/common for start-up and support code for running bare-metal.
The "extra" code is put there by gcc and is the sort of stuff required for any program. The proxy kernel is designed to be the bare minimum amount of support required to run such things. Once your processor is working, I would recommend running things on top of pk rather than bare-metal.
In the meantime, if you want to look at simple assembly, I would recommend skipping the linking phase with '-c':
riscv64-unknown-elf-gcc code.c -c -o code.o
riscv64-unknown-elf-objdump -d code.o
For examples of running code without pk or linux, I would look at riscv-tests.
I'm surprised no one mentioned gcc -S which skips assembly and linking altogether and outputs assembly code, albeit with a bunch of boilerplate, but it may be convenient just to poke around.

Resources