I am trying to run multi-thread programs/benchmark in a rocket-chip SoC I generated from chipyard.
I generated the TutorialConfig SoC given in https://fires.im/isca22-slides-pdf/03_building_custom_socs.pdf, which consists of a Rocket core and a BOOM core.
To check whether I can run a multi-threaded program in this configuration, I compiled the mt-matmul benchmark in riscv-tests after changing the number of cores in crt.S file.
I ran it using the following command:
make CONFIG=TutorialStarterConfig run-binary BINARY=riscv-tests/benchmarks/mt-matmul.riscv
In the output trace, I can only see 'C0' at the beginning of each line, I am assuming I should see 'C1' if the program was executed in a second core.
Is this the correct way to run multi-threaded programs with RocketChip SoCs?
Do I need to change anything else in the programs or in the SoC?
Related
Background
I need to run a blocks simulation. I've used OMEdit to create the system and I call omc to run the simulation using OMPython with zmq for messaging. The simulation works fine but now I need to move it to a server to simulate the system for long times.
Since the server is shared among a team of people, it uses slurm to queue the jobs. The server has 32 cores but they asked me to use only 8 while I tune my script and then 24 when I want to run my final simulation.
I've configured slurm to call my script in the following manner:
#!/bin/bash
#
#SBATCH --job-name=simulation_Test_ADC2_pipe_4096s
#SBATCH --output=simulation_Test_ADC2_pipe_4096s.txt
#
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --time=10:00
#SBATCH --mem-per-cpu=10000
source activate /home/jabozzo/conda_envs/py27
#which python
python ./Test_ADC2_pipe_4096s.py
Then I execute the slurm file using sbatch.
Problem
The omc compilation works fine. When it starts to simulate all the 32 cores of the server become loaded, even if it was configured to only use 8.
I've tried
There are compilation and simulation flags that can be passed to omc. I've tried to use --numProcs (a compilation flag) but this only seem to apply during the compilation process and does not affect the final executable. I've scanned the page of simulation flags looking for something related but it seems there is no option to change the cpu usage.
The only thing that we add when doing our OpenModelica testing in parallel is to add the GC_MARKERS=1 environment variable and --numProcs=1; this makes our nightly library testing of 10000 tests all run in serial. But GC_MARKERS shouldn't affect simulations unless they are allocating extreme amounts of memory. Other than that, OpenModelica simulations are serial unless perhaps you use a parallel blas/lapack/sundials library which might use more cores without OpenModelica knowing anything about it; in that case you would need to read the documentation for the library that's consuming all your resources.
What's a bit surprising is also how slurm allows your process to consume more CPUs than you set; it could use the taskset command to make the kernel force the process to only use certain CPUs.
My server administrator was unsure if taskset would interfere with slurm internals. So we found another option. If omc uses openMP for compilation we can also limit the number of cores replacing the last line of the slurm file with:
OMP_NUM_THREADS=8 python ./Test_ADC2_pipe_4096s.py
I'm leaving this anwser here to complement sjoelund.se anwser
i'm working on designing a mips processor using verilog in modelsim student version and we developed a c++ tool that converts assembly operations to machine code and save the result in .txt file so is there a way to make modelsim runs this tool when simulation starts ?
You can use $system("foo"); to run any system call from system verilog, including invoking your external C++ program, like ("echo hi"); or :
$system("path/to/my/cpp_binary.exe arg1 arg2 arg3");
If you wrap it in an initial block, you can run it at the beginning of simulation. See this answer.
I think you meant you want to simulate an application running on your processor. To do that, you need to have a testbench that models memory with all the necessary connections to your processor. Then get your .txt file in a form that can be read by the $readmemh() system task. Then you load the contents of file into your memory.
From times to times my Go program crashes.
I tried a few things in order to get core dumps generated for this program:
defining ulimit on the system, I tried both ulimit -c unlimited and ulimit -c 10000 just in case. After launching my panicking program, I get no core dump.
I also added recover() support in my program and added code to log to syslog in case of panic but I get nothing in syslog.
I am running out of ideas right now.
I must have overlooked something but I do not find what, any help would be appreciated.
Thanks ! :)
Note that a core dump is generated by the OS when a condition from a certain set is met. These conditions are pretty low-level — like trying to access unmapped memory or trying to execute an opcode the CPU does not know etc. Under a POSIX operating system such as Linux when a process does one of these things, an appropriate signal is sent to it, and some of them, if not handled by the process, have a default action of generating a core dump, which is done by the OS if not prohibited by setting a certain limit.
Now observe that this machinery treats a process on the lowest possible level (machine code), but the binaries a Go compiler produces are more higher-level that those a C compiler (or assembler) produces, and this means certain errors in a process produced by a Go compiler are handled by the Go runtime rather than the OS. For instance, a typical NULL pointer dereference in a process produced by a C compiler usually results in sending the process the SIGSEGV signal which is then typically results in an attempt to dump the process' core and terminate it. In contrast, when this happens in a process compiled by a Go compiler, the Go runtime kicks in and panics, producing a nice stack trace for debugging purposes.
With these facts in mind, I would try to do this:
Wrap your program in a shell script which first relaxes the limit for core dumps (but see below) and then runs your program with its standard error stream redirected to a file (or piped to the logger binary etc).
The limits a user can tweak have a hierarchy: there are soft and hard limits — see this and this for an explanation. So try checking your system does not have 0 for the core dump size set as a hard limit as this would explain why your attempt to raise this limit has no effect.
At least on my Debian systems, when a program dies due to SIGSEGV, this fact is logged by the kernel and is visible in the syslog log files, so try grepping them for hints.
First, please make sure all errors are handled.
For core dump, you can refer generate a core dump in linux
You can use supervisor to reboot the program when it crashes.
I'm a little stuck here.
The idea is that I'd like to get a file of every instruction run by a program during it's execution. I'd like to do it with just the executable in hand (no source) and be able to determine what operation is occuring on what address when.
For example, I'd like to be able to run it on Google Chrome, Firefox, etc.
I want to use this for a performance prediction system I'm working on. I figure if I'm able to obtain each instruction that is executed in order it is executed on system 1, I can attempt to simulate/model the run time of an identical program being run on system 2, because I'll be able to predict(although I know not with 100% accuracy) L1/L2 cache-misses, L1/L2 cache-hits, TLB hits/misses, page faults, time taken on floating point multiplication operations, etc.
I'd like to try to do this on two different systems:
System 1: Ubuntu 10.10 on Intel Core 2 Duo CPU
System 2: Ubuntu 12.04 on system with 2x AMD Sixteen Core Opteron model 6274
(I can definitely change the OS's as neccessary, but would prefer to stay with Ubuntu, if possible)
Is this possible / how could I go about doing it? I know with debuggers, you can use them to step through everything, but I don't have the source available.
I think, you can use qemu (or even bochs) or valgrind to monitor every executed instruction. They are x86 binary translation tools (excluding bochs - which is an interpreter of x86 code). There is a valgrind tool called cachegrind (+ kcachegrind gui), which is ready to emulate cache by instrumenting every memory access and simulating some L1/L2 cache model (sizes may be configured via command line options).
To get deeper (into pipeline) you may want to look on free ptlsim (http://www.ptlsim.org/)
I am developing a MPI program on a Linux machine where I do not have sudo/su access. As my program currently segfaults, I would like to examine the core dumps via gdb. Unfortunately, as the program is multi-threaded, all the threads write to one core dump. So I would like to be able to append the PID to each separate core dump for every process.
I know there is a way to do it via /proc/sys/kernel/core_pattern, however I do not have access to write to this.
Thanks for any help.
It can be a pain to debug MPI apps on systems that are configured this way when you do not have root access. One option for working around this is to use Valgrind to get stack traces for your segfault(s). This will only be useful provided that your application will fail in a reasonable period of time when slowed down via Valgrind, and that it still segfaults at all in this case.
I usually run MPI apps under Valgrind like this:
% mpiexec -n 5 valgrind -q /path/to/my_app
That will send all of the Valgrind output to standard error. But if I want the output separated into different files, then you can get a bit fancier:
% mpiexec -n 5 valgrind -q --log-file='vg_out.%q{PMI_RANK}' /path/to/my_app
That's the setup for MPICH2. I think that for Open MPI you'll need to replace PMI_RANK with OMPI_MCA_ns_nds_vpid, but if that doesn't work for you then you'll need to check with the Open MPI developers on their discussion list. In either case, this will yield N files, where N is the size of MPI_COMM_WORLD, each named vg_out.0, vg_out.1, ..., to vg_out.$(($N-1)), each corresponding to a rank in MPI_COMM_WORLD.