Halide AOT for OpenCL works fine as static library but not as shared object - shared-libraries

I try to compile the code below both to static library and to object file:
Halide::Func f("f");
Halide::Var x("x");
f(x) = x;
f.gpu_tile(x, 4);
f.bound(x, 0, 16);
Halide::Target target = Halide::get_target_from_environment();
target.set_feature(Halide::Target::OpenCL);
target.set_feature(Halide::Target::Debug);
// f.compile_to_static_library("mylib", {}, "f", target);
// f.compile_to_file("mylib", {}, "f", target);
In case of static linking all works fine and output result is correct:
Halide::Buffer<int> output(16);
f(output.raw_buffer());
output.copy_to_host();
std::cout << output(10) << std::endl;
But when I try link object file into shared object,
gcc -shared -pthread mylib.o -o mylib.so
And open it from code (Ubuntu 16.04),
void* handle = dlopen("mylib.so", RTLD_NOW);
int (*func)(halide_buffer_t*);
*(void**)(&func) = dlsym(handle, "f");
func(output.raw_buffer());
I receive CL_INVALID_MEM_OBJECT error. Here is the debugging log:
CL: halide_opencl_init_kernels (user_context: 0x0, state_ptr: 0x7f1266b5a4e0, program: 0x7f1266957480, size: 1577
load_libopencl (user_context: 0x0)
Loaded OpenCL runtime library: libOpenCL.so
create_opencl_context (user_context: 0x0)
Got platform 'Intel(R) OpenCL', about to create context (t=6249430)
Multiple CL devices detected. Selecting the one with the most cores.
Device 0 has 20 cores
Device 1 has 4 cores
Selected device 0
device name: Intel(R) HD Graphics
device vendor: Intel(R) Corporation
device profile: FULL_PROFILE
global mem size: 1630 MB
max mem alloc size: 815 MB
local mem size: 65536
max compute units: 20
max workgroup size: 256
max work item dimensions: 3
max work item sizes: 256x256x256x0
clCreateContext -> 0x1899af0
clCreateCommandQueue 0x1a26a80
clCreateProgramWithSource -> 0x1a26ab0
clBuildProgram 0x1a26ab0 -D MAX_CONSTANT_BUFFER_SIZE=854799155 -D MAX_CONSTANT_ARGS=8
Time: 1.015832e+02 ms
CL: halide_opencl_run (user_context: 0x0, entry: kernel_f_s0_x___deprecated_block_id_x___block_id_x, blocks: 4x1x1, threads: 4x1x1, shmem: 0
clCreateKernel kernel_f_s0_x___deprecated_block_id_x___block_id_x -> Time: 1.361700e-02 ms
clSetKernelArg 0 4 [0x2e00010000000000 ...] 0
clSetKernelArg 1 8 [0x2149040 ...] 1
Mapped dev handle is: 0x2149040
Error: CL: clSetKernelArg failed: CL_INVALID_MEM_OBJECT
Aborted (core dumped)
Thank you very much for help! Commit state c7375fa. I'm pleasure provide extra information if it will be necessary.

Solution: In this case we have runtime duplication. Load shared object with flag RTLD_DEEPBIND.
void* handle = dlopen("mylib.so", RTLD_NOW | RTLD_DEEPBIND);
RTLD_DEEPBIND (since glibc 2.3.4)
Place the lookup scope of the symbols in this library ahead of the global scope. This means that a self-contained library will use its own symbols in preference to global symbols with the same name contained in libraries that have already been loaded. This flag is not specified in POSIX.1-2001.
https://linux.die.net/man/3/dlopen

Related

ubiformat in barebox giving timeout

I have a custom iMX 6UL board with Barebox (partially) functional. I have on board a Semper s25hs512t Flash being detected (after adding the necessary device id indrivers/mtd/spi-nor/spi-nor.c)
The problem - My board does not have ethernet or removable SD. I need to burn the boot loader/ flash on the s25hs512. I need to format the flash accordingly and copy the files on it.
my dtsi has
&qspi {
pinctrl-names = "default";
pinctrl-0 = <&pinctrl_qspi>;
status = "okay";
flash0: s25hs512t#0 {
#address-cells = <1>;
#size-cells = <1>;
compatible = "spansion,s25hs512t", "jedec,spi-nor";
spi-max-frequency = <40000000>;
spi-rx-bus-width = <4>;
spi-tx-bus-width = <4>;
reg = <0>;
spi-mode = <0>;
m25p,fast-read;
status = "okay";
partition#0 {
label = "barebox";
reg = <0x00000000 0x00100000>;
};
partition#1 {
label = "barebox-env";
reg = <0x00100000 0x00040000>;
};
partition#2 {
label = "barebox-of";
reg = <0x00140000 0x00040000>;
};
partition#3 {
label = "kernel";
reg = <0x00180000 0x00800000>;
};
partition#4 {
label = "root";
reg = <0x00980000 0x03640000>;
};
};
};
on boot barebox detects the flash
Board: Freescale i.MX6 UltraLite Caisteal Board
detected i.MX6 UltraLite revision 1.0
i.MX6 UltraLite unique ID: 241e09d4e317402a
m25p80 s25hs512t#00: s25hs512t (65536 Kbytes). <=====
imx-esdhc 2194000.mmc#2194000.of: registered as mmc1
rng_self_test: RNG software self-test passed
caam 2140000.crypto#2140000.of: Instantiated RNG4 SH0
caam 2140000.crypto#2140000.of: Instantiated RNG4 SH1
malloc space: 0x8eefcf80 -> 0x9ddf9eff (size 239 MiB)
barebox-environment chosen:environment.of: probe failed: No such file or directory
devinfo shows
`-- 21e0000.spi#21e0000.of
`-- s25hs512t#00
`-- m25p0
`-- 0x00000000-0x03ffffff ( 64 MiB): /dev/m25p0
`-- m25p0.barebox
`-- 0x00000000-0x000fffff ( 1 MiB): /dev/m25p0.barebox
`-- m25p0.barebox-env
`-- 0x00000000-0x0003ffff ( 256 KiB): /dev/m25p0.barebox-env
`-- m25p0.barebox-of
`-- 0x00000000-0x0003ffff ( 256 KiB): /dev/m25p0.barebox-of
`-- m25p0.kernel
`-- 0x00000000-0x007fffff ( 8 MiB): /dev/m25p0.kernel
`-- m25p0.root
`-- 0x00000000-0x0363ffff ( 54.3 MiB): /dev/m25p0.root
but when I run ubiformat, I am oddly getting this
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ ubiformat /dev/m25p0.barebox -y
ubiformat: m25p0.barebox (nor), size 1048576 bytes (1 MiB), 4 eraseblocks of 262144 bytes (256 KiB), min. I/O size 1 bytes
libscan: scanning eraseblock 3 -- 100 % complete
ubiformat: 1 eraseblocks are supposedly empty
ubiformat: warning!: 3 of 4 eraseblocks contain non-ubifs data
ubiformat: warning!: only 0 of 4 eraseblocks have valid erase counter
ubiformat: erase counter 0 will be used for all eraseblocks
ubiformat: note, arbitrary erase counter value may be specified using -e option
ubiformat: use erase counter 0 for all eraseblocks
ubiformat: formatting eraseblock 3 -- 100 % complete
ERROR: m25p80 s25hs512t#00: flash operation timed out
ERROR: m25p0.barebox: error -110 while writing 262144 bytes to PEB 0:0, written 0 bytes
libubigen: error!: cannot write 262144 bytes
ubiformat: error!: cannot write layout volume
ubiformat: Operation not permitted
Any way ahead from this?
PS : Update
Thanks for help from #TrentP - I am focusing only on formatting the larger partitions so that I can write the kernel and root partition. but I have not been able to mount the ubi partition. I get the following issue (Readonly filesystem)
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ erase /dev/m25p0.kernel
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ ubiattach /dev/m25p0.kernel
NOTICE: ubi0: scanning is finished
NOTICE: ubi0: empty MTD device detected
NOTICE: ubi0: registering /dev/m25p0.kernel.ubi
NOTICE: ubi0: attached mtd0 (name "m25p0.kernel", size 8 MiB) to ubi0
NOTICE: ubi0: PEB size: 262144 bytes (256 KiB), LEB size: 262016 bytes
NOTICE: ubi0: min./max. I/O unit sizes: 1/256, sub-page size 1
NOTICE: ubi0: VID header offset: 64 (aligned 64), data offset: 128
NOTICE: ubi0: good PEBs: 32, bad PEBs: 0, corrupted PEBs: 0
NOTICE: ubi0: user volume: 0, internal volumes: 1, max. volumes count: 128
NOTICE: ubi0: max/mean erase counter: 1/0, WL threshold: 65536, image sequence number: 1700878141
NOTICE: ubi0: available PEBs: 28, total reserved PEBs: 4, PEBs reserved for bad PEB handling: 0
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ ubimkvol /dev/m25p0.kernel.ubi kernel 0
NOTICE: ubi0: registering kernel as /dev/m25p0.kernel.ubi.kernel
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ mount -t ubifs /dev/m25p0.kernel.ubi.kernel /mnt/kernel/
ERROR: UBIFS error (ubi0:0): 9de5a2d5: can't format empty UBI volume: read-only mount
ERROR: ubifs ubifs0: probe failed: Read-only file system
mount: Invalid argument
If I use ubiformat I get this
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ ubiformat /dev/m25p0.kernel -y
ubiformat: m25p0.kernel (nor), size 8388608 bytes (8 MiB), 32 eraseblocks of 262144 bytes (256 KiB), min. I/O size 1 bytes
libscan: scanning eraseblock 31 -- 100 % complete
ubiformat: warning!: 32 of 32 eraseblocks contain non-ubifs data
ubiformat: warning!: only 0 of 32 eraseblocks have valid erase counter
ubiformat: erase counter 0 will be used for all eraseblocks
ubiformat: note, arbitrary erase counter value may be specified using -e option
ubiformat: use erase counter 0 for all eraseblocks
ubiformat: formatting eraseblock 31 -- 100 % complete
barebox#Freescale i.MX6 UltraLite Caisteal Board:/ ubiattach /dev/m25p0.kernel
NOTICE: ubi0: scanning is finished
ERROR: ubi0 error: ubi_read_volume_table: the layout volume was not found
ERROR: ubi0 error: ubi_attach_mtd_dev: failed to attach mtd0, error -22
failed to attach: Invalid argument
devinfo
Parent: m25p0.kernel
Parameters:
available_pebs: 0 (type: uint32)
bad_peb_count: 0 (type: uint32)
good_peb_count: 32 (type: uint32)
leb_size: 262016 (type: uint32)
max_erase_counter: 2 (type: uint32)
mean_erase_counter: 0 (type: uint32)
min_io_size: 1 (type: uint32)
peb_size: 262144 (type: uint32)
reserved_pebs: 32 (type: uint32) <=== why all PEBs are reserved?
sub_page_size: 1 (type: uint32)
vid_header_offset: 64 (type: uint32)
Any suggestions on what I am doing wrong. I know its something ridiculously simple. just unknown to me
You aren't supposed to use ubiformat on the barebox partition. It's too small. That's why it fails.
UBI is a Linux layer for putting UBI filesystems into NAND or NOR flash. The iMX6UL CPU boot ROM does not understand UBI. It can't boot something in a UBI formatted partition. It's for the root filesystem in the root partition.
Read section 8 of the iMX6UL reference manual, especially §8.6 about QuadSPI booting. This will tell you what you must put into flash to make it bootable.
Also look at the barebox_update command, which can be used to flash the bootloader from Barebox. The board needs to support it and I don't know about your board. The code is in various imx6_bbu_* functions. I'm not sure if qspi is supported, as I only see eMMC/SD,eMMC boot, NAND, and I2C/SPI. The qspi interface isn't the same as a serial EEPROM on one of the eCSPI controllers (again, see RM §8!). But perhaps it would work with an appropriate header already on the image.

any performance penalto to be expected with thread_local?

Using C++11 and/or C11 thread_local, should we expect any performance penalty over non-thread_local storage on x86 (32- or 64-bit) Linux, Red Hat 5 or newer, with a recent g++/gcc (say, version 4 or newer) or clang?
On Ubuntu 18.04 x86_64 with gcc-8.3 (options -pthread -m{arch,tune}=native -std=gnu++17 -g -O3 -ffast-math -falign-{functions,loops}=64 -DNDEBUG) the difference is almost imperceptible:
#include <benchmark/benchmark.h>
struct A { static unsigned n; };
unsigned A::n = 0;
struct B { static thread_local unsigned n; };
thread_local unsigned B::n = 0;
template<class T>
void bm(benchmark::State& state) {
for(auto _ : state)
benchmark::DoNotOptimize(++T::n);
}
BENCHMARK_TEMPLATE(bm, A);
BENCHMARK_TEMPLATE(bm, B);
BENCHMARK_MAIN();
Results:
Run on (16 X 5000 MHz CPU s)
CPU Caches:
L1 Data 32 KiB (x8)
L1 Instruction 32 KiB (x8)
L2 Unified 256 KiB (x8)
L3 Unified 16384 KiB (x1)
Load Average: 0.59, 0.49, 0.38
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
bm<A> 1.09 ns 1.09 ns 642390002
bm<B> 1.09 ns 1.09 ns 633963210
On x86_64 thread_local variables are accessed relative to fs register. Instructions with such addressing mode are often 2 bytes longer, so theoretically, they can take more time.
On other platforms it depends on how access to thread_local variables is implemented. See ELF Handling For Thread-Local Storage for more details.

Twice as many page faults when reading from a large malloced array instead of just storing?

I am doing a simple test on monitoring page faults with the code below, What I don't know is how a simple one line of code below doubled my page fault count.
if I use
ptr[i+4096] = 'A'
I got 25,722 page-faults with perf tool, which is what I expected,
but if I use
tmp = ptr[i+4096]
instead, the page-faults doubled to 51,322
I don't how to explain it. Below is the complete code. Thanks!
void do_something() {
int i;
char* ptr;
char tmp;
ptr = malloc(100*1024*1024);
int j = 0;
int k = 0;
for (i = 0; i < 100*1024*1024; i+=4096) {
//ptr[i+4096] = 'A' ;
tmp = ptr[i+4096];
for (j = 0 ; j < 4096; j++)
ptr[i+j] = (char) (i & 0xff); // pagefault
}
free(ptr);
}
int main(int argc, char* argv[]) {
do_something();
return 0;
}
Machine Info:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 40
On-line CPU(s) list: 0-39
Thread(s) per core: 2
Core(s) per socket: 10
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2687W v3 # 3.10GHz
Stepping: 2
CPU MHz: 3096.188
BogoMIPS: 6197.81
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 25600K
NUMA node0 CPU(s): 0-9,20-29
NUMA node1 CPU(s): 10-19,30-39
3.10.0-514.32.3.el7.x86_64 #1
malloc() will often satisfy requests for memory by asking the OS for new pages, e.g., via mmap. Such pages are generally allocated lazily: no actual page is allocated until the first access.
What happens then depends on the type of the first access: when you do a read first, Linux will map in a shared read-only COW page of zeros to satisfy it, and then if you later you write it takes a second fault to allocate the private writeable page.
When you just do the write first, the first step is skipped. That's the usual case since code generally isn't reading from newly allocated memory which has undefined contents (at least when you get it from malloc).
Note that the above is a description of how newly allocated pages work in Linux - when you use malloc there is another layer: malloc will generally try to satisfy requests for blocks the process freed earlier, rather than continually requesting new memory. In the case memory is re-used, it will generally already be paged in and the above won't apply. Of course for your initial big allocation of 1024 MiB, where is no memory to re-use so you can be sure the allocator is getting it from the OS.

How to get right MIPS libc toolchain for embedded device

I've run into a problem (repetitively) with various company's' embedded linux products where GPL source code from them does not match what is actually running on a system. It's "close", but not quite right, especially with respect to the standard C library they use.
Isn't that a violation of the GPL?
Often this mismatch results in a programmer (like me) cross compiling only to have the device reply cryptically "file not found" or something similar when the program is run.
I'm not alone with this kind of problem -- For many people have threads directly and indirectly related to the problem: eg:
Compile parameters for MIPS based codesourcery toolchain?
And I've run into the problem on Sony devices, D-link, and many others. It's very common.
Making a new library is not a good solution, since most systems are ROMFS only, and LD_LIBRARY_PATH is sometimes broken -- so that installing a new library on the device wastes very limited memory and often won't work.
If I knew what the right source code version of the library was, I could go around the manufacturer's carelessness and compile it from the original developer's tree; but how can I find out which version I need when all I have is the binary of the library itself?
For example: I ran elfread -a libc.so.0 on a DSL modem's libc (see below); but I don't see anything here that could tell me exactly which libc it was...
How can I find the name of the source code, or an identifier from the library's binary so I can create a cross compiler using that library? eg: Can anyone tell me what source code this library came from, and how they know?
ELF Header:
Magic: 7f 45 4c 46 01 02 01 00 00 00 00 00 00 00 00 00
Class: ELF32
Data: 2's complement, big endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: DYN (Shared object file)
Machine: MIPS R3000
Version: 0x1
Entry point address: 0x5a60
Start of program headers: 52 (bytes into file)
Start of section headers: 0 (bytes into file)
Flags: 0x1007, noreorder, pic, cpic, o32, mips1
Size of this header: 52 (bytes)
Size of program headers: 32 (bytes)
Number of program headers: 4
Size of section headers: 0 (bytes)
Number of section headers: 0
Section header string table index: 0
There are no sections in this file.
There are no sections to group in this file.
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
REGINFO 0x0000b4 0x000000b4 0x000000b4 0x00018 0x00018 R 0x4
LOAD 0x000000 0x00000000 0x00000000 0x2c9ee 0x2c9ee R E 0x1000
LOAD 0x02c9f0 0x0006c9f0 0x0006c9f0 0x009a0 0x040b8 RW 0x1000
DYNAMIC 0x0000cc 0x000000cc 0x000000cc 0x0579a 0x0579a RWE 0x4
Dynamic section at offset 0xcc contains 19 entries:
Tag Type Name/Value
0x0000000e (SONAME) Library soname: [libc.so.0]
0x00000004 (HASH) 0x18c
0x00000005 (STRTAB) 0x3e9c
0x00000006 (SYMTAB) 0x144c
0x0000000a (STRSZ) 6602 (bytes)
0x0000000b (SYMENT) 16 (bytes)
0x00000015 (DEBUG) 0x0
0x00000003 (PLTGOT) 0x6ce20
0x00000011 (REL) 0x5868
0x00000012 (RELSZ) 504 (bytes)
0x00000013 (RELENT) 8 (bytes)
0x70000001 (MIPS_RLD_VERSION) 1
0x70000005 (MIPS_FLAGS) NOTPOT
0x70000006 (MIPS_BASE_ADDRESS) 0x0
0x7000000a (MIPS_LOCAL_GOTNO) 11
0x70000011 (MIPS_SYMTABNO) 677
0x70000012 (MIPS_UNREFEXTNO) 17
0x70000013 (MIPS_GOTSYM) 0x154
0x00000000 (NULL) 0x0
There are no relocations in this file.
The decoding of unwind sections for machine type MIPS R3000 is not currently supported.
Histogram for bucket list length (total of 521 buckets):
Length Number % of total Coverage
0 144 ( 27.6%)
1 181 ( 34.7%) 27.1%
2 130 ( 25.0%) 66.0%
3 47 ( 9.0%) 87.1%
4 12 ( 2.3%) 94.3%
5 5 ( 1.0%) 98.1%
6 1 ( 0.2%) 99.0%
7 1 ( 0.2%) 100.0%
No version information found in this file.
Primary GOT:
Canonical gp value: 00074e10
Reserved entries:
Address Access Initial Purpose
0006ce20 -32752(gp) 00000000 Lazy resolver
0006ce24 -32748(gp) 80000000 Module pointer (GNU extension)
Local entries:
Address Access Initial
0006ce28 -32744(gp) 00070000
0006ce2c -32740(gp) 00030000
0006ce30 -32736(gp) 00000000
0006ce34 -32732(gp) 00010000
0006ce38 -32728(gp) 0006d810
0006ce3c -32724(gp) 0006d814
0006ce40 -32720(gp) 00020000
0006ce44 -32716(gp) 00000000
0006ce48 -32712(gp) 00000000
Global entries:
Address Access Initial Sym.Val. Type Ndx Name
0006ce4c -32708(gp) 000186c0 000186c0 FUNC bad section index[ 6] __fputc_unlocked
0006ce50 -32704(gp) 000211a4 000211a4 FUNC bad section index[ 6] sigprocmask
0006ce54 -32700(gp) 0001e2b4 0001e2b4 FUNC bad section index[ 6] free
0006ce58 -32696(gp) 00026940 00026940 FUNC bad section index[ 6] raise
...
truncated listing
....
Note:
The rest of this post is a blog showing how I came to ask the question above and to put useful information about the subject in one place.
Don't bother reading it unless you want to know I actually did research the question... in gory detail... and how NOT to answer my question.
The proper (theoretical) way to get a libc program running on (for example) a D-link modem would simply be to get the TRUE source code for the product from the manufacturer, and compile against those libraries.... (It's GPL !? right, so the law is on our side, right?)
For example: I just bought a D-Link DSL-520B modem and a 526B modem -- but found out after the fact that the manufacturer "forgot" to supply linux source code for the 520B but does have it for the 526B. I checked all of the DSL-5xxB devices online for source code & toolchains, finding to my delight that ALL of them (including 526B) -- contain the SAME pre-compiled libc.so.0 with MD5sum of 6ed709113ce615e9f170aafa0eac04a6 . So in theory, all supported modems in the DSL-5xxB family seemed to use the same libc library... and I hoped I might be able to use that library.
But after I figured out how to get the DSL modem itself to send me a copy of the installed /lib/libc.so.0 library -- I found to my disgust that they ALL use a library with MD5 sum of b8d492decc8207e724a0822641205078 . In NEITHER of the modems I bought (supported or not) was found the same library as contained in the source code toolchain.
To verify the toolchain from D-link was defective, I didn't compile a program (the toolchain wouldn't run on my PC anyway as it was the wrong binary format) -- but I found the toolchain had some pre-compiled mips binaries in it already; so I simply downloaded one to the modem and chmod +x -- and (surprise) I got the message "file not found." when I tried to run it ... It won't run.
So, I knew the toolchains were no good immediately, but not exactly why.
I decided to get a newer verson of MIPS GCC (binary version) that should have less bugs, more features and which is supported on most PC platforms. This is the way to GO!
See: Kernel.org pre-compiled binaries
I upgraded to gcc 4.9.0 after selecting the older "mips" verson from the above site to get the right FTP page; and set my shells' PATH variable to the /bin directory of the cross compiler once installed.
Then I copied all the header files and libraries from the D-link source code package into the new cross compiler just to verify that it could compile D-link libc binaries. And it did on the first try, compiling "hello world!" with no warnings or errors into a mips 32 big endian binary.
( START EDIT: ) #ChrisStratton points out in the comments (after this post) that my test of the toolchain is inadequate, and that using a newer GCC with an older library -- even though it links properly -- is flawed as a test. I wish there was a way to give him points for his comments -- I've become convinced that he's right; although that makes what D-link did even a worse problem -- for there's no way to know from the binaries on the modem which GCC they actually used. The GCC used for the kernel isn't necessarily the same used in user space.
In order to test the new compiler's compatibility with the modems and also make tools so I could get a copy of the actual libraries found on the modem: ( END EDIT ) I wrote a program that doesn't use the C library at all (but in two parts): It ran just fine... and the code is attached to show how it can be done.
The first listing is an assembly language program to bypass linking the standard C libraries on MIPS; and the second listing is a program meant to create an octal number dump of a binary file/stream using only the linux kernel. eg: It enables copying/pasting or scripting of binary data over telnet, netcat, etc... via ash/bash or busybox :) like a poor man's uucp.
// substart.S MIPS assembly language bypass of libc startup code
// it just calls main, and then jumps to the exit function
.text
.globl __start
__start: .ent __start
.frame $29, 32, $31
.set noreorder
.cpload $25
.set reorder
.cprestore 16
jal main
j exit
.end __start
// end substart.S
...and...
// octdump.c
// To compile w/o libc :
// mips-linux-gcc stubstart.S octdump.c -nostdlib -o octdump
// To compile with working libc (eg: x86 system) :
// gcc octdump.c -o octdump_x86
#include <syscall.h>
#include <errno.h>
#include <sys/types.h>
int* __errno_location(void) { return &errno; }
#ifdef _syscall1
// define three unix functions (exit,read,write) in terms of unix syscall macros.
_syscall1( void, exit, int, status );
_syscall3( ssize_t, read, int, fd, void*, buf, size_t, count );
_syscall3( ssize_t, write, int, fd, const void*, buf, size_t, count );
#endif
#include <unistd.h>
void oct( unsigned char c ) {
unsigned int n = c;
int m=6;
static unsigned char oval[6]={'\\','\\','0','0','0','0'};
if (n < 64) { m-=1; n <<= 3; }
if (n < 64) { m-=1; n <<= 3; }
if (n < 64) { m-=1; n <<= 3; }
oval[5]='0'+(n&7);
oval[4]='0'+((n>>3)&7);
oval[3]='0'+((n>>6)&7);
write( STDOUT_FILENO, oval, m );
}
int main(void) {
char buffer[255];
int count=1;
int i;
while (count>0) {
count=read( STDIN_FILENO, buffer, 17 );
if (count>0) write( STDOUT_FILENO, "echo -ne $'",11 );
for (i=0; i<count; ++i) oct( buffer[i] );
if (count>0) write( STDOUT_FILENO, "'\n", 2 );
}
write( STDOUT_FILENO,"#\n",2);
return 0;
}
Once mips' octdump was saved (chmod +x) as /var/octdump on the modem, it ran without errors.
(use your imagination about how I got it on there... Dlink's TFTP, & friends are broken.)
I was able to use octdump to copy all the dynamic libraries off the DSL modem and examine them, using an automated script to avoid copy/pasting by hand.
#!/bin/env python
# octget.py
# A program to upload a file off an embedded linux device via telnet
import socket
import time
import sys
import string
if len( sys.argv ) != 4 :
raise ValueError, "Usage: octget.py IP_OF_MODEM passwd path_to_file_to_get"
o = socket.socket( socket.AF_INET, socket.SOCK_STREAM )
o.connect((sys.argv[1],23)) # The IP address of the DSL modem.
time.sleep(1)
sys.stderr.write( o.recv(1024) )
o.send("admin\r\n");
time.sleep(0.1)
sys.stderr.write( o.recv(1024) )
o.send(sys.argv[2]+"\r\n")
time.sleep(0.1)
o.send("sh\r\n")
time.sleep(0.1)
sys.stderr.write( o.recv(1024) )
o.send("cd /var\r\n")
time.sleep(0.1)
sys.stderr.write( o.recv(1024) )
o.send("./octdump.x < "+sys.argv[3]+"\r\n" );
sys.stderr.write( o.recv(21) )
get="y"
while get and not ('#' in get):
get = o.recv(4096)
get = get.translate( None, '\r' )
sys.stdout.write( get )
time.sleep(0.5)
o.close()
The DSL520B modem had the following libraries...
libcrypt.so.0 libpsi.so libutil.so.0 ld-uClibc.so.0 libc.so.0 libdl.so.0 libpsixml.so
... and I thought I might cross compile using these libraries since (at least in theory) -- GCC could link against them; and my problem might be solved.
I made very sure to erase all the incompatible .so libraries from gcc-4.9.0/mips-linux/mips-linux/lib, but kept the generic crt..o files; then I copied the modem's libraries into the cross compiler directory.
But even though the kernel version of the source code, and the kernel version of the modem matched -- GCC found undefined symbols in the crt files.... So, either the generic crt files or the modem libraries themselves are somehow defective... and I don't know why. Without knowing how to get the full library version of the ? ucLibc ? library, I'm not sure how I can get the CORRECT source code to recompile the libraries and the crt's from scratch.

RES != CODE + DATA in the output information of the top command,why?

what 'man top' said is: RES = CODE + DATA
q: RES -- Resident size (kb)
The non-swapped physical memory a task has used.
RES = CODE + DATA.
r: CODE -- Code size (kb)
The amount of physical memory devoted to executable code, also known as the 'text resident set' size or TRS.
s: DATA -- Data+Stack size (kb)
The amount of physical memory devoted to other than executable code, also known as the 'data >resident set' size or DRS.
what when i run 'top -p 4258',i get the following:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ CODE DATA COMMAND
258 root 16 0 3160 1796 1328 S 0.0 0.3 0:00.10 476 416 bash
1796 != 476+416
why?
ps:
linux distribution:
linux-iguu:~ # lsb_release -a
LSB Version: core-2.0-noarch:core-3.0-noarch:core-2.0-ia32:core-3.0-ia32:desktop-3.1-ia32:desktop-3.1-noarch:graphics-2.0-ia32:graphics-2.0-noarch:graphics-3.1-ia32:graphics-3.1-noarch
Distributor ID: SUSE LINUX
Description: SUSE Linux Enterprise Server 9 (i586)
Release: 9
Codename: n/a
kernel version:
linux-iguu:~ # uname -a
Linux linux-iguu 2.6.16.60-0.21-default #1 Tue May 6 12:41:02 UTC 2008 i686 i686 i386 GNU/Linux
I'll explain this with the help of an example of what happens when a program allocates and uses memory. Specifically, this program:
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>
int main(){
int *data, size, count, i;
printf( "fyi: your ints are %d bytes large\n", sizeof(int) );
printf( "Enter number of ints to malloc: " );
scanf( "%d", &size );
data = malloc( sizeof(int) * size );
if( !data ){
perror( "failed to malloc" );
exit( EXIT_FAILURE );
}
printf( "Enter number of ints to initialize: " );
scanf( "%d", &count );
for( i = 0; i < count; i++ ){
data[i] = 1337;
}
printf( "I'm going to hang out here until you hit <enter>" );
while( getchar() != '\n' );
while( getchar() != '\n' );
exit( EXIT_SUCCESS );
}
This is a simple program that asks you how many integers to allocate, allocates them, asks how many of those integers to initialize, and then initializes them. For a run where I allocate 1250000 integers and initialize 500000 of them:
$ ./a.out
fyi: your ints are 4 bytes large
Enter number of ints to malloc: 1250000
Enter number of ints to initialize: 500000
Top reports the following information:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ SWAP CODE DATA COMMAND
<program start>
11129 xxxxxxx 16 0 3628 408 336 S 0 0.0 0:00.00 3220 4 124 a.out
<allocate 1250000 ints>
11129 xxxxxxx 16 0 8512 476 392 S 0 0.0 0:00.00 8036 4 5008 a.out
<initialize 500000 ints>
11129 xxxxxxx 15 0 8512 2432 396 S 0 0.0 0:00.00 6080 4 5008 a.out
The relevant information is:
DATA CODE RES VIRT
before allocation: 124 4 408 3628
after 5MB allocation: 5008 4 476 8512
after 2MB initialization: 5008 4 2432 8512
After I malloc'd 5MB of data, both VIRT and DATA increased by ~5MB, but RES did not. RES did increase after I touched 2MB of the integers I allocated, but DATA and VIRT stayed the same.
VIRT is the total amount of virtual memory used by the process, including what is shared and what is over-committed. DATA is the amount of virtual memory used that isn't shared and that isn't code-text. I.e., it is the virtual stack and heap of the process. RES is not virtual: it is a measurment of how much memory the process is actually using at that specific time.
So in your case, the large inequality CODE+DATA < RES is likely the shared libraries included by the process. In my example (and yours), SHR+CODE+DATA is a closer approximation to RES.
Hope this helps.
There's a lot of hand-waving and voodoo associated with top and ps. There are many articles (rants?) online about the descrepancies. E.g., this and this.
This explanation is terrific to resolve my some queries. Thanks!
And meanwhile, trying to add something got during my understanding of linux memory management knowledge. If any misunderstand, please correct me!
Modern OS process concepts are based on virtual memory. Virtual memory system includes the RAM+SWAP;
So I think most of the memory concepts related with processes refer to the virtual memory, except that there are some supplement notes.
Any virtual address(page) allocated to a process is in below state:
a) allocated, but no mapping to any physical memory(something like COW)
b) allocated, already mapped to physical memory
c) allocated, already mapped to swapped memory.
The fields ouput of top command:
a) VIRT -- it refers to all virtual memory that the process have the right
to access, no matter it is already mapped to physical memory or swapped
memory, or even has no any mapping.
b) RES -- it refers to the virtual address already mapped to physical address and it still in RAM.
c) SWAP -- refers to the virtual address already mapped to physical address and it is swapped into SWAP space.
d) SHR -- it refers to the shared memory available to a process(VM?)
e) CODE + DATA -- CODE could be in a state of 2.b/2.c, and DATA could be in any of 3 state 2.a/2.b/3.c, and 3.b/3.c also have a fields name called "USED".
4) So the calculation maybe look like:
a) VIRT(VM) = RES(VM in memory) + SWAP(VM in swap) + VM unmapped(DATA, SHR?).
b) USED = RES + SWAP
c) SWAP = CODE(vm in memory) + DATA(vm in memory) + SHR(vm in memory?)
d) RES = CODE(vm in memory) + DATA(vm in memory) + SHR(vm in memory?)
At least DATA segment still have a "DATA(VM unmapped)", this could be observed from above malloc example. That's a little different from the manpage of top command which says "DATA: The amount of physical memory devoted to other than executable code, also known as the Data Resident Set size or DRS". Thanks again.
So amount of (CODE + DATA + SHR) usually larger than RES, because at least DATA(vm unmapped) actually calculated in "DATA", not like the manpge claiming.
Regards,

Resources