Golang: What is etext? - linux

I've started to profile some of my Go1.2 code and the top item is always something named 'etext'. I've searched around but couldn't find much information about it other than it might relate to call depth in Go routines. However, I'm not using any Go routines and 'etext' is still taking up 75% or more of the total execution time.
(pprof) top20
Total: 171 samples
128 74.9% 74.9% 128 74.9% etext
Can anybody explain what this is and if there is any way to reduce the impact?

I hit the same problem then I found this: pprof broken in go 1.2?. To verify that it is really a 1.2 bug I wrote the following "hello world" program:
package main
import (
"fmt"
"testing"
)
func BenchmarkPrintln( t *testing.B ){
TestPrintln( nil )
}
func TestPrintln( t *testing.T ){
for i := 0; i < 10000; i++ {
fmt.Println("hello " + " world!")
}
}
As you can see it only calls fmt.Println.
You can compile this with “go test –c .”
Run with “./test.test -test.bench . -test.cpuprofile=test.prof”
See the result with “go tool pprof test.test test.prof”
(pprof) top10
Total: 36 samples
18 50.0% 50.0% 18 50.0% syscall.Syscall
8 22.2% 72.2% 8 22.2% etext
4 11.1% 83.3% 4 11.1% runtime.usleep
3 8.3% 91.7% 3 8.3% runtime.futex
1 2.8% 94.4% 1 2.8% MHeap_AllocLocked
1 2.8% 97.2% 1 2.8% fmt.(*fmt).padString
1 2.8% 100.0% 1 2.8% os.epipecheck
0 0.0% 100.0% 1 2.8% MCentral_Grow
0 0.0% 100.0% 33 91.7% System
0 0.0% 100.0% 3 8.3% _/home/xxiao/work/test.BenchmarkPrintln
The above result is got using go 1.2.1
Then I did the same thing using go 1.1.1 and got the following result:
(pprof) top10
Total: 10 samples
2 20.0% 20.0% 2 20.0% scanblock
1 10.0% 30.0% 1 10.0% fmt.(*pp).free
1 10.0% 40.0% 1 10.0% fmt.(*pp).printField
1 10.0% 50.0% 2 20.0% fmt.newPrinter
1 10.0% 60.0% 2 20.0% os.(*File).Write
1 10.0% 70.0% 1 10.0% runtime.MCache_Alloc
1 10.0% 80.0% 1 10.0% runtime.exitsyscall
1 10.0% 90.0% 1 10.0% sweepspan
1 10.0% 100.0% 1 10.0% sync.(*Mutex).Lock
0 0.0% 100.0% 6 60.0% _/home/xxiao/work/test.BenchmarkPrintln
You can see that the 1.2.1 result does not make much sense. Syscall and etext takes most of the time. And the 1.1.1 result looks right.
So I'm convinced that it is really a 1.2.1 bug. And I switched to use go 1.1.1 in my real project and I'm satisfied with the profiling result now.

I think Mathias Urlichs is right regarding missing debugging symbols in your cgo code. Its worth noting that some standard pkgs like net and syscall make use of cgo.
If you scroll down to the bottom of this doc to the section called Caveats, you can see that the third bullet says...
If the program linked in a library that was not compiled with enough symbolic information, all samples associated with the library may be charged to the last symbol found in the program before the library. This will artificially inflate the count for that symbol.
I'm not 100% positive this is what's happening but i'm betting that this is why etext appears to be so busy (in other words etext is a collection of various functions that doesn't have enough information for pprof to analysis properly.).

Related

Swift Combine - Does the zip operator retain all values that it hasn't had a chance to publish?

Here's the code I'm wondering about:
final class Foo {
var subscriptions = Set<AnyCancellable>()
init () {
Timer
.publish(every: 2, on: .main, in: .default)
.autoconnect()
.zip(Timer.publish(every: 3, on: .main, in: .default).autoconnect())
.sink {
print($0)
}
.store(in: &subscriptions)
}
}
This is the output it produces:
(2020-12-08 15:45:41 +0000, 2020-12-08 15:45:42 +0000)
(2020-12-08 15:45:43 +0000, 2020-12-08 15:45:45 +0000)
(2020-12-08 15:45:45 +0000, 2020-12-08 15:45:48 +0000)
(2020-12-08 15:45:47 +0000, 2020-12-08 15:45:51 +0000)
Would this code eventually crash from memory shortage? It seems like the zip operator is storing every value that it receives but can't yet publish.
zip does not limit its upstream buffer size. You can prove it like this:
import Combine
let ticket = (0 ... .max).publisher
.zip(Empty<Int, Never>(completeImmediately: false))
.sink { print($0) }
The (0 ... .max) publisher will try to publish 263 values synchronously (that, is, before returning control to the Zip subscriber). Run this and watch the memory gauge in Xcode's Debug navigator. It will climb steadily. You probably want to kill it after a few seconds, because it will eventually use up an awful lot of memory and make your Mac unpleasant to use before finally crashing.
If you run it in Instruments for a few seconds, you'll see that all of the allocations happen in this call stack, indicating that Zip internally uses a plain old Array to buffer the incoming values.
66.07 MB 99.8% 174 main
64.00 MB 96.7% 45 Publisher<>.sink(receiveValue:)
64.00 MB 96.7% 42 Publisher.subscribe<A>(_:)
64.00 MB 96.7% 41 Publishers.Zip.receive<A>(subscriber:)
64.00 MB 96.7% 12 Publisher.subscribe<A>(_:)
64.00 MB 96.7% 2 Empty.receive<A>(subscriber:)
64.00 MB 96.7% 2 AbstractZip.Side.receive(subscription:)
64.00 MB 96.7% 2 AbstractZip.receive(subscription:index:)
64.00 MB 96.7% 2 AbstractZip.resolvePendingDemandAndUnlock()
64.00 MB 96.7% 2 protocol witness for Subscription.request(_:) in conformance Publishers.Sequence<A, B>.Inner<A1, B1, C1>
64.00 MB 96.7% 2 Publishers.Sequence.Inner.request(_:)
64.00 MB 96.7% 1 AbstractZip.Side.receive(_:)
64.00 MB 96.7% 1 AbstractZip.receive(_:index:)
64.00 MB 96.7% 1 specialized Array._copyToNewBuffer(oldCount:)
64.00 MB 96.7% 1 specialized _ArrayBufferProtocol._forceCreateUniqueMutableBufferImpl(countForBuffer:minNewCapacity:requiredCapacity:)
64.00 MB 96.7% 1 swift_allocObject
64.00 MB 96.7% 1 swift_slowAlloc
64.00 MB 96.7% 1 malloc
64.00 MB 96.7% 1 malloc_zone_malloc

My matrix multiplication program takes quadruple time when thread count doubles

I wrote this simple program that multiplies matrices. I can specify how
many OS threads are used to run it with the environment variable
OMP_NUM_THREADS. It slows down a lot when the thread count gets
larger than my CPU's physical threads.
Here's the program.
static double a[DIMENSION][DIMENSION], b[DIMENSION][DIMENSION],
c[DIMENSION][DIMENSION];
#pragma omp parallel for schedule(static)
for (unsigned i = 0; i < DIMENSION; i++)
for (unsigned j = 0; j < DIMENSION; j++)
for (unsigned k = 0; k < DIMENSION; k++)
c[i][k] += a[i][j] * b[j][k];
My CPU is i7-8750H. It has 12 threads. When the matrices are large
enough, the program is fastest on around 11 threads. It is 4 times as
slow when the thread count reaches 17. Then run time stays about the
same as I increase the thread count.
Here's the results. The top row is DIMENSION. The left column is the
thread count. Times are in seconds. The column with * is when
compiled with -fno-loop-unroll-and-jam.
1024 2048 4096 4096* 8192
1 0.2473 3.39 33.80 35.94 272.39
2 0.1253 2.22 18.35 18.88 141.23
3 0.0891 1.50 12.64 13.41 100.31
4 0.0733 1.13 10.34 10.70 82.73
5 0.0641 0.95 8.20 8.90 62.57
6 0.0581 0.81 6.97 8.05 53.73
7 0.0497 0.70 6.11 7.03 95.39
8 0.0426 0.63 5.28 6.79 81.27
9 0.0390 0.56 4.67 6.10 77.27
10 0.0368 0.52 4.49 5.13 55.49
11 0.0389 0.48 4.40 4.70 60.63
12 0.0406 0.49 6.25 5.94 68.75
13 0.0504 0.63 6.81 8.06 114.53
14 0.0521 0.63 9.17 10.89 170.46
15 0.0505 0.68 11.46 14.08 230.30
16 0.0488 0.70 13.03 20.06 241.15
17 0.0469 0.75 20.67 20.97 245.84
18 0.0462 0.79 21.82 22.86 247.29
19 0.0465 0.68 24.04 22.91 249.92
20 0.0467 0.74 23.65 23.34 247.39
21 0.0458 1.01 22.93 24.93 248.62
22 0.0453 0.80 23.11 25.71 251.22
23 0.0451 1.16 20.24 25.35 255.27
24 0.0443 1.16 25.58 26.32 253.47
25 0.0463 1.05 26.04 25.74 255.05
26 0.0470 1.31 27.76 26.87 253.86
27 0.0461 1.52 28.69 26.74 256.55
28 0.0454 1.15 28.47 26.75 256.23
29 0.0456 1.27 27.05 26.52 256.95
30 0.0452 1.46 28.86 26.45 258.95
Code inside the loop compiles to this on gcc 9.3.1 with
-O3 -march=native -fopenmp. rax starts from 0 and increases by 64
each iteration. rdx points to c[i]. rsi points to b[j]. rdi
points to b[j+1].
vmovapd (%rsi,%rax), %ymm1
vmovapd 32(%rsi,%rax), %ymm0
vfmadd213pd (%rdx,%rax), %ymm3, %ymm1
vfmadd213pd 32(%rdx,%rax), %ymm3, %ymm0
vfmadd231pd (%rdi,%rax), %ymm2, %ymm1
vfmadd231pd 32(%rdi,%rax), %ymm2, %ymm0
vmovapd %ymm1, (%rdx,%rax)
vmovapd %ymm0, 32(%rdx,%rax)
I wonder why the run time increases so much when the thread count
increases.
My estimate says this shouldn't be the case when DIMENSION is 4096.
What I thought before I remembered that the compiler does 2 j loops at
a time. Each iteration of the j loop needs rows c[i] and b[j].
They are 64KB in total. My CPU has a 32KB L1 data cache and a 256KB L2
cache per 2 threads. The four rows the two hardware threads are working
with don't fit in L1 but fit in L2. So when j advances, c[i] is
read from L2. When the program is run on 24 OS threads, the number of
involuntary context switches is around 29371. Each thread gets
interrupted before it has a chance to finish one iteration of the j
loop. Since 8 matrix rows can fit in the L2 cache, the other software
thread's 2 rows are probably still in L2 when it resumes. So the
execution time shouldn't be much different from the 12 thread case.
However measurements say it's 4 times as slow.
Now that I have realized 2 j loops are done at a time. This way each
j iteration works on 96KB of memory. So 4 of them can't fit in the
256KB L2 cache. To verify this is what slows the program down, I
compiled the program with -fno-loop-unroll-and-jam. I got
vmovapd ymm0, YMMWORD PTR [rcx+rax]
vfmadd213pd ymm0, ymm1, YMMWORD PTR [rdx+rax]
vmovapd YMMWORD PTR [rdx+rax], ymm0
The results are in the table. They are like when 2 rows are done at a
time. Which makes me wonder even more. When DIMENSION is 4096, 4
software threads' 8 rows fit in the L2 cache when each thread works on 1
row at a time, but 12 rows don't fit in the L2 cache when each thread
works on 2 rows at a time. Why are the run times similar?
I thought maybe it's because the CPU warmed up when running with less
threads and has to slow down. I ran the tests multiple times, both in
the order of increasing thread count and decreasing thread count. They
yield similar results. And dmesg doesn't contain anything related to
thermal or clock.
I tried separately changing 4096 columns to 4104 columns and setting
OMP_PROC_BIND=true OMP_PLACES=cores, and the results are similar.
This problem seems to come from either the CPU caches (due to the bad memory locality) or the OS scheduler (due to more threads than the hardware can simultaneously execute).
I cannot exactly reproduce the same effect on my i5-9600KF processor (with 6 cores and 6 threads) and with a matrix of size 4096x4096. However, similar effects occur.
Here are performance results (with GCC 9.3 using -O3 -march=native -fopenmp on Linux 5.6):
#threads | time (in seconds)
----------------------------
1 | 16.726885
2 | 9.062372
3 | 6.397651
4 | 5.494580
5 | 4.054391
6 | 5.724844 <-- maximum number of hardware threads
7 | 6.113844
8 | 7.351382
9 | 8.992128
10 | 10.789389
11 | 10.993626
12 | 11.099117
24 | 11.283873
48 | 11.412288
We can see that the computation time starts to significantly grow between 5 and 12 cores.
This problem is due to a lot more data fetched from the RAM. Indeed, 161.6 Gio are loaded from memory with 6 threads while 424.7 Gio are loaded with 12 threads! In both cases, 3.3 Gio are written to the RAM. Because my memory throughput is roughly 40 Gio/s, the RAM accesses represent more than 96% of the overall execution time with 12 threads!
If we dig deeper, we can see that the number of L1 cache references and L1 cache misses are the same whatever the number of threads used. Meanwhile, there are a lot more L3 cache misses (as well as more references). Here are L3-cache statistics:
With 6 threads: 4.4 G loads
1.1 G load-misses (25% of all LL-cache hits)
With 12 threads: 6.1 G loads
4.5 G load-misses (74% of all LL-cache hits)
This means that the locality of the memory access is clearly worse with more threads. I guess this is because the compiler is not clever enough to do high-level cache-based optimizations that could reduce RAM pressure (especially when the number of threads is high). You have to do tiling yourself in order to improve the memory locality. You can find a good guide here.
Finally, note that using more threads that the hardware can simultaneously execute is generally not efficient. One problem is that the OS scheduler often badly place threads to core and frequently move them. The usual way to fix that is to bind software threads to hardware threads using OMP_PROC_BIND=TRUE and set the OMP_PLACES environment variable. Another problem is that the threads are executed using preemptive multitasking with shared resources (eg. caches).
PS: please note that BLAS libraries (eg. OpenBLAS, BLIS, Intel MKL, etc.) are much more optimized than this code as most they already include clever optimization including manual vectorization for the target hardware, loop unrolling, multithreading, tiling and fast matrix transpositions when needed. For a 4096x4096 matrix, they are about 10 times faster.

CPU speed changes by a multiply of 2 for short durations

I'm using a raspberry pi and I need really fast performance from my CPU for a certain process.
To achieve that, I added isolcpus=3 to my kernel boot parameters, to isolate the core for this process only.
From looking at /proc/interrupts, it seems that this core irqs are also minimal (after isolation).
Now, I'm running this code on the isolated CPU (taskset -p 8 PID):
for (i=0; i<254; i++) {
clock_gettime(CLOCK_REALTIME, &start);
for (rep=0; rep<10000000; rep++) {
}
clock_gettime(CLOCK_REALTIME, &end);
timespec_diff(&start, &end, &diff);
printf("%d\n", diff.tv_nsec);
}
The output is see is:
133562686, 133525447, 133536802, 133525760, 133540134, 133555290, 133540135, 133542218, 133525552, 133524979, 133577791, 133523208, 133525604, 133545916, 87085933, 66719079, 66719339, 66726787, 66719912, 66718870, 66712048, 76724670, 133535917, 133525396, 133528260, 133578416, 133522740, 133525552, 133541177, 133526021, 133553677, 133541906
This is only part of the output. The time is usually consistent on ~133525760, but sometimes it gets faster for a little while, by a multiply of 2.
The tasks running on core 3 are:
PID TID CLS RTPRIO NI PRI PSR %CPU STAT WCHAN COMMAND
22 22 TS - 0 19 3 0.0 S - cpuhp/3
23 23 FF 99 - 139 3 0.0 S - migration/3
24 24 TS - 0 19 3 0.0 S - ksoftirqd/3
25 25 TS - 0 19 3 0.0 S - kworker/3:0
26 26 TS - -20 39 3 0.0 S< - kworker/3:0H
1158 1158 TS - -20 39 3 0.0 S< - kworker/3:1H
1159 1159 TS - 0 19 3 0.0 S - kworker/3:1
5907 5907 TS - 0 19 3 99.1 R - a.out
According to ps, the usage percentage of my process varies between 99 to 100 percent of the CPU (which I also don't understand why it is not consistent on 100%), so the fact that the time is divided by 2 doesn't make sense.
Both speeds are good enough for me, I just need it to be consistent.
Does anyone have an idea why could this happen? Is there any way I can make my loop time consistent?

Can I place my application code on partition in my RAM?

I want use RAM instead of SSD. I'm looking for experienced people to give me some advice about this. I want to mount a partition and put into it my Rails app.
Any ideas?
UPD: I tested the SSD and RAM. I have a OSX with 4x4Gb Kingston # 1333 RAM, Intel Core i3 # 2,8 Ghz, OCZ Vertex3 # 120Gb, HDD Seagate ST3000DM001 # 3Tb. My OS installed on SSD and ruby with gems placed in home folder on SSD. I create new Rails app with 10.000 product items in sqlite and create controller with code:
#products = Product.all
Rails.cache.clear
Tested it with AB.
SSD
Document Length: 719 bytes
Concurrency Level: 4
Time taken for tests: 39.274 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 130600 bytes
HTML transferred: 71900 bytes
Requests per second: 2546.21
Transfer rate: 3325.35 kb/s received
Connnection Times (ms)
min avg max
Connect: 0 0 0
Processing: 398 1546 1627
Total: 398 1546 1627
RAM
Document Length: 719 bytes
Concurrency Level: 4
Time taken for tests: 39.272 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 130600 bytes
HTML transferred: 71900 bytes
Requests per second: 2546.33
Transfer rate: 3325.51 kb/s received
Connnection Times (ms)
min avg max
Connect: 0 0 0
Processing: 366 1546 1645
Total: 366 1546 1645
HDD
Document Length: 719 bytes
Concurrency Level: 4
Time taken for tests: 40.510 seconds
Complete requests: 100
Failed requests: 0
Total transferred: 130600 bytes
HTML transferred: 71900 bytes
Requests per second: 2468.54
Transfer rate: 3223.92 kb/s received
Connnection Times (ms)
min avg max
Connect: 0 0 0
Processing: 1193 1596 2400
Total: 1193 1596 2400
So, I think that thing in ruby with gems placed on SSD and get this scripts slowly, I will test on a real server and puts all ruby scripts into RAM with more complicated code or real application.
ps: sorry for my english :)
You are looking for a ramdisk.

Sphinx claiming memory is too low and my ids are null

I am trying to index about 3,000 document but here is what I am getting
[root#domU-12-31-39-0A-19-CB data]# /usr/local/sphinx/bin/indexer --all
Sphinx 2.0.4-release (r3135)
Copyright (c) 2001-2012, Andrew Aksyonoff
Copyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/usr/local/sphinx/etc/sphinx.conf'...
indexing index 'catalog'...
WARNING: Attribute count is 0: switching to none docinfo
WARNING: collect_hits: mem_limit=0 kb too low, increasing to 12288 kb
WARNING: source catalog: skipped 3558 document(s) with zero/NULL ids
collected 0 docs, 0.0 MB
total 0 docs, 0 bytes
total 0.040 sec, 0 bytes/sec, 0.00 docs/sec
total 1 reads, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
total 5 writes, 0.000 sec, 0.0 kb/call avg, 0.0 msec/call avg
I have it set to rt_mem_limit = 512M why is it telling me I dont have enough memory?
rt_mem_limit != mem_limit - they are different variables - with different purposes.
mem_limit - is the value used by indexer during indexing
http://sphinxsearch.com/docs/current.html#conf-mem-limit
- its in the 'indexer' section of your config file.
You must have it sent too loo. Either just leave it out (to use 32M), or change it to better value.
But you also have no document_ids in your dataset. Check your sql_query actully works.

Resources