I made simple program (code below) it make two loop. First loop makes 5900000000 iteration without any complicated calculation. The second loop counts the time of this first. After 3 to 5 iteration outside loop, the time inner loop take much much larger than earlier (result also below).
I tested it on two environment my local laptop MacBook Pro, and AWS EC2.
Do you have idea why it work that way, and what causing slow down?
const calculate = () => {
let a = 0;
for (let i = 0; i < 5900000000; ++i) {
a++
}
};
for (let i = 0; i < 10; ++i) {
console.time();
calculate();
console.timeEnd();
}
Result:
default: 4.875s
default: 5.566s
default: 29.625s
default: 29.805s
default: 29.698s
default: 29.595s
default: 29.733s
default: 29.611s
default: 29.597s
default: 29.476s
Also, I run app in with profilem mode to saw what happen: NODE_ENV=production node --prof src/index.js
Result:
Statistical profiling result from isolate-0x148040000-5612-v8.log, (81904 ticks, 2 unaccounted, 0 excluded).
[Shared libraries]:
ticks total nonlib name
9 0.0% /usr/lib/system/libsystem_pthread.dylib
5 0.0% /usr/lib/system/libsystem_c.dylib
4 0.0% /usr/lib/libc++.1.dylib
2 0.0% /usr/lib/system/libsystem_malloc.dylib
2 0.0% /usr/lib/system/libsystem_kernel.dylib
[JavaScript]:
ticks total nonlib name
81837 99.9% 99.9% LazyCompile: *calculate /Users/jjuszkiewicz/workspace/nodejs/src/index.js:1:19
[C++]:
ticks total nonlib name
19 0.0% 0.0% T node::contextify::CompiledFnEntry::WeakCallback(v8::WeakCallbackInfo<node::contextify::CompiledFnEntry> const&)
17 0.0% 0.0% T node::builtins::BuiltinLoader::CompileFunction(v8::FunctionCallbackInfo<v8::Value> const&)
5 0.0% 0.0% T _semaphore_destroy
1 0.0% 0.0% t std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long)
1 0.0% 0.0% T _mach_port_allocate
[Summary]:
ticks total nonlib name
81837 99.9% 99.9% JavaScript
43 0.1% 0.1% C++
0 0.0% 0.0% GC
22 0.0% Shared libraries
2 0.0% Unaccounted
[C++ entry points]:
ticks cpp total name
20 52.6% 0.0% T node::contextify::CompiledFnEntry::WeakCallback(v8::WeakCallbackInfo<node::contextify::CompiledFnEntry> const&)
17 44.7% 0.0% T node::builtins::BuiltinLoader::CompileFunction(v8::FunctionCallbackInfo<v8::Value> const&)
1 2.6% 0.0% t std::__1::basic_ostream<char, std::__1::char_traits<char> >& std::__1::__put_character_sequence<char, std::__1::char_traits<char> >(std::__1::basic_ostream<char, std::__1::char_traits<char> >&, char const*, unsigned long)
[Bottom up (heavy) profile]:
Note: percentage shows a share of a particular caller in the total
amount of its parent calls.
Callers occupying less than 1.0% are not shown.
ticks parent name
81837 99.9% LazyCompile: *calculate /Users/jjuszkiewicz/workspace/nodejs/src/index.js:1:19
81837 100.0% Function: ~<anonymous> /Users/jjuszkiewicz/workspace/nodejs/src/index.js:1:1
81837 100.0% LazyCompile: ~Module._compile node:internal/modules/cjs/loader:1173:37
81837 100.0% LazyCompile: ~Module._extensions..js node:internal/modules/cjs/loader:1227:37
81837 100.0% LazyCompile: ~Module.load node:internal/modules/cjs/loader:1069:33
81837 100.0% LazyCompile: ~Module._load node:internal/modules/cjs/loader:851:24
Related
Here's the code I'm wondering about:
final class Foo {
var subscriptions = Set<AnyCancellable>()
init () {
Timer
.publish(every: 2, on: .main, in: .default)
.autoconnect()
.zip(Timer.publish(every: 3, on: .main, in: .default).autoconnect())
.sink {
print($0)
}
.store(in: &subscriptions)
}
}
This is the output it produces:
(2020-12-08 15:45:41 +0000, 2020-12-08 15:45:42 +0000)
(2020-12-08 15:45:43 +0000, 2020-12-08 15:45:45 +0000)
(2020-12-08 15:45:45 +0000, 2020-12-08 15:45:48 +0000)
(2020-12-08 15:45:47 +0000, 2020-12-08 15:45:51 +0000)
Would this code eventually crash from memory shortage? It seems like the zip operator is storing every value that it receives but can't yet publish.
zip does not limit its upstream buffer size. You can prove it like this:
import Combine
let ticket = (0 ... .max).publisher
.zip(Empty<Int, Never>(completeImmediately: false))
.sink { print($0) }
The (0 ... .max) publisher will try to publish 263 values synchronously (that, is, before returning control to the Zip subscriber). Run this and watch the memory gauge in Xcode's Debug navigator. It will climb steadily. You probably want to kill it after a few seconds, because it will eventually use up an awful lot of memory and make your Mac unpleasant to use before finally crashing.
If you run it in Instruments for a few seconds, you'll see that all of the allocations happen in this call stack, indicating that Zip internally uses a plain old Array to buffer the incoming values.
66.07 MB 99.8% 174 main
64.00 MB 96.7% 45 Publisher<>.sink(receiveValue:)
64.00 MB 96.7% 42 Publisher.subscribe<A>(_:)
64.00 MB 96.7% 41 Publishers.Zip.receive<A>(subscriber:)
64.00 MB 96.7% 12 Publisher.subscribe<A>(_:)
64.00 MB 96.7% 2 Empty.receive<A>(subscriber:)
64.00 MB 96.7% 2 AbstractZip.Side.receive(subscription:)
64.00 MB 96.7% 2 AbstractZip.receive(subscription:index:)
64.00 MB 96.7% 2 AbstractZip.resolvePendingDemandAndUnlock()
64.00 MB 96.7% 2 protocol witness for Subscription.request(_:) in conformance Publishers.Sequence<A, B>.Inner<A1, B1, C1>
64.00 MB 96.7% 2 Publishers.Sequence.Inner.request(_:)
64.00 MB 96.7% 1 AbstractZip.Side.receive(_:)
64.00 MB 96.7% 1 AbstractZip.receive(_:index:)
64.00 MB 96.7% 1 specialized Array._copyToNewBuffer(oldCount:)
64.00 MB 96.7% 1 specialized _ArrayBufferProtocol._forceCreateUniqueMutableBufferImpl(countForBuffer:minNewCapacity:requiredCapacity:)
64.00 MB 96.7% 1 swift_allocObject
64.00 MB 96.7% 1 swift_slowAlloc
64.00 MB 96.7% 1 malloc
64.00 MB 96.7% 1 malloc_zone_malloc
i using kafka streaming to join streams,but the off heap memory is out of control.
i use jemalloc to find the cause.
first, the rocksdb use offheap memory in high percentage
1806344172 67.4% 67.4% 1806344172 67.4% rocksdb::BlockFetcher::ReadBlockContents
588400270 22.0% 89.4% 588400270 22.0% os::malloc#921040
132120590 4.9% 94.3% 132120590 4.9% rocksdb::Arena::AllocateNewBlock
50331648 1.9% 96.2% 50331648 1.9% init
17587683 0.7% 96.8% 17981107 0.7% rocksdb::VersionSet::ProcessManifestWrites
15688131 0.6% 97.4% 15688131 0.6% rocksdb::WritableFileWriter::WritableFileWriter
12943699 0.5% 97.9% 12943699 0.5% rocksdb::port::cacheline_aligned_alloc
11800800 0.4% 98.4% 12588000 0.5% rocksdb::LRUCacheShard::Insert
8784504 0.3% 98.7% 1811954485 67.6% rocksdb::BlockBasedTable::PartitionedIndexIteratorState::NewSecondaryIterator
7606272 0.3% 99.0% 7606272 0.3% rocksdb::LRUHandleTable::Resize
As time goes by, it was changed
Total: 4502654593 B
3379447055 75.1% 75.1% 3379447055 75.1% os::malloc#921040
620666890 13.8% 88.8% 620666890 13.8% rocksdb::BlockFetcher::ReadBlockContents
142606352 3.2% 92.0% 142606352 3.2% rocksdb::Arena::AllocateNewBlock
129603986 2.9% 94.9% 129603986 2.9% rocksdb::port::cacheline_aligned_alloc
67797317 1.5% 96.4% 67797317 1.5% rocksdb::LRUHandleTable::Resize
50331648 1.1% 97.5% 50331648 1.1% init
32501412 0.7% 98.2% 230760042 5.1% Java_org_rocksdb_Options_newOptions__
18600150 0.4% 98.6% 19255895 0.4% rocksdb::VersionSet::ProcessManifestWrites
16393216 0.4% 99.0% 16393216 0.4% rocksdb::WritableFileWriter::WritableFileWriter
5629242 0.1% 99.1% 5629242 0.1% updatewindow
os::malloc#921040 comsume most of the memory, and always growthing
so anyone can give some help ?
A simple plan:
import qualified Data.ByteString.Lazy.Char8 as BS
main = do
wc <- length . BS.words <$> BS.getContents
print wc
Build for speed:
ghc -fllvm -O2 -threaded -rtsopts Words.hs
More CPUs means more slowly?
$ time ./Words +RTS -qa -N1 < big.txt
331041862
real 0m25.963s
user 0m21.747s
sys 0m1.528s
$ time ./Words +RTS -qa -N2 < big.txt
331041862
real 0m36.410s
user 0m34.910s
sys 0m6.892s
$ time ./Words +RTS -qa -N4 < big.txt
331041862
real 0m42.150s
user 0m55.393s
sys 0m16.227s
For good measure:
$time wc -w big.txt
331041862 big.txt
real 0m8.277s
user 0m7.553s
sys 0m0.529s
Clearly, this is a single-threaded activity. Still, I wonder why it slows down so much.
Also, do you have any tips, how I can make it competitive with wc?
It's GC. Executed your program with +RTS -s and the results told everything.
-N1
D:\>a +RTS -qa -N1 -s < lorem.txt
15470835
4,558,095,152 bytes allocated in the heap
1,746,720 bytes copied during GC
77,936 bytes maximum residency (118 sample(s))
131,856 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8519 colls, 0 par 0.016s 0.021s 0.0000s 0.0001s
Gen 1 118 colls, 0 par 0.000s 0.004s 0.0000s 0.0001s
TASKS: 3 (1 bound, 2 peak workers (2 total), using -N1)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 0.842s ( 0.855s elapsed)
GC time 0.016s ( 0.025s elapsed)
EXIT time 0.016s ( 0.000s elapsed)
Total time 0.874s ( 0.881s elapsed)
Alloc rate 5,410,809,512 bytes per MUT second
Productivity 98.2% of total user, 97.4% of total elapsed
gc_alloc_block_sync: 0
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
-N4
D:\>a +RTS -qa -N4 -s < lorem.txt
15470835
4,558,093,352 bytes allocated in the heap
1,720,232 bytes copied during GC
77,936 bytes maximum residency (113 sample(s))
160,432 bytes maximum slop
4 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 8524 colls, 8524 par 4.742s 1.678s 0.0002s 0.0499s
Gen 1 113 colls, 112 par 0.031s 0.027s 0.0002s 0.0099s
Parallel GC work balance: 1.40% (serial 0%, perfect 100%)
TASKS: 6 (1 bound, 5 peak workers (5 total), using -N4)
SPARKS: 0 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 0 fizzled)
INIT time 0.000s ( 0.001s elapsed)
MUT time 1.950s ( 1.415s elapsed)
GC time 4.774s ( 1.705s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 6.724s ( 3.121s elapsed)
Alloc rate 2,337,468,786 bytes per MUT second
Productivity 29.0% of total user, 62.5% of total elapsed
gc_alloc_block_sync: 21082
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
The most significant parts are
Tot time (elapsed) Avg pause Max pause
Gen 0 8524 colls, 8524 par 4.742s 1.678s 0.0002s 0.0499s
and
Parallel GC work balance: 1.40% (serial 0%, perfect 100%)
When -threaded switch is on, at runtime ghc will try its best to balance any work among threads as far as possible. Your whole program is a sequential process so the only work can be moved to other threads are GC, while your program in fact cannot be GCed in parallel so these threads wait for one another to complete their job, resulting a lot of time wasted on synchronization.
If you tell the runtime not to balance among threads by +RTS -qm then sometimes -N4 is as fast as -N1.
I just started reading Parallel and Concurrent Programming in Haskell.
I wrote two programs that, I believe, sums up a list in 2 ways:
running rpar (force (sum list))
splitting up the list, running the above command on each list, and adding each
Here's the code:
import Control.Parallel.Strategies
import Control.DeepSeq
import System.Environment
main :: IO ()
main = do
[n] <- getArgs
[single, faster] !! (read n - 1)
single :: IO ()
single = print . runEval $ rpar (sum list)
faster :: IO ()
faster = print . runEval $ do
let (as, bs) = splitAt ((length list) `div` 2) list
res1 <- rpar (sum as)
res2 <- rpar (sum bs)
return (res1 + res2)
list :: [Integer]
list = [1..10000000]
Compile with parallelization enabled (-threaded)
C:\Users\k\Workspace\parallel_concurrent_haskell>ghc Sum.hs -O2 -threaded -rtsopts
[1 of 1] Compiling Main ( Sum.hs, Sum.o )
Linking Sum.exe ...
Results of single Program
C:\Users\k\Workspace\parallel_concurrent_haskell>Sum 1 +RTS -s -N2
50000005000000
960,065,896 bytes allocated in the heap
363,696 bytes copied during GC
43,832 bytes maximum residency (2 sample(s))
57,016 bytes maximum slop
2 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1837 colls, 1837 par 0.00s 0.01s 0.0000s 0.0007s
Gen 1 2 colls, 1 par 0.00s 0.00s 0.0002s 0.0003s
Parallel GC work balance: 0.18% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 1 (0 converted, 0 overflowed, 0 dud, 0 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.27s ( 0.27s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.27s ( 0.28s elapsed)
Alloc rate 3,614,365,726 bytes per MUT second
Productivity 100.0% of total user, 95.1% of total elapsed
gc_alloc_block_sync: 573
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
Run with faster
C:\Users\k\Workspace\parallel_concurrent_haskell>Sum 2 +RTS -s -N2
50000005000000
1,600,100,336 bytes allocated in the heap
1,477,564,464 bytes copied during GC
400,027,984 bytes maximum residency (14 sample(s))
70,377,336 bytes maximum slop
911 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 3067 colls, 3067 par 1.05s 0.68s 0.0002s 0.0021s
Gen 1 14 colls, 13 par 1.98s 1.53s 0.1093s 0.5271s
Parallel GC work balance: 0.00% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 2 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.38s ( 1.74s elapsed)
GC time 3.03s ( 2.21s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 3.42s ( 3.95s elapsed)
Alloc rate 4,266,934,229 bytes per MUT second
Productivity 11.4% of total user, 9.9% of total elapsed
gc_alloc_block_sync: 335
whitehole_spin: 0
gen[0].sync: 0
gen[1].sync: 0
Why did single complete in 0.28 seconds, but faster (poorly named, evidently) took 3.95 seconds?
I am no expert in haskell-specific profiling, but I can see several possible problems in faster. You are walking the input list at least three times: once to get its length, once for splitAt (maybe it is twice, I'm not totally sure how this is implemented), and then again to read and sum its elements. In single, the list is walked only once.
You also hold the entire list in memory at once with faster, but with single haskell can process it lazily, and GC as you go. If you look at the profiling output, you can see that faster is copying many more bytes during GC: over 3,000 times more! faster also needed 400MB of memory all at once, where single needed only 40KB at a time. So the garbage collector had a larger space to keep scanning over.
Another big issue: you allocate a ton of new cons cells in faster, to hold the two intermediate sub-lists. Even if it could all be GCed right away, this is a lot of time spent allocating. It's more expensive than just doing the addition to begin with! So even before you start adding, you are already "over budget" compared to simple.
Following amalloy's answer... My machine is slower than yours, and running your single took
Total time 0.41s ( 0.35s elapsed)
I tried:
list = [ 1..10000000]
list1 = [ 1..5000000]
list2 = [ 5000001 .. 10000000 ]
fastest :: IO ()
fastest = print . runEval $ do
res1 <- rpar (sum list1)
res2 <- rpar (sum list2)
return (res1 + res2)
With that I got
c:\Users\peter\Documents\Haskell\practice>parlist 4 +RTS -s -N2
parlist 4 +RTS -s -N2
50000005000000
960,068,544 bytes allocated in the heap
1,398,472 bytes copied during GC
43,832 bytes maximum residency (3 sample(s))
203,544 bytes maximum slop
3 MB total memory in use (0 MB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 1836 colls, 1836 par 0.00s 0.01s 0.0000s 0.0009s
Gen 1 3 colls, 2 par 0.00s 0.00s 0.0002s 0.0004s
Parallel GC work balance: 0.04% (serial 0%, perfect 100%)
TASKS: 4 (1 bound, 3 peak workers (3 total), using -N2)
SPARKS: 2 (0 converted, 0 overflowed, 0 dud, 1 GC'd, 1 fizzled)
INIT time 0.00s ( 0.00s elapsed)
MUT time 0.31s ( 0.33s elapsed)
GC time 0.00s ( 0.01s elapsed)
EXIT time 0.00s ( 0.00s elapsed)
Total time 0.31s ( 0.35s elapsed)
Alloc rate 3,072,219,340 bytes per MUT second
Productivity 100.0% of total user, 90.1% of total elapsed
which is faster...
I've started to profile some of my Go1.2 code and the top item is always something named 'etext'. I've searched around but couldn't find much information about it other than it might relate to call depth in Go routines. However, I'm not using any Go routines and 'etext' is still taking up 75% or more of the total execution time.
(pprof) top20
Total: 171 samples
128 74.9% 74.9% 128 74.9% etext
Can anybody explain what this is and if there is any way to reduce the impact?
I hit the same problem then I found this: pprof broken in go 1.2?. To verify that it is really a 1.2 bug I wrote the following "hello world" program:
package main
import (
"fmt"
"testing"
)
func BenchmarkPrintln( t *testing.B ){
TestPrintln( nil )
}
func TestPrintln( t *testing.T ){
for i := 0; i < 10000; i++ {
fmt.Println("hello " + " world!")
}
}
As you can see it only calls fmt.Println.
You can compile this with “go test –c .”
Run with “./test.test -test.bench . -test.cpuprofile=test.prof”
See the result with “go tool pprof test.test test.prof”
(pprof) top10
Total: 36 samples
18 50.0% 50.0% 18 50.0% syscall.Syscall
8 22.2% 72.2% 8 22.2% etext
4 11.1% 83.3% 4 11.1% runtime.usleep
3 8.3% 91.7% 3 8.3% runtime.futex
1 2.8% 94.4% 1 2.8% MHeap_AllocLocked
1 2.8% 97.2% 1 2.8% fmt.(*fmt).padString
1 2.8% 100.0% 1 2.8% os.epipecheck
0 0.0% 100.0% 1 2.8% MCentral_Grow
0 0.0% 100.0% 33 91.7% System
0 0.0% 100.0% 3 8.3% _/home/xxiao/work/test.BenchmarkPrintln
The above result is got using go 1.2.1
Then I did the same thing using go 1.1.1 and got the following result:
(pprof) top10
Total: 10 samples
2 20.0% 20.0% 2 20.0% scanblock
1 10.0% 30.0% 1 10.0% fmt.(*pp).free
1 10.0% 40.0% 1 10.0% fmt.(*pp).printField
1 10.0% 50.0% 2 20.0% fmt.newPrinter
1 10.0% 60.0% 2 20.0% os.(*File).Write
1 10.0% 70.0% 1 10.0% runtime.MCache_Alloc
1 10.0% 80.0% 1 10.0% runtime.exitsyscall
1 10.0% 90.0% 1 10.0% sweepspan
1 10.0% 100.0% 1 10.0% sync.(*Mutex).Lock
0 0.0% 100.0% 6 60.0% _/home/xxiao/work/test.BenchmarkPrintln
You can see that the 1.2.1 result does not make much sense. Syscall and etext takes most of the time. And the 1.1.1 result looks right.
So I'm convinced that it is really a 1.2.1 bug. And I switched to use go 1.1.1 in my real project and I'm satisfied with the profiling result now.
I think Mathias Urlichs is right regarding missing debugging symbols in your cgo code. Its worth noting that some standard pkgs like net and syscall make use of cgo.
If you scroll down to the bottom of this doc to the section called Caveats, you can see that the third bullet says...
If the program linked in a library that was not compiled with enough symbolic information, all samples associated with the library may be charged to the last symbol found in the program before the library. This will artificially inflate the count for that symbol.
I'm not 100% positive this is what's happening but i'm betting that this is why etext appears to be so busy (in other words etext is a collection of various functions that doesn't have enough information for pprof to analysis properly.).