Testing write speed: ArangoDB...Sync and MSync times - arangodb

So I'm testing some extreme writes, 10 individual insert...simple for loop (let's keep the topic simple for a moment)...I've got 'wait for sync' turned on for the collection (in this case we need 100% commit when the call returns)... 2 machines... I run the loop on my main machine that I'm running the actual unit test from and it takes 3 minutes to write the 10k... if I write to my remote machine (same arangoDB settings), it takes 9 sec...Is the reason it's taking longer on my local machine due to it also running the unit tests? Or is it due to the SYNC/MSYNC issues of the drive that the arangoDB FAQ warns about?
"From the durability point of view, immediate synchronization is of course better, but it means performing an extra system call for each operation. On systems with slow sync/msync,"
Is there a setting or whatever to check on a drive or system to determine what my values are for the sync/msync of the device?
thanks for the help!!

First of all the actual speed strongly depends on your hard disk.
For example for a notebook with SSD under MacOSX I get:
arangod> t = time(); for (i = 0; i < 1000; i++) db.unsync.save({ name: "Hallo " + i }); time() - t;
0.03408193588256836
arangod> t = time(); for (i = 0; i < 1000; i++) db.sync.save({ name: "Hallo " + i }); time() - t;
6.904788970947266
So writing 1000 documents is 200x times faster.
For a desktop with harddisk under Linux I get:
arangod> t = time(); for (i = 0; i < 1000; i++) db.unsync.save({ name: "Hallo " + i }); time() - t;
0.08486199378967285
arangod> t = time(); for (i = 0; i < 1000; i++) db.unsync.save({ name: "Hallo " + i }); time() - t;
54.90065908432007
Here it is even worse. More than a factor of 600.
Regarding the difference between local and remote: That sounds strange. How do you access the remote machine? Do you use arangosh?

Related

why does a a nodejs array shift/push loop run 1000x slower above array length 87369?

Why is the speed of nodejs array shift/push operations not linear in the size of the array? There is a dramatic knee at 87370 that completely crushes the system.
Try this, first with 87369 elements in q, then with 87370. (Or, on a 64-bit system, try 85983 and 85984.) For me, the former runs in .05 seconds; the latter, in 80 seconds -- 1600 times slower. (observed on 32-bit debian linux with node v0.10.29)
q = [];
// preload the queue with some data
for (i=0; i<87369; i++) q.push({});
// fetch oldest waiting item and push new item
for (i=0; i<100000; i++) {
q.shift();
q.push({});
if (i%10000 === 0) process.stdout.write(".");
}
64-bit debian linux v0.10.29 crawls starting at 85984 and runs in .06 / 56 seconds. Node v0.11.13 has similar breakpoints, but at different array sizes.
Shift is a very slow operation for arrays as you need to move all the elements but V8 is able to use a trick to perform it fast when the array contents fit in a page (1mb).
Empty arrays start with 4 slots and as you keep pushing, it will resize the array using formula 1.5 * (old length + 1) + 16.
var j = 4;
while (j < 87369) {
j = (j + 1) + Math.floor(j / 2) + 16
console.log(j);
}
Prints:
23
51
93
156
251
393
606
926
1406
2126
3206
4826
7256
10901
16368
24569
36870
55322
83000
124517
So your array size ends up actually being 124517 items which makes it too large.
You can actually preallocate your array just to the right size and it should be able to fast shift again:
var q = new Array(87369); // Fits in a page so fast shift is possible
// preload the queue with some data
for (i=0; i<87369; i++) q[i] = {};
If you need larger than that, use the right data structure
I started digging into the v8 sources, but I still don't understand it.
I instrumented deps/v8/src/builtins.cc:MoveElemens (called from Builtin_ArrayShift, which implements the shift with a memmove), and it clearly shows the slowdown: only 1000 shifts per second because each one takes 1ms:
AR: at 1417982255.050970: MoveElements sec = 0.000809
AR: at 1417982255.052314: MoveElements sec = 0.001341
AR: at 1417982255.053542: MoveElements sec = 0.001224
AR: at 1417982255.054360: MoveElements sec = 0.000815
AR: at 1417982255.055684: MoveElements sec = 0.001321
AR: at 1417982255.056501: MoveElements sec = 0.000814
of which the memmove is 0.000040 seconds, the bulk is the heap->RecordWrites (deps/v8/src/heap-inl.h):
void Heap::RecordWrites(Address address, int start, int len) {
if (!InNewSpace(address)) {
for (int i = 0; i < len; i++) {
store_buffer_.Mark(address + start + i * kPointerSize);
}
}
}
which is (store-buffer-inl.h)
void StoreBuffer::Mark(Address addr) {
ASSERT(!heap_->cell_space()->Contains(addr));
ASSERT(!heap_->code_space()->Contains(addr));
Address* top = reinterpret_cast<Address*>(heap_->store_buffer_top());
*top++ = addr;
heap_->public_set_store_buffer_top(top);
if ((reinterpret_cast<uintptr_t>(top) & kStoreBufferOverflowBit) != 0) {
ASSERT(top == limit_);
Compact();
} else {
ASSERT(top < limit_);
}
}
when the code is running slow, there are runs of shift/push ops followed by runs of 5-6 calls to Compact() for every MoveElements. When it's running fast, MoveElements isn't called until a handful of times at the end, and just a single compaction when it finishes.
I'm guessing memory compaction might be thrashing, but it's not falling in place for me yet.
Edit: forget that last edit about output buffering artifacts, I was filtering duplicates.
this bug had been reported to google, who closed it without studying the issue.
https://code.google.com/p/v8/issues/detail?id=3059
When shifting out and calling tasks (functions) from a queue (array)
the GC(?) is stalling for an inordinate length of time.
114467 shifts is OK
114468 shifts is problematic, symptoms occur
the response:
he GC has nothing to do with this, and nothing is stalling either.
Array.shift() is an expensive operation, as it requires all array
elements to be moved. For most areas of the heap, V8 has implemented a
special trick to hide this cost: it simply bumps the pointer to the
beginning of the object by one, effectively cutting off the first
element. However, when an array is so large that it must be placed in
"large object space", this trick cannot be applied as object starts
must be aligned, so on every .shift() operation all elements must
actually be moved in memory.
I'm not sure there's a whole lot we can do about this. If you want a
"Queue" object in JavaScript with guaranteed O(1) complexity for
.enqueue() and .dequeue() operations, you may want to implement your
own.
Edit: I just caught the subtle "all elements must be moved" part -- is RecordWrites not GC but an actual element copy then? The memmove of the array contents is 0.04 milliseconds. The RecordWrites loop is 96% of the 1.1 ms runtime.
Edit: if "aligned" means the first object must be at first address, that's what memmove does. What is RecordWrites?

Perl script execution keeps getting killed - running out of memory

I am trying to execute a perl script that processes a small 12 x 2 text file (approx. 260 bytes) and a large .bedgraph file (at least 1.3 MB in size). From these two files, the script outputs a new bedgraph file.
I have ran this script on 3 other .bedgraph files but I try to run it on the rest of them the process keeps getting Killed.
It should take about 20 minutes on average for the perl script to run on each of the .bedgraph files.
I'm running the perl script on my local machine (not from a server). I'm using a Linux OS Ubuntu 12.04 system 64-bit 4GB RAM.
Why does my perl script execution keeps getting killed and how can I fix this?
Here's the script:
# input file handle
open(my $sizes_fh, '<', 'S_lycopersicum_chromosomes.size') or die $!;
# output file handles
open(my $output, '+>', 'tendaysafterbreaker_output.bedgraph') or die $!;
my #array;
while(<$sizes_fh>){
chomp;
my ($chrom1, $size) = split(/\t/, $_);
#array = (0) x $size;
open(my $bedgraph_fh, '<', 'Solanum_lycopersicum_tendaysafterbreaker.bedgraph') or die $!;
while(<$bedgraph_fh>){
chomp;
my ($chrom2, $start, $end, $FPKM) = split(/\t/, $_);
if ($chrom1 eq $chrom2){
for(my $i = $start; $i < $end; $i++){
$array[$i] += $FPKM;
}
}
}
close $bedgraph_fh or warn $!;
my ($last_start, $last_end) = 0;
my $last_value = $array[0];
for (my $i = 1; $i < $#array; $i++){
my $curr_val = $array[$i];
my $curr_pos = $i;
# if the current value is not equal to the last value
if ($curr_val != $last_value){
my $last_value = $curr_val;
print $output "$chrom1\t$last_start\t$last_end\t$last_value\n";
$last_start = $last_end = $curr_pos;
} else {
$last_end = $i;
}
}
}
close $sizes_fh or warn $!;
You are trying to allocate an array of 90,000,000 elements. Perl, due to its flexible typing and other advanced variable features, uses a lot more memory for this than you would expect.
On my (Windows 7) machine, a program that just allocates such an array and does nothing else eats up 3.5 GB of RAM.
There are various ways to avoid this huge memory usage. Here are a couple:
The PDL module for scientific data processing, which is designed to efficiently store huge numeric arrays in memory. This will change the syntax for allocating and using the array, though (and it messes around with Perl's syntax in various other ways).
DBM::Deep is a module that allocates a database in a file--and then lets you access that database through a normal array or hash:
use DBM::Deep;
my #array;
my $db = tie #array, "DBM::Deep", "array.db";
#Now you can use #array like a normal array, but it will be stored in a database.
If you know a bit of C, it is quite simple to offload the array manipulation into low-level code. Using a C array takes less space, and is a lot faster. However, you loose nice stuff like bounds checking. Here is an implementation with Inline::C:
use Inline 'C';
...;
__END__
__C__
// note: I don't know if your data contains only ints or doubles. Adjust types as needed
int array_len = -1; // last index
int *array = NULL;
void make_array(int size) {
free(array);
// if this fails, start checking return value of malloc for != NULL
array = (int*) malloc(sizeof(int) * size);
array_len = size - 1;
}
// returns false on bounds error
int array_increment(int start, int end, int fpkm) {
if ((end - 1) > array_len) return 0;
int i;
for (i = start; i < end; i++) {
array[i] += fpkm;
}
return 1;
}
// please check if this is actually equivalent to your code.
// I removed some unneccessary-looking variables.
void loop_over_array(char* chrom1) {
int
i,
last_start = 0,
last_end = 0,
last_value = array[0];
for(i = 1; i < array_len; i++) { // are you sure not `i <= array_len`?
if (array[i] != last_value) {
last_value = array[i];
// I don't know how to use Perl filehandles from C,
// so just redirect the output on the command line
printf("%s\t%d\t%d\t%d\n", chrom1, last_start, last_end, last_value);
last_start = i;
}
last_end = i;
}
}
void free_array {
free(array);
}
Minimal testing code:
use Test::More;
make_array(15);
ok !array_increment(0, 16, 2);
make_array(95_000_000);
ok array_increment(0, 3, 1);
ok array_increment(2, 95_000_000, 1);
loop_over_array("chrom");
free_array();
done_testing;
The output of this test case is
chrom 0 1 2
chrom 2 2 1
(with testing output removed). It may take a second to compile, but after that it should be quite fast.
In the records read from $bedgraph_fh, what's a typical value for $start? Although hashes have more overhead per entry than arrays, you may be able to save some memory if #array starts with a lot of unused entries. e.g., If you have an #array of 90 million elements, but the first 80 million are never used, then there's a good chance you'll be better off with a hash.
Other than that, I don't see any obvious cases of this code holding on to data that's not needed by the algorithm it implements, although, depending on your actual objective, it is possible that there may be an alternative algorithm which doesn't require as much data to be held in memory.
If you really need to be dealing with a set of 90 million active data elements, though, then your primary options are going to be either buy a lot of RAM or use some form of database. In the latter case, I'd opt for SQLite (via DBD::SQLite) for simplicity and light weight, but YMMV.

How can i measure the overhead due to task migration/load balancing on linux with the real time patch?

I am trying to measure the overhead due to task migration. by overhead i would like to measure the latency involved in such a an activity. I know there are separate run queues available for each core and the kernel periodically checks the run queues to check whether there is a imbalance and wakes up a kernel thread ( perhaps a higher priority ) that does the migration.
Could any one provide me with pointers to kernel source code where i can insert time stamps to measure this value?
Is there any other performance metric which i probably investigate to get such an overhead?
I remember there is a post before that discussed about this topic, and someone also posted some codes about how to get the system overhead.
I see you want to add some codes to insert time stamps, do you think it's feasible because task schedule is so frequent. I think you can follow the topic that posted before.
I ever saved the source codes from the post, thanks for the author!
double getCurrentValue() {
double percent;
FILE* file;
unsigned long long totalUser, totalUserLow, totalSys, totalIdle, total;
file = fopen("/proc/stat", "r");
fscanf(file, "cpu %Ld %Ld %Ld %Ld", &totalUser, &totalUserLow,
&totalSys, &totalIdle);
fclose(file);
if (totalUser < lastTotalUser || totalUserLow < lastTotalUserLow ||
totalSys < lastTotalSys || totalIdle < lastTotalIdle) {
//Overflow detection. Just skip this value.
percent = -1.0;
}
else {
total = (totalUser - lastTotalUser) + (totalUserLow - lastTotalUserLow) +
(totalSys - lastTotalSys);
percent = total;
total += (totalIdle - lastTotalIdle);
percent /= total;
percent *= 100;
}
lastTotalUser = totalUser;
lastTotalUserLow = totalUserLow;
lastTotalSys = totalSys;
lastTotalIdle = totalIdle;
return percent;
}

Poor IO performance under heavy load

It seems that I have a problem with Linux IO performance. Working with a project I need to clear whole the file from the kernel space. I use the following code pattern:
for_each_mapping_page(mapping, index) {
page = read_mapping_page(mapping, index);
lock_page(page);
{ kmap // memset // kunmap }
set_page_dirty(page);
write_one_page(page, 1);
page_cache_release(page);
cond_resched();
}
All works fine but with large files (~3Gb+ for me) I see that my system stalls in a strange manner: while this operation is not completed I can't run anything. In other words, all the processes that exists before this operation runs fine, but if I try to run something while this operation I see nothing until it completed.
Is it a kernel's IO scheduling issue or may be I missed something? And how can I fix this problem?
Thanks.
UPD:
According to Kristof's suggestion I've reworked my code and now it looks like this:
headIndex = soff >> PAGE_CACHE_SHIFT;
tailIndex = eoff >> PAGE_CACHE_SHIFT;
/**
* doing the exact #headIndex .. #tailIndex range
*/
for (index = headIndex; index < tailIndex; index += nr_pages) {
nr_pages = min_t(int, ARRAY_SIZE(pages), tailIndex - index);
for (i = 0; i < nr_pages; i++) {
pages[i] = read_mapping_page(mapping, index + i, NULL);
if (IS_ERR(pages[i])) {
while (i--)
page_cache_release(pages[i]);
goto return_result;
}
}
for (i = 0; i < nr_pages; i++)
zero_page_atomic(pages[i]);
result = filemap_write_and_wait_range(mapping, index << PAGE_CACHE_SHIFT,
((index + nr_pages) << PAGE_CACHE_SHIFT) - 1);
for (i = 0; i < nr_pages; i++)
page_cache_release(pages[i]);
if (result)
goto return_result;
if (fatal_signal_pending(current))
goto return_result;
cond_resched();
}
As the result I've got better IO performance, but still have problems with huge IO activity while doing concurrent disk access within the same user as caused the operation.
Anyway, thanks for the suggestions.
In essence you're bypassing the kernels IO scheduler completely.
If you look at the ext2 implementation you'll see it never (well ok, once) calls write_one_page(). For large-scale data transfers it uses mpage_writepages() instead.
This uses the Block I/O interface, rather than immediately accessing the hardware. This means it passes through the IO scheduler. Large operations will not block the entire systems, as the scheduler will automatically ensure that other operations are interleaved with the large writes.

Bruteforcing Blackberry PersistentStore?

I am experimenting with Blackberry's Persistent Store, but I have gotten nowhere so far, which is good, I guess.
So I have written a a short program that attempts iterator through 0 to a specific upper bound to search for persisted objects. Blackberry seems to intentionally slow the loop. Check this out:
String result = "result: \n";
int ub = 3000;
Date start = Calendar.getInstance().getTime();
for(int i=0; i<ub; i++){
PersistentObject o = PersistentStore.getPersistentObject(i);
if (o.getContents() != null){
result += (String) o.getContents() + "\n";
}
}
result += "end result\n";
result += "from 0 to " + ub + " took " + (Calendar.getInstance().getTime().getTime() - start.getTime()) / 1000 + " seconds";
From 0 to 3000 took 20 seconds. Is this enough to conclude that brute-forcing is not a practical method to breach the Blackberry?
In general, how secure is BB Persistent Store?
It's very secure. If you're only getting 150 tries per second, it's going to take you about 3.9 billion years to try every long value (18446744073709551616 of them).
Even then, it would only find objects that are not secured further with a ControlledAccess object. If an application wraps the persisted data with a ControlledAccess object, it can only be read by the same signed application that stored the object. See the PersistentObject class docs for more information.

Resources