Why Group is slower than $group from Aggregate? [duplicate] - node.js

This question already has an answer here:
MongoDB aggregation comparison: group(), $group and MapReduce
(1 answer)
Closed 5 years ago.
I tried to bench 3 methods to group data: native js (with underscore), group and Aggregate with $group
I uses these datas (genre/position' trees in Paris) (237 168 rows, 35Mo)
This is my script test and the result is a bit surprising !
┌─────────────┬───────────────┐
│ Method │ avg time (ms) │
├─────────────┼───────────────┤
│ Pure js │ 897 │
├─────────────┼───────────────┤
│ Group │ 3863 │
├─────────────┼───────────────┤
│ Aggregation │ 364 │
└─────────────┴───────────────┘
Why grouping with group is 10x slower than Aggregation ?
For what is used "Group" ?
And how can i optimise again my request ?
Thanks.

Group command uses the same framework as mapreduce and there are many resources for why MR is slower than aggregation framework. Main one is it runs in a separate JS thread, where agg framework runs natively on the server.
See details here MongoDB aggregation comparison: group(), $group and MapReduce

Related

dsbulk unload missing data

I'm using dsbulk 1.6.0 to unload data from cassandra 3.11.3.
Each unload results in wildly different counts of rows. Here are results from 3 invocations of unload, on the same cluster, connecting to the same cassandra host. The table being unloaded is only ever appended, data is never deleted, so a decrease in unloaded rows should not occur. There are 3 cassandra databases in the cluster, and a replication factor of 3, so all data should be present on the chosen host. Furthermore, these were executed in quick succession, the number of added rows would be in the hundreds (if there were any) not in the tens of thousands.
Run 1:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 10,937 | 7 | 97 | 15,935.46 | 20,937.97 | 20,937.97
│ Operation UNLOAD_20201024-084213-097267 completed with 7 errors in
1 minute and 51 seconds.
Run 2:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 60,558 | 3 | 266 | 12,551.34 | 21,609.05 | 21,609.05
│ Operation UNLOAD_20201025-084208-749105 completed with 3 errors in
3 minutes and 47 seconds.
Run 3:
│ total | failed | rows/s | p50ms | p99ms | p999ms
│ 45,404 | 4 | 211 | 16,664.92 | 30,870.08 | 30,870.08
│ Operation UNLOAD_20201026-084206-791305 completed with 4 errors in
3 minutes and 35 seconds.
It would appear that Run 1 is missing the majority of the data. Run 2 may be closer to complete and Run 3 is missing significant data.
I'm invoking unload as follows:
dsbulk unload -h $CASSANDRA_IP -k $KEYSPACE -t $CASSANDRA_TABLE > $DATA_FILE
I'm assuming this isn't expected behaviour for dsbulk. How do I configure it to reliably unload a complete table without errors?
Data could be missing from host if host wasn't reachable when the data was written, and hints weren't replayed, and you don't run repairs periodically. And because DSBulk reads by default with consistency level LOCAL_ONE, different hosts will provide different views (the host that you're providing is just a contact point - after that the cluster topology will be discovered, and DSBulk will select replica based on the load balancing policy).
You can enforce that DSBulk read the data with another consistency level by using -cl command line option (doc). You can compare results with using LOCAL_QUORUM or ALL - in these modes Cassandra will also "fix" the inconsistencies as they will be discovered, although this would be much slower & will add the load onto the nodes because of the repaired data writes.

How to model a tree of computations with Slurm?

I have a simulation that consists of N steps, run sequentially. Each of these steps modifies a global state in memory, until the final step which is the result. It is possible, after a step has run, to write to disk the intermediate state that this step just computed, and to load such an intermediate state instead of starting from scratch. Writing and loading intermediate states has a non-negligible cost.
I want to run many variations of a simulation on a Slurm cluster. Each variation will change the parameter of some of the steps.
Example
Simulation steps
S1 --> S2 --> S3 --> S4
Variations
run1: S2.speed=2, S3.height=12
run2: S2.speed=2, S3.height=20
run3: S2.speed=2, S3.height=40
run4: S2.speed=5, S3.height=12
run5: S2.speed=5, S3.height=80
What I want to do is for the various runs to share common computations, by dumping the intermediate state of the shared steps. This will form a tree of step runs:
S1
├─ S2 (speed=2)
│ ├─ S3 (height=12)
│ │ └─ S4
│ ├─ S3 (height=20)
│ │ └─ S4
│ └─ S3 (height=40)
│ └─ S4
└─ S2 (speed=5)
├─ S3 (height=12)
│ └─ S4
└─ S3 (height=80)
└─ S4
I know I can get the result of the 5 runs by running 5 processes:
run1: S1 --> S2 (speed=2) --> S3 (height=12) --> S4
run2: (dump of run1.S2) --> S3 (height=20) --> S4
run3: (dump of run1.S2) --> S3 (height=40) --> S4
run4: (dump of run1.S1) --> S2 (speed=5) --> S3 (height=12) --> S4
run5: (dump of run4.S2) --> S3 (height=80) --> S4
This reduces the computation from 20 steps using a naive approach, to 13 steps with 3 dumps and 4 loads.
Now, my question is how to model this with Slurm, to make the best use of the scheduler?
One solution I can think of, is that each run is responsible to submit the jobs of the runs that depend on it, after the dump of the intermediate state. Run1 will submit run4 after dumping S1, and then it will submit run2 and run3 after dumping S2, and run4 will submit run5 after dumping S2. With this solution, is there any point in declaring the dependency when submitting the job to Slurm?
Another solution I can see is to break the long chains of computation in multiple, dependent jobs. The list of jobs to submit and their dependencies would be basically the tree I drew above (except the pairs S3/S4 would be merged in the same job). This is 8 jobs to submit instead of 5, but I can submit them all at once from the beginning, with the right dependencies. However, I am not sure what the advantages of this approach would be. Will Slurm do a better job as a scheduler, if he knows the full list of jobs and their dependencies right from the start? Are there some advantages from a user point of view, to have all the jobs submitted and linked with dependencies (eg, to cancel all the jobs that depend on the root job)? I know I can submit many jobs at once with a job array, but I don't see a way to declare dependencies between jobs of the same array. Is is possible, or even advisable?
Finally, are there other approaches I did not think about?
Edit
The example I gave is of course simplified a lot. The real simulations will contain hundreds of steps, with about a thousand variations to try. The scalability of the chosen solution is important.
One solution I can think of, is that each run is responsible to submit the jobs of the runs that depend on it, after the dump of the intermediate state. With this solution, is there any point in declaring the dependency when submitting the job to Slurm?
This is an approach often followed with simple workflows that involve long-running jobs that must checkpoint and restart.
Another solution I can see is to break the long chains of computation in multiple, dependent jobs. Will Slurm do a better job as a scheduler, if he knows the full list of jobs and their dependencies right from the start?
No. Slurm will just ignore the jobs that are not eligible to start because their dependent jobs are not finished.
Are there some advantages from a user point of view, to have all the jobs submitted and linked with dependencies (eg, to cancel all the jobs that depend on the root job)?
Yes, but that is marginally useful.
I know I can submit many jobs at once with a job array, but I don't see a way to declare dependencies between jobs of the same array. Is is possible, or even advisable?
No you cannot set dependencies between jobs of the same array.
Finally, are there other approaches I did not think about?
You could use a workflow management system.
One of the simplest solution is Makeflow. It uses files that look like classical Makefiles that describe the dependencies between jobs. Then, simply running something like makeflow –T slurm makefile.mf
Another option is Bosco. Bosco offers a bit more possibilities, and is good for personal use. It is easy to setup and can submit jobs to multiple clusters.
Finally, Fireworks is a very powerful solution. It requires a MongoDB, and is more suited for lab-wise use, but it can implement very complex logic for job submission/resubmission based on the outputs of jobs, and can handle errors in a clever way. You can for instance implement a workflow where a job is submitted with a given value for a given parameter, and have Fireworks monitor the convergence based on the output file, and cancel and re-submit with another value in case the convergence is not satisfactory.
Another possible solution is to make use of pipeline tools. In the field of bioinformatics SnakeMake is becoming really popular. SnakeMake is based on GNU Make, but made in Python, hence the name SnakeMake. For SnakeMake to work you specify which output you want, and SnakeMake will deduce which rules it has to run for this output. One of the nice things about SnakeMake is that it scales really easily from personal laptops, to bigger computers, and even clusters (for instance slurm clusters). Your example would look something like this:
rule all:
input:
['S4_speed_2_height_12.out',
'S4_speed_2_height_20.out',
'S4_speed_2_height_40.out',
'S4_speed_5_height_12.out',
'S4_speed_5_height_80.out']
rule S1:
output:
"S1.out"
shell:
"touch {output}" # do your heavy computations here
rule S2:
input:
"S1.out"
output:
"S2_speed_{speed}.out"
shell:
"touch {output}"
rule S3:
input:
"S2_speed_{speed}.out"
output:
"S3_speed_{speed}_height_{height}.out"
shell:
"touch {output}"
rule S4:
input:
"S3_speed_{speed}_height_{height}.out"
output:
"S4_speed_{speed}_height_{height}.out"
shell:
"touch {output}"
We can then ask snakemake to make a pretty image of how it would perform these computations:
Snakemake automatically figures out which output can be used by different rules.
Running this on your local machine is as simple executing snakemake, and to submit the actions to slurm is just snakemake --cluster "sbatch". The example I gave is obviously an oversimplification, but SnakeMake is highly customizable (nr of threads per rule, memory usage, etc.), and has the advantage that it is based on Python. It takes a bit of figuring out how everything works in SnakeMake, but I can definitely recommend it.

How to make Spark write by partition and group by a column

I have roughly 100Gb of data that I'm trying to process. The data has the form:
| timestamp | social_profile_id | json_payload |
|-----------|-------------------|------------------|
| 123 | 1 | {"json":"stuff"} |
| 124 | 2 | {"json":"stuff"} |
| 125 | 3 | {"json":"stuff"} |
I'm trying to split this data frame into folders in S3 by social_profile_id. There are roughly 430,000 social_profile_ids.
I've loaded the data no problem into a Dataset. However when I'm writing it out, and trying to partition it, it takes forever! Here's what I've tried:
messagesDS
.write
.partitionBy("socialProfileId")
.mode(sparkSaveMode)
I don't really care how many files are in each folder at the end of the job. My theory is that each node can group by the social_profile_id, then write out to its respective folder without having to do a shuffle or communicate with other nodes. But this isn't happening as evidenced by the long job time. Ideally the end result would look a little something like this:
├── social_id_1 (only two partitions had id_1 data)
| ├── partition1_data.parquet
| └── partition3_data.parquet
├── social_id_2 (more partitions have this data in it)
| ├── partition3_data.parquet
| └── partition4_data.parquet
| ├── etc.
├── social_id_3
| ├── partition2_data.parquet
| └── partition4_data.parquet
| ├── etc.
├── etc.
I've tried increasing the compute resources a few times, both increasing instances sizes and # of instances. What I've been able to see form the spark UI that is the majority of time time is being taken by the write operation. It seems that all of the executors are being used, but they take an absurdly long time to execute (like taking 3-5 hours to write ~150Mb) Any help would be appreciated! Sorry if I mixed up some of the spark terminology.

Performance test with Taurus

I am new to performance testing and would like to know what the following output from Taurus means (http://websi.te is NOT the real domain name of my test!):
10:53:12 INFO: Test duration: 0:06:54
10:53:12 INFO: Samples count: 1202, 2.08% failures
10:53:12 INFO: Average times: total 26.906, latency 0.132, connect 0.233
10:53:12 INFO: Percentiles:
┌───────────────┬───────────────┐
│ Percentile, % │ Resp. Time, s │
├───────────────┼───────────────┤
│ 0.0 │ 0.728 │
│ 50.0 │ 23.631 │
│ 90.0 │ 43.903 │
│ 95.0 │ 56.927 │
│ 99.0 │ 84.351 │
│ 99.9 │ 104.895 │
│ 100.0 │ 125.503 │
└───────────────┴───────────────┘
10:53:12 INFO: Request label stats:
┌─────────────────┬────────┬────────┬────────┬───────────────────┐
│ label │ status │ succ │ avg_rt │ error │
├─────────────────┼────────┼────────┼────────┼───────────────────┤
│ http://websi.te │ FAIL │ 97.92% │ 26.906 │ Moved Permanently │
└─────────────────┴────────┴────────┴────────┴───────────────────┘
For example:
Resp. Time, s: 43.903 - does this mean that my website responded in 40% of the cases after 40 seconds? This would be impossible, because it responses after 1-2 seconds if I visit it via a web browser.
Is avg_rt (average response time?) about 26 seconds? Impossible.
If I look at the Chromium Performance test, most elemets (Network, Frames, Scripts) are done after 1000ms the network waterfall is done after about 650ms.
I have also tested linguee.com with Taurus and it gives me similar figures:
avg_rt: 15 seconds
50%: 10 seconds
90%: 24 seconds
95%: 56 seconds
Is there a misconception? How is it even possible, that 90% of all requests had a response time of 24 seconds? check it by yourself and go to linguee.com, it about 2000ms.
Thank you in advance.
EDIT:
My config file looks as follows
execution:
- concurrency: 100
ramp-up: 1m
hold-for: 5m
scenario: quick-test
scenarios:
quick-test:
requests:
- https://www.linguee.com
Resp. Time, s 43.903 for 90% percentile means that response time was at least 43 seconds for 90% of requests
avg_rt stands for average response time. It is arithmetic mean of all samplers duration divided by their count. In your case it's 26 seconds
With regards to your "impossible" statements:
Your test assumes 1202 request
Your test duration is 7 minutes
It means that each minute you fired 171 request which gives ~2.85 requests per second. I wouldn't state that looks like a "load test" to me, most probably your system performance is a big question mark. You can try opening your system under test in browser while your test is running and prepare to be surprised.
So I would start investigating why your system responds so slowly and first of all checked whether it has enough resources (RAM, CPU, Network, Disk. etc.), you can do this using i.e. JMeter PerfMon Plugin
There are following possibilities for your bad results if you're really testing external website like linguee:
Your network card cannot handle the underlying traffic, so you're not testing the website but your network
The target website has DDOS protection mechanism and slows down your requests
Your machine is overloaded
Regarding blazedemo, it seems the website is down or facing performance issues.
Note you should never do load test on a website you don't own as it is considered as a DDOS attack

NodeJS Heap out of memory with long process with database access

I'm building a NodeJs App using Express 4 + Sequelize + a Postgresql database.
I'm using Node v8.11.3.
I wrote a script to load data into my database from a JSON file. I tested the script with a sample of ~30 entities to seed. It works perfectly.
Actually, I have around 100 000 entities to load, in the complete JSON file. My script reads the JSON file and tries to populate the database asynchronously (ie. 100 000 entities at the same time).
The result is, after some minutes :
<--- Last few GCs --->
[10488:0000018619050A20] 134711 ms: Mark-sweep 1391.6 (1599.7) -> 1391.6 (1599.7) MB, 1082.3 / 0.0 ms allocation failure GC in old space requested
[10488:0000018619050A20] 136039 ms: Mark-sweep 1391.6 (1599.7) -> 1391.5 (1543.7) MB, 1326.9 / 0.0 ms last resort GC in old space requested
[10488:0000018619050A20] 137351 ms: Mark-sweep 1391.5 (1543.7) -> 1391.5 (1520.2) MB, 1311.5 / 0.0 ms last resort GC in old space requested
<--- JS stacktrace --->
==== JS stack trace =========================================
Security context: 0000034170025879 <JSObject>
1: split(this=00000165BEC5DB99 <Very long string[1636]>)
2: attachExtraTrace [D:\Code\backend-lymo\node_modules\bluebird\js\release\debuggability.js:~775] [pc=0000021115C5728E](this=0000003CA90FF711 <CapturedTrace map = 0000033AD0FE9FB1>,error=000001D3EC5EFD59 <Error map = 00000275F61BA071>)
3: _attachExtraTrace(aka longStackTracesAttachExtraTrace) [D:\Code\backend-lymo\node_module...
FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
1: node_module_register
2: v8::internal::FatalProcessOutOfMemory
3: v8::internal::FatalProcessOutOfMemory
4: v8::internal::Factory::NewFixedArray
5: v8::internal::HashTable<v8::internal::SeededNumberDictionary,v8::internal::SeededNumberDictionaryShape>::IsKey
6: v8::internal::HashTable<v8::internal::SeededNumberDictionary,v8::internal::SeededNumberDictionaryShape>::IsKey
7: v8::internal::StringTable::LookupString
8: v8::internal::StringTable::LookupString
9: v8::internal::RegExpImpl::Exec
10: v8::internal::interpreter::BytecodeArrayRandomIterator::UpdateOffsetFromIndex
11: 0000021115A043C1
Finally, some entities have been created but the process clearly crashed.
I understood that this error is due to memory.
My questions is : Why Node doesn't take the time to manage everything without overshooting memory ? Is there a "queue" to limit such explosions ?
I identified some workarounds :
Segment the seed into several JSON files
Use more memory using --max_old_space_size=8192 option
Proceed sequentially (using sync calls)
but none of these solutions are satisfying to me. It makes me afraid for the future of my app supposed to manage sometimes long operations in production.
What do you think about it ?
Node.js just does what you tell it. If you go into some big loop and start up a lot of database operations, then that's exactly what node.js attempts to do. If you start so many operations that you consume too many resources (memory, database resources, files, whatever), then you will run into trouble. Node.js does not manage that for you. It has to be your code that manages how many operations you keep in flight at the same time.
On the other hand, node.js is particularly good at having a bunch of asynchronous operations in flight at the same time and you will generally get better end-to-end performance if you do code it to have more than one operation going at a time. How many you want to have in flight at the same time depends entirely upon the specific code and exactly what the asynchronous operation is doing. If it's a database operation, then it will likely depend upon the database and how many simultaneous requests it does best with.
Here are some references that give you ideas for ways to control how many operations are going at once, including some code examples:
Make several requests to an API that can only handle 20 request a minute
Promise.all consumes all my RAM
Javascript - how to control how many promises access network in parallel
Fire off 1,000,000 requests 100 at a time
Nodejs: Async request with a list of URL
Loop through an api get request with variable URL
Choose proper async method for batch processing for max requests/sec
If you showed your code, we could advise more specifically which technique might fit best for your situation.
Use async.eachOfLimit to do at max X operations in same times :
var async = require("async");
var myBigArray = [];
var X = 10; // 10 operations in same time at max
async.eachOfLimit(myBigArray, X, function(element, index, callback){
// insert element
MyCollection.insert(element, function(err){
return callback(err);
});
}, function(err, result){
// all finished
if(err){
// do stg
}
else
{
// do stg
}
});

Resources