How can i rightly configure my crawling program crawl-beans.cxml

How can i rightly configure my crawling program crawl-beans.cxml - heritrix

When i start my crawling i realized that it took much more time then it should have and still not finished
I tried to check the process pid to see what's going on from another termminal and the outputs were not clear to me, they were all of this form:
REMOVED by Not SEED, Prod or Cat ****
https://(url of a wanted to be crawled page)
perhaps if someone understand them it would be cool to let me know !
I highly doubt it's the crawling config code (crawl-beans.cxml) if someone knows how to deal with it pls let me know

Going a bit deeper in it, i think i was stupid,it was a php site so i should be taking time so the thing is there's no problem at all
So if

Related

Locating thread leak in Node.js

So I have this node application running on Ubuntu. And I noticed that there're lots of threads showing up form pstree -a
└─node /bin/node --expose-gc -max-old-space-size=256 main.js
└─process.title
├─sh ...
├─sudo ...
│... bunch of scripts i'm doing
├─66*[{process.title}]
└─5*[{node}]
Sometimes there're tens of them but it could go up to hundreds. And I have no idea how are they created, what are they doing. But for sure they are eating up system resources.
This project has complex package dependencies, so it becomes extream hard for me to locate the root cause of this problem. It will be very appreciated if someone could shed some light for me on this situation.

Thanks for #jfriend00 's comment, it really helped me to narrow this down.
So asyncawait and its underlying node-fibers turn out to be the root cause in my case.
While I'm still not quite sure how this is happening (as this is a widely used module and doesn't seem to find anyone talking about this). But taking some time replacing all asyncawait with node native async brings the thread count back to 4.

why the multiprocessing stops

I have followed the code in the link multiprocessing.Pool() slower than just using ordinary functions to write a multi process program, but I find when the length of data in mainwordlist is relatively large, the code can't work. (you can try by setting xrange(50) to xrange(1000) in the code)
Actually, the terminal interface shows that the code is still running, however, the process in top command is gone, can anyone tell me why? any comment will be appreciated. thank you!

I find the following link http://eli.thegreenplace.net/2012/01/16/python-parallelizing-cpu-bound-tasks-with-multiprocessing and reorganize my code. Both of them start from same method, but I avoid above problem though I still don't know why. Anyway, it works.

Node.js optimizing module for best performance

I'm writing a crawler module which is calling it-self recursively to download more and more links depending on a depth option parameter passed.
Besides that, I'm doing more tasks on the returned resources I've downloaded (enrich/change it depending on the configuration passed to the crawler). This process is going on recursively until it's done which might take a-lot of time (or not) depending on the configurations used.
I wish to optimize it to be as fast as possible and not to hinder on any Node.js application that will use it.I've set up an express server that one of its routes launch the crawler for a user defined (query string) host. After launching a few crawling sessions for different hosts, I've noticed that I can sometimes get real slow responses from other routes that only return simple text.The delay can be anywhere from a few milliseconds to something like 30 seconds, and it's seems to be happening at random times (well nothing is random but I can't pinpoint the cause).I've read an article of Jetbrains about CPU profiling using V8 profiler functionality that is integrated with Webstorm, but unfortunately it only shows on how to collect the information and how to view it, but it doesn't give me any hints on how to find such problems, so I'm pretty much stuck here.
Could anyone help me with this matter and guide me, any tips on what could hinder the express server that my crawler might do (A lot of recursive calls), or maybe how to find those hotspots I'm looking for and optimize them?

It's hard to say anything more specific on how to optimize code that is not shown, but I can give some advice that is relevant to the described situation.
One thing that comes to mind is that you may be running some blocking code. Never use deep recursion without using setTimeout or process.nextTick to break it up and give the event loop a chance to run once in a while.

how do nodejs tasks actually run?

I'm trying to figure out exactly how nodejs tasks are run. I understand that there is a main loop that takes requests and then queues them up and moves on. What exactly then executes those queued up events/tasks?
Update:
Can somebody actually please explain it? I appreciate people wanting me to script it and figure it out myself, but sometimes it's better to just have it explained rather than creating barriers to learning simple concepts.

you can follow https://github.com/node-inspector/node-inspector
use can use node-inspector to select a script and set some breakpoints,help you to understand event loop,

What can I do to find the cause of pread64/pwrite64 hangs?

My application is doing some heavy IO on raw /dev/sdb block device using pread64/pwrite64. Sometimes it doing just fine. Call to pread64/pwrite64 usually takes as little as 50-100us. But sometimes it takes a whole lot more, up to several seconds.
What can you recommend to find the cause of such problem?

I have not used it but I have heard about a tool called latencytop.

When it's hung like that, grab a stackshot. Another option is pstack or lsstack.
And as #Zan pointed out, latencytop could also give you that information.
That might not fully answer your question, but at least you'll know with certainty what it was trying to do when it was hung.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How can i rightly configure my crawling program crawl-beans.cxml - heritrix

Going a bit deeper in it, i think i was stupid,it was a php site so i should be taking time so the thing is there's no problem at all So if

Related

Locating thread leak in Node.js

why the multiprocessing stops

Node.js optimizing module for best performance

how do nodejs tasks actually run?

What can I do to find the cause of pread64/pwrite64 hangs?

Categories

Resources