Can StormCrawler crawl a file system rather than URLs? - stormcrawler

Is there a way to use StormCrawler to index files on the file system rather than URLs? We have 5+ million files that need to be crawled and indexed (with ElasticSearch). The index needs to be updated daily or more frequently. Other crawlers take 50+ hours to crawl the full file set. This makes update cycles too slow. For example, if you need to update the search index daily or more frequently it is not possible with other crawlers.

There is a File protocol available in StormCrawler. If you represent the files as URIs using file://, SC should be able to handle them out of the box.

Related

Limiting Kismet log files to a size or duration

Looking for a solid way to limit the size of Kismet's database files (*.kismet) through the conf files located in /etc/kismet/. The version of Kismet I'm currently using is 2021-08-R1.
The end state would be to limit the file size (10MB for example) or after X minutes of logging the database is written to and closed. Then, a new database is created, connected, and starts getting written to. This process would continue until Kismet is killed. This way, rather than having one large database, there will be multiple smaller ones.
In the kismet_logging.conf file there are some timeout options, but that's for expunging old entries in the logs. I want to preserve everything that's being captured, but break the logs into segments as the capture process is being performed.
I'd appreciate anyone's input on how to do this either through configuration settings (some that perhaps don't exist natively in the conf files by default?) or through plugins, or anything else. Thanks in advance!
Two interesting ways:
One could let the old entries be taken out, but reach in with SQL and extract what you wanted as a time-bound query.
A second way would be to automate the restarting of kismet... which is a little less elegant.. but seems to work.
https://magazine.odroid.com/article/home-assistant-tracking-people-with-wi-fi-using-kismet/
If you read that article carefully... there are lots of bits if interesting information here.

Apache Nutch: Get list of URLs and not content from the entire web

I'm very new to apache Nutch. My goal is to start from a list of seed URLs and extract as much URLs (and sub URLs) as I can within a size limit (say no more than 1 million or less than 1 TB of data) using Nutch. I do not need the content of the pages, I only need to save the URLs. Is there any way to do this? Is Nutch the right tool?
Yes, you could use Nutch for this purpose, essentially Nutch does all of what you want.
You need to parse the fetched HTML in either way (in order to discover new links, and of course repeat the process). One way to go would be to dump the LinkDB that Nutch keeps into a file using the linkdb command. Our you could use the indexer-links plugin that is available for Nutch 1.x to index your inlinks/outlinks into Solr/ES.
In Nutch you control how many URLs you want to process per round, but this is hardly related to the amount of fetched data. So you'll need to decide when to stop.

Lucene.NET indexing files

What's the best way of indexing pages? I'm creating about 50/60 new pages a day to my website.
Should I index the page when it's created or run a schedule every 15 mins and index in bulk?
I would say it would depend on if you are updating the pages as well...if you can handle indexing them when changing them that would be fine but at 50/60 pages a day it doesn't seem like your amount of files would cause any problems on a scheduled index.

Image storage performance on file system with Nodejs and Mongo

My Node.js application currently stores the uploaded images to the file system with the paths saved into a MongoDB database. Each document, maybe max 2000 in future, has between 4 and 10 images each. I don't believe I need to store the images in the database directly for my usage (I do not need to track versions etc), I am only concerned with performance.
Currently, I store all images in one folder and associated paths stored in the database. However as the number of documents, hence number of images, increase will this slow performance having so many files in a single folder?
Alternatively I could have a folder for each document. Does this extra level of folder complexity affect performance? Also using MongoDB the obvious folder naming schema would be to use the ObjectID but does folder names of the length (24) affect performance? Should I be using a custom ObjectID?
Are there more efficient ways? Thanks in advance.
For simply accessing files, the number of items in a directory does not really affect performance. However, it is common to split out directories for this as getting the directory index can certainly be slow when you have thousands of files. In addition, file systems have limits to the number of files per directory. (What that limit is depends on your file system.)
If I were you, I'd just have a separate directory for each document, and load the images in there. If you are going to have more than 10,000 documents, you might split those a bit. Suppose your hash is 7813258ef8c6b632dde8cc80f6bda62f. It's pretty common to have a directory structure like /7/8/1/3/2/5/7813258ef8c6b632dde8cc80f6bda62f.

Nutch Recrawl - Storing segments is necessory or not

I am deleting segments after they gets indexed then how nutch will get that last fetch time of pages while recrawling? Do i need to store them to speedup the recrawl?
The last fetch time is maintained by crawldb and not segments. Segments are useful just from indexing & searching perspective. Storing in any from will NOT impact crawling rate.

Resources