How to dump the html file from nutch

How to dump the html file from nutch - nutch

I'm new in nutch. i've crawled a lot of websites from internet. i want to get html content of segments. hence, i've dumped by follow command:
./nutch mergesegs crawl/merged crawl/segments/*
and then :
./nutch readseg -dump crawl/merged/* dumpedContent
now. i have 2 files at dumpedContent : dump and .dump.crc
the size of dump is too big(82GB).
how to dump each of original web pages in one file? or how to dump in small files?

You're getting a big file because you're merging the segments first with (mergesegs) you could try to dump each individual segment into it's own file.
At the moment the SegmentReader class doesn't support splitting each individual URL into a separated file, and not sure if that's something that we would like to support. For really big crawls this would definitively be a problem. Anyhow keep in mind that the -dump option always attach some metadata to the crawled URL, so you're not getting only the HTML content, but also some metadata. For example:
Recno:: 0
URL:: http://example.org
CrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Wed Oct 25 16:32:14 CEST 2017
Modified time: Thu Jan 01 01:00:00 CET 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0
Signature: null
Metadata:
_ngt_=1508941926882
_repr_=http://example.org
_pst_=success(1), lastModified=0
_rs_=478
Content-Type=text/html
nutch.protocol.code=200
Content::
Version: -1
url: http://example.org
base: http://example.org
contentType: text/html
metadata: X-Cache=HIT Connection=close Date=Wed, 25 Oct 2017 14:30:53 GMT nutch.crawl.score=0.0 nutch.fetch.time=1508941934366 Accept-Ranges=bytes nutch.segment.name=20171025163209 Cache-Control=max-age=600 Content-Encoding=gzip Vary=Accept-Encoding,Cookie Expires=Wed, 25 Oct 2017 14:40:53 GMT Content-Length=20133 X-Cache-Hits=1 _fst_=33 Age=78 Content-Type=text/html; charset=UTF-8
Content:
...
So you'll need to process this files to get the raw HTML.
Other option is indexing the content into Solr/ES with the -addBinaryContent flag and you'll have the raw content of the page stored in Solr/ES. The advantage here is that you can query for specific URLs. So you could extract the data from Solr/ES into whatever format/files that you want.
Yet another option is using the bin/nutch commoncrawldump feature, that will allow you to output the content into a different format, not sure now if is possible to do the 1 URL to 1 file relation.

Nutch SegmentReader is a good way to dump all your HTML content into one file. This generally leads to storing the HTML content from your starting URL(and their inlinks and outlinks as well).
However, if you need them to be parsed and stored separately, you may want to look into writing plugins. You may define where and what to store based on what is parsed. I recently tried this and it is efficient when it comes to storing separate HTML pages in a directory folder. Hope this helps.

Related

Create TZ-File out of TZ-String

I want to create a TF-File for storing Unix Time Zone informations out of a TZ-String as define in POSIX 1003.1 section 8.3.
It seems that timezone compiler zic does not cover this.
For a test I created a file "MYZ.txt" compatible to "zic" dealing with a phantasie time zone "MYZ=My Zone" with UTC offset +0:30 (or +1:30 in summer):
#TZ=MYZ-0:30MYSZ,M3.5.0/2,M10.5.0/3
Rule MYsummer min max - Mar lastSun 2:00w 1:00 MYSZ
Rule MYsummer min max - Oct lastSun 3:00w 0 MYZ
Zone MYZ 0:30 MYsummer %s
The commented first line has the same information as the two rules and Zone information below.
zic MYZ.txt -d /tmp does what I want and stores the time zone information into "/tmp/MYZ".
What I like to use is something like calling
export TZ=MYZ-0:30MYSZ,M3.5.0/2,M10.5.0/3
zic -d /tmp
and having the same time zone information in "/tmp/MYZ" like above.
Of corse I could implement a tool creating "MYZ.txt" out of the TZ environment variable.
But I guess somthing similar is avaible within the UNIX/Linux standard tools.
This may be realted to TZ Variable, custom file .
Thanks for any help,
Tom

Base64 coded attachments to e-mails in Public Record Request from City Government

In a public record request to a city government, I have gotten back a number of records that are .txt format of e-mails with attachments appearing to be base64. The e-mail attachments are jpegs, pdf, png, or doc in base64 a shorter example is below. The government official claim "The records we released to you are in the form that was available, including the file you noted. From what I have been told, it is likely garbled computer coding. We have no other version of those records."
Questions:
Does someone have to intentionally work at saving the e-mail and attachments in this way so that they are unreadable, thus making the information not public (hiding it)?
or is this something that can plausibly happen in saving "garbled computer coding"?
If it is plausible that a computer does it, how?
Is there a way of decoding it?
I have tried a number of online decoding with various settings and have been unsuccessful.
I have done a number of public record requests from this city and department in the past and have never gotten such .txt documents. The public record request is around a city contract that is problematic.
From: "Steinberg, David (DPW)" <david.steinberg#sfdpw.org>
To: "Goldberg, Jonathan (DPW)" <jonathan.goldberg#sfdpw.org>
Sent: Fri, 17 May 2019 20:40:36 +0000
Subject: Re: SOTF - Education, Outreach and Training Committee, May 21, 2019 hearing
----boundary-LibPST-iamunique-566105023_-_-
Content-Type: image/jpeg
Content-Transfer-Encoding: base64
Content-ID: <image002.jpg#01D50CB4.2B093D10>
Content-Disposition: attachment;
filename*=utf-8''image002.jpg;
filename="image002.jpg"
/9j/4AAQSkZJRgABAQEAYABgAAD/2wBDAAoHBwkHBgoJCAkLCwoMDxkQDw4ODx4WFxIZJCAmJSMg
IyIoLTkwKCo2KyIjMkQyNjs9QEBAJjBGS0U+Sjk/QD3/2wBDAQsLCw8NDx0QEB09KSMpPT09PT09
PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT09PT3/wAARCABJAFEDASIA
AhEBAxEB/8QAHwAAAQUBAQEBAQEAAAAAAAAAAAECAwQFBgcICQoL/8QAtRAAAgEDAwIEAwUFBAQA
AAF9AQIDAAQRBRIhMUEGE1FhByJxFDKBkaEII0KxwRVS0fAkM2JyggkKFhcYGRolJicoKSo0NTY3
ODk6Q0RFRkdISUpTVFVWV1hZWmNkZWZnaGlqc3R1dnd4eXqDhIWGh4iJipKTlJWWl5iZmqKjpKWm
p6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uHi4+Tl5ufo6erx8vP09fb3+Pn6/8QAHwEA
AwEBAQEBAQEBAQAAAAAAAAECAwQFBgcICQoL/8QAtREAAgECBAQDBAcFBAQAAQJ3AAECAxEEBSEx
BhJBUQdhcRMiMoEIFEKRobHBCSMzUvAVYnLRChYkNOEl8RcYGRomJygpKjU2Nzg5OkNERUZHSElK
U1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6goOEhYaHiImKkpOUlZaXmJmaoqOkpaanqKmqsrO0tba3
uLm6wsPExcbHyMnK0tPU1dbX2Nna4uPk5ebn6Onq8vP09fb3+Pn6/9oADAMBAAIRAxEAPwDKooqa
FY47qIXiSiHcPMVRhtuecZ716ZgRojSOqIpZ2OAo6k0+G2muBIYYZJBEu+Taudi9yfSum1bwlDa6
ENa0q7mlh4lVXGGVD0ORzkcVf8ExwaxHqPnb4bsxeVLLE20SI38RHTcMHnvWTqrl5kUo62OO07T5
9VvY7S0VWmkztDNgcDJ5qB0aN2RhhlJUj3Fdp4N0LT/7YZ5bx/t9nK4FtwOBwGz3GCK5/WmhsfFl
20FvE0UNxkQtyjY6g+xOaaqXk0hWsjJorutb0XS9R8OzaxZReVdpGssiQjCklQeVPbBzkVwtVCak
gasFFKFJViMHb1Hf649KSrEFFFFAE9jMba/t5gMlJFOAAc8+/FaPiO7a5uLdXl8xUQhcyb3UFj8p
OAcgY45rKiDtKgiBMhYbQPWtb+3tRe9ubh4rWedwryu8CvgLjGOwA9qzktboaOsuNXNtbaroscAE
NtpweBm5aQbRkkdB16exrN+H979lt9UXaD8isDjvhv8ACuZe7vvtD30hcvMjAu3IKNlSOe3OPal0
++vNMhmmtSqpIuxmbB6EHgHr1/I1n7P3Wu4+bU6HwPe/a/FiMyhMWhjRe+Bjqe5965vUZHutYu3d
gXedyT0HU80+zm1DSLuNrUtFM5CqBg7j6fqOPzqNGujb3ESR5EkgEz7fmzydpPYZBOPUVajaV0K+
h1eka2n9iw2RCRpeySWZduvESqhPtkj6Zrm9F0oahrFvaXLNDC0pieQDgMATtB6ZOMfjUZurloLP
zI4mggfMSbVXcSRnOOTnHWrdtr2raf5ksMixxGd3aMouwuSM8dTjj6UlFxvbqO99zvNYtLPQ9I8j
SbexivJQY4VlUbpePmAJ6nHr1OBXljI0Z2OrKw4IYYIrS1jUNQ1q+SS+SPziuFVFAwPfnPvzVCcy
mU/aGYyYHLNk4xxz6YopQcVqEncjooorYkltYpZ7uKK3jMkzMAiDua0prTV7RZ7p7UKhA81l2uAM
Y5wScc8n9aj8Nf8AIy6d/wBdhT7ObTtJeWe2uJriZoniWPyPLX5hjLHPIGelZSbvZDRTD3Taa+Fz
aqyxlsfdbkgfU/0pmLhtObAb7Ksm1iDxvI6H8BVyAf8AFI3mOQt7CT7DYwoi48J3BPRr6MA+uI2z
/Onf8wJY7PWLwQ3aQbwQxicsgyDxwCarJJfx6k8AiK3ckmxomTBLEYxj8fz5q5fLpp07R/t5ut/2
T/liExt8x/XvVmYs/jTTJsgxTNbvCec+XwF3Z/i45qFIdjAeR1Cxui74TtDEfMMHp6dalZrk2Ekp
j/0eaUqZP9v7xA54/L8ahu2H2245H+ufv/tGtF/+RPj/AOv9v/RYrTohDLW0v9QY3VvbIVyVeRiE
RyRyCWIBP0qrfQ3FvcmK6h8mRVACY4C44x7e9XfEHXT1X/j1FnGYB26fP+O7Oays5A/SiOuoMKKK
KsRZ037UNSt/sH/H1vHldPvfjxSrpl42pf2etuxu9xTysjOev06VBDM9tPHPEcSRMHU+4Oa9EuYo
7bU7jxVGoMJ04Sx+hlPAH5Y/Ospz5WUlc4rT49UtNVeys4z9rcmKSBgrB8ckEHg9M1M1rrWu3TWq
WxkNqShiiVUjiPfp8ueK68Qpbajd+Kto8l9PWWMjp5jDBH6D86xNRnmtfh9phtHZEuZWa6kQ4LOS
eCfr/IVCnd6IdjPvX1vQoIIL62SOKNSsTSwRyDGScBsH1JxUo0HxNcXceo/ZJZJsrIkhdO33cDPT
2qzo80t54H16O9dpLWFA0LOc7XxnAJ98fnWvrmnxajcaLE2s/YJntkVIwrZc8cgggfnSc7OwWuc4
JtdvbybT/skRuirGSM2sSNjuckD161QtLbUL+xlt7aMyW1sxnk5UBDjGSx9h0ruLW8W9+IrRKrqb
Wza3Z5B8zkEfN+tZus29uPB8kOg3HmW1pcEX2BzKf7xPcA/h+VCnrawWMjRLXXruyK6dai4tAxIW
ZEZA3fbu7/Ss/V1v0vyuqIyXAUDaygYXtgDjH0rcsta0q98PWmlandXWnvbE7JoSdjdeTj61meJd
KuNJ1JEuLo3ayxh4pyTll6DP0q4v3tRPYyKKKK2JCtSTxDeyeH00dtn2ZCCDg7sA5xnPTNZdFJpP
cDUfxDeyeH00djH9mQ5BAO4jOcE56Zo0nxFeaRFJBGIp7WQ5eCddyE+vtWXRS5I2tYd2a2reI7zV
rVbRkhtrRTkQW6bVJ9T61DqWt3WqS2kkwRHtUCRmMEcA5BPPXis+ihQS6Cuzb/4Su9GtnVRDbC5M
XlNhTtYepGetVNK1q50h7gwLHIlyhSWOUEqw+mfc/nWfRS5I7WHdm5YeKrixso7RrKxuIYs+WJos
lec9aoarq11rV59pvGUuF2qqjCovoBVKimoRTuhXYUUUVQH/2Q==
----boundary-LibPST-iamunique-566105023_-_---

I got the image by copying your base64 data to a file and
base64 -d file > image.jpeg
on my Debian/Linux.
RFC 2045 section 6 says binary data are to be encoded in US-ASCII characters
so that it's quite normal images are encoded by base64 although GUI email reader
would not show you raw data.
When you see such raw data, it means somebody might copy and paste other emails blindly, still it less likely happen. (Obviously your example is part of multipart message, but not complete.)
Decode service on web is available, for example, here.

How to push complex legacy logs into logstash?

I'd like to use ELK to analyze and visualize our GxP Logs, created by our stoneold LIMS system.
At least the system runs on SLES but the whole logging structure is some kind of a mess.
I try to give you an impression:
Main_Dir
| Log Dir
| Large number of sub dirs with a lot of files in them of which some may be of interest later
| Archive Dir
| [some dirs which I'm not interested in]
| gpYYMM <-- subdirs created automatically each month: YY = Year ; MM = Month
| gpDD.log <-- log file created automatically each day.
| [more dirs which I'm not interested in]
Important: Each medical examination, that I need to track, is completely logged in the gpDD.log file that represents the date of the order entry. The duration of the complete examination varies between minutes (if no material is available), several hours or days (e.g. 48h for a Covid-19 examination) or even several weeks for a microbiological sample. Example: All information about a Covid-19 sample, that reached us on December 30th is logged in ../gp2012/gp30.log even if the examination was made on January 4th and the validation / creation of report was finished on January 5th.
Could you please provide me some guidance of the right beat to use ( I guess either logbeat or filebeat) and how to implement the log transfer?

Logstash file input:
input {
file {
path => "/Main Dir/Archive Dir/gp*/gp*.log"
}
}
Filebeat input:
- type: log
paths:
- /Main Dir/Archive Dir/gp*/gp*.log
In both cases the path is possible, however if you need further processing of the lines, I would suggest using at least Logstash as a passthrough (using beats input if you do not want to install Logstash on the source itself, which can be understood)

Is there any way to retrieve a file / folder's unique ID via NodeJS?

The problem I face is simple. I want to keep track of the file/folder even after it has been renamed / deleted etc? Does NodeJS provide a way to access this information a file? I've tried the default file system module fs.stats() : https://nodejs.org/api/fs.html#fs_class_fs_stats . Unfortunately, it does not seem to provide such unique ID referencing for a particular file.
Does there exist such a solution in NodeJS?
Note: I DO NOT want to generate a Unique ID for a file. It's pretty easy to assign a random string to a file and associate the string with this this. But it's the other way around. I want to associate a file with a system wide string of some sort.
Any help is appreciated.

Looking at the link https://nodejs.org/api/fs.html#fs_class_fs_stats
Stats {
dev: 2114,
ino: 48064969,
mode: 33188,
nlink: 1,
uid: 85,
gid: 100,
rdev: 0,
size: 527,
blksize: 4096,
blocks: 8,
atime: Mon, 10 Oct 2011 23:24:11 GMT,
mtime: Mon, 10 Oct 2011 23:24:11 GMT,
ctime: Mon, 10 Oct 2011 23:24:11 GMT,
birthtime: Mon, 10 Oct 2011 23:24:11 GMT }
I can see unix inode number.
Can two files have the same inode number?
Two files can have the same inode, but only if they are part of different partitions.
Inodes are only unique on a partition level, not on the whole system.
Thus, in addition to the inode number one also compares the device number.
var uniqueFileId[fileName] = (Stats.dev + Stats.ino)

Extract Date from text file

We have to identify and extract the date from given samplese
Oct 4 07:44:45 cli[1290]: PAPI_Send: To: 7f000001:8372 Type:0x4 Timed out.
Oct 4 08:16:01 webui[1278]: USER:admin#192.168.100.205 COMMAND:<wlan ssid-profile "MFI-SSID" > -- command executed successfully
Oct 4 08:16:01 webui[1278]: USER:admin#192.168.100.205 COMMAND:<wlan ssid-profile "MFI-SSID" opmode opensystem > -- command executed successfully
Here the main problem is, Date format is versatile. it may be "oct 4 2004" or "oct/04/2004" etc

parsing is best way to handle such a problems.
so learn about parsing techniques then use them on your project and enjoy. Appropriate design pattern for an event log parser?

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to dump the html file from nutch - nutch

Related

Create TZ-File out of TZ-String

Base64 coded attachments to e-mails in Public Record Request from City Government

How to push complex legacy logs into logstash?

Is there any way to retrieve a file / folder's unique ID via NodeJS?

Extract Date from text file

Categories

Resources