Viewing data in nutch crawl/segment folder - nutch

I have a question regarding viewing data in crawldb/segments folder. I see there is a content/part-00000 folder in segment folder. How do I dump the data (or view the data)?
This is what I am seeing when is type esc :%!xxd in the binary file (I removed the hex codes)
SEQ.org.apache.hadoop.io.Text
org.apache.nutch.parse.ParseText.
.org.apache.hadoop.io.compress.
DefaultCodec http://localhost:8001/a.html
and more characters like this.
It does not make much sense. This does not look like the data I have on the local page. Is there another way of looking at this or should I be looking at a different place?

Run the following command from nutch home:
bin/nutch readseg -dump crawl/segments/your_segment output -nofetch -noparse -noparsetext
To know what commands you can use with Nutch, try to run
bin/nutch
I hope that helps.

Related

Is there a Linux command line utility for getting random data to work with from the web?

I am a Linux newbie and I often find myself working with a bunch of random data.
For example: I would like to work on a sample text file to try out some regular expressions or read some data into gnuplot from some sample data in a csv file or something.
I normally do this by copying and pasting passages from the internet but I was wondering if there exists some combination of commands that would allow me to do this without having to leave the terminal. I was thinking about using something like the curl command but I dont exactly know how it works...
To my knowledge there are websites that host content. I would simply like to access them and store them in my computer.
In conclusion and as a concrete example, how would i copy and paste a random passage off the internet from a website and store it in a file in my system using only the command line? Maybe you can point me in the right direction. Thanks.
You could redirect the output of a curl command into a file e.g.
curl https://run.mocky.io/v3/5f03b1ef-783f-439d-b8c5-bc5ad906cb14 > data-output
Note that I've mocked data in Mocky which is a nice website for quickly mocking an API.
I normally use "Project Gutenberg" which has 60,000+ books freely downloadable online.
So, if I want the full text of "Peter Pan and Wendy" by J.M. Barrie, I'd do:
curl "http://www.gutenberg.org/files/16/16-0.txt" > PeterPan.txt
If you look at the page for that book, you can see how to get it as HTML, plain text, ePUB or UTF-8.

How to add comments to folder in linux and view them with mouse cursor

I run simulations for various choices of parameters. For each choice I store the resulting data in a folder, like
/home/me/Documents/MyProject/C=10/1.dat
/home/me/Documents/MyProject/C=10/2.dat
/home/me/Documents/MyProject/C=10/3.dat
...
and
/home/me/Documents/MyProject/C=20/1.dat
/home/me/Documents/MyProject/C=20/2.dat
/home/me/Documents/MyProject/C=20/3.dat
...and so forth.
would like to write a little text file AAA.txt which contains not just the C parameter but all the others too. Then when viewing this folder which contains the data I want to hold my cursor on the little file symbol and have a little box appear. This box should show just the content of AAA.txt, so I can quickly check which set of parameters was used in this particular run.
Anyone know how to do this? I use Ubuntu 14.04
I am not aware of ways to give you a custom "tooltip". As an alternative, you could look into creating custom thumbnails of your .dat files.
See here for how to do that with nautilus; the default file browser for Ubuntu.
Alternatively, you might look into what Gloobus can do for you.

vi text search with exclude (linux)

I have a huge log file (25.3 million lines) with errors and I won't to search it without holding down the up button. How can you search this with vi? I've seen similar at directory levels but not particular files. At the moment I've searched a date e.g. /Tue 15 Jan
and navigated down but I'm not sure when problems may have began (only when noticed) so I need a general search.
My general idea from within the file would be /ORA-/!ORA-00020
meaning that I want to find those strings containing "ORA-" but ignoring those that are "ORA-00020". Any idea how this can be done viewing a particular file?
Thanks for any info
You do
grep -v "ORA\-00020" error.log > error2.log
then you search error2.log
I don't know the exactly way on vi, but we have kind of ways to get the result:
we can use cat/grep commands and write the answers in another file (so we have much smaller file) and then you can search in it, if it will be ok.
you can use vim instead vi and use TagList Vim Plugin.
Additional info at:
http://www.thegeekstuff.com/2009/04/ctags-taglist-vi-vim-editor-as-sourece-code-browser/
http://vim-taglist.sourceforge.net/faq.html

How do I view a log file generated by screen (screenlog.0)

So I just found out I can create log files of everything I do in screen (C-a H). Sounds like a nice way to keep track of potential goofs in a particular screen session. However, when I went to try it out the logfile is reported as being a binary file (and can't be viewed like a regular text as such). So am I missing something? A quick man page looksee and searching Google (and SO) turns up nothing about this.
So my question is: How do I generate plain text log files in screen?
Assuming the answer is "What a noob... how about you try making them? RTFM." my question becomes: How do I use less to view screen logfiles I've created (since less screenlog.0 does not work on a binary file)?
EDIT: So cat works fine but less complains that the file is binary... why?
SOLUTION: as jcomeau_ictx helpfully pointed out, you can view these logfiles fine with cat or more but with less you must add the -r flag less -r screenlog.0
I just found a screenlog.0 on the net; it is plain text, with some escape sequences. Just 'cat' the file, you should be able to view it just fine.
[after more checking]
Control-A H is what generates the screenlog on my system. And though 'cat' works, you'll miss a lot of data. Use 'more' instead of 'less' to interpolate the escape codes.
I found neither less nor more nor cat to be an ideal solution for viewing screenlog files. All "replay" some of the control character so that e.g. screen deletions as produced by "clear" (don't remember the corresponding control character) are beeing shown, hiding what has been cleared.
What i know works great is: use "view" or "vi", it just shows the control character in escaped notation. Probably any other text editor works, too (not tested).
-L logs to file,
tail -f 'logfilename' to monitor this file

How to download images from "wikimedia search result" using wget?

I need to mirror every images which appear on this page:
http://commons.wikimedia.org/w/index.php?title=Special:Search&ns0=1&ns6=1&ns12=1&ns14=1&ns100=1&ns106=1&redirs=0&search=buitenzorg&limit=900&offset=0
The mirror result should give us the full size images, not the thumbnails.
What is the best way to do this with wget?
UPDATE:
I update the solution below.
Regex is your friend my friend!
Using cat, egrep and wget youll get this task done pretty fast
Download the search results URI wget, then run
cat DownloadedSearchResults.html | egrep (?<=class="searchResultImage".+href=").+?\.jpg/
That should give you http://commons.wikimedia.org/ based links to each of the image's web page. Now, for each one of those results, download it and run:
cat DownloadedSearchResult.jpg | egrep (?<=class="fullImageLink".*href=").+?\.jpg
That should give you a direct link to the highest resolution available for that image.
Im hoping your bash knowledge will do the rest. Good Luck.
Came here with the same problem .. found this >> http://meta.wikimedia.org/wiki/Wikix
I don't have access to a linux machine now, so I didn't try it yet.
It is quite difficult to write all the script in stackoverflow editor, you can find the script at the address below. The script only downloads all images at the first page, you can modify it to automate download process in another page.
http://pastebin.com/xuPaqxKW

Resources