Using CrawlDbReader to read Nutch Crawl Data

Using CrawlDbReader to read Nutch Crawl Data - nutch

I am using nutch 1.4 to implement a focused crawler. Can anyone tell me how to use the nutch CrawlDbReader, LinkDbReader and SegmentReader APIs in my JSP program so that I can create custom UI for my project.
Specifically, I need to issue commands like readdb, readseg etc to the crawl data and get the output through a browser.

Is there something special with these APIs that make this more than "pass data from server to client" issue?
You can use the APIs to get the data. Just look how they are used by nutch.sh, and how the main() is built and do something similar. Then pass the data to the client wither by XML or by JSON or any other way.

Related

How to download a file from website by using logic app?

how you doing?
I'm trying to download a excel file from a web site (Specifically DataCamp) in order to use its data into an automatic process, but before to get the file is necessary to sign in on the page. I was thinking that this would be possible with the JSON Query on the HTTP action, but to be honest I don't know where to start (I'm new on Azure).
The process that I need to emulate to get the file extraction would be as follow (I know this could be possible with an API or RPA but I don't have any available for now):
Could you tell me guys some advices (how to get the desired result or at least where to make research)? is this even posibile?
Best regards.

If you don't have other ways, e.g. your source is on an SFTP, etc. than using an HTTP Action should work, pass the BODY to your next action (e.g. you might want to persist that on a BLOB if content is binary).
If your content is "readable", e.g. JSON, CSV and want to load for processing, you need to ensure, for large files, that you read it in Chunks to load it completely before processing.
Detailed explanation at https://learn.microsoft.com/en-us/azure/logic-apps/logic-apps-handle-large-messages#download-content-in-chunks

How to render a XML page sending data NODEJS

I'm building an app that one of the client's requirements was a page built in XML instead of HTML.
In NodeJS, when we would like to render a new page we use res.render('/pageName', {pageVariable: variableFromNodeJs}); I tried the same when using XML but didn't worked.
Do you know why and how I can make it?

What does the XML look like? If it looks like XHTML, then you can send it directly to the browser "as is", and the browser will handle it fine. If it's your own XML vocabulary, for example
<purchaseOrder>
<orderNumber>17<orderNumber>
<supplierName>Amazon</supplierName>
<value currency="SGD">1600</value>
</purchaseOrder>
then you're going to have to supply a stylesheet to render it. Usually people use XSLT for this (though in simple cases it can also be done with CSS). You can apply the XSLT stylesheet either on the server or in the browser. If you want to use the latest version of XSLT (3.0), then the Saxon-JS implementation [disclaimer: my company's product] is available both on Node.js and in the browser.

Couchdb and dhtmlx library

I have created a small database (couchdb) and web page (html5 boilerplate). My end goal is to have the user click a button which will retrieve a particular view which will be rendered as a table using the dhtmlx library (http://dhtmlx.com/).
At this point I have the page initializing the table (grid) on page load. I am trying to load the data in to the table using 'mygrid.load(url,"json")' The documentation doesn't provide an example of url but I'm assuming it would the be couchdb url of the view. In my case that is: 127.0.0.1:5984/mydata/_design/mydata/_view/details. If I open that url in a browser, I see the data in json format.
{"total_rows":14,"offset":0,"rows":[
{"id":"90e77126ce592105891eba2bd4000143","key":"An","value":"addition to others"},
{"id":"90e77126ce592105891eba2bd4001106","key":"Changed","value":"Directories."},
. . .
{"id":"83001c900adeefe50928a24b98001733","key":"Yeah","value":"CSS kind of working. Guess I have express 3.0"}
]}
Needless to say:
mygrid.load("http://127.0.0.1:5984/mydata/_design/mydata/_view/details","json")
doesn't work. So:
a) Any ideas what I might be doing wrong?
b) Are there better libraries for what I'm trying to do with the grid? dhtmlx seems to be oriented to xml files, but it's what I was given.

Also check if your html is served from http:/ /127.0.0.1:5984. If it is not served from that address and port, than your javascript will not be able to issue request at all to http:/ /127.0.0.1:5984 because of Same origin policy
Sou you either have to serve your html from couchdb directly or use some proxy so that it appears they are served from same host and port.

It looks like dhtmlx supports JSON initialization:
http://www.dhtmlx.com/docs/products/dhtmlxGrid/samples/12_initialization_loading/09_init_grid_json.html
You will probably need to write some custom JavaScript to massage the CouchDB view output into a format that the Grid initializer supports.

Enabling search for my site that uses rss

my site is here
The main flash app uses rss (or xml) to display data. I'm wondering how I can add search functionality to it. One idea is to create multiple custom rss for each filter and search query, but I thought that it would be a nightmare to add more data later on. So I'm wondering if there's another way to do it?
RSS feed is located here My site is hosted at edicy.com and I can't install any other server side extensions other than use XHTML, XML, HTML and Javascript.

Index your data using a search engine like solr or sphinx then have your flash app talk to the server to post a query to it and retrieve the results in XML

How do I run server-side code from couchdb?

Couchdb is great at storing and serving data, but I'm having some trouble getting to grips with how to do back-end processing with it. GWT, for example, has out of the box support for synchronous and asynchronous call backs, which allow you to run arbitrary Java code on the server. Is there any way to do something like this with couchdb?
For example, I'd like to generate and serve a PDF file when the user clicks a button a web app. Ideally the workflow would look something like this:
User enters some data
User clicks a generate button
A call is made to the server, and the PDF is generated server side. The server code can be written in any language, but preferably Java.
When PDF generation is finished, the user is prompted to download and save the document.
Is there a way to do this with out of the box couchdb, or is some additional, third-party software required to communicate between the web client and backend data processing code?
EDIT:Looks like I did a pretty poor job of explaining my question. What I'm interested in is essentially serving servlets from Couchdb similarly to the way that you can serve Java servlets along side web pages from a war file. I used GWT as an example because it has support for developing the servlets and client side code together and compiling everything into a single war file. I'd be very interested in something like this because it would make deploying fully functional websites a breeze through Couchdb replication.
By the looks of it, however, the answer to my question is no, you can't serve servlets from couchdb. The database is set up for CRUD style interactions, and any servlet style components need to either be served separately, or done by polling the db for changes and acting accordingly.

Here's what I would propose as the general workflow:
When user clicks Generate: serialize the data they've entered and any other relevant metadata (e.g. priority, username) and POST it to couchdb as a new document. Keep track of the _id of the document.
Code up a background process that monitors couchdb for documents that need processing.
When it sees such a document, have it generate the PDF and attach it to that same couch doc.
Now back to the client side. You could use ajax polling to repeatedly GET the couch doc and test whether is has an attachment or not. If it does, then you can show the user the download link.
Of course the devil is in the details...
Two ways your background process(es) can identify pending documents:
Use the _changes API to monitor for new documents with _rev beginning with "1-"
Make requests on a couchdb view that only returns docs that do not have an "_attachments" property. When there are no documents to process it will return nothing.
Optionally: If you have multiple PDF-making processes working on the queue in parallel you will want to update the couch doc with a property like {"being-processed":true} and filter these out the view as well.
Some other thoughts:
I do not recommend using the couchdb externals API for this use case because it (basically) means couchdb and your PDF-generating code must be on the same machine. But it's something to be aware of.
I don't know a thing about GWT, but it doesn't seem necessary to accomplish your goals. Certainly CouchDB can serve any static files (js or other) you want either as attachments to docs in a db or from the filesystem. You could even eval() JSON properties you put into couch docs. So you can use GWT to make ajax calls or whatever but GWT can be completely decoupled from couchdb. Might be simpler that way.

GWT has two parts to it. One is a client that the GWT compiler translates to Java, and the other is a Servlet if you do any RPC. Typically you would run your Client code on a browser and then when you made any RPC calls you would contact a Java Servlet Engine (Such as Tomcat or Jetty or ...) , which in turn calls you persistence layer.
GWT does have the ability to do JSON requests over HTTP and coincidentally, this is what CouchDB uses. So in theory it should be possible. (I do not know if anybody has tried it). There would be a couple of issues.
CouchDB would need to serve up the .js files that have the compiled GWT client code.
The main issue I see in your case is that couchDB would need to generate your PDF files, while couchDB is just a storage engine and does not typically do any processing. I guess you could extend it if you are any good with the Erlang programming language.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string