Crawl Desired/precise data from site using nutch-solr - search

While running nutch, I am getting all the data. This is what I do not want. I wants to fetch data according to div class and div id. I mean that I do not want to fetch all data but data according to my need .Is it possible ??

Related

Python: Webcraping connecting multiple links from same page

I am looking to extract data from all the "Reactions" in the webpage, http://www.genome.jp/dbget-bin/www_bget?cpd:C10453
The code when executed should get data from fields Name, formula, reaction, pathway. Next it should open all the 3 reactions and collect the data of fields Name, definition, reaction class.
I tried using Beautiful soup but did not get how to extract data as there is no specific class for the fields in HTML.
I assume you have inspected the element on the webpage and noticed, that the reaction table row has the class 21. Assuming that every page is structured like this and you use BS3 or BS4, you should be able do something like
// get all elements with class td21, take the first, take every link in it
links = soup.find_all("td", class="td21"})[0].find_all("a")
to get the link elements (warning, syntax varies between BS3 + BS4!). Have a look in the references for further informations
With the links you got, you can start new http requests by extracting the href-attribute for each link and start parsing your results again with BS.
References:
how-to-find-elements-by-class
searching-by-css-class

SharePoint Rest call is not returning all fields

Goal: Have python program pull data from SharePoint so we can store on database.
Issue: I am able to connect to share point and return data, but I am not getting all of the fields I can see when hitting the UI page. The UI page I am hitting is in the list on REST call but is a Custom View
Update: Using the renderashtml I was at least able to see some of the data points I am looking for. I would hope there is a better solution than this
Code:
import sharepy
connection = sharepy.connect("https://{site}.sharepoint.com")
r = connection.get("https://{site}.sharepoint.com/{page}/_api/web/Lists/getbytitle('{list_name}')/items")
print(r.content)
print(r.json())
#I have also tried
https://{site}.sharepoint.com/{page}/_api/web/lists('{list_id}')/views('{view_id}')
#I was able to return data as html
https://{site}.sharepoint.com/{page}/_api/web/lists('{list_id}')/views('{view_id}')/renderashtml
Research: I have taken a look at the rest documentation for sharepoint and I am under the impression you cannot return data from a view. The solution I saw was to first hit the view and then generate a list of columns and use that to build a query to search the list. I have tied that and those fields are not available when I pull the list but are in the view.
https://social.msdn.microsoft.com/forums/sharepoint/en-US/a5815727-925b-4ac5-8a46-b0979a910ebb/query-listitems-by-view-through-rest-api
https://msdn.microsoft.com/en-us/library/office/dn531433.aspx#bk_View
Are you trying to get the data from known fields, or discover the names of the fields?
Can you get the desired data by listing the fields in a select?
_api/web/lists/getbytitle('Documents')/items?$select=Title,Created,DateOfBirth
or to get all of the fields:
_api/web/lists/getbytitle('Documents')/items?$select=*

kibana - programatically return saved search objects and associated data via REST API

I am currently working on an excel export tool for kibana using node.js. Right now I am trying to figure out if it is possible to export the data associated with a saved search within my selected kibana index.
Here is an example of what I am trying to do:
User provides authorization and selects a project within kibana that they have access to.
Once a user has selected a project, any saved searches associated with that project are populated into the UI.
The user selects a saved search, report name, and date range, and submits the form. The application then makes a request to the kibana index and returns the data associated with the selected search and within the given time range.
I have finished the authorization and UI, but I am currently stuck trying to figure out how to return the saved search objects within a specific project. I am also unsure of how to construct the request to the kibana index that would return the data associated with the selected saved search within the given time frame.
Does anyone have any experience with something similar to this? I am also very new to Elasticsearch, is this sort of functionality possible?
Answered by a wonderful Elastic team member here:
https://discuss.elastic.co/t/exporting-saved-search-data/90843

Parse-Server query fed directly from URL query string

I'd like to know if this is even possible. And if it is possible, what the security ramifications would be.
I want to use Javascript to build a dynamic URL to query a Parse-Server database.
It appears that it might be possible based on an earlier Stackoverflow question here and a Node.js doc here
Here's how I envision it working....
So, a user would be sent (via email/Twitter/etc) a link which was created by above method. Once the user clicked on that URL link, the following would happen automatically:
Step #1: User's system would automatically submit a parse-server query.
Step #2: On success, the user's browser would download a web page which displayed the query results.
step one create the pointer value ie the query pseudo-semantic
step 2 insert the pointer value to a text-type field in parse cls=clazz
step 2b mailgun a msg containing a link like :
express.domain.com/#!clazz/:yALNMbHWoy - where 'yA...oy' is oid of pointer row in parse/clazz table
Note that the link is an abstraction only. Its a uri first to express/route && function that will simply get a row from parse.clazz. That row contains the semantic for making a parse query to get back the full DB compliment to pass along to the node template setting up the html ...
in your node/router GET/clazz/:oid will lookup that Parse row in parse/clazz, using the pointer/text value to format a 2nd, Parse query. The query response from the 2nd parse.qry is your real meat ... This response can be used by the express template formatting the html response to orig request on "express.domain.com".
where u ask "download web page" .. that is simply node's RESPONSE to a GET on a route like "GET/clazz".

Re-using a list in Node.Js

I'm developing a web scraper in Node JS and I'm trying to figure out the best approach to combine data from one list, with data on another list.
For example:
Step 1: Scrape data from website A
In this step, I scrape some data using cheerio/request and store this in a list, which is then displayed on the screen in a jQuery data table. The data table has a checkbox next to each scraped row of data and the value I have assigned to each checkbox is a "URL".
Step 2: Scrape again based on URLs chosen in checkboxes
In step 2, the app will scrape another website based on the URLs selected in step 1 and will retrieve some data values from these URLs that are scraped.
My dilemma
I wish to use some data values that were scraped in step 1 along with some data values scraped in step 2. However currently in my app, the data from Step 1 has been lost because it's not being saved anywhere.
Since this is a sort of dynamic search whereby a user will search for data, scrape it and then not neccessarily want to see it again, I think saving data into a database would be overkill? So I'm wondering if I should save the list data from step 1 into a session variable and then link them up together again using the URL (in the checkbox) as the key?
Thanks for your help!
Anthony
if you dont want to do saving for these data then pass it as another inputs of your form , try
< input type=hidden value=json.stringify(item)/>
I dont think using database for storage of the scrapped conetent would be an overkill.
The ideal points to note in this process would be.
Use a document store like mongoDB to dump your json data directly. I proposed mongoDB because you get more resources to refer.
Initially open up a db connection in node and your scraping deamon can reuse it for each time when it parses the http result using cheerio and dumps it to db.
Here Once you get the output from http request to your Target URL, the cheerio parse steps should take more time than dumping the data to a db.
I have followed a similar approach for scrapping movie data from imdb and you can find the corresponding code in this repo, if you are interested. I have also used cheerio and its a good selection in my opinion.

Resources