I'm developing a web scraper in Node JS and I'm trying to figure out the best approach to combine data from one list, with data on another list.
For example:
Step 1: Scrape data from website A
In this step, I scrape some data using cheerio/request and store this in a list, which is then displayed on the screen in a jQuery data table. The data table has a checkbox next to each scraped row of data and the value I have assigned to each checkbox is a "URL".
Step 2: Scrape again based on URLs chosen in checkboxes
In step 2, the app will scrape another website based on the URLs selected in step 1 and will retrieve some data values from these URLs that are scraped.
My dilemma
I wish to use some data values that were scraped in step 1 along with some data values scraped in step 2. However currently in my app, the data from Step 1 has been lost because it's not being saved anywhere.
Since this is a sort of dynamic search whereby a user will search for data, scrape it and then not neccessarily want to see it again, I think saving data into a database would be overkill? So I'm wondering if I should save the list data from step 1 into a session variable and then link them up together again using the URL (in the checkbox) as the key?
Thanks for your help!
Anthony
if you dont want to do saving for these data then pass it as another inputs of your form , try
< input type=hidden value=json.stringify(item)/>
I dont think using database for storage of the scrapped conetent would be an overkill.
The ideal points to note in this process would be.
Use a document store like mongoDB to dump your json data directly. I proposed mongoDB because you get more resources to refer.
Initially open up a db connection in node and your scraping deamon can reuse it for each time when it parses the http result using cheerio and dumps it to db.
Here Once you get the output from http request to your Target URL, the cheerio parse steps should take more time than dumping the data to a db.
I have followed a similar approach for scrapping movie data from imdb and you can find the corresponding code in this repo, if you are interested. I have also used cheerio and its a good selection in my opinion.
Related
I need a table that shows about 2.5 million rows from an array that is already created in memory. When I create the table and add the array to the 'data' property, the browser engine runs out of memory after some (significant) time. I assume that tabulator not only creates objects for the current virtual DOM part, but for each entry in the array in advance.
So my question: is it possible to not provide the entire array, but only the the count of rows, and let tabulator ask for the content of each row via a callback only when needed for rendering. Of course it only makes sense if tabulator does not keep any data of rows that are gone out of view.
I know that this might be in conflict with some column calculation features or others, but this would be fine for my use case.
The same use case is working with canvas-datagrid, which I have tried before.
If you can use Ajax to get the data there is the Progressive Ajax Loading that will help you to load data using the pagination module to make a series of requests for part of the data set, one at a time, appending it to the table as the data arrives.
Doc is here: http://tabulator.info/docs/4.3/data#ajax-alter
Progressive loading is an option, but you still are going to run into the issue that you will have two copies of the data in memory. It will either happen automatically if in 'load' mode or manually in 'scroll' mode as you scroll through the table. The best option would seem to be to have code via say a button that loads the data using either setData() or replaceData(). Then the user could fetch either the next or previous set of data in batches.
Goal: Have python program pull data from SharePoint so we can store on database.
Issue: I am able to connect to share point and return data, but I am not getting all of the fields I can see when hitting the UI page. The UI page I am hitting is in the list on REST call but is a Custom View
Update: Using the renderashtml I was at least able to see some of the data points I am looking for. I would hope there is a better solution than this
Code:
import sharepy
connection = sharepy.connect("https://{site}.sharepoint.com")
r = connection.get("https://{site}.sharepoint.com/{page}/_api/web/Lists/getbytitle('{list_name}')/items")
print(r.content)
print(r.json())
#I have also tried
https://{site}.sharepoint.com/{page}/_api/web/lists('{list_id}')/views('{view_id}')
#I was able to return data as html
https://{site}.sharepoint.com/{page}/_api/web/lists('{list_id}')/views('{view_id}')/renderashtml
Research: I have taken a look at the rest documentation for sharepoint and I am under the impression you cannot return data from a view. The solution I saw was to first hit the view and then generate a list of columns and use that to build a query to search the list. I have tied that and those fields are not available when I pull the list but are in the view.
https://social.msdn.microsoft.com/forums/sharepoint/en-US/a5815727-925b-4ac5-8a46-b0979a910ebb/query-listitems-by-view-through-rest-api
https://msdn.microsoft.com/en-us/library/office/dn531433.aspx#bk_View
Are you trying to get the data from known fields, or discover the names of the fields?
Can you get the desired data by listing the fields in a select?
_api/web/lists/getbytitle('Documents')/items?$select=Title,Created,DateOfBirth
or to get all of the fields:
_api/web/lists/getbytitle('Documents')/items?$select=*
I am trying to retrieve data from http://www.professorpaddle.com/rivers/riverlist.asp which automatically defaults to Washington state as the state id. However, I want to pull data from the table for Oregon. Can this be done as a property? So far I've tried writing a .iqy file with the following and it still doesn't work:
WEB
1
http://www.professorpaddle.com/rivers/riverlist.asp?hstateid=["oregon"]
Selection=EntirePage
Formatting=All
PreFormattedTextToColumns=True
ConsecutiveDelimitersAsOne=True
SingleBlockTextImport=False
I am new to VBA but open to using it as well.
You've got two issues:
The page you linked uses a POST not a GET request so you have to write the parameters separately, not as part of the URL.
I'm not sure why you're supplying oregon as the hstateid. When examining the request in Chrome's dev tools I see 37 as the value of this parameter.
And the brackets around the value indicate that you want you to specify variable values for the parameter but I think you're trying to just have a static value here.
So your query should start with something like:
WEB
1
http://www.professorpaddle.com/rivers/riverlist.asp
hstateid=37
I'd like to know if this is even possible. And if it is possible, what the security ramifications would be.
I want to use Javascript to build a dynamic URL to query a Parse-Server database.
It appears that it might be possible based on an earlier Stackoverflow question here and a Node.js doc here
Here's how I envision it working....
So, a user would be sent (via email/Twitter/etc) a link which was created by above method. Once the user clicked on that URL link, the following would happen automatically:
Step #1: User's system would automatically submit a parse-server query.
Step #2: On success, the user's browser would download a web page which displayed the query results.
step one create the pointer value ie the query pseudo-semantic
step 2 insert the pointer value to a text-type field in parse cls=clazz
step 2b mailgun a msg containing a link like :
express.domain.com/#!clazz/:yALNMbHWoy - where 'yA...oy' is oid of pointer row in parse/clazz table
Note that the link is an abstraction only. Its a uri first to express/route && function that will simply get a row from parse.clazz. That row contains the semantic for making a parse query to get back the full DB compliment to pass along to the node template setting up the html ...
in your node/router GET/clazz/:oid will lookup that Parse row in parse/clazz, using the pointer/text value to format a 2nd, Parse query. The query response from the 2nd parse.qry is your real meat ... This response can be used by the express template formatting the html response to orig request on "express.domain.com".
where u ask "download web page" .. that is simply node's RESPONSE to a GET on a route like "GET/clazz".
This seems like a pretty simple thing but I can't find any discussions that really explain how to do it.
I'm building a scraper with MongoDB and Node.js. It runs once daily and scrapes several hundred urls and records to the database. Example:
Scraper goes to this google image search page for "stack overflow"
Scraper gets the top 100 links from this page
A record of the link's url, img src, page title and domain name are saved to MongoDB.
Here's what I'm trying to achieve:
If the image is no longer in the 100 scraped links, I want to delete it from the databqse
If the image is still in the 100 scraped links, but details have changed (e.g. new page title) I want to find the mongodb record and update it.
If the image doesn't exist already, I want to create a new record
The bit I'm having trouble with is deleting entries that haven't been scraped. What's the best way to achieve this?
So far my code successfully checks whether entries exist, updates them. It's deleting records that are no longer relevant that I'm having trouble with. Pastebin link is here:
http://pastebin.com/35cXcXzk
You either need to timestamp items (and update them on every scrape) and periodically delete items which haven't been updated in a while, or you need to associate items with a particular query. In the latter case, you would gather all of the items previously associated with the query, and mark them off as the new results come in. Any items not marked off the list at the end, need to be deleted.
another possibility is to use the new TTL index option in mongodb 2.4 allowing you to set time to live on documents
http://docs.mongodb.org/manual/tutorial/expire-data/
This will let the server expire them over time instead of having to perform big expensive remove executions.
Another optimization is to use the power of 2 option for collections to avoid the high fragmentation of memory that write, remove cycles create
http://docs.mongodb.org/manual/reference/command/collMod/