Unable to extract data using Import.io from Amazon web page where data is loaded into the page via Ajax - amazon

Anyone know how to extract data from a webpage using Import.io where the data is loaded into the page via Ajax?
I am unable to extract data from below mentioned pages.
There is no issue in first page data extraction, but how do I move on to extract data from second page?
URL is given below.
<http://www.amazon.com/gp/aag/main?ie=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=ATVPDKIKX0DER&orderID=&seller=A13JB7253Q5S1B>

The data on that page is deployed using an interesting mix of technologies; it relies heavily on server side code and Javascript. That type of page can be a challenge, however, there are always methods to get the data. For example, some sellers have a page like this:
http://www.amazon.co.uk/gp/node/index.html?ie=UTF8&marketplaceID=ATVPDKIKX0DER&me=A2WO1PQ2OIOIGM&merchant=A2WO1PQ2OIOIGM
Which is very easy to extract data from, even using the magic algorithm - https://magic.import.io/?site=http:%2F%2Fwww.amazon.co.uk%2Fgp%2Fnode%2Findex.html%3Fie%3DUTF8%26marketplaceID%3DA1F83G8C2ARO7P%26me%3DA2WO1PQ2OIOIGM%26merchant%3DA2WO1PQ2OIOIGM
I had to take off the redirect=true from the URLs before it would work - just an FYI.
Other times some stores don't have such a URL, its a bit of a pain, and there URLs can be tough to figure out.
We do help some of our enterprise customers build bespoke APIs when the data is very important to them, so do feel free to get in touch. I imagine a larger scale workaround would be to create a dataset/API based on a the categories you are interested in and then to filter that larger dataset down (python or CSV style) by seller name. That would probably work!

I managed to get a static dataset but no API. You can find that dataset at the following GUID: c7c63f1c-7081-4d4a-ad91-afe9789a6620
Thanks

Related

NodeJS PDF Generator Data Across Multiple pages

We have an NodeJS API, that is used to generate PDF's. We use the Handlebars API server side to bind data to HTML templates (i.e. header, footer, content), then we use Puppeteer to generate the PDF's from the bound template. This approach works great when the data fits on one page.
However, we are trying to determine the best approach when the data spans multiple pages. We are not quite sure how to do this with the current technology.
With the data sorted, the layout is a four column table with the first column having the data sorted down, then once the column is full, it goes to the next column, then the next column, until all columns are full. One all the columns are full go to the next page.
Here is an example of what flow of the data needs to look like this across multiple pages:
https://lucid.app/lucidchart/83cf06f7-7e8a-4eb3-8059-fafd598318a6/view?page=0_0#?folder_id=home&browser=icon
Once the pages have been created they will be sent to the consumer as a single PDF.
Will puppeteer do this paging for us? If so, we have not had much success on how to tell puppeteer to do this. Do we have to create PDF's for each page and somehow stitch them together in a single document?
We are open to ideas on how to do this with the current technology (NodeJS, puppeteer, handlebars) or using a different NodeJS technology. The only requirement is it has to be server side and run in NodeJS.
Again we are open to any ideas on how to do this?
Examples are encouraged.

Google Docs: Table of content page numbers

we are currently building an application on the google cloud platform, which generates reports in Google Doc. For them, it is really important to have a table of content ... with page numbers. I know this is a feature request since a few years and there are add-ons (Paragraph Styles +, which didn't work for us) that provide this solution, butt we are considering to build this ourselves. if anybody has a suggestion on how we could start with this, it would be a great help!
thanks,
Best bet is to file a feature request on the product forums.
Currently the only way to do that level of manipulation of a doc to provide a custom TOC is to use Apps Script. It provides access to the document structure sufficient enough to build and insert a basic table of contents, but I'm not sure there's enough to do paging correctly (unless you force a page break on ever page...) There's no method to answer the question of "what page is this element on?"
Hacks like writing to a DOCX and converting don't work because TOCs are recognized for what they and show up without page numbers.
Of course you could write a DOCX or PDF with the TOC as you'd like and upload as a blob rather than as a Google Doc. They can still be viewed in Drive and such.

Best Approach to Scrape Paginated Results using import.io

There are several websites within the cruise industry that I would like to scrape.
Examples:
http://www.silversea.com/cruise/cruise-results/?page_num=1
http://www.seabourn.com/find-luxury-cruise-vacation/FindCruises.action?cfVer=2&destCode=&durationCode=&dateCode=&shipCodeSearch=&portCode=
In some scenarios, like the first one shown, the results page follows a patten - ?page_num=1...17. However the number of results will vary over time.
In the second scenario, the URL does not change with pagination.
At the end of the day, what I'd like to do is to get the results for each website into a single file.
Q1: Is there any alternative to setting 17 scrapers for scenario 1 and then actively watching as results grow/shrink over time?
Q2: I'm completely stumped about how to scrape content from second scenario.
Q1- The free tool from (import.io) does not have the ability to actively watch the data change over time. What you could do is have the data Bulk Extracted by the Extractor (with 17 pages this would be really fast) and added to a database. After each entry to the database, the entries could be de-duped or marked as unique. You could do this manually in Excel or programmatically.
Their Enterprise (data as a service) could do this for you.
Q2- If there is not a unique URL for each page, the only tool that will paginate the pages for you is the Connector.
I would recommend you to build an extractor to get the pagination. The result of this extractor will be a list of links, each link corresponding to a page.
This way, every time you run your application and the number of pages changes, you will always get all the pages.
After that, make a call for each page to get the data you want.
Extractor 1: Get pages -- Input: The first URL
Extractor 2: Get items (data) -- Input: The result from Extractor 1

Pulling two different sets of data from the same document library in a single page SharePoint 2013

I have a document library set up with multiple different categories of document, and I'm using a metadata column to differentiate between them.
I want to be able to display two different document library web part on a page for different categories of file side by side. This is simple for one category, I just set up a list view filtered by the metadata column, but when I add a second web part alongside the first, it breaks the first one.
I have no idea why this is happening, but it seems like SharePoint isn't happy with pulling two sets of data from the same document library.
When I am editing the web parts, I can get them to both display the documents I want, but then when I click save, the first web part empties.
Not sure what other information would be useful for diagnosing or helping with the problem, so if I haven't given enough detail let me know. I am familiar with SPD as well as developing through the web interface, so if this needs a more complex solution that's fine with me!
Having spent some more time playing around with this, it struck me that I could probably achieve what I wanted using something other than a Document web part, and I was right.
Instead of using the somewhat inflexible document web part, I created a content query web part which only searched within the document library from my site, and filtered by the metadata column.
This way I can create as many queries as I like and they don't interact with each other in weird ways. It also has the advantage of being significantly easier to customise the output without needing to resort to SharePoint Designer.
Content Queries are the answer!

Data Set that can be used for statistics

I need some raw data to visualize it with google charts and some other APIs. Problem is that i some raw data that includes timestamps too.
For example visitors visiting a website i.e. from which device (mobile/computer etc) they accessed website, at what time (hours:minutes:second:miliseconds) and what which links they visited etc. Please help me to know if someone knows about such kind of dummy raw data on web.
You can build your own dataset using Google Spreadsheets.
For example, consider the spreadsheet from the link below:
https://docs.google.com/spreadsheet/pub?key=0Aj9J3uCNjN9_dG1rdmNtTlhyNWpkTUVHVHBwRzNWX2c&output=html
If you tweak the link, it can provide you with the JSON representation of the data.
https://docs.google.com/spreadsheet/tq?key=0Aj9J3uCNjN9_dG1rdmNtTlhyNWpkTUVHVHBwRzNWX2c&pub1
Basically, what you have to do in order to get the JSON response is replace the "pub" element to "tq" and remove the "output=html" element at the end adding "pub1" instead.
With this procedure you should be able to create your own datasource for your tests.
You can find more information on the Google Chart API documentation:
https://developers.google.com/chart/interactive/docs/spreadsheets
Hope it helps

Resources