We have an NodeJS API, that is used to generate PDF's. We use the Handlebars API server side to bind data to HTML templates (i.e. header, footer, content), then we use Puppeteer to generate the PDF's from the bound template. This approach works great when the data fits on one page.
However, we are trying to determine the best approach when the data spans multiple pages. We are not quite sure how to do this with the current technology.
With the data sorted, the layout is a four column table with the first column having the data sorted down, then once the column is full, it goes to the next column, then the next column, until all columns are full. One all the columns are full go to the next page.
Here is an example of what flow of the data needs to look like this across multiple pages:
https://lucid.app/lucidchart/83cf06f7-7e8a-4eb3-8059-fafd598318a6/view?page=0_0#?folder_id=home&browser=icon
Once the pages have been created they will be sent to the consumer as a single PDF.
Will puppeteer do this paging for us? If so, we have not had much success on how to tell puppeteer to do this. Do we have to create PDF's for each page and somehow stitch them together in a single document?
We are open to ideas on how to do this with the current technology (NodeJS, puppeteer, handlebars) or using a different NodeJS technology. The only requirement is it has to be server side and run in NodeJS.
Again we are open to any ideas on how to do this?
Examples are encouraged.
Related
I was wondering if anyone could point me towards a solution to my issue with pdftron/webviewer. I am building a MERN app and one of the projected feature is contracts generation. I created my fields in .docx file, loaded it in with webviewer.
I tried:
having my data collecting form on the same page with webviewer and passing data via state. (...yeah)
Tried to separate contract generation page and display (webviewer) page and use localStorage/redux.
Lastly I tried to take advantage of my database, so I created new model/schema and set up the collections that on the form page the contract data gets POSTed into database and then fetches from the webview page. Mongo has a TTL flag to remove every record after 1 minute.
I can console.log(contract.name) but it just won't consistently show as {{NAME}} (consistently because on several ocassions it works..)
Any help is greatly appreciated. Can't provide console errors log/network dont show any.
It seems like you are trying to fill forms with data. You could check out this guide for form filling. PDFTron JavaScript PDF form filling library
This is an live example showing how it would work.
Let me know how this works for you.
Ok. Basically I want to have a div template (e.g. for displaying different people & information - name, picture, information, styling etc.). So is there a way where I can duplicate one template div multiple times to show different data? Would I use JavaScript or a server side script like php pulling from a sql database.
This would also make it easier to exit layer on.
Thanks
There are several websites within the cruise industry that I would like to scrape.
Examples:
http://www.silversea.com/cruise/cruise-results/?page_num=1
http://www.seabourn.com/find-luxury-cruise-vacation/FindCruises.action?cfVer=2&destCode=&durationCode=&dateCode=&shipCodeSearch=&portCode=
In some scenarios, like the first one shown, the results page follows a patten - ?page_num=1...17. However the number of results will vary over time.
In the second scenario, the URL does not change with pagination.
At the end of the day, what I'd like to do is to get the results for each website into a single file.
Q1: Is there any alternative to setting 17 scrapers for scenario 1 and then actively watching as results grow/shrink over time?
Q2: I'm completely stumped about how to scrape content from second scenario.
Q1- The free tool from (import.io) does not have the ability to actively watch the data change over time. What you could do is have the data Bulk Extracted by the Extractor (with 17 pages this would be really fast) and added to a database. After each entry to the database, the entries could be de-duped or marked as unique. You could do this manually in Excel or programmatically.
Their Enterprise (data as a service) could do this for you.
Q2- If there is not a unique URL for each page, the only tool that will paginate the pages for you is the Connector.
I would recommend you to build an extractor to get the pagination. The result of this extractor will be a list of links, each link corresponding to a page.
This way, every time you run your application and the number of pages changes, you will always get all the pages.
After that, make a call for each page to get the data you want.
Extractor 1: Get pages -- Input: The first URL
Extractor 2: Get items (data) -- Input: The result from Extractor 1
Anyone know how to extract data from a webpage using Import.io where the data is loaded into the page via Ajax?
I am unable to extract data from below mentioned pages.
There is no issue in first page data extraction, but how do I move on to extract data from second page?
URL is given below.
<http://www.amazon.com/gp/aag/main?ie=UTF8&asin=&isAmazonFulfilled=&isCBA=&marketplaceID=ATVPDKIKX0DER&orderID=&seller=A13JB7253Q5S1B>
The data on that page is deployed using an interesting mix of technologies; it relies heavily on server side code and Javascript. That type of page can be a challenge, however, there are always methods to get the data. For example, some sellers have a page like this:
http://www.amazon.co.uk/gp/node/index.html?ie=UTF8&marketplaceID=ATVPDKIKX0DER&me=A2WO1PQ2OIOIGM&merchant=A2WO1PQ2OIOIGM
Which is very easy to extract data from, even using the magic algorithm - https://magic.import.io/?site=http:%2F%2Fwww.amazon.co.uk%2Fgp%2Fnode%2Findex.html%3Fie%3DUTF8%26marketplaceID%3DA1F83G8C2ARO7P%26me%3DA2WO1PQ2OIOIGM%26merchant%3DA2WO1PQ2OIOIGM
I had to take off the redirect=true from the URLs before it would work - just an FYI.
Other times some stores don't have such a URL, its a bit of a pain, and there URLs can be tough to figure out.
We do help some of our enterprise customers build bespoke APIs when the data is very important to them, so do feel free to get in touch. I imagine a larger scale workaround would be to create a dataset/API based on a the categories you are interested in and then to filter that larger dataset down (python or CSV style) by seller name. That would probably work!
I managed to get a static dataset but no API. You can find that dataset at the following GUID: c7c63f1c-7081-4d4a-ad91-afe9789a6620
Thanks
I have a document library set up with multiple different categories of document, and I'm using a metadata column to differentiate between them.
I want to be able to display two different document library web part on a page for different categories of file side by side. This is simple for one category, I just set up a list view filtered by the metadata column, but when I add a second web part alongside the first, it breaks the first one.
I have no idea why this is happening, but it seems like SharePoint isn't happy with pulling two sets of data from the same document library.
When I am editing the web parts, I can get them to both display the documents I want, but then when I click save, the first web part empties.
Not sure what other information would be useful for diagnosing or helping with the problem, so if I haven't given enough detail let me know. I am familiar with SPD as well as developing through the web interface, so if this needs a more complex solution that's fine with me!
Having spent some more time playing around with this, it struck me that I could probably achieve what I wanted using something other than a Document web part, and I was right.
Instead of using the somewhat inflexible document web part, I created a content query web part which only searched within the document library from my site, and filtered by the metadata column.
This way I can create as many queries as I like and they don't interact with each other in weird ways. It also has the advantage of being significantly easier to customise the output without needing to resort to SharePoint Designer.
Content Queries are the answer!