Using html header/footers with html2pdfrocket - azure-web-app-service

Has anyone had success using html2pdfrocket with custom html header/footers?
I've got the following footer (not the dataUri has be shrunk for posting) that I'm trying to pass to the html2pdfrocket API but I just get a 500 error back.
var generatedFooter = "<div style=\"width:100%;text-align:right;margin-right:25px;padding-bottom:25px;\"><img src=\"data:image/png;base64,iVBOR....\" style=\"width:150px; margin-bottom:25px;\" /></div>";

Turns out that html2pdfrocket doesn't support html strings for the header/footer. Instead I created a webservice that returns a generated html page with the header information.
Just as a further FYI html2pdfrocket and wkhtmltopdf which it is based on require that the header html page has a tag at the top otherwise the bottom of the page in a large view is shown instead.

Related

How to only get the "title" and "main content" of a page using puppeteer?

I'm trying to create a clone of getpocket.com for learning. On that app, every saved link gets converted into a markdown; and it seems like the it's a filtered content with only the page title and body without headers, footers, etc.
I could get the page's title using puppeteer api thru different means:
using page.title()
or get the page's opengraph "og:title"
But how do i get like the summarized version containing only the main content of the page.
Note that i don't know beforehand the "css class" of the main content since i'm planning on just entering a url in a textbox and scrape that site from there.
I have found what i've needed for this scenario.
I used the Readability.js library for making webpages readable by removing some certain html tags. Here's the library.
This library is what mozilla uses behind the scenes when rendering their reader view

scrapy pagination without href

I created a spider that takes the information from the table below, but I can not change to the previous table because it does not have "href", how do I?
https://br.soccerway.com/teams/italy/as-roma/1241/
previous button without href
<a rel="previous" class="previous " id="page_team_1_block_team_matches_summary_7_previous">« anterior</a>
If you look at network inspector in your browser you can see an XHR request being made when you click next button:
That request return json response with html changes:
You need to reverse engineer how your page generated this url (from the first image):
https://br.soccerway.com/a/block_team_matches_summary?block_id=page_team_1_block_team_matches_summary_7&callback_params=%7B%22page%22%3A0%2C%22bookmaker_urls%22%3A%7B%2213%22%3A%5B%7B%22link%22%3A%22http%3A%2F%2Fwww.bet365.com%2Fhome%2F%3Faffiliate%3D365_371546%22%2C%22name%22%3A%22Bet%20365%22%7D%5D%7D%2C%22block_service_id%22%3A%22team_summary_block_teammatchessummary%22%2C%22team_id%22%3A1241%2C%22competition_id%22%3A0%2C%22filter%22%3A%22all%22%2C%22new_design%22%3Afalse%7D&action=changePage&params=%7B%22page%22%3A1%7D
And then you can use that to retrieve following pages.

Node.JS HTML to PDF

I want to convert the HTML to PDF for which I am using this package (https://www.npmjs.com/package/html-pdf).
But in this HTML I am also getting tables and that tables can come in any position in HTML.
So I want to copy the table thead to another page if the table in the HTML exceeds more than 1 page.
Can you please help me on this.
Thanks in advance
You're going to have to modify the CSS of the page to properly handle a printable view.
See also: Repeat table headers in print mode
Try to use Puppeteer to create PDF from HTML
Example from here https://github.com/chuongtrh/html_to_pdf
Or https://github.com/GoogleChrome/puppeteer

Rendering a user modified page using PhantomJS

My use case is: a user goes onto a webpage and modifies it by either filling in a form, populating the page with data from the database, or dragging around some draggables on the page. He can then download the page he modified as pdf. I was thinking of using PhantomJS to do the conversion from html to pdf.
I understand the basic functionality of PhantomJS and got the basic example working but in all the examples I've seen, either a local file or a url is passed in. Example:
page.open('./test.html', function () { ... }
How would I render the page that is getting modified by a user using PhantomJS? I have 2 ideas:
Have the url change as the user modifies the page, and simply pass in the url. For example, the url contains the position of a draggable div.
Send the modified html to back-end, save it, and run PhantomJS
Do these solutions make sense? I'm hoping there would be a simpler way.

WKHTMLTOPDF Dynamic Header on every page

I am trying to produce a PDF file using WKHTMLTOPDF library in NODE for a large HTML file. I need to be able to stuff in some content in the Header and Footer on every page. But the content on the header changes on every page for e.g, have custom numbering in a format like BX008761. The number should increment on every page.
First page will be BX008761, second page BX008762, third BX008763 so on..
I could find a thread which is related..
WKHTMLTOPDF -- Is possible to display dynamic headers?
the above thread states:
"you can feed --header-html almost anything :) Try the following to see my point:
wkhtmltopdf.exe --margin-top 30mm --header-html isitchristmas.com google.fi x.pdf
So isitchristmas.com could be www.yoursite.com/magical/ponies.php"
does the source value provided for --header-html option be called for every page of the PDF rendered or it is called just once for every PDF..?
Appreciate your support.Thank you.
EDIT : I have tried a sample program and confirmed that it will process the value provided for --header-html option on every page rendered with in PDF. I am using a remote service to return the HTML string as a response to the url.
Now it is displaying the html string as is, instead of decoding it.
when the service returns below string:
<html> <body> <span style="color:red" > 123 :: 0 :: 3000025 :: 634943551338828720</span> <body> <html>
then the header on every page is also same as above instead of displaying the text in red color. how do i make the wkhtmltohtml understand that the content it received from service need to be decoded.
appreciate if any one can suggest a workaround.
Thank you.
EDIT : I have used another work around to return a HTML page for the header content. I used essentially a HTTPHandler in asp.net to return a valid response and the issue looks to have addressed the core issue of having a dynamic header on every page.

Resources