Print a webpage to a PDF file using Python - python-3.x

I have a tableau URL with a grid report in it. I need to print the page to pdf(A3) using python. Is there a way to achieve it. I tried using pdfkit and requests.get method but which is not giving proper output.
import requests
url = 'http://tabiisweb.sample.com/'
myfile = requests.get(url, allow_redirects=True,stream = True)
open('c:/tabfile.pdf', 'wb').write(myfile.content)

Related

How to send a pdf object from Databricks to Sharepoint?

INTRO: I have a Databricks notebook where I create a pdf file based on some data.
In order to generate the file I am using the fpdf library:
from fpdf import FPDF, HTMLMixin
Thanks to the library I generate a pdf file which is of type: <__main__.HTML2PDF at 0x7f3b73720fd0>.
My goal now is to send this pdf to a sharepoint folder. To do so I am using the following lines of code:
from office365.runtime.auth.user_credential import UserCredential
from office365.sharepoint.client_context import ClientContext
from office365.sharepoint.files.file import File
# paths
sharepoint_site = "MySharepointSite"
sharepoint_folder = "Shared Documents/General/PDFs/"
sharepoint_user = "aaa#bbb.onmicrosoft.com"
sharepoint_user_pw = "xyz"
sharepoint_folder = sharepoint_folder.strip("/")
# set environment variables
SITE_URL = f"https://sharepoint.com/sites/{sharepoint_site}"
RELATIVE_URL = f"/sites/{sharepoint_site}/{sharepoint_folder}"
# connect to sharepoint
ctx = ClientContext(SITE_URL).with_credentials(UserCredential(sharepoint_user, sharepoint_user_pw))
web = ctx.web
ctx.load(web).execute_query()
# Generate PDF
pdf = generate_pdf(ctx, row['ServerRelativeUrl'])
# HERE IS MY ISSUE!
ctx.web.get_folder_by_server_relative_url(sharepoint_folder).upload_file('test.pdf', pdf).execute_query()
PROBLEM: When I reach the last row I get the following error message:
TypeError: Object of type HTML2PDF is not JSON serializable
I believe that pdf objects cannot be serialized to be JSON and therefore I am stuck and I do not know how to send the PDF to the sharepoint.
QUESTION: Would you be able to suggest a smart and elegant way to achieve my goal i.e sending the pdf file to the sharepoint please?
I was able to solve this problem by saving the pdf as a string, then encoding it and finally pushing it to the sharepoint:
pdf_binary = pdf.output(dest='S').encode("latin1")
ctx.web.get_folder_by_server_relative_url(sharepoint_folder).upload_file("test.pdf", pdf_binary).execute_query()
Note: If it does not work, try to change the encoding type.

Why not full data?

I try to get all specific span tags in all 3 urls
but finally the csv file only shows the data of last url.
Python code
from selenium import webdriver
from lxml import etree
from bs4 import BeautifulSoup
import time
import pandas as pd
urls = []
for i in range(1, 4):
if i == 1:
url = "https://www.coinbase.com/price/s/listed"
urls.append(url)
else:
url = "https://www.coinbase.com/price/s/listed" + f"?page={i}"
urls.append(url)
print(urls)
for url in urls:
wd = webdriver.Chrome()
wd.get(url)
time.sleep(30)
resp =wd.page_source
html = BeautifulSoup(resp,"lxml")
tr = html.find_all("tr",class_="AssetTableRowDense__Row-sc-14h1499-1 lfkMjy")
print(len(tr))
names =[]
for i in tr:
name1 = i.find("span",class_="TextElement__Spacer-hxkcw5-0 cicsNy Header__StyledHeader-sc-1xiyexz-0 kwgTEs AssetTableRowDense__StyledHeader-sc-14h1499-14 AssetTableRowDense__StyledHeaderDark-sc-14h1499-17 cWTMKR").text
name2 = i.find("span",class_="TextElement__Spacer-hxkcw5-0 cicsNy Header__StyledHeader-sc-1xiyexz-0 bjBkPh AssetTableRowDense__StyledHeader-sc-14h1499-14 AssetTableRowDense__StyledHeaderLight-sc-14h1499-15 AssetTableRowDense__TickerText-sc-14h1499-16 cdqGcC").text
names.append([name1,name2])
ns=pd.DataFrame(names)
date = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
path = "/Users/paul/jpn traffic/coinbase/coinbase"
ns.to_csv(path+date+date+'.csv',index=None)
the result of 2 print() function, it returns nothing wrong:
print(urls):['https://www.coinbase.com/price/s/listed', 'https://www.coinbase.com/price/s/listed?page=2', 'https://www.coinbase.com/price/s/listed?page=3']
print(len(tr))
26
30
16
So what's wrong with my code? Why not full data?
BTW, if I want to run my code on cloud service everyday at a given time, which works better for me, as a green hand python learner? I don't need to store huge data on cloud, I just need python scripts sending emails to my box that's it.
Why not data? Answer is data is generating from backdoor meaning the site is using API that's why data is not with the help of BeautifulSoup. You can easily get data using api_url and requests. To get api_url go to chrome devtools then network tab then xhr tab and click header tab then you will get the url and click preview tab to see data.
Now, data is generating:
import requests
r = requests.get('https://www.coinbase.com/api/v2/assets/search?base=BDT&country=BD&filter=listed&include_prices=true&limit=30&order=asc&page=2&query=&resolution=day&sort=rank')
coinbase = r.json()['data']
for coin in coinbase:
print(coin['name'])

How to fix "businessObject not defined"

I am a newbie to Python and web scraping. To practice, I am just trying to pull some business names from some HTML tags a website. However, the code is not running and is throwing an 'object is not defined' error.
from bs4 import BeautifulSoup
import requests
url = 'https://marketplace.akc.org/groomers/?location=Michigan&page=1'
response = requests.get(url, timeout = 5)
content = BeautifulSoup(response.content, "html.parser")
for business in content.find_all('div', attrs={"class": "groomer-salon-card__details"}):
businessObject = {
"BusinessName": business.find('h4', attrs={"class": "groomer-salon-card__name"}).text.encode('utf-8')
}
print (businessObject)
Expected: I am trying to retrieve the business names from this web page.
Result:
NameError: name 'businessObject' is not defined
When you did
content.find_all('div', attrs={"class": "groomer-salon-card__details"})
you actually got an empty list as no match.
So, when you did
for business in content.find_all('div', attrs={"class": "groomer-salon-card__details"}):
you didn't generate
businessObject
As mentioned in comments, that led to your error.
Content is dynamically loaded from elswhere in the DOM using javascript (as well as other DOM modifications). You can still regex out the javascript object which contains the content used to update the DOM as you saw it in browser. You then parse with json parser as follows:
import requests, re, json
url = 'https://marketplace.akc.org/groomers/?location=Michigan&page=1'
response = requests.get(url, timeout = 5)
p = re.compile(r'state: (.*?)\n', re.DOTALL)
data = json.loads(p.findall(response.text)[0])
for listing in data['content']['search_results']['pages']['data']:
print(listing['organization_name'])
If you view page source on webpage you will see that the DOM is essentially dynamically populated from top to bottom with mutation observers monitoring progress.

Unable to get the expected html element details using Python

I am trying to scrape a website using Python. I have been able to scrape it successfully, however the expected resulted is not fetching up. I think there is something to do with the JavaScript of the web page.
My Code below:
driver.get(
"https://my website")
soup=BeautifulSoup(driver.page_source,'lxml')
all_text = soup.text
ct = all_text.replace('\n', ' ')
cl_text = ct.replace('\t', ' ')
cln_text_t = cl_text.replace('\r', ' ')
cln_text = re.sub(' +', ' ', cln_text_t)
print(cln_text)
Instead of giving me the website details it is giving the below data. Any idea how could I fix this?
html, body {height:100%;margin:0;} You have to enable javascript in your browser to use an application built with Vaadin.........
Why do you need this BeautifulSoup at all? It doesn't seem to support JavaScript.
If you need to get web page text you can fetch the document root using simple XPath selector of //html and get innerText property of the resulting WebElement
Suggested code change:
driver.get(
"my website")
root = driver.find_element_by_xpath("//html")
all_text = root.get_attribute("innerText")

Finding tweet id from parsed html page

I am trying to get tweet id from the parsed HTML. Here is my code:
tweet_ids = []
stat = statnum_parser(page_soup)
name = stat["Full_Name"]
print(page_soup.select("div.tweet"))
for tweet in page_soup.select("div.tweet"): # doesn't work properly
if tweet['data-name'] == name:
tweet_ids.append(tweet['data-tweet-id'])
The if condition checks if the tweet is not retweeted. The for loop does not work properly. Can someone help me?
I am using Selenium, BeautifulSoup
I figured out the problem. The problem was not using properly selenium with BeautifulSoup. Here is the code to get properly the HTML content of static website:
import selenium as webdriver
path_to_chrome_driver="path_to_your_chrome_driver"
driver = webdriver.Chrome(executable_path=path_to_chrome_driver)
driver.base_url = "URL of the website"
driver.get(driver.base_url)

Resources