How encode this string to export it to PDF - python-3.x

I'm trying to export to PDF a string containing UTF-8 characters. I have reduced my code to the following code to be able to reproduce it:
source = """<html>
<style>
#font-face {
font-family: Preeti;
src: url("preeti.ttf");
}
body {
font-family: Preeti;
}
</style>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
</head>
<body>
This is a test <br/>
सरल
</body>
</html>"""
from xhtml2pdf import pisa
from io import StringIO, BytesIO
bytes_result = BytesIO()
pisa.CreatePDF(source, bytes_result, encoding='UTF-8')
result = StringIO(bytes_result.getvalue().decode('UTF-8', 'ignore'))
f = open('test.pdf', 'wt', encoding='utf-8')
f.write(result.getvalue())
Which produces the following PDF:
I have tried a ton of solutions from the internet but I cannot get it right. Any idea what am I missing?

Related

Getting meta property with beautifulsoup

I am trying to extract the property "og" from opengraph from a website. What I want is to have all the tags that start with "og" of the document in a list.
What I've tried is:
soup.find_all("meta", property="og:")
and
soup.find_all("meta", property="og")
But it does not find anything unless I specify the complete tag.
A few examples are:
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:url"/>,
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:secure_url"/>,
<meta content="text/html" property="og:video:type"/>,
<meta content="1280" property="og:video:width"/>,
<meta content="720" property="og:video:height"/>
Expected output would be:
l = ["og:video:url", "og:video:secure_url", "og:video:type", "og:video:width", "og:video:height"]
How can I do this?
Thank you
use CSS selector meta[property]
metas = soup.select('meta[property]')
propValue = [v['property'] for v in metas]
print(propValue)
Is this what you want?
from bs4 import BeautifulSoup
sample = """
<html>
<body>
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:url"/>,
<meta content="https://www.youtube.com/embed/Rv9hn4IGofM" property="og:video:secure_url"/>,
<meta content="text/html" property="og:video:type"/>,
<meta content="1280" property="og:video:width"/>,
<meta content="720" property="og:video:height"/>
</body>
</html>
"""
print([m["property"] for m in BeautifulSoup(sample, "html.parser").find_all("meta")])
Output:
['og:video:url', 'og:video:secure_url', 'og:video:type', 'og:video:width', 'og:video:height']
You can check if og exist in property as follows:
...
soup = BeautifulSoup(html, "html.parser")
og_elements = [
tag["property"] for tag in soup.find_all("meta", property=lambda t: "og" in t)
]
print(og_elements)

BeautifulSoup Access Denied parsing error [duplicate]

This question already has answers here:
Scraper in Python gives "Access Denied"
(3 answers)
Closed 2 years ago.
This is the first time I've encountered a site where it wouldn't 'allow me access' to the webpage. I'm not sure why and I can't figure out how to scrape from this website.
My attempt:
import requests
from bs4 import BeautifulSoup
def html(url):
return BeautifulSoup(requests.get(url).content, "lxml")
url = "https://www.g2a.com/"
soup = html(url)
print(soup.prettify())
Output:
<html>
<head>
<title>
Access Denied
</title>
</head>
<body>
<h1>
Access Denied
</h1>
You don't have permission to access "http://www.g2a.com/" on this server.
<p>
Reference #18.4d24db17.1592006766.55d2bc1
</p>
</body>
</html>
I've looked into it for awhile now and I found that there is supposed to be some type of token [access, refresh, etc...].
Also, action="/search" but I wasn't sure what to do with just that.
This page needs to specify some HTTP headers to obtain the information (Accept-Language):
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.5'}
def html(url):
return BeautifulSoup(requests.get(url, headers=headers).content, "lxml")
url = "https://www.g2a.com/"
soup = html(url)
print(soup.prettify())
Prints:
<!DOCTYPE html>
<html lang="en-us">
<head>
<link href="polyfill.g2a.com" rel="dns-prefetch"/>
<link href="images.g2a.com" rel="dns-prefetch"/>
<link href="id.g2a.com" rel="dns-prefetch"/>
<link href="plus.g2a.com" rel="dns-prefetch"/>
... and so on.

Can you insert a string literal that contains unicode (like a Chinese character) into a Jinja2 template?

In my Python code (.py file), I set a session variable to Chinese characters:
session.mystring = '你好'
At first, Flask choked. So, at the very top of the file, I put:
# -*- coding: utf-8 -*-
That solved the problem. I could then display the string in my Flask app by just doing:
{{ session.mystring }}
Here's the weird problem. If I write the Chinese characters directly into the template (.html file), like so:
<!-- this works -->
<p>{{ session.mystring }}</p>
<!-- this doesn't -->
<p>你好</p>
The browser (Edge) displays the following error:
'utf-8' codec can't decode byte 0xb5 in position 516: invalid start byte
I tried putting the following at the top of the template:
<head>
<meta charset="utf-8">
</head>
But that doesn't help. Is there any way I can insert the Chinese characters directly as a literal in the Jinja2 template?
EDIT (June 7, 2020):
Made my question clearer by actually putting the Chinese characters into the question (thought I couldn't previously).
This should work , by simply adding below in HTML file:
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
working eg here :
with minimal flask app:
app.py:
from flask import Flask, render_template
app = Flask(name)
#app.route('/')
def test():
data = '中华人民共和国'
return render_template('index.html', data=data)
if(__name__) == '__main__':
app.run()
In templates/index.html :
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Document</title>
</head>
<body>
{{ data }}
</body>
</html>
o/p:

What are these different errors I'm getting when scraping using BeautifulSoup?

I'm trying to scrape this site called whoscored.com and here's the simple code I'm using to scrape a particular page of it.
import requests
import pandas as pd
from bs4 import BeautifulSoup
headers = {'User-Agent':
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/47.0.2526.106 Safari/537.36'}
page =
"https://www.whoscored.com/Teams/13/RefereeStatistics/England-Arsenal"
pageTree = requests.get(page, headers=headers)
pageSoup = BeautifulSoup(pageTree.content, 'lxml')
print(pageSoup)
The code runs just fine but here's what it's returning -
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<title>404 - File or directory not found.</title>
<style type="text/css">
<!--
body{margin:0;font-size:.7em;font-family:Verdana, Arial, Helvetica, sans-
serif;background:#EEEEEE;}
fieldset{padding:0 15px 10px 15px;}
h1{font-size:2.4em;margin:0;color:#FFF;}
h2{font-size:1.7em;margin:0;color:#CC0000;}
h3{font-size:1.2em;margin:10px 0 0 0;color:#000000;}
#header{width:96%;margin:0 0 0 0;padding:6px 2% 6px 2%;font-
family:"trebuchet MS", Verdana, sans-serif;color:#FFF;
background-color:#555555;}
#content{margin:0 0 0 2%;position:relative;}
.content-container{background:#FFF;width:96%;margin-
top:8px;padding:10px;position:relative;}
-->
</style>
</head>
<body>
<div id="header"><h1>Server Error</h1></div>
<div id="content">
<div class="content-container"><fieldset>
<h2>404 - File or directory not found.</h2>
<h3>The resource you are looking for might have been removed, had its name
changed, or is temporarily unavailable.</h3>
</fieldset></div>
</div>
<script type="text/javascript">
//<![CDATA[
(function() {
var _analytics_scr = document.createElement('script');
_analytics_scr.type = 'text/javascript'; _analytics_scr.async = true;
_analytics_scr.src = '/_Incapsula_Resource?
SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3&ns=1&cb=1578388490';
var _analytics_elem = document.getElementsByTagName('script')[0];
_analytics_elem.parentNode.insertBefore(_analytics_scr, _analytics_elem);
})();
// ]]>
</script></body>
</html>
As you see, it returns 404 - file or directory not found or The resource you are looking for might have been removed, had its name
changed, or is temporarily unavailable.There's another bunch of error at the end which I'm not all too familiar with.
I have a few ideas why this might be happening. Maybe there's JavaScript(I see that at the end) or it's due to some sort of counter-measure by the website. However, I'd like to know exactly what's the problem and what can I do to solve this and make sure I'm getting the data I am trying to scrape from the page - which, by the way, is the entire table.
The little I got from reading similar questions on here is that I need to use Selenium but I'm not sure how. Any help would be appreciated.
I'm on IDLE. My Python version is 37(64-bit), and my computer is 64 bit.
In code you have England/Arsenal in url but it has to be England-Arsenal - see / and -
But page use JavaScript so using BeautifulSoup you can't get data. You will have to use Selenium to control web browser which will load page and run JavaScript. After rendering page you can get HTML from brower (using Selenium) and use BeautifulSoup to search your data.
Get tables with Selenium and BeautifulSoup
import selenium.webdriver
from bs4 import BeautifulSoup
url = "https://www.whoscored.com/Teams/13/RefereeStatistics/England-Arsenal"
driver = selenium.webdriver.Firefox()
#driver = selenium.webdriver.Chrome()
driver.get(url)
#print(driver.page_source) # HTML
soup = BeautifulSoup(driver.page_source, 'lxml')
all_tables = soup.find_all('table')
print('len(all_tables):', len(all_tables))
for table in all_tables:
print(table)

An uncaught error when trying to update the result of a python script using brython

I'm new to brython. I got an uncaught error when trying to update the running result of a python script to my page using brython.
Uncaught Error
at Object._b_.AttributeError.$factory (eval at $make_exc (brython.js:7017), <anonymous>:31:340)
at attr_error (brython.js:6184)
at Object.$B.$getattr (brython.js:6269)
at CallWikifier0 (eval at $B.loop (brython.js:4702), <anonymous>:34:66)
at eval (eval at $B.loop (brython.js:4702), <anonymous>:111:153)
at $B.loop (brython.js:4702)
at $B.inImported (brython.js:4696)
at $B.loop (brython.js:4708)
at $B.inImported (brython.js:4696)
at $B.loop (brython.js:4708)
Here my page
<html>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>
<head>
<script src="/brython-3.7.0rc1/www/src/brython.js"></script>
<script src="/brython-3.7.0rc1/www/src/brython_stdlib.js"></script>
</head>
<body onload="brython()">
<script type="text/python" src="vttToWiki.py"></script>
<div class="wrap">title</div>
<p id="content-title"></p>
<div class="wrap">subtitle</div>
<div id="content-subtitle" class="content-div">
<p id="message" font-color=white>loading...</p>
</div>
<button id="mybutton">click!</button>
</body>
</html>
Here my python script. This is script contact with http://www.wikifier.org/annotate-article, which is a website for semantic annotation. Given a text, the function CallWikifier got the annotation result and wiki url accordingly.
import sys, json, urllib.parse, urllib.request
from browser import alert, document, html
element = document.getElementById("content-title")
def CallWikifier(text, lang="en", threshold=0.8):
# Prepare the URL.
data = urllib.parse.urlencode([
("text", text), ("lang", lang),
("userKey", "trnwaxlgwchfgxqqnlfltzdfihiakr"),
("pageRankSqThreshold", "%g" % threshold), ("applyPageRankSqThreshold", "true"),
("nTopDfValuesToIgnore", "200"), ("nWordsToIgnoreFromList", "200"),
("wikiDataClasses", "true"), ("wikiDataClassIds", "false"),
("support", "true"), ("ranges", "false"),
("includeCosines", "false"), ("maxMentionEntropy", "3")
])
url = "http://www.wikifier.org/annotate-article"
# Call the Wikifier and read the response.
req = urllib.request.Request(url, data=data.encode("utf8"), method="POST")
with urllib.request.urlopen(req, timeout = 60) as f:
response = f.read()
response = json.loads(response.decode("utf8"))
# Save the response as a JSON file
with open("record.json", "w") as f:
json.dump(response, f)
# update the annotations.
for annotation in response["annotations"]:
alert(annotation["title"])
elt = document.createElement("B")
txt = document.createTextNode(f"{annotation}\n")
elt.appendChild(txt)
element.appendChild(elt)
document["mybutton"].addEventListener("click", CallWikifier("foreign minister has said Damascus is ready " +"to offer a prisoner exchange with rebels.")
If you have any idea about that please let me know. Thank you for advice!

Resources