How to deal with "lxml.etree.XMLSyntaxError"? - python-3.x

When I use lxml.etree to parse some page, it occurs errors like following
lxml.etree.XMLSyntaxError: Double hyphen within comment: <!--[if lte IE 9 ]>
the page's head code is like
<!--[if lt IE 7 ]><html class="ie6 domain_so"><![endif]-->
<!--[if IE 7 ]><html class="ie7 domain_so"><![endif]-->
<!--[if IE 8 ]><html class="ie8 domain_so"><![endif]-->
<!--[if IE 9 ]><html class="ie9 domain_so"><![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--><html class="w3c domain_so"><!--<![endif]-->
How to handle this?

Related

Matching Font Size of SVG with Regular Text in a Web Browser

I am trying to get the font size of my svg file to match the font size of my regular text. I have managed to get the fonts themselves to match, both are using Open Sans by Steve Matteson however the sizing is different. My svg file consistently renders slight larger font than my regular text.
When I created the svg file (in Inkscape) I set the font size to 22.4px; I have also set the font size of my regular text to 22.4px. I have done some research on this and realize that svgs are quite complicated to render and that browsers have many default stylings, but I have not been able to hone in on what exactly causes this discrepancy in font size. (I have tried a browser reset code, this did not help; the problem remained exactly as it did without the reset).
Here is a screen shot comparing the regular text on top with the svg underneath. You can see the svg is slightly larger in font size (this is most visible when comparing the number "1" in the text vs the svg.
HTML:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<title>Font Comparison Test</title>
<!-- Open Sans font by Steve Matteson -->
<link rel="preconnect" href="https://fonts.gstatic.com">
<link href="https://fonts.googleapis.com/css2?family=Open+Sans+Condensed:wght#300&display=swap" rel="stylesheet">
<link rel="stylesheet" href="css/main.css">
</head>
<body>
<p>
Einige suchen dieselbe in der fortlaufenden Zahlenreihe
von 1 bis 16 und legen den Gliedern dieser Reihe die Töne
auf folgende Weise unter :
</p>
<!-- First graph, series 1 - 16 -->
<img src="svg/1.1 English series 1 - 16.svg" style="width: auto; height: auto;">
</body>
</html>
CSS:
body{
font-family: 'Open Sans Condensed', sans-serif;
background-color: #162e2e;
color: #dfddfe; /* Font colour */
font-size: 22.4px; /* 1.4 = 22.4 px? */
font-weight: 400; /* important for font styling */
}
SVG:
<svg
width="167mm"
height="17mm"
viewBox="0 0 167 17"
version="1.1"
id="svg127"
<g
inkscape:label="Layer 1"
inkscape:groupmode="layer"
id="layer1"
transform="translate(-17.407795,-14.122843)">
<text
xml:space="preserve"
style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-size:7.90222216px;line-height:1.25;font-family:'Open Sans Condensed';-inkscape-font-specification:'Open Sans Condensed, ';letter-spacing:0px;word-spacing:0px;fill:#dfddfe;fill-opacity:1;stroke:none;stroke-width:0.26458332"
x="18.06776"
y="21.276409"
id="text4998-2-0"><tspan
sodipodi:role="line"
x="18.06776"
y="21.276409"
style="font-style:normal;font-variant:normal;font-weight:normal;font-stretch:normal;font-family:'Open Sans Condensed';-inkscape-font-specification:'Times New Roman, ';stroke-width:0.26458332"
id="tspan5000-5-1">1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16</tspan></text>
I had the same problem. solved by setting the svg width and height to "1em"

StringContext$InvalidEscapeException: invalid escape '\:' not one of [\b, \t, \n, \f, \r, \\, \", \'] when creating an HTML string body

On user sign up, I want to send an email with html body to user. I have created the email template as follows:
object EmailHTML {
val body =
s"""
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
...
""".stripMargin
}
And I am using it like:
val html = if(userToken.tokenType == UserTokenType.RegistrationConfirmation){
EmailHTML.body
}else {
EmailHTML.body
}
SignupEmail(subject,from,html)
When I execute the code, I get the following error:
Caused by: scala.StringContext$InvalidEscapeException: invalid escape '\:' not one of [\b, \t, \n, \f, \r, \\, \", \'] at index 454 in "
|<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
...
The exception doesn't occur if I remove the line:
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
But the email I receive is not well formatted! What might be the issue and what is the importance of the line?
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
Another solution you might find helpful is removing the s before """:
val body1 = """
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
...""".stripMargin
It will cause the string not to be evaluated, and therefore it will stay as is.
Code running at Scastie.
solved it by adding extra \ i.e changed o\: to o\\: and in other places. I think the solution was trivial, was just not thinking straight

BeautifulSoup different parsers

could anyone elaborate more about the difference between parsers like html.parser and html5lib?
I've stumbled across a weird behavior where when using html.parser it ignores all the tags in specific place. look at this code
from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
<![endif]-->
<!--[if lte IE 8]>
<![endif]-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
print(tags)
this will return an empty list, whereas when using html5lib, the desired "a" tags are returned as expected.
does anyone know the reason for that ?
I've read the documentation but the explanation about the different parsers is pretty vague..
Also I've noticed that html5lib ignores invalid tags like nested form tags, is there a way to use html5lib to avoid the above behavior with html.parser and also get invalid tags like nested form tags? (when parsing with html5lib one of the form tags are removed)
thanks in advance.
You can use lxml which is very fast and can use find_all or select to get all tags.
from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
<![endif]-->
<!--[if lte IE 8]>
<![endif]-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
print(tags)
OR
from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
<![endif]-->
<!--[if lte IE 8]>
<![endif]-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)

requests.get.text error in python 3.6 really need some help here

from bs4 import BeautifulSoup
import requests
url = "https://www.104.com.tw/job/?jobno=5mjva&jobsource=joblist_b_relevance"
r = requests.get(url)
r.encoding = "utf-8"
print(r.text)
I want to reach the content in div ("class=content")(p)
but when I print the r.text out there's a big part disappear.
But I also found if I open a text file and write it in, it would be just right in the notebook
doc = open("file104.txt", "w", encoding="utf-8")
doc.write(r.text)
doc.close()
I guess it might be the encoding problem? But it is still not working after I encoded in utf-8.
Sorry everbody!
===========================================================================
I finally found the problem which comes from the Ipython IDLE, everthing would be fine if I run the code in powershell, I should try this earlier....
But still wanna know why cause this problem!
Use content.decode()
>>> import requests
>>> url = "https://www.104.com.tw/job/?jobno=5mjva&jobsource=joblist_b_relevance"
>>> r = requests.get(url)
>>> TextInfo = r.content.decode('UTF-8')
>>> print(TextInfo)
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--><html lang="zh-tw"><!--<![endif]-->
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="cache-control" content="no-cache" />
.....
.....
the guts of the html code
.....
.....
</script>
</body>
</html>
>>>
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.104.com.tw/job/?jobno=5mjva&
jobsource=joblist_b_relevance"
r = urllib.request.urlopen(url).read()
r=r.decode('utf-8')
print(r)
#OR
urllib.request.urlretrieve(url,"myhtml.html")
myhtml=open(myhtml.html,'rb')
print(myhtml)

How do I use Paul Irish's Conditional comments in a SharePoint 2010 master page

I want to use Paul Irish's Conditional comments from the Boilerplate HTML template:
<!--[if lt IE 7 ]> <html lang="en" class="no-js ie6"> <![endif]-->
<!--[if IE 7 ]> <html lang="en" class="no-js ie7"> <![endif]-->
<!--[if IE 8 ]> <html lang="en" class="no-js ie8"> <![endif]-->
<!--[if IE 9 ]> <html lang="en" class="no-js ie9"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html lang="en" class="no-js"> <!--<![endif]-->
in a SharePoint 2010 masterpage. I have read 'conditional comments don’t always work so well in SP2010'. (not sure what that means!) The advice is to use:
<SharePoint:CSSRegistration Name="foo.css" ConditionalExpression="gte IE 7" runat="server" />
This allows me to use a conditional to load a specific stylesheet but not to use the Conditional html tag in the way Paul Irish suggests. Is there a way to do this or can I just simply paste the code from Biolerplate into the Sharepoint masterpage?
I assume that the SharePoint masterpages equal the ones in ASP.NET (MVC). Therefore this shouln't be a problem at all.
<!--[if lt IE 7 ]> <html class="ie6"> <![endif]-->
<!--[if IE 7 ]> <html class="ie7"> <![endif]-->
<!--[if IE 8 ]> <html class="ie8"> <![endif]-->
<!--[if IE 9 ]> <html class="ie9"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html> <!--<![endif]-->
All the preceding code does is setting the HTML tag with a different CSS class on the HTML tag depending which browser is accessing the site. So that you are able to override some style sheets for any given browser (IE).
site.css
.coloredBackground
{
background-color: #FF0000; //red
}
.ie6 .coloredBackground
{
background-color: #00FF00; //green
}
.ie8 .coloredBackground
{
background-color: #0000FF; //blue
}
In this example users browsing with Firefox, Opera, Safari, IE7,9,10 will see a red background. For IE6 the background color gets overridden with green and in IE8 with blue.
Your CSS registration would look like the following:
<SharePoint:CSSRegistration Name="site.css" runat="server" />
As you can see there is no need to set the ConditionalExpression anymore in the CSS registration, because you are already switching the used style sheet by setting a specific class on the HTML element.
Update:
Another possibility would be to include another style sheet file depending on the browser version using the ConditionalExpression property on the SharePoint aspx control.
<SharePoint:CSSRegistration Name="ie6.css" ConditionalExpression="lt IE 7" runat="server" />
<SharePoint:CSSRegistration Name="ie7.css" ConditionalExpression="IE 7" runat="server" />
<SharePoint:CSSRegistration Name="ie8.css" ConditionalExpression="IE 8" runat="server" />
<SharePoint:CSSRegistration Name="ie9.css" ConditionalExpression="IE 9" runat="server" />
The downside is that you may get css priority issues because the .ie* class is missing on the html element and therefore doesn't overrule the .class-to-override-if-specific-ie-version. You could solve this by using the !important rule in the ie specific style sheet files.
I am very new to SharePoint branding and was coming across the same problem. I am not a ASP developer so understanding some of the solutions were a little tough.
What I did was take Paul's conditional statements with the HTML tag and moved them to the BODY tag and it seemed to work just perfect with out having to mess with and SP code.
<!--[if lt IE 7]> <body class="no-js lt-ie9 lt-ie8 lt-ie7" scroll="no" onload="javascript:_spBodyOnLoadWrapper();"> <![endif]-->
<!--[if IE 7]> <body class="no-js lt-ie9 lt-ie8" scroll="no" onload="javascript:_spBodyOnLoadWrapper();"> <![endif]-->
<!--[if IE 8]> <body class="no-js lt-ie9" scroll="no" onload="javascript:_spBodyOnLoadWrapper();"> <![endif]-->
<!--[if gt IE 8]><!--> <body class="no-js" scroll="no" onload="javascript:_spBodyOnLoadWrapper();"> <!--<![endif]-->
Hope that helps.
Posting again as my first reply was deleted. This has been edited to better meet StackOverflow's posting guidelines.
I used this (from the bottom of Paul's post, dated 2012/1/17) and it worked fine in a SharePoint master page. Please note this is different than what you posted, which was an earlier version of the code.
<!--[if lt IE 7]> <html class="lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class=""> <!--<![endif]-->
My concern is the other things that had to get stripped out that may be needed by SharePoint pages. Here is the original SharePoint HTML tag:
<html lang="<%$Resources:wss,language_value%>" dir="<%$Resources:wss,multipages_direction_dir_value%>" runat="server" xmlns:o="urn:schemas-microsoft-com:office:office" __expr-val-dir="ltr">
Once you start adding that stuff in the HTML tags with the conditional comments the page breaks.
I urge you to also review your HTML/CSS to see if some of the conditional CSS can be removed to help streamline your solution.
I used ASP literal strings to pass the conditional IE comments in my SP2010 masterpage.
That way, I could still pass in the language value used in the original <html> tag, <%$Resources:wss,language_value%>.
<asp:literal ID="html1a" runat="server" Text="&lt;!--[if lt IE 7]&gt; &lt;html class=&quot;no-js lt-ie9 lt-ie8 lt-ie7&quot; lang=&quot;"/><asp:Literal ID="html1b" runat="server" Text="<%$Resources:wss,language_value%>" /><asp:literal ID="html1c" runat="server" Text="&quot;&gt;&lt;![endif]--&gt;"/><br />
<asp:literal ID="html2a" runat="server" Text="&lt;!--[if IE 7]&gt;&lt;html class=&quot;no-js lt-ie9 lt-ie8&quot; lang=&quot;"/><asp:Literal ID="html2b" runat="server" Text="<%$Resources:wss,language_value%>" /><asp:literal ID="html2c" runat="server" Text="&quot;&gt;&lt;![endif]--&gt;"/><br />
<asp:literal ID="html3a" runat="server" Text="&lt;!--[if IE 8]&gt;&lt;html class=&quot;no-js lt-ie9&quot; lang=&quot;"/><asp:Literal ID="html3b" runat="server" Text="<%$Resources:wss,language_value%>" /><asp:literal ID="html3c" runat="server" Text="&quot;&gt;&lt;![endif]--&gt;"/><br />
<asp:literal ID="html4a" runat="server" Text="&lt;!--[if gt IE 8]&gt;&lt;!--&gt; &lt;html class=&quot;no-js&quot; lang=&quot;"/><asp:Literal ID="html4b" runat="server" Text="<%$Resources:wss,language_value%>" /><asp:literal ID="html4c" runat="server" Text="&quot;&gt;&lt;!--&lt;![endif]--&gt;"/>
But if someone could further improve this method, it would be appreciated!
With Sharepoint 2010 it is best to avoid putting conditionals around the <html> tag. That being said, you can add in your conditionals within the <head> tag and add classes to your html tag using Javascript. This is not ideal for a regular website but for Sharepoint, this is an unubtrusive method that works in a large enterprise Sharepoint 2010 site that I help maintain:
...
<html id="Html1" class="no-js" lang="<%$Resources:wss,language_value %>" dir="<%$Resources:wss,multipages_direction_dir_value %>" runat="server" __expr-val-dir="ltr">
<head id="Head1" runat="server">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<meta http-equiv="Expires" content="0"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- IE version conditionals will apply classes to html element if old IE -->
<!--[if lt IE 7 ]><script type="text/javascript">var element = document.getElementById("ctl00_Html1"); element.className += " " + "lt-ie7";</script><![endif]-->
<!--[if lt IE 8 ]><script type="text/javascript">var element = document.getElementById("ctl00_Html1"); element.className += " " + "lt-ie8";</script><![endif]-->
<!--[if lt IE 9 ]><script type="text/javascript">var element = document.getElementById("ctl00_Html1"); element.className += " " + "lt-ie9";</script><![endif]-->
<!--[if lt IE 10 ]><script type="text/javascript">var element = document.getElementById("ctl00_Html1"); element.className += " " + "lt-ie10";</script><![endif]-->
...

Resources