StringContext$InvalidEscapeException: invalid escape '\:' not one of [\b, \t, \n, \f, \r, \\, \", \'] when creating an HTML string body - string

On user sign up, I want to send an email with html body to user. I have created the email template as follows:
object EmailHTML {
val body =
s"""
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
...
""".stripMargin
}
And I am using it like:
val html = if(userToken.tokenType == UserTokenType.RegistrationConfirmation){
EmailHTML.body
}else {
EmailHTML.body
}
SignupEmail(subject,from,html)
When I execute the code, I get the following error:
Caused by: scala.StringContext$InvalidEscapeException: invalid escape '\:' not one of [\b, \t, \n, \f, \r, \\, \", \'] at index 454 in "
|<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
...
The exception doesn't occur if I remove the line:
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
But the email I receive is not well formatted! What might be the issue and what is the importance of the line?
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->

Another solution you might find helpful is removing the s before """:
val body1 = """
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
<meta http-equiv="Content-Type" content="text/html; charset=Windows-1252">
<meta name="Generator" content="Microsoft Word 15 (filtered medium)">
<!--[if !mso]><style>v\:* {behavior:url(#default#VML);}
o\:* {behavior:url(#default#VML);}
w\:* {behavior:url(#default#VML);}
.shape {behavior:url(#default#VML);}
</style><![endif]-->
...""".stripMargin
It will cause the string not to be evaluated, and therefore it will stay as is.
Code running at Scastie.

solved it by adding extra \ i.e changed o\: to o\\: and in other places. I think the solution was trivial, was just not thinking straight

Related

Using FOR loop and IF for BeautifulSoup in Python

I am trying to pull out the meta description of a few webpages. Below is my code:
URL_List = ['https://digisapient.com', 'https://dataquest.io']
Meta_Description = []
for url in URL_List:
response = requests.get(url, headers=headers)
#lower_response_text = response.text.lower()
soup = BeautifulSoup(response.text, 'lxml')
metas = soup.find_all('meta')
for m in metas:
if m.get ('name') == 'description':
desc = m.get('content')
Meta_Description.append(desc)
else:
desc = "Not Found"
Meta_Description.append(desc)
Now this is returning me the below:
['Not Found',
'Not Found',
'Not Found',
'Not Found',
'Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!',
'Not Found',
'Not Found',
'Not Found',
'Not Found']
I want to pull the content where the meta name == 'description'. In case, the condition doesn't match, i.e., the page doesn't have meta property with name == 'description it should return Not Found.
Expected Output:
['Not Found',
'Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!']
Please suggest.
let me know if this works for you!
URL_List = ['https://digisapient.com', 'https://dataquest.io']
Meta_Description = []
meta_flag = False
for url in URL_List:
response = requests.get(url, headers=headers)
meta_flag = False
#lower_response_text = response.text.lower()
soup = BeautifulSoup(response.text, 'lxml')
metas = soup.find_all('meta')
for m in metas:
if m.get ('name') == 'description':
desc = m.get('content')
Meta_Description.append(desc)
meta_flag = True
continue
if not meta_flag:
desc = "Not Found"
Meta_Description.append(desc)
The idea behind the code is that it will iterate through all the item in metas, if a 'description' is found, it will set the flag to be True, thereby skipping the subsequent if-statement. If after iterating through metas and nothing is found, it will then append "Not Found" to Meta_Description.
Your result is actually what it's supposed to be.
Your current code
Let's have a look.
for url in URL_List:
For each page in your list,
metas = soup.find_all('meta')
you want all the <meta> tags (regardless of other attributes).
for m in metas:
For each <meta> tag, check if it's a <meta name="description">. If it is, save its content; otherwise, save "Not Found".
HTML <meta>
See more info on MDN's docs. In short, you can use <meta> for more meta needs, and as such, you can have multiple <meta> tags in your HTML document.
Your URLs
If you actually open up your URLs and have a look at the HTML, you'll see that you get
digisapient (1 <meta>)
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
dataquest (a lot more <meta>s)
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!" />
<meta name="robots" content="index, follow" />
<meta name="googlebot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" />
<meta name="bingbot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" />
<meta property="og:locale" content="en_US" />
<meta property="og:type" content="website" />
<meta property="og:title" content="Learn Data Science Online and Build Data Skills with Dataquest" />
<meta property="og:description" content="Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!" />
<meta property="og:url" content="https://www.dataquest.io/" />
<meta property="og:site_name" content="Dataquest" />
<meta property="article:publisher" content="https://www.facebook.com/dataquestio" />
<meta property="article:modified_time" content="2020-05-14T22:43:29+00:00" />
<meta property="og:image" content="https://www.dataquest.io/wp-content/uploads/2015/02/dataquest-learn-data-science-logo.jpg" />
<meta property="og:image:width" content="1040" />
<meta property="og:image:height" content="520" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:creator" content="#dataquestio" />
<meta name="twitter:site" content="#dataquestio" />
As you can see, the second website has many more such tags in its content, and for every one of those, you go through that last if/else statement; if it doesn't have a name="description", you save a "Not Found" in your results set.
Your actual results
To check what your program is doing at a time, and better understand the different values that the variables get over time, I think it's worthwhile to look up debugging and start doing it.
For a quick 'n dirty solution, try writing some messages to the screen as your program's execution progresses, e.g. Downloaded URL http://..., Found 4 <meta> tags, Found 0 <meta name="description".
I suggest you go over your program line by line, with the HTML in the other hand, and try to see what should happen.
Getting there
From your expected results, it seems that you don't actually care about the "Not Found"s, and you know up front that you always want the <meta name="description"> tags.
With CSS Selectors
You can try selecting DOM elements on your own in the browser, using document.querySelectorAll(). Open up your browser's developer tools / JavaScript console and type in, for instance
document.querySelectorAll("meta[name='description']")
to get all of the page's meta tags with a name attribute, which has the value of description. See more on CSS selectors.
Having this in mind is important, because
1. you get a better feel of what you're looking at / trying to do
1. you can actually use this type of selectors with BeautifulSoup, as well!
With BeautifulSoup
So you can move the check for the name attribute up in the query. Something like
metas = soup.find_all('meta', attrs={"name": "description"})
and that should give you only the tags which have name=description in them. This means that all the other <meta>s will be ignored.
Alternatively, you can keep the current query method (get all <meta>s) and just ignore them in the if/else statement, i.e. don't save them in your results list. If you want to know that it's actually doing something, but it just doesn't match up you required query, you could simply log a message, instead of saving the "Not Found".

BeautifulSoup different parsers

could anyone elaborate more about the difference between parsers like html.parser and html5lib?
I've stumbled across a weird behavior where when using html.parser it ignores all the tags in specific place. look at this code
from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
<![endif]-->
<!--[if lte IE 8]>
<![endif]-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'html.parser')
tags = soup.find_all('a')
print(tags)
this will return an empty list, whereas when using html5lib, the desired "a" tags are returned as expected.
does anyone know the reason for that ?
I've read the documentation but the explanation about the different parsers is pretty vague..
Also I've noticed that html5lib ignores invalid tags like nested form tags, is there a way to use html5lib to avoid the above behavior with html.parser and also get invalid tags like nested form tags? (when parsing with html5lib one of the form tags are removed)
thanks in advance.
You can use lxml which is very fast and can use find_all or select to get all tags.
from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
<![endif]-->
<!--[if lte IE 8]>
<![endif]-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
tags = soup.find_all('a')
print(tags)
OR
from bs4 import BeautifulSoup
html = """
<html>
<head></head>
<body>
<!--[if lte IE 8]> <!-- data-module-name="test"--> <![endif]-->
<![endif]-->
<!--[if lte IE 8]>
<![endif]-->
</body>
</html>
"""
soup = BeautifulSoup(html, 'lxml')
tags = soup.select('a')
print(tags)

requests.get.text error in python 3.6 really need some help here

from bs4 import BeautifulSoup
import requests
url = "https://www.104.com.tw/job/?jobno=5mjva&jobsource=joblist_b_relevance"
r = requests.get(url)
r.encoding = "utf-8"
print(r.text)
I want to reach the content in div ("class=content")(p)
but when I print the r.text out there's a big part disappear.
But I also found if I open a text file and write it in, it would be just right in the notebook
doc = open("file104.txt", "w", encoding="utf-8")
doc.write(r.text)
doc.close()
I guess it might be the encoding problem? But it is still not working after I encoded in utf-8.
Sorry everbody!
===========================================================================
I finally found the problem which comes from the Ipython IDLE, everthing would be fine if I run the code in powershell, I should try this earlier....
But still wanna know why cause this problem!
Use content.decode()
>>> import requests
>>> url = "https://www.104.com.tw/job/?jobno=5mjva&jobsource=joblist_b_relevance"
>>> r = requests.get(url)
>>> TextInfo = r.content.decode('UTF-8')
>>> print(TextInfo)
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--><html lang="zh-tw"><!--<![endif]-->
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="cache-control" content="no-cache" />
.....
.....
the guts of the html code
.....
.....
</script>
</body>
</html>
>>>
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.104.com.tw/job/?jobno=5mjva&
jobsource=joblist_b_relevance"
r = urllib.request.urlopen(url).read()
r=r.decode('utf-8')
print(r)
#OR
urllib.request.urlretrieve(url,"myhtml.html")
myhtml=open(myhtml.html,'rb')
print(myhtml)

Selecting using XmlSlurper like a WHERE clause

Given the following HTML snippet, I need to extract the text of the content attribute for the meta tag with attribute name equal to description and the meta tag with attribute property equal to og:title. I've tried what is shown at Groovy: Correct Syntax for XMLSlurper to find elements with a given attribute, but it doesn't seem to work the same in Groovy 1.8.6.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<meta property="fb:admins" content="100003979125336" />
<meta name="description" content="Shiny embossed blue butterfly" />
<meta name="keywords" content="Fancy That Blue Butterfly Platter" />
<meta property="og:title" content="Fancy That Blue Butterfly Platter" />
Is there a clean way to retrieve these with GPath?
This works with groovy 2.0.1 - I don't have 1.8.6 handy at the moment:
def slurper = new XmlSlurper()
File xmlFile = new File('sample.xml')
def xml = slurper.parseText(xmlFile.text)
println 'description = ' + xml.head.children().find{it.name() == 'meta' && it.#name == 'description'}.#content
println 'og:title = ' + xml.head.children().find{it.name() == 'meta' && it.#property == 'og:title'}.#content

How do I use Paul Irish's Conditional comments in a SharePoint 2010 master page

I want to use Paul Irish's Conditional comments from the Boilerplate HTML template:
<!--[if lt IE 7 ]> <html lang="en" class="no-js ie6"> <![endif]-->
<!--[if IE 7 ]> <html lang="en" class="no-js ie7"> <![endif]-->
<!--[if IE 8 ]> <html lang="en" class="no-js ie8"> <![endif]-->
<!--[if IE 9 ]> <html lang="en" class="no-js ie9"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html lang="en" class="no-js"> <!--<![endif]-->
in a SharePoint 2010 masterpage. I have read 'conditional comments don’t always work so well in SP2010'. (not sure what that means!) The advice is to use:
<SharePoint:CSSRegistration Name="foo.css" ConditionalExpression="gte IE 7" runat="server" />
This allows me to use a conditional to load a specific stylesheet but not to use the Conditional html tag in the way Paul Irish suggests. Is there a way to do this or can I just simply paste the code from Biolerplate into the Sharepoint masterpage?
I assume that the SharePoint masterpages equal the ones in ASP.NET (MVC). Therefore this shouln't be a problem at all.
<!--[if lt IE 7 ]> <html class="ie6"> <![endif]-->
<!--[if IE 7 ]> <html class="ie7"> <![endif]-->
<!--[if IE 8 ]> <html class="ie8"> <![endif]-->
<!--[if IE 9 ]> <html class="ie9"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html> <!--<![endif]-->
All the preceding code does is setting the HTML tag with a different CSS class on the HTML tag depending which browser is accessing the site. So that you are able to override some style sheets for any given browser (IE).
site.css
.coloredBackground
{
background-color: #FF0000; //red
}
.ie6 .coloredBackground
{
background-color: #00FF00; //green
}
.ie8 .coloredBackground
{
background-color: #0000FF; //blue
}
In this example users browsing with Firefox, Opera, Safari, IE7,9,10 will see a red background. For IE6 the background color gets overridden with green and in IE8 with blue.
Your CSS registration would look like the following:
<SharePoint:CSSRegistration Name="site.css" runat="server" />
As you can see there is no need to set the ConditionalExpression anymore in the CSS registration, because you are already switching the used style sheet by setting a specific class on the HTML element.
Update:
Another possibility would be to include another style sheet file depending on the browser version using the ConditionalExpression property on the SharePoint aspx control.
<SharePoint:CSSRegistration Name="ie6.css" ConditionalExpression="lt IE 7" runat="server" />
<SharePoint:CSSRegistration Name="ie7.css" ConditionalExpression="IE 7" runat="server" />
<SharePoint:CSSRegistration Name="ie8.css" ConditionalExpression="IE 8" runat="server" />
<SharePoint:CSSRegistration Name="ie9.css" ConditionalExpression="IE 9" runat="server" />
The downside is that you may get css priority issues because the .ie* class is missing on the html element and therefore doesn't overrule the .class-to-override-if-specific-ie-version. You could solve this by using the !important rule in the ie specific style sheet files.
I am very new to SharePoint branding and was coming across the same problem. I am not a ASP developer so understanding some of the solutions were a little tough.
What I did was take Paul's conditional statements with the HTML tag and moved them to the BODY tag and it seemed to work just perfect with out having to mess with and SP code.
<!--[if lt IE 7]> <body class="no-js lt-ie9 lt-ie8 lt-ie7" scroll="no" onload="javascript:_spBodyOnLoadWrapper();"> <![endif]-->
<!--[if IE 7]> <body class="no-js lt-ie9 lt-ie8" scroll="no" onload="javascript:_spBodyOnLoadWrapper();"> <![endif]-->
<!--[if IE 8]> <body class="no-js lt-ie9" scroll="no" onload="javascript:_spBodyOnLoadWrapper();"> <![endif]-->
<!--[if gt IE 8]><!--> <body class="no-js" scroll="no" onload="javascript:_spBodyOnLoadWrapper();"> <!--<![endif]-->
Hope that helps.
Posting again as my first reply was deleted. This has been edited to better meet StackOverflow's posting guidelines.
I used this (from the bottom of Paul's post, dated 2012/1/17) and it worked fine in a SharePoint master page. Please note this is different than what you posted, which was an earlier version of the code.
<!--[if lt IE 7]> <html class="lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class=""> <!--<![endif]-->
My concern is the other things that had to get stripped out that may be needed by SharePoint pages. Here is the original SharePoint HTML tag:
<html lang="<%$Resources:wss,language_value%>" dir="<%$Resources:wss,multipages_direction_dir_value%>" runat="server" xmlns:o="urn:schemas-microsoft-com:office:office" __expr-val-dir="ltr">
Once you start adding that stuff in the HTML tags with the conditional comments the page breaks.
I urge you to also review your HTML/CSS to see if some of the conditional CSS can be removed to help streamline your solution.
I used ASP literal strings to pass the conditional IE comments in my SP2010 masterpage.
That way, I could still pass in the language value used in the original <html> tag, <%$Resources:wss,language_value%>.
<asp:literal ID="html1a" runat="server" Text="&lt;!--[if lt IE 7]&gt; &lt;html class=&quot;no-js lt-ie9 lt-ie8 lt-ie7&quot; lang=&quot;"/><asp:Literal ID="html1b" runat="server" Text="<%$Resources:wss,language_value%>" /><asp:literal ID="html1c" runat="server" Text="&quot;&gt;&lt;![endif]--&gt;"/><br />
<asp:literal ID="html2a" runat="server" Text="&lt;!--[if IE 7]&gt;&lt;html class=&quot;no-js lt-ie9 lt-ie8&quot; lang=&quot;"/><asp:Literal ID="html2b" runat="server" Text="<%$Resources:wss,language_value%>" /><asp:literal ID="html2c" runat="server" Text="&quot;&gt;&lt;![endif]--&gt;"/><br />
<asp:literal ID="html3a" runat="server" Text="&lt;!--[if IE 8]&gt;&lt;html class=&quot;no-js lt-ie9&quot; lang=&quot;"/><asp:Literal ID="html3b" runat="server" Text="<%$Resources:wss,language_value%>" /><asp:literal ID="html3c" runat="server" Text="&quot;&gt;&lt;![endif]--&gt;"/><br />
<asp:literal ID="html4a" runat="server" Text="&lt;!--[if gt IE 8]&gt;&lt;!--&gt; &lt;html class=&quot;no-js&quot; lang=&quot;"/><asp:Literal ID="html4b" runat="server" Text="<%$Resources:wss,language_value%>" /><asp:literal ID="html4c" runat="server" Text="&quot;&gt;&lt;!--&lt;![endif]--&gt;"/>
But if someone could further improve this method, it would be appreciated!
With Sharepoint 2010 it is best to avoid putting conditionals around the <html> tag. That being said, you can add in your conditionals within the <head> tag and add classes to your html tag using Javascript. This is not ideal for a regular website but for Sharepoint, this is an unubtrusive method that works in a large enterprise Sharepoint 2010 site that I help maintain:
...
<html id="Html1" class="no-js" lang="<%$Resources:wss,language_value %>" dir="<%$Resources:wss,multipages_direction_dir_value %>" runat="server" __expr-val-dir="ltr">
<head id="Head1" runat="server">
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<meta http-equiv="Expires" content="0"/>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<!-- IE version conditionals will apply classes to html element if old IE -->
<!--[if lt IE 7 ]><script type="text/javascript">var element = document.getElementById("ctl00_Html1"); element.className += " " + "lt-ie7";</script><![endif]-->
<!--[if lt IE 8 ]><script type="text/javascript">var element = document.getElementById("ctl00_Html1"); element.className += " " + "lt-ie8";</script><![endif]-->
<!--[if lt IE 9 ]><script type="text/javascript">var element = document.getElementById("ctl00_Html1"); element.className += " " + "lt-ie9";</script><![endif]-->
<!--[if lt IE 10 ]><script type="text/javascript">var element = document.getElementById("ctl00_Html1"); element.className += " " + "lt-ie10";</script><![endif]-->
...

Resources