Selecting using XmlSlurper like a WHERE clause - groovy

Given the following HTML snippet, I need to extract the text of the content attribute for the meta tag with attribute name equal to description and the meta tag with attribute property equal to og:title. I've tried what is shown at Groovy: Correct Syntax for XMLSlurper to find elements with a given attribute, but it doesn't seem to work the same in Groovy 1.8.6.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<meta property="fb:admins" content="100003979125336" />
<meta name="description" content="Shiny embossed blue butterfly" />
<meta name="keywords" content="Fancy That Blue Butterfly Platter" />
<meta property="og:title" content="Fancy That Blue Butterfly Platter" />
Is there a clean way to retrieve these with GPath?

This works with groovy 2.0.1 - I don't have 1.8.6 handy at the moment:
def slurper = new XmlSlurper()
File xmlFile = new File('sample.xml')
def xml = slurper.parseText(xmlFile.text)
println 'description = ' + xml.head.children().find{it.name() == 'meta' && it.#name == 'description'}.#content
println 'og:title = ' + xml.head.children().find{it.name() == 'meta' && it.#property == 'og:title'}.#content

Related

Selenium handles '#' character very strangely

I'm trying to use Selenium in python with standard and with undetected_chrome drivers in normal and headless browser modes.
I've noticed some strange behaviors:
normal chrome driver works fine with special inputs while sending keys into HTML input field with send_keys() function
undetected_chrome driver does not handle the special inputs very well with send_keys() function
in normal browser mode if I send an email address into a HTML input field, like 'abc#xyz.com' the current content from the clipboard is pasted in place of '#' character
e.g.: if I have copied the string '123' for the last time, the entered email address will not be 'abc#xyz.com' but 'abc123xyz.com' which is obviously incorrect
that's why I'm using a workaround, that I import pyperclip and put '#' character to the clipboard before the send_keys() function runs, to replace the '#' character with '#' character correctly
in headless browser mode if I enter an email address into a HTML input field, like 'abc#xyz.com' my workaround doesn't matter any more, the '#' character will be stripped from the email, and the 'abcxyz.com' string will be put into the field
I think sending a '#' character shouldn't be that hard by default. Am I doing something wrong here?
Questions:
Can anyone explain these strange behaviors?
How could I send an email address correctly with headless browser mode? (I need to use undetected_chrome driver because of bots)
from selenium import webdriver
self.chrome_options = webdriver.ChromeOptions()
if self.headless:
self.chrome_options.add_argument('--headless')
self.chrome_options.add_argument('--disable-gpu')
prefs = {'download.default_directory' : os.path.realpath(self.download_dir_path)}
self.chrome_options.add_experimental_option('prefs', prefs)
self.driver = webdriver.Chrome(options=self.chrome_options)
import undetected_chromedriver as uc
# workaround for problem with pasting text from the clipboard into '#' symbol when using send_keys()
import pyperclip
pyperclip.copy('#') # <---- THIS IS THE WORKAROUND FOR THE PASTING PROBLEM
self.chrome_options = uc.ChromeOptions()
if self.headless:
self.chrome_options.add_argument('--headless')
self.chrome_options.add_argument('--disable-gpu')
self.driver = uc.Chrome(options=self.chrome_options)
params = {
"behavior": "allow",
"downloadPath": os.path.realpath(self.download_dir_path)
}
self.driver.execute_cdp_cmd("Page.setDownloadBehavior", params)
The versions I'm using requirements.txt:
selenium==4.3.0
undetected-chromedriver==3.1.5.post4
Instead of using custom configuration options, try a more basic variant first and see if works correctly. And then figure out which option combination is causing the issue.
An example (which does work correctly by the way)
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
driver = uc.Chrome()
driver.execute_script("""
document.documentElement.innerHTML = `
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<style>
html, body, main {
height: 100%;
}
main {
display: flex;
justify-content: center;
}
</style>
</head>
<body>
<main>
<form>
<input type="text" name="text" />
<input type="email" name="email"/>
<input type="tel" name="tel"/>
</form>
</main>
</body>
</html>`
""")
driver.find_element(By.NAME, "text").send_keys("info#example.com")
driver.find_element(By.NAME, "email").send_keys("info#example.com")
driver.find_element(By.NAME, "tel").send_keys("info#example.com")
This complaint (entering "#" pastes clipboard contents) has come up here from time to time but I've never seen a definitive solution. I'd suspect a particular language version of Windows &/or keyboard driver.
The windows language is the problem, more precisley the keyboard layout as it changes depending on language. Switch it to "ENG" and it should work fine. Keyboard layouts where letter Z is between "T" and "U" are not working with undetected driver. Keyboard layout that has Z next to "X" works fine.

Using FOR loop and IF for BeautifulSoup in Python

I am trying to pull out the meta description of a few webpages. Below is my code:
URL_List = ['https://digisapient.com', 'https://dataquest.io']
Meta_Description = []
for url in URL_List:
response = requests.get(url, headers=headers)
#lower_response_text = response.text.lower()
soup = BeautifulSoup(response.text, 'lxml')
metas = soup.find_all('meta')
for m in metas:
if m.get ('name') == 'description':
desc = m.get('content')
Meta_Description.append(desc)
else:
desc = "Not Found"
Meta_Description.append(desc)
Now this is returning me the below:
['Not Found',
'Not Found',
'Not Found',
'Not Found',
'Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!',
'Not Found',
'Not Found',
'Not Found',
'Not Found']
I want to pull the content where the meta name == 'description'. In case, the condition doesn't match, i.e., the page doesn't have meta property with name == 'description it should return Not Found.
Expected Output:
['Not Found',
'Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!']
Please suggest.
let me know if this works for you!
URL_List = ['https://digisapient.com', 'https://dataquest.io']
Meta_Description = []
meta_flag = False
for url in URL_List:
response = requests.get(url, headers=headers)
meta_flag = False
#lower_response_text = response.text.lower()
soup = BeautifulSoup(response.text, 'lxml')
metas = soup.find_all('meta')
for m in metas:
if m.get ('name') == 'description':
desc = m.get('content')
Meta_Description.append(desc)
meta_flag = True
continue
if not meta_flag:
desc = "Not Found"
Meta_Description.append(desc)
The idea behind the code is that it will iterate through all the item in metas, if a 'description' is found, it will set the flag to be True, thereby skipping the subsequent if-statement. If after iterating through metas and nothing is found, it will then append "Not Found" to Meta_Description.
Your result is actually what it's supposed to be.
Your current code
Let's have a look.
for url in URL_List:
For each page in your list,
metas = soup.find_all('meta')
you want all the <meta> tags (regardless of other attributes).
for m in metas:
For each <meta> tag, check if it's a <meta name="description">. If it is, save its content; otherwise, save "Not Found".
HTML <meta>
See more info on MDN's docs. In short, you can use <meta> for more meta needs, and as such, you can have multiple <meta> tags in your HTML document.
Your URLs
If you actually open up your URLs and have a look at the HTML, you'll see that you get
digisapient (1 <meta>)
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
dataquest (a lot more <meta>s)
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!" />
<meta name="robots" content="index, follow" />
<meta name="googlebot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" />
<meta name="bingbot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" />
<meta property="og:locale" content="en_US" />
<meta property="og:type" content="website" />
<meta property="og:title" content="Learn Data Science Online and Build Data Skills with Dataquest" />
<meta property="og:description" content="Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!" />
<meta property="og:url" content="https://www.dataquest.io/" />
<meta property="og:site_name" content="Dataquest" />
<meta property="article:publisher" content="https://www.facebook.com/dataquestio" />
<meta property="article:modified_time" content="2020-05-14T22:43:29+00:00" />
<meta property="og:image" content="https://www.dataquest.io/wp-content/uploads/2015/02/dataquest-learn-data-science-logo.jpg" />
<meta property="og:image:width" content="1040" />
<meta property="og:image:height" content="520" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:creator" content="#dataquestio" />
<meta name="twitter:site" content="#dataquestio" />
As you can see, the second website has many more such tags in its content, and for every one of those, you go through that last if/else statement; if it doesn't have a name="description", you save a "Not Found" in your results set.
Your actual results
To check what your program is doing at a time, and better understand the different values that the variables get over time, I think it's worthwhile to look up debugging and start doing it.
For a quick 'n dirty solution, try writing some messages to the screen as your program's execution progresses, e.g. Downloaded URL http://..., Found 4 <meta> tags, Found 0 <meta name="description".
I suggest you go over your program line by line, with the HTML in the other hand, and try to see what should happen.
Getting there
From your expected results, it seems that you don't actually care about the "Not Found"s, and you know up front that you always want the <meta name="description"> tags.
With CSS Selectors
You can try selecting DOM elements on your own in the browser, using document.querySelectorAll(). Open up your browser's developer tools / JavaScript console and type in, for instance
document.querySelectorAll("meta[name='description']")
to get all of the page's meta tags with a name attribute, which has the value of description. See more on CSS selectors.
Having this in mind is important, because
1. you get a better feel of what you're looking at / trying to do
1. you can actually use this type of selectors with BeautifulSoup, as well!
With BeautifulSoup
So you can move the check for the name attribute up in the query. Something like
metas = soup.find_all('meta', attrs={"name": "description"})
and that should give you only the tags which have name=description in them. This means that all the other <meta>s will be ignored.
Alternatively, you can keep the current query method (get all <meta>s) and just ignore them in the if/else statement, i.e. don't save them in your results list. If you want to know that it's actually doing something, but it just doesn't match up you required query, you could simply log a message, instead of saving the "Not Found".

requests.get.text error in python 3.6 really need some help here

from bs4 import BeautifulSoup
import requests
url = "https://www.104.com.tw/job/?jobno=5mjva&jobsource=joblist_b_relevance"
r = requests.get(url)
r.encoding = "utf-8"
print(r.text)
I want to reach the content in div ("class=content")(p)
but when I print the r.text out there's a big part disappear.
But I also found if I open a text file and write it in, it would be just right in the notebook
doc = open("file104.txt", "w", encoding="utf-8")
doc.write(r.text)
doc.close()
I guess it might be the encoding problem? But it is still not working after I encoded in utf-8.
Sorry everbody!
===========================================================================
I finally found the problem which comes from the Ipython IDLE, everthing would be fine if I run the code in powershell, I should try this earlier....
But still wanna know why cause this problem!
Use content.decode()
>>> import requests
>>> url = "https://www.104.com.tw/job/?jobno=5mjva&jobsource=joblist_b_relevance"
>>> r = requests.get(url)
>>> TextInfo = r.content.decode('UTF-8')
>>> print(TextInfo)
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--><html lang="zh-tw"><!--<![endif]-->
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="cache-control" content="no-cache" />
.....
.....
the guts of the html code
.....
.....
</script>
</body>
</html>
>>>
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.104.com.tw/job/?jobno=5mjva&
jobsource=joblist_b_relevance"
r = urllib.request.urlopen(url).read()
r=r.decode('utf-8')
print(r)
#OR
urllib.request.urlretrieve(url,"myhtml.html")
myhtml=open(myhtml.html,'rb')
print(myhtml)

add own text inside nested braces + exception

Original question locates here, current question is desire to avoid one problem.
I have this code which works perfect with html_1 data:
from pyparsing import nestedExpr, originalTextFor
html_1 = '''
<html>
<head>
<title><?php echo "title here"; ?></title>
<head>
<body>
<h1 <?php echo "class='big'" ?>>foo</h1>
</body>
</html>
'''
html_2 = '''
<html>
<head>
<title><?php echo "title here"; ?></title>
<head>
<body>
<h1 <?php echo $tpl->showStyle(); ?>>foo</h1>
</body>
</html>
'''
nested_angle_braces = nestedExpr('<', '>')
# for match in nested_angle_braces.searchString(html):
# print(match)
# nested_angle_braces_with_h1 = nested_angle_braces().addCondition(
# lambda tokens: tokens[0][0].lower() == 'h1')
nested_angle_braces_with_h1 = originalTextFor(
nested_angle_braces().addCondition(lambda tokens: tokens[0][0].lower() == 'h1')
)
nested_angle_braces_with_h1.addParseAction(lambda tokens: tokens[0] + 'MY_TEXT')
print(nested_angle_braces_with_h1.transformString(html_1))
Result of html_1 variable is:
<html>
<head>
<title><?php echo "title here"; ?></title>
<head>
<body>
<h1 <?php echo "class='big'" ?>>MY_TEXTfoo</h1>
</body>
</html>
Here is all right, all placed as expected. MY_TEXT located in right region (inside h1 tag).
But let's see result for html_2:
<html>
<head>
<title><?php echo "title here"; ?></title>
<head>
<body>
<h1 <?php echo $tpl->showStyle(); ?>MY_TEXT>foo</h1>
</body>
</html>
Now we got error, MY_TEXT placed inside h1 property area because PHP contains brace inside "$tpl->".
How I can fix it? I need get this result in that region:
<h1 <?php echo $tpl->showStyle(); ?>>MY_TEXTfoo</h1>
The solution requires that we define a special expression for PHP tags, which our simple nestedExpr gets confused by.
# define an expression for a PHP tag
php_tag = Literal('<?') + 'php' + SkipTo('?>', include=True)
We'll need more than simple strings now for the opener and closer, including a negative lookahead when matching a '<' to make sure we aren't at the leading edge of a PHP tag:
# define expressions for opener and closer, such that we don't
# accidentally interpret a PHP tag as a nested expr
opener = ~php_tag + Literal("<")
closer = Literal(">")
If opener and closer aren't simple strings, then we need to give a content expression too. Our content will be very simple to define, just PHP tags or other Words of printables, excluding '<' and '>' (you'll end up wrapping this all back up in originalTextFor anyway):
# define nested_angle_braces to potentially contain PHP tag, or
# some other printable (not including '<' or '>' chars)
nested_angle_braces = nestedExpr(opener, closer,
content=php_tag | Word(printables, excludeChars="<>"))
Now if I use nested_angle_braces.searchString to scan html_2, I get:
for tag in originalTextFor(nested_angle_braces).searchString(html_2):
print(tag)
['<html>']
['<head>']
['<title>']
['</title>']
['<head>']
['<body>']
['<h1 <?php echo $tpl->showStyle(); ?>>']
['</h1>']
['</body>']
['</html>']

How to disable up/down pan in UIWebView?

I want to have my web view pannable left and right but not up and down.
Is that possible?
Thanks.
ok, well wrap your one line of html like this:
<html>
<head>
<meta name = "viewport" content = "height = device-height, user-scalable = no, width = WIDTH">
</head>
<body>
...YOUR HTML
</body>
</html>
Replace width with the width of your content, and see how it works.

Resources