I am trying to pull out the meta description of a few webpages. Below is my code:
URL_List = ['https://digisapient.com', 'https://dataquest.io']
Meta_Description = []
for url in URL_List:
response = requests.get(url, headers=headers)
#lower_response_text = response.text.lower()
soup = BeautifulSoup(response.text, 'lxml')
metas = soup.find_all('meta')
for m in metas:
if m.get ('name') == 'description':
desc = m.get('content')
Meta_Description.append(desc)
else:
desc = "Not Found"
Meta_Description.append(desc)
Now this is returning me the below:
['Not Found',
'Not Found',
'Not Found',
'Not Found',
'Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!',
'Not Found',
'Not Found',
'Not Found',
'Not Found']
I want to pull the content where the meta name == 'description'. In case, the condition doesn't match, i.e., the page doesn't have meta property with name == 'description it should return Not Found.
Expected Output:
['Not Found',
'Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!']
Please suggest.
let me know if this works for you!
URL_List = ['https://digisapient.com', 'https://dataquest.io']
Meta_Description = []
meta_flag = False
for url in URL_List:
response = requests.get(url, headers=headers)
meta_flag = False
#lower_response_text = response.text.lower()
soup = BeautifulSoup(response.text, 'lxml')
metas = soup.find_all('meta')
for m in metas:
if m.get ('name') == 'description':
desc = m.get('content')
Meta_Description.append(desc)
meta_flag = True
continue
if not meta_flag:
desc = "Not Found"
Meta_Description.append(desc)
The idea behind the code is that it will iterate through all the item in metas, if a 'description' is found, it will set the flag to be True, thereby skipping the subsequent if-statement. If after iterating through metas and nothing is found, it will then append "Not Found" to Meta_Description.
Your result is actually what it's supposed to be.
Your current code
Let's have a look.
for url in URL_List:
For each page in your list,
metas = soup.find_all('meta')
you want all the <meta> tags (regardless of other attributes).
for m in metas:
For each <meta> tag, check if it's a <meta name="description">. If it is, save its content; otherwise, save "Not Found".
HTML <meta>
See more info on MDN's docs. In short, you can use <meta> for more meta needs, and as such, you can have multiple <meta> tags in your HTML document.
Your URLs
If you actually open up your URLs and have a look at the HTML, you'll see that you get
digisapient (1 <meta>)
<meta name="viewport" content="width=device-width, initial-scale=1, user-scalable=no" />
dataquest (a lot more <meta>s)
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!" />
<meta name="robots" content="index, follow" />
<meta name="googlebot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" />
<meta name="bingbot" content="index, follow, max-snippet:-1, max-image-preview:large, max-video-preview:-1" />
<meta property="og:locale" content="en_US" />
<meta property="og:type" content="website" />
<meta property="og:title" content="Learn Data Science Online and Build Data Skills with Dataquest" />
<meta property="og:description" content="Learn Python, R, and SQL skills. Follow career paths to become a job-qualified data scientist, analyst, or engineer with interactive data science courses!" />
<meta property="og:url" content="https://www.dataquest.io/" />
<meta property="og:site_name" content="Dataquest" />
<meta property="article:publisher" content="https://www.facebook.com/dataquestio" />
<meta property="article:modified_time" content="2020-05-14T22:43:29+00:00" />
<meta property="og:image" content="https://www.dataquest.io/wp-content/uploads/2015/02/dataquest-learn-data-science-logo.jpg" />
<meta property="og:image:width" content="1040" />
<meta property="og:image:height" content="520" />
<meta name="twitter:card" content="summary_large_image" />
<meta name="twitter:creator" content="#dataquestio" />
<meta name="twitter:site" content="#dataquestio" />
As you can see, the second website has many more such tags in its content, and for every one of those, you go through that last if/else statement; if it doesn't have a name="description", you save a "Not Found" in your results set.
Your actual results
To check what your program is doing at a time, and better understand the different values that the variables get over time, I think it's worthwhile to look up debugging and start doing it.
For a quick 'n dirty solution, try writing some messages to the screen as your program's execution progresses, e.g. Downloaded URL http://..., Found 4 <meta> tags, Found 0 <meta name="description".
I suggest you go over your program line by line, with the HTML in the other hand, and try to see what should happen.
Getting there
From your expected results, it seems that you don't actually care about the "Not Found"s, and you know up front that you always want the <meta name="description"> tags.
With CSS Selectors
You can try selecting DOM elements on your own in the browser, using document.querySelectorAll(). Open up your browser's developer tools / JavaScript console and type in, for instance
document.querySelectorAll("meta[name='description']")
to get all of the page's meta tags with a name attribute, which has the value of description. See more on CSS selectors.
Having this in mind is important, because
1. you get a better feel of what you're looking at / trying to do
1. you can actually use this type of selectors with BeautifulSoup, as well!
With BeautifulSoup
So you can move the check for the name attribute up in the query. Something like
metas = soup.find_all('meta', attrs={"name": "description"})
and that should give you only the tags which have name=description in them. This means that all the other <meta>s will be ignored.
Alternatively, you can keep the current query method (get all <meta>s) and just ignore them in the if/else statement, i.e. don't save them in your results list. If you want to know that it's actually doing something, but it just doesn't match up you required query, you could simply log a message, instead of saving the "Not Found".
Related
I'm trying to use Selenium in python with standard and with undetected_chrome drivers in normal and headless browser modes.
I've noticed some strange behaviors:
normal chrome driver works fine with special inputs while sending keys into HTML input field with send_keys() function
undetected_chrome driver does not handle the special inputs very well with send_keys() function
in normal browser mode if I send an email address into a HTML input field, like 'abc#xyz.com' the current content from the clipboard is pasted in place of '#' character
e.g.: if I have copied the string '123' for the last time, the entered email address will not be 'abc#xyz.com' but 'abc123xyz.com' which is obviously incorrect
that's why I'm using a workaround, that I import pyperclip and put '#' character to the clipboard before the send_keys() function runs, to replace the '#' character with '#' character correctly
in headless browser mode if I enter an email address into a HTML input field, like 'abc#xyz.com' my workaround doesn't matter any more, the '#' character will be stripped from the email, and the 'abcxyz.com' string will be put into the field
I think sending a '#' character shouldn't be that hard by default. Am I doing something wrong here?
Questions:
Can anyone explain these strange behaviors?
How could I send an email address correctly with headless browser mode? (I need to use undetected_chrome driver because of bots)
from selenium import webdriver
self.chrome_options = webdriver.ChromeOptions()
if self.headless:
self.chrome_options.add_argument('--headless')
self.chrome_options.add_argument('--disable-gpu')
prefs = {'download.default_directory' : os.path.realpath(self.download_dir_path)}
self.chrome_options.add_experimental_option('prefs', prefs)
self.driver = webdriver.Chrome(options=self.chrome_options)
import undetected_chromedriver as uc
# workaround for problem with pasting text from the clipboard into '#' symbol when using send_keys()
import pyperclip
pyperclip.copy('#') # <---- THIS IS THE WORKAROUND FOR THE PASTING PROBLEM
self.chrome_options = uc.ChromeOptions()
if self.headless:
self.chrome_options.add_argument('--headless')
self.chrome_options.add_argument('--disable-gpu')
self.driver = uc.Chrome(options=self.chrome_options)
params = {
"behavior": "allow",
"downloadPath": os.path.realpath(self.download_dir_path)
}
self.driver.execute_cdp_cmd("Page.setDownloadBehavior", params)
The versions I'm using requirements.txt:
selenium==4.3.0
undetected-chromedriver==3.1.5.post4
Instead of using custom configuration options, try a more basic variant first and see if works correctly. And then figure out which option combination is causing the issue.
An example (which does work correctly by the way)
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
driver = uc.Chrome()
driver.execute_script("""
document.documentElement.innerHTML = `
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<style>
html, body, main {
height: 100%;
}
main {
display: flex;
justify-content: center;
}
</style>
</head>
<body>
<main>
<form>
<input type="text" name="text" />
<input type="email" name="email"/>
<input type="tel" name="tel"/>
</form>
</main>
</body>
</html>`
""")
driver.find_element(By.NAME, "text").send_keys("info#example.com")
driver.find_element(By.NAME, "email").send_keys("info#example.com")
driver.find_element(By.NAME, "tel").send_keys("info#example.com")
This complaint (entering "#" pastes clipboard contents) has come up here from time to time but I've never seen a definitive solution. I'd suspect a particular language version of Windows &/or keyboard driver.
The windows language is the problem, more precisley the keyboard layout as it changes depending on language. Switch it to "ENG" and it should work fine. Keyboard layouts where letter Z is between "T" and "U" are not working with undetected driver. Keyboard layout that has Z next to "X" works fine.
from bs4 import BeautifulSoup
import requests
url = "https://www.104.com.tw/job/?jobno=5mjva&jobsource=joblist_b_relevance"
r = requests.get(url)
r.encoding = "utf-8"
print(r.text)
I want to reach the content in div ("class=content")(p)
but when I print the r.text out there's a big part disappear.
But I also found if I open a text file and write it in, it would be just right in the notebook
doc = open("file104.txt", "w", encoding="utf-8")
doc.write(r.text)
doc.close()
I guess it might be the encoding problem? But it is still not working after I encoded in utf-8.
Sorry everbody!
===========================================================================
I finally found the problem which comes from the Ipython IDLE, everthing would be fine if I run the code in powershell, I should try this earlier....
But still wanna know why cause this problem!
Use content.decode()
>>> import requests
>>> url = "https://www.104.com.tw/job/?jobno=5mjva&jobsource=joblist_b_relevance"
>>> r = requests.get(url)
>>> TextInfo = r.content.decode('UTF-8')
>>> print(TextInfo)
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--><html lang="zh-tw"><!--<![endif]-->
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge" />
<meta http-equiv="pragma" content="no-cache" />
<meta http-equiv="cache-control" content="no-cache" />
.....
.....
the guts of the html code
.....
.....
</script>
</body>
</html>
>>>
from bs4 import BeautifulSoup
import urllib.request
url = "https://www.104.com.tw/job/?jobno=5mjva&
jobsource=joblist_b_relevance"
r = urllib.request.urlopen(url).read()
r=r.decode('utf-8')
print(r)
#OR
urllib.request.urlretrieve(url,"myhtml.html")
myhtml=open(myhtml.html,'rb')
print(myhtml)
In addition to the strange disappearing markup issue as noted in another StackOverflow question, I'm also noticing weird encoding problems where some characters are replaced with random letters. This seem to happen when there are very long lines in the markup. Here are examples:
Normal behavior
Before processing of Gmail API
<html>
<head>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<title>Email Title</title>
</head>
<body>
<p style="font-size: 16px; line-height: 24px; margin-top: 0; margin-bottom: 0; font-family: 'ff-tisa-web-pro, Georgia, serif;">Pinterest mumblecore authentic stumptown, deep v slowcarb skateboard Intelligentsia food truck VHS. Asymmetrical swag raw denim put a bird on it Echo Park. Pinterest four loko lofi forage gentrify cray.</p>
</body>
</html>
After processing of Gmail API (via opening message in Gmail, and selecting Show original).
--001a1133f016eff52804ff2a2885
Content-Type: text/html; charset=ISO-8859-1
<html>
<head>
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Email Title</title>
</head>
<body>
<p style>Pinterest mumblecore authentic stumptown, deep v slowcarb skateboard Intelligentsia food truck VHS. Asymmetrical swag raw denim put a bird on it Echo Park. Pinterest four loko lofi forage gentrify cray.</p>
</body>
</html>
--001a1133f016eff52804ff2a2885--
In the above example, what happens is what we expected. Once the line-length of the p element is longer however, we get unusual behavior.
Weird behavior
Before processing of Gmail API
<html>
<head>
<meta name="viewport" content="width=device-width, initial-scale=1.0"/>
<title>Email Title</title>
</head>
<body>
<p style="font-size: 16px; line-height: 24px; margin-top: 0; margin-bottom: 0; font-family: 'ff-tisa-web-pro, Georgia, serif;">Pinterest mumblecore authentic stumptown, deep v slowcarb skateboard Intelligentsia food truck VHS. Asymmetrical swag raw denim put a bird on it Echo Park. Pinterest four loko lofi forage gentrify cray. Pinterest mumblecore authentic stumptown, deep v slowcarb skateboard Intelligentsia food truck VHS. Asymmetrical swag raw denim put a bird on it Echo Park. Pinterest four loko lofi forage gentrify cray.</p>
</body>
</html>
After processing of Gmail API (via opening message in Gmail, and selecting Show original).
--001a1133547278e12e04ff2a28d8
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
<html>
<head>
<meta name=3D"viewport" content=3D"width=3Ddevice-width, initial-scale=
=3D1.0">
<title>Email Title</title>
</head>
<body>
<p style>Pinterest mumblecore authentic stumptown, deep v slowcarb skat=
eboard Intelligentsia food truck VHS. Asymmetrical swag raw denim put a bir=
d on it Echo Park. Pinterest four loko lofi forage gentrify cray. Pinterest=
mumblecore authentic stumptown, deep v slowcarb skateboard Intelligentsia =
food truck VHS. Asymmetrical swag raw denim put a bird on it Echo Park. Pin=
terest four loko lofi forage gentrify cray.</p>
</body>
</html>
--001a1133547278e12e04ff2a28d8--
In the above example, the number of characters inside the p element has doubled. Somehow this triggers all sorts of weird encoding issues. Notice that Content-Transfer-Encoding: quoted-printable is added about the markup. Also notice that 3D appears after every =. Also, hard line breaks have been added to the p element. At the end of each line there is an = sign.
How do I prevent this from happening?
Google storage the email using the standard RFC822, and in this specific case as you can see on the headers beside the Content-Type text/html you can find the header Content-Transfer-Encoding: quoted-printable, (https://en.wikipedia.org/wiki/Quoted-printable).
So you need to parse the RFC822 message to get the actual html.
Find the correct chunk (the format of the message is like a multipart using the boundary that you will find in the first headers
Parse the headers of the Chunk and to get the encode-type (not always its a quote-printable so be careful)
Using the encode of the previous step decode the body of the chunk
I hope this answer you question
<div class="vcard" itemscope itemtype="http://schema.org/LocalBusiness">
<strong class="fn org" itemprop="name">Commercial Office Bangalore</strong><br/>
<span class="adr" itemprop="address" itemscope itemtype="httep:schema.org/PostalAddress">
<span class="street-address"itemprop="streetAddress">330, Raheja Arcade, 1/1 Koramangala Industrial Layout</span><br/>
<span class="locality"itemprop="addressLocality">Bangalore</span>
<span itemprop="postalCode"class="postal-code">560095</span><br/>
<span class="region"itemprop="addressRegion">Karnataka</span><br/>
<span class="country-name">India</span><br/>
<span class="tel id"="phone"itemprop="telephone">+91-80-41101360</span><br/>
<span class="geo"itemprop="geo"itemscope itemtype="http://schema.org/GeoCordinates">
<abbr class="latitude" property="latitude">12.936504</abbr>
<abbr class="longitude" property="longitude">77.6321344</abbr>
<meta itemprop="latitude" content="12.936504" />
<meta itemprop="longitude" content="77.6321344" />
</span>
</span>
I want to make lat and logitude invisible for user and visible for Google bot. What shall I do?
If you mean the Microdata (using Schema.org):
Just remove the abbr elements. You are already giving this data in meta elements (which can be used in the body), which is the correct way to provide data that should/can not be visible on the page:
<meta itemprop="latitude" content="12.936504" />
<meta itemprop="longitude" content="77.6321344" />
(Note that you were using property attributes, but they are not allowed in Microdata, only in RDFa.)
(Also note that you use http://schema.org/GeoCordinates but it should be http://schema.org/GeoCoordinates. And httep:schema.org/PostalAddress is also wrong.)
If you mean the Microformat (using hCard):
You could reuse the meta elements used with Microdata:
<meta class="latitude" itemprop="latitude" content="12.936504" />
<meta class="longitude" itemprop="longitude" content="77.6321344" />
But I’m not sure if all Microformat parsers support this.
The abbr elements with lat and lon can be set to display:none, which does exactly what you are asking for, hiding the content from humans, while still serving it up to bots:
abbr{display:none}
/** or **/
.latitude,
.longitude{display:none}
Although I can't honestly see what's wrapping them in your example. Ff the span with class .geo is the parent element, you could make that display:none instead.
Given the following HTML snippet, I need to extract the text of the content attribute for the meta tag with attribute name equal to description and the meta tag with attribute property equal to og:title. I've tried what is shown at Groovy: Correct Syntax for XMLSlurper to find elements with a given attribute, but it doesn't seem to work the same in Groovy 1.8.6.
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head>
<meta http-equiv="X-UA-Compatible" content="IE=8" />
<meta property="fb:admins" content="100003979125336" />
<meta name="description" content="Shiny embossed blue butterfly" />
<meta name="keywords" content="Fancy That Blue Butterfly Platter" />
<meta property="og:title" content="Fancy That Blue Butterfly Platter" />
Is there a clean way to retrieve these with GPath?
This works with groovy 2.0.1 - I don't have 1.8.6 handy at the moment:
def slurper = new XmlSlurper()
File xmlFile = new File('sample.xml')
def xml = slurper.parseText(xmlFile.text)
println 'description = ' + xml.head.children().find{it.name() == 'meta' && it.#name == 'description'}.#content
println 'og:title = ' + xml.head.children().find{it.name() == 'meta' && it.#property == 'og:title'}.#content