combining links into one one output in beautifulsoup - python-3.x

I am trying to grab all of the links with a certain div tag which I can accomplish. The problem is every link is displayed in a new line. For example:
Home
Wire Wheels
Crimped
I would like it to show Home,Wire Wheels,Crimped
Is this possible?
Here is the python code I am using to grab the data:
for crumbs in soup.find('div',{"id":"breadcrumbs"}).find_all('a'):
crumbs2 = crumbs.text
print(crumbs2)

You can specify a different line-ending to print. The default is os.linesep:
crumbs = list(soup.find('div',{"id":"breadcrumbs"}).find_all('a'))
for ind, crumb in enumerate(crumbs):
if ind < len(crumbs) - 1:
ending = {'end': ', '}
else:
ending = {}
print(crumb.text, **ending)
That being said, you should definitely go with #alecxe's answer.

Use .get_text() to get the stripped text directly and str.join() to join the strings:
",".join([crumbs.get_text(strip=True)
for crumbs in soup.find('div',{"id":"breadcrumbs"}).find_all('a')])
Also note that soup.find('div',{"id":"breadcrumbs"}).find_all('a') can be simplified to soup.select("#breadcrumbs a").

Related

Beautifulsoup span class is returning a blank string

I am trying to print out different things from a Norwegian weather site with beautifulsoup.
I manage to print out everything i want except one thing witch mentions how the weather will be the next hour.
This contains the text i want to get:
<span class="nowcast-description" data-reactid="59">har opphold nå, det holder seg tørt den neste timen</span>
And i am trying print it with this:
cond = soup.find(class_='nowcast-description').get_text()
Inspected elements from storm.no/ski
Here is a picture of the some of the elements on the site.
with printing these:
soup = bs4.BeautifulSoup(html, "html.parser")
loc = soup.find(class_='info-text').get_text()
cond = soup.find(class_='nowcast-description').get_text()
temp = soup.find(class_='temperature').get_text()
wind = soup.find(class_='indicator wind').get_text()
also tested with this line:
cond = soup.select("span.nowcast-description")
but that gives me everything except what i want from the line.
Site link: https://www.storm.no/ski
i get:
Ski Akershus, 131 moh.
""
2°
3 m/s
It is retrieved dynamically from a script tag. You can regex out object containing all forecasts and handle with hjson library due to unquoted keys. You need to install hjson then do the following:
import requests, hjson, re
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.storm.no/ski')
p = re.compile(r'window\.__dehydratedState = (.*?);', re.DOTALL)
data = hjson.loads(p.findall(r.text)[0])
print(data['app-container']['current']['forecast']['nowcastDescription'])
You could regex out with library direct as well but using hsjon means you have access to all the other data.
It's because text under nowcast-description is generated dynamically. If you will dump the loaded page:
print(soup.prettify())
You only find only this:
<span class="nowcast-description" data-reactid="59">
</span>
On rough analysis, it seems that the content of this span is loaded from field nowcastDescription which is a part of window.__dehydratedState .
Because the field is a simple json, you can try to extract it from it.

Extract data from embedded script tag in html

I'm trying to fetch data inside a (big) script tag within HTML. By using Beautifulsoup I can approach the necessary script, yet I cannot get the data I want.
What I'm looking for inside this tag resides within a list called "Beleidsdekkingsgraad" more specifically
["Beleidsdekkingsgraad","107,6","107,6","109,1","109,8","110,1","111,5","112,5","113,3","113,3","114,3","115,7","116,3","116,9","117,5","117,8","118,1","118,3","118,4","118,6","118,8","118,9","118,9","118,9","118,5","118,1","117,8","117,6","117,5","117,1","116,7","116,2"] even more specific; the last entry in the list (116,2)
Following 1 or 2 cannot get the case done.
What I've done so far
base='https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed'
url=requests.get(base)
soup=BeautifulSoup(url.text, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[3].get_text()[1907:2179]
This, however, is not satisfying since each time the indexing has to be changed if new numbers are added.
What I'm looking for an easy way to extract the list from the script tag, second to catch the last number of the extracted list (i.e. 116,2)
You could regex out javascript object holding that item then parse with json library
import requests,re,json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'window\.infographicData=(.*);')
data = json.loads(p.findall(r.text)[0])
result = [i for i in data['elements'][1]['data'][0] if 'Beleidsdekkingsgraad' in i][0][-1]
print(result)
Or do whole thing with regex:
import requests,re
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'\["Beleidsdekkingsgraad".+?,"([0-9,]+)"\]')
print(p.findall(r.text)[0])
Second regex:
Another option:
import requests,re, json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'(\["Beleidsdekkingsgraad".+?"\])')
print(json.loads(p.findall(r.text)[0])[-1])

Why does Beautifulsoup's contents method return text with brackets and quotes, whereas the text method returns just plain text

I am scraping company names from a webpage using Python3 and Beautifulsoup. When I use the "contents" method to pull the contents of a tag, it returns the text with brackets and single quotes (e.g. ['Company A']) whereas the "text" method returns simply Company A. Why do the functions behave this way? I realize this may be a dumb question but I'm new and have tried searching around. See Below:
entity_name = bsObj2.find(class_='span-16 first')
entity_name_item = entity_name.find('h1')
entity_name_item = entity_name_item.contents
print(entity_name_item)
Returns:
['Company A']
Whereas:
entity_name = bsObj2.find(class_='span-16 first')
entity_name_item = entity_name.find('h1')
entity_name_item = entity_name_item.text
print(entity_name_item)
Returns:
Company A
contents gives you a list with the tag childrens, while text gives you the text of a tag.
contents returns all the results of a find as a Python list, while text gives you the text value of what was found as a Python string. For instance, in the following HTML:
<div>Header
<h1>Header2</h1>
</div>
Parsing it with Beautiful Soup and calling result = soup.find('div') returns a BeautifulSoup object. Calling contents on that object returns the list of that tag and all its children, i.e. result.contents == ['Header', 'Header2'] and type(result) returns <class 'list'>. But text returns things in a human readable format as a string, so you get something like result.text == 'Header Header2 and type(result) gets you <class 'string'>.

How to extract text inserted with track-changes in python-docx

I want to extract text from word documents that were edited in "Track Changes" mode. I want to extract the inserted text and ignore the deleted text.
Running the below code I saw that paragraphs inserted in "track changes" mode return an empty Paragraph.text
import docx
doc = docx.Document('C:\\test track changes.docx')
for para in doc.paragraphs:
print(para)
print(para.text)
Is there a way to retrieve the text in revisioned inserts (w:ins elements) ?
I'm using python-docx 0.8.6, lxml 3.4.0, python 3.4, Win7
Thanks
I was having the same problem for years (maybe as long as this question existed).
By looking at the code of "etienned" posted by #yiftah and the attributes of Paragraph, I have found a solution to retrieve the text after accepting the changes.
The trick was to get p._p.xml to get the XML of the paragraph and then using "etienned" code on that (i.e retrieving all the <w:t> elements from the XML code, which contains both regular runs and <w:ins> blocks).
Hope it can help the souls lost like I was:
from docx import Document
try:
from xml.etree.cElementTree import XML
except ImportError:
from xml.etree.ElementTree import XML
WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
TEXT = WORD_NAMESPACE + "t"
def get_accepted_text(p):
"""Return text of a paragraph after accepting all changes"""
xml = p._p.xml
if "w:del" in xml or "w:ins" in xml:
tree = XML(xml)
runs = (node.text for node in tree.getiterator(TEXT) if node.text)
return "".join(runs)
else:
return p.text
doc = Document("Hello.docx")
for p in doc.paragraphs:
print(p.text)
print("---")
print(get_accepted_text(p))
print("=========")
Not directly using python-docx; there's no API support yet for tracked changes/revisions.
It's a pretty tricky job, which you'll discover if you search on the element names, perhaps 'open xml w:ins' for a start, that brings up this document as the first result:
https://msdn.microsoft.com/en-us/library/ee836138(v=office.12).aspx
If I needed to do something like that in a pinch I'd get the body element using:
body = document._body._body
and then use XPath on that to return the elements I wanted, something vaguely like this aircode:
from docx.text.paragraph import Paragraph
inserted_ps = body.xpath('./w:ins//w:p')
for p in inserted_ps:
paragraph = Paragraph(p, None)
print(paragraph.text)
You'll be on your own for figuring out what XPath expression will get you the paragraphs you want.
opc-diag may be a friend in this, allowing you to quickly scan the XML of the .docx package. http://opc-diag.readthedocs.io/en/latest/index.html
the below code from Etienne worked for me, it's working directly with the document's xml (and not using python-docx)
http://etienned.github.io/posts/extract-text-from-word-docx-simply/
I needed a quick solution to make text surrounded by "smart tags" visible to docx's text property, and found that the solution could also be adapted to make some tracked changes visible.
It uses lxml.etree.strip_tags to remove surrounding "smartTag" and "ins" tags, and promote the contents; and lxml.etree.strip_elements to remove the whole "del" elements.
def para2text(p, quiet=False):
if not quiet:
unsafeText = p.text
lxml.etree.strip_tags(p._p, "{*}smartTag")
lxml.etree.strip_elements(p._p, "{*}del")
lxml.etree.strip_tags(p._p, "{*}ins")
safeText = p.text
if not quiet:
if safeText != unsafeText:
print()
print('para2text: unsafe:')
print(unsafeText)
print('para2text: safe:')
print(safeText)
print()
return safeText
docin = docx.Document(filePath)
for para in docin.paragraphs:
text = para2text(para)
Beware that this only works for a subset of "tracked changes", but it might be the basis of a more general solution.
If you want to see the xml for a docx file directly: rename it as .zip, extract the "document.xml", and view it by dropping into chrome or your favourite viewer.

Unable to remove string from text I am extracting from html

I am trying to extract the main article from a web page. I can accomplish the main text extraction using Python's readability module. However the text I get back often contains several &#13 strings (there is a ; at the end of this string but this editor won't allow the full string to be entered (strange!)). I have tried using the python replace function, I have also tried using regular expression's replace function, I have also tried using the unicode encode and decode functions. None of these approaches have worked. For the replace and Regular Expression approaches I just get back my original text with the &#13 strings still present and with the unicode encode decode approach I get back the error message:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 2099: ordinal not in range(128)
Here is the code I am using that takes the initial URL and using readability extracts the main article. I have left in all my commented out code that corresponds to the different approaches I have tried to remove the 
 string. It appears as though &#13 is interpreted to be u'\xa9'.
from readability.readability import Document
def find_main_article_text_2():
#url = 'http://finance.yahoo.com/news/questcor-pharmaceuticals-closes-transaction-acquire-130000695.html'
url = "http://us.rd.yahoo.com/finance/industry/news/latestnews/*http://us.rd.yahoo.com/finance/external/cbsm/SIG=11iiumket/*http://www.marketwatch.com/News/Story/Story.aspx?guid=4D9D3170-CE63-4570-B95B-9B16ABD0391C&siteid=yhoof2"
html = urllib.urlopen(url).read()
readable_article = Document(html).summary()
readable_title = Document(html).short_title()
#readable_article.replace("u'\xa9'"," ")
#print re.sub("
",'',readable_article)
#unicodedata.normalize('NFKD', readable_article).encode('ascii','ignore')
print readable_article
#print readable_article.decode('latin9').encode('utf8'),
print "There are " ,readable_article.count("
"),"
's"
#print readable_article.encode( sys.stdout.encoding , '' )
#sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
#sents = sent_tokenizer.tokenize(readable_article)
#new_sents = []
#for sent in sents:
# unicode_sent = sent.decode('utf-8')
# s1 = unicode_sent.encode('ascii', 'ignore')
#s2 = s1.replace("\n","")
# new_sents.append(s1)
#print new_sents
# u'\xa9'
I have a URL that I have been testing the code with included inside the def. If anybody has any ideas on how to remove this &#13 I would appreciate the help. Thanks, George

Resources