Extract data from embedded script tag in html - python-3.x

I'm trying to fetch data inside a (big) script tag within HTML. By using Beautifulsoup I can approach the necessary script, yet I cannot get the data I want.
What I'm looking for inside this tag resides within a list called "Beleidsdekkingsgraad" more specifically
["Beleidsdekkingsgraad","107,6","107,6","109,1","109,8","110,1","111,5","112,5","113,3","113,3","114,3","115,7","116,3","116,9","117,5","117,8","118,1","118,3","118,4","118,6","118,8","118,9","118,9","118,9","118,5","118,1","117,8","117,6","117,5","117,1","116,7","116,2"] even more specific; the last entry in the list (116,2)
Following 1 or 2 cannot get the case done.
What I've done so far
base='https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed'
url=requests.get(base)
soup=BeautifulSoup(url.text, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[3].get_text()[1907:2179]
This, however, is not satisfying since each time the indexing has to be changed if new numbers are added.
What I'm looking for an easy way to extract the list from the script tag, second to catch the last number of the extracted list (i.e. 116,2)

You could regex out javascript object holding that item then parse with json library
import requests,re,json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'window\.infographicData=(.*);')
data = json.loads(p.findall(r.text)[0])
result = [i for i in data['elements'][1]['data'][0] if 'Beleidsdekkingsgraad' in i][0][-1]
print(result)
Or do whole thing with regex:
import requests,re
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'\["Beleidsdekkingsgraad".+?,"([0-9,]+)"\]')
print(p.findall(r.text)[0])
Second regex:
Another option:
import requests,re, json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'(\["Beleidsdekkingsgraad".+?"\])')
print(json.loads(p.findall(r.text)[0])[-1])

Related

How to append all data to dict instead of last result only?

I'm trying to create a metadata scraper to enrich my e-book collection, but am experiencing some problems. I want to create a dict (or whatever gets the job done) to store the index (only while testing), the path and the series name. This is the code I've written so far:
from bs4 import BeautifulSoup
def get_opf_path():
opffile=variables.items
pathdict={'index':[],'path':[],'series':[]}
safe=[]
x=0
for f in opffile:
x+=1
pathdict['path']=f
pathdict['index']=x
with open(f, 'r') as fi:
soup=BeautifulSoup(fi, 'lxml')
for meta in soup.find_all('meta'):
if meta.get('name')=='calibre:series':
pathdict['series']=meta.get('content')
safe.append(pathdict)
print(pathdict)
print(safe)
this code is able to go through all the opf files and get the series, index and path, I'm sure of this, since the console output is this:
However, when I try to store the pathdict to the safe, no matter where I put the safe.append(pathdict) the output is either:
or
or
What do I have to do, so that the safe=[] has the data shown in image 1?
I have tried everything I could think of, but nothing worked.
Any help is appreciated.
I believe this is the correct way:
from bs4 import BeautifulSoup
def get_opf_path():
opffile = variables.items
pathdict = {'index':[], 'path':[], 'series':[]}
safe = []
x = 0
for f in opffile:
x += 1
pathdict['path'] = f
pathdict['index'] = x
with open(f, 'r') as fi:
soup = BeautifulSoup(fi, 'lxml')
for meta in soup.find_all('meta'):
if meta.get('name') == 'calibre:series':
pathdict['series'] = meta.get('content')
print(pathdict)
safe.append(pathdict.copy())
print(safe)
For two main reasons:
When you do:
pathdict['series'] = meta.get('content')
you are overwriting the last value in pathdict['series'] so I believe this is where you should save.
You also need to make a copy of it, if you don´t it will change also in the list. When you store the dict you really are storing a reeference to it (in this case, a reference to the variable pathdict.
Note
If you want to print the elements of the list in separated lines you can do something like this:
print(*save, sep="\n")

Counting words in a webpage is inaccurate

Noob, trying to build a word counter, to count the words displayed on a website. I found some code (counting words inside a webpage), modified it, tried it on Google, and found that it was way off. Other code I tried displayed all of the various HTML tags, which was likewise not helpful. If visible page content reads: "Hello there world," I'm looking for a count of 3. For now, I'm not concerned with words that are in image files (pictures). My modified code is as follows:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation
# Page you want to count words from
page = "https://google.com"
# Get the page
r = requests.get(page)
soup = BeautifulSoup(r.content)
# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
# creates a dictionary of words and frequency from paragraphs
content_paras = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))
sum_of_paras = sum(content_paras.values())
# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
content_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
sum_of_divs = sum(content_div.values())
words_on_page = sum_of_paras + sum_of_divs
print(words_on_page)
As always, simple answers I can follow are appreciated over complex/elegant ones I cannot, b/c Noob.

Beautifulsoup span class is returning a blank string

I am trying to print out different things from a Norwegian weather site with beautifulsoup.
I manage to print out everything i want except one thing witch mentions how the weather will be the next hour.
This contains the text i want to get:
<span class="nowcast-description" data-reactid="59">har opphold nå, det holder seg tørt den neste timen</span>
And i am trying print it with this:
cond = soup.find(class_='nowcast-description').get_text()
Inspected elements from storm.no/ski
Here is a picture of the some of the elements on the site.
with printing these:
soup = bs4.BeautifulSoup(html, "html.parser")
loc = soup.find(class_='info-text').get_text()
cond = soup.find(class_='nowcast-description').get_text()
temp = soup.find(class_='temperature').get_text()
wind = soup.find(class_='indicator wind').get_text()
also tested with this line:
cond = soup.select("span.nowcast-description")
but that gives me everything except what i want from the line.
Site link: https://www.storm.no/ski
i get:
Ski Akershus, 131 moh.
""
2°
3 m/s
It is retrieved dynamically from a script tag. You can regex out object containing all forecasts and handle with hjson library due to unquoted keys. You need to install hjson then do the following:
import requests, hjson, re
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.storm.no/ski')
p = re.compile(r'window\.__dehydratedState = (.*?);', re.DOTALL)
data = hjson.loads(p.findall(r.text)[0])
print(data['app-container']['current']['forecast']['nowcastDescription'])
You could regex out with library direct as well but using hsjon means you have access to all the other data.
It's because text under nowcast-description is generated dynamically. If you will dump the loaded page:
print(soup.prettify())
You only find only this:
<span class="nowcast-description" data-reactid="59">
</span>
On rough analysis, it seems that the content of this span is loaded from field nowcastDescription which is a part of window.__dehydratedState .
Because the field is a simple json, you can try to extract it from it.

Data from a table getting printed to csv in a single line

I've written a script to parse data from the first table of a website. I've used xpath to parse the table. Btw, I didn't use "tr" tag cause without using it I can still see the results in the console when printed. When I run my script, the data are getting scraped but being printed in a single line in a csv file. I can't find out the mistake I'm making. Any input on this will be highly appreciated. Here is what I've tried with:
import csv
import requests
from lxml import html
url="https://fantasy.premierleague.com/player-list/"
response = requests.get(url).text
outfile=open('Data_tab.csv','w', newline='')
writer=csv.writer(outfile)
writer.writerow(["Player","Team","Points","Cost"])
tree = html.fromstring(response)
for titles in tree.xpath("//table[#class='ism-table']")[0]:
# tab_r = titles.xpath('.//tr/text()')
tab_d = titles.xpath('.//td/text()')
writer.writerow(tab_d)
You might want to add a level of looping, examining each table row in turn.
Try this:
for titles in tree.xpath("//table[#class='ism-table']")[0]:
for row in titles.xpath('./tr'):
tab_d = row.xpath('./td/text()')
writer.writerow(tab_d)
Or, perhaps this:
table = tree.xpath("//table[#class='ism-table']")[0]
for row in table.xpath('.//tr'):
items = row.xpath('./td/text()')
writer.writerow(items)
Or you could have the first XPath expression find the rows for you:
rows = tree.xpath("(.//table[#class='ism-table'])[1]//tr")
for row in rows:
items = row.xpath('./td/text()')
writer.writerow(items)

python-docx insertion point

I am not sure if I've been missing anything obvious, but I have not found anything documented about how one would go to insert Word elements (tables, for example) at some specific place in a document?
I am loading an existing MS Word .docx document by using:
my_document = Document('some/path/to/my/document.docx')
My use case would be to get the 'position' of a bookmark or section in the document and then proceed to insert tables below that point.
I'm thinking about an API that would allow me to do something along those lines:
insertion_point = my_document.bookmarks['bookmark_name'].position
my_document.add_table(rows=10, cols=3, position=insertion_point+1)
I saw that there are plans to implement something akin to the 'range' object of the MS Word API, this would effectively solve that problem. In the meantime, is there a way to instruct the document object methods where to insert the new elements?
Maybe I can glue some lxml code to find a node and pass that to these python-docx methods? Any help on this subject would be much appreciated! Thanks.
I remembered an old adage, "use the source, Luke!", and could figure it out. A post from python-docx owner on its git project page also gave me a hint: https://github.com/python-openxml/python-docx/issues/7.
The full XML document model can be accessed by using the its _document_part._element property. It behaves exactly like an lxml etree element. From there, everything is possible.
To solve my specific insertion point problem, I created a temp docx.Document object which I used to store my generated content.
import docx
from docx.oxml.shared import qn
tmp_doc = docx.Document()
# Generate content in tmp_doc document
tmp_doc.add_heading('New heading', 1)
# more content generation using docx API.
# ...
# Reference the tmp_doc XML content
tmp_doc_body = tmp_doc._document_part._element.body
# You could pretty print it by using:
#print(docx.oxml.xmlchemy.serialize_for_reading(tmp_doc_body))
I then loaded my docx template (containing a bookmark named 'insertion_point') into a second docx.Document object.
doc = docx.Document('/some/path/example.docx')
doc_body = doc._document_part._element.body
#print(docx.oxml.xmlchemy.serialize_for_reading(doc_body))
The next step is parsing the doc XML to find the index of the insertion point. I defined a small function for the task at hand, which returns a named bookmark parent paragraph element:
def get_bookmark_par_element(document, bookmark_name):
"""
Return the named bookmark parent paragraph element. If no matching
bookmark is found, the result is '1'. If an error is encountered, '2'
is returned.
"""
doc_element = document._document_part._element
bookmarks_list = doc_element.findall('.//' + qn('w:bookmarkStart'))
for bookmark in bookmarks_list:
name = bookmark.get(qn('w:name'))
if name == bookmark_name:
par = bookmark.getparent()
if not isinstance(par, docx.oxml.CT_P):
return 2
else:
return par
return 1
The newly defined function was used toget the bookmark 'insertion_point' parent paragraph. Error control is left to the reader.
bookmark_par = get_bookmark_par_element(doc, 'insertion_point')
We can now use bookmark_par's etree index to insert our tmp_doc generated content at the right place:
bookmark_par_parent = bookmark_par.getparent()
index = bookmark_par_parent.index(bookmark_par) + 1
for child in tmp_doc_body:
bookmark_par_parent.insert(index, child)
index = index + 1
bookmark_par_parent.remove(bookmark_par)
The document is now finalized, the generated content having been inserted at the bookmark location of an existing Word document.
# Save result
# print(docx.oxml.xmlchemy.serialize_for_reading(doc_body))
doc.save('/some/path/generated_doc.docx')
I hope this can help someone, as the documentation regarding this is still yet to be written.
You put [image] as a token in your template document:
for paragraph in document.paragraphs:
if "[image]" in paragraph.text:
paragraph.text = paragraph.text.strip().replace("[image]", "")
run = paragraph.add_run()
run.add_picture(image_path, width=Inches(3))
you have have a paragraph in a table cell as well. just find the cell and do as above.
Python-docx owner suggests how to insert a table into the middle of an existing document:
https://github.com/python-openxml/python-docx/issues/156
Here it is with some improvements:
import re
from docx import Document
def move_table_after(document, table, search_phrase):
regexp = re.compile(search_phrase)
for paragraph in document.paragraphs:
if paragraph.text and regexp.search(paragraph.text):
tbl, p = table._tbl, paragraph._p
p.addnext(tbl)
return paragraph
if __name__ == '__main__':
document = Document('Existing_Document.docx')
table = document.add_table(rows=..., cols=...)
...
move_table_after(document, table, "your search phrase")
document.save('Modified_Document.docx')
Have a look at python-docx-template which allows jinja2 style templates insertion points in a docx file rather than Word bookmarks:
https://pypi.org/project/docxtpl/
https://docxtpl.readthedocs.io/en/latest/
Thanks a lot for taking time to explain all of this.
I was going through more or less the same issue. My specific point was how to merge two or more docx documents, at the end.
It's not exactly a solution to your problem, but here is the function I came with:
def combinate_word(main_file, files, output):
main_doc = Document(main_file)
for file in files:
sub_doc = Document(file)
for element in sub_doc._document_part.body._element:
main_doc._document_part.body._element.append(element)
main_doc.save(output)
Unfortunately, it's not yet possible nor easy to copy images with python-docx. I fall back to win32com ...

Resources