Counting words in a webpage is inaccurate - python-3.x

Noob, trying to build a word counter, to count the words displayed on a website. I found some code (counting words inside a webpage), modified it, tried it on Google, and found that it was way off. Other code I tried displayed all of the various HTML tags, which was likewise not helpful. If visible page content reads: "Hello there world," I'm looking for a count of 3. For now, I'm not concerned with words that are in image files (pictures). My modified code is as follows:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation
# Page you want to count words from
page = "https://google.com"
# Get the page
r = requests.get(page)
soup = BeautifulSoup(r.content)
# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
# creates a dictionary of words and frequency from paragraphs
content_paras = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))
sum_of_paras = sum(content_paras.values())
# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
content_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
sum_of_divs = sum(content_div.values())
words_on_page = sum_of_paras + sum_of_divs
print(words_on_page)
As always, simple answers I can follow are appreciated over complex/elegant ones I cannot, b/c Noob.

Related

Get the count of a phrase in a url using python and bs4

I want to get the count of any phrase appearing in a URL, say https://en.wikipedia.org/wiki/India.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/India'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
Now, I want to get the count of the phrase India is a in the soup. How to go about this?
Please suggest.
This can be done in one of two ways.
First, the common denominator:
texts = soup.find_all(text=True)
cleaned = ["".join(t.strip()) for t in texts]
counter=0
Now, if you want to use regex:
import re
regex = re.compile(r'\bIndia is a\b')
for c in cleaned:
if regex.match(c) is not None:
counter+=1
I, personally, don't like using regex except as last resort, so I would go the longer way
phrase = 'India is a'
for c in cleaned:
if phrase==c or phrase+' ' in c:
counter+=1
In both cases, print(counter) outputs 6.
Note that, intentionally, these do not count the 3 situations where the phrase is part of a larger phrase (such as India is also); it counts only the exact phrase or the phrase followed by a space.
I tried below and the same worked fine:
import re
import requests
url = 'https://en.wikipedia.org/wiki/India'
response = requests.get(url)
response_text = response.text
keyword = 'India is a'
match = re.findall("%s" % keyword, response_text)
count = (len(match))
count
Output is 9.
This code will look into <head>, <body> and elsewhere.

Beautifulsoup span class is returning a blank string

I am trying to print out different things from a Norwegian weather site with beautifulsoup.
I manage to print out everything i want except one thing witch mentions how the weather will be the next hour.
This contains the text i want to get:
<span class="nowcast-description" data-reactid="59">har opphold nå, det holder seg tørt den neste timen</span>
And i am trying print it with this:
cond = soup.find(class_='nowcast-description').get_text()
Inspected elements from storm.no/ski
Here is a picture of the some of the elements on the site.
with printing these:
soup = bs4.BeautifulSoup(html, "html.parser")
loc = soup.find(class_='info-text').get_text()
cond = soup.find(class_='nowcast-description').get_text()
temp = soup.find(class_='temperature').get_text()
wind = soup.find(class_='indicator wind').get_text()
also tested with this line:
cond = soup.select("span.nowcast-description")
but that gives me everything except what i want from the line.
Site link: https://www.storm.no/ski
i get:
Ski Akershus, 131 moh.
""
2°
3 m/s
It is retrieved dynamically from a script tag. You can regex out object containing all forecasts and handle with hjson library due to unquoted keys. You need to install hjson then do the following:
import requests, hjson, re
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://www.storm.no/ski')
p = re.compile(r'window\.__dehydratedState = (.*?);', re.DOTALL)
data = hjson.loads(p.findall(r.text)[0])
print(data['app-container']['current']['forecast']['nowcastDescription'])
You could regex out with library direct as well but using hsjon means you have access to all the other data.
It's because text under nowcast-description is generated dynamically. If you will dump the loaded page:
print(soup.prettify())
You only find only this:
<span class="nowcast-description" data-reactid="59">
</span>
On rough analysis, it seems that the content of this span is loaded from field nowcastDescription which is a part of window.__dehydratedState .
Because the field is a simple json, you can try to extract it from it.

Extract data from embedded script tag in html

I'm trying to fetch data inside a (big) script tag within HTML. By using Beautifulsoup I can approach the necessary script, yet I cannot get the data I want.
What I'm looking for inside this tag resides within a list called "Beleidsdekkingsgraad" more specifically
["Beleidsdekkingsgraad","107,6","107,6","109,1","109,8","110,1","111,5","112,5","113,3","113,3","114,3","115,7","116,3","116,9","117,5","117,8","118,1","118,3","118,4","118,6","118,8","118,9","118,9","118,9","118,5","118,1","117,8","117,6","117,5","117,1","116,7","116,2"] even more specific; the last entry in the list (116,2)
Following 1 or 2 cannot get the case done.
What I've done so far
base='https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed'
url=requests.get(base)
soup=BeautifulSoup(url.text, 'html.parser')
all_scripts = soup.find_all('script')
all_scripts[3].get_text()[1907:2179]
This, however, is not satisfying since each time the indexing has to be changed if new numbers are added.
What I'm looking for an easy way to extract the list from the script tag, second to catch the last number of the extracted list (i.e. 116,2)
You could regex out javascript object holding that item then parse with json library
import requests,re,json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'window\.infographicData=(.*);')
data = json.loads(p.findall(r.text)[0])
result = [i for i in data['elements'][1]['data'][0] if 'Beleidsdekkingsgraad' in i][0][-1]
print(result)
Or do whole thing with regex:
import requests,re
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'\["Beleidsdekkingsgraad".+?,"([0-9,]+)"\]')
print(p.findall(r.text)[0])
Second regex:
Another option:
import requests,re, json
r = requests.get('https://e.infogr.am/pob_dekkingsgraadgrafiek?src=embed#async_embed')
p = re.compile(r'(\["Beleidsdekkingsgraad".+?"\])')
print(json.loads(p.findall(r.text)[0])[-1])

Python3:How to get title eng from url?

i ues this code
import urllib.request
fp = urllib.request.urlopen("https://english-thai-dictionary.com/dictionary/?sa=all")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
x = 'alt'
for item in mystr.split():
if (x) in item:
print(item.strip())
I get Thai word from this code but I didn't know how to get Eng word.Thanks
If you want to get words from table you should use parsing library like BeautifulSoup4. Here is an example how you can parse this (I'm using requests to fetch and beautifulsoup here to parse data):
First using dev tools in your browser identify table with content you want to parse. Table with translations has servicesT class attribute which occurs only once in whole document:
import requests
from bs4 import BeautifulSoup
url = 'https://english-thai-dictionary.com/dictionary/?sa=all;ftlang=then'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Get table with translations
table = soup.find('table', {'class':'servicesT'})
After that you need to get all rows that contain translations for Thai words. If you look up page's source file you will notice that first few <tr rows are headers that contain only headers so we will omit them. After that we wil get all <td> elements from row (in that table there are always 3 <td> elements) and fetch words from them (in this table words are actually nested in and ).
table_rows = table.findAll('tr')
# We will skip first 3 rows beacause those are not
# contain information we need
for tr in table_rows[3:]:
# Finding all <td> elements
row_columns = tr.findAll('td')
if len(row_columns) >= 2:
# Get tag with Thai word
thai_word_tag = row_columns[0].select_one('span > a')
# Get tag with English word
english_word_tag = row_columns[1].find('span')
if thai_word_tag:
thai_word = thai_word_tag.text
if english_word_tag:
english_word = english_word_tag.text
# Printing our fetched words
print((thai_word, english_word))
Of course, this is very basic example of what I managed to parse from page and you should decide for yourself what you want to scrape. I've also noticed that data inside table does not have translations all the time so you should keep that in mind when scraping data. You also can use Requests-HTML library to parse data (it supports pagination which is present in table on page you want to scrape).

How to fix 'ValueError("input must have more than one sentence")' Error

Im writing a script that takes a website url and downloads it using beautiful soup. It then uses gensim.summarization to summarize the text but I keep getting ValueError("input must have more than one sentence") even thought the text has more than one sentence. The first section of the script works that downloads the text but I cant get the second part to summarize the text.
import bs4 as bs
import urllib.request
from gensim.summarization import summarize
from gensim.summarization.textcleaner import split_sentences
#===========================================
print("(Insert URL)")
url = input()
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce,'lxml')
#===========================================
print(soup.title.string)
with open (soup.title.string + '.txt', 'wb') as file:
for paragraph in soup.find_all('p'):
text = paragraph.text.replace('.', '.\n')
text = split_sentences(text)
text = summarize(str(text))
text = text.encode('utf-8', 'ignore')
#===========================================
file.write(text+'\n\n'.encode('utf-8'))
It should create a .txt file with the summarized text in it after the script is run in whatever folder the .py file is located
You should not use split_sentences() before passing the text to summarize() since summarize() takes a string (with multiple sentences) as input.
In your code you are first turning your text into a list of sentences (using split_sentences()) and then converting that back to a string (with str()). The result of this is a string like "['First sentence', 'Second sentence']". It doesn't make sense to pass this on to summarize().
Instead you should simply pass your raw text as input:
text = summarize(text)

Resources