len() function not giving correct character number - python-3.x

I am trying to figure out the number of characters in a string but for some strange reason len() is only giving me back 1.
here is an example of my output
WearWorks is a haptics design company that develops products and
experiences that communicate information through touch. Our first product,
Wayband, is a wearable tactile navigation device for the blind and visually
impaired.
True
1
here is my code
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url="https://www.wear.works/"
response=requests.get(url)
html=response.content
soup=BeautifulSoup(html,'html.parser')
#reference https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python
# getting rid of the script sytle in html
for script in soup(["script", "style"]):
(script.extract()) # rip it out
# print(script)
# get text
# grabbing the first chunk of text
text = soup.get_text()[0]
print(isinstance(text, str))
print(len(text))
print(text)

The problem is text = soup.get_text()[0] convert it to text = soup.get_text() have a look. You're slicing a string to get the first character.

Related

How to find a span containing a particular word

I am using BeautifulSoup to parse a webpage. Now I would like to read the Index value 31811.75 from the span:
<span>Underlying Index: <b style="font-size:1.2em;">BANKNIFTY 31811.75</b> </span>
Unfortunately the span lacks any other identifies such as class. I followed the solutions mentioned on a similar question, but I don't seem to get the whole text:
>>> print(soup.body(text=re.compile('Underlying')))
['Underlying Index: ']
I would like the used the keyword Underlying to extract the text present in the span. How can I do this?
Created a synthetic HTML document that has a span that we don't want to find. Extract decimal from found text using re.findall()
from bs4 import BeautifulSoup
import re
html = """
<html><body>
<span>unwanted</span>
<span>Underlying Index: <b style="font-size:1.2em;">BANKNIFTY 31811.75</b> </span>
</html></body>
"""
soup = BeautifulSoup(html)
index = re.findall("\d+\.\d+", soup.find(lambda tag:tag.name=="span" and "Underlying" in tag.text).text )
index[0] if len(index)==1 else None # re.findall() returns a list, take first located decimal. Could default to 0.0 instead of None
output
'31811.75'

Counting words in a webpage is inaccurate

Noob, trying to build a word counter, to count the words displayed on a website. I found some code (counting words inside a webpage), modified it, tried it on Google, and found that it was way off. Other code I tried displayed all of the various HTML tags, which was likewise not helpful. If visible page content reads: "Hello there world," I'm looking for a count of 3. For now, I'm not concerned with words that are in image files (pictures). My modified code is as follows:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation
# Page you want to count words from
page = "https://google.com"
# Get the page
r = requests.get(page)
soup = BeautifulSoup(r.content)
# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
# creates a dictionary of words and frequency from paragraphs
content_paras = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))
sum_of_paras = sum(content_paras.values())
# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
content_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
sum_of_divs = sum(content_div.values())
words_on_page = sum_of_paras + sum_of_divs
print(words_on_page)
As always, simple answers I can follow are appreciated over complex/elegant ones I cannot, b/c Noob.

How to fix 'ValueError("input must have more than one sentence")' Error

Im writing a script that takes a website url and downloads it using beautiful soup. It then uses gensim.summarization to summarize the text but I keep getting ValueError("input must have more than one sentence") even thought the text has more than one sentence. The first section of the script works that downloads the text but I cant get the second part to summarize the text.
import bs4 as bs
import urllib.request
from gensim.summarization import summarize
from gensim.summarization.textcleaner import split_sentences
#===========================================
print("(Insert URL)")
url = input()
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce,'lxml')
#===========================================
print(soup.title.string)
with open (soup.title.string + '.txt', 'wb') as file:
for paragraph in soup.find_all('p'):
text = paragraph.text.replace('.', '.\n')
text = split_sentences(text)
text = summarize(str(text))
text = text.encode('utf-8', 'ignore')
#===========================================
file.write(text+'\n\n'.encode('utf-8'))
It should create a .txt file with the summarized text in it after the script is run in whatever folder the .py file is located
You should not use split_sentences() before passing the text to summarize() since summarize() takes a string (with multiple sentences) as input.
In your code you are first turning your text into a list of sentences (using split_sentences()) and then converting that back to a string (with str()). The result of this is a string like "['First sentence', 'Second sentence']". It doesn't make sense to pass this on to summarize().
Instead you should simply pass your raw text as input:
text = summarize(text)

How to change datatype after parsing?

Here is my full working code.
When imported into Excel the data is imported as text.
How do I change the data type?
​​from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.PhantomJS()
driver.get("http://www.cbr.ru/hd_base/dv/?P1=4")
driver.find_element_by_id('UniDbQuery_FromDate').clear()
driver.find_element_by_id('UniDbQuery_FromDate').send_keys('13.01.2013')
driver.find_element_by_id('UniDbQuery_ToDate').clear()
driver.find_element_by_id('UniDbQuery_ToDate').send_keys('12.12.2017')
driver.find_element_by_id("UniDbQuery_searchbutton").click()
z=driver.page_source
driver.quit()
soup=BeautifulSoup(z)
x=[]
for tag in soup.tbody.findAll('td'):
x.append(tag.text)
y=x[1::2]
d=pd.Series(y)
You can use the .astype method for series:
d = d.astype(int)
or whatever datatype you're trying to convert to in place of int. Note that this will give you an error if there's anything in your series that it can't convert - you may need to drop null values or strip whitespace first if that happens.

(Python)- How to store text extracted from HTML table using BeautifulSoup in a structured python list

I parse a webpage using beautifulsoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("webpage url")
soup = BeautifulSoup(page.content, 'html.parser')
I find the table and print the text
Ear_yield= soup.find(text="Earnings Yield").parent
print(Ear_yield.parent.text)
And then I get the output of a single row in a table
Earnings Yield
0.01
-0.59
-0.33
-1.23
-0.11
I would like this output to be stored in a list so that I can print on xls and operate on the elements (For ex if (Earnings Yield [0] > Earnings Yield [1]).
So I write:
import html2text
text1 = Ear_yield.parent.text
Ear_yield_text = html2text.html2text(pr1)
list_Ear_yield = []
for i in Ear_yield_text :
list_Ear_yield.append(i)
Thinking that my web data has gone into list. I print the fourth item and check:
print(list_Ear_yield[3])
I expect the output as -0.33 but i get
n
That means the list takes in individual characters and not the full word:
Please let me know where I am doing wrong
That is because your Ear_yield_text is a string rather than a list. Assuming that the text have new lines you can do directly this:
list_Ear_yield = Ear_yield_text.split('\n')
Now if you print list_Ear_yield you will be given this result
['Earnings Yield', '0.01', '-0.59', '-0.33', '-1.23', '-0.11']

Resources