I have a task, to get raw text from HTML page. After HTML parsing, I receive a string with a lot '\n' symbols. When I'm trying to replace it with empty, replace function doesn't work. Here is my code:
from bs4 import BeautifulSoup
import urllib
with urllib.request.urlopen('http://shakespeare.mit.edu/lear/full.html') as response:
lear_bytes = response.read()
lear_html = str(lear_bytes)
soup = BeautifulSoup(lear_html, 'html.parser')
lear_txt_dirty = soup.get_text()
lear_txt_clean = str.replace(lear_txt_dirty, '\n', '')
print(lear_txt_clean)
When sorting out string problems, its useful to print the repr of the string, so you can see what's really there. Replacing your print with:
#print(lear_txt_clean)
print("Num newlines", lear_txt_clean.count('\n'))
print(repr(lear_txt_clean[:80]))
I get
Num newlines 0
"b'\\n \\n \\n King Lear: Entire Play\\n \\n \\n \\n \\n \\n\\n\\nKing Lear\\n\\n Shakesp"
You are processing a python byte representation of the text, not the real text. In your code, lear_bytes is a bytes object but lear_html = str(lear_bytes) doesn't decode the object, it gives you a python representation of the bytes object. Instead, you should just let BeautifulSoup have the raw bytes and let it sort it out:
from bs4 import BeautifulSoup
import urllib
with urllib.request.urlopen('http://shakespeare.mit.edu/lear/full.html') as response:
soup = BeautifulSoup(response.read(), 'html.parser')
lear_txt_dirty = soup.get_text()
lear_txt_clean = str.replace(lear_txt_dirty, '\n', '')
print(lear_txt_clean[:80])
Related
How can i cut the output from a response using Python requests?
The output looks like:
...\'\n});\nRANDOMDATA\nExt.define...
or
...\'\n});\nOTHERRANDOMDATA\nExt.define...
And i only want to print out the RANDOMDATA.
req = "https://example.org/endpoint"
response = requests.get(req, verify=False)
print (response.content)
You can use the re module findall() function for searching all occurrences of a regular expression in a string.
The following code, just search for the string RANDOMDATA in the response got from requests.get() function
import re
req = "https://example.org/endpoint"
response = requests.get(req, verify=False)
print (response.content)
ar = re.findall('RANDOMDATA',str(response.content))
if len(ar):
print(ar[0])
This link would be helpful to learn about regular expressions
Additionally, if you have a variable data containing a string to b searched, and a variable t containing a string to search, you can use,
import re
arr = re.findall(t,data)
To return all the occurrences of t in data and :
arr = data.find(t)
To get the index of the first occurrence of t in data
I'm get stuck triyng to transform only one word from unicode into a plain string. I look for answers but no one help me to solve this simple problem.
I'v already tried the following links:
https://www.oreilly.com/library/view/python-cookbook/0596001673/ch03s18.html
Convert a Unicode string to a string in Python (containing extra symbols)
How to convert unicode string into normal text in python
from bs4 import BeautifulSoup
r = requests.get('https://www.mpgo.mp.br/coliseu/concursos/inscricoes_abertas')
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find('table', attrs={'class':'grid'})
text = table.get_text()
text_str = text[0:7]
text_str = text_str.encode('utf-8')
test_str = 'Nenhum'
test_str = test_str.encode('utf-8')
if text_str == test_str:
print('Ok they are equal')
else:
print(id(text_str))
print(id(test_str))
print(type(test_str))
print(type(test_str))
print(test_str)
print(test_str)```
My spected result is: text_str being equal test_str
Welcome to SO. You have a typo in your debug output. The last 4 values are all test_str instead of some text_str.
Then you would have noticed that your read in variable contains:
'\nNenhum'
So if you either change your slice to: text_str = text[1:7] or if you set your test string accordingly:
test_str = '\nNenhum'
It works. Happy hacking...
I want to read all the web pages and extract text from them and then remove white spaces and punctuation. My goal is to combine all the words in all the webpage and produce a dictionary that counts the number of times a word appears across all the web pages.
Following is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
import re
def web_parsing(filename):
with open (filename, "r") as df:
urls = df.readlines()
for url in urls:
uClient = ureq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
par = page_soup.findAll('p')
for node in par:
#print(node)
text = ''.join(node.findAll(text = True))
#text = text.lower()
#text = re.sub(r"[^a-zA-Z-0-9 ]","",text)
text = text.strip()
print(text)
The output I got is:
[Paragraph1]
[paragraph2]
[paragraph3]
.....
What I want is:
[Paragraph1 paragraph2 paragraph 3]
Now, if I split the text here it gives me multiple lists:
[paragraph1], [paragraph2], [paragraph3]..
I want all the words of all the paragraphs of all the webpages in one list.
Any help is appreciated.
As far as I understood your question, you have a list of nodes from which you can extract a string. You then want these strings to be merged into a single string. This can simply done by creating an empty string and then adding the subsequent strings to it.
result = ""
for node in par:
text = ''.join(node.finAll(text=True)).strip()
result += text
print(result) # "Paragraph1 Paragraph2 Paragraph3"
prin([result]) # ["Paragraph1 Paragraph2 Paragraph3"]
I parse a webpage using beautifulsoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("webpage url")
soup = BeautifulSoup(page.content, 'html.parser')
I find the table and print the text
Ear_yield= soup.find(text="Earnings Yield").parent
print(Ear_yield.parent.text)
And then I get the output of a single row in a table
Earnings Yield
0.01
-0.59
-0.33
-1.23
-0.11
I would like this output to be stored in a list so that I can print on xls and operate on the elements (For ex if (Earnings Yield [0] > Earnings Yield [1]).
So I write:
import html2text
text1 = Ear_yield.parent.text
Ear_yield_text = html2text.html2text(pr1)
list_Ear_yield = []
for i in Ear_yield_text :
list_Ear_yield.append(i)
Thinking that my web data has gone into list. I print the fourth item and check:
print(list_Ear_yield[3])
I expect the output as -0.33 but i get
n
That means the list takes in individual characters and not the full word:
Please let me know where I am doing wrong
That is because your Ear_yield_text is a string rather than a list. Assuming that the text have new lines you can do directly this:
list_Ear_yield = Ear_yield_text.split('\n')
Now if you print list_Ear_yield you will be given this result
['Earnings Yield', '0.01', '-0.59', '-0.33', '-1.23', '-0.11']
# Ex1
# Number of datasets currently listed on data.gov
# http://catalog.data.gov/dataset
import requests
import re
from bs4 import BeautifulSoup
page = requests.get(
"http://catalog.data.gov/dataset")
soup = BeautifulSoup(page.content, 'html.parser')
value = soup.find_all(class_='new-results')
results = re.search([0-9][0-9][0-9],[0-9][0-9][0-9], value
print(value)
The code is above .. I want to find a text in the form on regex = [0-9][0-9][0-9],[0-9][0-9][0-9]
inside the text inside the variable 'value'
How can i do this ?
Based on ShellayLee's suggestion i changed it to
import requests
import re
from bs4 import BeautifulSoup
page = requests.get(
"http://catalog.data.gov/dataset")
soup = BeautifulSoup(page.content, 'html.parser')
value = soup.find_all(class_='new-results')
my_match = re.search(r'\d\d\d,\d\d\d', value)
print(my_match)
STILL GETTING ERROR
Traceback (most recent call last):
File "ex1.py", line 19, in
my_match = re.search(r'\d\d\d,\d\d\d', value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 182, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
You need some basics of regex in Python. A regex in Python is represented in as a string, and re module provides functions like match, search, findall which can take a string as an argument and treat it as a pattern.
In your case, the pattern [0-9][0-9][0-9],[0-9][0-9][0-9] can be represented as:
my_pattern = r'\d\d\d,\d\d\d'
then used like
my_match = re.search(my_pattern, value_text)
where \d means a digit symbol (same as [0-9]). The r leading the string means the backslaches in the string are not treated as escaper.
The search function returns a match object.
I suggest you walk through some tutorials first to get rid of further confusions. The official HOWTO is already well written:
https://docs.python.org/3.6/howto/regex.html