I can't convert a unicode into a plain string - python-3.x

I'm get stuck triyng to transform only one word from unicode into a plain string. I look for answers but no one help me to solve this simple problem.
I'v already tried the following links:
https://www.oreilly.com/library/view/python-cookbook/0596001673/ch03s18.html
Convert a Unicode string to a string in Python (containing extra symbols)
How to convert unicode string into normal text in python
from bs4 import BeautifulSoup
r = requests.get('https://www.mpgo.mp.br/coliseu/concursos/inscricoes_abertas')
soup = BeautifulSoup(r.content, 'html.parser')
table = soup.find('table', attrs={'class':'grid'})
text = table.get_text()
text_str = text[0:7]
text_str = text_str.encode('utf-8')
test_str = 'Nenhum'
test_str = test_str.encode('utf-8')
if text_str == test_str:
print('Ok they are equal')
else:
print(id(text_str))
print(id(test_str))
print(type(test_str))
print(type(test_str))
print(test_str)
print(test_str)```
My spected result is: text_str being equal test_str

Welcome to SO. You have a typo in your debug output. The last 4 values are all test_str instead of some text_str.
Then you would have noticed that your read in variable contains:
'\nNenhum'
So if you either change your slice to: text_str = text[1:7] or if you set your test string accordingly:
test_str = '\nNenhum'
It works. Happy hacking...

Related

How to extract text between specific letters from a string in Python(3.9)?

how may I be able to take from a string in python a value that is in a given text but is inside it, it's between 2 letters that I want it to copy from inside.
e.g.
"Kahoot : ID:1234567 Name:RandomUSERNAME"
I want it to receive the 1234567 and the RandomUSERNAME in 2 different variables.
a way I found to catch is to get it between the "ID:"COPYINPUT until the SPACE., "Name:"COPYINPUT until the end of the text.
How do I code this?
if I hadn't explained correctly tell me, I don't know how to ask/format this question! Sorry for any inconvenience!.
If the text always follows the same format you could just split the string. Alternatively, you could use regular expressions using the re library.
Using split:
string = "Kahoot : ID:1234567 Name:RandomUSERNAME"
string = string.split(" ")
id = string[2][3:]
name = string[3][5:]
print(id)
print(name)
Using re:
import re
string = "Kahoot : ID:1234567 Name:RandomUSERNAME"
id = re.search(r'(?<=ID:).*?(?=\s)', string).group(0)
name = re.search(r'(?<=Name:).*', string).group(0)
print(id)
print(name)

Get the count of a phrase in a url using python and bs4

I want to get the count of any phrase appearing in a URL, say https://en.wikipedia.org/wiki/India.
import requests
from bs4 import BeautifulSoup
url = 'https://en.wikipedia.org/wiki/India'
r = requests.get(url)
soup = BeautifulSoup(r.text,'lxml')
Now, I want to get the count of the phrase India is a in the soup. How to go about this?
Please suggest.
This can be done in one of two ways.
First, the common denominator:
texts = soup.find_all(text=True)
cleaned = ["".join(t.strip()) for t in texts]
counter=0
Now, if you want to use regex:
import re
regex = re.compile(r'\bIndia is a\b')
for c in cleaned:
if regex.match(c) is not None:
counter+=1
I, personally, don't like using regex except as last resort, so I would go the longer way
phrase = 'India is a'
for c in cleaned:
if phrase==c or phrase+' ' in c:
counter+=1
In both cases, print(counter) outputs 6.
Note that, intentionally, these do not count the 3 situations where the phrase is part of a larger phrase (such as India is also); it counts only the exact phrase or the phrase followed by a space.
I tried below and the same worked fine:
import re
import requests
url = 'https://en.wikipedia.org/wiki/India'
response = requests.get(url)
response_text = response.text
keyword = 'India is a'
match = re.findall("%s" % keyword, response_text)
count = (len(match))
count
Output is 9.
This code will look into <head>, <body> and elsewhere.

how to replace question mark to word

I have Arabic tweet and I want to replace question marks and exclamation into Arabic word synonymous I tried this code i used regular expression but nothing happens. I used jupyter notebook
def replace_questionmark(tweet):
text = re.sub("!", "تعجب",tweet)
text = re.sub('استفهام','؟' ,tweet)
return tweet
data_df['clean text'] = data_df['Text'].apply(lambda x: replace_questionmark(x))
The following code solves your problem
import pandas as pd
import re
Text = [u'I am feeling good !', u'I am testing this code ؟']
data_df = pd.DataFrame(columns=['Text'], data=Text)
def replace_questionmark(tweet):
text = tweet.replace(u'!', u'تعج')
text = text.replace(u'؟', u'استفهام')
return text.encode('utf-8')
data_df['clean text'] = data_df['Text'].apply(lambda x: replace_questionmark(x))
print(data_df)
Output
Text clean text
0 I am feeling good ! I am feeling good تعج
1 I am testing this code ؟ I am testing this code استفهام

How to fix 'ValueError("input must have more than one sentence")' Error

Im writing a script that takes a website url and downloads it using beautiful soup. It then uses gensim.summarization to summarize the text but I keep getting ValueError("input must have more than one sentence") even thought the text has more than one sentence. The first section of the script works that downloads the text but I cant get the second part to summarize the text.
import bs4 as bs
import urllib.request
from gensim.summarization import summarize
from gensim.summarization.textcleaner import split_sentences
#===========================================
print("(Insert URL)")
url = input()
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce,'lxml')
#===========================================
print(soup.title.string)
with open (soup.title.string + '.txt', 'wb') as file:
for paragraph in soup.find_all('p'):
text = paragraph.text.replace('.', '.\n')
text = split_sentences(text)
text = summarize(str(text))
text = text.encode('utf-8', 'ignore')
#===========================================
file.write(text+'\n\n'.encode('utf-8'))
It should create a .txt file with the summarized text in it after the script is run in whatever folder the .py file is located
You should not use split_sentences() before passing the text to summarize() since summarize() takes a string (with multiple sentences) as input.
In your code you are first turning your text into a list of sentences (using split_sentences()) and then converting that back to a string (with str()). The result of this is a string like "['First sentence', 'Second sentence']". It doesn't make sense to pass this on to summarize().
Instead you should simply pass your raw text as input:
text = summarize(text)

How do I combine paragraphs of web pages (from a text file containing urls)?

I want to read all the web pages and extract text from them and then remove white spaces and punctuation. My goal is to combine all the words in all the webpage and produce a dictionary that counts the number of times a word appears across all the web pages.
Following is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
import re
def web_parsing(filename):
with open (filename, "r") as df:
urls = df.readlines()
for url in urls:
uClient = ureq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
par = page_soup.findAll('p')
for node in par:
#print(node)
text = ''.join(node.findAll(text = True))
#text = text.lower()
#text = re.sub(r"[^a-zA-Z-0-9 ]","",text)
text = text.strip()
print(text)
The output I got is:
[Paragraph1]
[paragraph2]
[paragraph3]
.....
What I want is:
[Paragraph1 paragraph2 paragraph 3]
Now, if I split the text here it gives me multiple lists:
[paragraph1], [paragraph2], [paragraph3]..
I want all the words of all the paragraphs of all the webpages in one list.
Any help is appreciated.
As far as I understood your question, you have a list of nodes from which you can extract a string. You then want these strings to be merged into a single string. This can simply done by creating an empty string and then adding the subsequent strings to it.
result = ""
for node in par:
text = ''.join(node.finAll(text=True)).strip()
result += text
print(result) # "Paragraph1 Paragraph2 Paragraph3"
prin([result]) # ["Paragraph1 Paragraph2 Paragraph3"]

Resources