How to fix 'ValueError("input must have more than one sentence")' Error - python-3.x

Im writing a script that takes a website url and downloads it using beautiful soup. It then uses gensim.summarization to summarize the text but I keep getting ValueError("input must have more than one sentence") even thought the text has more than one sentence. The first section of the script works that downloads the text but I cant get the second part to summarize the text.
import bs4 as bs
import urllib.request
from gensim.summarization import summarize
from gensim.summarization.textcleaner import split_sentences
#===========================================
print("(Insert URL)")
url = input()
sauce = urllib.request.urlopen(url).read()
soup = bs.BeautifulSoup(sauce,'lxml')
#===========================================
print(soup.title.string)
with open (soup.title.string + '.txt', 'wb') as file:
for paragraph in soup.find_all('p'):
text = paragraph.text.replace('.', '.\n')
text = split_sentences(text)
text = summarize(str(text))
text = text.encode('utf-8', 'ignore')
#===========================================
file.write(text+'\n\n'.encode('utf-8'))
It should create a .txt file with the summarized text in it after the script is run in whatever folder the .py file is located

You should not use split_sentences() before passing the text to summarize() since summarize() takes a string (with multiple sentences) as input.
In your code you are first turning your text into a list of sentences (using split_sentences()) and then converting that back to a string (with str()). The result of this is a string like "['First sentence', 'Second sentence']". It doesn't make sense to pass this on to summarize().
Instead you should simply pass your raw text as input:
text = summarize(text)

Related

Counting words in a webpage is inaccurate

Noob, trying to build a word counter, to count the words displayed on a website. I found some code (counting words inside a webpage), modified it, tried it on Google, and found that it was way off. Other code I tried displayed all of the various HTML tags, which was likewise not helpful. If visible page content reads: "Hello there world," I'm looking for a count of 3. For now, I'm not concerned with words that are in image files (pictures). My modified code is as follows:
import requests
from bs4 import BeautifulSoup
from collections import Counter
from string import punctuation
# Page you want to count words from
page = "https://google.com"
# Get the page
r = requests.get(page)
soup = BeautifulSoup(r.content)
# We get the words within paragrphs
text_p = (''.join(s.findAll(text=True))for s in soup.findAll('p'))
# creates a dictionary of words and frequency from paragraphs
content_paras = Counter((x.rstrip(punctuation).lower() for y in text_p for x in y.split()))
sum_of_paras = sum(content_paras.values())
# We get the words within divs
text_div = (''.join(s.findAll(text=True))for s in soup.findAll('div'))
content_div = Counter((x.rstrip(punctuation).lower() for y in text_div for x in y.split()))
sum_of_divs = sum(content_div.values())
words_on_page = sum_of_paras + sum_of_divs
print(words_on_page)
As always, simple answers I can follow are appreciated over complex/elegant ones I cannot, b/c Noob.

how to extract text from PDF file using python , i never did this and not getting the DOM of PDF file

this is my PDF file "https://drive.google.com/open?id=1M9k1AO17ZSwT6HTrTrB-uz85ps3WL1wS"
Help me someone to extract this , as i search on SO getting some clue to extract text using these libries PyPDF2, PyPDF2.pdf , PageObject, u_, ContentStream, b_, TextStringObject ,but not getting how to use it.
someone please help me to extract this with some explanation, so i can understand the code and tell me how to read DOM of PDF file.
you need to install some libaries:
pip install PyPDF2
pip install textract
pip install nltk
This will download the libraries you require t0 parsePDF documents and extract keywords. In order to do this, make sure your PDF file is stored within the folder where you’re writing your script.
Startup your favourite editor and type:
Note: All lines starting with # are comments.
Step 1: Import all libraries:
import PyPDF2
import textract
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
Step 2: Read PDF File
#write a for-loop to open many files -- leave a comment if you'd #like to learn how
filename = 'enter the name of the file here'
#open allows you to read the file
pdfFileObj = open(filename,'rb')
#The pdfReader variable is a readable object that will be parsed
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
#discerning the number of pages will allow us to parse through all #the pages
num_pages = pdfReader.numPages
count = 0
text = ""
#The while loop will read each page
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
#This if statement exists to check if the above library returned #words. It's done because PyPDF2 cannot read scanned files.
if text != "":
text = text
#If the above returns as False, we run the OCR library textract to #convert scanned/image based PDF files into text
else:
text = textract.process(fileurl, method='tesseract', language='eng')
# Now we have a text variable which contains all the text derived #from our PDF file. Type print(text) to see what it contains. It #likely contains a lot of spaces, possibly junk such as '\n' etc.
# Now, we will clean our text variable, and return it as a list of keywords.
Step 3: Convert text into keywords
#The word_tokenize() function will break our text phrases into #individual words
tokens = word_tokenize(text)
#we'll create a new list which contains punctuation we wish to clean
punctuations = ['(',')',';',':','[',']',',']
#We initialize the stopwords variable which is a list of words like #"The", "I", "and", etc. that don't hold much value as keywords
stop_words = stopwords.words('english')
#We create a list comprehension which only returns a list of words #that are NOT IN stop_words and NOT IN punctuations.
keywords = [word for word in tokens if not word in stop_words and not word in punctuations]
Now you have keywords for your file stored as a list. You can do whatever you want with it. Store it in a spreadsheet if you want to make the PDF searchable, or parse a lot of files and conduct a cluster analysis. You can also use it to create a recommender system for resumes for jobs ;)

Python3:How to get title eng from url?

i ues this code
import urllib.request
fp = urllib.request.urlopen("https://english-thai-dictionary.com/dictionary/?sa=all")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
x = 'alt'
for item in mystr.split():
if (x) in item:
print(item.strip())
I get Thai word from this code but I didn't know how to get Eng word.Thanks
If you want to get words from table you should use parsing library like BeautifulSoup4. Here is an example how you can parse this (I'm using requests to fetch and beautifulsoup here to parse data):
First using dev tools in your browser identify table with content you want to parse. Table with translations has servicesT class attribute which occurs only once in whole document:
import requests
from bs4 import BeautifulSoup
url = 'https://english-thai-dictionary.com/dictionary/?sa=all;ftlang=then'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Get table with translations
table = soup.find('table', {'class':'servicesT'})
After that you need to get all rows that contain translations for Thai words. If you look up page's source file you will notice that first few <tr rows are headers that contain only headers so we will omit them. After that we wil get all <td> elements from row (in that table there are always 3 <td> elements) and fetch words from them (in this table words are actually nested in and ).
table_rows = table.findAll('tr')
# We will skip first 3 rows beacause those are not
# contain information we need
for tr in table_rows[3:]:
# Finding all <td> elements
row_columns = tr.findAll('td')
if len(row_columns) >= 2:
# Get tag with Thai word
thai_word_tag = row_columns[0].select_one('span > a')
# Get tag with English word
english_word_tag = row_columns[1].find('span')
if thai_word_tag:
thai_word = thai_word_tag.text
if english_word_tag:
english_word = english_word_tag.text
# Printing our fetched words
print((thai_word, english_word))
Of course, this is very basic example of what I managed to parse from page and you should decide for yourself what you want to scrape. I've also noticed that data inside table does not have translations all the time so you should keep that in mind when scraping data. You also can use Requests-HTML library to parse data (it supports pagination which is present in table on page you want to scrape).

How do I combine paragraphs of web pages (from a text file containing urls)?

I want to read all the web pages and extract text from them and then remove white spaces and punctuation. My goal is to combine all the words in all the webpage and produce a dictionary that counts the number of times a word appears across all the web pages.
Following is my code:
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as ureq
import re
def web_parsing(filename):
with open (filename, "r") as df:
urls = df.readlines()
for url in urls:
uClient = ureq(url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
par = page_soup.findAll('p')
for node in par:
#print(node)
text = ''.join(node.findAll(text = True))
#text = text.lower()
#text = re.sub(r"[^a-zA-Z-0-9 ]","",text)
text = text.strip()
print(text)
The output I got is:
[Paragraph1]
[paragraph2]
[paragraph3]
.....
What I want is:
[Paragraph1 paragraph2 paragraph 3]
Now, if I split the text here it gives me multiple lists:
[paragraph1], [paragraph2], [paragraph3]..
I want all the words of all the paragraphs of all the webpages in one list.
Any help is appreciated.
As far as I understood your question, you have a list of nodes from which you can extract a string. You then want these strings to be merged into a single string. This can simply done by creating an empty string and then adding the subsequent strings to it.
result = ""
for node in par:
text = ''.join(node.finAll(text=True)).strip()
result += text
print(result) # "Paragraph1 Paragraph2 Paragraph3"
prin([result]) # ["Paragraph1 Paragraph2 Paragraph3"]

len() function not giving correct character number

I am trying to figure out the number of characters in a string but for some strange reason len() is only giving me back 1.
here is an example of my output
WearWorks is a haptics design company that develops products and
experiences that communicate information through touch. Our first product,
Wayband, is a wearable tactile navigation device for the blind and visually
impaired.
True
1
here is my code
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
url="https://www.wear.works/"
response=requests.get(url)
html=response.content
soup=BeautifulSoup(html,'html.parser')
#reference https://stackoverflow.com/questions/328356/extracting-text-from-html-file-using-python
# getting rid of the script sytle in html
for script in soup(["script", "style"]):
(script.extract()) # rip it out
# print(script)
# get text
# grabbing the first chunk of text
text = soup.get_text()[0]
print(isinstance(text, str))
print(len(text))
print(text)
The problem is text = soup.get_text()[0] convert it to text = soup.get_text() have a look. You're slicing a string to get the first character.

Resources