Finding a regex patterned text inside a python variable - python-3.x

# Ex1
# Number of datasets currently listed on data.gov
# http://catalog.data.gov/dataset
import requests
import re
from bs4 import BeautifulSoup
page = requests.get(
"http://catalog.data.gov/dataset")
soup = BeautifulSoup(page.content, 'html.parser')
value = soup.find_all(class_='new-results')
results = re.search([0-9][0-9][0-9],[0-9][0-9][0-9], value
print(value)
The code is above .. I want to find a text in the form on regex = [0-9][0-9][0-9],[0-9][0-9][0-9]
inside the text inside the variable 'value'
How can i do this ?
Based on ShellayLee's suggestion i changed it to
import requests
import re
from bs4 import BeautifulSoup
page = requests.get(
"http://catalog.data.gov/dataset")
soup = BeautifulSoup(page.content, 'html.parser')
value = soup.find_all(class_='new-results')
my_match = re.search(r'\d\d\d,\d\d\d', value)
print(my_match)
STILL GETTING ERROR
Traceback (most recent call last):
File "ex1.py", line 19, in
my_match = re.search(r'\d\d\d,\d\d\d', value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 182, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object

You need some basics of regex in Python. A regex in Python is represented in as a string, and re module provides functions like match, search, findall which can take a string as an argument and treat it as a pattern.
In your case, the pattern [0-9][0-9][0-9],[0-9][0-9][0-9] can be represented as:
my_pattern = r'\d\d\d,\d\d\d'
then used like
my_match = re.search(my_pattern, value_text)
where \d means a digit symbol (same as [0-9]). The r leading the string means the backslaches in the string are not treated as escaper.
The search function returns a match object.
I suggest you walk through some tutorials first to get rid of further confusions. The official HOWTO is already well written:
https://docs.python.org/3.6/howto/regex.html

Related

Python3 requests - cut response

How can i cut the output from a response using Python requests?
The output looks like:
...\'\n});\nRANDOMDATA\nExt.define...
or
...\'\n});\nOTHERRANDOMDATA\nExt.define...
And i only want to print out the RANDOMDATA.
req = "https://example.org/endpoint"
response = requests.get(req, verify=False)
print (response.content)
You can use the re module findall() function for searching all occurrences of a regular expression in a string.
The following code, just search for the string RANDOMDATA in the response got from requests.get() function
import re
req = "https://example.org/endpoint"
response = requests.get(req, verify=False)
print (response.content)
ar = re.findall('RANDOMDATA',str(response.content))
if len(ar):
print(ar[0])
This link would be helpful to learn about regular expressions
Additionally, if you have a variable data containing a string to b searched, and a variable t containing a string to search, you can use,
import re
arr = re.findall(t,data)
To return all the occurrences of t in data and :
arr = data.find(t)
To get the index of the first occurrence of t in data

Stemming and Lemmatization on Array

I dont quite understand why I cannot Lemmatize or do Stemming. I tried converting the array to string, but I have no luck.
This is my code.
import bs4, re, string, nltk, numpy as np, pandas as pd
from nltk.stem import PorterStemmer
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen
news_url="https://news.google.com/news/rss"
Client=urlopen(news_url)
xml_page=Client.read()
Client.close()
soup_pg=soup(xml_page,"xml")
news_lst=soup_page.findAll("item")
limit=19
corpus = []
# Print news title, url and publish date
for index, news in enumerate(news_list):
#print(news.title.text)
#print(index+1)
corpus.append(news.title.text)
if index ==limit:
break
#print(arrayList)
df = pd.DataFrame(corpus, columns=['News'])
wpt=nltk.WordPunctTokenizer()
stop_words=nltk.corpus.stopwords.words('english')
def normalize_document (doc):
#lowercase and remove special characters\whitespace
doc=re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A) #re.I ignore case sensitive, ASCII-only matching
doc=doc.lower()
doc=doc.strip()
#tokenize document
tokens=wpt.tokenize(doc)
#filter stopwords out of document
filtered_tokens=[token for token in tokens if token not in stop_words]
#re-create documenr from filtered tokens
doc=' '.join(filtered_tokens)
return doc
normalize_corpus=np.vectorize(normalize_document)
norm_corpus=normalize_corpus(corpus)
norm_corpus
The error I get starts with the next lines I add
stemmer = PorterStemmer()
sentences = nltk.sent_tokenize(norm_corpus)
# Stemming
for i in range(len(norm_corpus)):
words = nltk.word_tokenize(norm_corpus[i])
words = [stemmer.stem(word) for word in words]
norm_corpus[i] = ' '.join(words)
once I insert these lines then I get the following error:
TypeError: cannot use a string pattern on a bytes-like object
I think if I solve the error with stemming it will be the same solution to my error with lemmatization.
The type of norm_corpus is numpy.ndarray, i.e bytes. The sent_tokenize method expects a string, hence the error. You need to convert norm_corpus to a list of strings to get rid of this error.
What I don't understand is why would you vectorize the document before stemming? Is there a problem of doing it other way around, i.e first stemming and then vectorize. The error should be resolved then

Need Help Looping Through URLs with Beautiful Soup

I'm trying to scrape the names of all companies listed on this site. Each page (14 in total), shows the name of 80 companies. Each URL has a start=241&count=80&first=2009&last=2018 at the end, where start is the first row of the page. I'm trying to loop through every 80 companies, which will loop through each page, and scrape the names of the companies. However, everytime I try, I get this error on the second time through the loop:
File "beautiful_soup_2.py", line 10, in <module>
name_table = (soup.findAll('table')[4])
File "C:\Users\adamm\Downloads\Python\lib\site-packages\bs4\element.py", line 1807, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
However, if I remove the list and manually enter a URL where start=81, 161, 241, etc., the result returns the list of companies on the page.
My code so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
for x in range(1,1042,80):
sauce = ('https://www.sec.gov/cgi-bin/srch-edgar?text=form-type%20%3D%2010-12b%20OR%20form-type%3D10-12b%2Fa&start={}&count=80&first=2009&last=2018'.format(x))
source_link = urlopen(sauce).read()
soup = soup(source_link, 'lxml')
name_table = (soup.findAll('table')[4])
table_rows = name_table.findAll('tr')
for row in table_rows:
cols = row.findAll('td')
cols = [x.text.strip() for x in cols]
print(cols)
This is driving me crazy, so any help is much appreciated.

(Python)- How to store text extracted from HTML table using BeautifulSoup in a structured python list

I parse a webpage using beautifulsoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("webpage url")
soup = BeautifulSoup(page.content, 'html.parser')
I find the table and print the text
Ear_yield= soup.find(text="Earnings Yield").parent
print(Ear_yield.parent.text)
And then I get the output of a single row in a table
Earnings Yield
0.01
-0.59
-0.33
-1.23
-0.11
I would like this output to be stored in a list so that I can print on xls and operate on the elements (For ex if (Earnings Yield [0] > Earnings Yield [1]).
So I write:
import html2text
text1 = Ear_yield.parent.text
Ear_yield_text = html2text.html2text(pr1)
list_Ear_yield = []
for i in Ear_yield_text :
list_Ear_yield.append(i)
Thinking that my web data has gone into list. I print the fourth item and check:
print(list_Ear_yield[3])
I expect the output as -0.33 but i get
n
That means the list takes in individual characters and not the full word:
Please let me know where I am doing wrong
That is because your Ear_yield_text is a string rather than a list. Assuming that the text have new lines you can do directly this:
list_Ear_yield = Ear_yield_text.split('\n')
Now if you print list_Ear_yield you will be given this result
['Earnings Yield', '0.01', '-0.59', '-0.33', '-1.23', '-0.11']

How to solve ValueError is not in List? It's in the list

How to solve ValueError is not in List problem? I don't understand what is wrong with my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://uk.reuters.com/business/quotes/financialHighlights? symbol=AAPL.O")
bsObj = BeautifulSoup(html,"html.parser")
tag = bsObj.findAll("td")
tagList = []
for tagItem in tag:
tagList.append(tagItem)
print(tagList.index("<td>Dec</td>"))
Error:
Traceback (most recent call last):
File "/Users/home/Desktop/development/x/code.py", line 11, in <module>
print(tagList.index("<td>Dec</td>"))
ValueError: '<td>Dec</td>' is not in list
Process finished with exit code 1
You're creating a list of <class 'bs4.element.Tag'> objects. Their string representation seems to match the string you're looking for, except that objects are not equal since they have different types.
(note that printing the list yields [<td>Dec</td>, <td>Dec</td>], note the absence of quotes, printing the same list but with strings yields ['<td>Dec</td>', '<td>Dec</td>'])
Quickfix: create your list as string
for tagItem in tag:
tagList.append(str(tagItem))
or as list comprehension:
tagList = [str(tagItem) for tagItem in tag]
Now index works: returns "0"
Note that you could keep your list unconverted (if you want to keep the objects, not coerce to strings), and use the following to find the first index compared to a string:
print(next(i for i,x in enumerate(tagList) if str(x)=="<td>Dec</td>"))

Resources