Need Help Looping Through URLs with Beautiful Soup - python-3.x

I'm trying to scrape the names of all companies listed on this site. Each page (14 in total), shows the name of 80 companies. Each URL has a start=241&count=80&first=2009&last=2018 at the end, where start is the first row of the page. I'm trying to loop through every 80 companies, which will loop through each page, and scrape the names of the companies. However, everytime I try, I get this error on the second time through the loop:
File "beautiful_soup_2.py", line 10, in <module>
name_table = (soup.findAll('table')[4])
File "C:\Users\adamm\Downloads\Python\lib\site-packages\bs4\element.py", line 1807, in __getattr__
"ResultSet object has no attribute '%s'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?" % key
AttributeError: ResultSet object has no attribute 'findAll'. You're probably treating a list of items like a single item. Did you call find_all() when you meant to call find()?
However, if I remove the list and manually enter a URL where start=81, 161, 241, etc., the result returns the list of companies on the page.
My code so far:
from urllib.request import urlopen
from bs4 import BeautifulSoup as soup
for x in range(1,1042,80):
sauce = ('https://www.sec.gov/cgi-bin/srch-edgar?text=form-type%20%3D%2010-12b%20OR%20form-type%3D10-12b%2Fa&start={}&count=80&first=2009&last=2018'.format(x))
source_link = urlopen(sauce).read()
soup = soup(source_link, 'lxml')
name_table = (soup.findAll('table')[4])
table_rows = name_table.findAll('tr')
for row in table_rows:
cols = row.findAll('td')
cols = [x.text.strip() for x in cols]
print(cols)
This is driving me crazy, so any help is much appreciated.

Related

Trying to extract links from fetching link from csv file into request.get but getting " TypeError: 'NoneType' object is not subscriptable "

code :
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
links = pd.read_csv('C:\\Users\\acer\\Desktop\\hindustan_pages.csv',encoding = 'latin',dtype=str)
for i in range(1,3):
link = links.iloc[i,0]
r = requests.get(link)
soup = BeautifulSoup(r.text,'lxml')
div = soup.find('div',{"id":"company_list_grid"})
for links in div.find_all('th',{"id":"c_name"}):
link = links.find('a')
print("https://www.hindustanyellowpages.in/Ahmedabad" + link['href'][2:])
getting error:
Traceback (most recent call last):
File
"C:\Users\acer\AppData\Local\Programs\Python\Python37\hindustanyellowpages.py",
line 8, in
link = links.iloc[i,0]
TypeError: 'NoneType' object is not subscriptable
help me please to sort out this.
Found the issue, but first:
1) When pandas reads in the csv file, it's assuming you have a header. So row 1 in your csv actually never would get processed because that gets stored as the column name, so you'd need to add a parameter to tell pandas that there are no headers header=None. Also just as a note, not sure if you realize it or not, but the indices start at 0. So your code as is, is starting actually with the link at row 2 of the dataframe. If you want it to start at row 1, you'd need range(0,2)
2) the TypeError: 'NoneType' object is not subscriptable is raised because links which is storing your dataframe to iterate over, gets overwritten at for links in div.find_all('th',{"id":"c_name"}):. So when it goes to the 2nd row of your links dataframe, it no longer is a dataframe, but rather an element in your div.find_all('th',{"id":"c_name"}) tags. To fix this, we'll just rename your links dataframe as links_df:
from bs4 import BeautifulSoup
import requests
import csv
import pandas as pd
# Renamed to links_df and added header=None
links_df = pd.read_csv('C:\\Users\\acer\\Desktop\\hindustan_pages.csv',encoding = 'latin',dtype=str, header=None)
#for i, row in links_df.iterrows(): <---- if you want it to do the whole dataframe, you can simply use .iterrows instead of for i in range()
for i in range(1,3):
link = links_df.iloc[i,0] #<--- refers to a row in your links_df
r = requests.get(link)
soup = BeautifulSoup(r.text,'lxml')
div = soup.find('div',{"id":"company_list_grid"})
for links in div.find_all('th',{"id":"c_name"}):
link = links.find('a')
print("https://www.hindustanyellowpages.in/Ahmedabad" + link['href'][2:])
Output:
https://www.hindustanyellowpages.in/Ahmedabad/Hetal-Mandap-Decorators/Satellite
https://www.hindustanyellowpages.in/Ahmedabad/Radhe-Krishna-Event-Management/Bapunagar
https://www.hindustanyellowpages.in/Ahmedabad/Amiraj-Decorators/Maninagar
https://www.hindustanyellowpages.in/Ahmedabad/Hiral-Handicraft/Saijpur-Bogha
https://www.hindustanyellowpages.in/Ahmedabad/S-D-Traders/Teen-Darwaja
https://www.hindustanyellowpages.in/Ahmedabad/Agro-Net-Plast/Kathwada
https://www.hindustanyellowpages.in/Ahmedabad/Shree-Krishna-Suppliers/Naroda
https://www.hindustanyellowpages.in/Ahmedabad/Bulakhidas-Vitthaldas/Panchkuva
https://www.hindustanyellowpages.in/Ahmedabad/Nagindas-AND-Sons-(Patva)/Relief-Road
https://www.hindustanyellowpages.in/Ahmedabad/New-Rahul-Electricals-OR-Dhruv-Light-Decoration-AND-Sound-System/Subhash-Bridge
https://www.hindustanyellowpages.in/Ahmedabad/Bhairavi-Craft/Thaltej
https://www.hindustanyellowpages.in/Ahmedabad/Saath-Sangath-Party-Plot/Rakanpur
https://www.hindustanyellowpages.in/Ahmedabad/Poonam-Light-Decoration-And-Electricals/Naroda
https://www.hindustanyellowpages.in/Ahmedabad/Muku-Enterprise/Isanpur
https://www.hindustanyellowpages.in/Ahmedabad/Malaviya-Decoration-Service/Nikol
https://www.hindustanyellowpages.in/Ahmedabad/Hariprabha-Enterprises/Paldi
https://www.hindustanyellowpages.in/Ahmedabad/Festo-Craft/Thaltej
https://www.hindustanyellowpages.in/Ahmedabad/Jay-Jognimata-Decoration/Kubernagar
https://www.hindustanyellowpages.in/Ahmedabad/Maruti-Decorators/Ranip
https://www.hindustanyellowpages.in/Ahmedabad/Krishna-Light-and-Decoration/Gandhi-Road
https://www.hindustanyellowpages.in/Ahmedabad/Shree-Ambika-Light-Decoration/Ghatlodiya
https://www.hindustanyellowpages.in/Ahmedabad/R-S-Power-House/Gandhi-Road
https://www.hindustanyellowpages.in/Ahmedabad/New-Amit-Stores/Paldi
https://www.hindustanyellowpages.in/Ahmedabad/Gayatri-Cetera-s-and-Decoration/Ranip
https://www.hindustanyellowpages.in/Ahmedabad/Poonam-lights-and-decoration/Navrangpura
https://www.hindustanyellowpages.in/Ahmedabad/Shree-Ambica-Engg-Works/Naroda
https://www.hindustanyellowpages.in/Ahmedabad/Vedant-Industries/Vatva-Gidc
https://www.hindustanyellowpages.in/Ahmedabad/Honest-Traders/Narol
https://www.hindustanyellowpages.in/Ahmedabad/Sai-Samarth-Industries/Nana-Chiloda
https://www.hindustanyellowpages.in/Ahmedabad/R-N-Industries/Odhav
https://www.hindustanyellowpages.in/Ahmedabad/Jay-Ambe-Enterprise/S-G-Road
https://www.hindustanyellowpages.in/Ahmedabad/Satyam-Enterprise/Narol
https://www.hindustanyellowpages.in/Ahmedabad/Maniar-And-Co/Rakhial
https://www.hindustanyellowpages.in/Ahmedabad/Shiv-Shakti-Plastic-Industries/Amraivadi
https://www.hindustanyellowpages.in/Ahmedabad/Dinesh-Engineering-Works/Kathwada
https://www.hindustanyellowpages.in/Ahmedabad/MARUTI-ENGINEERS-OR-MARUTI-SAND-BLASTING/Odhav
https://www.hindustanyellowpages.in/Ahmedabad/Cranetech-Equipments/Vatva
https://www.hindustanyellowpages.in/Ahmedabad/Suyog-Engineering-AND-Fabricators/Chhatral-GIDC
https://www.hindustanyellowpages.in/Ahmedabad/Sanghvii-Conveyors/Thaltej
https://www.hindustanyellowpages.in/Ahmedabad/Siddh-Kripa-Steel-Fab/Kathwada
https://www.hindustanyellowpages.in/Ahmedabad/Ashirvad-Industries/Vatva-Gidc
https://www.hindustanyellowpages.in/Ahmedabad/G-I-Lokhandwala/Chandola
https://www.hindustanyellowpages.in/Ahmedabad/M-P-Electrical-Engineering--OR--M-P-Crane/Naroda-Gidc
https://www.hindustanyellowpages.in/Ahmedabad/Krishna-Enterprise/Geeta-Mandir
https://www.hindustanyellowpages.in/Ahmedabad/Nmtg-Mechtrans-Techniques-Pvt-Ltd/Naroda-Gidc
https://www.hindustanyellowpages.in/Ahmedabad/Shree-Balaad-Handling-Works/Odhav
https://www.hindustanyellowpages.in/Ahmedabad/DLM-Enterprise/Vastral
https://www.hindustanyellowpages.in/Ahmedabad/Alfa-Engineers--OR--Everest-Sanitary/Amaraivadi
https://www.hindustanyellowpages.in/Ahmedabad/Parv-Engineering-Equipments/Vatva
https://www.hindustanyellowpages.in/Ahmedabad/Vikrant-Equipments/Kathwada
Additional:
use:
for i, row in links_df.head(10).iterrows():
link = links_df.iloc[i,0]
or
for i, row in links_df.iloc[:10].iterrows():
link = links_df.iloc[i,0]
Try this out.
link = df.link.iloc[i]
If i am not mistaking your df is links so
link = links.link.iloc[i]
Let me know if this helps

Python3:How to get title eng from url?

i ues this code
import urllib.request
fp = urllib.request.urlopen("https://english-thai-dictionary.com/dictionary/?sa=all")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
x = 'alt'
for item in mystr.split():
if (x) in item:
print(item.strip())
I get Thai word from this code but I didn't know how to get Eng word.Thanks
If you want to get words from table you should use parsing library like BeautifulSoup4. Here is an example how you can parse this (I'm using requests to fetch and beautifulsoup here to parse data):
First using dev tools in your browser identify table with content you want to parse. Table with translations has servicesT class attribute which occurs only once in whole document:
import requests
from bs4 import BeautifulSoup
url = 'https://english-thai-dictionary.com/dictionary/?sa=all;ftlang=then'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Get table with translations
table = soup.find('table', {'class':'servicesT'})
After that you need to get all rows that contain translations for Thai words. If you look up page's source file you will notice that first few <tr rows are headers that contain only headers so we will omit them. After that we wil get all <td> elements from row (in that table there are always 3 <td> elements) and fetch words from them (in this table words are actually nested in and ).
table_rows = table.findAll('tr')
# We will skip first 3 rows beacause those are not
# contain information we need
for tr in table_rows[3:]:
# Finding all <td> elements
row_columns = tr.findAll('td')
if len(row_columns) >= 2:
# Get tag with Thai word
thai_word_tag = row_columns[0].select_one('span > a')
# Get tag with English word
english_word_tag = row_columns[1].find('span')
if thai_word_tag:
thai_word = thai_word_tag.text
if english_word_tag:
english_word = english_word_tag.text
# Printing our fetched words
print((thai_word, english_word))
Of course, this is very basic example of what I managed to parse from page and you should decide for yourself what you want to scrape. I've also noticed that data inside table does not have translations all the time so you should keep that in mind when scraping data. You also can use Requests-HTML library to parse data (it supports pagination which is present in table on page you want to scrape).

How to print the table from a website using python script?

Here is my python script so far.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
my_url = 'my_company_website'
#opening up connection, grabbing the page
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
#html parsing
page_soup = soup(page_html, "html.parser")
#grabs each product
containers = page_soup.findAll("div",{"class":"navigator-content"})
print (containers)
After this, in inspect element it is like below,
<div class ="issue-table-container">
<div>
<table id ="issuetable" class>
<thead>...</thead>
<tbody>...<t/body> (This contains all the information i want to print)
</table>
How to print the table and export to csv
For each of the containers you should grab the table [1], then you have to find the body of the table and iterate over its rows [2] and compile a line for your csv file with the table cells (td) [3]
for container in containers:
table = container.find(id="issuetable") [1]
#if you are exactly sure of the structure and/or if the tables have different/unique ids and there is only one table per container you can also do:
table = container.table [1]
for tr in table.tbody.find_all("tr"): [2]
line = ""
for td in tr: [3]
line += td.text+"," #Adding the text in the td to the line followed by the separator of your choice in this case comma
csvfile.write(line[:-1]+"/n") #add the line (replace "/n" with your system's new line character for extra portability)
There are different ways of navigating the soup tree depending on your need and on how flexible you script needs to be.
Have a look at https://www.crummy.com/software/BeautifulSoup/bs4/doc/ and check out the find / find_all sections.
Good luck!
/Teo

(Python)- How to store text extracted from HTML table using BeautifulSoup in a structured python list

I parse a webpage using beautifulsoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("webpage url")
soup = BeautifulSoup(page.content, 'html.parser')
I find the table and print the text
Ear_yield= soup.find(text="Earnings Yield").parent
print(Ear_yield.parent.text)
And then I get the output of a single row in a table
Earnings Yield
0.01
-0.59
-0.33
-1.23
-0.11
I would like this output to be stored in a list so that I can print on xls and operate on the elements (For ex if (Earnings Yield [0] > Earnings Yield [1]).
So I write:
import html2text
text1 = Ear_yield.parent.text
Ear_yield_text = html2text.html2text(pr1)
list_Ear_yield = []
for i in Ear_yield_text :
list_Ear_yield.append(i)
Thinking that my web data has gone into list. I print the fourth item and check:
print(list_Ear_yield[3])
I expect the output as -0.33 but i get
n
That means the list takes in individual characters and not the full word:
Please let me know where I am doing wrong
That is because your Ear_yield_text is a string rather than a list. Assuming that the text have new lines you can do directly this:
list_Ear_yield = Ear_yield_text.split('\n')
Now if you print list_Ear_yield you will be given this result
['Earnings Yield', '0.01', '-0.59', '-0.33', '-1.23', '-0.11']

Finding a regex patterned text inside a python variable

# Ex1
# Number of datasets currently listed on data.gov
# http://catalog.data.gov/dataset
import requests
import re
from bs4 import BeautifulSoup
page = requests.get(
"http://catalog.data.gov/dataset")
soup = BeautifulSoup(page.content, 'html.parser')
value = soup.find_all(class_='new-results')
results = re.search([0-9][0-9][0-9],[0-9][0-9][0-9], value
print(value)
The code is above .. I want to find a text in the form on regex = [0-9][0-9][0-9],[0-9][0-9][0-9]
inside the text inside the variable 'value'
How can i do this ?
Based on ShellayLee's suggestion i changed it to
import requests
import re
from bs4 import BeautifulSoup
page = requests.get(
"http://catalog.data.gov/dataset")
soup = BeautifulSoup(page.content, 'html.parser')
value = soup.find_all(class_='new-results')
my_match = re.search(r'\d\d\d,\d\d\d', value)
print(my_match)
STILL GETTING ERROR
Traceback (most recent call last):
File "ex1.py", line 19, in
my_match = re.search(r'\d\d\d,\d\d\d', value)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/re.py", line 182, in search
return _compile(pattern, flags).search(string)
TypeError: expected string or bytes-like object
You need some basics of regex in Python. A regex in Python is represented in as a string, and re module provides functions like match, search, findall which can take a string as an argument and treat it as a pattern.
In your case, the pattern [0-9][0-9][0-9],[0-9][0-9][0-9] can be represented as:
my_pattern = r'\d\d\d,\d\d\d'
then used like
my_match = re.search(my_pattern, value_text)
where \d means a digit symbol (same as [0-9]). The r leading the string means the backslaches in the string are not treated as escaper.
The search function returns a match object.
I suggest you walk through some tutorials first to get rid of further confusions. The official HOWTO is already well written:
https://docs.python.org/3.6/howto/regex.html

Resources