Why is the .get('href') returning "None" on a bs4.element.tag? - python-3.x

I'm pulling together a dataset to do analysis on. The goal is to parse a table on a SEC webpage and pull out the link in a row that has the text "SC 13D" in it. This needs to be repeatable so I can automate it across a large list of links I have in a database. I know this code is not the most Pythonic, but I hacked it together to get what I need out of the table, except for the link in the table row. How can I extract the href value from the table row?
I tried doing a .findAll on 'tr' instead of 'td' in the table (Line 15) but couldn't figure out how to search on "SC 13D" and pop the element from the list of table rows if I performed the .findAll('td'). I also tried to just get the anchor tag with the link in it using the .get('a) instead of .get('href') (included in the code, line 32) but it also returns "None".
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
url = 'https://www.sec.gov/Archives/edgar/data/1050122/000101143807000336/0001011438-07-000336-index.htm'
html = urllib.request.urlopen(url, context=ctx).read()
soup = BeautifulSoup(html, 'html.parser')
table = soup.find('table',{'summary':'Document Format Files'})
rows = table.findAll("td")
i = 0
pos = 0
for row in rows:
if "SC 13D" in row:
pos = i
break
else: i = i + 1
linkpos = pos - 1
linkelement = rows[linkpos]
print(linkelement.get('a'))
print(linkelement.get('href'))
The expected results is printing out the link in linkelement. The actual result is "None".

It is because your a tag is inside your td tag
You just have to do:
linkelement = rows[linkpos]
a_element = linkelement.find('a')
print(a_element.get('href'))

Switch your .get to .find
You want to find the <a> tag, and print the href attribute
print(linkelement.find('a')['href'])
Or you need to use .get with the tag:
print(linkelement.a.get('href'))

Related

How to convert <br> tag to a comma/new column when scraping website with python?

I'm trying to scrape the website below. I can get all of the data I need off of it by using the code below. However, the 'br' tags are creating issues for me. I'd prefer for them to be treated as an indicator for a new column in my data frame.
Here is the website: directory.ccnecommunity.org/...
I tried BeautifulSoup and got invalid tags. It didn't work too well.
My thought was to remove every tag except 'br' and then go back and replace them with commas. There was too much other crap that was added and not just the plain text.
Code:
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?state=AL&sFullName=Alabama&sProgramType=1'
table = pd.read_html(url)
table = pd.concat(table[1:-1])
table.columns = table.iloc[0]
table = table.iloc[1:-1]
print(table)
I want each indentation in the tables/school section to be a new column in my data frame. I can deal with naming them and cleaning it later. I'm using selenium to get the URLs because the the search page is java script. Would using selenium to do this be better? I can always export to csv and read it back in using pandas. Any help or tips would be appreciated.
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup
import requests
import re
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?state=AL&sFullName=Alabama&sProgramType=1'
page_html = requests.get(url).text
page_soup = BeautifulSoup(page_html, "html.parser")
tables = page_soup.find_all("table", id = "finder")
reformattable = []
reg = re.compile(r"(<[\/]?br[\/]?>)+")
for table in tables:
reformattable.append(re.sub(reg, "<td>", str(table)))
dflist = []
for table in reformattable:
dflist.append(pd.read_html(str(table)))
info = [dflist[i][0] for i in np.arange(len(dflist))]
stats = [dflist[i][1] for i in np.arange(len(dflist))]
adjInfo = []
for df in info:
adjInfo.append(pd.concat([df[i] for i in np.arange(len(df.columns))]).dropna().reset_index(drop = True))
adjStats= []
for df in stats:
df.drop(columns = 1, inplace = True)
df.dropna(inplace = True)
df[3] = df[0]+' ' + df[2]
adjStats.append(df[3])
combo = []
for p1,p2 in zip(adjInfo, adjStats):
combo.append(pd.concat([p1,p2]))
finaldf = pd.concat([combo[i] for i in np.arange(len(combo))], axis = 1)
finaldf
So this gives you exactly what you want. Lets go over it.
After inspecting the website we can see that each section is a "table" with the id of finder. So we looked for this using beautiful soup. Next we had to reformat the <br> tags to make it easier to load into a df. So I replaced all the <br> tags with a single <td> tag.
Another issue with the website was that each section was broken up into 2 tables. So we would have 2 df per one section. In order to make cleaning easier, I broke them down to both the info and stats dataframe lists.
adjInfo and adjStats simply clean the dataframes and put them in a list. Next week recombine information into single columns for each section and put it in combo.
Finally we take all the columns in combo and concat them to get our finaldf.
EDIT
To loop:
finaldf = pd.DataFrame()
for changeinurl in url:
#fix it to however you manipulated the url for your loop
url = 'http://directory.ccnecommunity.org/reports/rptAccreditedPrograms_New.asp?state=AL&sFullName=Alabama&sProgramType=1'
page_html = requests.get(url).text
page_soup = BeautifulSoup(page_html, "html.parser")
tables = page_soup.find_all("table", id = "finder")
reformattable = []
reg = re.compile(r"(<[\/]?br[\/]?>)+")
for table in tables:
reformattable.append(re.sub(reg, "<td>", str(table)))
dflist = []
for table in reformattable:
dflist.append(pd.read_html(str(table)))
info = [dflist[i][0] for i in np.arange(len(dflist))]
stats = [dflist[i][1] for i in np.arange(len(dflist))]
adjInfo = []
for df in info:
adjInfo.append(pd.concat([df[i] for i in np.arange(len(df.columns))]).dropna().reset_index(drop = True))
adjStats= []
for df in stats:
df.drop(columns = 1, inplace = True)
df.dropna(inplace = True)
df[3] = df[0]+' ' + df[2]
adjStats.append(df[3])
combo = []
for p1,p2 in zip(adjInfo, adjStats):
combo.append(pd.concat([p1,p2]))
df = pd.concat([combo[i] for i in np.arange(len(combo))], axis = 1).reset_index(drop = True).T
finaldf.append(df)

How to fix "cannot set a row with mismatched columns" error in pandas

I'm creating a web scraper for a project of mine. I'm web scraping jobs from indeed. I'm able to get all the data that I need. Now I'm having a problem creating a dataframe to save it to a CSV file.
I have searched for the error and tried many possible solutions but I keep getting the same error. Appreciate any suggestions on code or error problem. Thank you
ValueError: cannot set a row with mismatched columns
import requests
import bs4
from bs4 import BeautifulSoup
import pandas as pd
import time
max_results_per_city = 30
city_set = ['New+York','Chicago']
columns = ["city", "job_title", "company_name", "location", "summary"]
database = pd.DataFrame(columns = columns)
for city in city_set:
for start in range(0, max_results_per_city, 10):
page = requests.get('https://www.indeed.com/jobs?q=computer+science&l=' + str(city) + '&start=' + str(start))
time.sleep(1)
soup = BeautifulSoup(page.text, "lxml")
for div in soup.find_all(name="div", attrs={"class":"row"}):
num = (len(sample_df) + 1)
job_post = []
job_post.append(city)
for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
job_post.append(a["title"])
company = div.find_all(name="span", attrs={"class":"company"})
if len(company) > 0:
for b in company:
job_post.append(b.text.strip())
else:
sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
for span in sec_try:
job_post.append(span.text)
c = div.findAll('div', attrs={'class': 'location'})
for span in c:
job_post.append(span.text)
d = div.findAll('div', attrs={'class': 'summary'})
for span in d:
job_post.append(span.text.strip())
database.loc[num] = job_post
database.to_csv("test.csv")
Reproducing your code, it was not extracting location and database indentation is in the wrong place. So, fix c = div.findAll(name='span', attrs={'class': 'location'})
. Here's a fix that makes it work:
database = []
for city in city_set:
for start in range(0, max_results_per_city, 10):
page = requests.get('https://www.indeed.com/jobs?q=computer+science&l=' + str(city) + '&start=' + str(start))
time.sleep(1)
soup = BeautifulSoup(page.text, "lxml")
for div in soup.find_all(name="div", attrs={"class":"row"}):
#num = (len(sample_df) + 1)
job_post = []
job_post.append(city)
for a in div.find_all(name="a", attrs={"data-tn-element":"jobTitle"}):
job_post.append(a["title"])
company = div.find_all(name="span", attrs={"class":"company"})
if len(company) > 0:
for b in company:
job_post.append(b.text.strip())
else:
sec_try = div.find_all(name="span", attrs={"class":"result-link-source"})
for span in sec_try:
job_post.append(span.text)
c = div.findAll(name='span', attrs={'class': 'location'})
for span in c:
job_post.append(span.text)
d = div.findAll('div', attrs={'class': 'summary'})
for span in d:
job_post.append(span.text.strip())
database.append(job_post)
df00=pd.DataFrame(database)
df00.shape
df00.columns=columns
df00.to_csv("test.csv",index=False)
This issue is caused by the # Columns not matching the amount of data (for at least one row).
I see a number of issues: Where is 'sample_df' initialized, where are you adding data to 'database' are the big ones that pop out.
I'd restructure your code job_post looks like your row level list. I would use to append to a table level list, so at the end of each loop hit table.append(job_post) instead of sample_df.loc[num] = job_post
then after your loop you can call Dataframe(table, columns=columns)
a note: make sure you're adding None, Null or "" when your scraper can't find data, otherwise your row length wont match your column length, which is what is causing your error.

Python3:How to get title eng from url?

i ues this code
import urllib.request
fp = urllib.request.urlopen("https://english-thai-dictionary.com/dictionary/?sa=all")
mybytes = fp.read()
mystr = mybytes.decode("utf8")
fp.close()
print(mystr)
x = 'alt'
for item in mystr.split():
if (x) in item:
print(item.strip())
I get Thai word from this code but I didn't know how to get Eng word.Thanks
If you want to get words from table you should use parsing library like BeautifulSoup4. Here is an example how you can parse this (I'm using requests to fetch and beautifulsoup here to parse data):
First using dev tools in your browser identify table with content you want to parse. Table with translations has servicesT class attribute which occurs only once in whole document:
import requests
from bs4 import BeautifulSoup
url = 'https://english-thai-dictionary.com/dictionary/?sa=all;ftlang=then'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Get table with translations
table = soup.find('table', {'class':'servicesT'})
After that you need to get all rows that contain translations for Thai words. If you look up page's source file you will notice that first few <tr rows are headers that contain only headers so we will omit them. After that we wil get all <td> elements from row (in that table there are always 3 <td> elements) and fetch words from them (in this table words are actually nested in and ).
table_rows = table.findAll('tr')
# We will skip first 3 rows beacause those are not
# contain information we need
for tr in table_rows[3:]:
# Finding all <td> elements
row_columns = tr.findAll('td')
if len(row_columns) >= 2:
# Get tag with Thai word
thai_word_tag = row_columns[0].select_one('span > a')
# Get tag with English word
english_word_tag = row_columns[1].find('span')
if thai_word_tag:
thai_word = thai_word_tag.text
if english_word_tag:
english_word = english_word_tag.text
# Printing our fetched words
print((thai_word, english_word))
Of course, this is very basic example of what I managed to parse from page and you should decide for yourself what you want to scrape. I've also noticed that data inside table does not have translations all the time so you should keep that in mind when scraping data. You also can use Requests-HTML library to parse data (it supports pagination which is present in table on page you want to scrape).

Use bs4 to scrape specific html table among several tables in same page

So I want to scrape the last table titled "Salaries" on this website http://www.baseball-reference.com/players/a/alberma01.shtml
url = 'http://www.baseball-reference.com/players/a/alberma01.shtml'
r = urllib.request.urlopen(url).read()
soup = BeautifulSoup(r)
I've tried
div = soup.find('div', id='all_br-salaries')
and
div = soup.find('div', attrs={'id': 'all_br-salaries'})
When I print div I see the data from the table but when I try something like:
div.find('thead')
div.find('tbody')
I get nothing. My question is how can I select the table correctly so I can iterate over the tr/td & th tags to extract the data?
The reason? The HTML for that table is — don't ask me why — in a comment field. Therefore, dig the HTML out of the comment, turn that into soup and mine the soup in the usual way.
>>> import requests
>>> page = requests.get('http://www.baseball-reference.com/players/a/alberma01.shtml').text
>>> from bs4 import BeautifulSoup
>>> table_code = page[page.find('<table class="sortable stats_table" id="br-salaries"'):]
>>> soup = BeautifulSoup(table_code, 'lxml')
>>> rows = soup.findAll('tr')
>>> len(rows)
14
>>> for row in rows[1:]:
... row.text
...
'200825Baltimore\xa0Orioles$395,000? '
'200926Baltimore\xa0Orioles$410,000? '
'201027Baltimore\xa0Orioles$680,0002.141 '
'201128Boston\xa0Red\xa0Sox$875,0003.141 '
'201229Boston\xa0Red\xa0Sox$1,075,0004.141contracts '
'201330Cleveland\xa0Indians$1,750,0005.141contracts '
'201431Houston\xa0Astros$2,250,0006.141contracts '
'201532Chicago\xa0White\xa0Sox$1,500,0007.141contracts '
'201532Houston\xa0Astros$200,000Buyout of contract option'
'201633Chicago\xa0White\xa0Sox$2,000,0008.141 '
'201734Chicago\xa0White\xa0Sox$250,000Buyout of contract option'
'2017 StatusSigned thru 2017, Earliest Free Agent: 2018'
'Career to date (may be incomplete)$11,385,000'
EDIT: I found that this was in a comment field by opening the HTML for the page in the Chrome browser and then look down through it for the desired table. This is what I found. Notice the opening <!--.

(Python)- How to store text extracted from HTML table using BeautifulSoup in a structured python list

I parse a webpage using beautifulsoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("webpage url")
soup = BeautifulSoup(page.content, 'html.parser')
I find the table and print the text
Ear_yield= soup.find(text="Earnings Yield").parent
print(Ear_yield.parent.text)
And then I get the output of a single row in a table
Earnings Yield
0.01
-0.59
-0.33
-1.23
-0.11
I would like this output to be stored in a list so that I can print on xls and operate on the elements (For ex if (Earnings Yield [0] > Earnings Yield [1]).
So I write:
import html2text
text1 = Ear_yield.parent.text
Ear_yield_text = html2text.html2text(pr1)
list_Ear_yield = []
for i in Ear_yield_text :
list_Ear_yield.append(i)
Thinking that my web data has gone into list. I print the fourth item and check:
print(list_Ear_yield[3])
I expect the output as -0.33 but i get
n
That means the list takes in individual characters and not the full word:
Please let me know where I am doing wrong
That is because your Ear_yield_text is a string rather than a list. Assuming that the text have new lines you can do directly this:
list_Ear_yield = Ear_yield_text.split('\n')
Now if you print list_Ear_yield you will be given this result
['Earnings Yield', '0.01', '-0.59', '-0.33', '-1.23', '-0.11']

Resources