How to scrape a table and its links - python-3.x

What I want to do is to take thw following website
https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html
view-source:https://www.tdcj.texas.gov/death_row/dr_executed_offenders.html
And pick the year of execution, enter the Last Statement Link, and retrieve the statement... perhaps I would be creating 2 dictionaries, both with the execution number as key.
Afterwards, I would classify the statements by length, besides "flagging" the refusals to give it or if it was just not given.
Finally, all would be compiled in a SQLite database, and I would display a graph that shows how many messages, clustered by type, have been given each year.
Beautiful Soup seems to be the path to follow, I'm already having troubles with just printing the year of execution... Of course, I'm not ultimately interested in printing the years of execution, but it seems like a good way of checking if at least my code is properly locating the tags I want.
tags = soup('td')
for tag in tags:
print(tag.get('href', None))
Why does the previous code only print None?
Thanks beforehand.

Use pandas to get and manipulate the table. The links are static and by that I mean they can be easily recreated with offender's first and last name.
Then, you can use requests and BeautifulSoup to scrape for offender's last statement, which are quite moving.
Here's how:
import requests
import pandas as pd
def clean(first_and_last_name: list) -> str:
name = "".join(first_and_last_name).replace(" ", "").lower()
return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")
base_url = "https://www.tdcj.texas.gov/death_row"
response = requests.get(f"{base_url}/dr_executed_offenders.html")
df = pd.read_html(response.text, flavor="bs4")
df = pd.concat(df)
df.rename(columns={'Link': "Offender Information", "Link.1": "Last Statement URL"}, inplace=True)
df["Offender Information"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)
df["Last Statement URL"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)
df.to_csv("offenders.csv", index=False)
This gets you:
EDIT:
I actually went ahead and added the code that fetches all offenders' last statements.
import random
import time
import pandas as pd
import requests
from lxml import html
base_url = "https://www.tdcj.texas.gov/death_row"
response = requests.get(f"{base_url}/dr_executed_offenders.html")
statement_xpath = '//*[#id="content_right"]/p[6]/text()'
def clean(first_and_last_name: list) -> str:
name = "".join(first_and_last_name).replace(" ", "").lower()
return name.replace(", Jr.", "").replace(", Sr.", "").replace("'", "")
def get_last_statement(statement_url: str) -> str:
page = requests.get(statement_url).text
statement = html.fromstring(page).xpath(statement_xpath)
text = next(iter(statement), "")
return " ".join(text.split())
df = pd.read_html(response.text, flavor="bs4")
df = pd.concat(df)
df.rename(
columns={'Link': "Offender Information", "Link.1": "Last Statement URL"},
inplace=True,
)
df["Offender Information"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}.html", axis=1)
df["Last Statement URL"] = df[
["Last Name", 'First Name']
].apply(lambda x: f"{base_url}/dr_info/{clean(x)}last.html", axis=1)
offender_data = list(
zip(
df["First Name"],
df["Last Name"],
df["Last Statement URL"],
)
)
statements = []
for item in offender_data:
*names, url = item
print(f"Fetching statement for {' '.join(names)}...")
statements.append(get_last_statement(statement_url=url))
time.sleep(random.randint(1, 4))
df["Last Statement"] = statements
df.to_csv("offenders_data.csv", index=False)
This will take a couple of minutes because the code "sleeps" for anywhere between 1 to 4 seconds, more or less, so the server doesn't get abused.
Once this gets done, you'll end up with a .csv file with all offenders' data and their statements, if there was one.

Related

python code to loop though a list of postcodes and get the GP practices for those postcodes by scraping the yellow pages (Australia)

The code below gives me the following error:
ValueError: Length mismatch: Expected axis has 0 elements, new values have 1 elements
on the df.columns = ["GP Practice Name"] line.
I tried
import pandas as pd
import requests
from bs4 import BeautifulSoup
postal_codes = ["2000", "2010", "2020", "2030", "2040"]
places_by_postal_code = {}
def get_places(postal_code):
url = f"https://www.yellowpages.com.au/search/listings?clue={postal_code}&locationClue=&latitude=&longitude=&selectedViewMode=list&refinements=category:General%20Practitioner&selectedSortType=distance"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
places = soup.find_all("div", {"class": "listing-content"})
return [place.find("h2").text for place in places]
for postal_code in postal_codes:
places = get_places(postal_code)
places_by_postal_code[postal_code] = places
df = pd.DataFrame.from_dict(places_by_postal_code, orient='index')
df.columns = ["GP Practice Name"]
df = pd.DataFrame(places_by_postal_code.values(), index=places_by_postal_code.keys(), columns=["GP Practice Name"])
print(df)
and was expecting a list of GPs for the postcodes specified in the postal_codes variable.

Store data in a correct format after scraping in Python, post processing the csv data

Scraping is done now i have to postprocess the data stored in a csv file,
import pandas as pd
import uuid
def postprocess():
clean= ['Home','Shri','Smt','Dr','Mr','Mrs','Ms','Contact','Facebook','account','photo','Of','Biography','Email','Address','Roles','To','Read','more','Blog','Vacancies','Advertisement','Advertise','Holding','Second','Estimates','Fax','Find','Close','Links','Key','Figures','Stats','Statistics','Household','Full','name','Parks','Open','Menu','Languages','Opinion','Education','Address','Latest','Activity','NA','Father','Husband']
banned = ['january','february','march','april','may','june','july','september','october','november','december','monday','tuesday','wednesday','thursday','friday','saturday','sunday']
try:
df = pd.read_csv("result.csv",names=['Prefix','Name','Designation','Nation','Created_date_time'],encoding= 'unicode_escape')
except:
return
df = df.applymap(str)
df.replace(to_replace = 'nan', value = '', inplace = True)
print("Postprocessing...")
df.dropna(subset=['Name'],inplace=True)
df.drop_duplicates(subset=['Name'], keep='first',inplace=True)
df.reset_index(drop=True, inplace=True)
for count,_ in enumerate(df['Name']):
if len(df['Name'][count])<4:
df.drop(count,axis=0,inplace=True)
continue
if df['Prefix'][count].isnumeric() or len(df['Prefix'][count])>15:
df.at[count, 'Prefix'] = ''
if 'Home' in df['Prefix'][count]:
df.at[count, 'Prefix'] = df['Name'][count].replace('Home','')
for item in df['Name'][count].split(' '):
if len(item)<2:
df.at[count, 'Name'] = df['Name'][count].replace(item,'')
for word in clean:
if word in item:
new_val = df['Name'][count].replace(word,'').strip()
df.at[count, 'Name'] = new_val
for ban in banned:
if ban in item.lower():
df.drop(count,axis=0,inplace=True)
break
else:
continue
break
df.dropna(subset=['Name'],inplace=True)
df.drop_duplicates(subset=['Name'], keep='first',inplace=True)
df.reset_index(drop=True, inplace=True)
uuids = []
for _ in range(len(df)):
uuids.append(uuid.uuid4().hex)
df.insert(0, '_id', uuids)
print("Postprocessing complete. Database contains {} rows".format(len(df)))
df.to_csv("result.csv",index=False)
I am not getting the require result
What i am getting is:
id Prefix Name Designation Nation Created_date_time
random id font size present(some random words) India this is correct
I am not getting prefix, name is not correct, in place of designation country is showing and in place of nation date and time is showing
what i actually want is:
id Prefix Name Designation Nation Created_date_time
random id Mr Ankur Singh Analyst India date and time
Is there any suggestion for this... what things i need to correct or what can i do

TypeError: 'str' object does not support item assignment, pandas operation

I am trying to grab some finanical data off a financial website. I wanna manipulate df['__this value__']. I did some research myself, and I understand the error fine, but I really have no idea how to fix it. This is how my code is like:
import requests
import bs4
import os
import pandas as pd
def worker_names(code=600110):
......
df = pd.DataFrame({'class': name_list})
worker_years(df)
def worker_years(df, code=600110, years=None):
if years is None:
years = ['2019', '2018', '2017', '2016']
url = 'http://quotes.money.163.com/f10/dbfx_'\
+ str(code) + '.html?date='\
+ str(years) + '-12-31,'\
+ str(years) + '-09-30#01c08'
......
df['{}-12-31'.format(years)] = number_list # this is where the problem is
df = df.drop_duplicates(subset=['class'], keep=False)
df.to_csv(".\\__fundamentals__\\{:0>6}.csv".format(code),
index=False, encoding='GBK')
print(df)
if __name__ == '__main__':
pd.set_option('display.max_columns', None)
pd.set_option('display.unicode.ambiguous_as_wide', True)
pd.set_option('display.unicode.east_asian_width', True)
fundamental_path = '.\\__fundamentals__'
stock_path = '.\\__stock__'
worker_names(code=600110)
Is there any ways that I can work around? please help! THX ALL!
your codes df['{}-12-31'.format(years)] = number_list demostrate a very good example of you can't make 2 variables on both side of the equation. Try this:
df_year = pd.DataFrame({'{}'.format(year): number_list})
df = pd.concat([df, df_year], axis=1)
work around with dataframe, there are many different ways to get the same result.

Using langdetect output to be imported into a new column in my dataframe

Being rather new to programming with python I tried to language detect segments of text in pandas data frame.
So first I made a function for the 'langdetect' package
import pandas as pd
from langdetect import detect
def language_detect(x):
lang = detect(x)
print(lang)
My second step would be to feed in the data frame for processing. All the segments that need detecting are in separate rows in the dataframe under the same column header.
result = [language_detect(x) for x in df['column_name']]
df['l_detect'] = pd.append(result)
In the output I see the texts being recognized properly.
But when I try to print result.
it returns me with only the value for every entry 'none'
So my questions are:
why do I get 'none' when the the print output from the function has the right values
How can I attach this to my current data frame, since when I try to append it I get 'none' on
every field as well.
Thanks in advance.
The problem is that result is empty because your function language_detect() doesn't return anything (it is only printing the results).
import pandas as pd
from langdetect import detect
lst = [('this is a test', 1), ('what language is this?', 4), ('stackoverflow is a website', 23)]
df = pd.DataFrame(lst, columns = ['text', 'something'])
def language_detect(x):
lang = detect(x)
print(lang)
result = [language_detect(x) for x in df['text']]
result
#Output:[None, None, None]
Just give it a return value:
def language_detect(x):
lang = detect(x)
return lang
df['l_detect'] = df['text'].apply(language_detect)
df.head()
#Output:
# text something l_detect
#0 this is a test 1 en
#1 what language is this? 4 en
#2 stackoverflow is a website 23 en
and it will work as expected.

EmptyDataError: No columns to parse from file. (Generate files with "for" in Python)

The following code obtains specific data from an internet financial portal (Morningstar). I obtain data from different companies, in this case from Dutch companies. Each one is represented by a ticker.
import pandas as pd
import numpy as np
def financials_download(ticker,report,frequency):
if frequency == "A" or frequency == "a":
frequency = "12"
elif frequency == "Q" or frequency == "q":
frequency = "3"
url = 'http://financials.morningstar.com/ajax/ReportProcess4CSV.html?&t='+ticker+'&region=usa&culture=en-US&cur=USD&reportType='+report+'&period='+frequency+'&dataType=R&order=desc&columnYear=5&rounding=3&view=raw&r=640081&denominatorView=raw&number=3'
df = pd.read_csv(url, skiprows=1, index_col=0)
return df
def ratios_download(ticker):
url = 'http://financials.morningstar.com/ajax/exportKR2CSV.html?&callback=?&t='+ticker+'&region=usa&culture=en-US&cur=USD&order=desc'
df = pd.read_csv(url, skiprows=2, index_col=0)
return df
holland=("AALBF","ABN","AEGOF", "AHODF", "AKZO","ALLVF","AMSYF","ASML","KKWFF","KDSKF","GLPG","GTOFF","HINKF","INGVF","KPN","NN","LIGHT","RANJF","RDLSF","RDS.A","SBFFF", "UNBLF", "UNLVF", "VOPKF", "WOLTF")
def finance(country):
for ticker in country:
frequency = "a"
df1 = financials_download(ticker,'bs',frequency)
df2 = financials_download(ticker,'is',frequency)
df3 = ratios_download(ticker)
d1 = df1.loc['Total assets']
if np.any("EBITDA" in df2.index) == True:
d2 = df2.loc["EBITDA"]
else:
d2 = None
if np.any("Revenue USD Mil" in df3.index) == True:
d3 = df3.loc["Revenue USD Mil"]
else:
d3 = df3.loc["Revenue EUR Mil"]
d4 = df3.loc["Operating Margin %"]
d5 = df3.loc["Return on Assets %"]
d6 = df3.loc["Return on Equity %"]
d7 = df3.loc["EBT Margin"]
d8 = df3.loc["Net Margin %"]
d9 = df3.loc["Free Cash Flow/Sales %"]
if d2 is not None:
d1=d1.to_frame().T
d2=d2.to_frame().T
d3=d3.to_frame().T
d4=d4.to_frame().T
d5=d5.to_frame().T
d6=d6.to_frame().T
d7=d7.to_frame().T
d8=d8.to_frame().T
d9=d9.to_frame().T
df_new=pd.concat([d1,d2,d3,d4,d5,d6,d7,d8,d9])
else:
d1=d1.to_frame().T
d3=d3.to_frame().T
d4=d4.to_frame().T
d5=d5.to_frame().T
d6=d6.to_frame().T
d7=d7.to_frame().T
d8=d8.to_frame().T
d9=d9.to_frame().T
df_new=pd.concat([d1,d3,d4,d5,d6,d7,d8,d9])
df_new.to_csv(ticker+'.csv')
The problem is that when I use a for loop so that it goes through all the tickers of the variable holland and generates a csv document for each of them, it returns the following error:
File "pandas/_libs/parsers.pyx", line 565, in
pandas._libs.parsers.TextReader.__cinit__ (pandas\_libs\parsers.c:6260)
EmptyDataError: No columns to parse from file
On the other hand, it runs without error, if I just select one company ticker after the other.
I'd really appreciate it if you could help me.
When you run your script several times, it fails on different tickers and different calls. This gives you an indication that the problem is not associated with a specific ticker, but rather that the call from the csv reader doesn't return a value that can be read into the data frame. You can address this problem, by using Python's error handling routines, e.g. for your financials_download function:
df = ""
i = 0
#some data in df?
while len(df) == 0:
#try to download data and load them into df
try:
df = pd.read_csv(url, skiprows=1, index_col=0)
#not successful? Count failed attempts
except:
i += 1
print("Trial", i, "failed")
#five attempts failed? Unlikely that this server will respond
if i == 5:
print("ticker", ticker, ": server is down")
break
#print("downloaded", ticker)
#print("financial download data frame:")
#print(df)
This tries five times to retrieve the data from the ticker and if this fails, it prints a message that it was not successful. But now you have to deal with this situation in your main program and adjust it, because some of the data frames are empty.
I would like to point you for this kind of basic debugging to a blog post.

Resources