I have been trying too scrape the data from the Website section from the various cities' vcard table on wikipedia but somehow I get the results for the Co-ordinates section which is located at the beginning off the table
I have tried specifying "Website" while selecting the specific tags in the table.
def getAdditionalInfo(url):
try:
city_page = PageContent('https://en.wikipedia.org' + url)
table = city_page.find('table', {'class' : 'infobox geography vcard'})
additional_details = []
read_content = False
for tr in table.find_all('tr'):
if (tr.get('class') == ['mergedtoprow'] and not read_content):
link = tr.find('th')
if (link and (link.get_text().strip() == 'Website')):
read_content = True
elif ((tr.get('class') == ['mergedbottomrow']) or tr.get('class') == ['mergedrow'] and read_content):
additional_details.append(tr.find('td').get_text().strip('\n'))
return additional_details
except Exception as error:
print('Error occured: {}'.format(error))
return []
I want to append this data into a new column which shows the website link for each city's official page which I would be getting from this function
With bs4 4.7.1 you can use :contains to target the table header of website and then get the next td's a tag href attribute. Clearly there are other cases where this pattern could match so perhaps some other form of validation is required on input values.
You could add an additional class selector for the vcard if you wish: result = soup.select_one('.vcard th:contains(Website) + td > [href]')
Python
import requests
from bs4 import BeautifulSoup as bs
cities = ['Paris', 'Frankfurt', 'London']
base = 'https://en.wikipedia.org/wiki/'
with requests.Session() as s:
for city in cities:
r = s.get(base + city)
soup = bs(r.content, 'lxml')
result = soup.select_one('th:contains(Website) + td > [href]')
if result is None:
print(city, 'selector failed to find url')
else:
print(city, result['href'])
As I understand the problem correctly, you want to extract official URL of the city from Wikipedia:
import requests
from bs4 import BeautifulSoup
def getAdditionalInfo(url):
soup = BeautifulSoup(requests.get('https://en.wikipedia.org' + url).text, 'lxml')
for th in soup.select('.vcard th'):
if not th.text.lower() == 'website':
continue
yield th.parent.select_one('td').text
cities = ['/wiki/Paris', '/wiki/London', '/wiki/Madrid']
for city in cities:
for info in getAdditionalInfo(city):
print(f'{city}: {info}')
This prints:
/wiki/Paris: www.paris.fr
/wiki/London: london.gov.uk
/wiki/Madrid: www.madrid.es
Related
I am trying to download reports from a companies website, https://www.investorab.com/investors-media/reports-presentations/. In the end, I would like to download all the available reports.
I have next to none experience in webscraping, so I have some trouble defining the correct search pattern. Previously I have needed to take out all links containing pdfs, i.e. I could use soup.select('div[id="id-name"] a[data-type="PDF"]'). But for this website, there is not listed a datatype for the links. How do I select all links under "Report and presentations"? Here is what I have tried, but it returns an empty list:
from bs4 import BeautifulSoup
import requests
url = "https://www.investorab.com/investors-media/reports-presentations/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Select all reports, publication_dates
reports = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] a[href]')
pub_dates = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] div[class="field-content"]')
I would also like to select all publications date, but also ends up with an empty list. Any help in the right direction is appreciated.
What you'll need to do is iterate through the pages, or what I did was just iterate through the year parameter. Once you get the list for the year, get the link of each report, then within each link, find the pdf link. You'll then use that pdf link to write to file:
from bs4 import BeautifulSoup
import requests
import os
# Gets all the links
linkList = []
url = 'https://vp053.alertir.com/v3/en/events-archive?'
for year in range(1917,2021):
query = 'type%5B%5D=report&type%5B%5D=annual_report&type%5B%5D=cmd&type%5B%5D=misc&year%5Bvalue%5D%5Byear%5D=' + str(year)
response = requests.get(url + query )
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
linkList += [link['href'] for link in links if 'v3' in link['href']]
print ('Gathered links for year %s.' %year)
# Go to each link and get the pdsf within them
print ('Downloading PDFs...')
for link in linkList:
url = 'https://vp053.alertir.com' + link
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for pdflink in soup.select("a[href$='.pdf']"):
folder_location = 'C:/test/pdfDownloads/'
if not os.path.exists(folder_location):
os.mkdir(folder_location)
try:
filename = os.path.join(folder_location,pdflink['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get('https://vp053.alertir.com' + pdflink['href']).content)
print ('Saved: %s' %pdflink['href'].split('/')[-1])
except Exception as ex:
print('%s not saved. %s' %(pdflink['href'],ex))
My database table looks like this
I have a web crawler that fetches news from the website and i am trying to store it in this table. I have used scrappy and beautiful soup libraries.The below code shows my crawler logic.
import requests
from bs4 import BeautifulSoup
import os
import datetime
import cx_Oracle
def scrappy(url):
try:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
title = soup.find('title').text.split('|')[0]
time =soup.find('span', attrs={'class':'time_cptn'}).find_all('span')[2].contents[0]
full_text =soup.find('div', attrs={'class':'article_content'}).text.replace('Download The Times of India News App for Latest India News','')
except:
return ('','','','')
else:
return (title,time,url,full_text)
def pathmaker(name):
path = "Desktop/Web_Crawler/CRAWLED_DATA/{}".format(name)
try:
os.makedirs(path)
except OSError:
pass
else:
pass
def filemaker(folder,links_all):
#k=1
for link in links_all:
scrapped=scrappy(link)
#textfile=open('Desktop/Web_Crawler/CRAWLED_DATA/{}/text{}.txt'.format(x,k),'w+')
#k+=1
Title = scrapped[0]
Link = scrapped[2]
Dates = scrapped[1]
Text = scrapped[3]
con = cx_Oracle.connect('shivams/tiger#127.0.0.1/XE')
cursor = con.cursor()
sql_query = "insert into newsdata values(:1,:2,:3,:4)"
cursor.executemany(sql_query,[Title,Link,Dates,Text])
con.commit()
cursor.close()
con.close()
#textfile.write('Title\n{}\n\nLink\n{}\n\nDate & Time\n{}\n\nText\n{}'.format(scrapped[0],scrapped[2],scrapped[1],scrapped[3]))
#textfile.close()
con.close()
folders_links=[('India','https://timesofindia.indiatimes.com/india'),('World','https://timesofindia.indiatimes.com/world'),('Business','https://timesofindia.indiatimes.com/business'),('Homepage','https://timesofindia.indiatimes.com/')]
for x,y in folders_links:
pathmaker(x)
r = requests.get(y)
soup = BeautifulSoup(r.text, 'html.parser')
if x!='Homepage':
links =soup.find('div', attrs={'class':'main-content'}).find_all('span', attrs={'class':'twtr'})
links_all=['https://timesofindia.indiatimes.com'+links[x]['data-url'].split('?')[0] for x in range(len(links))]
else:
links =soup.find('div', attrs={'class':'wrapper clearfix'})
total_links = links.find_all('a')
links_all=[]
for p in range(len(total_links)):
if 'href' in str(total_links[p]) and '.cms' in total_links[p]['href'] and 'http' not in total_links[p]['href'] and 'articleshow' in total_links[p]['href'] :
links_all+=['https://timesofindia.indiatimes.com'+total_links[p]['href']]
filemaker(x,links_all)
Earlier i was creating text files and storing the news in them but now i want to store it in the database for my web application to access it. My database logic is in the file maker function. I am trying to insert the values in the table but its not working and giving various types of errors. I followed other posts on the website but they didnt work in my case. Can anyone help me with this. Also i am not sure if that is the correct way to insert CLOB data as i am using it for the first time. Need help.
You can do the following:
cursor.execute(sql_query, [Title, Link, Dates, Text])
or, if you build up a list of these values, you can then do the following:
allValues = []
allValues.append([Title, Link, Dates, Text])
cursor.executemany(sql_query, allValues)
Hope that explains things!
I am having issue with getting text of field from the web page using python 3 and bs4. Code below.
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get("https://www.mlssoccer.com/players")
content = page.content
soup = BeautifulSoup(content, "html.parser")
data = soup.find('div', class_ = 'item-list' )
names=[]
for player in data:
name = data.find_all('div', class_ = 'name')
names.append(name)
df= pd.DataFrame({'player':names})
the code works (ie executes) but I get the html tags in the output, rather than the text of the field (player name). i tried:
name = data.find_all('div', class_ = 'name').text
in the for loop but that doesn't work either.
Any pointers or references to help would be appreciated
What you get from the find_all is ResultSet, so yes you need to use text to retrieve name data you want but it won't work for a set. Therefore you need to use for loop to retrieve them one by one.
However, the text in div actually contains an a tag, so you need to further dig in it by find('a').
for player in data:
name = data.find_all('div', class_ = 'name')
for obj in name:
names.append(obj.find('a').text)
you only need to loop once, use .text to get text inside element
....
soup = BeautifulSoup(content, "html.parser")
data = soup.findAll('a', class_='name_link' )
names=[]
for player in data:
names.append(player.text)
.....
There are some question like this online but I looked at them and none of them have helped me I am currently working on a script that pulls an item name from http://www.supremenewyork.com/shop/all/accessories
I want it to pull this information from supreme uk but im having trouble with the proxy stuff but right now im strugglinh with this script everytime I run it I get the error listed above in the title.
Here is my Script:
import requests
from bs4 import BeautifulSoup
URL = ('http://www.supremenewyork.com/shop/all/accessories')
proxy_script = requests.get(URL).text
soup = BeautifulSoup(proxy_script, 'lxml')
for item in soup.find_all('div', class_='inner-article'):
name = soup.find('h1', itemprop='name').text
print(name)
I am always getting this error and when I run the script without the .text at the end of the itemprop=name I just get a bunch of None's
like this:
None
None
None etc......
The exact amount of Nones as there are Items available to print
Here we go, I've commented the code that I've used below.
and the reason we use class_='something is because the word class is reserved for classes in Python.
URL = ('http://www.supremenewyork.com/shop/all/accessories')
#UK_Proxy1 = '178.62.13.163:8080'
#proxies = {
# 'http': 'http://' + UK_Proxy1,
#'https': 'https://' + UK_Proxy1
#}
#proxy_script = requests.get(URL, proxies=proxies).text
proxy_script = requests.get(URL).text
soup = BeautifulSoup(proxy_script, 'lxml')
thetable = soup.find('div', class_='turbolink_scroller')
items = thetable.find_all('div', class_='inner-article')
for item in items:
only_text = item.h1.a.text
# by doing .<tag> we extract information just from that tag
# example bsobject = <html><body><b>ey</b></body</html>
# if we print bsobject.body.b it will return `<b>ey</b>`
color = item.p.a.text
print(only_text, color)
I am trying to extract a table into pandas from a website that is automatically updated on a regular basis. I tried:
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup
website = 'http://www.dallasfirerescue.com/active_incidents.html'
req = Request(website)
abc = urlopen(req)
raw = abc.read().decode("utf-8")
page = raw.replace('<!-->', '')
soup = BeautifulSoup(page, "html.parser")
table = soup.find("table")
print (table)
It gives me None
Your link didn't work for me, but here is a great example of how to download data from an HTML table into Python.
# import libraries
import requests
from bs4 import BeautifulSoup
# query the website and return the html to the variable ‘page’
page = requests.get("https://www.aucklandairport.co.nz/flights").text
soup = BeautifulSoup(page)
tbody = soup.find('tbody')
rows = tbody.findAll('tr',{'class':'flight-toggle'}) #find tr whose class = flight-toggle
for tr in rows:
cols = tr.findAll('td',class_=lambda x: x != 'logo') # find td whose class!=logo (exclude the first td)
dv0 = cols[0].find('div').findAll('div') #flight, carrier, origin under second td
flight, carrier, origin = [c.text.strip() for c in dv0]
dv1 = cols[1].find('div').findAll('div') #date, schedule under third td
date, scheduled = [c.text.strip() for c in dv1]
dv2 = cols[2].find('div').findAll('div') #estimated, statusunder fouth td
estimated, status = [c.text.strip() for c in dv2[1:]] # exclude the first div
print(flight, carrier, origin, date, scheduled, estimated, status)
See the links below for more info.
http://srome.github.io/Parsing-HTML-Tables-in-Python-with-BeautifulSoup-and-pandas/
https://pythonprogramminglanguage.com/web-scraping-with-pandas-and-beautifulsoup/
The content of that page is generated dynamically. You can't grab the response by making http request. You need to use any browser simulator instead. Here is how you can achieve that. I used selenium in this case:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.Chrome()
driver.get('http://www.dallasfirerescue.com/active_incidents.html')
soup = BeautifulSoup(driver.page_source, "lxml")
table = soup.find(class_="CSVTable")
for tr in table.find_all("tr"):
data = [item.text.strip() for item in tr.find_all("td")]
print(data)
driver.quit()
When you execute the above script, the data from the table of that webpage will be available in your grip.