Unable to insert data into table using python and oracle database - python-3.x

My database table looks like this
I have a web crawler that fetches news from the website and i am trying to store it in this table. I have used scrappy and beautiful soup libraries.The below code shows my crawler logic.
import requests
from bs4 import BeautifulSoup
import os
import datetime
import cx_Oracle
def scrappy(url):
try:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
title = soup.find('title').text.split('|')[0]
time =soup.find('span', attrs={'class':'time_cptn'}).find_all('span')[2].contents[0]
full_text =soup.find('div', attrs={'class':'article_content'}).text.replace('Download The Times of India News App for Latest India News','')
except:
return ('','','','')
else:
return (title,time,url,full_text)
def pathmaker(name):
path = "Desktop/Web_Crawler/CRAWLED_DATA/{}".format(name)
try:
os.makedirs(path)
except OSError:
pass
else:
pass
def filemaker(folder,links_all):
#k=1
for link in links_all:
scrapped=scrappy(link)
#textfile=open('Desktop/Web_Crawler/CRAWLED_DATA/{}/text{}.txt'.format(x,k),'w+')
#k+=1
Title = scrapped[0]
Link = scrapped[2]
Dates = scrapped[1]
Text = scrapped[3]
con = cx_Oracle.connect('shivams/tiger#127.0.0.1/XE')
cursor = con.cursor()
sql_query = "insert into newsdata values(:1,:2,:3,:4)"
cursor.executemany(sql_query,[Title,Link,Dates,Text])
con.commit()
cursor.close()
con.close()
#textfile.write('Title\n{}\n\nLink\n{}\n\nDate & Time\n{}\n\nText\n{}'.format(scrapped[0],scrapped[2],scrapped[1],scrapped[3]))
#textfile.close()
con.close()
folders_links=[('India','https://timesofindia.indiatimes.com/india'),('World','https://timesofindia.indiatimes.com/world'),('Business','https://timesofindia.indiatimes.com/business'),('Homepage','https://timesofindia.indiatimes.com/')]
for x,y in folders_links:
pathmaker(x)
r = requests.get(y)
soup = BeautifulSoup(r.text, 'html.parser')
if x!='Homepage':
links =soup.find('div', attrs={'class':'main-content'}).find_all('span', attrs={'class':'twtr'})
links_all=['https://timesofindia.indiatimes.com'+links[x]['data-url'].split('?')[0] for x in range(len(links))]
else:
links =soup.find('div', attrs={'class':'wrapper clearfix'})
total_links = links.find_all('a')
links_all=[]
for p in range(len(total_links)):
if 'href' in str(total_links[p]) and '.cms' in total_links[p]['href'] and 'http' not in total_links[p]['href'] and 'articleshow' in total_links[p]['href'] :
links_all+=['https://timesofindia.indiatimes.com'+total_links[p]['href']]
filemaker(x,links_all)
Earlier i was creating text files and storing the news in them but now i want to store it in the database for my web application to access it. My database logic is in the file maker function. I am trying to insert the values in the table but its not working and giving various types of errors. I followed other posts on the website but they didnt work in my case. Can anyone help me with this. Also i am not sure if that is the correct way to insert CLOB data as i am using it for the first time. Need help.

You can do the following:
cursor.execute(sql_query, [Title, Link, Dates, Text])
or, if you build up a list of these values, you can then do the following:
allValues = []
allValues.append([Title, Link, Dates, Text])
cursor.executemany(sql_query, allValues)
Hope that explains things!

Related

Beautiful Soup Value not extracting properly

Recently i was working with python beautiful soup to extract some data and put it into pandas DataFrame.
I used python beautiful soup to extract some of the hotel data from the website booking.com.
I was able to extract some of the attributes very correctly without any empty.
Here is my code snippet:
def get_Hotel_Facilities(soup):
try:
title = soup.find_all("div", attrs={"class":"db29ecfbe2 c21a2f2d97 fe87d598e8"})
new_list = []
# Inner NavigatableString Object
for i in range(len(title)):
new_list.append(title[i].text.strip())
except AttributeError:
new_list=""
return new_list
The above code is my function to retrieve the Facilities of a hotel and return the facilitites List items.
page_no=0
d = {"Hotel_Name":[], "Hotel_Rating":[], "Room_type":[],"Room_price":[],"Room_sqft":[],"Facilities":[],"Location":[]}
while (page_no<=25):
URL = f"https://www.booking.com/searchresults.html?aid=304142&label=gen173rf-1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ&sid=2214b1422694e7b065e28995af4e22d9&sb=1&sb_lp=1&src=theme_landing_index&src_elem=sb&error_url=https%3A%2F%2Fwww.booking.com%2Fhotel%2Findex.html%3Faid%3D304142%26label%3Dgen173rf1FCAEoggI46AdIM1gDaGyIAQGYATG4ARfIAQzYAQHoAQH4AQKIAgGiAg1wcm9qZWN0cHJvLmlvqAIDuAKwwPadBsACAdICJDU0NThkNDAzLTM1OTMtNDRmOC1iZWQ0LTdhOTNjOTJmOWJlONgCBeACAQ%26sid%3D2214b1422694e7b065e28995af4e22d9%26&ss=goa&is_ski_area=0&checkin_year=2023&checkin_month=1&checkin_monthday=13&checkout_year=2023&checkout_month=1&checkout_monthday=14&group_adults=2&group_children=0&no_rooms=1&b_h4u_keep_filters=&from_sf=1&offset{page_no}"
new_webpage = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(new_webpage.content,"html.parser")
links = soup.find_all("a", attrs={"class":"e13098a59f"})
for link in links:
new_webpage = requests.get(link.get('href'), headers=HEADERS)
new_soup = BeautifulSoup(new_webpage.content, "html.parser")
d["Hotel_Name"].append(get_Hotel_Name(new_soup))
d["Hotel_Rating"].append(get_Hotel_Rating(new_soup))
d["Room_type"].append(get_Room_type(new_soup))
d["Room_price"].append(get_Price(new_soup))
d["Room_sqft"].append(get_Room_Sqft(new_soup))
d["Facilities"].append(get_Hotel_Facilities(new_soup))
d["Location"].append(get_Hotel_Location(new_soup))
page_no += 25
The above code is the main one where the while loop will traverse the linked pages and retrieve the URL's of the pages. After retrieving ,it goes to every page to retrieve the corresponding atrributes.
I was able to retrieve the rest of the attributes correctly but i am not able to retrive the facilities, Like only some of the room facilities are being returned and some are not returning.
Here is my below o/p after making it into a pandas data frame.
Facilities o/p image
Please help me in this Problem as why some are coming and some are not coming.
P.S:- The facilities are available in the website
I have Tried using all the corresponding classes and attributes for retrieval but i am not getting the facilities column properly.
Probably as a predictive measure, the html fetched by the requests don't seem to consistent in their layouts or even the contents.
There might be more possible selectors, but try
def get_Hotel_Facilities(soup):
selectors = ['div[data-testid="property-highlights"]', '#facilities',
'.hp-description~div div.important_facility']
new_list = []
for sel in selectors:
for sect in soup.select(sel):
new_list += list(sect.stripped_strings)
return list(set(new_list)) # set <--> unique
But even with this, the results are inconsistent. E.g.: I tested on this page with
for i in range(10):
soup = BeautifulSoup(cloudscraper.create_scraper().get(url).content)
fl = get_Hotel_Facilities(soup) if soup else []
print(f'[{i}] {len(fl)} facilities: {", ".join(fl)}')
(But the inconsistencies might be due to using cloudscraper - maybe you'll get better results with your headers?)

Scraping reports from a website using BeautifulSoup in python

I am trying to download reports from a companies website, https://www.investorab.com/investors-media/reports-presentations/. In the end, I would like to download all the available reports.
I have next to none experience in webscraping, so I have some trouble defining the correct search pattern. Previously I have needed to take out all links containing pdfs, i.e. I could use soup.select('div[id="id-name"] a[data-type="PDF"]'). But for this website, there is not listed a datatype for the links. How do I select all links under "Report and presentations"? Here is what I have tried, but it returns an empty list:
from bs4 import BeautifulSoup
import requests
url = "https://www.investorab.com/investors-media/reports-presentations/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'lxml')
# Select all reports, publication_dates
reports = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] a[href]')
pub_dates = soup.select('div[class="html not-front not-logged-in no-sidebars page-events-archive i18n-en"] div[class="field-content"]')
I would also like to select all publications date, but also ends up with an empty list. Any help in the right direction is appreciated.
What you'll need to do is iterate through the pages, or what I did was just iterate through the year parameter. Once you get the list for the year, get the link of each report, then within each link, find the pdf link. You'll then use that pdf link to write to file:
from bs4 import BeautifulSoup
import requests
import os
# Gets all the links
linkList = []
url = 'https://vp053.alertir.com/v3/en/events-archive?'
for year in range(1917,2021):
query = 'type%5B%5D=report&type%5B%5D=annual_report&type%5B%5D=cmd&type%5B%5D=misc&year%5Bvalue%5D%5Byear%5D=' + str(year)
response = requests.get(url + query )
soup = BeautifulSoup(response.text, 'html.parser')
links = soup.find_all('a', href=True)
linkList += [link['href'] for link in links if 'v3' in link['href']]
print ('Gathered links for year %s.' %year)
# Go to each link and get the pdsf within them
print ('Downloading PDFs...')
for link in linkList:
url = 'https://vp053.alertir.com' + link
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for pdflink in soup.select("a[href$='.pdf']"):
folder_location = 'C:/test/pdfDownloads/'
if not os.path.exists(folder_location):
os.mkdir(folder_location)
try:
filename = os.path.join(folder_location,pdflink['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get('https://vp053.alertir.com' + pdflink['href']).content)
print ('Saved: %s' %pdflink['href'].split('/')[-1])
except Exception as ex:
print('%s not saved. %s' %(pdflink['href'],ex))

Scraping wikipedia infobox geography vcard

I have been trying too scrape the data from the Website section from the various cities' vcard table on wikipedia but somehow I get the results for the Co-ordinates section which is located at the beginning off the table
I have tried specifying "Website" while selecting the specific tags in the table.
def getAdditionalInfo(url):
try:
city_page = PageContent('https://en.wikipedia.org' + url)
table = city_page.find('table', {'class' : 'infobox geography vcard'})
additional_details = []
read_content = False
for tr in table.find_all('tr'):
if (tr.get('class') == ['mergedtoprow'] and not read_content):
link = tr.find('th')
if (link and (link.get_text().strip() == 'Website')):
read_content = True
elif ((tr.get('class') == ['mergedbottomrow']) or tr.get('class') == ['mergedrow'] and read_content):
additional_details.append(tr.find('td').get_text().strip('\n'))
return additional_details
except Exception as error:
print('Error occured: {}'.format(error))
return []
I want to append this data into a new column which shows the website link for each city's official page which I would be getting from this function
With bs4 4.7.1 you can use :contains to target the table header of website and then get the next td's a tag href attribute. Clearly there are other cases where this pattern could match so perhaps some other form of validation is required on input values.
You could add an additional class selector for the vcard if you wish: result = soup.select_one('.vcard th:contains(Website) + td > [href]')
Python
import requests
from bs4 import BeautifulSoup as bs
cities = ['Paris', 'Frankfurt', 'London']
base = 'https://en.wikipedia.org/wiki/'
with requests.Session() as s:
for city in cities:
r = s.get(base + city)
soup = bs(r.content, 'lxml')
result = soup.select_one('th:contains(Website) + td > [href]')
if result is None:
print(city, 'selector failed to find url')
else:
print(city, result['href'])
As I understand the problem correctly, you want to extract official URL of the city from Wikipedia:
import requests
from bs4 import BeautifulSoup
def getAdditionalInfo(url):
soup = BeautifulSoup(requests.get('https://en.wikipedia.org' + url).text, 'lxml')
for th in soup.select('.vcard th'):
if not th.text.lower() == 'website':
continue
yield th.parent.select_one('td').text
cities = ['/wiki/Paris', '/wiki/London', '/wiki/Madrid']
for city in cities:
for info in getAdditionalInfo(city):
print(f'{city}: {info}')
This prints:
/wiki/Paris: www.paris.fr
/wiki/London: london.gov.uk
/wiki/Madrid: www.madrid.es

Using Beautifulsoup to parse a big comment?

I'm using BS4 to parse this webpage:
You'll notice there are two separate tables on the page. Here's the relevant snipped of my code, which is successfully returning the data I want from the first table, but does not find anything from the second table:
# import packages
import urllib3
import certifi
from bs4 import BeautifulSoup
import pandas as pd
#settings
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())
gamelog_offense = []
#scrape the data and write the .csv files
url = "https://www.sports-reference.com/cfb/schools/florida/2018/gamelog/"
response = http.request('GET', url)
soup = BeautifulSoup(response.data, features="html.parser")
cnt = 0
for row in soup.findAll('tr'):
try:
col=row.findAll('td')
Pass_cmp = col[4].get_text()
Pass_att = col[5].get_text()
gamelog_offense.append([Pass_cmp, Pass_att])
cnt += 1
except:
pass
print("Finished writing with " + str(cnt) + " records")
Finished writing with 13 records
I've verified the data from the SECOND table is contained within the soup (I can see it!). After lots of troubleshooting, I've discovered that the entire second table is completely contained within one big comment(why?). I've managed to extract this comment into a single comment object using the code below, but can't figure out what to do with it after that to extract the data I want. Ideally, I'd like to parse the comment in same way I'm successfully parsing the first table. I've tried using the ideas from similar stack overflow questions (selenium, phantomjs)...no luck.
import bs4
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
big_comment = item
print(big_comment)
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
...and so on....
Posting an answer here in case others find helpful. Many thanks to #TomasCarvalho for directing me to find a solution. I was able to pass the big comment as html into a second soup instance using the following code, and then just use the original parsing code on the new soup instance. (note: the try/except is because some of the teams have no gamelog, and you can't call .children on a NoneType.
try:
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
html = item
Dsoup = BeautifulSoup(html, features="html.parser")
except:
html = ''
Dsoup = BeautifulSoup(html, features="html.parser")

Optimizing a webcrawl

The following crawl, though very short, is painfully slow. I mean, "Pop in a full-length feature film," slow.
def bestActressDOB():
# create empty bday list
bdays = []
# for every base url
for actress in getBestActresses("https://en.wikipedia.org/wiki/Academy_Award_for_Best_Actress"):
# use actress list to create unique actress url
URL = "http://en.wikipedia.org"+actress
# connect to html
html = urlopen(URL)
# create soup object
bsObj = BeautifulSoup(html, "lxml")
# get text from <span class='bday">
try:
bday = bsObj.find("span", {"class":"bday"}).get_text()
except AttributeError:
print(URL)
bdays.append(bday)
print(bday)
return bdays
It grabs the name of every actress nominated for an Academy Award from a table on one Wikipedia page, then converts that to a list, uses those names to create URLs to visit each actresses' wiki, where it grabs her date of birth. The data will be used to calculate the age at which each actress was nominated for, or won, the Academy Award for Best Actress. Beyond Big O, is there a way to speed this up in real time. I have little experience with this sort of thing, so I am unsure of how normal this is. Thoughts?
Edit: Requested sub-routine
def getBestActresses(URL):
bestActressNomineeLinks = []
html = urlopen(URL)
try:
soup = BeautifulSoup(html, "lxml")
table = soup.find("table", {"class":"wikitable sortable"})
except AttributeError:
print("Error creating/navigating soup object")
table_row = table.find_all("tr")
for row in table_row:
first_data_cell = row.find_all("td")[0:1]
for datum in first_data_cell:
actress_name = datum.find("a")
links = actress_name.attrs['href']
bestActressNomineeLinks.append(links)
#print(bestActressNomineeLinks)
return bestActressNomineeLinks
I would reccomend trying on a faster computer or even running on a service like google cloud platform, microsoft azure, or amazon web services. There is no code that will make it go faster.

Resources