Scraping: issues receiving table data when looping with request - python-3.x

UPDATE:
The code only works momentarily. There are 2000+ cryptos and at this moment i have 492 unique files with their history.
When I try to run a url getting skipped in the first place on its own, it works. Therefore I think it has been narrowed down to have something to do with the request of content.
Is it possible to make sure the table i'm interested in is fully loaded before continuing the code?
UPDATE:
I got it working properly. I think there is a limit of requests you can do per sec or min on the website im trying to scrape from.
I put in a delay on 3 sec between every request and NOW IT WORKS!!!!
Thanks to both of you for the help. Even though it didn't provide a direct answer, it put me on the right track to figuring it out.
from bs4 import BeautifulSoup
import requests
import pandas as pd
import time
def scraping(url):
global line
content = requests.get(url).content
soup = BeautifulSoup(content,'html.parser')
table = soup.find('table', {'class': 'table'})
if not table:
print(url)
return
data = [[td.text.strip() for td in tr.findChildren('td')] for tr in table.findChildren('tr')]
df = pd.DataFrame(data)
df.drop(df.index[0], inplace=True)
df[0] = pd.to_datetime(df[0])
for i in range(1,7):
df[i] = pd.to_numeric(df[i].str.replace(",","").str.replace("-",""))
df.columns = ['Date','Open','High','Low','Close','Volume','Market Cap']
df.set_index('Date',inplace=True)
df.sort_index(inplace=True)
return df.to_csv(line + '_historical_data.csv')
with open("list_of_urls.txt") as file:
for line in file:
time.sleep(3)
line = line.strip()
start = "https://coinmarketcap.com/currencies/"
end = "/historical-data/?start=20000101&end=21000101"
url = start + line + end
scraping(url)

it could be URL not found 404 or the page has no table. To debug normalize your loop and print current processing Crypto name
table = soup.find('table', {'class': 'table'})
if not table:
print('no table')
return

You may perform findChildren() only of the returned table and tr objects are not NoneType, as follows:
data = [[td.text.strip() for td in tr.findChildren('td') if td] for tr in table.findChildren('tr') if tr] if table else []
if len(data) > 0:
# process your data here
Hope it helps.

Related

Read a table from google docs without using API

I want to read a google sheet table in python, but without using API.
I tried with BytesIO, Beatifulsoup.
I know about the soluthion with gspread, but I need to read table without token. Only using url.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from lxml.etree import tostring
from io import BytesIO
import requests
req=requests.get('https://docs.google.com/spreadsheets/d/sheetId/edit#gid=', auth=('email', 'password'))
page = req.text
here i've got html code, like <!doctype html><html lang="en-US" dir="ltr"><head><base href="h and so on...
i also tried lib BeautifulSoup, but the result is same.
For reading a table from html, you can use pandas.read_html.
If it's an unrestricted spreadsheet like this one, you probably don't even need requests - you can just directly pass the URL to read_html. [view dataframe]
import pandas as pd
sheetId = '1bQo1an4yS1tSOMDhmUTGYtUlgnHDQ47EmIcj4YyuIxo' ## REPLACE WITH YOUR SHEETID
sheetUrl = f'https://docs.google.com/spreadsheets/d/{sheetId}'
sheetDF = pd.read_html(
sheetUrl, attrs={'class': 'waffle'}, skiprows=1
)[0].drop(['1'], axis=1).dropna(axis=0, how='all').dropna(axis=1, how='all')
If it's not unrestricted, then the way you're using requests.get would not work either, because you're not passing the auth argument correctly. I actually don't think there is any way to login to Google with just requests.auth. You could login to drive on your browser, open a sheet and then copy the request to https://curlconverter.com/ and paste to your code from there.
import pandas as pd
import requests
from bs4 import BeautifulSoup
sheetUrl = 'YOUR_SHEET_URL'
cookies = 'PASTE_FROM_https://curlconverter.com/'
headers = 'PASTE_FROM_https://curlconverter.com/'
req = requests.get(sheetUrl, cookies=cookies, headers=headers)
# in one line, no error-handling
# sheetDF = pd.read_html(req.text, attrs={'class': 'waffle'}, skiprows=1)[0].drop(['1'], axis=1).dropna(axis=0, how='all').dropna(axis=1, how='all')
# req.raise_for_status() # raise error if request fails
if req.status_code != 200:
print(req.status_code, req.reason, '- failed to get', url)
soup = BeautifulSoup(req.content)
wTable = soup.find('table', class_="waffle")
if wTable is not None:
dfList = pd.read_html(str(wTable), skiprows=1) # set skiprows=1 to skip top row with column names A, B, C...
sheetDF = dfList[0] # because read_html returns a LIST of dataframes
sheetDF = sheetDF.drop(['1'], axis='columns') # drop row #s 1,2,3...
sheetDF = sheetDF.dropna(axis='rows', how='all') # drop empty rows
sheetDF = sheetDF.dropna(axis='columns', how='all') # drop empty columns
### WHATEVER ELSE YOU WANT TO DO WITH DATAFRAME ###
else: print('COULD NOT FIND TABLE')
However, please note that the cookies are probably only good for up to 5 hours maximum (and then you'll need to paste new ones), and that if there are multiple sheets in one spreadsheet, you'll only be able to scrape the first sheet with requests/pandas. So, it would be better to use the API for restricted or multi-sheet spreadsheets.

Python 3.7 Issue with append function

I'm learning Python and decided to adapte code from an example to scrape Craigslist data to look at prices of cars. https://towardsdatascience.com/web-scraping-craigslist-a-complete-tutorial-c41cea4f4981
I've created a Jupyter notebook and modified the code for my use. I recreated the same error when running the code in Spyder Python 3.7.
I'm running into an issue at line 116.
File "C:/Users/UserM/Documents/GitHub/learning/Spyder Python Craigslist Scrape Untitled0.py", line 116
post_prices.append(post_price). I receive a "SynaxError: invalid syntax".
Any help appreciated. Thanks.
# -*- coding: utf-8 -*-
"""
Created on Wed Oct 2 12:26:06 2019
"""
#import get to call a get request on the site
from requests import get
#get the first page of the Chicago car prices
response = get('https://chicago.craigslist.org/search/cto?bundleDuplicates=1') #eliminate duplicates and show owner only sales
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
#get the macro-container for the housing posts
posts = html_soup.find_all('li', class_= 'result-row')
print(type(posts)) #to double check that I got a ResultSet
print(len(posts)) #to double check I got 120 (elements/page
#grab the first post
post_one = posts[0]
#grab the price of the first post
post_one_price = post_one.a.text
post_one_price.strip()
#grab the time of the post in datetime format to save on cleaning efforts
post_one_time = post_one.find('time', class_= 'result-date')
post_one_datetime = post_one_time['datetime']
#title is a and that class, link is grabbing the href attribute of that variable
post_one_title = post_one.find('a', class_='result-title hdrlnk')
post_one_link = post_one_title['href']
#easy to grab the post title by taking the text element of the title variable
post_one_title_text = post_one_title.text
#the neighborhood is grabbed by finding the span class 'result-hood' and pulling the text element from that
post_one_hood = posts[0].find('span', class_='result-hood').text
#the price is grabbed by finding the span class 'result-price' and pulling the text element from that
post_one_hood = posts[0].find('span', class_='result-price').text
#build out the loop
from time import sleep
import re
from random import randint #avoid throttling by not sending too many requests one after the other
from warnings import warn
from time import time
from IPython.core.display import clear_output
import numpy as np
#find the total number of posts to find the limit of the pagination
results_num = html_soup.find('div', class_= 'search-legend')
results_total = int(results_num.find('span', class_='totalcount').text) #pulled the total count of posts as the upper bound of the pages array
#each page has 119 posts so each new page is defined as follows: s=120, s=240, s=360, and so on. So we need to step in size 120 in the np.arange function
pages = np.arange(0, results_total+1, 120)
iterations = 0
post_timing = []
post_hoods = []
post_title_texts = []
post_links = []
post_prices = []
for page in pages:
#get request
response = get("https://chicago.craigslist.org/search/cto?bundleDuplicates=1"
+ "s=" #the parameter for defining the page number
+ str(page) #the page number in the pages array from earlier
+ "&hasPic=1"
+ "&availabilityMode=0")
sleep(randint(1,5))
#throw warning for status codes that are not 200
if response.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
#define the html text
page_html = BeautifulSoup(response.text, 'html.parser')
#define the posts
posts = html_soup.find_all('li', class_= 'result-row')
#extract data item-wise
for post in posts:
if post.find('span', class_ = 'result-hood') is not None:
#posting date
#grab the datetime element 0 for date and 1 for time
post_datetime = post.find('time', class_= 'result-date')['datetime']
post_timing.append(post_datetime)
#neighborhoods
post_hood = post.find('span', class_= 'result-hood').text
post_hoods.append(post_hood)
#title text
post_title = post.find('a', class_='result-title hdrlnk')
post_title_text = post_title.text
post_title_texts.append(post_title_text)
#post link
post_link = post_title['href']
post_links.append(post_link)
#removes the \n whitespace from each side, removes the currency symbol, and turns it into an int
#test removed: post_price = int(post.a.text.strip().replace("$", ""))
post_price = int(float((post.a.text.strip().replace("$", ""))) #does this work??
post_prices.append(post_price)
iterations += 1
print("Page " + str(iterations) + " scraped successfully!")
print("\n")
print("Scrape complete!")
import pandas as pd
eb_apts = pd.DataFrame({'posted': post_timing,
'neighborhood': post_hoods,
'post title': post_title_texts,
'URL': post_links,
'price': post_prices})
print(eb_apts.info())
eb_apts.head(10)
Welcome to StackOverflow. Usually when you see syntax errors in already working code, it means that you've either messed up indentation, forgot to terminate a string somewhere, or missed a closing bracket.
You can tell this when a line of what looks to be ok code is throwing you a syntax error. This is because the line before isn't ended properly and the interpreter is giving you hints around where to look.
In this case, you're short a paranthesis in the line before.
post_price = int(float((post.a.text.strip().replace("$", "")))
should be
post_price = int(float((post.a.text.strip().replace("$", ""))))
or delete the extra paranthesis after float
post_price = int(float(post.a.text.strip().replace("$", "")))

Unable to insert data into table using python and oracle database

My database table looks like this
I have a web crawler that fetches news from the website and i am trying to store it in this table. I have used scrappy and beautiful soup libraries.The below code shows my crawler logic.
import requests
from bs4 import BeautifulSoup
import os
import datetime
import cx_Oracle
def scrappy(url):
try:
r = requests.get(url)
soup = BeautifulSoup(r.text, 'html.parser')
title = soup.find('title').text.split('|')[0]
time =soup.find('span', attrs={'class':'time_cptn'}).find_all('span')[2].contents[0]
full_text =soup.find('div', attrs={'class':'article_content'}).text.replace('Download The Times of India News App for Latest India News','')
except:
return ('','','','')
else:
return (title,time,url,full_text)
def pathmaker(name):
path = "Desktop/Web_Crawler/CRAWLED_DATA/{}".format(name)
try:
os.makedirs(path)
except OSError:
pass
else:
pass
def filemaker(folder,links_all):
#k=1
for link in links_all:
scrapped=scrappy(link)
#textfile=open('Desktop/Web_Crawler/CRAWLED_DATA/{}/text{}.txt'.format(x,k),'w+')
#k+=1
Title = scrapped[0]
Link = scrapped[2]
Dates = scrapped[1]
Text = scrapped[3]
con = cx_Oracle.connect('shivams/tiger#127.0.0.1/XE')
cursor = con.cursor()
sql_query = "insert into newsdata values(:1,:2,:3,:4)"
cursor.executemany(sql_query,[Title,Link,Dates,Text])
con.commit()
cursor.close()
con.close()
#textfile.write('Title\n{}\n\nLink\n{}\n\nDate & Time\n{}\n\nText\n{}'.format(scrapped[0],scrapped[2],scrapped[1],scrapped[3]))
#textfile.close()
con.close()
folders_links=[('India','https://timesofindia.indiatimes.com/india'),('World','https://timesofindia.indiatimes.com/world'),('Business','https://timesofindia.indiatimes.com/business'),('Homepage','https://timesofindia.indiatimes.com/')]
for x,y in folders_links:
pathmaker(x)
r = requests.get(y)
soup = BeautifulSoup(r.text, 'html.parser')
if x!='Homepage':
links =soup.find('div', attrs={'class':'main-content'}).find_all('span', attrs={'class':'twtr'})
links_all=['https://timesofindia.indiatimes.com'+links[x]['data-url'].split('?')[0] for x in range(len(links))]
else:
links =soup.find('div', attrs={'class':'wrapper clearfix'})
total_links = links.find_all('a')
links_all=[]
for p in range(len(total_links)):
if 'href' in str(total_links[p]) and '.cms' in total_links[p]['href'] and 'http' not in total_links[p]['href'] and 'articleshow' in total_links[p]['href'] :
links_all+=['https://timesofindia.indiatimes.com'+total_links[p]['href']]
filemaker(x,links_all)
Earlier i was creating text files and storing the news in them but now i want to store it in the database for my web application to access it. My database logic is in the file maker function. I am trying to insert the values in the table but its not working and giving various types of errors. I followed other posts on the website but they didnt work in my case. Can anyone help me with this. Also i am not sure if that is the correct way to insert CLOB data as i am using it for the first time. Need help.
You can do the following:
cursor.execute(sql_query, [Title, Link, Dates, Text])
or, if you build up a list of these values, you can then do the following:
allValues = []
allValues.append([Title, Link, Dates, Text])
cursor.executemany(sql_query, allValues)
Hope that explains things!

Web Scraping with Try: Except: in For Loop

I have written the code below attempting to practice web-scraping with Python, Pandas, etc. In general I have four steps I am trying to follow to achieve my desired output:
Get a list of names to append to a base url
Create a list of player specific urls
Use the player urls to scrape tables
add the player name to the table I scraped to keep track of which player belongs to which stats - so in each row of the table add a column with the players name who was used to scrape the table
I was able to get #'s 1 and 2 working. The components of #3 seem to work, but i believe i have something wrong with my try: except because if i run just the line of code to scrape a specific playerUrl the tables DF populates as expected. The first player scraped has no data so I believe I am failing with the error catching.
For # 4 i really havent been able to find a solution. How do i add the name to the list as it is iterating in the for loop?
Any help is appreciated.
import requests
import pandas as pd
from bs4 import BeautifulSoup
### get the player data to create player specific urls
res = requests.get("https://www.mlssoccer.com/players?page=0")
soup = BeautifulSoup(res.content,'html.parser')
data = soup.find('div', class_ = 'item-list' )
names=[]
for player in data:
name = data.find_all('div', class_ = 'name')
for obj in name:
names.append(obj.find('a').text.lower().lstrip().rstrip().replace(' ','-'))
### create a list of player specific urls
url = 'https://www.mlssoccer.com/players/'
playerUrl = []
x = 0
for name in (names):
playerList = names
newUrl = url + str(playerList[x])
print("Gathering url..."+newUrl)
playerUrl.append(newUrl)
x +=1
### now take the list of urls and gather stats tables
tbls = []
i = 0
for url in (playerUrl):
try: ### added the try, except, pass because some players have no stats table
tables = pd.read_html(playerUrl[i], header = 0)[2]
tbls.append(tables)
i +=1
except Exception:
continue
There are lots of redundancy in your script. You can clean them up complying the following. I've used select() instead of find_all() to shake of the verbosity in the first place. To get rid of that IndexError, you can make use of continue keyword like I've shown below:
import requests
import pandas as pd
from bs4 import BeautifulSoup
base_url = "https://www.mlssoccer.com/players?page=0"
url = 'https://www.mlssoccer.com/players/'
res = requests.get(base_url)
soup = BeautifulSoup(res.text,'lxml')
names = []
for player in soup.select('.item-list .name a'):
names.append(player.get_text(strip=True).replace(" ","-"))
playerUrl = {}
for name in names:
playerUrl[name] = f'{url}{name}'
tbls = []
for url in playerUrl.values():
if len(pd.read_html(url))<=2:continue
tables = pd.read_html(url, header=0)[2]
tbls.append(tables)
print(tbls)
You can do couple of things to improve your code and get the step # 3 and 4 done.
(i) When using the for name in names loop, there is no need to explicitly use the indexing, just use the variable name.
(ii) You can save the player's name and its corresponding URL as a dict, where the name is the key. Then in step 3/4 you can use that name
(iii) Construct a DataFrame for each parsed HTML table and just append the player's name to it. Save this data frame individually.
(iv) Finally concatenate these data frames to form a single one.
Here is your code modified with above suggested changes:
import requests
import pandas as pd
from bs4 import BeautifulSoup
### get the player data to create player specific urls
res = requests.get("https://www.mlssoccer.com/players?page=0")
soup = BeautifulSoup(res.content,'html.parser')
data = soup.find('div', class_ = 'item-list' )
names=[]
for player in data:
name = data.find_all('div', class_ = 'name')
for obj in name:
names.append(obj.find('a').text.lower().lstrip().rstrip().replace(' ','-'))
### create a list of player specific urls
url = 'https://www.mlssoccer.com/players/'
playerUrl = {}
x = 0
for name in names:
newUrl = url + str(name)
print("Gathering url..."+newUrl)
playerUrl[name] = newUrl
### now take the list of urls and gather stats tables
tbls = []
for name, url in playerUrl.items():
try:
tables = pd.read_html(url, header = 0)[2]
df = pd.DataFrame(tables)
df['Player'] = name
tbls.append(df)
except Exception as e:
print(e)
continue
result = pd.concat(tbls)
print(result.head())

Using Beautifulsoup to parse a big comment?

I'm using BS4 to parse this webpage:
You'll notice there are two separate tables on the page. Here's the relevant snipped of my code, which is successfully returning the data I want from the first table, but does not find anything from the second table:
# import packages
import urllib3
import certifi
from bs4 import BeautifulSoup
import pandas as pd
#settings
http = urllib3.PoolManager(
cert_reqs='CERT_REQUIRED',
ca_certs=certifi.where())
gamelog_offense = []
#scrape the data and write the .csv files
url = "https://www.sports-reference.com/cfb/schools/florida/2018/gamelog/"
response = http.request('GET', url)
soup = BeautifulSoup(response.data, features="html.parser")
cnt = 0
for row in soup.findAll('tr'):
try:
col=row.findAll('td')
Pass_cmp = col[4].get_text()
Pass_att = col[5].get_text()
gamelog_offense.append([Pass_cmp, Pass_att])
cnt += 1
except:
pass
print("Finished writing with " + str(cnt) + " records")
Finished writing with 13 records
I've verified the data from the SECOND table is contained within the soup (I can see it!). After lots of troubleshooting, I've discovered that the entire second table is completely contained within one big comment(why?). I've managed to extract this comment into a single comment object using the code below, but can't figure out what to do with it after that to extract the data I want. Ideally, I'd like to parse the comment in same way I'm successfully parsing the first table. I've tried using the ideas from similar stack overflow questions (selenium, phantomjs)...no luck.
import bs4
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
big_comment = item
print(big_comment)
<div class="table_outer_container">
<div class="overthrow table_container" id="div_defense">
...and so on....
Posting an answer here in case others find helpful. Many thanks to #TomasCarvalho for directing me to find a solution. I was able to pass the big comment as html into a second soup instance using the following code, and then just use the original parsing code on the new soup instance. (note: the try/except is because some of the teams have no gamelog, and you can't call .children on a NoneType.
try:
defense = soup.find(id="all_defense")
for item in defense.children:
if isinstance(item, bs4.element.Comment):
html = item
Dsoup = BeautifulSoup(html, features="html.parser")
except:
html = ''
Dsoup = BeautifulSoup(html, features="html.parser")

Resources