I want to read a google sheet table in python, but without using API.
I tried with BytesIO, Beatifulsoup.
I know about the soluthion with gspread, but I need to read table without token. Only using url.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from lxml.etree import tostring
from io import BytesIO
import requests
req=requests.get('https://docs.google.com/spreadsheets/d/sheetId/edit#gid=', auth=('email', 'password'))
page = req.text
here i've got html code, like <!doctype html><html lang="en-US" dir="ltr"><head><base href="h and so on...
i also tried lib BeautifulSoup, but the result is same.
For reading a table from html, you can use pandas.read_html.
If it's an unrestricted spreadsheet like this one, you probably don't even need requests - you can just directly pass the URL to read_html. [view dataframe]
import pandas as pd
sheetId = '1bQo1an4yS1tSOMDhmUTGYtUlgnHDQ47EmIcj4YyuIxo' ## REPLACE WITH YOUR SHEETID
sheetUrl = f'https://docs.google.com/spreadsheets/d/{sheetId}'
sheetDF = pd.read_html(
sheetUrl, attrs={'class': 'waffle'}, skiprows=1
)[0].drop(['1'], axis=1).dropna(axis=0, how='all').dropna(axis=1, how='all')
If it's not unrestricted, then the way you're using requests.get would not work either, because you're not passing the auth argument correctly. I actually don't think there is any way to login to Google with just requests.auth. You could login to drive on your browser, open a sheet and then copy the request to https://curlconverter.com/ and paste to your code from there.
import pandas as pd
import requests
from bs4 import BeautifulSoup
sheetUrl = 'YOUR_SHEET_URL'
cookies = 'PASTE_FROM_https://curlconverter.com/'
headers = 'PASTE_FROM_https://curlconverter.com/'
req = requests.get(sheetUrl, cookies=cookies, headers=headers)
# in one line, no error-handling
# sheetDF = pd.read_html(req.text, attrs={'class': 'waffle'}, skiprows=1)[0].drop(['1'], axis=1).dropna(axis=0, how='all').dropna(axis=1, how='all')
# req.raise_for_status() # raise error if request fails
if req.status_code != 200:
print(req.status_code, req.reason, '- failed to get', url)
soup = BeautifulSoup(req.content)
wTable = soup.find('table', class_="waffle")
if wTable is not None:
dfList = pd.read_html(str(wTable), skiprows=1) # set skiprows=1 to skip top row with column names A, B, C...
sheetDF = dfList[0] # because read_html returns a LIST of dataframes
sheetDF = sheetDF.drop(['1'], axis='columns') # drop row #s 1,2,3...
sheetDF = sheetDF.dropna(axis='rows', how='all') # drop empty rows
sheetDF = sheetDF.dropna(axis='columns', how='all') # drop empty columns
### WHATEVER ELSE YOU WANT TO DO WITH DATAFRAME ###
else: print('COULD NOT FIND TABLE')
However, please note that the cookies are probably only good for up to 5 hours maximum (and then you'll need to paste new ones), and that if there are multiple sheets in one spreadsheet, you'll only be able to scrape the first sheet with requests/pandas. So, it would be better to use the API for restricted or multi-sheet spreadsheets.
Related
I try to get all specific span tags in all 3 urls
but finally the csv file only shows the data of last url.
Python code
from selenium import webdriver
from lxml import etree
from bs4 import BeautifulSoup
import time
import pandas as pd
urls = []
for i in range(1, 4):
if i == 1:
url = "https://www.coinbase.com/price/s/listed"
urls.append(url)
else:
url = "https://www.coinbase.com/price/s/listed" + f"?page={i}"
urls.append(url)
print(urls)
for url in urls:
wd = webdriver.Chrome()
wd.get(url)
time.sleep(30)
resp =wd.page_source
html = BeautifulSoup(resp,"lxml")
tr = html.find_all("tr",class_="AssetTableRowDense__Row-sc-14h1499-1 lfkMjy")
print(len(tr))
names =[]
for i in tr:
name1 = i.find("span",class_="TextElement__Spacer-hxkcw5-0 cicsNy Header__StyledHeader-sc-1xiyexz-0 kwgTEs AssetTableRowDense__StyledHeader-sc-14h1499-14 AssetTableRowDense__StyledHeaderDark-sc-14h1499-17 cWTMKR").text
name2 = i.find("span",class_="TextElement__Spacer-hxkcw5-0 cicsNy Header__StyledHeader-sc-1xiyexz-0 bjBkPh AssetTableRowDense__StyledHeader-sc-14h1499-14 AssetTableRowDense__StyledHeaderLight-sc-14h1499-15 AssetTableRowDense__TickerText-sc-14h1499-16 cdqGcC").text
names.append([name1,name2])
ns=pd.DataFrame(names)
date = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
path = "/Users/paul/jpn traffic/coinbase/coinbase"
ns.to_csv(path+date+date+'.csv',index=None)
the result of 2 print() function, it returns nothing wrong:
print(urls):['https://www.coinbase.com/price/s/listed', 'https://www.coinbase.com/price/s/listed?page=2', 'https://www.coinbase.com/price/s/listed?page=3']
print(len(tr))
26
30
16
So what's wrong with my code? Why not full data?
BTW, if I want to run my code on cloud service everyday at a given time, which works better for me, as a green hand python learner? I don't need to store huge data on cloud, I just need python scripts sending emails to my box that's it.
Why not data? Answer is data is generating from backdoor meaning the site is using API that's why data is not with the help of BeautifulSoup. You can easily get data using api_url and requests. To get api_url go to chrome devtools then network tab then xhr tab and click header tab then you will get the url and click preview tab to see data.
Now, data is generating:
import requests
r = requests.get('https://www.coinbase.com/api/v2/assets/search?base=BDT&country=BD&filter=listed&include_prices=true&limit=30&order=asc&page=2&query=&resolution=day&sort=rank')
coinbase = r.json()['data']
for coin in coinbase:
print(coin['name'])
I can't get the below code to navigate through the disclaimer page of the website, I think the issue is how I try and collect the cookie.
I want to try and use requests rather than selenium.
import requests
import pandas as pd
from pandas import read_html
# open the page with the disclaimer just to get the cookies
disclaimer = "https://umm.gassco.no/disclaimer"
disclaimerdummy = requests.get(disclaimer)
# open the actual page and use the cookies from the fake page opened before
actualpage = "https://umm.gassco.no/disclaimer/acceptDisclaimer"
actualpage2 = requests.get(actualpage, cookies=disclaimerdummy.cookies)
# store the content of the actual page in text format
actualpagetext = (actualpage2.text)
# identify relevant data sources by looking at the 'msgTable' class in the webpage code
# This is where the tables with the realtime data can be found
gasscoflow = read_html(actualpagetext, attrs={"class": "msgTable"})
# create the dataframes for the two relevant tables
Table0 = pd.DataFrame(gasscoflow[0])
Table1 = pd.DataFrame(gasscoflow[1])
Table2 = pd.DataFrame(gasscoflow[2])
Table3 = pd.DataFrame(gasscoflow[3])
Table4 = pd.DataFrame(gasscoflow[4])
After Seeing the website first of all it has only 2 tables and you could use session to use cookies across request instead of storing in a variable follow the below code to get all your expected data it is printing only last 2 rows as I have used tail command, you can modify and get your desired data from those tables.
import requests
import pandas as pd
from pandas import read_html
s=requests.session()
s1=s.get("https://umm.gassco.no")
s2=s.get("https://umm.gassco.no/disclaimer/acceptDisclaimer?")
data = read_html(s2.text, attrs={"class": "msgTable"})
t0 = pd.DataFrame(data[0])
t1 = pd.DataFrame(data[1])
print(t0.tail(2))
print(t1.tail(2))
Output:
Let me know if you have any questions :)
So for a project, I'm working on creating an API to interface with my School's course-finder and I'm struggling to grab the data from the a HTML table they store the data in without using Selenium. I was able to pull the HTML data initially using Selenium but my Instructor says he would prefer if I used BeautifulSoup4 & MechanicalSoup libraries. I got as far as submitting a search and grabbing the HTML table the data is stored in. I'm not sure how to iterate through the data stored in the HTML table as I did with my Selenium code below.
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import Select
from selenium.webdriver.chrome.options import Options
Chrome_Options = Options()
Chrome_Options.add_argument("--headless") #allows program to run without opening a chrome window
driver = webdriver.Chrome()
driver.get("https://winnet.wartburg.edu/coursefinder/") #sets the Silenium driver
select = Select(driver.find_element_by_id("ctl00_ContentPlaceHolder1_FormView1_DropDownList_Term"))
term_options = select.options
#for index in range(0, len(term_options) - 1):
# select.select_by_index(index)
lst = []
DeptSelect = Select(driver.find_element_by_id("ctl00_ContentPlaceHolder1_FormView1_DropDownList_Department"))
DeptSelect.select_by_visible_text("History") #finds the desiered department
search = driver.find_element_by_name("ctl00$ContentPlaceHolder1$FormView1$Button_FindNow")
search.click() #sends query
table_id = driver.find_element_by_id("ctl00_ContentPlaceHolder1_GridView1")
rows = table_id.find_elements_by_tag_name("tr")
for row in rows: #creates a list of lists containing our data
col_lst = []
col = row.find_elements_by_tag_name("td")
for data in col:
lst.append(data.text)
def chunk(l, n): #class that partitions our lists neatly
print("chunking...")
for i in range(0, len(l), n):
yield l[i:i + n]
n = 16 #each list contains 16 items regardless of contents or search
uberlist = list(chunk(lst, n)) #call chunk fn to partion list
with open('class_data.txt', 'w') as handler: #output of scraped data
print("writing file...")
for listitem in uberlist:
handler.write('%s\n' % listitem)
driver.close #ends and closes Silenium control over brower
This is my Soup Code and I'm wondering how I can take the data from the HTML in a similar way I did above with my Selenium.
import mechanicalsoup
import requests
from lxml import html
from lxml import etree
import pandas as pd
def text(elt):
return elt.text_content().replace(u'\xa0', u' ')
#This Will Use Mechanical Soup to grab the Form, Subit it and find the Data Table
browser = mechanicalsoup.StatefulBrowser()
winnet = "http://winnet.wartburg.edu/coursefinder/"
browser.open(winnet)
Searchform = browser.select_form()
Searchform.choose_submit('ctl00$ContentPlaceHolder1$FormView1$Button_FindNow')
response1 = browser.submit_selected() #This Progresses to Second Form
dataURL = browser.get_url() #Get URL of Second Form w/ Data
dataURL2 = 'https://winnet.wartburg.edu/coursefinder/Results.aspx'
pageContent=requests.get(dataURL2)
tree = html.fromstring(pageContent.content)
dataTable = tree.xpath('//*[#id="ctl00_ContentPlaceHolder1_GridView1"]')
rows = [] #initialize a collection of rows
for row in dataTable[0].xpath(".//tr")[1:]: #add new rows to the collection
rows.append([cell.text_content().strip() for cell in row.xpath(".//td")])
df = pd.DataFrame(rows) #load the collection to a dataframe
print(df)
#XPath to Table
#//*[#id="ctl00_ContentPlaceHolder1_GridView1"]
#//*[#id="ctl00_ContentPlaceHolder1_GridView1"]/tbody
Turns out I was able passing the wrong thing when using MechanicalSoup. I was able to pass the new page's contents to a variable called table had the page use .find('table') to retrieve the table HTML rather than the full page's HTML. From there just used table.get_text().split('\n') to make essentially a giant list of all of the rows.
I also dabble with setting form filters which worked as well.
import mechanicalsoup
from bs4 import BeautifulSoup
#Sets StatefulBrowser Object to winnet then it it grabs form
browser = mechanicalsoup.StatefulBrowser()
winnet = "http://winnet.wartburg.edu/coursefinder/"
browser.open(winnet)
Searchform = browser.select_form()
#Selects submit button and has filter options listed.
Searchform.choose_submit('ctl00$ContentPlaceHolder1$FormView1$Button_FindNow')
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$TextBox_keyword', "") #Keyword Searches by Class Title. Inputting string will search by that string ignoring any stored nonsense in the page.
#ACxxx Course Codes have 3 spaces after them, THIS IS REQUIRED. Except the All value for not searching by a Department does not.
Searchform.set("ctl00$ContentPlaceHolder1$FormView1$DropDownList_Department", 'All') #For Department List, it takes the CourseCodes as inputs and displays as the Full Name
Searchform.set("ctl00$ContentPlaceHolder1$FormView1$DropDownList_Term", "2020 Winter Term") # Term Dropdown takes a value that is a string. String is Exactly the Term date.
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_MeetingTime', 'all') #Takes the Week Class Time as a String. Need to Retrieve list of options from pages
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_EssentialEd', 'none') #takes a small string signialling the EE req or 'all' or 'none'. None doesn't select and option and all selects all coruses w/ a EE
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_CulturalDiversity', 'none')# Cultural Diversity, Takes none, C, D or all
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_WritingIntensive', 'none') # options are none or WI
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_PassFail', 'none')# Pass/Faill takes 'none' or 'PF'
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$CheckBox_OpenCourses', False) #Check Box, It's True or False
Searchform.set('ctl00$ContentPlaceHolder1$FormView1$DropDownList_Instructor', '0')# 0 is for None Selected otherwise it is a string of numbers (Instructor ID?)
#Submits Page, Grabs results and then launches a browser for test purposes.
browser.submit_selected()# Submits Form. Retrieves Results.
table = browser.get_current_page().find('table') #Finds Result Table
print(type(table))
rows = table.get_text().split('\n') # List of all Class Rows split by \n.
I have encounter a problem with the google ads report and I have no clue how to fix it... I use the following code to extract the data from google ads via API call
import sys
from googleads import adwords
import pandas as pd
import pandas as np
import io
output = io.StringIO()
def main(client):
# Initialize appropriate service.
report_downloader = client.GetReportDownloader(version='v201809')
# Create report query.
report_query = (adwords.ReportQueryBuilder()
.Select('CampaignId', 'AdGroupId', 'Id', 'Criteria',
'CriteriaType', 'FinalUrls', 'Impressions', 'Clicks',
'Cost')
.From('CRITERIA_PERFORMANCE_REPORT')
.Where('Status').In('ENABLED', 'PAUSED')
.During('LAST_7_DAYS')
.Build())
# You can provide a file object to write the output to. For this
# demonstration we use sys.stdout to write the report to the screen.
report_downloader.DownloadReportWithAwql(
report_query, 'CSV', output, skip_report_header=False,
skip_column_header=False, skip_report_summary=False,
include_zero_impressions=True)
output.seek(0)
df = pd.read_csv(output)
df = df.to_csv('results.csv')
if __name__ == '__main__':
# Initialize client object.
adwords_client = adwords.AdWordsClient.LoadFromStorage()
main(adwords_client)
the code works as expected and pulls the data and save it in a CSV file, however, when I access the columns it prints just one column 'CRITERIA_PERFORMANCE_REPORT (Nov 5, 2019-Nov 11, 2019)' when I open the csv file looks like this
result.csv
I have tried to drop the first row with df.drop(df.index[0]) to access the rest of the data however nothing seems to be working. is there any way I can remove the first row or change to use the second row as the columns names which is the result I expected.
thanks in advance
I'm able to eliminate the header there with the following download request:
report_downloader.DownloadReportWithAwql(
report_query, 'CSV', output, skip_report_header=True,
skip_column_header=False, skip_report_summary=True,
include_zero_impressions=True
)
I think if you include skip_report_header=True, skip_report_summary=True you'll get what you want.
I wish to parse the results table from a local sport event (the page basically just contain a table), but when I try with the script below I just get the "menu", not the actual result list. What am I missing?
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd
site = "https://rittresultater.no/nb/sb_tid/923?pv2=11027&pv1=U"
html = urlopen(site)
soup = BeautifulSoup(html, "lxml") #BeautifulSoup(urlopen(html, "lxml"))
table = soup.select("table")
df = pd.read_html(str(table))[0]
print.df
This is happening because there are two <table>s on that page. You can either query on the class name of the table you want (in this case .table-condensed) using the class_ parameter of the find() function, or you can just grab the second table in the list of all tables using the find_all() function.
Solution 1:
table = soup.find('table', class_='table-condensed')
print(table)
Solution 2:
tables = soup.find_all('table')
print(tables[1])