Why not full data? - python-3.x

I try to get all specific span tags in all 3 urls
but finally the csv file only shows the data of last url.
Python code
from selenium import webdriver
from lxml import etree
from bs4 import BeautifulSoup
import time
import pandas as pd
urls = []
for i in range(1, 4):
if i == 1:
url = "https://www.coinbase.com/price/s/listed"
urls.append(url)
else:
url = "https://www.coinbase.com/price/s/listed" + f"?page={i}"
urls.append(url)
print(urls)
for url in urls:
wd = webdriver.Chrome()
wd.get(url)
time.sleep(30)
resp =wd.page_source
html = BeautifulSoup(resp,"lxml")
tr = html.find_all("tr",class_="AssetTableRowDense__Row-sc-14h1499-1 lfkMjy")
print(len(tr))
names =[]
for i in tr:
name1 = i.find("span",class_="TextElement__Spacer-hxkcw5-0 cicsNy Header__StyledHeader-sc-1xiyexz-0 kwgTEs AssetTableRowDense__StyledHeader-sc-14h1499-14 AssetTableRowDense__StyledHeaderDark-sc-14h1499-17 cWTMKR").text
name2 = i.find("span",class_="TextElement__Spacer-hxkcw5-0 cicsNy Header__StyledHeader-sc-1xiyexz-0 bjBkPh AssetTableRowDense__StyledHeader-sc-14h1499-14 AssetTableRowDense__StyledHeaderLight-sc-14h1499-15 AssetTableRowDense__TickerText-sc-14h1499-16 cdqGcC").text
names.append([name1,name2])
ns=pd.DataFrame(names)
date = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime())
path = "/Users/paul/jpn traffic/coinbase/coinbase"
ns.to_csv(path+date+date+'.csv',index=None)
the result of 2 print() function, it returns nothing wrong:
print(urls):['https://www.coinbase.com/price/s/listed', 'https://www.coinbase.com/price/s/listed?page=2', 'https://www.coinbase.com/price/s/listed?page=3']
print(len(tr))
26
30
16
So what's wrong with my code? Why not full data?
BTW, if I want to run my code on cloud service everyday at a given time, which works better for me, as a green hand python learner? I don't need to store huge data on cloud, I just need python scripts sending emails to my box that's it.

Why not data? Answer is data is generating from backdoor meaning the site is using API that's why data is not with the help of BeautifulSoup. You can easily get data using api_url and requests. To get api_url go to chrome devtools then network tab then xhr tab and click header tab then you will get the url and click preview tab to see data.
Now, data is generating:
import requests
r = requests.get('https://www.coinbase.com/api/v2/assets/search?base=BDT&country=BD&filter=listed&include_prices=true&limit=30&order=asc&page=2&query=&resolution=day&sort=rank')
coinbase = r.json()['data']
for coin in coinbase:
print(coin['name'])

Related

Read a table from google docs without using API

I want to read a google sheet table in python, but without using API.
I tried with BytesIO, Beatifulsoup.
I know about the soluthion with gspread, but I need to read table without token. Only using url.
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
from lxml.etree import tostring
from io import BytesIO
import requests
req=requests.get('https://docs.google.com/spreadsheets/d/sheetId/edit#gid=', auth=('email', 'password'))
page = req.text
here i've got html code, like <!doctype html><html lang="en-US" dir="ltr"><head><base href="h and so on...
i also tried lib BeautifulSoup, but the result is same.
For reading a table from html, you can use pandas.read_html.
If it's an unrestricted spreadsheet like this one, you probably don't even need requests - you can just directly pass the URL to read_html. [view dataframe]
import pandas as pd
sheetId = '1bQo1an4yS1tSOMDhmUTGYtUlgnHDQ47EmIcj4YyuIxo' ## REPLACE WITH YOUR SHEETID
sheetUrl = f'https://docs.google.com/spreadsheets/d/{sheetId}'
sheetDF = pd.read_html(
sheetUrl, attrs={'class': 'waffle'}, skiprows=1
)[0].drop(['1'], axis=1).dropna(axis=0, how='all').dropna(axis=1, how='all')
If it's not unrestricted, then the way you're using requests.get would not work either, because you're not passing the auth argument correctly. I actually don't think there is any way to login to Google with just requests.auth. You could login to drive on your browser, open a sheet and then copy the request to https://curlconverter.com/ and paste to your code from there.
import pandas as pd
import requests
from bs4 import BeautifulSoup
sheetUrl = 'YOUR_SHEET_URL'
cookies = 'PASTE_FROM_https://curlconverter.com/'
headers = 'PASTE_FROM_https://curlconverter.com/'
req = requests.get(sheetUrl, cookies=cookies, headers=headers)
# in one line, no error-handling
# sheetDF = pd.read_html(req.text, attrs={'class': 'waffle'}, skiprows=1)[0].drop(['1'], axis=1).dropna(axis=0, how='all').dropna(axis=1, how='all')
# req.raise_for_status() # raise error if request fails
if req.status_code != 200:
print(req.status_code, req.reason, '- failed to get', url)
soup = BeautifulSoup(req.content)
wTable = soup.find('table', class_="waffle")
if wTable is not None:
dfList = pd.read_html(str(wTable), skiprows=1) # set skiprows=1 to skip top row with column names A, B, C...
sheetDF = dfList[0] # because read_html returns a LIST of dataframes
sheetDF = sheetDF.drop(['1'], axis='columns') # drop row #s 1,2,3...
sheetDF = sheetDF.dropna(axis='rows', how='all') # drop empty rows
sheetDF = sheetDF.dropna(axis='columns', how='all') # drop empty columns
### WHATEVER ELSE YOU WANT TO DO WITH DATAFRAME ###
else: print('COULD NOT FIND TABLE')
However, please note that the cookies are probably only good for up to 5 hours maximum (and then you'll need to paste new ones), and that if there are multiple sheets in one spreadsheet, you'll only be able to scrape the first sheet with requests/pandas. So, it would be better to use the API for restricted or multi-sheet spreadsheets.

Output from web scraping with bs4 returns empty lists

I am trying to scrape specific information from a website of 25 pages but when I run my code i get empty lists. My output is supposed to be dictionary with the specific information scraped. Please any help would be appreciated.
# Loading libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd
import mitosheet
# Assigning column names using class_ names
name_selector = "af885_1iPzH"
old_price_selector = "f6eb3_1MyTu"
new_price_selector = "d7c0f_sJAqi"
discount_selector = "._6c244_q2qap"
# Placeholder list
data = []
# Looping over each page
for i in range(1,26):
url = "https://www.konga.com/category/phones-tablets-5294?brand=Samsung&page=" +str(i)
website = requests.get(url)
soup = BeautifulSoup(website.content, 'html.parser')
name = soup.select(name_selector)
old_price = soup.select(old_price_selector)
new_price = soup.select(new_price_selector)
discount = soup.select(discount_selector)
# Combining the elements into a zipped list to be able to pull the data simultaneously
for names, old_prices, new_prices, discounts in zip(name, old_price, new_price, discount):
dic = {"Phone Names": names.getText(),"New Prices": new_prices.getText(),"Old Prices": old_prices.getText(),"Discounts": discounts.getText()}
data.append(dic)
data
I tested the below and it works for me getting 40 name values.
I wasn't able to get the values using beautiful soup but directly through selenium.
If you decide to use Chrome and PyCharm as I have then:
Open Chrome. Click on three dots near top right. Click on Settings then About Chrome to see the version of your Chrome. Download the corresponding driver here. Save the driver in the PyCharm PATH folder
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
# Assigning column names using class_ names
name_selector = "af885_1iPzH"
# Looping over each page
for i in range(1, 27):
url = "https://www.konga.com/category/phones-tablets-5294?brand=Samsung&page=" +str(i)
driver.get(url)
xPath = './/*[#class="' + name_selector + '"]'
name = driver.find_elements(By.XPATH, xPath)

Scraping Site Data with out Selenium

Currently I am trying to pull CMS historical data from there site. I have got some working code to pull the download links from the page. My problem is that the links are divided into pages. I need to iterate through all the available pages and extract the download links. The obvious choice here is to use Selenium to click next pages and get data. Due to company policy i can not run selenium in the environment. Is there a way I can got through the pages and extract link. The website does not show the post link once you try to go to next page. I am out of ideas to try and get to next page without post link or not using selenium.
Current working code to pull links from first page
import pandas as pd
from datetime import datetime
#from selenium import webdriver
from lxml import html
import requests
def http_request_get(url, session=None, payload=None, parse=True):
""" Sends a GET HTTP request to a website and returns its HTML content and full url address. """
if payload is None:
payload = {}
if session:
content = session.get(url, params=payload, verify=False, headers={"content-type":"text"})
else:
content = requests.get(url, params=payload, verify=False, headers={"content-type":"text"})
content.raise_for_status() # Raise HTTPError for bad requests (4xx or 5xx)
if parse:
return html.fromstring(content.text), content.url
else:
return content.text, content.url
def get_html(link):
"""
Returns a html.
"""
page_parsed, _ = http_request_get(url=link, payload={'t': ''}, parse=True)
return page_parsed
cmslink = "https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-
Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report"
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
headers = linkTable[0].xpath('//a/#href')
df1 = pd.DataFrame(headers,columns= ['links'])
df1SubSet = df1[df1['links'].str.contains('contract-summary', case=False)]
These are the two urls that will give you the total 166 entries. I have also changed the condition for capturing hrefs. Give this a try.
cmslinks=[
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=0',
'https://www.cms.gov/Research-Statistics-Data-and-Systems/Statistics-Trends-and-Reports/MCRAdvPartDEnrolData/Monthly-Contract-and-Enrollment-Summary-Report?items_per_page=100&items_per_page_options%5B5%5D=5%20per%20page&items_per_page_options%5B10%5D=10%20per%20page&items_per_page_options%5B25%5D=25%20per%20page&items_per_page_options%5B50%5D=50%20per%20page&items_per_page_options%5B100%5D=100%20per%20page&combine=&page=1']
df=pd.DataFrame()
for cmslink in cmslinks:
print(cmslink)
content, _ = http_request_get(url=cmslink,payload={'t':''},parse=True)
linkTable = content.cssselect('td[headers="view-dlf-1-title-table-column"]')[0]
headers = linkTable[0].xpath("//a[contains(text(),'Contract Summary') or contains(text(),'Monthly Enrollment by CPSC')]/#href")
df1 = pd.DataFrame(headers,columns= ['links'])
df=df.append(df1)

How can I scrape data which is not having any of the source code?

scrape.py
# code to scrape the links from the html
from bs4 import BeautifulSoup
import urllib.request
data = open('scrapeFile','r')
html = data.read()
data.close()
soup = BeautifulSoup(html,features="html.parser")
# code to extract links
links = []
for div in soup.find_all('div', {'class':'main-bar z-depth-1'}):
# print(div.a.get('href'))
links.append('https://godamwale.com' + str(div.a.get('href')))
print(links)
file = open("links.txt", "w")
for link in links:
file.write(link + '\n')
print(link)
I have successfully got the list of links by using this code. But When I want to scrape the data from those links from their html page, these don't have any of the source code that contains data,and to extract them it my job tough . I have used selenium driver , but it won't work well for me.
I want to scrape the data from the below link , that contains data in the html sections , which have Customer details, licence and automation, commercial details, Floor wise, operational details . I want to extract these data with name , location , contact number and type.
https://godamwale.com/list/result/591359c0d6b269eecc1d8933
it 's link here . If someone finds solution , please give it to me.
Using Developer tools in your browser, you'll notice whenever you visit that link there is a request for https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933 that returns a json response probably containing the data you're looking for.
Python 2.x:
import urllib2, json
contents = json.loads(urllib2.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read())
print contents
Python 3.x:
import urllib.request, json
contents = json.loads(urllib.request.urlopen("https://godamwale.com/public/warehouse/591359c0d6b269eecc1d8933").read().decode('UTF-8'))
print(contents)
Here you go , the main problem with the site seems to be it takes time to load that's why it was returning incomplete page source. you have to wait until page loads completely. notice time.sleep(8) this line in code below :
from bs4 import BeautifulSoup
import requests
from selenium import webdriver
import time
CHROMEDRIVER_PATH ="C:\Users\XYZ\Downloads/Chromedriver.exe"
wd = webdriver.Chrome(CHROMEDRIVER_PATH)
responce = wd.get("https://godamwale.com/list/result/591359c0d6b269eecc1d8933")
time.sleep(8) # wait untill page loads completely
soup = BeautifulSoup(wd.page_source, 'lxml')
props_list = []
propvalues_list = []
div = soup.find_all('div', {'class':'row'})
for childtags in div[6].findChildren('div',{'class':'col s12 m4 info-col'}):
props = childtags.find("span").contents
props_list.append(props)
propvalue = childtags.find("p",recursive=True).contents
propvalues_list.append(propvalue)
print(props_list)
print(propvalues_list)
note: code will return Construction details in 2 seperate list.

Web Scraping reviews -Flipkart

I am trying to take out entire review of a product(remaining half of the review is display after clicking read more. but I am still not able to do so.It is not displaying entire content of a review, which get dispalyed after clicking read more option. Below is the code , which click the readmore option and also get data from the website
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
response = requests.get("https://www.flipkart.com/poco-f1-graphite-black-64-gb/product-reviews/itmf8fyjyssnt25c?page=2&pid=MOBF85V7A6PXETAX")
data = BeautifulSoup(response.content, 'lxml')
chromepath = r"C:\Users\Mohammed\Downloads\chromedriver.exe"
driver=webdriver.Chrome(chromepath)
driver.get("https://www.flipkart.com/poco-f1-graphite-black-64-gb/product-reviews/itmf8fyjyssnt25c?page=2&pid=MOBF85V7A6PXETAX")
d = driver.find_element_by_class_name("_1EPkIx")
d.click()
title = data.find_all("p",{"class" : "_2xg6Ul"})
text1 = data.find_all("div",{"class" : "qwjRop"})
name = data.find_all("p",{"class" : "_3LYOAd _3sxSiS"})
for t2, t , t1 in zip(title,text1,name) :
print(t2.text,'\n',t.text,'\n',t1.text)
To get the full reviews, It is necessary to click on those READ MORE buttons to unwrap the rest. As you have already used selenium in combination with BeautifulSoup, I've modified the script to follow the logic. The script will first click on those READ MORE buttons. Once it is done, it will then parse all the titles and reviews from there. You can now get the titles and reviews from multiple pages (upto 4 pages).
import time
from bs4 import BeautifulSoup
from selenium import webdriver
link = "https://www.flipkart.com/poco-f1-graphite-black-64-gb/product-reviews/itmf8fyjyssnt25c?page={}&pid=MOBF85V7A6PXETAX"
driver = webdriver.Chrome() #If necessary, define the chrome path explicitly
for page_num in range(1,5):
driver.get(link.format(page_num))
[item.click() for item in driver.find_elements_by_class_name("_1EPkIx")]
time.sleep(1)
soup = BeautifulSoup(driver.page_source, 'lxml')
for items in soup.select("._3DCdKt"):
title = items.select_one("p._2xg6Ul").text
review = ' '.join(items.select_one(".qwjRop div:nth-of-type(2)").text.split())
print(f'{title}\n{review}\n')
driver.quit()

Resources