Iterate all ids and crawler table's elements as dataframe in Python

Iterate all ids and crawler table's elements as dataframe in Python - python-3.x

I want to click into each item for each page from this link, then crawler all infos from this page in red circle and save then as dataframe format.
In order to loop all pages, I need to iterate all Ids in the second link, but I don't know where I can get them.
https://www.pudong.gov.cn/shpd/InfoOpen/InfoDetail.aspx?Id=956244
I starts with the following code but I get a response of no table found which I can visually see from the source code:
import requests
import json
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
url = 'https://www.pudong.gov.cn/shpd/InfoOpen/InfoDetail.aspx?Id=1044647'
website_url = requests.get(url, verify=False).text
soup = BeautifulSoup(website_url, 'html.parser')
table = soup.find('table', {'class': 'ewb-claim table'})
# https://stackoverflow.com/questions/51090632/python-excel-export
df = pd.read_html(str(table))[0]
print(df.head(5))
Ouput:
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/urllib3/connectionpool.py:858: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
InsecureRequestWarning)
Traceback (most recent call last):
File "<ipython-input-13-d95369a5d235>", line 12, in <module>
df = pd.read_html(str(table))[0]
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/html.py", line 915, in read_html
keep_default_na=keep_default_na)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/io/html.py", line 749, in _parse
raise_with_traceback(retained)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pandas/compat/__init__.py", line 385, in raise_with_traceback
raise exc.with_traceback(traceback)
ValueError: No tables found
Someone could help? Thanks.

Related

How can i grab a URL located in the network>headers tab without Selenium?

So i want to grab a URL located in the network tab > Headers but the specified URL is nowhere to be seen
here is my code so far
import requests
from bs4 import BeautifulSoup
link = 'https://www.tiktok.com/#docfrankhere/video/7128498731908893995?is_from_webapp=1&sender_device=pc&web_id=7111780332651578886'
r = requests.get(link)
response = r.status_code
head = r.headers['Request URL']
print(head)
Output
Traceback (most recent call last):
File "C:\Users\SEBAA\Desktop\copyu.py", line 10, in <module>
head = r.headers['Request URL']
File "C:\Users\SEBAA\AppData\Local\Programs\Python\Python39\lib\site-packages\requests\structures.py", line 54, in __getitem__
return self._store[key.lower()][1]
KeyError: 'request url'
So even when i do
head = r.headers
print(head)
the output don't show me the URL i want to grab
i'm new to this any help would be appreciatedenter image description here

Scraping values using Python

I want to scrape "Short interest" and "% of float shorted" values from https://www.marketwatch.com/investing/stock/aapl. I have tried using xpath, but it returns an Error.
import requests
from lxml import html
import pandas as pd
import numpy as np
url = "https://www.marketwatch.com/investing/stock/aapl"
page = requests.get(url)
tree = html.fromstring(page.content)
per_of_float_shorted = tree.xpath('/html/body/div[4]/div[5]/div[1]/div[1]/div/ul/li[14]/span[1]')[0].text()
short_interest = tree.xpath('/html/body/div[4]/div[5]/div[1]/div[1]/div/ul/li[14]/span[1]')[0].text()
print(short_interest)
print(per_of_float_shorted)```
Traceback (most recent call last):
File "scraper.py", line 11, in <module>
per_of_float_shorted = tree.xpath('/html/body/div[4]/div[5]/div[1]/div[1]/div/ul/li[14]/span[1]')[0].text()
IndexError: list index out of range
Thank you.

How to read html table as dataframe (urllib.error.URLError: <urlopen error unknown url type: https>)?

I would appreciate if you could let me know how to convert a html table into a dataframe.
import pandas as pd
df = pd.read_html('https://www.iasplus.com/en/resources/ifrs-topics/use-of-ifrs', header = None)
Error:
C:\Users\t\Anaconda3\python.exe C:/Users/t/Downloads/hyperopt12.py
Traceback (most recent call last):
File "C:/Users/t/Downloads/hyperopt12.py", line 12, in <module>
df = pd.read_html('https://www.iasplus.com/en/resources/ifrs-topics/use-of-ifrs', header = None)
File "C:\Users\t\Anaconda3\lib\site-packages\pandas\io\html.py", line 1094, in read_html
displayed_only=displayed_only)
File "C:\Users\t\Anaconda3\lib\site-packages\pandas\io\html.py", line 916, in _parse
raise_with_traceback(retained)
File "C:\Users\t\Anaconda3\lib\site-packages\pandas\compat\__init__.py", line 420, in raise_with_traceback
raise exc.with_traceback(traceback)
urllib.error.URLError: <urlopen error unknown url type: https>
Thanks in advance.

You need to find the right table on the page to read. read_html returns a list of dataframe objects. See the documentation here.
import pandas as pd
tables = pd.read_html('https://www.iasplus.com/en/resources/ifrs-topics/use-of-ifrs', header = None)
df = tables[2]
df

Python Requests.Get - Giving Invalid Schema Error

Another one for you.
Trying to scrape a list of URLs from a CSV file. This is my code:
from bs4 import BeautifulSoup
import requests
import csv
with open('TeamRankingsURLs.csv', newline='') as f_urls, open('TeamRankingsOutput.csv', 'w', newline='') as f_output:
csv_urls = csv.reader(f_urls)
csv_output = csv.writer(f_output)
for line in csv_urls:
page = requests.get(line[0]).text
soup = BeautifulSoup(page, 'html.parser')
results = soup.findAll('div', {'class' :'LineScoreCard__lineScoreColumnElement--1byQk'})
for r in range(len(results)):
csv_output.writerow([results[r].text])
...Which gives me the following error:
Traceback (most recent call last):
File "TeamRankingsScraper.py", line 11, in <module>
page = requests.get(line[0]).text
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\api.py", line 72, in get
return request('get', url, params=params, **kwargs)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\api.py", line 58, in request
return session.request(method=method, url=url, **kwargs)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 512, in request
resp = self.send(prep, **send_kwargs)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 616, in send
adapter = self.get_adapter(url=request.url)
File "C:\Users\windowshopr\AppData\Local\Programs\Python\Python36\lib\site-packages\requests\sessions.py", line 707, in get_adapter
raise InvalidSchema("No connection adapters were found for '%s'" % url)
requests.exceptions.InvalidSchema: No connection adapters were found for 'ï»¿https://www.teamrankings.com/mlb/stat/runs-per-game?date=2018-04-15'
My CSV file is just a list in column A of several urls (ie. https://www...)
(The div class I'm trying to scrape doesn't exist on that page, but that's not where the problem is. At least I don't think. I just need to update that when I can get it to read from the CSV file.)
Any suggestions? Because this code works on another project, but for some reason I'm having issues with this new URL list. Thanks a lot!

from the Traceback, requests.exceptions.InvalidSchema: No connection adapters were found for 'ï»¿https://www.teamrankings.com/mlb/stat/runs-per-game?date=2018-04-15'
See the random character in the url, it should start from https://www.teamrankings.com/mlb/stat/runs-per-game?date=2018-04-15
So first parse the csv and remove any random characters before the http/https using regex. That should solve your problem.
If you want to resolve your current problem with this particular url while reading csv, do :
import regex as re
strin = "ï»¿https://www.teamrankings.com/mlb/stat/runs-per-game?date=2018-04-15"
re.sub(r'.*http', 'http', strin)
This will give you the correct url which request can handle.
Since you ask for the full fix of the path which is accessible in the loop, here's what you could do:
from bs4 import BeautifulSoup
import requests
import csv
import regex as re
with open('TeamRankingsURLs.csv', newline='') as f_urls, open('TeamRankingsOutput.csv', 'w', newline='') as f_output:
csv_urls = csv.reader(f_urls)
csv_output = csv.writer(f_output)
for line in csv_urls:
page = re.sub(r'.*http', 'http', line[0])
page = requests.get(page).text
soup = BeautifulSoup(page, 'html.parser')
results = soup.findAll('div', {'class' :'LineScoreCard__lineScoreColumnElement--1byQk'})
for r in range(len(results)):
csv_output.writerow([results[r].text])

"NameError: name 'datetime' is not defined" with datetime imported

I know there are a lot of datetime not defined posts but they all seem to forget the obvious import of datetime. I can't figure out why I'm getting this error. When I do each step in iPython it works well, but the method dosen't
import requests
import datetime
def daily_price_historical(symbol, comparison_symbol, limit=1, aggregate=1, exchange='', allData='true'):
url = 'https://min-api.cryptocompare.com/data/histoday?fsym={}&tsym={}&limit={}&aggregate={}&allData={}'\
.format(symbol.upper(), comparison_symbol.upper(), limit, aggregate, allData)
if exchange:
url += '&e={}'.format(exchange)
page = requests.get(url)
data = page.json()['Data']
df = pd.DataFrame(data)
df['timestamp'] = [datetime.datetime.fromtimestamp(d) for d in df.time]
datetime.datetime.fromtimestamp()
return df
This code produces this error:
Traceback (most recent call last):
File "C:\Users\20115619\AppData\Local\Continuum\anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2963, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-29-4f015e05113f>", line 1, in <module>
rv.get_prices(30, 'ETH')
File "C:\Users\20115619\Desktop\projects\testDash\Revas.py", line 161, in get_prices
for symbol in symbols:
File "C:\Users\20115619\Desktop\projects\testDash\Revas.py", line 50, in daily_price_historical
df = pd.DataFrame(data)
File "C:\Users\20115619\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py", line 4372, in __getattr__
return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'time'

df['timestamp'] = [datetime.datetime.fromtimestamp(d) for d in df.time]
I think that line is the problem.
Your Dataframe df at the end of the line doesn't have the attribute .time
For what it's worth I'm on Python 3.6.0 and this runs perfectly for me:
import requests
import datetime
import pandas as pd
def daily_price_historical(symbol, comparison_symbol, limit=1, aggregate=1, exchange='', allData='true'):
url = 'https://min-api.cryptocompare.com/data/histoday?fsym={}&tsym={}&limit={}&aggregate={}&allData={}'\
.format(symbol.upper(), comparison_symbol.upper(), limit, aggregate, allData)
if exchange:
url += '&e={}'.format(exchange)
page = requests.get(url)
data = page.json()['Data']
df = pd.DataFrame(data)
df['timestamp'] = [datetime.datetime.fromtimestamp(d) for d in df.time]
#I don't have the following function, but it's not needed to run this
#datetime.datetime.fromtimestamp()
return df
df = daily_price_historical('BTC', 'ETH')
print(df)
Note, I commented out the line that calls an external function that I do not have. Perhaps you have a global variable causing a problem?
Update as per the comments:
I'd use join instead to make the URL:
url = "".join(["https://min-api.cryptocompare.com/data/histoday?fsym=", str(symbol.upper()), "&tsym=", str(comparison_symbol.upper()), "&limit=", str(limit), "&aggregate=", str(aggregate), "&allData=", str(allData)])

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Iterate all ids and crawler table's elements as dataframe in Python - python-3.x

Related

How can i grab a URL located in the network>headers tab without Selenium?

Scraping values using Python

How to read html table as dataframe (urllib.error.URLError: <urlopen error unknown url type: https>)?

Python Requests.Get - Giving Invalid Schema Error

"NameError: name 'datetime' is not defined" with datetime imported

Categories

Resources