Web scraping with BS4: unable to get table

Web scraping with BS4: unable to get table - python-3.x

When you open the URL below in a browser,
http://www.kianfunds2.com/%D8%A7%D8%B1%D8%B2%D8%B4-%D8%AF%D8%A7%D8%B1%D8%A7%DB%8C%DB%8C-%D9%87%D8%A7-%D9%88-%D8%AA%D8%B9%D8%AF%D8%A7%D8%AF-%D9%88%D8%A7%D8%AD%D8%AF-%D9%87%D8%A7
you see a purple icon by the name of "copy". When you select this icon("copy"), you will achieve a complete table that you can paste into Excel. How can I get this table as an input in Python?
My code is below, and it shows nothing:
import requests
from bs4 import BeautifulSoup
url = "http://www.kianfunds2.com/" + "ارزش-دارایی-ها-و-تعداد-واحد-ها"
result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")
table = soup.find("a", class_="dt-button buttons-copy buttons-html5")
I don't want use Selenium, because it takes a lot of time. Please use Beautiful Soup.

To me it seems pretty unnecessary to use any sort of web scraping here. Since you can download the data as a file anyway, it is inadequate to go through the parsing you would need to represent that data via scrapping.
Instead you could just download the data and read it into a pandas dataframe. You will need to have pandas installed, in case you have Anaconda installed, you might already have it on your computer, otherwise you might need to download Anaconda and instal pandas:
conda install pandas
More Information on Installing Pandas
With pandas you can read in the data directly from the excel-sheet:
import pandas as pd
df = pd.read_excel("dataset.xlsx")
pandas.read_excel documentation
In case that this is making difficulties, you can still convert the excel-sheet to a csv and use pd.read_csv. Notice that you'll want to use correct encoding.
In case that you want to use BeautifulSoup for some reason: You might want to look into how to parse tables.
For a normal table, you would want to identify the content you want to scrape correctly. The table on that specific website has an id which is "arzeshdarayi". It is also the only table on that page, so you can also use the <table>-Tag to select it.
table = soup.find("table", id="arzeshdarayi")
table = soup.select("#arzeshdarayi")
The table on the website you provided has only a static header, the data is rendered as javascript, and BeautifulSoup won't be able to retrieve the information. Yet you can use the [json-object] that javascript works with
and once again, read it in as a dataframe:
import requests
import pandas pd
r = requests.get("http://www.kianfunds2.com/json/gettables.ashx?get=arzeshdarayi")
dict = r.json()
df = pd.DataFrame.from_dict(data)
In case you really want to scrape it, you will need some sort of browser simulation, so the Javascript will be evaluated before you access the html.
This answer recommends using Requests_HTML which is a very high level approach to web scraping, that brings together Requests, BS and that renders Javascript. Your code would look somewhat like this:
import requests_html as request
session = request.HTMLSession()
url = "http://www.kianfunds2.com/ارزش-دارایی-ها-و-تعداد-واحد-ها"
r = session.get(url)
#Render the website including javascript
#Uses Chromium (will be downloaded on first execution)
r.html.render(sleep=1)
#Find the table by it's id and take only the first result
table = r.html.find("#arzeshdarayi")[0]
#Find the single table rows
#Loop through those rows
for items in table.find("tr"):
#Take only the item.text for all elements
#While extracting the Headings and Data from the Tablerows
data = [item.text for item in items.find("th,td")[:-1]]
print(data)

Related

Parsed HTML using Python of a web page is different than the actual page

I need to get and store the PM2.5 and PM10 values from the table in https://app.cpcbccr.com/AQI_India/. I use BeautifulSoup4 to scrape the web page, but the parsed HTML I got is different from the actual page. For example, I get this
instead of this.
I wrote the required code to get the table rows and table data etc., but since my parsed HTML is missing rows of the table body, it couldn't find them, so now I only have this to see my parsed HTML:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://app.cpcbccr.com/AQI_India/"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
with open("Desktop/soup.html", "a") as dumpfile:
dumpfile.write(str(soup))
How can I get all of the table? Thanks in advance.

Try the below code. I have implemented the data scraping script for https://app.cpcbccr.com/AQI_India/ using the API way. Using requests you can hit the API and it will send back the result which you have to convert in to JSON format.
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def scrap_air_quality_index():
payload = 'eyJzdGF0aW9uX2lkIjoic2l0ZV8zMDEiLCJkYXRlIjoiMjAyMC0wNy0yNFQ5OjAwOjAwWiJ9:'
session = requests.Session()
response = session.post('https://app.cpcbccr.com/aqi_dashboard/aqi_all_Parameters',data=payload,verify=False)
result = json.loads(response.text)
extracted_metrics = result['metrics']
print(extracted_metrics)
I have checked the API calls in the network section where i got the API url https://app.cpcbccr.com/aqi_dashboard/aqi_all_Parameters which i'm using for getting the data using an additional mandatory parameter which is a payload without this you will not be able to get the data. You can leverage script and add saving of data(refer screenshot ) to .csv or excel file.
Image of API URL
Image of json result of metrics.

How to know which tags to use in scraping?

Is there any logic which tags should be used in scraping?
Right now I'm just doing "trial-and-error" on different tag variations to see which works. It takes a lot of time and is really frustrating. I can't understand the logic as to why some tags work and some dont. For example, the code below works fine:
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://finance.yahoo.com/quote/IWDA.AS?p=IWDA.AS&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, 'html.parser')
test1 = soup.find_all('div', attrs={'id':'app'})
print(test1)
However, just a slight change to the code and the result is "None":
import requests
from bs4 import BeautifulSoup
import pandas as pd
page = requests.get('https://finance.yahoo.com/quote/IWDA.AS?p=IWDA.AS&.tsrc=fin-srch')
soup = BeautifulSoup(page.content, 'html.parser')
test2 = soup.find_all('div', attrs={'id':'YDC-Lead-Stack-Composite'})
print(test2)
Is there any logical explanation why the first example (test1) returns values and why the second example (test2) doesn't return any value?
Is there an efficient way to know which tags will work?

Looks to me like you're trying to scrape a react webapp which will be impossible via the usual web scraping methods.
If you view the raw source (before the scripts are loaded, you'll find that the app is not loaded (as it runs in javascript and fetches the data).
There are two options here:
Find out if there is an API you can query (instead of scraping)
Load the page in a browser and use selenium to scrape (see https://selenium-python.readthedocs.io/getting-started.html)

Web Scraping: My first project and no idea where to start

I have very little experience but a lot of persistence. One of my hobbies is football and I help in a local team. I'm a big fan of statistics and unfortunately the only way to collect data from the local football federation is by web scraping. I have seen python with beautifulsoup package can help but I cannot identify the tags as these I believe are on a table.
My goal is to automatically collect the information to build up a database with players, fixtures, teams, referees,... and build up stats such as how many times has player been on the starting line up, when are teams more likely to score a goal, when to receive a goal,...
I have two links for a reference.
The first is for the general fixtures of a given group.
The second is the details with a match of any of a given match.
Any clues or where to start with the pattern would be great.

you should first get into web-scraping using python native libraries like requests to kinda contact the page that you want. then depending on the page, you should use bs4 (Beautifulsoup) to find what you are looking for inside the page. Then what you wanna do is transform and refine that information into variables (list or dataframes based on your need) and finally save those variables into a dataset. here is an example code:
import requests
from bs4 import BeautifulSoup #assuming you already have bs4 installed
page = requests.get('www.DESTINATIONSITE.com/BLAHBLAH')
soup = BeautifulSoup(page.text, 'html.parser')
soup_str = str(soup.text)
up to this point, you have the string values of the entire page at hand. now you wanna do some regular expression or any other python coding to seperate the information that you want from soup_str and store them in variable(s).
for the rest of the process, if you want to save that data, i suggest you look into libraries like pandas

I have done some progress
import libraries
import pandas as pd
import requests
from bs4 import BeautifulSoup
send request
url = 'http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c'
page = requests.get(url)
soup = BeautifulSoup(page.text,'html.parser')
read acta
acta_text = []
acta_text_element = soup.find_all(class_='acta-table')
for item in acta_text_element:
acta_text.append(item.text)
when I print the items I get many \n

Mahdi_J thanks for the initial response. I have already started the approach:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://fcf.cat/acta/1920/futbol-11/infantil-primera-divisio/grup-11/1i/sant-ildefons-ue-b/1i/lhospitalet-centre-esports-c", headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
soup = BeautifulSoup(c,"html.parser")
What I need is some support on where to start for the parsing. On the url I neeed to collect the players for both teams, the goals, bookings to save them to else where. What I'm having trouble is how to parse. I have very little skills and I need some starting point for the parsing.

Download data from xml using xpath - returns empty list

I am fairly new to using python to collect data from the web. I am interested in writing a script that collects data from an xml webpage. Here is the address:
https://www.w3schools.com/xml/guestbook.asp
import requests
from lxml import html
url = "https://www.w3schools.com/xml/guestbook.asp"
page = requests.get(url)
extractedHtml = html.fromstring(page.content)
guest = extractedHtml.xpath("/guestbook/guest/fname")
print(guest)
I am not certain why this is returning an empty list. I've tried numerous syntax in the xpath statement, so I'm losing confidence my overall structure is correct.
For context, I want to write something that will parse the entire xml webpage and return a csv that can be used within other programs. I'm starting with the basics to make sure I understand how the various packages work. Thank you for any help.

This should do it
import requests
from lxml import html
url = "https://www.w3schools.com/xml/guestbook.asp"
page = requests.get(url)
extractedHtml = html.fromstring(page.content)
guest = extractedHtml.xpath("//guestbook/guest/fname")
for i in guest:
print(i.text)
In the xpath, you need a double-dash in the beginning. Also, this returns a list with elements. The text of each element can be extracted using .text

Web-scraping a javascript table with python BueatifulSoup

I can't get one javascript table with BueatifulSoup, returning empty array
I tried to get data from this page.
https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Historical-Daily?sc_lang=en#select4=1&select5=2&select3=0&select2=3&select1=24
import requests, json
text = requests.get("https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Historical-Daily?sc_lang=en#select4=0&select5=2&select3=0&select2=3&select1=24")
data = json.loads(text)
print(data['Scty'])

There is another url you can use - found by looking at the network tab. A little string manipulation on the response text and you have a string that can be loaded with json and contains everything on the page (including for all 4 drop down geographies). There is no need for bs4. You can extract everything you want with json library.
Explore it here.
import requests
import json
r = requests.get('https://www.hkex.com.hk/eng/csm/DailyStat/data_tab_daily_20190425e.js?_=1556252093686')
data = json.loads(r.text.replace('tabData = ',''))
For example, path to first row of table on landing page:

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Web scraping with BS4: unable to get table - python-3.x

Related

Parsed HTML using Python of a web page is different than the actual page

How to know which tags to use in scraping?

Web Scraping: My first project and no idea where to start

Download data from xml using xpath - returns empty list

Web-scraping a javascript table with python BueatifulSoup

Categories

Resources