Web-scraping a javascript table with python BueatifulSoup - python-3.x

I can't get one javascript table with BueatifulSoup, returning empty array
I tried to get data from this page.
https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Historical-Daily?sc_lang=en#select4=1&select5=2&select3=0&select2=3&select1=24
import requests, json
text = requests.get("https://www.hkex.com.hk/Mutual-Market/Stock-Connect/Statistics/Historical-Daily?sc_lang=en#select4=0&select5=2&select3=0&select2=3&select1=24")
data = json.loads(text)
print(data['Scty'])

There is another url you can use - found by looking at the network tab. A little string manipulation on the response text and you have a string that can be loaded with json and contains everything on the page (including for all 4 drop down geographies). There is no need for bs4. You can extract everything you want with json library.
Explore it here.
import requests
import json
r = requests.get('https://www.hkex.com.hk/eng/csm/DailyStat/data_tab_daily_20190425e.js?_=1556252093686')
data = json.loads(r.text.replace('tabData = ',''))
For example, path to first row of table on landing page:

Related

I am trying to use Fitz to extract data from a pdf that contains text in a very unstructured format. But it's returning none at the first step

Here's the code I have been trying with the output:
import fitz
import pandas as pd
doc = fitz.open('xyz.pdf')
page1 = doc[0]
words = page1.get_text("words")
first_annots=[]
rec=page1.first_annot.rect
rec
Output:
the output I am expecting is all text rectangles to be identified and called separately.
Here's where i found the code that i am implementing: https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/
Independent from your overall intention (to parse unstructured text):
Accessing the page's annotations via page.first_annot makes no sense at all.
Your exception is caused by the fact that that page page has no annotations, and therefore page.first_annot is None of course.
Again: whether or not there are annotations has nothing to do with the text of the page. Simply do not access page.first_annot.

Parsed HTML using Python of a web page is different than the actual page

I need to get and store the PM2.5 and PM10 values from the table in https://app.cpcbccr.com/AQI_India/. I use BeautifulSoup4 to scrape the web page, but the parsed HTML I got is different from the actual page. For example, I get this
instead of this.
I wrote the required code to get the table rows and table data etc., but since my parsed HTML is missing rows of the table body, it couldn't find them, so now I only have this to see my parsed HTML:
from bs4 import BeautifulSoup
import requests
import pandas as pd
url = "https://app.cpcbccr.com/AQI_India/"
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
with open("Desktop/soup.html", "a") as dumpfile:
dumpfile.write(str(soup))
How can I get all of the table? Thanks in advance.
Try the below code. I have implemented the data scraping script for https://app.cpcbccr.com/AQI_India/ using the API way. Using requests you can hit the API and it will send back the result which you have to convert in to JSON format.
import json
import requests
from urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
def scrap_air_quality_index():
payload = 'eyJzdGF0aW9uX2lkIjoic2l0ZV8zMDEiLCJkYXRlIjoiMjAyMC0wNy0yNFQ5OjAwOjAwWiJ9:'
session = requests.Session()
response = session.post('https://app.cpcbccr.com/aqi_dashboard/aqi_all_Parameters',data=payload,verify=False)
result = json.loads(response.text)
extracted_metrics = result['metrics']
print(extracted_metrics)
I have checked the API calls in the network section where i got the API url https://app.cpcbccr.com/aqi_dashboard/aqi_all_Parameters which i'm using for getting the data using an additional mandatory parameter which is a payload without this you will not be able to get the data. You can leverage script and add saving of data(refer screenshot ) to .csv or excel file.
Image of API URL
Image of json result of metrics.

Download data from xml using xpath - returns empty list

I am fairly new to using python to collect data from the web. I am interested in writing a script that collects data from an xml webpage. Here is the address:
https://www.w3schools.com/xml/guestbook.asp
import requests
from lxml import html
url = "https://www.w3schools.com/xml/guestbook.asp"
page = requests.get(url)
extractedHtml = html.fromstring(page.content)
guest = extractedHtml.xpath("/guestbook/guest/fname")
print(guest)
I am not certain why this is returning an empty list. I've tried numerous syntax in the xpath statement, so I'm losing confidence my overall structure is correct.
For context, I want to write something that will parse the entire xml webpage and return a csv that can be used within other programs. I'm starting with the basics to make sure I understand how the various packages work. Thank you for any help.
This should do it
import requests
from lxml import html
url = "https://www.w3schools.com/xml/guestbook.asp"
page = requests.get(url)
extractedHtml = html.fromstring(page.content)
guest = extractedHtml.xpath("//guestbook/guest/fname")
for i in guest:
print(i.text)
In the xpath, you need a double-dash in the beginning. Also, this returns a list with elements. The text of each element can be extracted using .text

Use pdfplumber to find text in PDF, return page number, then return table

I downloaded 42 PDFs which are each formatted similarly. Each has various tables, one of which is labeled "Campus Reported Incidents." That particular table is on a different page in each PDF. I want to write a function that will search for the page that has "Campus Reported Incidents" and scrape that table so that I can put it into a dataframe.
I figured that I could use PDFPlumber to search for the string "Campus Reported Incidents" and return the page number. I would then write a function that uses the page number to scrape the table I want, and I would loop that function through every PDF. However, I keep on getting the error "argument is not iterable" or "type object is not subscriptable." I looked through the PDFPlumber documentation but it didn't help my problem.
Here is one example of code that I tried:
url = "pdfs/example.pdf"
import pdfplumber
pdf = pdfplumber.open(url)
for page in range[0:len(pdf.pages)]:
if 'Total number of physical restraints' in pdf.pages[page]:
print(pdf.page_number)
I see this post was from a while ago but maybe this response will still help you or someone else.
The error looks like it's coming from the way you are looping through the pages. The range object is not a list, which is why you're seeing the "type object is not subscriptable" error message. Instead, try to "Enumerate" through the pages. The "i" will give you access to the index (aka current count in the loop). The "pg", will give you access to the page object in the PDF pages. I didn't use the "pg" variable below, but you could use that instead of "pages[i]" if you want.
The code below should print the tables from each page, as well as give you access to the tables to manipulate them further.
import pdfplumber
pdf_file = "pdfs/example.pdf"
tables=[]
with pdfplumber.open(pdf_file) as pdf:
pages = pdf.pages
for i,pg in enumerate(pages):
tbl = pages[i].extract_tables()
print(f'{i} --- {tbl}')
This is nothing to do with pdfplumber.
It should be range() not range[].
Please try below:
url = "pdfs/example.pdf"
import pdfplumber
pdf = pdfplumber.open(url)
for page in range(0:len(pdf.pages)):
if 'Total number of physical restraints' in pdf.pages[page]:
print(pdf.page_number)

Web scraping with BS4: unable to get table

When you open the URL below in a browser,
http://www.kianfunds2.com/%D8%A7%D8%B1%D8%B2%D8%B4-%D8%AF%D8%A7%D8%B1%D8%A7%DB%8C%DB%8C-%D9%87%D8%A7-%D9%88-%D8%AA%D8%B9%D8%AF%D8%A7%D8%AF-%D9%88%D8%A7%D8%AD%D8%AF-%D9%87%D8%A7
you see a purple icon by the name of "copy". When you select this icon("copy"), you will achieve a complete table that you can paste into Excel. How can I get this table as an input in Python?
My code is below, and it shows nothing:
import requests
from bs4 import BeautifulSoup
url = "http://www.kianfunds2.com/" + "ارزش-دارایی-ها-و-تعداد-واحد-ها"
result = requests.get(url)
soup = BeautifulSoup(result.content, "html.parser")
table = soup.find("a", class_="dt-button buttons-copy buttons-html5")
I don't want use Selenium, because it takes a lot of time. Please use Beautiful Soup.
To me it seems pretty unnecessary to use any sort of web scraping here. Since you can download the data as a file anyway, it is inadequate to go through the parsing you would need to represent that data via scrapping.
Instead you could just download the data and read it into a pandas dataframe. You will need to have pandas installed, in case you have Anaconda installed, you might already have it on your computer, otherwise you might need to download Anaconda and instal pandas:
conda install pandas
More Information on Installing Pandas
With pandas you can read in the data directly from the excel-sheet:
import pandas as pd
df = pd.read_excel("dataset.xlsx")
pandas.read_excel documentation
In case that this is making difficulties, you can still convert the excel-sheet to a csv and use pd.read_csv. Notice that you'll want to use correct encoding.
In case that you want to use BeautifulSoup for some reason: You might want to look into how to parse tables.
For a normal table, you would want to identify the content you want to scrape correctly. The table on that specific website has an id which is "arzeshdarayi". It is also the only table on that page, so you can also use the <table>-Tag to select it.
table = soup.find("table", id="arzeshdarayi")
table = soup.select("#arzeshdarayi")
The table on the website you provided has only a static header, the data is rendered as javascript, and BeautifulSoup won't be able to retrieve the information. Yet you can use the [json-object] that javascript works with
and once again, read it in as a dataframe:
import requests
import pandas pd
r = requests.get("http://www.kianfunds2.com/json/gettables.ashx?get=arzeshdarayi")
dict = r.json()
df = pd.DataFrame.from_dict(data)
In case you really want to scrape it, you will need some sort of browser simulation, so the Javascript will be evaluated before you access the html.
This answer recommends using Requests_HTML which is a very high level approach to web scraping, that brings together Requests, BS and that renders Javascript. Your code would look somewhat like this:
import requests_html as request
session = request.HTMLSession()
url = "http://www.kianfunds2.com/ارزش-دارایی-ها-و-تعداد-واحد-ها"
r = session.get(url)
#Render the website including javascript
#Uses Chromium (will be downloaded on first execution)
r.html.render(sleep=1)
#Find the table by it's id and take only the first result
table = r.html.find("#arzeshdarayi")[0]
#Find the single table rows
#Loop through those rows
for items in table.find("tr"):
#Take only the item.text for all elements
#While extracting the Headings and Data from the Tablerows
data = [item.text for item in items.find("th,td")[:-1]]
print(data)

Resources