Confusion as to how to reference HTML code in Python

Confusion as to how to reference HTML code in Python - python-3.x

Here's my situation: I want to go to Yahoo's NHL site, here: http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01
The above link is for April's scores.
I'm trying to get my Python code to display for me the various scores of the various table rows, as well as with the names of the teams.
However, the issue is that I am not good at referencing HTML code when using xpath--I also feel like my code may actually be really wrong, too.
Here it is:
from xml import etree
from urllib.request import urlopen
website_score_yahoo = urlopen('http://sports.yahoo.com/nhl/scoreboard?d=2013-04-01', 'r').read()
binary_data = data.encode(ascii)
result = etree.HTML(data)
for tr in result.xpath(''):
print (result.xpath)
The "for tr in result.xpath(''):" is left blank in the parentheses due to the issue I listed above.
It's a lot to cover. Sorry about that.

Related

I am trying to use Fitz to extract data from a pdf that contains text in a very unstructured format. But it's returning none at the first step

Here's the code I have been trying with the output:
import fitz
import pandas as pd
doc = fitz.open('xyz.pdf')
page1 = doc[0]
words = page1.get_text("words")
first_annots=[]
rec=page1.first_annot.rect
rec
Output:
the output I am expecting is all text rectangles to be identified and called separately.
Here's where i found the code that i am implementing: https://www.analyticsvidhya.com/blog/2021/06/data-extraction-from-unstructured-pdfs/

Independent from your overall intention (to parse unstructured text):
Accessing the page's annotations via page.first_annot makes no sense at all.
Your exception is caused by the fact that that page page has no annotations, and therefore page.first_annot is None of course.
Again: whether or not there are annotations has nothing to do with the text of the page. Simply do not access page.first_annot.

Processing all values of an array with get_text

(Disclaimer: I'm a newbie, I'm sorry if this problem is really obvious)
Hello,
I build a little script in order to first find certain parts of HTML markup within a local file and then display the information without HTML tags.
I used bs4 and find_all / get_text for this. Take a look:
from bs4 import BeautifulSoup
with open("/Users/user1/Desktop/testdatapython.html") as fp:
soup = BeautifulSoup(fp, "lxml")
titleResults = soup.find_all('span', attrs={'class':'caption-subject'})
firstResult = titleResults[0]
firstStripped = firstResult.get_text()
print(firstStripped)
This actually works so far. But I want to do this for all values of titleResults, not only the first value. But I can't process an array with get_text.
Which way would be best to accomplish this? The number of values for titleResults is always changing since the local html file is only a sample.
Thank you in advance!
P.S. I already looked up this related thread but it is not enough for understanding or solving the problem sadly:
BeautifulSoup get_text from find_all

find_all returns a list
for result in titleResults:
stripped = result.get_text()
print(stripped)

Why does this simple Python 3 XPath web scraping script not work?

I was following a web scraping tutorial from here: http://docs.python-guide.org/en/latest/scenarios/scrape/
It looks pretty straight forward and before I did anything else, I just wanted to see if the sample code would run. I'm trying to find the URIs for the images on this site.
http://www.bvmjets.com/
This actually might be a really bad example. I was trying to do this with a more complex site but decided to dumb it down a bit so I could understand what was going on.
Following the instructions, I got the XPath for one of the images.
/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img
The whole script looks like:
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
images = tree.xpath('/html/body/div/div/table/tbody/tr[4]/td/p[1]/a/img')
print(images)
But when I run this, the dict is empty. I've looked at the XPath docs and I've tried various alterations to the xpath but I get nothing each time.

I dont think I can answer you question directly, but I noticed the images on the page you are targeting are sometimes wrapped differently. I'm unfamiliar with xpath myself, and wasnt able to get the number selector to work, despite this post. Here are a couple of examples to try:
tree.xpath('//html//body//div//div//table//tr//td//div//a//img[#src]')
or
tree.xpath('//table//tr//td//div//img[#src]')
or
tree.xpath('//img[#src]') # 68 images
The key to this is building up slowly. Find all the images, then find the image wrapped in the tag you are interested in.. etc etc, until you are confident you can find only the images your are interested in.
Note that the [#src] allows us to now access the source of that image. Using this post we can now download any/all image we want:
import shutil
from lxml import html
import requests
page = requests.get('http://www.bvmjets.com/')
tree = html.fromstring(page.content)
cool_images = tree.xpath('//a[#target=\'_blank\']//img[#src]')
source_url = page.url + cool_images[5].attrib['src']
path = 'cool_plane_image.jpg' # path on disk
r = requests.get(source_url, stream=True)
if r.status_code == 200:
with open(path, 'wb') as f:
r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
I would highly recommend looking at Beautiful Soup. For me, this has helped my amateur web scraping ventures. Have a look at this post for a relevant starting point.
This may not be the answer you are looking for, but hopeful it is a starting point / of some use to you - best of luck!

Python BeautifulSoup

I am using Python BeautifulSoup to extract some data from a famous song site.
Here is the snippet of code:
import requests
from bs4 import BeautifulSoup
url= 'https://gaana.com/playlist/gaana-dj-bollywood-top-50-1'
res = requests.get(url)
while(res.status_code!=200):
try:
res = requests.get('url')
except:
pass
print (res)
soup = BeautifulSoup(res.text,'lxml')
songs = soup.find_all('meta',{'property':'music:song'})
print (songs[0])
Here is the sample output:
<Response [200]>
<meta content="https://gaana.com/song/o-saathi" property="music:song"/>
Now i want to extract the url within content as string so that i can further use that url in my program.
Someone please Help me.

It's in the comments, but I just want to explain: beautifulsoup returns most results as a list or other iterable object. You show that you understand this in your code by using songs[0], but in this case what's been returned is a dictionary.
As explained in this StackOverflow post, you have need to query not only songs[0] but also the property within the dictionary (the two together are called a key pair and are the chief way to get data out of a dictionary).
Last note: while I've been a big fan of BeautifulSoup4 for basic web scraping, you may consider the lxml library. It's pretty well documented; to really take advantage of it you have to learn Python-variety Xpaths, which are sort of like regex for XML/HTML; but for advanced scraping it's probably the last best option short of Selenium, and it returns cleaner data than bs4.
Good luck!

Beautiful soup: List all Attributes

I am a design researcher. I have several .txt files which contain 75-100 quotations to which I have given various tags like so:
<q 69_A F exercises positive> Well I think it’s very good. I thought that the exercises that Rosy did was very good. I looked at it a few times. I listened and I paid attention but I didn’t really do it on the regular. I didn’t do the exercises on a regular basis. </q>
I am trying to trying to list all the tags ("69_a" "exercises" "positive") by using beautifulsoup. But instead of giving me an output which looks like this:
69_a
exercises
positive
It is giving me an output which looks like this:
q
q
q
q
Finished...
Can you please help me fix this? I have a lot of qualitative data that I want to put through this. The objective is to export all the quotes to a .xlsx file and sort using pivot tables.
from bs4 import BeautifulSoup
file_object = open('Angela_Q_2.txt', 'r')
soup = BeautifulSoup(file_object.read(), "lxml")
tag = soup.findAll('name')
for tag in soup.findAll(True):
print(tag.name)
print('Finished')

What you are wanting to list are called attributes not tags. To access a tags attributes use the .attr value.
Use below as shown:
from bs4 import BeautifulSoup
contents = '<q tag1 tag2>Quote1</q>dome other text<q tag1 tag3>quote2</q>'
soup = BeautifulSoup(contents)
for tag in soup.findAll('q'):
print(tag.attrs)
print(tag.contents)
print('Finished')

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Confusion as to how to reference HTML code in Python - python-3.x

Related

I am trying to use Fitz to extract data from a pdf that contains text in a very unstructured format. But it's returning none at the first step

Processing all values of an array with get_text

Why does this simple Python 3 XPath web scraping script not work?

Python BeautifulSoup

Beautiful soup: List all Attributes

Categories

Resources