How to remove< > from the result - python-3.x

I am trying to get the list of college names from an online dataset table (search result), and the college names are in between the tag and , i am not sure how to remove those from the result.
geo_table = soup.find('table',{'id':'ctl00_cphCollegeNavBody_ucResultsMain_tblResults'})
Colleges=geo_table.findAll('strong')
Colleges
I am thinking that the problem is I am extracting the wrong part because refers to bold the line. Where shall I find the college name?
This is a sample output:
href="?s=IL+MA+PA&p=14.0802+14.0801+14.3901&l=91+92+93+94&id=211440"

To fetch the href value you need to find_all <a> tag and then iterate the loop and get the attribute value href to fetch the college name you can find <strong> tag and get the text value.
geo_table =soup.find('table',{'id':'ctl00_cphCollegeNavBody_ucResultsMain_tblResults'})
Colleges=geo_table.findAll('a')
for college in Colleges:
print('href :' + college['href'])
print('college Name : ' + college.find('strong').text )

Related

Scrapy parse is returning an empty array, regardles of yield

I am brand new to Scrapy, and I could use a hint here. I realize that there are quite a few similar questions, but none of them seem to fix my problem. I have the following code written for a simple web scraper:
import scrapy
from ScriptScraper.items import ScriptItem
class ScriptScraper(scrapy.Spider):
name = "script_scraper"
allowed_domains = ["https://proplay.ws"]
start_urls = ["https://proplay.ws/dramas/"]
def parse(self, response):
for column in response.xpath('//div[#class="content-column one_fourth"]'):
text = column.xpath('//p/b/text()').extract()
item = ScriptItem()
item['url'] = "test"
item['title'] = text
yield item
I will want to do some more involved scraping later, but right now, I'm just trying to get the scraper to return anything at all. The HTML for the site I'm trying to scrape looks like this:
<div class="content-column one_fourth">
::before
<p>
<b>
All dramas
<br>
(in alphabetical
<br>
order):
</b>
</p>
...
</div>
and I am running the following command in the Terminal:
scrapy parse --spider=script_scraper -c parse_ITEM -d 2 https://proplay.ws/dramas/
According to my understanding of Scrapy, the code I have written should be yielding the text "All dramas"; however, it is yielding an empty array instead. Can anyone give me a hint as to why this is not producing the expected yield? Again, I apologize for the repetitive question.
your XPath expressions are not exactly as you want to extract data. If you want the first column's first-row item. Then your XPath expression should be.
item = {}
item['text'] = response.xpath ('//div[#class="content-column one_fourth"][1]/p[1]/b/text()').extract()[0].
The function extract() will return all the matches for the expression, it returns an array. If you want the first you should use extract()[0] or extract_first().
Go through this page https://devhints.io/xpath to get more knowledge related to Xpath.

How to parse the only the second span tag in an HTML document using python bs4

I want to parse only one span tag in my html document. There are three sibling span tags without any class or I'd. I am targeting the second one only using BeautifulSoup 4.
Given the following html document:
<div class="adress">
<span>35456 street</span>
<span>city, state</span>
<span>zipcode</span>
</div>
I tried:
for spn in soup.findAll('span'):
data = spn[1].text
but it didn't work. The expected result is the text in the second span stored in a a variable:
data = "city, state"
and how to to get both the first and second span concatenated in one variable.
You are trying to slice an individual span (a Tag instance). Get rid of the for loop and slice the findAll response instead, i.e.
>>> soup.findAll('span')[1]
<span>city, state</span>
You can get the first and second tags together using:
>>> soup.findAll('span')[:2]
[<span>35456 street</span>, <span>city, state</span>]
or, as a string:
>>> "".join([str(tag) for tag in soup.findAll('span')[:2]])
'<span>35456 street</span><span>city, state</span>'
Another option:
data = soup.select_one('div > span:nth-of-type(2)').get_text(strip=True)
print(data)
Output:
city, state

Can't acess dynamic element on webpage

I can't acess a textbox on a webpage box , it's a dynamic element. I've tried to filter it by many attributes on the xpath but it seems that the number that changes on the id and name is the only unique part of the element's xpath. All the filters I try show at least 3 element. I've been trying for 2 days, really need some help here.
from selenium import webdriver
def click_btn(submit_xpath): #clicks on button
submit_box = driver.find_element_by_xpath(submit_xpath)
submit_box.click()
driver.implicitly_wait(7)
return
#sends text to text box
def send_text_to_box(box_xpath, text):
box = driver.find_element_by_xpath(box_xpath)
box.send_keys(text)
driver.implicitly_wait(3)
return
descr = 'Can't send this text'
send_text_to_box('//*[#id="textfield-1285-inputEl"]', descr)' #the number
#here is the changeable part on the xpath
:
edit: it worked now with the following xpath //input[contains(#id, 'textfield') and contains(#aria-readonly, 'false') and contains (#class, 'x-form-invalid-field-default')] . Hopefully I found something specific on this element:
You can use partial string to find the element instead of an exact match. That is, in place of
send_text_to_box('//*[#id="textfield-1285-inputEl"]', descr)' please try send_text_to_box('//*[contains(#id,"inputEl")]', descr)'
In case if there are multiple elements that have string 'inputE1' in id, you should look for something else that remains constant(some other property may be). Else, try finding some other element and then traverse to the required input.

obtain en-US title tag text

I'm trying to obtain the text in only the title#lang=en-US elements in an XML file.
This code obtains all the title text for all languages.
entries = root.xpath('//prefix:new-item', namespaces={'prefix': 'http://mynamespace'})
for entry in entries:
all_titles = entry.xpath('./prefix:title', namespaces={'prefix': 'http://mynamespace'})
for title in all_titles:
print (title.text)
I tried this code to get the title#lang=en-US text, but it does not work.
all_titles = entry.xpath('./prefix:title', namespaces={'prefix': 'http://mynamespace'})
for title in all_titles:
test = title.xpath("#lang='en-US'")
print (test)
How do I obtain the text for only the english language items?
The expression
//prefix:title[lang('en')]
will select all the English-language titles. Specifically:
title elements that have an xml:lang attribute identifying the title as English, for example <title xml:lang="en-US"> or <title xml:lang="en-GB">
title elements within some container that identifies all the contents as English, for example <section xml:lang="en-US"><title/></section>.
If you specifically want only US English titles, excluding other forms of English, then you can use the predicate [lang('en-US')].

Using BeautifuSoup to separate the hrefs and the anchor text

I'm using Python3 with Beautiful Soup 4 to separate hrefs from the text itself. Like:
LINK
I wanna (1) extract and print yoursite.com, and then get LINK.
If anyone could help me that would be great!
Locate the a element by, say, class name; use dictionary-like access to attributes; .get_text() to get the link text:
a = soup.find("a", class_="sample-class") # or soup.select_one("a.sample-class")
print(a["href"])
print(a.get_text())
A tag may have any number of attributes. The tag
has an attribute “class” whose value is “boldest”. You can access a
tag’s attributes by treating the tag like a dictionary:
> tag['class']
> # u'boldest'
A string corresponds to a bit of text within a tag. Beautiful Soup
uses the NavigableString class to contain these bits of text:
tag.string
# u'Extremely bold'
you can find this in Beautiful Soup Documentation

Resources