I am trying to extract text from these documents(i.e doc1, doc2.
I just need text inside Item 1 header.
What I tried so far is shown below
soup = BS(response.text,'html.parser')
startid = BS(response.css('tr:contains("Item\xa01"), tr:contains("Item 1."), *:contains("ITEM 1")')[0].css('a').get('')).find('a').attrs
endid = BS(response.css('tr:contains("Item\xa02"), tr:contains("Item 2."),*:contains("ITEM 2")')[0].css('a').get('')).find('a').attrs
html=''
for tag in soup.select('a',startid)[0].parent.next_siblings:
if soup.select('a',endid)[0].parent == tag:
break
else:
html += str(tag)
h = html2text.HTML2Text()
h.ignore_links = True
print(h.handle(html))
I just wanted the text under Item 1 portion.
If you run:
r = requests.get('https://www.sec.gov/Archives/edgar/data/0000001800/000104746915001377/a2222655z10-k.htm')
print(r.text[1532:(1532 + 571)])
The output is:
To allow for equitable access to all users, SEC reserves the right to limit requests originating from undeclared automated tools. Your request has been identified as part of a network of automated tools outside of the acceptable policy and will be managed until action is taken to declare your traffic.</p>\n\n<p>Please declare your traffic by updating your user agent to include company specific information.</p>\n\n\n<p>For best practices on efficiently downloading information from SEC.gov, including the latest EDGAR filings, visit <a href="https://www.sec.gov/developer" '
If you look at https://www.sec.gov/developer in links off to https://www.sec.gov/edgar/sec-api-documentation.
So for 0000001800 you should be trying https://data.sec.gov/submissions/CIK0000001800.json which contains...
{"cik":"1800","entityType":"operating","sic":"2834
","sicDescription":"Pharmaceutical Preparations","
insiderTransactionForOwnerExists":1,"insiderTransa
ctionForIssuerExists":1,"name":"ABBOTT LABORATORIE
S","tickers":["ABT"],"exchanges":["NYSE"],"ein":"3
60698440","description":"","website":"","investorW
ebsite":"","category":"Large accelerated filer","f
iscalYearEnd":"1231","stateOfIncorporation":"IL","
stateOfIncorporationDescription":"IL","addresses":
{"mailing":{"street1":"100 ABBOTT PARK ROAD","stre
et2":null,"city":"ABBOTT PARK","stateOrCountry":"I
L","zipCode":"60064-3500","stateOrCountryDescripti
on":"IL"},"business":{"street1":"100 ABBOTT PARK R
OAD","street2":null,"city":"ABBOTT PARK","stateOrC
ountry":"IL","zipCode":"60064-3500","stateOrCountr
yDescription":"IL"}},"phone":"2246676100","flags":
"","formerNames":[],"filings":{"recent":{"accessio
nNumber":["0001415889-21-004019","0001415889-21-00
4018","0001415889-21-003917","0001415889-21-003804
","0001104659-21-100055","0001415889-21-003773","0
001415889-21-003748","0001104659-21-094680","00014
15889-21-003516","0001415889-21-003514","000141588
9-21-003513","0001415889-21-003512","0001415889-21
-003509","0001415889-21-003503","0001415889-21-003
428","0001415889-21-003425","0001415889-21-003423"
,"0001415889-21-003418","0001104659-21-086325","00
01415889-21-002958","0001415889-21-002831","000141
5889-21-002830","0001104659-21-0763........
I'm working with html loosely structured like this:
...
<div class='TL-dsdf2323...'>
<a href='/link1/'>
(more stuff)
</a>
<a href='/link2/'>
(more stuff)
</a>
</div>
...
I want to be able to return all of the hrefs contained within this particular div. So far it seems like I am able to locate the proper div
div = driver.find_elements_by_xpath("//div[starts-with(#class, 'TL')]")
This is where I'm hitting a wall though. I've gone through other posts and tried several options such as
links = div.find_elements_by_xpath("//a[starts-with(#href,'/link')]")
and
div.find_element_by_partial_link_text('/link')
but I keep returning empty lists. Any idea where I'm going wrong here?
Edit:
here's a picture of the actual html. I simplified the div class name from ThumbnailLayout to TL and the href /listing to /link
As #mr_mooo_cow pointed out in a comment, a delay was needed in order to extract the links. Here is the final working code:
a_tags = WebDriverWait(driver,10).until(EC.presence_of_all_elements_located( (By.XPATH, "//a[starts-with(#href,'/listing')]") ))
links = []
for link in a_tags:
links.append(link.get_attribute('href'))
Can you try something like this:
links = div.find_elements_by_xpath("//a[starts-with(#href,'/link') and ./div[starts-with(#class, 'TL')]]")
./ references the parent element in xpath. I haven't tested this so let me know if it doesn't work.
<a class="link__f5415c25" href="/profiles/people/1515754-andrea-jung" title="Andrea Jung">
I have above HTML element and tried using
driver.find_elements_by_class_name('link__f5415c25')
and
driver.get_attribute('href')
but it doesn't work at all. I expected to extract values in href.
How can I do that? Thanks!
You have to first locate the element, then retrieve the attribute href, like so:
href = driver.find_element_by_class_name('link__f5415c25').get_attribute('href')
if there are multiple links associated with that class name, you can try something like:
eList = driver.find_elements_by_class_name('link__f5415c25')
hrefList = []
for e in eList:
hrefList.append(e.get_attribute('href'))
for href in hrefList:
print(href)
I am trying to get some information from a page. Here is how I set it up:
url =str('https://kith.com/collections/all/products.atom')
r = requests.get(url)
soup = bs4.BeautifulSoup(r.text, "xml")
I am trying to get the 'size' for each product on this page. The price is tag as <title> which is located the <variant> tag inside the <entry> tag.
items = soup.find_all('entry')
for item in items:
variants=item.find_all('variant')
for variant in variants:
size = variant.title.text
print('size: '+str(size))
For some reasons, instead of size: 3 it prints out size:33.
When ran it the third time it prints out size:333 and so on. Why is it repeating itself and how can I fix it?
I am trying to scrape a webpage for address data (the highlighted street address shown in this image:1) using the find() function of the BeautifulSoup library. Most online tutorials only provide examples where data can be easily pinpointed to a certain class; however, for this particular site, the street address is a element within a larger class="dataCol col02 inlineEditWrite" and I'm not sure how to get at it with the find() function.
What would be the arguments to find() to get the street address in this example? Any help would be greatly appreciated.
Image: 1
This should get you started, it will find the div element with the class "dataCol col02 inlineEditWrite" then search for td elements within it and print the first td elements text:
divTag = soup.find("div", {"class":"dataCol col02 inlineEditWrite"})
for tag in divTag:
tdTags = tag.find_all("td")
print (tdTags[0].text)
the above example assumes you want to print the first td element from all the div elements with the class "dataCol col02 inlineEditWrite" otherwise
divTag = soup.find("div", {"class":"dataCol col02 inlineEditWrite"})
tdTags = divTag[0].find_all("td")
print (tdTags[0].text)