Extracting text of <p></p> with BeautifulSoup

Extracting text of <p></p> with BeautifulSoup - python-3.x

I am trying to get the news article from this link. My code is :
def get_news_details(news_url):
source = requests.get(news_url)
plain_text = source.text
soup = BeautifulSoup(plain_text, "html.parser")
content = soup.findAll('div', {'class' : 'big-img-box'})
print(content[0].findAll('p'))
The result shows :
[<p></p>, <p></p>, <p></p>, <p></p>, <p></p>, <p></p>]
And the value of content is :
<div class="big-img-box">
<div class="left-imgs">
<figure>
<img alt="iOS developer hints possibility of 4K Apple TV" class="img-responsive" src="http://www.aninews.in/contentimages/detail/appletv.jpg"/>
<figcaption><span class="heading-inner-span"></span></figcaption>
</figure>
<div class="mb10"></div>
</div>
<p></p> New York [USA], August 6 <a class="highlights" href="http://aninews.in/" target="_blank">(ANI)</a>: The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a <span class="highlights"> 4K Apple TV</span> with high dynamic range (HDR) support for both <span class="highlights"> HDR10 </span> and <span class="highlights"> Dolby Vision</span>.<p></p> While the current range of Apple's TV set-top box is incompatible to 4K technology, <span class="highlights">iOS</span> developer <span class="highlights"> Guilherme Rambo</span> revealed that the company is hinting an adoption of the ultra high-definition format, reports <span class="highlights">The Verge</span>.<p></p> Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year.<p></p> It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like <span class="highlights"> Netflix</span> and <span class="highlights"> Amazon</span> support the two high-definition formats.<p></p> Last month, iTunes started listing movies as supporting 4K and <span class="highlights"> HDR</span> in users' purchase histories, thus providing more thrust to the speculations of the 4K <span class="highlights"> Apple</span> TV. <a class="highlights" href="http://aninews.in/" target="_blank">(ANI)</a><p></p>
</div>
I can get a somewhat clumsy version of the article by content[0].text but I cannot format it.
While inspecting the webpage in chrome, the article seems to be written inside <p>article_text</p> tags. Whereas in content, it appears as <p></p>article_text tags. If the former version is present in soup, I can get my desired output. What should be done ?

It depends what you mean by formatting. You can make it 'tidier' in fairly simple ways.
>>> import bs4
>>> import requests
>>> page = requests.get('http://www.aninews.in/newsdetail-Nw/MzI4NDIy/ios-developer-hints-possibility-of-4k-apple-tv.html').content
>>> soup = bs4.BeautifulSoup(page, 'lxml')
>>> big_img_box = soup.select('.big-img-box')
Get all the text and strip away white space.
>>> big_img_box[0].text.strip()
"New York [USA], August 6 (ANI): The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a 4K Apple TV with high dynamic range (HDR) support for both HDR10 and Dolby Vision. While the current range of Apple's TV set-top box is incompatible to 4K technology, iOS developer Guilherme Rambo revealed that the company is hinting an adoption of the ultra high-definition format, reports The Verge. Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year. It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like Netflix and Amazon support the two high-definition formats. Last month, iTunes started listing movies as supporting 4K and HDR in users' purchase histories, thus providing more thrust to the speculations of the 4K Apple TV. (ANI)"
Go beyond this and remove longer strings of interior white space.
>>> import re
>>> re.sub(r'\s{2,}', ' ', big_img_box[0].text.strip())
"New York [USA], August 6 (ANI): The latest designs from Apple's HomePod firmware revealed that the tech giant is hinting the launch of a 4K Apple TV with high dynamic range (HDR) support for both HDR10 and Dolby Vision. While the current range of Apple's TV set-top box is incompatible to 4K technology, iOS developer Guilherme Rambo revealed that the company is hinting an adoption of the ultra high-definition format, reports The Verge. Reports of the new range of Apple TV have surfaced time and again over the past few months, starting February this year. It is said that implementing the HDR and 4K content will prove to b beneficial for the company, rather than a simpler resolution, since popular online movie and television platforms like Netflix and Amazon support the two high-definition formats. Last month, iTunes started listing movies as supporting 4K and HDR in users' purchase histories, thus providing more thrust to the speculations of the 4K Apple TV. (ANI)"

Related

Steam API json output isn't being displayed cleanly

I've been trying to get the output from the steam API to display in my django app.
My goal is to create a "portfolio" of games that I own for a project - using the information in the Steam API. I'm fairly new to Django, but know enough to be dangerous. Any help would be greatly appreciated, as I have been banging against this for a week...
My question is, how does one pull specific items from the json and display the data correctly inside of the template? I cannot seem to loacte a tutorial that fits the schema of the output from this API. - any help (and a brief explanation maybe) would be greatly appreciated.
For example, how can I capture the minimum system requirements from the API and display them in a div on my page?
This is my views.py:
def index(request):
response = requests.get('https://store.steampowered.com/api/appdetails/?appids=218620').json()
return render(request, 'index.html', {'response': response})
While it actually returns data, you'll notice that the data is littered with weird characters and such.
output:
</ul>', 'short_description': 'PAYDAY 2 is an action-packed, four-player co-op shooter that once again lets gamers don the masks of the original PAYDAY crew - Dallas, Hoxton, Wolf and Chains - as they descend on Washington DC for an epic crime spree.', 'supported_languages': 'English<strong>*</strong>, German, French, Italian, Spanish - Spain, Dutch, Russian, Japanese, Simplified Chinese, Spanish - Latin America, Korean<br><strong>*</strong>languages with full audio support', 'header_image': 'https://cdn.akamai.steamstatic.com/steam/apps/218620/header.jpg?t=1646834144', 'website': 'http://www.paydaythegame.com/', 'pc_requirements': {'minimum': '<strong>Minimum:</strong><br><ul class="bb_ul"><li><strong>OS:</strong>Windows 7<br>\t</li><li><strong>Processor:</strong>2 GHz Intel Dual Core Processor<br>\t</li><li><strong>Memory:</strong>4 GB RAM<br>\t</li><li><strong>Graphics:</strong>Nvidia & AMD (512MB VRAM)<br>\t</li><li><strong>DirectX®:</strong>9.0c<br>\t</li><li><strong>Storage:</strong>83 GB available space<br>\t</li><li><strong>Sound:</strong>DirectX 9.0c compatible</li></ul>', 'recommended': '<strong>Recommended
Here is my index.html:
{% extends 'base.html' %}
{% block content %}
{{response}}
{% endblock %}
if you add a 'safe' tag to the django template like this:
{% extends 'base.html' %}
{% block content %}
{{ response | safe }}
{% endblock %}
It cleans up the displayed output considerably, but still shows a bunch of weird characters in weird places - eg.
', 'short_description': 'PAYDAY 2 is an action-packed, four-player co-op shooter that once again lets gamers don the masks of the original PAYDAY crew - Dallas, Hoxton, Wolf and Chains - as they descend on Washington DC for an epic crime spree.', 'supported_languages': 'English\*, German, French, Italian, Spanish - Spain, Dutch, Russian, Japanese, Simplified Chinese, Spanish - Latin America, Korean *languages with full audio support', 'header_image': 'https://cdn.akamai.steamstatic.com/steam/apps/218620/header.jpg?t=1646834144', 'website': 'http://www.paydaythegame.com/', 'pc_requirements': {'minimum': 'Minimum:OS:Windows 7 \t Processor:2 GHz Intel Dual Core Processor

Beautiful Soup extracting all text between two tags, some of the text within the div is tagless

So, I have been at this for a while now, I am using BeautifulSoup4 and I am attempting to get all the text between the first <h3> tag and the first <h1> tag. The problem is that .next and nextSibling all appear to skip tagless text.
This is what I am trying as of now
start = article_content(text="Description")[0]
end = soup.find_next(class_='title')
spell_description = ""
item = start.next
while item != end:
try:
new_text = item.extract()
if new_text != '':
strip_newline = str(re.sub(r'\n\s*\n', '\n\n', new_text))
spell_description += strip_newline + '\n'
except Exception as e:
print(e)
item = item.next
This is the HTML
<h3 class="framing">Description</h3>
You detect magical auras. The amount of information revealed depends on how long you study a particular area or subject.
<br>
<br>
<i>1st Round</i>: Presence or absence of magical auras.
<br>
<br>
<i>2nd Round</i>: Number of different magical auras and the power of the most potent aura.
<br>
<br>
<i>3rd Round</i>: The strength and location of each aura. If the items or creatures bearing the auras are in line of sight, you can make Knowledge (arcana) skill checks to determine the school of magic involved in each. (Make one check per aura: DC 15 + spell level, or 15 + 1/2 caster level for a nonspell effect.) If the aura eminates from a magic item, you can attempt to identify its properties (see Spellcraft).
<br>
<br>
Magical areas, multiple types of magic, or strong local magical emanations may distort or conceal weaker auras.
<br>
<br>
<i>Aura Strength</i>: An aura's power depends on a spell's functioning spell level or an item's caster level; see the accompanying table. If an aura falls into more than one category, <i>detect magic</i> indicates the stronger of the two.
<br>
<br>
<i>Lingering Aura</i>: A magical aura lingers after its original source dissipates (in the case of a spell) or is destroyed (in the case of a magic item). If detect magic is cast and directed at such a location, the spell indicates an aura strength of dim (even weaker than a faint aura). How long the aura lingers at this dim level depends on its original power:
<br>
<br>
<table class="inner">
<tbody>
<tr>
<td><b>Original Strength</b></td>
<td><b>Duration of Lingering Aura</b></td>
</tr>
<tr>
<td>Faint</td>
<td>1d6 rounds</td>
</tr>
<tr>
<td>Moderate</td>
<td>1d6 minutes</td>
</tr>
<tr>
<td>Strong</td>
<td>1d6 x 10 minutes</td>
</tr>
<tr>
<td>Overwhelming</td>
<td>1d6 days</td>
</tr>
</tbody>
</table>
<br>Outsiders and elementals are not magical in themselves, but if they are summoned, the conjuration spell registers. Each round, you can turn to detect magic in a new area. The spell can penetrate barriers, but 1 foot of stone, 1 inch of common metal, a thin sheet of lead, or 3 feet of wood or dirt blocks it.<br><br><i>Detect magic</i> can be made permanent with a <i>permanency</i> spell.<br><br>
</table>
<h1 class="title"><img src="images\PathfinderSocietySymbol.gif" title="PFS Legal" style="margin:3px 3px 0px 3px;"> Detect Magic, Greater</h1>
output as of now
'You detect magical auras. The amount of information revealed depends on how long you study a particular area or subject.\n'
expected output
You detect magical auras. The amount of information revealed depends on how long you study a particular area or subject.
1st Round: Presence or absence of magical auras.
2nd Round: Number of different magical auras and the power of the most potent aura.
3rd Round: The strength and location of each aura. If the items or creatures bearing the auras are in line of sight, you can make Knowledge (arcana) skill checks to determine the school of magic involved in each. (Make one check per aura: DC 15 + spell level, or 15 + 1/2 caster level for a nonspell effect.) If the aura eminates from a magic item, you can attempt to identify its properties (see Spellcraft).
Magical areas, multiple types of magic, or strong local magical emanations may distort or conceal weaker auras.
Aura Strength: An aura's power depends on a spell's functioning spell level or an item's caster level; see the accompanying table. If an aura falls into more than one category, detect magic indicates the stronger of the two.
Lingering Aura: A magical aura lingers after its original source dissipates (in the case of a spell) or is destroyed (in the case of a magic item). If detect magic is cast and directed at such a location, the spell indicates an aura strength of dim (even weaker than a faint aura). How long the aura lingers at this dim level depends on its original power:
Original Strength
Duration of Lingering Aura
Faint
1d6 rounds
Moderate
1d6 minutes
Strong
1d6 x 10 minutes
Overwhelming
1d6 days

I would use BeautifulSoup to find the first h3 element then use next_sibling()
to find the next elements until h1 is reached and print the NavigableStrings like this:
from bs4 import BeautifulSoup, NavigableString, Comment
html = """
<html><p>stuff you dont want</p>
<h3 class="framing">Description</h3>
You detect magical auras. The amount of information revealed depends on how long you study a particular area or subject.
<br>
<br>
<i>1st Round</i>: Presence or absence of magical auras.
<br>
<br>
<i>2nd Round</i>: Number of different magical auras and the power of the most potent aura.
<br>
<br>
<i>3rd Round</i>: The strength and location of each aura. If the items or creatures bearing the auras are in line of sight, you can make Knowledge (arcana) skill checks to determine the school of magic involved in each. (Make one check per aura: DC 15 + spell level, or 15 + 1/2 caster level for a nonspell effect.) If the aura eminates from a magic item, you can attempt to identify its properties (see Spellcraft).
<br>
<br>
Magical areas, multiple types of magic, or strong local magical emanations may distort or conceal weaker auras.
<br>
<br>
<i>Aura Strength</i>: An aura's power depends on a spell's functioning spell level or an item's caster level; see the accompanying table. If an aura falls into more than one category, <i>detect magic</i> indicates the stronger of the two.
<br>
<br>
<i>Lingering Aura</i>: A magical aura lingers after its original source dissipates (in the case of a spell) or is destroyed (in the case of a magic item). If detect magic is cast and directed at such a location, the spell indicates an aura strength of dim (even weaker than a faint aura). How long the aura lingers at this dim level depends on its original power:
<br>
<br>
<table class="inner">
<tbody>
<tr>
<td><b>Original Strength</b></td>
<td><b>Duration of Lingering Aura</b></td>
</tr>
<tr>
<td>Faint</td>
<td>1d6 rounds</td>
</tr>
<tr>
<td>Moderate</td>
<td>1d6 minutes</td>
</tr>
<tr>
<td>Strong</td>
<td>1d6 x 10 minutes</td>
</tr>
<tr>
<td>Overwhelming</td>
<td>1d6 days</td>
</tr>
</tbody>
</table>
<br>Outsiders and elementals are not magical in themselves, but if they are summoned, the conjuration spell registers. Each round, you can turn to detect magic in a new area. The spell can penetrate barriers, but 1 foot of stone, 1 inch of common metal, a thin sheet of lead, or 3 feet of wood or dirt blocks it.<br><br><i>Detect magic</i> can be made permanent with a <i>permanency</i> spell.<br><br>
</table>
<h1 class="title"><img src="images\PathfinderSocietySymbol.gif" title="PFS Legal" style="margin:3px 3px 0px 3px;"> Detect Magic, Greater</h1>
<p>stuff you dont want</p></html>"""
soup = BeautifulSoup(html, 'lxml')
ret =[]
tag= soup.find('h3')
while True:
if tag is None or tag.name == 'h1':
break
if isinstance(tag, NavigableString) and not isinstance(tag, Comment):
ret.append(tag.strip())
elif tag.text is not None:
ret.append(tag.text.strip())
tag = tag.next_sibling
print("\n".join(ret))
Outputs:
Description
You detect magical auras. The amount of information revealed depends on how long you study a particular area or subject.
1st Round
: Presence or absence of magical auras.
2nd Round
: Number of different magical auras and the power of the most potent aura.
3rd Round
: The strength and location of each aura. If the items or creatures bearing the auras are in line of sight, you can make Knowledge (arcana) skill checks to determine the school of magic involved in each. (Make one check per aura: DC 15 + spell level, or 15 + 1/2 caster level for a nonspell effect.) If the aura eminates from a magic item, you can attempt to identify its properties (see Spellcraft).
Magical areas, multiple types of magic, or strong local magical emanations may distort or conceal weaker auras.
Aura Strength
: An aura's power depends on a spell's functioning spell level or an item's caster level; see the accompanying table. If an aura falls into more than one category,
detect magic
indicates the stronger of the two.
Lingering Aura
: A magical aura lingers after its original source dissipates (in the case of a spell) or is destroyed (in the case of a magic item). If detect magic is cast and directed at such a location, the spell indicates an aura strength of dim (even weaker than a faint aura). How long the aura lingers at this dim level depends on its original power:
Original Strength
Duration of Lingering Aura
Faint
1d6 rounds
Moderate
1d6 minutes
Strong
1d6 x 10 minutes
Overwhelming
1d6 days
Outsiders and elementals are not magical in themselves, but if they are summoned, the conjuration spell registers. Each round, you can turn to detect magic in a new area. The spell can penetrate barriers, but 1 foot of stone, 1 inch of common metal, a thin sheet of lead, or 3 feet of wood or dirt blocks it.
Detect magic
can be made permanent with a
permanency
spell.

Value doesn't show up when retrieved using requests.get

I'm using Python to scrape a retailer website for its html. I looking for the data and attribute on their air conditioning products, such as Energy Efficiency, Constant or Variable Type, etc. etc. Hence, I used requests.get() and afterwards I plan to filter the data using regex or bs4.
file_number = 0
for portal in portals:
item = requests.get(portal)
item_text = str(item.text)
file_number += 1
file_name = "blah" + file_number.zfill(4) + ".txt"
file = open(file_name,"w",encoding="utf8")
file.write(item_text)
file.close()
I could retrieve all html pages from the set() I've compiled. However, the product price is missing. This piece of information is present if I go the page and directly right-click --> inspect.
The example below is just one instance of the differences. The two files are the same, except all references to the prices are omitted (Just a wild guess: the price could appear slightly differently depending on who's shopping, that's why there're stored separately somehow.)
Also be glad to listen to any suggestion on code improvement, I'm brand new to python!
requests.get() version of info
<div class="p-price">
<strong class="J-p-32965125681"></strong> <span>X <span class="J-buy-num"></span></span>
</div>
vs
right-click --> inspect version of info
<div class="p-price">
<strong class="J-p-32965125681">￥3499.00</strong> <span>X <span class="J-buy-num"></span></span>
</div>
Thank you so much!
By the way, disclaimer the robots.txt says:
User-agent: *
Disallow: /?*
And I'm not crawling any page that have "?" in their url so...

Web Scraping is tricky!
At a first glance, the values seens to be added via javascript. In that case, you'll need to use a headless browser or extension to scrap the DOM after the page concludes loading, not the skeleton html page on the site.

Scratches appear on vector logo on Safari

I'm developing an eCommerce platform for one of my customers, I have to implement a retina-responsive logo (vector format, svg), everything is working fine on Firefox, Chrome and IE but not Safari (iOS & OS X), many scratches are appearing, here is a screenshot:
http://i.imgur.com/mpGdXD8.png?1
Regarding the code:
<div class="row">
<div id="header_logo" class="col-xs-12 col-sm-{4+$warehouse_vars.logo_width}{if isset($warehouse_vars.logo_position) && !$warehouse_vars.logo_position} col-sm-push-{4-$warehouse_vars.logo_width/2} centered-logo {/if}">
<a href="{if $force_ssl}{$base_dir_ssl}{else}{$base_dir}{/if}" title="{$shop_name|escape:'html':'UTF-8'}">
<img class="logo img-responsive" src="{$img_dir}/logo.svg" data-fallback="{$logo_url}" alt="{$shop_name|escape:'html':'UTF-8'}" />
</a>
</div>
The e-commerce solution is based on Prestashop 1.6.0.14.
The logo was initially built using Photoshop then vectorised with Inkspace.
Many Thanks for your help.

I created for you the Logo, with Inkscape:
http://www.mediafire.com/download/5adsdwl9a3zeidu/Mytech_Logo.svg
Download it and test it, should work.
Usually when you vectorize with inkscape it creates many layers with different colors, this makes the file very large, compare the file size of your logo, and the one that i sent to you, that one have around 10KB, and some times vectorized files could lack in some browsers, caused for all those layers and the background transparency, probably IE9 have this problem too, or some other browsers.
Instead try to create it directly with Inskcape, the vectorize method is not usually the best option, if you want to implement it on the web, as i told you before:
Creates very large files.
Also if you use the option "smooth", It will seem with blured borders.
And if you don't use it the edges will seem rough, of that way it loses the retina concept, at the end there won't be a difference between a .jpg and .svg file.
Hope been helpful!

How to load a html text in to textfield in iOS?

i am new in iOS development i want to load a html text into textview with one hyper link my text like as
<html>
<head>
<meta charset="UTF-8">
<title>Untitled Document</title>
</head>
<body>
<p>As the name suggest itself “Trueman India” will cover icons of India. Our national Magazine “Trueman India” is an expansion to our business, it is an addition to our Trueman group of Companies. Trueman group of companies was started initially with the Diamond business with the company name DTC Diamonds and Apple Diamonds in the year 1975. Later with the growth of the company, we entered into entertainment Industry under the banner, “Trueman Entertainment” in the year 2010 and “Trueman Theatres in the year 2012. Today “TRUEMAN” group of companies consists of Trueman Entertainment, Trueman Theatres and Trueman Foundation which has been our recent discovery. Now we have come with print media called “Trueman India” a National Magazine which will cover Bollywood news, Interviews of Celebrities and some part of it would be dedicated for our real heroes “Policemen” and Politicians. </p><p> If you are interested in advertising on Trueman India, click here for ad rates and details. Your input is so important to the success of this site for you and the real true man around the world. </p>
</body>
</html>
i know that UIWebView is better option but when i use it then the hyper link was open in to the same web view not in the safari.i want a link to open into safari
please give me solution.

you can catch the hyper link click action webView:shouldStartLoadWithRequest:navigationType: and return NO, then open safari for this link

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Extracting text of <p></p> with BeautifulSoup - python-3.x

Related

Steam API json output isn't being displayed cleanly

Beautiful Soup extracting all text between two tags, some of the text within the div is tagless

Value doesn't show up when retrieved using requests.get

Scratches appear on vector logo on Safari

How to load a html text in to textfield in iOS?

Categories

Resources