python3 soup,replace html element content and save to file - python-3.x

how to replace text content of html tag in file and save them to another(some), file ?
Ex. there is a file index.html
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<p itemprop="someprop">SOME BIG TEXT</p>
</body>
</html>
I need to replace the text "SOME BIG TEXT" in the "p" tag to "ANOTHER BIG TEXT"
from bs4 import BeautifulSoup
with open("index.html","r") as file:
fcontent=file.read()
sp=BeautifulSoup(fcontent,'lxml')
t='new_text_for_replacement'
print(sp.replace(sp.find(itemprop="someprop").text,t))
What am I doing wrong ?
Thank you

Use open() on the output file to write to it.
with open('index.html', 'r') as file:
fcontent = file.read()
sp = BeautifulSoup(fcontent, 'html.parser')
t = 'new_text_for_replacement'
# replace the paragraph using `replace_with` method
sp.find(itemprop='someprop').replace_with(t)
# open another file for writing
with open('output.html', 'w') as fp:
# write the current soup content
fp.write(sp.prettify())
If you want to replace just the inner content of the paragraph instead of the paragraph element itself, you can set the .string property.
sp.find(itemprop='someprop').string = t

The problem relies upon on the way you are searching for the criteria try changing the following code:
print(sp.replace(sp.find(itemprop="someprop").text,t))
to this:
print(sp.replace(sp.find({"itemprop":"someprop"}).text,t))
hopefully, this helps
(PS: based of your questionI'm assuming that you only have one thing to replace)

Related

Selenium handles '#' character very strangely

I'm trying to use Selenium in python with standard and with undetected_chrome drivers in normal and headless browser modes.
I've noticed some strange behaviors:
normal chrome driver works fine with special inputs while sending keys into HTML input field with send_keys() function
undetected_chrome driver does not handle the special inputs very well with send_keys() function
in normal browser mode if I send an email address into a HTML input field, like 'abc#xyz.com' the current content from the clipboard is pasted in place of '#' character
e.g.: if I have copied the string '123' for the last time, the entered email address will not be 'abc#xyz.com' but 'abc123xyz.com' which is obviously incorrect
that's why I'm using a workaround, that I import pyperclip and put '#' character to the clipboard before the send_keys() function runs, to replace the '#' character with '#' character correctly
in headless browser mode if I enter an email address into a HTML input field, like 'abc#xyz.com' my workaround doesn't matter any more, the '#' character will be stripped from the email, and the 'abcxyz.com' string will be put into the field
I think sending a '#' character shouldn't be that hard by default. Am I doing something wrong here?
Questions:
Can anyone explain these strange behaviors?
How could I send an email address correctly with headless browser mode? (I need to use undetected_chrome driver because of bots)
from selenium import webdriver
self.chrome_options = webdriver.ChromeOptions()
if self.headless:
self.chrome_options.add_argument('--headless')
self.chrome_options.add_argument('--disable-gpu')
prefs = {'download.default_directory' : os.path.realpath(self.download_dir_path)}
self.chrome_options.add_experimental_option('prefs', prefs)
self.driver = webdriver.Chrome(options=self.chrome_options)
import undetected_chromedriver as uc
# workaround for problem with pasting text from the clipboard into '#' symbol when using send_keys()
import pyperclip
pyperclip.copy('#') # <---- THIS IS THE WORKAROUND FOR THE PASTING PROBLEM
self.chrome_options = uc.ChromeOptions()
if self.headless:
self.chrome_options.add_argument('--headless')
self.chrome_options.add_argument('--disable-gpu')
self.driver = uc.Chrome(options=self.chrome_options)
params = {
"behavior": "allow",
"downloadPath": os.path.realpath(self.download_dir_path)
}
self.driver.execute_cdp_cmd("Page.setDownloadBehavior", params)
The versions I'm using requirements.txt:
selenium==4.3.0
undetected-chromedriver==3.1.5.post4
Instead of using custom configuration options, try a more basic variant first and see if works correctly. And then figure out which option combination is causing the issue.
An example (which does work correctly by the way)
import undetected_chromedriver as uc
from selenium.webdriver.common.by import By
driver = uc.Chrome()
driver.execute_script("""
document.documentElement.innerHTML = `
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Document</title>
<style>
html, body, main {
height: 100%;
}
main {
display: flex;
justify-content: center;
}
</style>
</head>
<body>
<main>
<form>
<input type="text" name="text" />
<input type="email" name="email"/>
<input type="tel" name="tel"/>
</form>
</main>
</body>
</html>`
""")
driver.find_element(By.NAME, "text").send_keys("info#example.com")
driver.find_element(By.NAME, "email").send_keys("info#example.com")
driver.find_element(By.NAME, "tel").send_keys("info#example.com")
This complaint (entering "#" pastes clipboard contents) has come up here from time to time but I've never seen a definitive solution. I'd suspect a particular language version of Windows &/or keyboard driver.
The windows language is the problem, more precisley the keyboard layout as it changes depending on language. Switch it to "ENG" and it should work fine. Keyboard layouts where letter Z is between "T" and "U" are not working with undetected driver. Keyboard layout that has Z next to "X" works fine.

Beautifulsoup filter "find_all" results, limited to .jpeg file via Regex

I would like to acquire some pictures from a forum. The find_all results gives me most what I want, which are jpeg files. However It also gives me few gif files which I do not desire. Another problem is that the gif file is an attachment, not a valid link, and it causes trouble when I save files.
soup_imgs = soup.find(name='div', attrs={'class':'t_msgfont'}).find_all('img', alt="")
for i in soup_imgs:
src = i['src']
print(src)
I tried to avoid that gif files in my find_all selections search, but useless, both jpeg and gif files are in the same section. What should I do to filter my result then? Please give me some help, chief. I am pretty amateur with coding. Playing with Python is just a hobby of mine.
You can filter it via regular expression.Please refer the following example.Hope this helps.
import re
from bs4 import BeautifulSoup
data='''<html>
<body>
<h2>List of images</h2>
<div class="t_msgfont">
<img src="img_chania.jpeg" alt="" width="460" height="345">
<img src="wrongname.gif" alt="">
<img src="img_girl.jpeg" alt="" width="500" height="600">
</div>
</body>
</html>'''
soup=BeautifulSoup(data, "html.parser")
soup_imgs = soup.find('div', attrs={'class':'t_msgfont'}).find_all('img', alt="" ,src=re.compile(".jpeg"))
for i in soup_imgs:
src = i['src']
print(src)
Try the following which I suspect you can shorten. It uses the ends with operator ($) to specify that the src attributes value of the child img elements ends with .jpg (edited to jpg from jpeg in light of OP's comment that it is actually jpg)
srcs = [item['src'] for item in soup.select("div.t_msgfont img[alt=''][src$='.jpg']")]
Have a look at shortening the selector(I can't without seeing the HTML in question), you may well get away with something like
srcs = [item['src'] for item in soup.select(".t_msgfont [alt=''][src$='.jpg']")]
or even
srcs = [item['src'] for item in soup.select(".t_msgfont [src$='.jpg']")]
I would suggest you to use requests-html to find the image resources in the page.
It's pretty simple compared to BeautifulSoup + requests.
Here's the code to do it.
from requests_html import HTMLSession
session = HTMLSession()
resp = session.get(url)
for i in resp.html.absolute_links:
if i.endswith('.jpeg'):
print(i)

beautifulsoup get value of attribute using get_attr method

I'd like to print all items in the list, but not containing the style tag = the following value: "text-align: center"
test = soup.find_all("p")
for x in test:
if not x.has_attr('style'):
print(x)
Essentially, return me all items in list where style is not equal to: "text-align: center". Probably just a small error here, but is it possible to define the value of style in has_attr?
Just check if the specific style is present in the Tag's style. Style is not considered a multi-valued attribute and the entire string inside quotes is the value of style attribute. Using x.get("style",'') instead of x['style'] also handles cases in which there is no style attribute and avoids KeyError.
for x in test:
if 'text-align: center' not in x.get("style",''):
print(x)
You can also use list comprehension to skip a few lines.
test=[x for x in soup.find_all("p") if 'text-align: center' not in x.get("style",'')]
print(test)
If you wanted to consider a different approach you could use the :not selector
from bs4 import BeautifulSoup as bs
html = '''
<html>
<head>
<title>Try jsoup</title>
</head>
<body>
<p style="color:green">This is the chosen paragraph.</p>
<p style="text-align: center">This is another paragraph.</p>
</body>
</html>
'''
soup = bs(html, 'lxml')
items = [item.text for item in soup.select('p:not([style="text-align: center"])')]
print(items)

Using XPath, select node without text sibling

I want to extract some HTML elements with python3 and the HTML parser provided by lxml.
Consider this HTML:
<!DOCTYPE html>
<html>
<body>
<span class="foo">
<span class="bar">bar</span>
foo
</span>
</body>
</html>
Consider this program:
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
from lxml import html
tree = html.fromstring('html from above')
bars = tree.xpath("//span[#class='bar']")
print(bars)
print(html.tostring(bars[0], encoding="unicode"))
In a browser, the query selector "span.bar" selects only the span element. This is what I desire. However, the above program produces:
[<Element span at 0x7f5dd89a4048>]
<span class="bar">bar</span>foo
It looks like my XPath does not actually behave like a query selector and the sibling text node is picked up next to the span element. How can I adjust the XPath to select only the bar element, but not the text "foo"?
Notice that XML tree model in lxml (as well as in the standard module xml.etree) has concept of tail. So text nodes located after a.k.a following-sibling of element will be stored as tail of that element. So your XPath correctly return the span element, but according to the tree model, it has tail which holds the text 'foo'.
As a workaround, assuming that you don't want to use the tree model further, simply clear the tail before printing:
>>> bars[0].tail = ''
>>> print(html.tostring(bars[0], encoding="unicode"))
<span class="bar">bar</span>

How to replace selected string from content editable div?

I'm trying to replace chapter titles from the contenteditable="true" div tag by using python and selenium-webdriver, at first I am searching for the chapter title, which is usually at first line... then I'm replacing it with empty value and saving.. but it's not saving after refreshing browser. But I see that code is working. Here is my code
##getting content editable div tag
input_field = driver.find_element_by_css_selector('.trumbowyg-editor')
### getting innerHTML of content editable div
chapter_html = input_field.get_attribute('innerHTML')
chapter_content = input_field.get_attribute('innerHTML')
if re.search('<\w*>', chapter_html):
chapter_content = re.split('<\w*>|</\w*>', chapter_html)
first_chapter = chapter_content[1]
### replacing first_chapter with ''
chapter_replace = chapter_html.replace(first_chapter, '')
### writing back innerHTML without first_chapter string
driver.execute_script("arguments[0].innerHTML = arguments[1];",input_field, chapter_replace)
time.sleep(1)
## click on save button
driver.find_element_by_css_selector('.btn.save-button').click()
How I can handle this ? It is working when I'm doing manually(I mean it probably can't be site problem/bug)... Please help ...
Relevant HTML is following:
<div class="trumbowyg-editor" dir="ltr" contenteditable="true">
<p>Chapter 1</p>
<p> There is some text</p>
<p> There is some text</p>
<p> There is some text</p>
</div>
As per the HTML you have shared to replace the chapter title with empty value you have to induce WebDriverWait with expected_conditions clause set to visibility_of_element_located and can use the following block of code :
page_number = WebDriverWait(driver, 20).until(EC.visibility_of_element_located((By.XPATH, "//div[#class='trumbowyg-editor' and #contenteditable='true']/p[contains(.,'Chapter')]")))
driver.execute_script("arguments[0].removeAttribute('innerHTML')", page_number)
#or
driver.execute_script("arguments[0].removeAttribute('innerText')", page_number)
#or
driver.execute_script("arguments[0].removeAttribute('textContent')", page_number)

Resources