Search entire text for images - string

I have a problem with a project.
I need to search a string for images.
I want to get the source of the image and modify the html form of the img tag.
For example the image form is:
and I want to change it to:
<div class="col-md-3">
<hr class="visible-sm visible-xs tall" />
<a class="img-thumbnail lightbox pull-left" href="upload/uploader/up_164.jpg" data-plugin-options='{"type":"image"}' title="Image title">
<img class="img-responsive" width="215" src="upload/uploader/up_164.jpg"><span class="zoom"><i class="fa fa-search"></i>
</span></a>
I have done some part of this.
I can find the image, change the form of the html but cannot loop this for all images found in the string.
My code goes like
Using the following function I get the string between two strings
// Get substring between
function GetBetween($var1="",$var2="",$pool){
$temp1 = strpos($pool,$var1)+strlen($var1);
$result = substr($pool,$temp1,strlen($pool));
$dd=strpos($result,$var2);
if($dd == 0){
$dd = strlen($result);
}
return substr($result,0,$dd);
}
And then I get the image tag from the string
$imageFile = GetBetween("img","/>",$newText);
The next was to filter the source of the image:
$imageSource = GetBetween('src="','\"',$imageFile);
And for the last part I call str_replace to do the job:
$newText = str_replace('oldform', 'newform', $newText);
The problem is in case there are more tha one images, I cannot loop this process.
Thank you in advance.

The best, simple and safe way to read an xml file is to use an xml parser.
And, I think you will gain a lot of time.

Related

Generate Seletor from source code, for scrapy

I am trying to create a CSS selector from the source code of a dynamic web page. I have tried with no results with:
response.css('seller-info#region *::text').get()
response.css('seller-info > region *::text').get()
response.css('.seller-info#region ::text').get()
response.css('seller-info#region ::text').get()
response.css('seller-info > region ::text').get()
response.css('seller-info:contains("to extract")::text').get()
response.css('.seller-info:contains("to extract")::text').get()
response.css('.seller-info:contains("to extract") *::text').get()
response.css('seller-info:contains("to extract") *::text').get()
Response of each: "None"
I need the text: "to extract"
*The region name is repeated in other code trees
Source code
<seller-info
username='glorious'
ispro='true'
region="to extract"
phoneurl='/pg/0.gif"'
storeurl=""
seniority=''
category="1220"
phonevisible='true'
>
<div slot="avatar">
<div class="seller-info__header--icon-container">
<i class="icon-yapo icon-briefcase "></i>
</div>
</div>
</seller-info>```
Data from your source code that you are trying to extract - this is a tag attribute value (not tag text):
region = response.css("seller-info[region]::attr(region)").get()
or:
region = response.css("seller-info::attr(region)").get()
Selectors like tagname::text aimed to extract text between opening and closing tags like <tagname> text to extract </tagname>
Your <seller-info> tag - is self-closing tag (like img tag). It store data inside its attributes.

Selenium Can't Find Element Returning None or []

im having trouble accessing element, here is my code:
driver.get(url)
desc = driver.find_elements_by_xpath('//p[#class="somethingcss xxx"]')
and im trying to use another method like this
desc = driver.find_elements_by_class_name('somethingcss xxx')
the element i try to find like this
<div data-testid="descContainer">
<div class="abc1123">
<h2 class="xxx">The Description<span data-tid="prodTitle">The Description</span></h2>
<p data-id="paragraphxx" class="somethingcss xxx">sometext here
<br>text
<br>
<br>text
<br> and several text with
<br> tag below
</p>
</div>
<!--and another div tag below-->
i want to extract tag p inside div class="abc1123", but it doesn't return any result, only return [] when i try to get_attribute or extract it to text.
When i try extract another element using this method with another class, it works perfectly.
Does anyone know why I can't access these elements?
Try the following css selector to locate p tag.
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").text)
OR Use get_attribute("textContent")
print(driver.find_element_by_css_selector("p[data-id^='paragraph'][class^='somethingcss']").get_attribute("textContent"))

How to click on Web check box using Excel VBA?

How do I check the table checkbox?
I tried clicking.
ie.Document.getElementsByClassName("x-grid3-hd-checker").Checked = True
<div class="x-grid3-hd-inner x-grid3-hd-checker x-grid3-hd-checker-on" unselectable="on" style="">
<a class="x-grid3-hd-btn" href="#"></a>
<div class="x-grid3-hd-checker"> </div>
<img class="x-grid3-sort-icon" src="/javascript/extjs/resources/images/default/s.gif">
</div>
I can't see a checkbox in the HTML code. But you use getElementsByClassName() in a wrong way for your case. getElementsByClassName() generates a node collection. If you need a specific node, you must get it by it's index in the node collection. First element has index 0.
Please note that the div tag with the CSS class class="x-grid3-hd-inner x-grid3-hd-checker x-grid3-hd-checker-on " is also included in the Node Collection, because a part of the class identifier is identical to "x-grid3-hd-checker ". [Edit: I'm not realy sure if the part must maybe stand at the begin of the identifier]
If you want to check this:
<div class="x-grid3-hd-checker"> </div>
Your code needs the second index of the node collection:
ie.Document.getElementsByClassName("x-grid3-hd-checker")(1).Checked = True
But if there are more tags with the class name "x-grid3-hd-checker" the above line don't work. I can't say anymore until you don't post more HTML and VBA code. The best would be a link to the site.

Python 3 BeautifulSoup4 search for text in source page

I want to search for all '1' in the source code and print the location of that '1' ex: <div id="yeahboy">1</div> the '1' could be replaced by any other string. I want to see the tag around that string.
Consider this context for example * :
from bs4 import BeautifulSoup
html = """<root>
<div id="yeahboy">1</div>
<div id="yeahboy">2</div>
<div id="yeahboy">3</div>
<div>
<span class="nested">1</span>
</div>
</root>"""
soup = BeautifulSoup(html)
You can use find_all() passing parameter True to indicate that you want only element nodes (instead of the child text nodes), and parameter text="1" to indicate that the element you want must have text content equals "1" -or any other text you want to search for- :
for element1 in soup.find_all(True, text="1"):
print(element1)
Output :
<div id="yeahboy">1</div>
<span class="nested">1</span>
*) For OP: for future questions, try to give a context, just like the above context example. That will make your question more concrete and easier to answer -as people doesn't have to create context on his own, which may turn out to be not relevant to the situation that you actually have.

HTMLPurifier allow attributes

I'm having troubles making HTMLPurifier do not filter tag attributes but without success until now and im going crazy.
$config = HTMLPurifier_Config::createDefault();
$config->set('Core.Encoding', 'UTF-8');
$config->set('Core.CollectErrors', true);
$config->set('HTML.TidyLevel', 'medium');
$config->set('HTML.Doctype', 'XHTML 1.0 Transitional');
$config->set('URI.DisableExternalResources', false);
$config->set('HTML.Allowed', 'table[border|width|style],tbody,tr,td,th,img[style|src|alt],span[style],p[style],ul,ol,li,strong,em,sup,sub');
$PHTML = new HTMLPurifier($config);
echo htmlspecialchars($PHTML->purify($html));
// The input string:
"Some <span style="text-decoration: underline;">cool text</span> <img src="http://someurl.com/images/logo.png" alt="" />.
// The output string:
"Some <span>cool text</span> <img src="%5C" alt="" />.
I want to allow the given attributes for specified elements which are defined in HTML.Allowed option.
Turn off magic quotes. (Note the %5C)
Bit of a late suggestion, but I've run into a similar issue with HTMLPurifier stripping style attributes even though they were configured in the HTML.Allowed setting.
The solution I found requires that you also configure CSS.AllowedProperties which looks a bit like this:
$config->set('CSS.AllowedProperties', 'text-align,text-decoration,width,height');
Use this in conjunction with HTML.Allowed:
$config->set('HTML.Allowed', 'img[src|alt|style],span[style]');
I hope someone else finds this useful, you can read more about CSS.AllowedProperties here.

Resources