Start scraping only after and before certain element [closed] - python-3.x

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Here's what the HTML looks like:
<h4>Categories</h4>
<ul>
<li>Cars</li>
<li>Bikes</li>
<li>Planes</li>
</ul>
<h4>Brands</h4>
<ul>
<li>Audi</li>
<li>BMW</li>
<li>Mercedes</li>
</ul>
<h4>FAQ</h4>
<ul>
<li>FAQ1</li>
<li>FAQ2</li>
<li>FAQ3</li>
</ul>
I'm trying to extract only the brands using Scrapy. There are no distinguishing features between the category vs. brands section except that the H4 begins the new section. Also, there are many categories and brands so it's hard to hardcode it.

You can use the following or following-sibling axis.
For instance, in order to get the brands you can get to the desired h4 element by text and then get to the next ul sibling via following-sibling:
//h4[. = 'Brands']/following-sibling::ul[1]/li/text()
Demo from the Scrapy shell:
$ scrapy shell ./index.html
>>> response.xpath("//h4[. = 'Brands']/following-sibling::ul[1]/li/text()").extract()
['Audi', 'BMW', 'Mercedes']

Related

Scrape the salary from indeed.com [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
I am trying to scrape salary from indeed.com using beautiful soup.
The salary is given as:
<div id="vjs-jobinfo">
<div id="vjs-jobtitle">Senior Data Scientist/ Machine learning engineer</div>
<div>
<span id="vjs-cn">Intellify</span>
<span id="vjs-loc"> - Sydney NSW</span>
</div>
<div>
<span>$120,000 - $160,000 a year</span>
-
<span>Full-time, Part-time</span>
</div>
</div>
My solution:
new_soup = BeautifulSoup(new_html, 'html.parser', from_encoding='utf-8')
for titles in new_soup.find_all('div',{'id':'vjs-jobtitle'}):
print(titles.text)
print('\n')
for company_name in new_soup.find_all('span',{'id':'vjs-cn'}):
print(company_name.text)
print('\n')
for company_location in new_soup.find_all('span',{'id':'vjs-loc'}):
print(company_location.text)
But I can't get the salary as I have span has no attribute there. Can anyone help please.
You can use CSS-style selectors like:
new_soup.select_one("div#vjs-jobinfo div:nth-of-type(3)").findChild().text
One solution is since you know that the salary is in the 3rd span tag, you can directly access it
all_span=new_soup.find_all("span")
salary=all_span[2].getText()
#$120,000 - $160,000 a year
EDIT : Since you know salary must start with a dollar symbol, you can also use regex to find it
salary=new_soup.find('span', text=re.compile(r'^\$')).getText()
#$120,000 - $160,000 a year

kentico site and DesignMode css conflicts [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 6 years ago.
Improve this question
I have my custom css (all Compass/SASS based) added to the site from the Master page, rather than include the CSS in the site settings. While in Design View, my custom css, has overridden the items from DesignMode.css.
My master page is loading in my compiled CSS this way:
<link href="/CMSPages/GetResource.ashx?stylesheetfile=/KFF/SalesForce/main.css" type="text/css" rel="stylesheet" />
What is the best method to isolate my CSS from the designview.css?
Might want to take a look at this article
I think you can also do something like this in your master page:
{% ViewMode == "LiveSite" ? "StyleSheet Link": "" %}

svg to inline svg data compiler [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
i was wondering if there is any compiler is capable of parsing a css or scss file replacing all references to *.svg files with inline svg data. i found this:
body { background-image:
url("/assest/svg/test.svg");
}
would become
body { background-image:
url("data:image/svg+xml;utf8,<svg xmlns='http://www.w3.org/2000/svg' width='10' height='10'><linearGradient id='gradient'><stop offset='10%' stop-color='%23F00'/><stop offset='90%' stop-color='%23fcc'/> </linearGradient><rect fill='url(%23gradient)' x='0' y='0' width='100%' height='100%'/></svg>");
}
im looking for a way to make a portable css file without any dependencies. so far i found: https://github.com/jkphl/ but my tests so far did not show that data inlining may work. any ideas?
Compass allows you to inline the image with the inline-image helper...
background-image: inline-image("/assest/svg/test.svg");

Plotting Multiple barcharts using Flot API [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 8 years ago.
Improve this question
hi i am using FLOT charts API to show data in the form of bar charts in my application. I have got requirement to show the data in bar chart and in categories with Pre Data and Post data some thing as shown in the picture like this
Sample Diagram
Please tell me how to take the data to plot the bar chart
You need a plugin for this.
Check OrderBars and then use data like:
var series = [];
series.push({
data: [], // your raw data
bars: {
order: 0
}
});
series.push({
data: [], // your raw data
bars: {
order: 1
}
});
Example: http://jsfiddle.net/ZRXP5/
My example uses Mootools, but you find the jQuery version (.js file) in the link above.

Robots.TXT and Meta Tag Robots [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I want to make sure I understand this:
This: <meta content="noindex, nofollow" name="robots" /> in the <head> of a webpage
is the same as:
Disallow: /example-page.html in the Robots.txt
Right?
in the <head> of a webpage is the same as: Yes, if you are talking about the <head> of the example-page.html. The only difference is that when you have restriction for bots in the meta tag, the page will still be requested by the spider. This might be essential if that page is generated by any server-side script and you count the number of times it was displayed or gather any other information related to the visits to this page (from access logs, for example).
The bot (the valid bot from normal search engines) will access the page, read the meta tag, and subsequently not index it, while with the record in robots.txt no request of the page will be performed by the generic spider or the one mentioned in User-agent section of robots.txt.

Resources