I'm having trouble parsing html element inside li tag.
This is my code:
from bs4 import BeautifulSoup
import requests
sess = requests.Session()
url = 'http://example.com'
page = sess.get(url)
page = BeautifulSoup(page.text)
soap = page.select('li.item')
print(soap.find('h3').text)
This is html code:
...
<li class="item">
<strong class="item-type">design</strong>
<h3 class="item-title">Item title</h3>
<p class="item-description">
Lorem ipsum dolor sit amet, dicam partem praesent vix ei, ne nec quem omnium cotidieque, omnes deseruisse efficiendi sit te. Mei putant postulant id. Cibo doctus eligendi at vix. Eos nisl exerci mediocrem cu, nullam pertinax petentium sea et. Vim affert feugait an.
</p>
</li>
...
There are more than 10 li tag I just paste one of them.
Output error:
Traceback (most recent call last):
File "test.py", line 10, in <module>
print(soap.find('h3').text)
AttributeError: 'list' object has no attribute 'find'
Thanks to #DaveJ , this method worked:
[s.find('h3').text for s in soap]
Related
I am having trouble with my code not printing all the strings that I want and I am unsure of how to edit my code to change that.
I am trying to scrape all the strings including things like 460 hp # 7000 rpm which it is currently not scraping. Ideally the strings in the strong elements are kept separate. I have tried adding another .next_sibling, changing the br to p and strong they just return an error.
The HTML is as follows:
<div class="specs-content">
<p>
<strong>Displacement:</strong>
" 307 cu in, 5038 "
<br>
<strong>Power:</strong>
" 460 hp # 7000 rpm "
<br>
<strong>Torque:</strong>
" 420 lb-ft # 4600 rpm "
</p>
<p>
<strong>TRANSMISSION:</strong>
" 10-speed automatic with manual shifting mode "
</p>
<p>
<strong>CHASSIS</strong>
<br>
" Suspension (F/R): struts/multilink "
<br>
" Brakes (F/R): 15.0-in vented disc/13.0-in vented disc "
<br>
" Tires: Michelin Pilot Sport 4S, F: 255/40ZR-19 (100Y) R: 275/40ZR-19 (105Y) "
</p>
</div>
I have written the following code thus far:
import requests
from bs4 import BeautifulSoup
URL = requests.get('https://www.LinkeHere.com')
soup = BeautifulSoup(URL.text, 'html.parser')
FindClass = soup.find(class_='specs-content')
FindElement = FindClass.find_all('br')
for Specs in FindElement:
Specs = Specs.next_sibling
print(Specs.string)
This returns:
Power:
Torque:
Suspension (F/R): struts/multilink
Brakes (F/R): 13.9-in vented disc/13.0-in vented disc
Tires: Michelin Pilot Sport 4S, 255/40ZR-19 (100Y)
You can use the get_text() method with adding a newline \n as the separator argument:
from bs4 import BeautifulSoup
html = """THE ABOVE HTML SNIPPET"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all(class_="specs-content"):
print(tag.get_text(strip=True, separator="\n").replace('"', ""))
Output:
Displacement:
307 cu in, 5038
Power:
460 hp # 7000 rpm
Torque:
420 lb-ft # 4600 rpm
TRANSMISSION:
10-speed automatic with manual shifting mode
CHASSIS
Suspension (F/R): struts/multilink
Brakes (F/R): 15.0-in vented disc/13.0-in vented disc
Tires: Michelin Pilot Sport 4S, F: 255/40ZR-19 (100Y) R: 275/40ZR-19 (105Y)
This question already has answers here:
how to replace a string/word in a text file in groovy
(6 answers)
Closed 2 years ago.
I want to replace VERSION placeholders in a file to a variable version value, but I'm running into the below error:
def versions = ["8.8.0", "9.9.0"]
versions.each { version ->
def file = new File("$Path/test.url")
def fileText = file.replaceAll("VERSION", "${version}")
file.write(fileText);
Error:
groovy.lang.MissingMethodException: No signature of method: java.io.File.replaceAll() is applicable for argument types: (java.lang.String, org.codehaus.groovy.runtime.GStringImpl) values: [VERSION, 8.8.0]
I'm a newbie to groovy dsl, not sure what I'm missing, any suggestions, appreciated !
Another way is to use the groovy file .text property:
def f = new File('sample-file.txt')
f.text = f.text.replaceAll('VERSION', '8.8.0')
and like #cfrick mentioned, there is not much point in performing the replace operation on multiple versions as only the first one will actually find the VERSION string.
Running the above on a sample file:
─➤ groovy solution.groovy
─➤
will result in the string being replaced:
─➤ diff sample-file.txt_original sample-file.txt
1c1
< Dolore magna aliqua. VERSION Ut enim ad minim veniam.
---
> Dolore magna aliqua. 8.8.0 Ut enim ad minim veniam.
where diff is a linux tool for comparing two files.
I'm attempting to scrape Indeed.com and want to get information pertaining to each job in their respective div. The response will print out in the terminal, but when I write to a file or run the spider I get a blank file and no items returned. How do I fix this issue?
I've tried changing my xpaths to relative to the container its pulling from and it still runs blank.
def parse(self, response):
html = response.body
container3 = response.xpath(".//div[contains(#class,'jobsearch-SerpJobCard unifiedRow row result clickcard')]").extract()
print(container3)
with open('container.txt', 'w') as cont:
cont.write(container3)
cont.close()
title = Selector(response=container3).xpath(".//*[#class='title']/a/#title").get()
titles = container3.xpath(".//*[#class='title']/a/#title").getall()
locations = container3.xpath(".//*[#class= 'sjcl']/span/text()").getall()
companies = container3.xpath(".//*[#class= 'company']/a/text()").getall()
summarys = container3.xpath(".//*[#class= 'summary']/.").getall()
links = response.css("div.title a::attr(href)").getall()
webscrape = WebscrapeItem()
webscrape['title'] = []
webscrape['company'] = []
webscrape['location'] = []
webscrape['desc'] = []
webscrape['link'] = []
for link in links:
self.links.append('https://www.indeed.com/' + link)
webscrape['link'].append('https://www.indeed.com/' + link)
for title, local in itertools.zip_longest(titles, locations):
webscrape['title'].append(title)
webscrape['location'].append(local)
for suma, com in itertools.zip_longest(summarys, companies):
webscrape['desc'].append(suma)
webscrape['company'].append(com)
yield webscrape
container3 output:
<div class="jobsearch-SerpJobCard unifiedRow row result clickcard" id="pj_23e4270b7501bb9b" data-jk="23e4270b7501bb9b" data-empn="5625259597886418" data-ci="291406065">\n\n <div class="title">\n <a target="_blank" id="sja2" href="/pagead/clk?mo=r&ad=-6NYlbfkN0AGcPE08CwaySIkGkcc_oP1ITgH03VIz0r4xVHFv1QhAqfdykiPOMynTjgufJX7HvDowBKp7j-7NHJP9GOjbo56Vjxh5NURcHO8VKHA2Y_kPQaP89uziwg10G1Cy7gxqliSnkyvAjNozb3dIZaFvs20PbgIEbVp-Hlps87Ix3AR1T6shfkApixB3pFjOLL7mVL86YGAk8ZDtjg1RSW02V3Z21NoirneOsjdmwulvgL84YrSuUydYlJaqi5F8aPMUi7pz0h9-mKPlGF9g2xadVCCe2GDYCw9Svjigifq0j5m6WWsToS9ZsU4_uJu3ZNLRr92Eiwq9QHaT2tJcVrjqtO1X7Lz2bHVDj0RBD_MvoO_FmG0_Sr_tCm8gCxu55S7Vk4GEi0nBslmfj4br8hgZ1AuLs4D_XWmJF6MErKJSgPJFZWn7X2SAlVC&p=2&fvj=1&vjs=3" onmousedown="sjomd(\'sja2\'); clk(\'sja2\');" onclick=" setRefineByCookie([]); sjoc(\'sja2\', 0); convCtr(\'SJ\')" rel="noopener nofollow" title="EMS Executive Director" class="jobtitle turnstileLink " data-tn-element="jobTitle">\n EMS Executive Director</a>\n\n </div>\n\n <div class="sjcl">\n <div>\n <span class="company">\n <a data-tn-element="companyName" class="turnstileLink" target="_blank" href="/cmp/Remsa-1" onmousedown="this.href = appendParamsOnce(this.href, \'from=SERP&campaignid=serp-linkcompanyname&fromjk=23e4270b7501bb9b&jcid=1075eae744bf7959\')" rel="noopener">\n REMSA</a></span>\n\n <a data-tn-element="reviewStars" data-tn-variant="cmplinktst2" class="turnstileLink slNoUnderline " href="/cmp/Remsa-1/reviews" title="Remsa reviews" onmousedown="this.href = appendParamsOnce(this.href, \'?campaignid=cmplinktst2&from=SERP&jt=EMS+Executive+Director&fromjk=23e4270b7501bb9b&jcid=1075eae744bf7959\');" target="_blank" rel="noopener">\n <span class="ratings" aria-label="3.9 out of 5 star rating"><span class="rating" style="width:44.4px"><!-- --></span></span>\n<span class="slNoUnderline">7 reviews</span>\n </a>\n </div>\n<div id="recJobLoc_23e4270b7501bb9b" class="recJobLoc" data-rc-loc="United States" style="display: none"></div>\n\n <div class="location ">United States</div>\n </div>\n\n <div class="summary">\n Responsible for the <b>financial</b>, operational and management performance of Healthcare services for the company. Directs daily operations in support of the mission…</div>
I expect each 'jobsearch-SerpJobCard unifiedRow row result clickcard' to be extracted into a list, then getting titles, locations, companies, and summarys from that list using relative xpaths.
However, what I'm getting is a blank container3, and no items returned. Here is the response.text info from the finished spider.
"{\"status\": \"ok\", \"items\": [], \"items_dropped\": [], \"stats\": {\"downloader/request_bytes\": 1132, \"downloader/request_count\": 3, \"downloader/request_method_count/GET\": 2, \"downloader/request_method_count/POST\": 1, \"downloader/response_bytes\": 1012262, \"downloader/response_count\": 3, \"downloader/response_status_count/200\": 2, \"downloader/response_status_count/404\": 1, \"finish_reason\": \"finished\", \"finish_time\": \"2019-08-21 06:29:40\", \"log_count/DEBUG\": 3, \"log_count/ERROR\": 1, \"log_count/INFO\": 8, \"log_count/WARNING\": 1, ...
Check this out, it works
for item in response.xpath('//div[#class="jobsearch-SerpJobCard unifiedRow row result"]'):
titles = item.xpath(".//*[#class='title']/a/#title").getall()
print(titles)
locations = item.xpath(".//*[#class= 'sjcl']/span/text()").getall()
print(locations)
Output
['Python Developer Freshers Trainees', 'Python Developer', 'Python Developer', 'Python Developer', 'Python Developers', 'Software Trainee', 'Python\\Django Developer', 'Hiring 2016 / 2017 / 2018 / 2019 freshers as software trainee', 'Python/Django Developer', 'Senior Python Developer']
['Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala', 'Kochi, Kerala']
here is the function to scramble the words in a file
import itertools as it
import random as rdm
def get_permuted_lines(word_list):
'''this function takes a list of all the words in the file in order they appear in the file and returns another list having all the scrumbled words in the same order they appear in the file'''
#final list is the list that will hold all the scrumbled words
final_list=[]
for word in word_list:
#words having length<=3 should not be scrumbled
if len(word)<=3:
final_list.append(word)
else:
if len(word)==4 and (word.endswith('.') or word.endswith(',')):
final_list.append(word)
elif len(word)==5 and word.endswith('\n'):
final_list.append(word)
else:
#if a line endswith ,
if word.endswith(',\n'):
first_letter, *middle_letters, last_letter =
word[0],word[1:-3],word[-3:len(word)]
perm_list = list(it.permutations(middle_letters, len(middle_letters)))
join_tup_words=[''.join(tup) for tup in perm_list]
final_list.append(first_letter+ join_tup_words[rdm.randint(0,len(join_tup_words)-1)]+last_letter)
#if a line endswith .
elif word.endswith('.\n'):
first_letter, *middle_letters, last_letter = word[0],word[1:-3],word[-3:len(word)]
perm_list= list(it.permutations(middle_letters, len(middle_letters)))
join_tup_words= [''.join(tup) for tup in perm_list]
final_list.append(first_letter+ join_tup_words[rdm.randint(0,len(join_tup_words)-1)]+last_letter)
#for remaining words
else:
first_letter, *middle_letters, last_letter=word
perm_list= list(it.permutations(middle_letters,len(middle_letters)))
join_tup_words=[''.join(tup) for tup in perm_list]
final_list.append(first_letter+ join_tup_words[rdm.randint(0,len(join_tup_words)-1)]+last_letter)
return final_list
def read_write(fname):
'''here we read from the file fname and write to a new file called fname + Scrumble.txt after creating it'''
with open(fname,'r') as f:
lines=f.read()
#getting a list of scrumbled words in order it appears in the file
permuted_words=get_permuted_lines(lines.split(' '))
#joining all the words to form lines
join_words_list=' '.join(permuted_words)
#creating a new file with the name (fname + scrumble.txt)
new_file=fname[:-4]+'Scrumble.txt'
with open(new_file,'w') as f:
f.write(join_words_list)
with open(new_file,'r') as f:
print(f.read())
if __name__=='__main__':
'''getting the file name and passing it for readiing its content'''
#file_name is the name of the file we want to scramble
file_name =input('enter the file_name: ')
read_write(file_name)
i have tried tried the same program with the re and the random module which works fine. also using only the random module does the task. but using the itertools.permutations() works only for the files having less no of lines(say 3) but not more.
how can i fix this?
You have a combinatory explosion at hand using permutations. Your texts probably has some long words in it:
from itertools import permutations
from datetime import datetime, timedelta
for n in range (1,15):
g = ''.join("k"*n)
start = datetime.now()
print()
print( f' "{g}" feed to permutations leads to {len(list(permutations(g)))} results taking {(datetime.now()-start).total_seconds() * 1000} ms')
Output:
"k" feed to permutations leads to 1 results taking 0.0 ms
"kk" feed to permutations leads to 2 results taking 0.0 ms
"kkk" feed to permutations leads to 6 results taking 0.0 ms
"kkkk" feed to permutations leads to 24 results taking 0.0 ms
<snipp>
"kkkkkkkkk" feed to permutations leads to 362880 results taking 78.126 ms
"kkkkkkkkkk" feed to permutations leads to 3628800 results taking 703.131 ms
"kkkkkkkkkkk" feed to permutations leads to 39916800 results taking 8920.826 ms
... laptop freezes ...
For me its around 12 characters long.
How to avoid it: do not use permutations - use a simple shuffle:
import random
def removeSpaces(textList):
return ' '.join(textList)
def addSpaces(text):
return text.split(" ")
def needsScrambling(word):
stripped = word.strip(",.?!")
return len(stripped) > 3 and stripped.isalpha()
def scramble(words):
def scrambleWord(oneWord):
prev = ""
suff = ""
if oneWord[0] in ",.?!":
prev = oneWord[0]
oneWord = oneWord[1:]
if oneWord[-1] in ",.?!\n":
suff = oneWord[-1]
oneWord = oneWord[:-1]
return ''.join([prev, oneWord[0], *random.sample(oneWord[1:-1], k=len(oneWord)-2),oneWord[-1],suff])
return [ scrambleWord(w) if needsScrambling(w) else w for w in words]
def doIt(t):
return removeSpaces(scramble(addSpaces(t)))
demoText = "Non eram nescius, Brute, cum, quae summis ingeniis exquisitaque doctrina philosophi" + ' \n' + \
"Graeco sermone tractavissent, ea Latinis litteris mandaremus, fore ut hic noster labor in varias" + ' \n' + \
"reprehensiones incurreret. nam quibusdam, et iis quidem non admodum indoctis, totum hoc displicet" + ' \n' + \
"philosophari. quidam autem non tam id reprehendunt, si remissius agatur, sed tantum studium tamque" + ' \n' + \
"multam operam ponendam in eo non arbitrantur. erunt etiam, et ii quidem eruditi Graecis litteris," + ' \n' + \
"contemnentes Latinas, qui se dicant in Graecis legendis operam malle consumere. postremo aliquos" + ' \n' + \
"futuros suspicor, qui me ad alias litteras vocent, genus hoc scribendi, etsi sit elegans, personae" + ' \n' + \
"tamen et dignitatis esse negent." + ' \n\n' + \
"[2] Contra quos omnis dicendum breviter existimo. Quamquam philosophiae quidem vituperatoribus" + ' \n' + \
"satis responsum est eo libro, quo a nobis philosophia defensa et collaudata est, cum esset" + ' \n' + \
"accusata et vituperata ab Hortensio. qui liber cum et tibi probatus videretur et iis, quos" + ' \n' + \
"ego posse iudicare arbitrarer, plura suscepi veritus ne movere hominum studia viderer, retinere" + ' \n' + \
"non posse. Qui autem, si maxime hoc placeat, moderatius tamen id volunt fieri, difficilem" + ' \n' + \
"quandam temperantiam postulant in eo, quod semel admissum coerceri reprimique non potest, ut" + ' \n' + \
"propemodum iustioribus utamur illis, qui omnino avocent a philosophia, quam his, qui rebus" + '\n' + \
"infinitis modum constituant in reque eo meliore, quo maior sit, mediocritatem desiderent." + '\n' + \
"Source: https://la.wikisource.org/wiki/De_finibus_bonorum_et_malorum/Liber_Primus"
print(doIt(demoText))
Output:
Non eram niseucs, Bture, cum, quae siumms igeninis euaqusixtqie driocnta phlpoishoi
Graeco srmenoe tevsricstanat, ea Lniatis liteirts mnurdeaams, fore ut hic noestr lbaor in varias
reprehensiones icrenruert. nam qubsuidam, et iis qeuidm non audmdom itdnoics, toutm hoc dsieiplct
philosophari. qaduim autem non tam id rneedrunepht, si rmesisuis auatgr, sed tntaum sutuidm tqmaue
multam oaerpm pednnoam in eo non attuirranbr. eurnt etaim, et ii qideum edriuti Garceis liettris,
contemnentes Laanits, qui se dinact in Gecrias lgednies orpeam mllae coermusne. psormeto aliuqos
futuros sospciur, qui me ad ailas ltreatis vcnoet, geuns hoc sdrbcneii, etsi sit eaegnls, psneroae
tamen et dgiainitts esse nenegt.
[2] Conrta quos oinms dnuicedm betievrr esimtxio. Qumuqaam pooihhslipae qeduim vupaiteuoirbtrs
satis rnupessom est eo libro, quo a noibs psiohoiplha densfea et cduoallata est, cum esest
accusata et vtiaterupa ab Hirntseoo. qui liebr cum et tbii purotbas videertur et iis, qous
ego posse irucdiae aaeribtrrr, pulra seuspci vterius ne mrovee hmiuonm sduita vdeerir, rntreeie
non pssoe. Qui ateum, si mixmae hoc pclaaet, mairtdueos teamn id vnlout ferii, dciffeiilm
quandam tnmreeaiptam pasounltt in eo, quod smeel aidsmsum cercroei rimriqepue non pteost, ut
propemodum itriuosiubs uuamtr iills, qui omnino aocevnt a pshoihloipa, qaum his, qui rebus
infinitis mdoum caustinontt in rquee eo mierole, quo miaor sit, meretiicodatm desiderent.
I am trying to use lzma to compress and decompress some data in memory. I know that the following approach works:
import lzma
s = 'Lorem ipsum dolor'
bytes_in = s.encode('utf-8')
print(s)
print(bytes_in)
# Compress
bytes_out = lzma.compress(data=bytes_in, format=lzma.FORMAT_XZ)
print(bytes_out)
# Decompress
bytes_decomp = lzma.decompress(data=bytes_out, format=lzma.FORMAT_XZ)
print(bytes_decomp)
The output is:
Lorem ipsum dolor
b'Lorem ipsum dolor'
b'\xfd7zXZ\x00\x00\x04\xe6\xd6\xb4F\x02\x00!\x01\x16\x00\x00\x00t/\xe5\xa3\x01\x00\x10Lorem ipsum dolor\x00\x00\x00\x00\xddq\x8e\x1d\x82\xc8\xef\xad\x00\x01)\x112\np\x0e\x1f\xb6\xf3}\x01\x00\x00\x00\x00\x04YZ'
b'Lorem ipsum dolor'
However, I notice that using lzma.LZMACompressor gives different results. With the following code:
import lzma
s = 'Lorem ipsum dolor'
bytes_in = s.encode('utf-8')
print(s)
print(bytes_in)
# Compress
lzc = lzma.LZMACompressor(format=lzma.FORMAT_XZ)
lzc.compress(bytes_in)
bytes_out = lzc.flush()
print(bytes_out)
# Decompress
bytes_decomp = lzma.decompress(data=bytes_out, format=lzma.FORMAT_XZ)
print(bytes_decomp)
I get this output:
Lorem ipsum dolor
b'Lorem ipsum dolor'
b'\x01\x00\x10Lorem ipsum dolor\x00\x00\x00\x00\xddq\x8e\x1d\x82\xc8\xef\xad\x00\x01)\x112\np\x0e\x1f\xb6\xf3}\x01\x00\x00\x00\x00\x04YZ'
And then the program fails on line 18 with _lzma.LZMAError: Input format not supported by decoder.
I have 3 questions here:
How come the output for lzma.compress is so much longer than lzma.LZMACompressor.compress even though it seemingly does the same thing?
In the second example, why does the decompressor complain about invalid format?
How can I get the second example to decompress correctly?
On your second example you're dropping a part of the compressed stream, and bytes_out only gets the flush part. On the other hand, that works:
lzc = lzma.LZMACompressor(format=lzma.FORMAT_XZ)
bytes_out = lzc.compress(bytes_in) + lzc.flush()
print(bytes_out)
note that the first example is really equivalent since source for lzma.compress is:
def compress(data, format=FORMAT_XZ, check=-1, preset=None, filters=None):
"""Compress a block of data.
Refer to LZMACompressor's docstring for a description of the
optional arguments *format*, *check*, *preset* and *filters*.
For incremental compression, use an LZMACompressor instead.
"""
comp = LZMACompressor(format, check, preset, filters)
return comp.compress(data) + comp.flush()