Python & BS4 pagination loop - python-3.x
I'm new to web scraping and I'm trying to do it on this page https://www.metrocuadrado.com/bogota.
The idea is to extract all the information. So far I have been able to do it with only one page but I do not know how to do it with pagination. Is there any way to do it based on the code I already have?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# opening up connection, grabbing html
my_url = 'https://www.metrocuadrado.com/bogota'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parser
page_soup = soup(page_html, "html.parser")
# grabs each product
containers = page_soup.findAll("div",{"class":"detail_wrap"})
filename = "metrocuadrado.csv"
f = open(filename, "w")
headers= "propertytype, businestype, cityname, neighborhood, description, price, area\n"
f.write(headers)
for container in containers:
property_type = container[propertytype]
busines_type = container[businestype]
city_name = container[cityname]
neighborhood_location = container[neighborhood]
description = container.div.a.img["alt"]
price_container = container.findAll("span",{"itemprop":"price"})
price = price_container[0].text
area_container = container.findAll("div",{"class":"m2"})
area = area_container[0].p.span.text
print("property_type: " + property_type)
print("busines_type: " + busines_type)
print("city_name: " + city_name)
print("neighborhood_location: " + neighborhood_location)
print("description: " + description)
print("price: " + price)
print("area: " + area)
f.write(property_type + "," + busines_type + "," + city_name + "," + neighborhood_location + "," + description.replace(",", "|") + "," + price + "," + area + "\n")
f.close()
You are going to need to scrape each page (likely in a loop), do this by figuring out what the call is to get page 2, page 3 etc. You can try to figure that out by looking at the page source code or using developer tools from your browser and looking at the network calls.
Related
For Loop to CSV Leading to Uneven Rows in Python
Still learning Python, so apologies if this is an extremely obvious mistake. I've been trying to figure it out for hours now though and figured I'd see if anyone can help out. I've scraped a hockey website for their ice skate name and price and have written it to a CSV. The only problem is that when I write it to CSV the rows for the name column (listed as Gear) and the Price column are not aligned. It goes: Gear Name 1 Row Space Price Row Space Gear Name 2 It would be great to align the gear and price rows next to each other. I've attached a link to a picture of the CSV as well if that helps. import requests from bs4 import BeautifulSoup as Soup webpage_response = requests.get('https://www.purehockey.com/c/ice-hockey-skates-senior?') webpage = (webpage_response.content) parser = Soup(webpage, 'html.parser') filename = "gear.csv" f = open(filename, "w") headers = "Gear, Price" f.write(headers) for gear in parser.find_all("div", {"class": "details"}): gearname = gear.find_all("div", {"class": "name"}, "a") gearnametext = gearname[0].text gearprice = gear.find_all("div", {"class": "price"}, "a") gearpricetext = gearprice[0].text print (gearnametext) print (gearpricetext) f.write(gearnametext + "," + gearpricetext) [What the uneven rows look like][1] [1]: https://i.stack.imgur.com/EG2f2.png
Would recommend with python 3 to use with open(filename, 'w') as f: and strip() your texts before write() to your file. Unless you do not use 'a' mode to append each line you have to add linebreak to each line you are writing. Example import requests from bs4 import BeautifulSoup as Soup webpage_response = requests.get('https://www.purehockey.com/c/ice-hockey-skates-senior?') webpage = (webpage_response.content) parser = Soup(webpage, 'html.parser') filename = "gear1.csv" headers = "Gear,Price\n" with open(filename, 'w') as f: f.write(headers) for gear in parser.find_all("div", {"class": "details"}): gearnametext = gear.find("div", {"class": "name"}).text.strip() gearpricetext = gear.find("div", {"class": "price"}).text.strip() f.write(gearnametext + "," + gearpricetext+"\n") Output Gear,Price Bauer Vapor X3.7 Ice Hockey Skates - Senior,$249.99 Bauer X-LP Ice Hockey Skates - Senior,$119.99 Bauer Vapor Hyperlite Ice Hockey Skates - Senior,$999.98 - $1149.98 CCM Jetspeed FT475 Ice Hockey Skates - Senior,$249.99 Bauer X-LP Ice Hockey Skates - Intermediate,$109.99 ...
I've noticed that gearnametext returns 2\n inside the string. You should try the method str.replace() to remove the \n which are creating you the jump to the next line. Try with: import requests from bs4 import BeautifulSoup as Soup webpage_response = requests.get('https://www.purehockey.com/c/ice-hockey-skates-senior?') webpage = (webpage_response.content) parser = Soup(webpage, 'html.parser') filename = "gear.csv" f = open(filename, "w") headers = "Gear, Price" f.write(headers) for gear in parser.find_all("div", {"class": "details"}): gearname = gear.find_all("div", {"class": "name"}, "a") gearnametext = gearname[0].text.replace('\n','') gearprice = gear.find_all("div", {"class": "price"}, "a") gearpricetext = gearprice[0].text print (gearnametext) print (gearpricetext) f.write(gearnametext + "," + gearpricetext) I changed inside the loop the second line for the gear name for: gearnametext = gearname[0].text.replace('\n','').
Try request on website with around 5000 length in url
On this website https://www.bezrealitky.cz i try request for information but the url is around 5000 length long and i get nothing in output. I try this code: import gspread import requests from bs4 import BeautifulSoup cast_2 = "7460372%2C50.0646836%5D%2C%5B14.7483696%2C50.0612238%5D%2C%5B14.7483383%2C50.0583969%5D%2C%5B14.7546239%2C50.0529376%5D%2C%5B14.7515918%2C50.04738%5D%2C%5B14.7528839%2C50.0403354%5D%2C%5B14.7493543%2C50.0391631%5D%2C%5B14.7512053%2C50.0370347%5D%2C%5B14.7504488%2C50.0345038%5D%2C%5B14.7529152%2C50.0334157%5D%2C%5B14.7549521%2C50.0273482%5D%2C%5B14.7510315%2C50.0233121%5D%2C%5B14.7613824%2C50.0237208%5D%2C%5B14.7636167%2C50.0221496%5D%2C%5B14.7637296%2C50.0164491%5D%2C%5B14.7615743%2C50.012857%5D%2C%5B14.7706651%2C50.0100383%5D%2C%5B14.7881162%2C50.0135281%5D%2C%5B14.7975388%2C50.0122003%5D%2C%5B14.8027277%2C50.0136821%5D%2C%5B14.8061398%2C50.0114449%5D%2C%5B14.8061581%2C50.0065423%5D%2C%5B14.8078741%2C50.0060327%5D%2C%5B14.8079901%2C50.0075185%5D%2C%5B14.8115355%2C50.0091465%5D%2C%5B14.8167724%2C50.00864%5D%2C%5B14.8310856%2C50.0172897%5D%2C%5B14.8338134%2C50.0171025%5D%2C%5B14.8342407%2C50.0124198%5D%2C%5B14.8362383%2C50.011968%5D%2C%5B14.8368206%2C50.016655%5D%2C%5B14.8543707%2C50.0161536%5D%2C%5B14.8565756%2C50.0177261%5D%2C%5B14.8570572%2C50.0213736%5D%2C%5B14.8498872%2C50.0278437%5D%2C%5B14.8536689%2C50.0320382%5D%2C%5B14.8620365%2C50.0331695%5D%2C%5B14.8699823%2C50.029833%5D%2C%5B14.8686147%2C50.0169813%5D%2C%5B14.8666129%2C50.013744%5D%2C%5B14.8682312%2C50.0096072%5D%2C%5B14.883544%2C50.0078683%5D%2C%5B14.8816725%2C50.0059146%5D%2C%5B14.8807235%2C50.0006693%5D%2C%5B14.8870952%2C50.0005446%5D%2C%5B14.8867784%2C49.9967618%5D%2C%5B14.890288%2C49.9920084%5D%2C%5B14.9000404%2C49.9936546%5D%2C%5B14.9089142%2C49.989327%5D%2C%5B14.9090253%2C49.9904781%5D%2C%5B14.9180098%2C49.9911811%5D%2C%5B14.9212725%2C49.9857218%5D%2C%5B14.9192363%2C49.9845855%5D%2C%5B14.9203582%2C49.983578%5D%2C%5B14.9269916%2C49.9858337%5D%2C%5B14.928416%2C49.9890091%5D%2C%5B14.9306597%2C49.9900138%5D%2C%5B14.9311636%2C49.9882144%5D%2C%5B14.9367484%2C49.9906851%5D%2C%5B14.9419242%2C49.9895189%5D%2C%5B14.9359828%2C49.9863591%5D%2C%5B14.9382795%2C49.981983%5D%2C%5B14.931343%2C49.9780358%5D%2C%5B14.9303346%2C49.9724941%5D%2C%5B14.932246%2C49.9674614%5D%2C%5B14.9280618%2C49.9579488%5D%2C%5B14.9331609%2C49.9577085%5D%2C%5B14.9306572%2C49.9527402%5D%2C%5B14.9274264%2C49.9509742%5D%2C%5B14.9302193%2C49.9495788%5D%2C%5B14.9292025%2C49.9481072%5D%2C%5B14.9263141%2C49.9481936%5D%2C%5B14.9261212%2C49.9456787%5D%2C%5B14.9275244%2C49" cast_3 = ".9455488%5D%2C%5B14.9257431%2C49.9451039%5D%2C%5B14.9243516%2C49.9420148%5D%2C%5B14.9336951%2C49.9390724%5D%2C%5B14.9285746%2C49.9375499%5D%2C%5B14.9232954%2C49.9391386%5D%2C%5B14.9185643%2C49.9523745%5D%2C%5B14.9095933%2C49.9529501%5D%2C%5B14.8981181%2C49.9476337%5D%2C%5B14.8907048%2C49.9392604%5D%2C%5B14.8915933%2C49.936355%5D%2C%5B14.8864724%2C49.9375382%5D%2C%5B14.8867596%2C49.9356867%5D%2C%5B14.8899145%2C49.9313207%5D%2C%5B14.8933734%2C49.9300241%5D%2C%5B14.8964743%2C49.923446%5D%2C%5B14.9025013%2C49.919721%5D%2C%5B14.9047702%2C49.9164821%5D%2C%5B14.9176309%2C49.9175431%5D%2C%5B14.9164319%2C49.9200692%5D%2C%5B14.9175303%2C49.9225778%5D%2C%5B14.9289145%2C49.917223%5D%2C%5B14.9330489%2C49.9168551%5D%2C%5B14.9372387%2C49.9075293%5D%2C%5B14.9352755%2C49.9066415%5D%2C%5B14.9411617%2C49.9029621%5D%2C%5B14.940379%2C49.8993594%5D%2C%5B14.935669%2C49.8952156%5D%2C%5B14.9384554%2C49.8917643%5D%2C%5B14.9472443%2C49.8880358%5D%2C%5B14.9528711%2C49.8883229%5D%2C%5B14.9545243%2C49.8889728%5D%2C%5B14.9555799%2C49.8961729%5D%2C%5B14.9607454%2C49.8965317%5D%2C%5B14.9606301%2C49.8945635%5D%2C%5B14.9635334%2C49.8942765%5D%2C%5B14.9624936%2C49.8932596%5D%2C%5B14.9674574%2C49.8925867%5D%2C%5B14.9724662%2C49.8948229%5D%2C%5B14.9714967%2C49.8966132%5D%2C%5B14.9725729%2C49.8995754%5D%2C%5B14.9759147%2C49.898875%5D%2C%5B14.9769214%2C49.9004331%5D%2C%5B14.9846763%2C49.9007413%5D%2C%5B14.9865037%2C49.9003017%5D%2C%5B14.986104%2C49.8976845%5D%2C%5B14.9911552%2C49.8945319%5D%2C%5B14.9874514%2C49.8896809%5D%2C%5B14.9946711%2C49.8895076%5D%2C%5B14.9915024%2C49.8850037%5D%2C%5B14.9947409%2C49.8846719%5D%2C%5B14.9958491%2C49.8802898%5D%2C%5B14.999003%2C49.8823417%5D%2C%5B15.0053515%2C49.8830424%5D%2C%5B15.0099309%2C49.8795241%5D%2C%5B15.0124418%2C49.8801023%5D%2C%5B15.011424%2C49.8899134%5D%2C%5B15.01354%2C49.8915423%5D%2C%5B15.0122471%2C49.8921134%5D%2C%5B15.0142633%2C49.8934424%5D%2C%5B15.0140542%2C49.8954869%5D%2C%5B15.0189515%2C49.8949809%5D%2C%5B15.0194431%2C49.8989414%5D%2C%5B15.0243348%2C49.9005027%5D%2C%5B15.020474%2C49.905695%5D%2C%5B15.0219726%2C49.9095088%5D%2C%5B15.0272693%2C49.9116921%5D%2C%5B15.0263371%2C49.9142298%5D%2C%5B15.022275%2C49.9153146%5D%2C%5B15.0219879%2C49.9204434%5D%2C%5B15.023551%2C49.9231595%5D%2C%5B15.029706%2C49.9239071%5D%2C%5B15.0292901%2C49.9265814%5D%2C%5B15.0353043%2C49.927282%5D%2C%5B15" cast_4 = ".0400271%2C49.9230311%5D%2C%5B15.0390876%2C49.9223017%5D%2C%5B15.0426921%2C49.9214656%5D%2C%5B15.042195%2C49.919563%5D%2C%5B15.056001%2C49.9151051%5D%2C%5B15.0694879%2C49.9139673%5D%2C%5B15.0699502%2C49.9155857%5D%2C%5B15.0747277%2C49.9154931%5D%2C%5B15.0746215%2C49.9175953%5D%2C%5B15.0802883%2C49.9195617%5D%2C%5B15.078314%2C49.9228952%5D%2C%5B15.0796324%2C49.923096%5D%2C%5B15.0901438%2C49.9218556%5D%2C%5B15.0929845%2C49.9233777%5D%2C%5B15.103518%2C49.9187582%5D%2C%5B15.1029777%2C49.9314127%5D%2C%5B15.1144736%2C49.9345202%5D%2C%5B15.1102589%2C49.9420706%5D%2C%5B15.1158032%2C49.9422821%5D%2C%5B15.1165203%2C49.944272%5D%2C%5B15.1145739%2C49.945346%5D%2C%5B15.1171915%2C49.9488564%5D%2C%5B15.1276201%2C49.9512419%5D%2C%5B15.1274718%2C49.955674%5D%2C%5B15.1517372%2C49.9531124%5D%2C%5B15.1526448%2C49.9567041%5D%2C%5B15.1586225%2C49.9561615%5D%2C%5B15.1598926%2C49.962716%5D%2C%5B15.1662636%2C49.9630819%5D%2C%5B15.1715054%2C49.9584597%5D%2C%5B15.1774664%2C49.9591787%5D%2C%5B15.1810824%2C49.956233%5D%2C%5B15.1810663%2C49.9526501%5D%2C%5B15.1862035%2C49.9530225%5D%2C%5B15.1952391%2C49.9578357%5D%2C%5B15.2115243%2C49.9592322%5D%2C%5B15.2139904%2C49.9620232%5D%2C%5B15.2170339%2C49.9610558%5D%2C%5B15.2230008%2C49.9622295%5D%2C%5B15.2265423%2C49.9759596%5D%2C%5B15.2333309%2C49.975372%5D%2C%5B15.2348486%2C49.97927%5D%2C%5B15.2407609%2C49.978409%5D%2C%5B15.2397699%2C49.9807084%5D%2C%5B15.2425184%2C49.9817658%5D%2C%5B15.2446314%2C49.9818782%5D%2C%5B15.2442988%2C49.9775396%5D%2C%5B15.2472652%2C49.9768984%5D%2C%5B15.2480568%2C49.9718284%5D%2C%5B15.2507013%2C49.9708684%5D%2C%5B15.2517455%2C49.9680948%5D%2C%5B15.2577153%2C49.9664238%5D%2C%5B15.2596658%2C49.9691599%5D%2C%5B15.2620682%2C49.9693487%5D%2C%5B15.2582713%2C49.9773754%5D%2C%5B15.2607681%2C49.9811908%5D%2C%5B15.2624639%2C49.9805064%5D%2C%5B15.2673975%2C49.988205%5D%2C%5B15.2760691%2C49.9856889%5D%2C%5B15.2782403%2C49.9882659%5D%2C%5B15.2673699%2C49.9908658%5D%2C%5B15.268852%2C49.9939161%5D%2C%5B15.2710964%2C49.9936841%5D%2C%5B15.2693211%2C49.994915%5D%2C%5B15.2713724%2C49.9974748%5D%2C%5B15.2748592%2C49.9976762%5D%2C%5B15.2784299%2C49.9949788%5D%2C%5B15.2806409%2C50.0003848%5D%2C%5B15.2903084%2C50.0004699%5D%2C%5B15.2904481%2C50.0038247%5D%2C%5B15.2942844%2C50.0041151%5D%2C%5B15.2950376%2C50.003145%5D%2C%5B15.2912638%2C50.0001124%5D%2C%5B15.297798%2C49.9966362%5D%2C%5B15" cast_5 = ".3030433%2C49.9978618%5D%2C%5B15.3065558%2C50.0010961%5D%2C%5B15.3176203%2C49.9958783%5D%2C%5B15.3209257%2C49.9957084%5D%2C%5B15.3218971%2C49.9940926%5D%2C%5B15.3251451%2C49.9942954%5D%2C%5B15.310632%2C50.000222%5D%2C%5B15.3115761%2C50.0052775%5D%2C%5B15.3181268%2C50.0036175%5D%2C%5B15.3095883%2C50.0068186%5D%2C%5B15.3288149%2C50.0102639%5D%2C%5B15.3278347%2C50.0129677%5D%2C%5B15.3228156%2C50.014841%5D%2C%5B15.3226442%2C50.0161805%5D%2C%5B15.3168893%2C50.0158068%5D%2C%5B15.3168707%2C50.0183363%5D%2C%5B15.3099008%2C50.0173573%5D%2C%5B15.3100671%2C50.0223491%5D%2C%5B15.3146543%2C50.0224925%5D%2C%5B15.3181523%2C50.0207887%5D%2C%5B15.3233162%2C50.0270015%5D%2C%5B15.3292159%2C50.0240114%5D%2C%5B15.3339433%2C50.0244362%5D%2C%5B15.3371379%2C50.0271514%5D%2C%5B15.3376541%2C50.031498%5D%2C%5B15.340518%2C50.0339298%5D%2C%5B15.346957%2C50.0329283%5D%2C%5B15.3532778%2C50.0345715%5D%2C%5B15.3610469%2C50.0318515%5D%2C%5B15.3655293%2C50.0283415%5D%2C%5B15.3784025%2C50.0246007%5D%2C%5B15.3942223%2C50.0162445%5D%2C%5B15.3969556%2C50.0185625%5D%2C%5B15.4022783%2C50.0183338%5D%2C%5B15.4020888%2C50.0197968%5D%2C%5B15.3932339%2C50.0206583%5D%2C%5B15.3920018%2C50.0223698%5D%2C%5B15.385031%2C50.0243347%5D%2C%5B15.383268%2C50.0266595%5D%2C%5B15.3821418%2C50.0248152%5D%2C%5B15.3815274%2C50.0269813%5D%2C%5B15.3804665%2C50.0253754%5D%2C%5B15.3792426%2C50.025917%5D%2C%5B15.3743428%2C50.0294937%5D%2C%5B15.3734636%2C50.0347859%5D%2C%5B15.3695889%2C50.0346392%5D%2C%5B15.3666384%2C50.0394245%5D%2C%5B15.3628116%2C50.041087%5D%2C%5B15.3730635%2C50.0432036%5D%2C%5B15.3780732%2C50.0472218%5D%2C%5B15.3856527%2C50.0472561%5D%2C%5B15.388892%2C50.0443888%5D%2C%5B15.39216%2C50.0442043%5D%2C%5B15.3896159%2C50.0475198%5D%2C%5B15.3943902%2C50.0459441%5D%2C%5B15.402595%2C50.0465414%5D%2C%5B15.4014034%2C50.0483683%5D%2C%5B15.4035434%2C50.0505257%5D%2C%5B15.3962372%2C50.0486151%5D%2C%5B15.394535%2C50.0500037%5D%2C%5B15.3969398%2C50.0520049%5D%2C%5B15.3940358%2C50.0504431%5D%2C%5B15.3896418%2C50.052281%5D%2C%5B15.3833997%2C50.0519173%5D%2C%5B15.3753385%2C50.0557371%5D%2C%5B15.3768323%2C50.0574795%5D%2C%5B15.3807552%2C50.0576628%5D%2C%5B15.3805605%2C50.0605402%5D%2C%5B15.3881422%2C50.061425%5D%2C%5B15.3925269%2C50.0591033%5D%2C%5B15.3946303%2C50.0603602%5D%2C%5B15.3940834%2C50.0616654%5D%2C%5B15.3961776%2C50.0607414%5D%2C%5B15.4125822%2C50.0630548%5D%2C%5B15" cast_6 = ".4134746%2C50.0744661%5D%2C%5B15.4067539%2C50.0804816%5D%2C%5B15.4054916%2C50.0830808%5D%2C%5B15.4066544%2C50.0837547%5D%2C%5B15.4042526%2C50.0860421%5D%2C%5B15.4052905%2C50.0870901%5D%2C%5B15.4088927%2C50.0854556%5D%2C%5B15.4160104%2C50.0890967%5D%2C%5B15.4203786%2C50.0871824%5D%2C%5B15.4280014%2C50.0880659%5D%2C%5B15.4236774%2C50.0883806%5D%2C%5B15.4155121%2C50.0939248%5D%2C%5B15.4098499%2C50.0906237%5D%2C%5B15.4075168%2C50.0919355%5D%2C%5B15.4094691%2C50.0929791%5D%2C%5B15.402386%2C50.0951011%5D%2C%5B15.4009451%2C50.0939897%5D%2C%5B15.4015589%2C50.0959031%5D%2C%5B15.3991191%2C50.0954995%5D%2C%5B15.4078486%2C50.1053316%5D%2C%5B15.4139267%2C50.1076966%5D%2C%5B15.4190626%2C50.1062858%5D%2C%5B15.4204394%2C50.1043623%5D%2C%5B15.419641%2C50.1029451%5D%2C%5B15.425487%2C50.0992095%5D%2C%5B15.4372403%2C50.101709%5D%2C%5B15.4341712%2C50.103413%5D%2C%5B15.4323829%2C50.110828%5D%2C%5B15.4373797%2C50.1098872%5D%2C%5B15.4382373%2C50.112052%5D%2C%5B15.440637%2C50.1119275%5D%2C%5B15.4348981%2C50.1152388%5D%2C%5B15.4399748%2C50.1187849%5D%2C%5B15.4422332%2C50.1186898%5D%2C%5B15.448004%2C50.1245509%5D%2C%5B15.4394162%2C50.1263351%5D%2C%5B15.4424804%2C50.1272179%5D%2C%5B15.4402373%2C50.1285949%5D%2C%5B15.4386249%2C50.1267957%5D%2C%5B15.4242543%2C50.1302151%5D%2C%5B15.4136279%2C50.1360966%5D%2C%5B15.4162801%2C50.1388282%5D%2C%5B15.4153957%2C50.1404655%5D%2C%5B15.4101445%2C50.1389653%5D%2C%5B15.4035614%2C50.1410463%5D%2C%5B15.4057903%2C50.1385907%5D%2C%5B15.4032532%2C50.1349821%5D%2C%5B15.3908308%2C50.1380372%5D%2C%5B15.3896295%2C50.1404446%5D%2C%5B15.3818727%2C50.1455547%5D%2C%5B15.3666429%2C50.1437912%5D%2C%5B15.3314292%2C50.1459439%5D%2C%5B15.3022217%2C50.1566421%5D%2C%5B15.2999343%2C50.159518%5D%2C%5B15.3008151%2C50.1577516%5D%2C%5B15.2919217%2C50.1559134%5D%2C%5B15.2913616%2C50.1515941%5D%2C%5B15.2873242%2C50.1529242%5D%2C%5B15.2850402%2C50.1494766%5D%2C%5B15.2794866%2C50.1508381%5D%2C%5B15.2808718%2C50.1481203%5D%2C%5B15.2717813%2C50.1493956%5D%2C%5B15.2693875%2C50.1511556%5D%2C%5B15.2720351%2C50.1479877%5D%2C%5B15.2764349%2C50.1466513%5D%2C%5B15.2813847%2C50.147562%5D%2C%5B15.2838828%2C50.1461274%5D%2C%5B15.2826681%2C50.1444837%5D%2C%5B15.2784628%2C50.143719%5D%2C%5B15.2796582%2C50.1433122%5D%2C%5B15.2780194%2C50.1409157%5D%2C%5B15.282405%2C50.140849%5D%2C%5B15.2784695%2C50.1381623%5D%2C%5B15.2708307%2C50.1391665%5D%2C%5B15" cast_7 = ".2668258%2C50.1375857%5D%2C%5B15.269911%2C50.1346072%5D%2C%5B15.2762845%2C50.1329987%5D%2C%5B15.2759419%2C50.1303474%5D%2C%5B15.2793229%2C50.1302371%5D%2C%5B15.281164%2C50.1264982%5D%2C%5B15.2845801%2C50.1261406%5D%2C%5B15.2848885%2C50.1230899%5D%2C%5B15.2828851%2C50.1228826%5D%2C%5B15.2837849%2C50.1179389%5D%2C%5B15.2768798%2C50.117716%5D%2C%5B15.2771028%2C50.1144981%5D%2C%5B15.272524%2C50.1150153%5D%2C%5B15.2713081%2C50.1121539%5D%2C%5B15.2725418%2C50.1105403%5D%2C%5B15.2695258%2C50.1068804%5D%2C%5B15.2674912%2C50.1101226%5D%2C%5B15.2588749%2C50.1105492%5D%2C%5B15.2588528%2C50.109051%5D%2C%5B15.2609659%2C50.1088692%5D%2C%5B15.2592838%2C50.1065069%5D%2C%5B15.2536841%2C50.1067928%5D%2C%5B15.2539585%2C50.1090625%5D%2C%5B15.2525796%2C50.1049205%5D%2C%5B15.2435934%2C50.1060842%5D%2C%5B15.2453384%2C50.098772%5D%2C%5B15.2423646%2C50.0989078%5D%2C%5B15.2423883%2C50.1011124%5D%2C%5B15.2393165%2C50.1013005%5D%2C%5B15.2393185%2C50.1024704%5D%2C%5B15.2334966%2C50.1021718%5D%2C%5B15.2160559%2C50.1071886%5D%2C%5B15.2144357%2C50.1094093%5D%2C%5B15.2039163%2C50.1073611%5D%2C%5B15.2041415%2C50.1086931%5D%2C%5B15.1852359%2C50.1097565%5D%2C%5B15.1833813%2C50.1090549%5D%2C%5B15.1841281%2C50.1078717%5D%2C%5B15.1874227%2C50.1095254%5D%2C%5B15.1907831%2C50.1079763%5D%2C%5B15.1922237%2C50.1051987%5D%2C%5B15.191224%2C50.1027416%5D%2C%5B15.1841377%2C50.1017764%5D%2C%5B15.1760019%2C50.1032842%5D%2C%5B15.1773527%2C50.0987535%5D%2C%5B15.1731946%2C50.0978941%5D%2C%5B15.1751755%2C50.0963999%5D%2C%5B15.1743437%2C50.0952196%5D%2C%5B15.1696704%2C50.0961228%5D%2C%5B15.1711029%2C50.0927787%5D%2C%5B15.169623%2C50.0915337%5D%2C%5B15.1653342%2C50.0951398%5D%2C%5B15.1613706%2C50.0942165%5D%2C%5B15.1581873%2C50.1000534%5D%2C%5B15.1550458%2C50.0984016%5D%2C%5B15.1465104%2C50.0981989%5D%2C%5B15.1465095%2C50.0995308%5D%2C%5B15.1357627%2C50.0982693%5D%2C%5B15.1313868%2C50.0962456%5D%2C%5B15.1253349%2C50.0992604%5D%2C%5B15.1258172%2C50.1009654%5D%2C%5B15.1204803%2C50.1016614%5D%2C%5B15.1218891%2C50.0947834%5D%2C%5B15.1188048%2C50.0935699%5D%2C%5B15.1155421%2C50.0831571%5D%2C%5B15.1093235%2C50.0843775%5D%2C%5B15.1081288%2C50.0825619%5D%2C%5B15.0999571%2C50.0878321%5D%2C%5B15.1013205%2C50.0922107%5D%2C%5B15.0915809%2C50.0931084%5D%2C%5B15.0920168%2C50.0949287%5D%2C%5B15.0863782%2C50.0966826%5D%2C%5B15.086481%2C50.0944002%5D%2C%5B15.0826752%2C50.0941637%5D%2C%5B15" cast_8 = ".068619%2C50.0973604%5D%2C%5B15.0703038%2C50.101897%5D%2C%5B15.0740585%2C50.1044897%5D%2C%5B15.0764522%2C50.1039785%5D%2C%5B15.0736102%2C50.1060177%5D%2C%5B15.0629614%2C50.1074685%5D%2C%5B15.0565328%2C50.105559%5D%2C%5B15.0564676%2C50.1038553%5D%2C%5B15.0533182%2C50.1015993%5D%2C%5B15.0516594%2C50.1040076%5D%2C%5B15.0458835%2C50.1020048%5D%2C%5B15.0448312%2C50.1041812%5D%2C%5B15.0466144%2C50.1126407%5D%2C%5B15.0457175%2C50.1149513%5D%2C%5B15.0343113%2C50.1170375%5D%2C%5B15.0345083%2C50.114516%5D%2C%5B15.0301261%2C50.1146343%5D%2C%5B15.0264935%2C50.1113655%5D%2C%5B15.0170356%2C50.1140104%5D%2C%5B15.0119161%2C50.11321%5D%2C%5B15.0045907%2C50.1162478%5D%2C%5B15.0045718%2C50.1108536%5D%2C%5B15.0027007%2C50.1105543%5D%2C%5B15.004638%2C50.1101266%5D%2C%5B15.0039322%2C50.1047413%5D%2C%5B15.005155%2C50.1043048%5D%2C%5B14.998137%2C50.0967009%5D%2C%5B14.9596509%2C50.1052427%5D%2C%5B14.955716%2C50.1030399%5D%2C%5B14.9618513%2C50.0988074%5D%2C%5B14.9624361%2C50.0959271%5D%2C%5B14.9534678%2C50.0972712%5D%2C%5B14.9537379%2C50.093282%5D%2C%5B14.9511217%2C50.0933379%5D%2C%5B14.9534407%2C50.0882163%5D%2C%5B14.948147%2C50.0876311%5D%2C%5B14.9480446%2C50.0931762%5D%2C%5B14.9450619%2C50.089111%5D%2C%5B14.9445069%2C50.0901939%5D%2C%5B14.9419088%2C50.089393%5D%2C%5B14.941649%2C50.0905889%5D%2C%5B14.9436269%2C50.090921%5D%2C%5B14.9292988%2C50.0919269%5D%2C%5B14.9302586%2C50.0950085%5D%2C%5B14.9272723%2C50.095255%5D%2C%5B14.9267884%2C50.0914298%5D%2C%5B14.9249778%2C50.0917708%5D%2C%5B14.9268745%2C50.0926059%5D%2C%5B14.9258074%2C50.0979741%5D%2C%5B14.9314785%2C50.1019029%5D%2C%5B14.9320943%2C50.1066867%5D%2C%5B14.9413214%2C50.1105662%5D%2C%5B14.9474381%2C50.1112535%5D%2C%5B14.9460171%2C50.1142574%5D%2C%5B14.9409307%2C50.117369%5D%2C%5B14.9450905%2C50.1184205%5D%2C%5B14.9422467%2C50.119131%5D%2C%5B14.9419262%2C50.1213203%5D%2C%5B14.9377646%2C50.1221107%5D%2C%5B14.9293773%2C50.1215114%5D%2C%5B14.9277818%2C50.1181327%5D%2C%5B14.91945%2C50.1202597%5D%2C%5B14.9064761%2C50.1161503%5D%2C%5B14.9114208%2C50.1130913%5D%2C%5B14.8973812%2C50.1101211%5D%2C%5B14.9026135%2C50.1041702%5D%2C%5B14.8862235%2C50.1046495%5D%2C%5B14.882639%2C50.1016048%5D%2C%5B14.8737914%2C50.0991642%5D%2C%5B14.8736091%2C50.0977705%5D%2C%5B14.8640719%2C50.096482%5D%2C%5B14.8380862%2C50.1029933%5D%2C%5B14.8353495%2C50.1065255%5D%2C%5B14.8203159%2C50.1075386%5D%2C%5B14.8178407%2C50.10837%5D%2C%5B14" cast_9 = ".8174777%2C50.1109162%5D%2C%5B14.8078363%2C50.1109486%5D%2C%5B14.8082064%2C50.1099036%5D%2C%5B14.8065928%2C50.1097325%5D%2C%5B14.8031871%2C50.1151496%5D%2C%5B14.7909845%2C50.112214%5D%2C%5B14.791537%2C50.1159915%5D%2C%5B14.7898356%2C50.1161527%5D%2C%5B14.7888326%2C50.1127876%5D%2C%5B14.7836591%2C50.1126201%5D%2C%5B14.7851984%2C50.1109994%5D%2C%5B14.783707%2C50.1101127%5D%2C%5B14.7799914%2C50.1100845%5D%2C%5B14.7794878%2C50.1121285%5D%2C%5B14.7765771%2C50.112093%5D%2C%5B14.7678175%2C50.1091971%5D%2C%5B14.7649339%2C50.1046049%5D%2C%5B14.7632564%2C50.1067252%5D%2C%5B14.7619325%2C50.1062309%5D%2C%5B14.7632773%2C50.1032313%5D%2C%5B14.7594198%2C50.099077%5D%2C%5B14.7363085%2C50.0943267%5D%2C%5B14.7384606%2C50.0924892%5D%2C%5B14.7376811%2C50.0912459%5D%2C%5B14.7305199%2C50.0868433%5D%5D%5D%7D%7D¢er=%5B15.181636455691432%2C50.02066883782007%5D&zoom=11.111617917065713&locationInput=okres%20Kol%C3%ADn%2C%20St%C5%99edo%C4%8Desk%C3%BD%20kraj%2C%20St%C5%99edn%C3%AD%20%C4%8Cechy%2C%20%C4%8Cesko&limit=15" URL = 'https://www.bezrealitky.cz/vyhledat#offerType=prodej&estateType=byt&disposition=3-kk%2C3-1%2C4-kk%2C4-1%2C5-kk%2C5-1%2C6-kk%2C6-1%2C7-kk%2C7-1&ownership=&construction=&equipped=&balcony=&order=timeOrder_desc&boundary=%7B%22type%22%3A%22Feature%22%2C%22properties%22%3A%7B%7D%2C%22geometry%22%3A%7B%22type%22%3A%22Polygon%22%2C%22coordinates%22%3A%5B%5B%5B14.7305199%2C50.0868433%5D%2C%5B14.7347907%2C50.0851368%5D%2C%5B14.7335681%2C50.0820115%5D%2C%5B14.7365932%2C50.0818778%5D%2C%5B14.7369494%2C50.0836348%5D%2C%5B14.7440628%2C50.0834312%5D%2C%5B14.7442786%2C50.081928%5D%2C%5B14.7491301%2C50.0819783%5D%2C%5B14.7501216%2C50.0801764%5D%2C%5B14.7422156%2C50.0785401%5D%2C%5B14.7385155%2C50.0708815%5D%2C%5B14.753382%2C50.0710775%5D%2C%5B14.7541288%2C50.0684731%5D%2C%5B14.7520195%2C50.067327%5D%2C%5B14.7522295%2C50.0657011%5D%2C%5B14.' + cast_2 + cast_3 + cast_4 + cast_5 + cast_6 + cast_7 + cast_8 + cast_9 page = requests.get(URL) soup = BeautifulSoup(page.content, 'html.parser') results = soup.find_all('div', attrs={'class':'product__body-new'}) for job_data in results: cena = job_data.find('strong', attrs={'class':'product__value'}) cena_final = cena.text.strip() print(cena_final) It is works but no show price in <strong class="product__value">5.990.000 Kč</strong> and others values, it start and end immediately. I need print: Kolín, Středočeský kraj Prodej bytu 2+kk, 46 m² 3.480.000 Kč Jateční, Kolín, Středočeský kraj Prodej bytu 3+1, 73 m² 4.500.000 Kč
That is because the page does not return any divs with a class of "product__body-new".
How to infinite click on show more button to generate a full page and then collect data links?
My goal is to collect a maximum of profile links on Khan Academy and then select some specific data on each of these profiles to store them into a CSV file. Here is my script to get profile links. Then scrape specific data on each of these profiles. And then store them in a csv file. from bs4 import BeautifulSoup from requests_html import HTMLSession import re session = HTMLSession() r = session.get('https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms') r.html.render(sleep=5) soup=BeautifulSoup(r.html.html,'html.parser') #find the profile links profiles = soup.find_all(href=re.compile("/profile/kaid")) profile_list=[] for links in profiles: links_no_list = links.extract() text_link = links_no_list['href'] text_link_nodiscussion = text_link[:-10] final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion profile_list.append(final_profile_link) #create the csv file filename = "khanscraptry1.csv" f = open(filename, "w") headers = "link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n" f.write(headers) #for each profile link, scrape the specific data and store them into the csv for link in profile_list: print("Scrapping ",link) session = HTMLSession() r = session.get(link) r.html.render(sleep=5) soup=BeautifulSoup(r.html.html,'html.parser') user_info_table=soup.find('table', class_='user-statistics-table') if user_info_table is not None: dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')] else: dates=points=videos='NA' user_socio_table=soup.find_all('div', class_='discussion-stat') data = {} for gettext in user_socio_table: category = gettext.find('span') category_text = category.text.strip() number = category.previousSibling.strip() data[category_text] = number full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] #might change answers to answer because when it's 1 it's putting NA instead for header_value in full_data_keys: if header_value not in data.keys(): data[header_value]='NA' user_calendar = soup.find('div',class_='streak-calendar-scroll-container') if user_calendar is not None: last_activity = user_calendar.find('span',class_='streak-cell filled') try: last_activity_date = last_activity['title'] except TypeError: last_activity_date='NA' else: last_activity_date='NA' f.write(link + "," + dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n") #might change answers to answer because when it's 1 it's putting NA instead f.close() This first script should work fine. Now, my problem is that this script found about 40 profile links: print(len(profile_list)) return 40. If I could click on show more button (on : https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms), then I will get more profile links (and thus more profiles to scrape). That script is infinitely clicking on show more button, until there is no show more button: import unittest from selenium import webdriver from selenium.webdriver.common.keys import Keys driver = webdriver.Chrome() #watch out, change if you are not using Chrome driver.get("https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms") driver.implicitly_wait(10) def showmore(self): while True: try: driver.implicitly_wait(5) showmore = self.find_element_by_class_name("button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203") showmore.click() except NoSuchElementException: break showmore(driver) This second script should also work fine. My question is: how can I merge these two scripts? How to make BeautifulSoup, Selenium and Requests work together? In other words: How can I apply the second script to get a full page and then treat it into the first script?
My question is: how can I merge these two scripts? How to make BeautifulSoup, Selenium and Requests work together? You don't need to. Selenium alone can do all of the actions required as well as get the data required. Another alternative is to use selenium do actions (such as click), get the page_source and let BeautifulSoup do the parsing. I have used the second option. Please note that this is b'coz I am more comfortable with BeautifulSoup and not b'coz selenium can't get the data required. Merged Script from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.common.exceptions import TimeoutException,StaleElementReferenceException from bs4 import BeautifulSoup import re driver = webdriver.Chrome() #watch out, change if you are not using Chrome driver.get("https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms") while True: try: showmore=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[#id="v/what-are-algorithms-panel"]/div[1]/div/div[6]/div/div[4]/button'))) showmore.click() except TimeoutException: break except StaleElementReferenceException: break soup=BeautifulSoup(driver.page_source,'html.parser') #find the profile links profiles = soup.find_all(href=re.compile("/profile/kaid")) profile_list=[] for links in profiles: links_no_list = links.extract() text_link = links_no_list['href'] text_link_nodiscussion = text_link[:-10] final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion profile_list.append(final_profile_link) #remove duplicates #remove the below line if you want the dupliactes profile_list=list(set(profile_list)) #print number of profiles we got print(len(profile_list)) #create the csv file filename = "khanscraptry1.csv" f = open(filename, "w") headers = "link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n" f.write(headers) #for each profile link, scrape the specific data and store them into the csv for link in profile_list: #to avoid Scrapping same profile multiple times #print each profile link we are about to scrap print("Scrapping ",link) driver.get(link) #wait for content to load #if profile does not exist skip try: WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[#id="widget-list"]/div[1]/div[1]'))) except TimeoutException: continue soup=BeautifulSoup(driver.page_source,'html.parser') user_info_table=soup.find('table', class_='user-statistics-table') if user_info_table is not None: dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')] else: dates=points=videos='NA' user_socio_table=soup.find_all('div', class_='discussion-stat') data = {} for gettext in user_socio_table: category = gettext.find('span') category_text = category.text.strip() number = category.previousSibling.strip() data[category_text] = number full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] #might change answers to answer because when it's 1 it's putting NA instead for header_value in full_data_keys: if header_value not in data.keys(): data[header_value]='NA' user_calendar = soup.find('div',class_='streak-calendar-scroll-container') if user_calendar is not None: last_activity = user_calendar.find('span',class_='streak-cell filled') try: last_activity_date = last_activity['title'] except TypeError: last_activity_date='NA' else: last_activity_date='NA' f.write(link + "," + dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n") Sample console Output 551 Scrapping https://www.khanacademy.org/profile/kaid_888977072825430260337359/ Scrapping https://www.khanacademy.org/profile/kaid_883316191998827325047066/ Scrapping https://www.khanacademy.org/profile/kaid_1174374133389372329315932/ Scrapping https://www.khanacademy.org/profile/kaid_175131632601098270919916/ Scrapping https://www.khanacademy.org/profile/kaid_120532771190025953629523/ Scrapping https://www.khanacademy.org/profile/kaid_443636490088836886070300/ Scrapping https://www.khanacademy.org/profile/kaid_1202505937095267213741452/ Scrapping https://www.khanacademy.org/profile/kaid_464949975690601300556189/ Scrapping https://www.khanacademy.org/profile/kaid_727801603402106934190616/ Scrapping https://www.khanacademy.org/profile/kaid_808370995413780397188230/ Scrapping https://www.khanacademy.org/profile/kaid_427134832219441477944618/ Scrapping https://www.khanacademy.org/profile/kaid_232193725763932936324703/ Scrapping https://www.khanacademy.org/profile/kaid_167043118118112381390423/ Scrapping https://www.khanacademy.org/profile/kaid_17327330351684516133566/ ... Sample File Output (khanscraptry1.csv) link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date https://www.khanacademy.org/profile/kaid_888977072825430260337359/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Tuesday Dec 8 2015 https://www.khanacademy.org/profile/kaid_883316191998827325047066/,5 years ago,2152299,513,10,884,34,16,82,108,1290,360,Monday Aug 27 2018 https://www.khanacademy.org/profile/kaid_1174374133389372329315932/,NA,NA,NA,2,0,0,0,NA,NA,0,0,NA https://www.khanacademy.org/profile/kaid_175131632601098270919916/,NA,NA,NA,173,19,2,0,NA,NA,128,3,Thursday Feb 7 2019 https://www.khanacademy.org/profile/kaid_120532771190025953629523/,NA,NA,NA,9,0,3,18,NA,NA,4,4,Tuesday Oct 11 2016 https://www.khanacademy.org/profile/kaid_443636490088836886070300/,7 years ago,3306784,987,10,231,49,11,8,156,10,NA,Sunday Jul 22 2018 https://www.khanacademy.org/profile/kaid_1202505937095267213741452/,NA,NA,NA,2,0,0,0,NA,NA,0,0,Thursday Apr 28 2016 https://www.khanacademy.org/profile/kaid_464949975690601300556189/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Friday Mar 16 2018 https://www.khanacademy.org/profile/kaid_727801603402106934190616/,5 years ago,2927634,1049,6,562,332,9,NA,NA,20,NA,NA https://www.khanacademy.org/profile/kaid_808370995413780397188230/,NA,NA,NA,NA,19,192,0,NA,NA,52,NA,Saturday Jan 19 2019 https://www.khanacademy.org/profile/kaid_427134832219441477944618/,NA,NA,NA,2,0,0,0,NA,NA,0,0,Tuesday Sep 18 2018 https://www.khanacademy.org/profile/kaid_232193725763932936324703/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Monday May 15 2017 https://www.khanacademy.org/profile/kaid_167043118118112381390423/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Friday Mar 1 2019 https://www.khanacademy.org/profile/kaid_17327330351684516133566/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,NA https://www.khanacademy.org/profile/kaid_146705727466233630898864/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Thursday Apr 5 2018
How to ask f.write() to put NA's if there is no data in beautifulsoup?
My goal is to scrape some specific data on multiple profile pages on khan academy. And put the data on a csv file. Here is the code to scrape one specific profile page and put it on a csv: from bs4 import BeautifulSoup from requests_html import HTMLSession session = HTMLSession() r = session.get('https://www.khanacademy.org/profile/DFletcher1990/') r.html.render(sleep=5) soup=BeautifulSoup(r.html.html,'html.parser') user_info_table=soup.find('table', class_='user-statistics-table') dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')] user_socio_table=soup.find_all('div', class_='discussion-stat') data = {} for gettext in user_socio_table: category = gettext.find('span') category_text = category.text.strip() number = category.previousSibling.strip() data[category_text] = number filename = "khanscraptry1.csv" f = open(filename, "w") headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx\n" f.write(headers) f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "\n") f.close() This code is working fine with this specific link('https://www.khanacademy.org/profile/DFletcher1990/'). Now though when I change my link to an other profile on khan academy for example : 'https://www.khanacademy.org/profile/Kkasparas/' I get this error : KeyError: 'project help requests' This is normal because on this profile "https://www.khanacademy.org/profile/Kkasparas/" there is no project help requests value (and no project help replies either). Thus data['project help requests'] and data['project help replies'] don't exist and thus can't be written on the csv file. My goal is to run this script with many profile pages. So I would like to know how to put an NA in every case I will not get the data on each variable. And then print te NA's to the csv file. In other words : I would like to make my script work for any kind of user profile page. Many thanks in advance for your contributions :)
You could define a new list with all possible headers and set the value of keys that are not present to 'NA', before writing it to the file. full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] for header_value in full_data_keys: if header_value not in data.keys(): data[header_value]='NA' Also gentle reminder to provide a fully working code in your question. user_socio_table was not defined in the question. I had to look up your previous question to get that. Full code would be from bs4 import BeautifulSoup from requests_html import HTMLSession session = HTMLSession() r = session.get('https://www.khanacademy.org/profile/Kkasparas/') r.html.render(sleep=5) soup=BeautifulSoup(r.html.html,'html.parser') user_info_table=soup.find('table', class_='user-statistics-table') dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')] data = {} user_socio_table=soup.find_all('div', class_='discussion-stat') for gettext in user_socio_table: category = gettext.find('span') category_text = category.text.strip() number = category.previousSibling.strip() data[category_text] = number full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] for header_value in full_data_keys: if header_value not in data.keys(): data[header_value]='NA' filename = "khanscraptry1.csv" f = open(filename, "w") headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx\n" f.write(headers) f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "\n") f.close() Ouput - khanscraptry1.csv date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx 6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0 Change to the following lines if user_info_table is not present if user_info_table is not None: dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')] else: dates=points=videos='NA'
AttributeError: 'str' object has no attribute 'text' python 2.7
Ik there are many questions like this but the answers are all specific and can only fix the solution for the persons specific script. I am currently trying to print a bunch of info from supremenewyork.com from the uk website. This script can succsesfully print all the info I want from supreme us but when I added the proxy script I starte to get alot of errors. I know the prxy script works becuase I tested it on a small scipt and It was able to pull info that was on supreme uk and didnt exist on supreme us Here is my script. import requests from bs4 import BeautifulSoup UK_Proxy1 = raw_input('UK http Proxy1: ') UK_Proxy2 = raw_input('UK http Proxy2: ') proxies = { 'http': 'http://' + UK_Proxy1 + '', 'https': 'http://' + UK_Proxy2 + '', } categorys = ['jackets','shirts','tops_sweaters','sweatshirts','pants','shorts','t- shirts','hats','hats','bags','accessories','shoes','skate'] catNumb = 0 altArray = [] nameArray = [] styleArray = [] for cat in categorys: catStr = str(categorys[catNumb]) cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr proxy_script = requests.get((cUrl.text), proxies=proxies) bSoup = BeautifulSoup(proxy_script, 'lxml') print('\n*******************"'+ catStr.upper() + '"*******************\n') catNumb += 1 for item in bSoup.find_all('div', class_='inner-article'): url = item.a['href'] alt = item.find('img')['alt'] req = requests.get('http://www.supremenewyork.com' + url) item_soup = BeautifulSoup(req.text, 'lxml') name = item_soup.find('h1', itemprop='name').text style = item_soup.find('p', itemprop='model').text print alt +(' --- ')+ name +(' --- ')+ style altArray.append(alt) nameArray.append(name) styleArray.append(style) print altArray print nameArray print styleArray I am getting this error when I execute the script AttributeError: 'str' object has no attribute 'text' with the error pointing towards the proxy_script = requests.get((cUrl.text), proxies=proxies) i recently added this to the script which sorta fixed it... It was able to print the category's but no info between them. Which (I NEED) it just printed ****************jackets**************, ****shirts******, etc.... here is what I changed import requests from bs4 import BeautifulSoup # make sure proxy is http and port 8080 UK_Proxy1 = raw_input('UK http Proxy1: ') UK_Proxy2 = raw_input('UK http Proxy2: ') proxies = { 'http': 'http://' + UK_Proxy1 + '', 'https': 'http://' + UK_Proxy2 + '', } categorys = ['jackets','shirts','tops_sweaters','sweatshirts','pants','shorts','t-shirts','hats','bags','accessories','shoes','skate'] catNumb = 0 altArray = [] nameArray = [] styleArray = [] for cat in categorys: catStr = str(categorys[catNumb]) cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr proxy_script = requests.get(cUrl, proxies=proxies).text bSoup = BeautifulSoup(proxy_script, 'lxml') print('\n*******************"'+ catStr.upper() + '"*******************\n') catNumb += 1 for item in bSoup.find_all('div', class_='inner-article'): url = item.a['href'] alt = item.find('img')['alt'] req = requests.get('http://www.supremenewyork.com' + url) item_soup = BeautifulSoup(req.text, 'lxml') name = item_soup.find('h1', itemprop='name').text style = item_soup.find('p', itemprop='model').text print alt +(' --- ')+ name +(' --- ')+ style altArray.append(alt) nameArray.append(name) styleArray.append(style) print altArray print nameArray print styleArray I put .text at the end and it worked sorta.... How do i fix it so it prints the info I want???
I think you miss smt. Your cUrl is a string type, not request type. I guess you want: proxy_script = requests.get(cUrl, proxies=proxies).text