Python & BS4 pagination loop - python-3.x

I'm new to web scraping and I'm trying to do it on this page https://www.metrocuadrado.com/bogota.
The idea is to extract all the information. So far I have been able to do it with only one page but I do not know how to do it with pagination. Is there any way to do it based on the code I already have?
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
# opening up connection, grabbing html
my_url = 'https://www.metrocuadrado.com/bogota'
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
# html parser
page_soup = soup(page_html, "html.parser")
# grabs each product
containers = page_soup.findAll("div",{"class":"detail_wrap"})
filename = "metrocuadrado.csv"
f = open(filename, "w")
headers= "propertytype, businestype, cityname, neighborhood, description, price, area\n"
f.write(headers)
for container in containers:
property_type = container[propertytype]
busines_type = container[businestype]
city_name = container[cityname]
neighborhood_location = container[neighborhood]
description = container.div.a.img["alt"]
price_container = container.findAll("span",{"itemprop":"price"})
price = price_container[0].text
area_container = container.findAll("div",{"class":"m2"})
area = area_container[0].p.span.text
print("property_type: " + property_type)
print("busines_type: " + busines_type)
print("city_name: " + city_name)
print("neighborhood_location: " + neighborhood_location)
print("description: " + description)
print("price: " + price)
print("area: " + area)
f.write(property_type + "," + busines_type + "," + city_name + "," + neighborhood_location + "," + description.replace(",", "|") + "," + price + "," + area + "\n")
f.close()

You are going to need to scrape each page (likely in a loop), do this by figuring out what the call is to get page 2, page 3 etc. You can try to figure that out by looking at the page source code or using developer tools from your browser and looking at the network calls.

Related

For Loop to CSV Leading to Uneven Rows in Python

Still learning Python, so apologies if this is an extremely obvious mistake. I've been trying to figure it out for hours now though and figured I'd see if anyone can help out.
I've scraped a hockey website for their ice skate name and price and have written it to a CSV. The only problem is that when I write it to CSV the rows for the name column (listed as Gear) and the Price column are not aligned. It goes:
Gear Name 1
Row Space
Price
Row Space
Gear Name 2
It would be great to align the gear and price rows next to each other. I've attached a link to a picture of the CSV as well if that helps.
import requests
from bs4 import BeautifulSoup as Soup
webpage_response = requests.get('https://www.purehockey.com/c/ice-hockey-skates-senior?')
webpage = (webpage_response.content)
parser = Soup(webpage, 'html.parser')
filename = "gear.csv"
f = open(filename, "w")
headers = "Gear, Price"
f.write(headers)
for gear in parser.find_all("div", {"class": "details"}):
gearname = gear.find_all("div", {"class": "name"}, "a")
gearnametext = gearname[0].text
gearprice = gear.find_all("div", {"class": "price"}, "a")
gearpricetext = gearprice[0].text
print (gearnametext)
print (gearpricetext)
f.write(gearnametext + "," + gearpricetext)
[What the uneven rows look like][1]
[1]: https://i.stack.imgur.com/EG2f2.png
Would recommend with python 3 to use with open(filename, 'w') as f: and strip() your texts before write() to your file.
Unless you do not use 'a' mode to append each line you have to add linebreak to each line you are writing.
Example
import requests
from bs4 import BeautifulSoup as Soup
webpage_response = requests.get('https://www.purehockey.com/c/ice-hockey-skates-senior?')
webpage = (webpage_response.content)
parser = Soup(webpage, 'html.parser')
filename = "gear1.csv"
headers = "Gear,Price\n"
with open(filename, 'w') as f:
f.write(headers)
for gear in parser.find_all("div", {"class": "details"}):
gearnametext = gear.find("div", {"class": "name"}).text.strip()
gearpricetext = gear.find("div", {"class": "price"}).text.strip()
f.write(gearnametext + "," + gearpricetext+"\n")
Output
Gear,Price
Bauer Vapor X3.7 Ice Hockey Skates - Senior,$249.99
Bauer X-LP Ice Hockey Skates - Senior,$119.99
Bauer Vapor Hyperlite Ice Hockey Skates - Senior,$999.98 - $1149.98
CCM Jetspeed FT475 Ice Hockey Skates - Senior,$249.99
Bauer X-LP Ice Hockey Skates - Intermediate,$109.99
...
I've noticed that gearnametext returns 2\n inside the string. You should try the method str.replace() to remove the \n which are creating you the jump to the next line. Try with:
import requests
from bs4 import BeautifulSoup as Soup
webpage_response = requests.get('https://www.purehockey.com/c/ice-hockey-skates-senior?')
webpage = (webpage_response.content)
parser = Soup(webpage, 'html.parser')
filename = "gear.csv"
f = open(filename, "w")
headers = "Gear, Price"
f.write(headers)
for gear in parser.find_all("div", {"class": "details"}):
gearname = gear.find_all("div", {"class": "name"}, "a")
gearnametext = gearname[0].text.replace('\n','')
gearprice = gear.find_all("div", {"class": "price"}, "a")
gearpricetext = gearprice[0].text
print (gearnametext)
print (gearpricetext)
f.write(gearnametext + "," + gearpricetext)
I changed inside the loop the second line for the gear name for: gearnametext = gearname[0].text.replace('\n','').

Try request on website with around 5000 length in url

On this website https://www.bezrealitky.cz i try request for information but the url is around 5000 length long and i get nothing in output.
I try this code:
import gspread
import requests
from bs4 import BeautifulSoup
cast_2 = "7460372%2C50.0646836%5D%2C%5B14.7483696%2C50.0612238%5D%2C%5B14.7483383%2C50.0583969%5D%2C%5B14.7546239%2C50.0529376%5D%2C%5B14.7515918%2C50.04738%5D%2C%5B14.7528839%2C50.0403354%5D%2C%5B14.7493543%2C50.0391631%5D%2C%5B14.7512053%2C50.0370347%5D%2C%5B14.7504488%2C50.0345038%5D%2C%5B14.7529152%2C50.0334157%5D%2C%5B14.7549521%2C50.0273482%5D%2C%5B14.7510315%2C50.0233121%5D%2C%5B14.7613824%2C50.0237208%5D%2C%5B14.7636167%2C50.0221496%5D%2C%5B14.7637296%2C50.0164491%5D%2C%5B14.7615743%2C50.012857%5D%2C%5B14.7706651%2C50.0100383%5D%2C%5B14.7881162%2C50.0135281%5D%2C%5B14.7975388%2C50.0122003%5D%2C%5B14.8027277%2C50.0136821%5D%2C%5B14.8061398%2C50.0114449%5D%2C%5B14.8061581%2C50.0065423%5D%2C%5B14.8078741%2C50.0060327%5D%2C%5B14.8079901%2C50.0075185%5D%2C%5B14.8115355%2C50.0091465%5D%2C%5B14.8167724%2C50.00864%5D%2C%5B14.8310856%2C50.0172897%5D%2C%5B14.8338134%2C50.0171025%5D%2C%5B14.8342407%2C50.0124198%5D%2C%5B14.8362383%2C50.011968%5D%2C%5B14.8368206%2C50.016655%5D%2C%5B14.8543707%2C50.0161536%5D%2C%5B14.8565756%2C50.0177261%5D%2C%5B14.8570572%2C50.0213736%5D%2C%5B14.8498872%2C50.0278437%5D%2C%5B14.8536689%2C50.0320382%5D%2C%5B14.8620365%2C50.0331695%5D%2C%5B14.8699823%2C50.029833%5D%2C%5B14.8686147%2C50.0169813%5D%2C%5B14.8666129%2C50.013744%5D%2C%5B14.8682312%2C50.0096072%5D%2C%5B14.883544%2C50.0078683%5D%2C%5B14.8816725%2C50.0059146%5D%2C%5B14.8807235%2C50.0006693%5D%2C%5B14.8870952%2C50.0005446%5D%2C%5B14.8867784%2C49.9967618%5D%2C%5B14.890288%2C49.9920084%5D%2C%5B14.9000404%2C49.9936546%5D%2C%5B14.9089142%2C49.989327%5D%2C%5B14.9090253%2C49.9904781%5D%2C%5B14.9180098%2C49.9911811%5D%2C%5B14.9212725%2C49.9857218%5D%2C%5B14.9192363%2C49.9845855%5D%2C%5B14.9203582%2C49.983578%5D%2C%5B14.9269916%2C49.9858337%5D%2C%5B14.928416%2C49.9890091%5D%2C%5B14.9306597%2C49.9900138%5D%2C%5B14.9311636%2C49.9882144%5D%2C%5B14.9367484%2C49.9906851%5D%2C%5B14.9419242%2C49.9895189%5D%2C%5B14.9359828%2C49.9863591%5D%2C%5B14.9382795%2C49.981983%5D%2C%5B14.931343%2C49.9780358%5D%2C%5B14.9303346%2C49.9724941%5D%2C%5B14.932246%2C49.9674614%5D%2C%5B14.9280618%2C49.9579488%5D%2C%5B14.9331609%2C49.9577085%5D%2C%5B14.9306572%2C49.9527402%5D%2C%5B14.9274264%2C49.9509742%5D%2C%5B14.9302193%2C49.9495788%5D%2C%5B14.9292025%2C49.9481072%5D%2C%5B14.9263141%2C49.9481936%5D%2C%5B14.9261212%2C49.9456787%5D%2C%5B14.9275244%2C49"
cast_3 = ".9455488%5D%2C%5B14.9257431%2C49.9451039%5D%2C%5B14.9243516%2C49.9420148%5D%2C%5B14.9336951%2C49.9390724%5D%2C%5B14.9285746%2C49.9375499%5D%2C%5B14.9232954%2C49.9391386%5D%2C%5B14.9185643%2C49.9523745%5D%2C%5B14.9095933%2C49.9529501%5D%2C%5B14.8981181%2C49.9476337%5D%2C%5B14.8907048%2C49.9392604%5D%2C%5B14.8915933%2C49.936355%5D%2C%5B14.8864724%2C49.9375382%5D%2C%5B14.8867596%2C49.9356867%5D%2C%5B14.8899145%2C49.9313207%5D%2C%5B14.8933734%2C49.9300241%5D%2C%5B14.8964743%2C49.923446%5D%2C%5B14.9025013%2C49.919721%5D%2C%5B14.9047702%2C49.9164821%5D%2C%5B14.9176309%2C49.9175431%5D%2C%5B14.9164319%2C49.9200692%5D%2C%5B14.9175303%2C49.9225778%5D%2C%5B14.9289145%2C49.917223%5D%2C%5B14.9330489%2C49.9168551%5D%2C%5B14.9372387%2C49.9075293%5D%2C%5B14.9352755%2C49.9066415%5D%2C%5B14.9411617%2C49.9029621%5D%2C%5B14.940379%2C49.8993594%5D%2C%5B14.935669%2C49.8952156%5D%2C%5B14.9384554%2C49.8917643%5D%2C%5B14.9472443%2C49.8880358%5D%2C%5B14.9528711%2C49.8883229%5D%2C%5B14.9545243%2C49.8889728%5D%2C%5B14.9555799%2C49.8961729%5D%2C%5B14.9607454%2C49.8965317%5D%2C%5B14.9606301%2C49.8945635%5D%2C%5B14.9635334%2C49.8942765%5D%2C%5B14.9624936%2C49.8932596%5D%2C%5B14.9674574%2C49.8925867%5D%2C%5B14.9724662%2C49.8948229%5D%2C%5B14.9714967%2C49.8966132%5D%2C%5B14.9725729%2C49.8995754%5D%2C%5B14.9759147%2C49.898875%5D%2C%5B14.9769214%2C49.9004331%5D%2C%5B14.9846763%2C49.9007413%5D%2C%5B14.9865037%2C49.9003017%5D%2C%5B14.986104%2C49.8976845%5D%2C%5B14.9911552%2C49.8945319%5D%2C%5B14.9874514%2C49.8896809%5D%2C%5B14.9946711%2C49.8895076%5D%2C%5B14.9915024%2C49.8850037%5D%2C%5B14.9947409%2C49.8846719%5D%2C%5B14.9958491%2C49.8802898%5D%2C%5B14.999003%2C49.8823417%5D%2C%5B15.0053515%2C49.8830424%5D%2C%5B15.0099309%2C49.8795241%5D%2C%5B15.0124418%2C49.8801023%5D%2C%5B15.011424%2C49.8899134%5D%2C%5B15.01354%2C49.8915423%5D%2C%5B15.0122471%2C49.8921134%5D%2C%5B15.0142633%2C49.8934424%5D%2C%5B15.0140542%2C49.8954869%5D%2C%5B15.0189515%2C49.8949809%5D%2C%5B15.0194431%2C49.8989414%5D%2C%5B15.0243348%2C49.9005027%5D%2C%5B15.020474%2C49.905695%5D%2C%5B15.0219726%2C49.9095088%5D%2C%5B15.0272693%2C49.9116921%5D%2C%5B15.0263371%2C49.9142298%5D%2C%5B15.022275%2C49.9153146%5D%2C%5B15.0219879%2C49.9204434%5D%2C%5B15.023551%2C49.9231595%5D%2C%5B15.029706%2C49.9239071%5D%2C%5B15.0292901%2C49.9265814%5D%2C%5B15.0353043%2C49.927282%5D%2C%5B15"
cast_4 = ".0400271%2C49.9230311%5D%2C%5B15.0390876%2C49.9223017%5D%2C%5B15.0426921%2C49.9214656%5D%2C%5B15.042195%2C49.919563%5D%2C%5B15.056001%2C49.9151051%5D%2C%5B15.0694879%2C49.9139673%5D%2C%5B15.0699502%2C49.9155857%5D%2C%5B15.0747277%2C49.9154931%5D%2C%5B15.0746215%2C49.9175953%5D%2C%5B15.0802883%2C49.9195617%5D%2C%5B15.078314%2C49.9228952%5D%2C%5B15.0796324%2C49.923096%5D%2C%5B15.0901438%2C49.9218556%5D%2C%5B15.0929845%2C49.9233777%5D%2C%5B15.103518%2C49.9187582%5D%2C%5B15.1029777%2C49.9314127%5D%2C%5B15.1144736%2C49.9345202%5D%2C%5B15.1102589%2C49.9420706%5D%2C%5B15.1158032%2C49.9422821%5D%2C%5B15.1165203%2C49.944272%5D%2C%5B15.1145739%2C49.945346%5D%2C%5B15.1171915%2C49.9488564%5D%2C%5B15.1276201%2C49.9512419%5D%2C%5B15.1274718%2C49.955674%5D%2C%5B15.1517372%2C49.9531124%5D%2C%5B15.1526448%2C49.9567041%5D%2C%5B15.1586225%2C49.9561615%5D%2C%5B15.1598926%2C49.962716%5D%2C%5B15.1662636%2C49.9630819%5D%2C%5B15.1715054%2C49.9584597%5D%2C%5B15.1774664%2C49.9591787%5D%2C%5B15.1810824%2C49.956233%5D%2C%5B15.1810663%2C49.9526501%5D%2C%5B15.1862035%2C49.9530225%5D%2C%5B15.1952391%2C49.9578357%5D%2C%5B15.2115243%2C49.9592322%5D%2C%5B15.2139904%2C49.9620232%5D%2C%5B15.2170339%2C49.9610558%5D%2C%5B15.2230008%2C49.9622295%5D%2C%5B15.2265423%2C49.9759596%5D%2C%5B15.2333309%2C49.975372%5D%2C%5B15.2348486%2C49.97927%5D%2C%5B15.2407609%2C49.978409%5D%2C%5B15.2397699%2C49.9807084%5D%2C%5B15.2425184%2C49.9817658%5D%2C%5B15.2446314%2C49.9818782%5D%2C%5B15.2442988%2C49.9775396%5D%2C%5B15.2472652%2C49.9768984%5D%2C%5B15.2480568%2C49.9718284%5D%2C%5B15.2507013%2C49.9708684%5D%2C%5B15.2517455%2C49.9680948%5D%2C%5B15.2577153%2C49.9664238%5D%2C%5B15.2596658%2C49.9691599%5D%2C%5B15.2620682%2C49.9693487%5D%2C%5B15.2582713%2C49.9773754%5D%2C%5B15.2607681%2C49.9811908%5D%2C%5B15.2624639%2C49.9805064%5D%2C%5B15.2673975%2C49.988205%5D%2C%5B15.2760691%2C49.9856889%5D%2C%5B15.2782403%2C49.9882659%5D%2C%5B15.2673699%2C49.9908658%5D%2C%5B15.268852%2C49.9939161%5D%2C%5B15.2710964%2C49.9936841%5D%2C%5B15.2693211%2C49.994915%5D%2C%5B15.2713724%2C49.9974748%5D%2C%5B15.2748592%2C49.9976762%5D%2C%5B15.2784299%2C49.9949788%5D%2C%5B15.2806409%2C50.0003848%5D%2C%5B15.2903084%2C50.0004699%5D%2C%5B15.2904481%2C50.0038247%5D%2C%5B15.2942844%2C50.0041151%5D%2C%5B15.2950376%2C50.003145%5D%2C%5B15.2912638%2C50.0001124%5D%2C%5B15.297798%2C49.9966362%5D%2C%5B15"
cast_5 = ".3030433%2C49.9978618%5D%2C%5B15.3065558%2C50.0010961%5D%2C%5B15.3176203%2C49.9958783%5D%2C%5B15.3209257%2C49.9957084%5D%2C%5B15.3218971%2C49.9940926%5D%2C%5B15.3251451%2C49.9942954%5D%2C%5B15.310632%2C50.000222%5D%2C%5B15.3115761%2C50.0052775%5D%2C%5B15.3181268%2C50.0036175%5D%2C%5B15.3095883%2C50.0068186%5D%2C%5B15.3288149%2C50.0102639%5D%2C%5B15.3278347%2C50.0129677%5D%2C%5B15.3228156%2C50.014841%5D%2C%5B15.3226442%2C50.0161805%5D%2C%5B15.3168893%2C50.0158068%5D%2C%5B15.3168707%2C50.0183363%5D%2C%5B15.3099008%2C50.0173573%5D%2C%5B15.3100671%2C50.0223491%5D%2C%5B15.3146543%2C50.0224925%5D%2C%5B15.3181523%2C50.0207887%5D%2C%5B15.3233162%2C50.0270015%5D%2C%5B15.3292159%2C50.0240114%5D%2C%5B15.3339433%2C50.0244362%5D%2C%5B15.3371379%2C50.0271514%5D%2C%5B15.3376541%2C50.031498%5D%2C%5B15.340518%2C50.0339298%5D%2C%5B15.346957%2C50.0329283%5D%2C%5B15.3532778%2C50.0345715%5D%2C%5B15.3610469%2C50.0318515%5D%2C%5B15.3655293%2C50.0283415%5D%2C%5B15.3784025%2C50.0246007%5D%2C%5B15.3942223%2C50.0162445%5D%2C%5B15.3969556%2C50.0185625%5D%2C%5B15.4022783%2C50.0183338%5D%2C%5B15.4020888%2C50.0197968%5D%2C%5B15.3932339%2C50.0206583%5D%2C%5B15.3920018%2C50.0223698%5D%2C%5B15.385031%2C50.0243347%5D%2C%5B15.383268%2C50.0266595%5D%2C%5B15.3821418%2C50.0248152%5D%2C%5B15.3815274%2C50.0269813%5D%2C%5B15.3804665%2C50.0253754%5D%2C%5B15.3792426%2C50.025917%5D%2C%5B15.3743428%2C50.0294937%5D%2C%5B15.3734636%2C50.0347859%5D%2C%5B15.3695889%2C50.0346392%5D%2C%5B15.3666384%2C50.0394245%5D%2C%5B15.3628116%2C50.041087%5D%2C%5B15.3730635%2C50.0432036%5D%2C%5B15.3780732%2C50.0472218%5D%2C%5B15.3856527%2C50.0472561%5D%2C%5B15.388892%2C50.0443888%5D%2C%5B15.39216%2C50.0442043%5D%2C%5B15.3896159%2C50.0475198%5D%2C%5B15.3943902%2C50.0459441%5D%2C%5B15.402595%2C50.0465414%5D%2C%5B15.4014034%2C50.0483683%5D%2C%5B15.4035434%2C50.0505257%5D%2C%5B15.3962372%2C50.0486151%5D%2C%5B15.394535%2C50.0500037%5D%2C%5B15.3969398%2C50.0520049%5D%2C%5B15.3940358%2C50.0504431%5D%2C%5B15.3896418%2C50.052281%5D%2C%5B15.3833997%2C50.0519173%5D%2C%5B15.3753385%2C50.0557371%5D%2C%5B15.3768323%2C50.0574795%5D%2C%5B15.3807552%2C50.0576628%5D%2C%5B15.3805605%2C50.0605402%5D%2C%5B15.3881422%2C50.061425%5D%2C%5B15.3925269%2C50.0591033%5D%2C%5B15.3946303%2C50.0603602%5D%2C%5B15.3940834%2C50.0616654%5D%2C%5B15.3961776%2C50.0607414%5D%2C%5B15.4125822%2C50.0630548%5D%2C%5B15"
cast_6 = ".4134746%2C50.0744661%5D%2C%5B15.4067539%2C50.0804816%5D%2C%5B15.4054916%2C50.0830808%5D%2C%5B15.4066544%2C50.0837547%5D%2C%5B15.4042526%2C50.0860421%5D%2C%5B15.4052905%2C50.0870901%5D%2C%5B15.4088927%2C50.0854556%5D%2C%5B15.4160104%2C50.0890967%5D%2C%5B15.4203786%2C50.0871824%5D%2C%5B15.4280014%2C50.0880659%5D%2C%5B15.4236774%2C50.0883806%5D%2C%5B15.4155121%2C50.0939248%5D%2C%5B15.4098499%2C50.0906237%5D%2C%5B15.4075168%2C50.0919355%5D%2C%5B15.4094691%2C50.0929791%5D%2C%5B15.402386%2C50.0951011%5D%2C%5B15.4009451%2C50.0939897%5D%2C%5B15.4015589%2C50.0959031%5D%2C%5B15.3991191%2C50.0954995%5D%2C%5B15.4078486%2C50.1053316%5D%2C%5B15.4139267%2C50.1076966%5D%2C%5B15.4190626%2C50.1062858%5D%2C%5B15.4204394%2C50.1043623%5D%2C%5B15.419641%2C50.1029451%5D%2C%5B15.425487%2C50.0992095%5D%2C%5B15.4372403%2C50.101709%5D%2C%5B15.4341712%2C50.103413%5D%2C%5B15.4323829%2C50.110828%5D%2C%5B15.4373797%2C50.1098872%5D%2C%5B15.4382373%2C50.112052%5D%2C%5B15.440637%2C50.1119275%5D%2C%5B15.4348981%2C50.1152388%5D%2C%5B15.4399748%2C50.1187849%5D%2C%5B15.4422332%2C50.1186898%5D%2C%5B15.448004%2C50.1245509%5D%2C%5B15.4394162%2C50.1263351%5D%2C%5B15.4424804%2C50.1272179%5D%2C%5B15.4402373%2C50.1285949%5D%2C%5B15.4386249%2C50.1267957%5D%2C%5B15.4242543%2C50.1302151%5D%2C%5B15.4136279%2C50.1360966%5D%2C%5B15.4162801%2C50.1388282%5D%2C%5B15.4153957%2C50.1404655%5D%2C%5B15.4101445%2C50.1389653%5D%2C%5B15.4035614%2C50.1410463%5D%2C%5B15.4057903%2C50.1385907%5D%2C%5B15.4032532%2C50.1349821%5D%2C%5B15.3908308%2C50.1380372%5D%2C%5B15.3896295%2C50.1404446%5D%2C%5B15.3818727%2C50.1455547%5D%2C%5B15.3666429%2C50.1437912%5D%2C%5B15.3314292%2C50.1459439%5D%2C%5B15.3022217%2C50.1566421%5D%2C%5B15.2999343%2C50.159518%5D%2C%5B15.3008151%2C50.1577516%5D%2C%5B15.2919217%2C50.1559134%5D%2C%5B15.2913616%2C50.1515941%5D%2C%5B15.2873242%2C50.1529242%5D%2C%5B15.2850402%2C50.1494766%5D%2C%5B15.2794866%2C50.1508381%5D%2C%5B15.2808718%2C50.1481203%5D%2C%5B15.2717813%2C50.1493956%5D%2C%5B15.2693875%2C50.1511556%5D%2C%5B15.2720351%2C50.1479877%5D%2C%5B15.2764349%2C50.1466513%5D%2C%5B15.2813847%2C50.147562%5D%2C%5B15.2838828%2C50.1461274%5D%2C%5B15.2826681%2C50.1444837%5D%2C%5B15.2784628%2C50.143719%5D%2C%5B15.2796582%2C50.1433122%5D%2C%5B15.2780194%2C50.1409157%5D%2C%5B15.282405%2C50.140849%5D%2C%5B15.2784695%2C50.1381623%5D%2C%5B15.2708307%2C50.1391665%5D%2C%5B15"
cast_7 = ".2668258%2C50.1375857%5D%2C%5B15.269911%2C50.1346072%5D%2C%5B15.2762845%2C50.1329987%5D%2C%5B15.2759419%2C50.1303474%5D%2C%5B15.2793229%2C50.1302371%5D%2C%5B15.281164%2C50.1264982%5D%2C%5B15.2845801%2C50.1261406%5D%2C%5B15.2848885%2C50.1230899%5D%2C%5B15.2828851%2C50.1228826%5D%2C%5B15.2837849%2C50.1179389%5D%2C%5B15.2768798%2C50.117716%5D%2C%5B15.2771028%2C50.1144981%5D%2C%5B15.272524%2C50.1150153%5D%2C%5B15.2713081%2C50.1121539%5D%2C%5B15.2725418%2C50.1105403%5D%2C%5B15.2695258%2C50.1068804%5D%2C%5B15.2674912%2C50.1101226%5D%2C%5B15.2588749%2C50.1105492%5D%2C%5B15.2588528%2C50.109051%5D%2C%5B15.2609659%2C50.1088692%5D%2C%5B15.2592838%2C50.1065069%5D%2C%5B15.2536841%2C50.1067928%5D%2C%5B15.2539585%2C50.1090625%5D%2C%5B15.2525796%2C50.1049205%5D%2C%5B15.2435934%2C50.1060842%5D%2C%5B15.2453384%2C50.098772%5D%2C%5B15.2423646%2C50.0989078%5D%2C%5B15.2423883%2C50.1011124%5D%2C%5B15.2393165%2C50.1013005%5D%2C%5B15.2393185%2C50.1024704%5D%2C%5B15.2334966%2C50.1021718%5D%2C%5B15.2160559%2C50.1071886%5D%2C%5B15.2144357%2C50.1094093%5D%2C%5B15.2039163%2C50.1073611%5D%2C%5B15.2041415%2C50.1086931%5D%2C%5B15.1852359%2C50.1097565%5D%2C%5B15.1833813%2C50.1090549%5D%2C%5B15.1841281%2C50.1078717%5D%2C%5B15.1874227%2C50.1095254%5D%2C%5B15.1907831%2C50.1079763%5D%2C%5B15.1922237%2C50.1051987%5D%2C%5B15.191224%2C50.1027416%5D%2C%5B15.1841377%2C50.1017764%5D%2C%5B15.1760019%2C50.1032842%5D%2C%5B15.1773527%2C50.0987535%5D%2C%5B15.1731946%2C50.0978941%5D%2C%5B15.1751755%2C50.0963999%5D%2C%5B15.1743437%2C50.0952196%5D%2C%5B15.1696704%2C50.0961228%5D%2C%5B15.1711029%2C50.0927787%5D%2C%5B15.169623%2C50.0915337%5D%2C%5B15.1653342%2C50.0951398%5D%2C%5B15.1613706%2C50.0942165%5D%2C%5B15.1581873%2C50.1000534%5D%2C%5B15.1550458%2C50.0984016%5D%2C%5B15.1465104%2C50.0981989%5D%2C%5B15.1465095%2C50.0995308%5D%2C%5B15.1357627%2C50.0982693%5D%2C%5B15.1313868%2C50.0962456%5D%2C%5B15.1253349%2C50.0992604%5D%2C%5B15.1258172%2C50.1009654%5D%2C%5B15.1204803%2C50.1016614%5D%2C%5B15.1218891%2C50.0947834%5D%2C%5B15.1188048%2C50.0935699%5D%2C%5B15.1155421%2C50.0831571%5D%2C%5B15.1093235%2C50.0843775%5D%2C%5B15.1081288%2C50.0825619%5D%2C%5B15.0999571%2C50.0878321%5D%2C%5B15.1013205%2C50.0922107%5D%2C%5B15.0915809%2C50.0931084%5D%2C%5B15.0920168%2C50.0949287%5D%2C%5B15.0863782%2C50.0966826%5D%2C%5B15.086481%2C50.0944002%5D%2C%5B15.0826752%2C50.0941637%5D%2C%5B15"
cast_8 = ".068619%2C50.0973604%5D%2C%5B15.0703038%2C50.101897%5D%2C%5B15.0740585%2C50.1044897%5D%2C%5B15.0764522%2C50.1039785%5D%2C%5B15.0736102%2C50.1060177%5D%2C%5B15.0629614%2C50.1074685%5D%2C%5B15.0565328%2C50.105559%5D%2C%5B15.0564676%2C50.1038553%5D%2C%5B15.0533182%2C50.1015993%5D%2C%5B15.0516594%2C50.1040076%5D%2C%5B15.0458835%2C50.1020048%5D%2C%5B15.0448312%2C50.1041812%5D%2C%5B15.0466144%2C50.1126407%5D%2C%5B15.0457175%2C50.1149513%5D%2C%5B15.0343113%2C50.1170375%5D%2C%5B15.0345083%2C50.114516%5D%2C%5B15.0301261%2C50.1146343%5D%2C%5B15.0264935%2C50.1113655%5D%2C%5B15.0170356%2C50.1140104%5D%2C%5B15.0119161%2C50.11321%5D%2C%5B15.0045907%2C50.1162478%5D%2C%5B15.0045718%2C50.1108536%5D%2C%5B15.0027007%2C50.1105543%5D%2C%5B15.004638%2C50.1101266%5D%2C%5B15.0039322%2C50.1047413%5D%2C%5B15.005155%2C50.1043048%5D%2C%5B14.998137%2C50.0967009%5D%2C%5B14.9596509%2C50.1052427%5D%2C%5B14.955716%2C50.1030399%5D%2C%5B14.9618513%2C50.0988074%5D%2C%5B14.9624361%2C50.0959271%5D%2C%5B14.9534678%2C50.0972712%5D%2C%5B14.9537379%2C50.093282%5D%2C%5B14.9511217%2C50.0933379%5D%2C%5B14.9534407%2C50.0882163%5D%2C%5B14.948147%2C50.0876311%5D%2C%5B14.9480446%2C50.0931762%5D%2C%5B14.9450619%2C50.089111%5D%2C%5B14.9445069%2C50.0901939%5D%2C%5B14.9419088%2C50.089393%5D%2C%5B14.941649%2C50.0905889%5D%2C%5B14.9436269%2C50.090921%5D%2C%5B14.9292988%2C50.0919269%5D%2C%5B14.9302586%2C50.0950085%5D%2C%5B14.9272723%2C50.095255%5D%2C%5B14.9267884%2C50.0914298%5D%2C%5B14.9249778%2C50.0917708%5D%2C%5B14.9268745%2C50.0926059%5D%2C%5B14.9258074%2C50.0979741%5D%2C%5B14.9314785%2C50.1019029%5D%2C%5B14.9320943%2C50.1066867%5D%2C%5B14.9413214%2C50.1105662%5D%2C%5B14.9474381%2C50.1112535%5D%2C%5B14.9460171%2C50.1142574%5D%2C%5B14.9409307%2C50.117369%5D%2C%5B14.9450905%2C50.1184205%5D%2C%5B14.9422467%2C50.119131%5D%2C%5B14.9419262%2C50.1213203%5D%2C%5B14.9377646%2C50.1221107%5D%2C%5B14.9293773%2C50.1215114%5D%2C%5B14.9277818%2C50.1181327%5D%2C%5B14.91945%2C50.1202597%5D%2C%5B14.9064761%2C50.1161503%5D%2C%5B14.9114208%2C50.1130913%5D%2C%5B14.8973812%2C50.1101211%5D%2C%5B14.9026135%2C50.1041702%5D%2C%5B14.8862235%2C50.1046495%5D%2C%5B14.882639%2C50.1016048%5D%2C%5B14.8737914%2C50.0991642%5D%2C%5B14.8736091%2C50.0977705%5D%2C%5B14.8640719%2C50.096482%5D%2C%5B14.8380862%2C50.1029933%5D%2C%5B14.8353495%2C50.1065255%5D%2C%5B14.8203159%2C50.1075386%5D%2C%5B14.8178407%2C50.10837%5D%2C%5B14"
cast_9 = ".8174777%2C50.1109162%5D%2C%5B14.8078363%2C50.1109486%5D%2C%5B14.8082064%2C50.1099036%5D%2C%5B14.8065928%2C50.1097325%5D%2C%5B14.8031871%2C50.1151496%5D%2C%5B14.7909845%2C50.112214%5D%2C%5B14.791537%2C50.1159915%5D%2C%5B14.7898356%2C50.1161527%5D%2C%5B14.7888326%2C50.1127876%5D%2C%5B14.7836591%2C50.1126201%5D%2C%5B14.7851984%2C50.1109994%5D%2C%5B14.783707%2C50.1101127%5D%2C%5B14.7799914%2C50.1100845%5D%2C%5B14.7794878%2C50.1121285%5D%2C%5B14.7765771%2C50.112093%5D%2C%5B14.7678175%2C50.1091971%5D%2C%5B14.7649339%2C50.1046049%5D%2C%5B14.7632564%2C50.1067252%5D%2C%5B14.7619325%2C50.1062309%5D%2C%5B14.7632773%2C50.1032313%5D%2C%5B14.7594198%2C50.099077%5D%2C%5B14.7363085%2C50.0943267%5D%2C%5B14.7384606%2C50.0924892%5D%2C%5B14.7376811%2C50.0912459%5D%2C%5B14.7305199%2C50.0868433%5D%5D%5D%7D%7D&center=%5B15.181636455691432%2C50.02066883782007%5D&zoom=11.111617917065713&locationInput=okres%20Kol%C3%ADn%2C%20St%C5%99edo%C4%8Desk%C3%BD%20kraj%2C%20St%C5%99edn%C3%AD%20%C4%8Cechy%2C%20%C4%8Cesko&limit=15"
URL = 'https://www.bezrealitky.cz/vyhledat#offerType=prodej&estateType=byt&disposition=3-kk%2C3-1%2C4-kk%2C4-1%2C5-kk%2C5-1%2C6-kk%2C6-1%2C7-kk%2C7-1&ownership=&construction=&equipped=&balcony=&order=timeOrder_desc&boundary=%7B%22type%22%3A%22Feature%22%2C%22properties%22%3A%7B%7D%2C%22geometry%22%3A%7B%22type%22%3A%22Polygon%22%2C%22coordinates%22%3A%5B%5B%5B14.7305199%2C50.0868433%5D%2C%5B14.7347907%2C50.0851368%5D%2C%5B14.7335681%2C50.0820115%5D%2C%5B14.7365932%2C50.0818778%5D%2C%5B14.7369494%2C50.0836348%5D%2C%5B14.7440628%2C50.0834312%5D%2C%5B14.7442786%2C50.081928%5D%2C%5B14.7491301%2C50.0819783%5D%2C%5B14.7501216%2C50.0801764%5D%2C%5B14.7422156%2C50.0785401%5D%2C%5B14.7385155%2C50.0708815%5D%2C%5B14.753382%2C50.0710775%5D%2C%5B14.7541288%2C50.0684731%5D%2C%5B14.7520195%2C50.067327%5D%2C%5B14.7522295%2C50.0657011%5D%2C%5B14.' + cast_2 + cast_3 + cast_4 + cast_5 + cast_6 + cast_7 + cast_8 + cast_9
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
results = soup.find_all('div', attrs={'class':'product__body-new'})
for job_data in results:
cena = job_data.find('strong', attrs={'class':'product__value'})
cena_final = cena.text.strip()
print(cena_final)
It is works but no show price in <strong class="product__value">5.990.000 Kč</strong> and others values, it start and end immediately.
I need print:
Kolín, Středočeský kraj
Prodej bytu 2+kk, 46 m²
3.480.000 Kč
Jateční, Kolín, Středočeský kraj
Prodej bytu 3+1, 73 m²
4.500.000 Kč
That is because the page does not return any divs with a class of "product__body-new".

How to infinite click on show more button to generate a full page and then collect data links?

My goal is to collect a maximum of profile links on Khan Academy and then select some specific data on each of these profiles to store them into a CSV file.
Here is my script to get profile links. Then scrape specific data on each of these profiles. And then store them in a csv file.
from bs4 import BeautifulSoup
from requests_html import HTMLSession
import re
session = HTMLSession()
r = session.get('https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
#find the profile links
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
profile_list.append(final_profile_link)
#create the csv file
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)
#for each profile link, scrape the specific data and store them into the csv
for link in profile_list:
print("Scrapping ",link)
session = HTMLSession()
r = session.get(link)
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'
user_socio_table=soup.find_all('div', class_='discussion-stat')
data = {}
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] #might change answers to answer because when it's 1 it's putting NA instead
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
last_activity = user_calendar.find('span',class_='streak-cell filled')
try:
last_activity_date = last_activity['title']
except TypeError:
last_activity_date='NA'
else:
last_activity_date='NA'
f.write(link + "," + dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n") #might change answers to answer because when it's 1 it's putting NA instead
f.close()
This first script should work fine. Now, my problem is that this script found about 40 profile links: print(len(profile_list)) return 40.
If I could click on show more button (on : https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms), then I will get more profile links (and thus more profiles to scrape).
That script is infinitely clicking on show more button, until there is no show more button:
import unittest
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
driver = webdriver.Chrome() #watch out, change if you are not using Chrome
driver.get("https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms")
driver.implicitly_wait(10)
def showmore(self):
while True:
try:
driver.implicitly_wait(5)
showmore = self.find_element_by_class_name("button_1eqj1ga-o_O-shared_1t8r4tr-o_O-default_9fm203")
showmore.click()
except NoSuchElementException:
break
showmore(driver)
This second script should also work fine.
My question is: how can I merge these two scripts? How to make BeautifulSoup, Selenium and Requests work together?
In other words: How can I apply the second script to get a full page and then treat it into the first script?
My question is: how can I merge these two scripts? How to make
BeautifulSoup, Selenium and Requests work together?
You don't need to. Selenium alone can do all of the actions required as well as get the data required. Another alternative is to use selenium do actions (such as click), get the page_source and let BeautifulSoup do the parsing. I have used the second option. Please note that this is b'coz I am more comfortable with BeautifulSoup and not b'coz selenium can't get the data required.
Merged Script
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException,StaleElementReferenceException
from bs4 import BeautifulSoup
import re
driver = webdriver.Chrome() #watch out, change if you are not using Chrome
driver.get("https://www.khanacademy.org/computing/computer-science/algorithms/intro-to-algorithms/v/what-are-algorithms")
while True:
try:
showmore=WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[#id="v/what-are-algorithms-panel"]/div[1]/div/div[6]/div/div[4]/button')))
showmore.click()
except TimeoutException:
break
except StaleElementReferenceException:
break
soup=BeautifulSoup(driver.page_source,'html.parser')
#find the profile links
profiles = soup.find_all(href=re.compile("/profile/kaid"))
profile_list=[]
for links in profiles:
links_no_list = links.extract()
text_link = links_no_list['href']
text_link_nodiscussion = text_link[:-10]
final_profile_link ='https://www.khanacademy.org'+text_link_nodiscussion
profile_list.append(final_profile_link)
#remove duplicates
#remove the below line if you want the dupliactes
profile_list=list(set(profile_list))
#print number of profiles we got
print(len(profile_list))
#create the csv file
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date\n"
f.write(headers)
#for each profile link, scrape the specific data and store them into the csv
for link in profile_list:
#to avoid Scrapping same profile multiple times
#print each profile link we are about to scrap
print("Scrapping ",link)
driver.get(link)
#wait for content to load
#if profile does not exist skip
try:
WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH ,'//*[#id="widget-list"]/div[1]/div[1]')))
except TimeoutException:
continue
soup=BeautifulSoup(driver.page_source,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'
user_socio_table=soup.find_all('div', class_='discussion-stat')
data = {}
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks'] #might change answers to answer because when it's 1 it's putting NA instead
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
user_calendar = soup.find('div',class_='streak-calendar-scroll-container')
if user_calendar is not None:
last_activity = user_calendar.find('span',class_='streak-cell filled')
try:
last_activity_date = last_activity['title']
except TypeError:
last_activity_date='NA'
else:
last_activity_date='NA'
f.write(link + "," + dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "," + last_activity_date + "\n")
Sample console Output
551
Scrapping https://www.khanacademy.org/profile/kaid_888977072825430260337359/
Scrapping https://www.khanacademy.org/profile/kaid_883316191998827325047066/
Scrapping https://www.khanacademy.org/profile/kaid_1174374133389372329315932/
Scrapping https://www.khanacademy.org/profile/kaid_175131632601098270919916/
Scrapping https://www.khanacademy.org/profile/kaid_120532771190025953629523/
Scrapping https://www.khanacademy.org/profile/kaid_443636490088836886070300/
Scrapping https://www.khanacademy.org/profile/kaid_1202505937095267213741452/
Scrapping https://www.khanacademy.org/profile/kaid_464949975690601300556189/
Scrapping https://www.khanacademy.org/profile/kaid_727801603402106934190616/
Scrapping https://www.khanacademy.org/profile/kaid_808370995413780397188230/
Scrapping https://www.khanacademy.org/profile/kaid_427134832219441477944618/
Scrapping https://www.khanacademy.org/profile/kaid_232193725763932936324703/
Scrapping https://www.khanacademy.org/profile/kaid_167043118118112381390423/
Scrapping https://www.khanacademy.org/profile/kaid_17327330351684516133566/
...
Sample File Output (khanscraptry1.csv)
link, date_joined, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx, last_date
https://www.khanacademy.org/profile/kaid_888977072825430260337359/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Tuesday Dec 8 2015
https://www.khanacademy.org/profile/kaid_883316191998827325047066/,5 years ago,2152299,513,10,884,34,16,82,108,1290,360,Monday Aug 27 2018
https://www.khanacademy.org/profile/kaid_1174374133389372329315932/,NA,NA,NA,2,0,0,0,NA,NA,0,0,NA
https://www.khanacademy.org/profile/kaid_175131632601098270919916/,NA,NA,NA,173,19,2,0,NA,NA,128,3,Thursday Feb 7 2019
https://www.khanacademy.org/profile/kaid_120532771190025953629523/,NA,NA,NA,9,0,3,18,NA,NA,4,4,Tuesday Oct 11 2016
https://www.khanacademy.org/profile/kaid_443636490088836886070300/,7 years ago,3306784,987,10,231,49,11,8,156,10,NA,Sunday Jul 22 2018
https://www.khanacademy.org/profile/kaid_1202505937095267213741452/,NA,NA,NA,2,0,0,0,NA,NA,0,0,Thursday Apr 28 2016
https://www.khanacademy.org/profile/kaid_464949975690601300556189/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Friday Mar 16 2018
https://www.khanacademy.org/profile/kaid_727801603402106934190616/,5 years ago,2927634,1049,6,562,332,9,NA,NA,20,NA,NA
https://www.khanacademy.org/profile/kaid_808370995413780397188230/,NA,NA,NA,NA,19,192,0,NA,NA,52,NA,Saturday Jan 19 2019
https://www.khanacademy.org/profile/kaid_427134832219441477944618/,NA,NA,NA,2,0,0,0,NA,NA,0,0,Tuesday Sep 18 2018
https://www.khanacademy.org/profile/kaid_232193725763932936324703/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Monday May 15 2017
https://www.khanacademy.org/profile/kaid_167043118118112381390423/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Friday Mar 1 2019
https://www.khanacademy.org/profile/kaid_17327330351684516133566/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,NA
https://www.khanacademy.org/profile/kaid_146705727466233630898864/,NA,NA,NA,NA,0,0,0,NA,NA,0,0,Thursday Apr 5 2018

How to ask f.write() to put NA's if there is no data in beautifulsoup?

My goal is to scrape some specific data on multiple profile pages on khan academy. And put the data on a csv file.
Here is the code to scrape one specific profile page and put it on a csv:
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/DFletcher1990/')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
user_socio_table=soup.find_all('div', class_='discussion-stat')
data = {}
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx\n"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "\n")
f.close()
This code is working fine with this specific link('https://www.khanacademy.org/profile/DFletcher1990/').
Now though when I change my link to an other profile on khan academy for example : 'https://www.khanacademy.org/profile/Kkasparas/'
I get this error :
KeyError: 'project help requests'
This is normal because on this profile "https://www.khanacademy.org/profile/Kkasparas/" there is no project help requests value (and no project help replies either).
Thus data['project help requests'] and data['project help replies'] don't exist and thus can't be written on the csv file.
My goal is to run this script with many profile pages.
So I would like to know how to put an NA in every case I will not get the data on each variable. And then print te NA's to the csv file.
In other words : I would like to make my script work for any kind of user profile page.
Many thanks in advance for your contributions :)
You could define a new list with all possible headers and set the value of keys that are not present to 'NA', before writing it to the file.
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
Also gentle reminder to provide a fully working code in your question. user_socio_table was not defined in the question. I had to look up your previous question to get that.
Full code would be
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('https://www.khanacademy.org/profile/Kkasparas/')
r.html.render(sleep=5)
soup=BeautifulSoup(r.html.html,'html.parser')
user_info_table=soup.find('table', class_='user-statistics-table')
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
data = {}
user_socio_table=soup.find_all('div', class_='discussion-stat')
for gettext in user_socio_table:
category = gettext.find('span')
category_text = category.text.strip()
number = category.previousSibling.strip()
data[category_text] = number
full_data_keys=['questions','votes','answers','flags raised','project help requests','project help replies','comments','tips and thanks']
for header_value in full_data_keys:
if header_value not in data.keys():
data[header_value]='NA'
filename = "khanscraptry1.csv"
f = open(filename, "w")
headers = "date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx\n"
f.write(headers)
f.write(dates + "," + points.replace("," , "") + "," + videos + "," + data['questions'] + "," + data['votes'] + "," + data['answers'] + "," + data['flags raised'] + "," + data['project help requests'] + "," + data['project help replies'] + "," + data['comments'] + "," + data['tips and thanks'] + "\n")
f.close()
Ouput - khanscraptry1.csv
date, points, videos, questions, votes, answers, flags, project_request, project_replies, comments, tips_thx
6 years ago,1527829,1123,25,100,2,0,NA,NA,0,0
Change to the following lines if user_info_table is not present
if user_info_table is not None:
dates,points,videos=[tr.find_all('td')[1].text for tr in user_info_table.find_all('tr')]
else:
dates=points=videos='NA'

AttributeError: 'str' object has no attribute 'text' python 2.7

Ik there are many questions like this but the answers are all specific and can only fix the solution for the persons specific script.
I am currently trying to print a bunch of info from supremenewyork.com
from the uk website. This script can succsesfully print all the info I want from supreme us but when I added the proxy script I starte to get alot of errors.
I know the prxy script works becuase I tested it on a small scipt and It was able to pull info that was on supreme uk and didnt exist on supreme us
Here is my script.
import requests
from bs4 import BeautifulSoup
UK_Proxy1 = raw_input('UK http Proxy1: ')
UK_Proxy2 = raw_input('UK http Proxy2: ')
proxies = {
'http': 'http://' + UK_Proxy1 + '',
'https': 'http://' + UK_Proxy2 + '',
}
categorys = ['jackets','shirts','tops_sweaters','sweatshirts','pants','shorts','t- shirts','hats','hats','bags','accessories','shoes','skate']
catNumb = 0
altArray = []
nameArray = []
styleArray = []
for cat in categorys:
catStr = str(categorys[catNumb])
cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
proxy_script = requests.get((cUrl.text), proxies=proxies)
bSoup = BeautifulSoup(proxy_script, 'lxml')
print('\n*******************"'+ catStr.upper() + '"*******************\n')
catNumb += 1
for item in bSoup.find_all('div', class_='inner-article'):
url = item.a['href']
alt = item.find('img')['alt']
req = requests.get('http://www.supremenewyork.com' + url)
item_soup = BeautifulSoup(req.text, 'lxml')
name = item_soup.find('h1', itemprop='name').text
style = item_soup.find('p', itemprop='model').text
print alt +(' --- ')+ name +(' --- ')+ style
altArray.append(alt)
nameArray.append(name)
styleArray.append(style)
print altArray
print nameArray
print styleArray
I am getting this error when I execute the script
AttributeError: 'str' object has no attribute 'text' with the error pointing towards the
proxy_script = requests.get((cUrl.text), proxies=proxies)
i recently added this to the script which sorta fixed it... It was able to print the category's but no info between them. Which (I NEED) it just printed ****************jackets**************, ****shirts******, etc.... here is what I changed
import requests
from bs4 import BeautifulSoup
# make sure proxy is http and port 8080
UK_Proxy1 = raw_input('UK http Proxy1: ')
UK_Proxy2 = raw_input('UK http Proxy2: ')
proxies = {
'http': 'http://' + UK_Proxy1 + '',
'https': 'http://' + UK_Proxy2 + '',
}
categorys = ['jackets','shirts','tops_sweaters','sweatshirts','pants','shorts','t-shirts','hats','bags','accessories','shoes','skate']
catNumb = 0
altArray = []
nameArray = []
styleArray = []
for cat in categorys:
catStr = str(categorys[catNumb])
cUrl = 'http://www.supremenewyork.com/shop/all/' + catStr
proxy_script = requests.get(cUrl, proxies=proxies).text
bSoup = BeautifulSoup(proxy_script, 'lxml')
print('\n*******************"'+ catStr.upper() + '"*******************\n')
catNumb += 1
for item in bSoup.find_all('div', class_='inner-article'):
url = item.a['href']
alt = item.find('img')['alt']
req = requests.get('http://www.supremenewyork.com' + url)
item_soup = BeautifulSoup(req.text, 'lxml')
name = item_soup.find('h1', itemprop='name').text
style = item_soup.find('p', itemprop='model').text
print alt +(' --- ')+ name +(' --- ')+ style
altArray.append(alt)
nameArray.append(name)
styleArray.append(style)
print altArray
print nameArray
print styleArray
I put .text at the end and it worked sorta.... How do i fix it so it prints the info I want???
I think you miss smt. Your cUrl is a string type, not request type. I guess you want:
proxy_script = requests.get(cUrl, proxies=proxies).text

Resources