How to extract information using BeautifulSoup from a particular site

How to extract information using BeautifulSoup from a particular site - python-3.x

My objective is to extract info from site https://shopopenings.com/merchant-search after entering pin code of the respective area and copy all info from there. Whether outlet is opened or closed. There has to be loop.

This site has an underlying API that you can use to get JSON responses. To find the endpoints and what is expected as request and response you can use the Mozilla developer tools and Chrome devtools under network.
import json
import requests
SEARCH_ADDRESS = "California City, CA 93505"
urlEndpoint_AutoComplete = "https://shopopenings.com/api/autocomplete"
urlEndpoint_Search = "https://shopopenings.com/api/search"
search_Location = {"type":"address", "searchText":SEARCH_ADDRESS, "language":"us"}
locations = requests.post(urlEndpoint_AutoComplete, data=search_Location)
local = json.loads(locations.text)[0] # get first address
local["place_address"] = local.pop("name") # fix key name for next post request
local["place_id"] = local.pop("id") # fix key name for next post request
local["shopTypes"] = ["ACC", "ARA", "AFS", "AUT", "BTN", "BWL", "BKS", "AAC",
"CEA", "CSV", "DPT", "DIS", "DSC", "DLS", "EQR", "AAF", "GHC", "GRO", "HBM",
"HIC", "AAM", "AAX", "MER", "MOT", "BMV", "BNM", "OSC", "OPT", "EAP", "SHS",
"GSF", "SGS", "TEV", "TOY", "TAT", "DVG", "WHC", "AAW"]
local["range"] = 304.8
local["language"] = "us"
results = requests.post(urlEndpoint_Search, data=local)
print(json.loads(results.text))
{'center': {'latitude': 35.125801, 'longitude': -117.9859038},
'range': '304.8',
'merchants': [{'mmh_id': '505518130',
'latitude': 35.125801,
'longitude': -117.9859,
'shopName': 'Branham M Branham Mtr',
'shopFullAddressString': 'California City, CA',
'isOpen': False,
'nfc': False,
'shopType': 'AUT',
'distance': 0.34636329,
'country': 'USA'},
{'mmh_id': '591581670',
'latitude': 35.125442,
'longitude': -117.986083,
'shopName': 'One Stop Market',
'shopFullAddressString': '7990 California City Blvd, California City, CA 93505-2518',
'isOpen': True,
'nfc': True,
'shopType': 'AFS',
'distance': 43.04766933,
'country': 'USA'},
...
...

I think use selenium to control the Navigation and the entering of the Pin then use BeautifulSoup to work with the Page Source after you action. Here is the documentation it's easy enough to get you started.
Selenium -- https://selenium-python.readthedocs.io/
BeautifulSoup -- https://readthedocs.org/projects/beautiful-soup-4/
Enjoy!!

Related

How to fetch only parts of json file in python3 requests module

So, I am writing a program in Python to fetch data from google classroom API using requests module. I am getting the full json response from the classroom as follows :
{'announcements': [{'courseId': '#############', 'id': '###########', 'text': 'This is a test','state': 'PUBLISHED', 'alternateLink': 'https://classroom.google.com/c/##########/p/###########', 'creationTime': '2021-04-11T10:25:54.135Z', 'updateTime': '2021-04-11T10:25:53.029Z', 'creatorUserId': '###############'}, {'courseId': '############', 'id': '#############', 'text': 'Hello everyone', 'state': 'PUBLISHED', 'alternateLink': 'https://classroom.google.com/c/#############/p/##################', 'creationTime': '2021-04-11T10:24:30.952Z', 'updateTime': '2021-04-11T10:24:48.880Z', 'creatorUserId': '##############'}, {'courseId': '##################', 'id': '############', 'text': 'Hello everyone', 'state': 'PUBLISHED', 'alternateLink': 'https://classroom.google.com/c/##############/p/################', 'creationTime': '2021-04-11T10:23:42.977Z', 'updateTime': '2021-04-11T10:23:42.920Z', 'creatorUserId': '##############'}]}
I was actually unable to convert this into a pretty format so just pasting it as I got it from the http request. What I actually wish to do is just request the first few announcements (say 1, 2, 3 whatever depending upon the requirement) from the service while what I'm getting are all the announcements (as in the sample 3 announcements) that had been made ever since the classroom was created. Now, I believe that fetching all the announcements might make the program slower and so I would prefer if I could get only the required ones. Is there any way to do this by passing some arguments or anything? There are a few direct functions provided by google classroom however I came across those a little later and have already written everything using the requests module which would require changing a lot of things which I would like to avoid. However if unavoidable I would go that route as well.

Answer:
Use the pageSize field to limit the number of responses you want in the announcements: list request, with an orderBy parameter of updateTime asc.
More Information:
As per the documentation:
orderBy: string
Optional sort ordering for results. A comma-separated list of fields with an optional sort direction keyword. Supported field is updateTime. Supported direction keywords are asc and desc. If not specified, updateTime desc is the default behavior. Examples: updateTime asc, updateTime
and:
pageSize: integer
Maximum number of items to return. Zero or unspecified indicates that the server may assign a maximum.
So, let's say you want the first 3 announcements for a course, you would use a pageSize of 3, and an orderBy of updateTime asc:
# Copyright 2021 Google LLC.
# SPDX-License-Identifier: Apache-2.0
service = build('classroom', 'v1', credentials=creds)
asc = "updateTime asc"
pageSize = 3
# Call the Classroom API
results = service.courses().announcements().list(pageSize=3, orderBy=asc ).execute()
or an HTTP request example:
GET https://classroom.googleapis.com/v1/courses/[COURSE_ID]/announcements
?orderBy=updateTime%20asc
&pageSize=2
&key=[YOUR_API_KEY] HTTP/1.1
Authorization: Bearer [YOUR_ACCESS_TOKEN]
Accept: application/json
References:
Method: announcements.list | Classroom API | Google Developers

How to structure (take out) JSON from a text string. (Python)

I have a text string/script which I took out from a webpage. I would like to clean/structure that text string/Script so that I can only get JSON out of it. But its very long that I lost finding beginning and ending of JSON from that text. Does anyone help me out or advice a online website which can help to find the beginning and ending of JSON from that text. Many Thanks
window.__NUXT__=function(e,l,a,t,r,s,i,o,n,d){return{layout:s,data:[{product:{active_gtin:"5711555000616",active_supplier:"0000009002",active_supplier_product_id:"000000000091052931-EA",brand:"Prosonic",description:"Prosonic 32\" TV med Android og Full-HD opløsning. Android styresystemet giver dig let adgang til Netflix, Viaplay og TV2 Play samt mange andre apps og med indbygget Chromecast kan du let caste indhold til TV'et.",display_list_price:l,display_sales_price:l,energy_class:"A+",energy_class_color_code:"lev_3",energy_label:i,erp_product_id:o,gallery_images:[i,"https://sg-dam.imgix.net/services/assets.img/id/13a13e85-efe7-48eb-bb6c-953abc94fb08/size/original","https://sg-dam.imgix.net/services/assets.img/id/e0c39be1-eb82-4652-88f4-992226390a3f/size/original","https://sg-dam.imgix.net/services/assets.img/id/9bc81449-64ba-44c0-b691-31b22bf5dc91/size/original"],hybris_code:n,id:n,image_primary:"https://sg-dam.imgix.net/services/assets.img/id/f8d59494-3da7-4cb7-9dd8-e8d16577e7c4/size/original",in_stock_stores_count:15,is_approved_for_sale:t,is_exposed:t,is_reservable:t,name:'Prosonic 32" 32and6021 LED tv',online_from:16000344e5,online_to:2534022108e5,primary_category_path:"/elektronik/tv",product_url:"/produkter/prosonic-32-32and6021-led-tv/100553115/",sales_price:e,show_discount_message:a,sku:o,specifications:'[{"features":[{"code":"text-TvMemory","label":"Tekst TV hukommelse","value":"1000"}],"label":"Tekst TV hukommelse"},{"features":[{"code":"tvFeatures","label":"TV funktioner","value":"Netflix"},{"code":"tvFeatures","label":"TV funktioner","value":"SmartTV"},{"code":"tvFeatures","label":"TV funktioner","value":"Wi-Fi indbygget"}],"label":"TV funktioner"},{"features":[{"code":"TV.tvApps","label":"TV Apps","value":"Amazon"},{"code":"TV.tvApps","label":"TV Apps","value":"Apple TV"},{"code":"TV.tvApps","label":"TV Apps","value":"Blockbuster"},{"code":"TV.tvApps","label":"TV Apps","value":"Boxer"},{"code":"TV.tvApps","label":"TV Apps","value":"Dplay"},{"code":"TV.tvApps","label":"TV Apps","value":"DR TV"},{"code":"TV.tvApps","label":"TV Apps","value":"Google Play Store"},{"code":"TV.tvApps","label":"TV Apps","value":"HBO Nordic"},{"code":"TV.tvApps","label":"TV Apps","value":"Min Bio"},{"code":"TV.tvApps","label":"TV Apps","value":"Netflix"},{"code":"TV.tvApps","label":"TV Apps","value":"Rakuten TV"},{"code":"TV.tvApps","label":"TV Apps","value":"SF Anytime"},{"code":"TV.tvApps","label":"TV Apps","value":"Skype"},{"code":"TV.tvApps","label":"TV Apps","value":"Spotify"},{"code":"TV.tvApps","label":"TV Apps","value":"TV2 play"},{"code":"TV.tvApps","label":"TV Apps","value":"Viaplay"},{"code":"TV.tvApps","label":"TV Apps","value":"YouSee"},{"code":"TV.tvApps","label":"TV Apps","value":"Youtube"}],"label":"TV Apps"},{"features":[{"code":"connectivity.videoConnectivity","label":"Video tilslutning","value":"composite"}],"label":"Video tilslutning"},{"features":[{"code":"screen.monitorLanguageList","label":"Skærmsprog","value":"Dansk"}],"label":"Skærmsprog"},{"features":[{"code":"builtInSpeakers.soundFunction","label":"Lydfunktioner","value":"Bluetooth"}],"label":"Lydfunktioner"},{"features":[{"code":"productionYear","label":"Produktionsår","value":"2.020"}],"label":"Produktionsår"},{"features":[{"code":"electronics.manufacturerNum","label":"Producentens Varenummer","value":"32AND6021"}],"label":"Producentens Varenummer"},{"features":[{"code":"TV.hdrLOV","label":"HDR","value":"HDR 10"}],"label":"HDR"},{"features":[{"code":"TV.isSleepTimerPresent","label":"Sleep timer","value":"Ja"}],"label":"Sleep timer"},{"features":[{"code":"isPVRFunctionPresent","label":"PVR funktion","value":"Ja"}],"label":"PVR funktion"},{"features":[{"code":"accessoriesIncluded","label":"Tilbehør inkluderet","value":"stand og remote"}],"label":"Tilbehør inkluderet"},{"features":[{"code":"screenTechnologyDesc","label":"Skærmteknologi","value":"LED"}],"label":"Skærmteknologi"},{"features":[{"code":"tvTunerList","label":"TV-tuners","value":"CI+"},{"code":"tvTunerList","label":"TV-tuners","value":"DVB-C"},{"code":"tvTunerList","label":"TV-tuners","value":"DVB-S"},{"code":"tvTunerList","label":"TV-tuners","value":"DVB-T2"},{"code":"tvTunerList","label":"TV-tuners","value":"MPEG4 tuner"}],"label":"TV-tuners"},{"features":[{"code":"TV.vesaStandardList","label":"Vægbeslag Vesa standard","value":"75x75"}],"label":"Vægbeslag Vesa standard"},{"features":[{"code":"connectivity.hdmiCount","label":"Antal HDMI","value":"3"}],"label":"Antal HDMI"},{"features":[{"code":"builtInSpeakers.speakerEffect","label":"Højtalereffekt","value":"12"}],"label":"Højtalereffekt"},{"features":[{"code":"usbCount","label":"Antal USB stik","value":"1"}],"label":"Antal USB stik"},{"features":[{"code":"TVResolution","label":"TV opløsning","value":"Full HD"}],"label":"TV opløsning"},{"features":[{"code":"picturePlayers.supportedImageFormats","label":"Understøttede Billed Formater","value":"JPG,BMP,PNG,GIF"}],"label":"Understøttede Billed Formater"},{"features":[{"code":"scartCount","label":"Antal scartstik","value":"0"}],"label":"Antal scartstik"},{"features":[{"code":"connectivity.usbcount2","label":"Antal USB 2.0 porte","value":"1"}],"label":"Antal USB 2.0 porte"},{"features":[{"code":"Color","label":"Produktfarve","value":"sort"}],"label":"Produktfarve"},{"features":[{"code":"TV.isWatchAndTimerFunctionOnOffPresent","label":"Ur og timerfunktion til\\/fra","value":"Ja"}],"label":"Ur og timerfunktion til\\/fra"},{"features":[{"code":"TV.isAutomaticChannelSearchAvailable","label":"Automatisk kanalsøgning","value":"Ja"}],"label":"Automatisk kanalsøgning"},{"features":[{"code":"screen.screenResolution","label":"Skærmopløsning","value":"Full-HD 1920 x 1080"}],"label":"Skærmopløsning"},{"features":[{"code":"TV.software","label":"TV software","value":"Android"}],"label":"TV software"},{"features":[{"code":"connectivity.connectivityDesc","label":"Andre tilslutningsmuligheder","value":"Composite, Audio in, VGA, optisk lyd ud,"}],"label":"Andre tilslutningsmuligheder"},{"features":[{"code":"TV.twinTuner","label":"Twin Tuner","value":"Nej"}],"label":"Twin Tuner"},{"features":[{"code":"picturePlayers.supportedVideoFileFormats","label":"Understøttede videofil formater","value":".MPG .MPEG.DAT.VOB.MKV.MP4 \\/ .M4A \\/ .M4V.MOV.FLV.3GP \\/ 3GPP.TS \\/ .M2TS.RMVB .RM.AVI.ASF .WMV.WEBM"}],"label":"Understøttede videofil formater"},{"features":[{"code":"isInternetBrowserPresent","label":"Internet browser","value":"Ja"}],"label":"Internet browser"},{"features":[{"code":"wirelessConnectivityOptionList","label":"Trådløse tilslutningsmuligheder","value":"Bluetooth"},{"code":"wirelessConnectivityOptionList","label":"Trådløse tilslutningsmuligheder","value":"Wi-Fi indbygget"}],"label":"Trådløse tilslutningsmuligheder"}]',step_product_id:"GR14425172",stock_count_online:2874,stock_count_status_online:"in_stock",stock_type:"NORMAL",summary:"Med Android og indbygget Chromecast",msg_sales_price_per_unit:l,package_display_sales_price:l,promotion_text:e,f_campaign_name:[]},loadingProduct:a}],error:e,state:{User:{UID:l,isLoggedIn:a,nickname:l,address:{firstName:l,lastName:l,address:l,postalCode:l,city:l,mobile:l,email:l,country:l},isDeliveryMethodSet:a,lastSeenProducts:[],wishlistProducts:[]},Tracking:{trackedOrders:[],activeRoute:e,oldRoute:e,cookieConsentGiven:a,initialRouteTracked:a},Search:{showDrawer:a,hideGlobalSearch:a,query:l,queryString:l,queries:[],brands:[],categories:[]},Products:{products:[]},ProductDialog:{showType:a,productId:e,quantity:e,error:e},plugins:{Cart:{checkoutErrorPlugin:{},productDialogPlugin:{}},TechnicalError:{technicalErrorPlugin:{}},Tracking:{gtmPlugin:{},gtmHandlers:{appInitializedHandler:{},bannerClickedHandler:{},bannerViewedHandler:{},checkoutStepChangedHandler:{},clickCollectCompletedHandler:{},cookieConsentGivenHandler:{},externalLinkClickedHandler:{},helpers:{},notFoundPageViewedHandler:{},orderCompletedHandler:{},plpProductsViewedHandler:{},productAddedHandler:{},productClickedHandler:{},productDetailViewedHandler:{},productQuantityChangeHandler:{},productRemovedHandler:{},recommendationsClickedHandler:{},recommendationsViewedHandler:{},routeChangedHandler:{},siteSearchHandler:{}}},User:{userPlugin:{}}},Payment:{paymentMethod:e,termsAccepted:a},OAuth:{accessToken:e,expiry:0,timestamp:e,trackingId:e},Navigation:{hierarchy:e,path:[],loading:a,lastFetchedTopNode:l},Layout:{eyebrow:{default:e},footer:{default:e},layout:s},InfoBar:{infoBars:[],infoBarMappers:{}},Delivery:{isFetchingPickups:a,deliveries:{},pickups:{},selectedDeliveries:{}},ClickCollect:{loading:a,showDrawer:a,baseMapLocation:e,stores:[],selectedStore:e,product:e,quantity:1,form:{name:l,email:l,countryDialCode:"45",phone:l,terms:a},reservation:e,error:a,filters:{inStockOnly:t}},Checkout:{panelState:{userInfo:{},delivery:{},payment:{mustVisit:t},store:{}},desiredPanel:"auto",panelValidators:{}},Cart:{data:{id:l,lineItems:[],totalLineItemsQuantity:0,totalSalesPrice:r,totalShippingSalesPrice:r,employeeNumber:e,loyaltyNumber:e,deliveries:[],totalLineItemSalesPrice:r,totalLineItemListPrice:r,totalLineItemDiscount:r,totalShippingListPrice:r,totalShippingPriceDiscount:r,orderNumber:e,totalSalesPriceNumber:0,isActive:t,isAllLineItemsValid:t,shippingAddress:d,billingAddress:d,hash:l,discountCodes:[],source:"USER_DEVICE"},loading:{},error:e,assistedSalesMode:a,assistedSalesStoreNumber:e},Breadcrumb:{categoryTree:{},productCategory:l,lookupBreadcrumbTasks:{},currentCategoryPage:[],helpers:{}}},serverRendered:t}}(null,"",!1,!0,"0,00","default","https://sg-dam.imgix.net/services/assets.img/id/87a045c1-0923-4575-81ce-fd9b7c3bfbf6/size/original","91052931-EA","100553115",void 0)

You can use a RegEx to get the Jsons from your string.
I have used this pattern: {(?:[^{}]*{[^{]*})*[^{}]*}
The above regex checks only the Json in one level deep.
Code:
import re
import json
input_data = """window.__NUXT__=funct ... A","100553115",void 0)"""
def json_validate(input_str):
founds = re.findall(r"{(?:[^{}]*{[^{]*})*[^{}]*}", input_str)
valid_jsons = []
for x in founds:
try:
valid_jsons.append(json.loads(x))
except json.JSONDecodeError:
continue
return valid_jsons
getting_jsons = json_validate(input_data)
for one_json in getting_jsons:
print(one_json)
print(len(getting_jsons))
It can find several (32) valid Jsons in your string:
>>> python3 test.py
{'features': [{'code': 'text-TvMemory', 'label': 'Tekst TV hukommelse', 'value': '1000'}], 'label': 'Tekst TV hukommelse'}
{'features': [{'code': 'tvFeatures', 'label': 'TV funktioner', 'value': 'Netflix'}, {'code': 'tvFeatures', 'label': 'TV funktioner', 'value': 'SmartTV'}, {'code': 'tvFeatures', 'label': 'TV funktioner', 'value': 'Wi-Fi indbygget'}], 'label': 'TV funktioner'}
{'features': [{'code': 'TV.tvApps', 'label': 'TV Apps', 'value': 'Amazon'}, {'code ...
I have found another solution which approaches the issue from totally different way: https://stackoverflow.com/a/54235803/11502612
I have tested the code from the above answer and I got the same output. It means the result is correct (probably).

Would it not be easier to do something like
import json
data = json.dumps(your_string)
Then iterate over it to find the values. Alternatively you can look for the value locations with
find("{")
Don't know if this is what your looking for but thought it may spark an idea / alternative view

find_all() in BeautifulSoup returns empty ResultSet

I am trying to scrape data from a website for practicing web scraping.But the findall() returns empty set. How can I resolve this issue?
#importing required modules
import requests,bs4
#sending request to the server
req = requests.get("https://www.udemy.com/courses/search/?q=python")
# checking the status on the request
print(req.status_code)
req.raise_for_status()
#converting using BeautifulSoup
soup = bs4.BeautifulSoup(req.text,'html.parser')
#Trying to scrape the particular div with the class but returning 0
container = soup.find_all('div',class_='popover--popover--t3rNO popover--popover-hover--14ngr')
#trying to print the number of container returned.
print(len(container))
Output :
200
0

See my comment about it being entirely javascript driven content. Modern websites often will use javascript to invoke HTTP requests to the server to grab data on demand when needed. Here if you disable javascript which you can easily do in chrome by going to more settings when you inspect the page. You will see that NO text is available on this website. Which is probably much different to imdb as you pointed out. If you check the beautifulsoup parsed html, you'll see you don't have any of the actual page source derived with javascript.
There are two ways to get data from a javascript rendered website
Mimic the HTTP request to the server
Browser automation package like selenium
The first option is better and more efficient, as the second option is more brittle and not great for larger data sets.
Fortunately udemy is getting the data you want from an API endpoint which it uses javascript to make HTTP requests to and the response back gets fed to the browser.
Code Example
import requests
cookies = {
'__udmy_2_v57r': '4f711b308da548b49394854a189d3179',
'ud_firstvisit': '2020-05-29T13:48:56.584511+00:00:1jefNY:9F1BJVEUJpv7gmNPgYNini76UaE',
'existing_user': 'true',
'optimizelyEndUserId': 'oeu1590760136407r0.2130390415126655',
'EUCookieMessageShown': 'true',
'_ga': 'GA1.2.1359933509.1590760142',
'_pxvid': '26d89ed1-a1b3-11ea-9179-cb750fa4136b',
'_ym_uid': '1585144165890161851',
'_ym_d': '1590760145',
'__ssid': 'd191bc02a1063fd2c75fbab525ededc',
'stc111655': 'env:1592304425%7C20200717104705%7C20200616111705%7C1%7C1014616:20210616104705|uid:1590760145861.374775813.04725504.111655.1839745362:20210616104705|srchist:1069270%3A1%3A20200629134905%7C1014624%3A1592252104%3A20200716201504%7C1014616%3A1592304425%3A20200717104705:20210616104705|tsa:0:20200616111705',
'ki_t': '1590760146239%3B1592304425954%3B1592304425954%3B3%3B5',
'ki_r': 'aHR0cHM6Ly93d3cuZ29vZ2xlLmNvbS8%3D',
'IR_PI': '00aea1e6-9da9-11ea-af3a-42010a24660a%7C1592390825988',
'_gac_UA-12366301-1': '1.1592304441.CjwKCAjw26H3BRB2EiwAy32zhfcltNEr_HHFK5JRaJar5qxUn4ifG9FVFctWyTUXigNZvKeOCz7PgxoCAfAQAvD_BwE',
'csrftoken': 'pPOdtdbH0HPaHvDfAZMzEOdvWqKZuQWufu8dUrEeXuy5mOOrnFRbWZ9vq8Dfd2ts',
'__cfruid': 'f1963d736e3891a2e307ebc9f918c89065ffe40f-1596962093',
'__cfduid': 'df4d951c87bc195c73b2f12b5e29568381597085850',
'ud_cache_price_country': 'GB',
'ud_cache_device': 'desktop',
'ud_cache_language': 'en',
'ud_cache_logged_in': '0',
'ud_cache_release': '0804b40d37e001f97dfa',
'ud_cache_modern_browser': '1',
'ud_cache_marketplace_country': 'GB',
'ud_cache_brand': 'GBen_US',
'ud_cache_version': '1',
'ud_cache_user': '',
'seen': '1',
'eventing_session_id': '66otW5O9TQWd5BYq1_etrA-1597087737933',
'ud_cache_campaign_code': '',
'exaff': '%7B%22start_date%22%3A%222020-08-09T08%3A52%3A04.083577Z%22%2C%22code%22%3A%22_7fFXpljNdk-m3_OJPaWBwAQc5gVKutaSg%22%2C%22merchant_id%22%3A39197%2C%22aff_type%22%3A%22LS%22%2C%22aff_id%22%3A60680%7D:1k5D3W:2PemPLTm4xaHixBYRvRyBaAukL4',
'evi': 'SlFfLh4RBzwTSVBjXFdHehNJUGMYQE99HVFdIExYQ3gARVY8QkAWIEEDCXsVQEd0BEsJexVAA24LQgdjGANXdgZBG3ETH1luRBdHKBoHV3ZKURl5XVBXdkpRXWNUU1luRxIJe1lTQXhMDgdjHRAFbgsICXNWVk1uCwgJN0xYRGATBUpjVFVEdAEOB2NcWkR+E0lQYxhAT30dUV0gTFhCfAhDVm1MUEJ0B1EROkwUV3YAXwk3D0BPewFAHzxCQEd0BUcJexVAA24LQgdjGANXdgZCHHETTld+BkUdY1QZVzoTSRptTBQUbgtFEnleHwhgEwBcY1QZV34HShtjVBlXOhNJE21MFBRuC0UceV4fWW4DSxh3TFgObkdREXBCQAMtE0kccFtUCGATQR54VkBPNxMFCXtfTlc6UFERd1tUTTEdURlzX1JXdkpRXWNUU1luRxIJe1tXQnpMXwlzVldDbgsICTdMWEdgEwVKY1RVRHUJDgdjXFdCdBNJUGMYQE99HVFdIExYQ3kCQ1Y8Ew==',
'ud_rule_vars': 'eJyFjkuOwyAQBa9isZ04agyYz1ksIYxxjOIRGmhPFlHuHvKVRrPItvWqus4EXT4EDJP9jSViyobPktKRgZqc4GrkmmmuBHdU6YlRqY1P6RgDMQ05D2SOueCDtZPDMNT7QDrooAXRdrqhzHBlRL8XUjPgXwAGYCC7ulpdRX3acglPA8bvPwbVgm6g4p0Bvqeyhsh_BkybXyxmN8_R21J9vvpcjm5cn7ZDTidc7G2xxnvlm87hZwvlU7wE2VP1en0hlyuoG10j:1k5D3W:nxRv-tyLU7lxhsF2jRYvkJA53uM',
}
headers = {
'authority': 'www.udemy.com',
'x-udemy-cache-release': '0804b40d37e001f97dfa',
'x-udemy-cache-language': 'en',
'x-udemy-cache-user': '',
'x-udemy-cache-modern-browser': '1',
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
'accept': 'application/json, text/plain, */*',
'x-udemy-cache-brand': 'GBen_US',
'x-udemy-cache-version': '1',
'x-requested-with': 'XMLHttpRequest',
'x-udemy-cache-logged-in': '0',
'x-udemy-cache-price-country': 'GB',
'x-udemy-cache-device': 'desktop',
'x-udemy-cache-marketplace-country': 'GB',
'x-udemy-cache-campaign-code': '',
'sec-fetch-site': 'same-origin',
'sec-fetch-mode': 'cors',
'sec-fetch-dest': 'empty',
'referer': 'https://www.udemy.com/courses/search/?q=python',
'accept-language': 'en-US,en;q=0.9',
}
params = (
('q', 'python'),
('skip_price', 'false'),
)
response = requests.get('https://www.udemy.com/api-2.0/search-courses/', headers=headers, params=params, cookies=cookies)
ids = []
titles = []
durations = []
ratings = []
for a in response.json()['courses']:
title = a['title']
duration =int(a['estimated_content_length']) / 60
rating = a['rating']
id = str(a['id'])
titles.append(title)
ids.append(id)
durations.append(duration)
ratings.append(rating)
clean_ids = ','.join(ids)
params2 = (
('course_ids', clean_ids),
('fields/[pricing_result/]', 'price,discount_price,list_price,price_detail,price_serve_tracking_id'),
)
response = requests.get('https://www.udemy.com/api-2.0/pricing/', params=params2)
data = response.json()['courses']
prices = []
for a in ids:
price = response.json()['courses'][a]['price']['amount']
prices.append(price)
data = zip(titles, durations,ratings, prices)
for a in data:
print(a)
Output
('Learn Python Programming Masterclass', 56.53333333333333, 4.54487, 14.99)
('The Python Mega Course: Build 10 Real World Applications', 25.3, 4.51476, 16.99)
('Python for Beginners: Learn Python Programming (Python 3)', 2.8833333333333333, 4.4391, 17.99)
('The Python Bible™ | Everything You Need to Program in Python', 9.15, 4.64238, 17.99)
('Python for Absolute Beginners', 3.066666666666667, 4.42209, 14.99)
('The Modern Python 3 Bootcamp', 30.3, 4.64714, 16.99)
('Python for Finance: Investment Fundamentals & Data Analytics', 8.25, 4.52908, 12.99)
('The Complete Python Course | Learn Python by Doing', 35.31666666666667, 4.58885, 17.99)
('REST APIs with Flask and Python', 17.033333333333335, 4.61233, 12.99)
('Python for Financial Analysis and Algorithmic Trading', 16.916666666666668, 4.53173, 12.99)
('Python for Beginners with Examples', 4.25, 4.27316, 12.99)
('Python OOP : Four Pillars of OOP in Python 3 for Beginners', 2.6166666666666667, 4.46451, 12.99)
('Python Bootcamp 2020 Build 15 working Applications and Games', 32.13333333333333, 4.2519, 14.99)
('The Complete Python Masterclass: Learn Python From Scratch', 32.36666666666667, 4.39151, 16.99)
('Learn Python MADE EASY : A Concise Python Course in Python 3', 2.1166666666666667, 4.76601, 12.99)
('Complete Python Web Course: Build 8 Python Web Apps', 15.65, 4.37577, 13.99)
('Python for Excel: Use xlwings for Data Science and Finance', 16.116666666666667, 4.92293, 12.99)
('Python 3 Network Programming - Build 5 Network Applications', 12.216666666666667, 4.66143, 12.99)
('The Complete Python & PostgreSQL Developer Course', 21.833333333333332, 4.5664, 12.99)
('The Complete Python Programmer Bootcamp 2020', 13.233333333333333, 4.63859, 12.99)
Explanation
There are two ways to do this, here is re-engineering the requests which is the more efficient solution. To get the necessary information, you'll need to inspect the page and look at which HTTP requests give which information. You can do this through the network tools --> XHR when you inspect the page. You can see there are two requests that give you information. My suggestion would be look at the previews of the responses on the right hand side when you select the request. The first gives you the title, duration, price, ratings and the second request you need the id's of the courses to get the prices of the courses.
I usually copy the CURL of the HTTP requests the javascript invokes into curl.trillworks.com and this converts the necessary headers, parameters and cookies to python format.
In the first request, headers, cookies and parameters are required. THe second request, only requires the parameters.
The response you get is a json object. response.json() converts this into a python dictionary. You have to do abit of digging in this dictionary to get what you want. But for each item in response.json()['courses'] all the necessary data for each 'card' on the website is there. So we do a for loop around where the data sits in the dictionary we've created. I would play around the with response.json() till you get a feel for what the object gives you to understand the code.
The duration comes in minutes therefore I've done a quick convert to hours here. Also the id's need to be a string because in the second request we use them as parameters to get the necessary prices for the courses. We convert ids into a string and feed this as a parameter.
The second request then gives us the necessary prices, again you have to go digging in the dictionary object and I suggest you do this yourself to confirm that nested in that is the price.
The data we zip up to combine all the lists of data and then I've done a for loop to print it all. You could feed this into pandas if you wanted etc...

To get required data you need to send requests to appropriate API. For that you need to create Session:
import requests
s = requests.Session()
cookies = s.get('https://www.udemy.com').cookies
headers={"Referer": "https://www.udemy.com/courses/search/?q=python&skip_price=false"}
for page_counter in range(1, 500):
data = s.get('https://www.udemy.com/api-2.0/search-courses/?p={}&q=python&skip_price=false'.format(page_counter), cookies=cookies, headers=headers).json()
for course in data['courses']:
params = {'course_ids': [str(course['id']),],
'fields/[pricing_result/]': ['price',]}
title = course['title']
price = s.get('https://www.udemy.com/api-2.0/pricing/', params=params, cookies=cookies).json()['courses'][str(course['id'])]['price']['amount']
print({'title': title, 'price': price})

TypeError: Object of type 'Location' is not JSON serializable

i am using geopy library for my Flask web app. i want to save user location which i am getting from my modal(html form) in my database(i am using mongodb), but every single time i am getting this error:
TypeError: Object of type 'Location' is not JSON serializable
Here's the code:
#app.route('/register', methods=['GET', 'POST'])
def register_user():
if request.method == 'POST':
login_user = mongo.db.mylogin
existing_user = login_user.find_one({'email': request.form['email']})
# final_location = geolocator.geocode(session['address'].encode('utf-8'))
if existing_user is None:
hashpass = bcrypt.hashpw(
request.form['pass'].encode('utf-8'), bcrypt.gensalt())
login_user.insert({'name': request.form['username'], 'email': request.form['email'], 'password': hashpass, 'address': request.form['add'], 'location' : session['location'] })
session['password'] = request.form['pass']
session['username'] = request.form['username']
session['address'] = request.form['add']
session['location'] = geolocator.geocode(session['address'])
flash(f"You are Registerd as {session['username']}")
return redirect(url_for('home'))
flash('Username is taken !')
return redirect(url_for('home'))
return render_template('index.html')
Please Help, let me know if you want more info..

According to the geolocator documentation the geocode function "Return a location point by address" geopy.location.Location objcet.
Json serialize support by default the following types:
Python | JSON
dict | object
list, tuple | array
str, unicode | string
int, long, float | number
True | true
False | false
None | null
All the other objects/types are not json serialized by default and there for you need to defined it.
geopy.location.Location.raw
Location’s raw, unparsed geocoder response. For details on this,
consult the service’s documentation.
Return type: dict or None
You might be able to call the raw function of the Location (the geolocator.geocode return value) and this value will be json serializable.

Location is indeed not json serializable: there are many properties in this object and there is no single way to represent a location, so you'd have to choose one by yourself.
What type of value do you expect to see in the location key of the response?
Here are some examples:
Textual address
In [9]: json.dumps({'location': geolocator.geocode("175 5th Avenue NYC").address})
Out[9]: '{"location": "Flatiron Building, 175, 5th Avenue, Flatiron District, Manhattan Community Board 5, Manhattan, New York County, New York, 10010, United States of America"}'
Point coordinates
In [10]: json.dumps({'location': list(geolocator.geocode("175 5th Avenue NYC").point)})
Out[10]: '{"location": [40.7410861, -73.9896298241625, 0.0]}'
Raw Nominatim response
(That's probably not what you want to expose in your API, assuming you want to preserve an ability to change geocoding service to another one in future, which might have a different raw response schema).
In [11]: json.dumps({'location': geolocator.geocode("175 5th Avenue NYC").raw})
Out[11]: '{"location": {"place_id": 138642704, "licence": "Data \\u00a9 OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright", "osm_type": "way", "osm_id": 264768896, "boundingbox": ["40.7407597", "40.7413004", "-73.9898715", "-73.9895014"], "lat": "40.7410861", "lon": "-73.9896298241625", "display_name": "Flatiron Building, 175, 5th Avenue, Flatiron District, Manhattan Community Board 5, Manhattan, New York County, New York, 10010, United States of America", "class": "tourism", "type": "attraction", "importance": 0.74059885426854, "icon": "https://nominatim.openstreetmap.org/images/mapicons/poi_point_of_interest.p.20.png"}}'
Textual address + point coordinates
In [12]: location = geolocator.geocode("175 5th Avenue NYC")
...: json.dumps({'location': {
...: 'address': location.address,
...: 'point': list(location.point),
...: }})
Out[12]: '{"location": {"address": "Flatiron Building, 175, 5th Avenue, Flatiron District, Manhattan Community Board 5, Manhattan, New York County, New York, 10010, United States of America", "point": [40.7410861, -73.9896298241625, 0.0]}}'

Is there a way to iterate over a list to add into the selenium code?

I am trying to iterate over a large list of dealership names and cities. I want to have it refer back to the list and loop over each entry and get the results separately.
#this is only a portion of the delers the rest are in a file
Dealers= ['Mossy Ford', 'Abel Chevrolet Pontiac Buick', 'Acura of Concord', 'Advantage Audi' ]
driver=webdriver.Chrome("C:\\Users\\kevin\\Anaconda3\\chromedriver.exe")
driver.set_page_load_timeout(30)
driver.get("https://www.bbb.org/")
driver.maximize_window()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/div[2]/button").click()
driver.find_element_by_xpath("""//*[#id="findTypeaheadInput"]""").send_keys("Mossy Ford")
driver.find_element_by_xpath("""//*[#id="nearTypeaheadInput"]""").send_keys("San Diego, CA")
driver.find_element_by_xpath("""/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/button""").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div/div[2]/div[2]/div[1]/div[6]/div").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath('/html/body/div[1]/div/div/div/div[2]/div[2]/div/div/div[1]/div/div[2]/div/div[2]/a').click()
#contact_names= driver.find_elements_by_xpath('/html/body/div[1]/div/div/div/div[2]/div/div[5]/div/div[1]/div[1]/div/div/ul[1]')
#print(contact_names)
#print("Query Link: ", driver.current_url)
#driver.quit()
from selenium import webdriver
dealers= ['Mossy Ford', 'Abel Chevrolet Pontiac Buick', 'Acura of Concord']
cities = ['San Diego, CA', 'Rio Vista, CA', 'Concord, CA']
driver=webdriver.Chrome("C:\\Users\\kevin\\Anaconda3\\chromedriver.exe")
driver.set_page_load_timeout(30)
driver.get("https://www.bbb.org/")
driver.maximize_window()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/div[2]/button").click()
for d in dealers:
driver.find_element_by_xpath("""//*[#id="findTypeaheadInput"]""").send_keys("dealers")
for c in cities:
driver.find_element_by_xpath("""//*[#id="nearTypeaheadInput"]""").send_keys("cities")
driver.find_element_by_xpath("""/html/body/div[1]/div/div/div/div[2]/div[1]/div/div[2]/div/form/div[2]/button""").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath("/html/body/div[1]/div/div/div/div[2]/div/div[2]/div[2]/div[1]/div[6]/div").click()
driver.implicitly_wait(10)
driver.find_element_by_xpath('/html/body/div[1]/div/div/div/div[2]/div[2]/div/div/div[1]/div/div[2]/div/div[2]/a').click()
contact_names= driver.find_elements_by_class_name('styles__UlUnstyled-sc-1fixvua-1 ehMHcp')
print(contact_names)
print("Query Link: ", driver.current_url)
driver.quit()
I want to be able to go to each of these different dealerships pages and pull all of their details then loop thru the rest. I am just struggling with the ideas of for loops within selenium.

Its better to create a dictionary with a mapping of dealer and city and loop through
Dealers_Cities_Dict = {
Dealers_Cities_Dict = {
"Mossy Ford": "San Diego, CA",
"Abel Chevrolet Pontiac Buick": "City",
"Acura of Concord'": "City",
"Advantage Audi'": "City"
}
for dealer,city in Dealers_Cities_Dict.items():
//This is where the code sits
driver.find_element_by_id("findTypeaheadInput").send_keys(dealer)
driver.find_element_by_id("nearTypeaheadInput").send_keys(city)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to extract information using BeautifulSoup from a particular site - python-3.x

My objective is to extract info from site https://shopopenings.com/merchant-search after entering pin code of the respective area and copy all info from there. Whether outlet is opened or closed. There has to be loop.

Related

How to fetch only parts of json file in python3 requests module

How to structure (take out) JSON from a text string. (Python)

find_all() in BeautifulSoup returns empty ResultSet

TypeError: Object of type 'Location' is not JSON serializable

Is there a way to iterate over a list to add into the selenium code?

Categories

Resources