I am trying to scrape formularylookup.com, a site with information on the market for pharmaceuticals.
It requires a login:
username: -
password: -
I need the information for the medicine called Rybelsus.
When I look into the Inspect-> Network -> XHR I suspect there could be an easy way to get the required data form this page:
https://formularylookup.com/Formulary/Coverage?ProductId=237171&ProductName=Rybelsus&ChannelId=1&DrugTypeId=3&StateId=all&Options=SummaryCoverages
I identified this site, which might give an idea of how to connect to formularylookup.com, but I am very inexperienced with connecting to API's.
Here's my code:
import requests
from bs4 import BeautifulSoup
url ="https://api.mmitnetwork.com/Formulary/v1/Products?Name=rybelsus"
params = {
"ProductId":"237171",
"productSearch":"Rybelsus"}
headers = {
"authorization":"Bearer H-oa4ULGls2Cpu8U6hX4myixRoFIPxfj",
"Access-Token":"H-oa4ULGls2Cpu8U6hX4myixRoFIPxfj",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Host": "formularylookup.com",
"X-NewRelic-ID": "XAYCVFZSGwcGU1lXBAI="
}
res = requests.get(url ,params=params ,headers = headers)
soup = BeautifulSoup(res.content, "lxml")
print(soup.prettify())
Which gives me the following response:
<!DOCTYPE html>
<html>
<head>
<title>
The resource cannot be found.
</title>
<meta content="width=device-width" name="viewport"/>
<style>
body {font-family:"Verdana";font-weight:normal;font-size: .7em;color:black;}
p {font-family:"Verdana";font-weight:normal;color:black;margin-top: -5px}
b {font-family:"Verdana";font-weight:bold;color:black;margin-top: -5px}
H1 { font-family:"Verdana";font-weight:normal;font-size:18pt;color:red }
H2 { font-family:"Verdana";font-weight:normal;font-size:14pt;color:maroon }
pre {font-family:"Consolas","Lucida Console",Monospace;font-size:11pt;margin:0;padding:0.5em;line-height:14pt}
.marker {font-weight: bold; color: black;text-decoration: none;}
.version {color: gray;}
.error {margin-bottom: 10px;}
.expandable { text-decoration:underline; font-weight:bold; color:navy; cursor:hand; }
#media screen and (max-width: 639px) {
pre { width: 440px; overflow: auto; white-space: pre-wrap; word-wrap: break-word; }
}
#media screen and (max-width: 479px) {
pre { width: 280px; }
}
</style>
</head>
<body bgcolor="white">
<span>
<h1>
Server Error in '/' Application.
<hr color="silver" size="1" width="100%"/>
</h1>
<h2>
<i>
The resource cannot be found.
</i>
</h2>
</span>
<font face="Arial, Helvetica, Geneva, SunSans-Regular, sans-serif ">
<b>
Description:
</b>
HTTP 404. The resource you are looking for (or one of its dependencies) could have been removed, had its name changed, or is temporarily unavailable. Please review the following URL and make sure that it is spelled correctly.
<br/>
<br/>
<b>
Requested URL:
</b>
/Formulary/v1/Products
<br/>
<br/>
<hr color="silver" size="1" width="100%"/>
<b>
Version Information:
</b>
Microsoft .NET Framework Version:4.0.30319; ASP.NET Version:4.6.1590.0
</font>
</body>
</html>
<!--
[HttpException]: The controller for path '/Formulary/v1/Products' was not found or does not implement IController.
at System.Web.Mvc.DefaultControllerFactory.GetControllerInstance(RequestContext requestContext, Type controllerType)
at System.Web.Mvc.DefaultControllerFactory.CreateController(RequestContext requestContext, String controllerName)
at System.Web.Mvc.MvcHandler.ProcessRequestInit(HttpContextBase httpContext, IController& controller, IControllerFactory& factory)
at System.Web.Mvc.MvcHandler.BeginProcessRequest(HttpContextBase httpContext, AsyncCallback callback, Object state)
at System.Web.HttpApplication.CallHandlerExecutionStep.System.Web.HttpApplication.IExecutionStep.Execute()
at System.Web.HttpApplication.ExecuteStep(IExecutionStep step, Boolean& completedSynchronously)
-->
<!--
This error page might contain sensitive information because ASP.NET is configured to show verbose error messages using <customErrors mode="Off"/>. Consider using <customErrors mode="On"/> or <customErrors mode="RemoteOnly"/> in production environments.-->
Update: I get an 404 error. Not sure why.
Below code will help you,
import requests
headers = {
'Accept': '*/*',
'X-Requested-With': 'XMLHttpRequest',
'Access-Token': '7Lq-KkDx2fCO_3kG90pLEpBS9Ssh62IQ',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.87 Safari/537.36',
'Is-Session-Expired': 'false',
'Referer': 'https://formularylookup.com/',
}
response = requests.get('https://formularylookup.com/Formulary/Coverage?ProductId=237171&ProductName=Rybelsus&ChannelId=1&DrugTypeId=3&StateId=AL&Options=SummaryCoverages', headers=headers)
print(response.json())
Note: 'Is-Session-Expired': 'false' is very important in the header otherwise you'll get 404 error.
See it in action here
Related
import requests
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Encoding": "gzip, deflate",
"Accept-Language": "en-GB,en-US;q=0.9,en;q=0.8",
"Dnt": "1",
"Host": "httpbin.org",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0",
}
url="https://app.shkolo.bg/dashboard"
res = requests.get(url,headers=headers)
print(res)
This throws a 403 response.
Any idea why?
I've just starting using the requests module so I cannot much more information.
By removing all information except the User-Agent I managed to get the reply:
import requests
url = "https://app.shkolo.bg/dashboard"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:83.0) Gecko/20100101 Firefox/83.0'}
result = requests.get(url, headers=headers)
print(result.content.decode())
# Begin of the output
<!DOCTYPE HTML>
<html lang="en-US">
<head>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Note that the answer is completely similar to this post.
This is the website "https://www.interlinecenter.com/" this website is making request to "http://cs.cruisebase.com/cs/forms/hotdeals.aspx?skin=605&nc=y" for loading html content in an "I-FRAME". I am making the exact same request using the same headers being sent by the browser but i am not getting the same content.
Here is the code i am using:
url='http://cs.cruisebase.com/cs/forms/hotdeals.aspx?skin=605&nc=y'
header = {
'Host': 'cs.cruisebase.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:70.0) Gecko/20100101 Firefox/70.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate, br',
'Referer': 'https://www.interlinecenter.com/',
'Connection': 'keep-alive',
'Cookie': 'visid_incap_312345=yt2dprI6SuGoy44xQsnF36dOwV0AAAAAQUIPAAAAAAAqm0pG5WAWOGjtyY8GOrLv; __utma=15704100.1052110012.1572947038.1574192877.1575447075.6; __utmz=15704100.1575447075.6.6.utmcsr=interlinecenter.com|utmccn=(referral)|utmcmd=referral|utmcct=/; ASP.NET_SessionId=pzd3a0l5kso41hhbqf3jiqlg; nlbi_312345=/7dzbSeGvDjg2/oY/eQfhwAAAACv806Zf3m7TsjHAou/y177; incap_ses_1219_312345=tMxeGkIPugj4d1gaasLqECHE5l0AAAAAg1IvjaYhEfuSIYLXtc2f/w==; LastVisitedClient=605; AWSELB=85D5DF550634E967F245F317B00A8C32EB84DA2B6B927E6D5CCB7C26C3821788BFC50D95449A1BA0B0AFD152140A70F5EA06CBB8492B21E10EC083351D7EBC4C68F086862A; incap_ses_500_312345=6PJ9FxwJ3gh0vta6kVvwBthz510AAAAAvUZPdshu8GVWM2sbkoUXmg==; __utmb=15704100.2.10.1575447075; __utmc=15704100; __utmt_tt=1',
'Upgrade-Insecure-Requests': '1',
'Cache-Control': 'max-age=0'
}
response = requests.get(url, timeout=10, headers=header)
byte_data = response.content
source_code = html.fromstring(byte_data)
print(response)
print(byte_data)
This is the response i am getting:
<Response [200]>
<html style="height:100%"><head><META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW"><meta name="format-detection" content="telephone=no"><meta name="viewport" content="initial-scale=1.0"><meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1"></head><body style="margin:0px;height:100%"><iframe id="main-iframe" src="/_Incapsula_Resource?SWUDNSAI=9&xinfo=10-99927380-0%200NNN%20RT%281575456049298%202%29%20q%280%20-1%20-1%200%29%20r%281%20-1%29%20B12%284%2c316%2c0%29%20U2&incident_id=500000240101726326-477561257670738314&edet=12&cinfo=04000000&rpinfo=0" frameborder=0 width="100%" height="100%" marginheight="0px" marginwidth="0px">Request unsuccessful. Incapsula incident ID: 500000240101726326-477561257670738314</iframe></body></html>
I need to extract/scrape data at "https://cs.cruisebase.com/cs/forms/hotdeals.aspx?skin=605&nc=y".
Note: i don't want to use the selenium webdriver to get the data any help will be much appreciated, Thanks!
Did you try getting the headers by loading the target URL directly?
I sent a GET request to https://cs.cruisebase.com/cs/forms/hotdeals.aspx?skin=605&nc=y with the following headers, and I was able to get the complete response.
headers = {
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
'Accept-Encoding':'gzip, deflate',
'Accept-Language':'en-GB,en;q=0.9,en-US;q=0.8,hi;q=0.7,la;q=0.6',
'Cache-Control':'no-cache',
'Connection':'keep-alive',
'Cookie':'ENTER COOKIES',
'DNT':'1',
'Host':'cs.cruisebase.com',
'Pragma':'no-cache',
'Upgrade-Insecure-Requests':'1',
'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36',
}
I have left the Cookie field blank, you will have to enter cookies otherwise the page won't load. You can get the cookies from Chrome.
When my python code run in localhost,it work. But, it doesn't in AWS ec2?
My code is simple: send a https request with post method. It works in localhost, but it occurs problem in AWS ec2.
I change another https request, it works.So the network is ok.
The code and the received content are below:
#!/usr/bin/python
# -*- coding:utf-8 -*-
import random
import requests
url = 'https://www.haidilao.com/eportal/ui?moduleId=5&pageId=9c8cf76c4ca84fc686ca11aaa936f5c7&struts.portlet.action=/portlet/map-portlet!getMapDotDataByRegion.action&random='+str(random.random())
header = {
'host':'www.haidilao.com',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 Safari/537.36',
'content-type': 'application/x-www-form-urlencoded; charset=UTF-8',
'accept': 'text/plain, */*; q=0.01',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8'
}
data = {
'queryContent': 'CA',
'country':'6bf4038f63234217aecf93668a63f04b',
'myLat':'',
'myLng':''
}
def test():
data = parse.urlencode(data).encode('utf-8')
req = requests.post(url,headers= header,data= data, verify= False)
print(req.status_code)
print(req.headers)
print(req.text)
test()
```python
```html
Content-Type: text/html
Content-Length: 60416
Connection: close
Date: Sat, 10 Aug 2019 13:02:31 GMT
Server: nginx
Last-Modified: Fri, 09 Aug 2019 14:05:06 GMT
ETag: "5d4d7d92-ec00"
Accept-Ranges: bytes
Vary: Accept-Encoding
X-Cache: Miss from cloudfront
Via: 1.1 46dd9ae2d97161deaefbdceeae5f57ac.cloudfront.net (CloudFront)
X-Amz-Cf-Pop: SIN2-C1
X-Amz-Cf-Id: XNkaD2emKes3BpaY3ZVSGb1bxlnsHD1KZeHCZPXnOcspTaYXXjVzKA==
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<title>sorry,the channel has expired!</title>
<link rel="stylesheet" type="text/css" href="/eportal/uiFramework/css/tip.css">
</head>
<body>
<div class="easysite-error">
<img src="/eportal/uiFramework/images/picNews/tip.png" />
<font class="easysite-404"></font>
<font>sorry,the channel has expired!</font>
</div>
</body>
</html>
I'm trying to make an HTTPS proxy server, but I cant make a connection to any server.
Edit: this is the part of the code that, after a client connect to my server, I get the message and try to send to the web. This is the message of a FireFox client trying to connect to Google:
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock = socket.create_connection((google.com, 443))
ssl_sock = ssl.wrap_socket(sock)
fullData=b''
ssl_sock.send(b'CONNECT www.google.com:443 HTTP/1.1\r\nUser-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:67.0) Gecko/20100101 Firefox/67.0\r\nProxy-Connection: keep-alive\r\nConnection: keep-alive\r\nHost: www.google.com:443\r\n\r\n')
while 1:
# receive data from web server
data = ssl_sock.recv(4026)
print(data)
if (len(data) > 0):
fullData+=data
else:
break
clientSock.send(fullData)
Google should got me a ok message but its getting me an error
HTTP/1.1 405 Method Not Allowed
Content-Type: text/html; charset=UTF-8
Referrer-Policy: no-referrer
Content-Length: 1592
Date: Fri, 24 May 2019 05:28:17 GMT
Alt-Svc: quic=":443"; ma=2592000; v="46,44,43,39"
Connection: close
<!DOCTYPE html>
<html lang=en>
<meta charset=utf-8>
<meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
<title>Error 405 (Method Not Allowed)!!1</title>
<style>
*{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}#media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}#media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}#media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
</style>
<a href=//www.google.com/><span id=logo aria-label=Google></span></a>
<p><b>405.</b> <ins>That\xe2\x80\x99s an error.</ins>
<p>The request method <code>CONNECT</code> is inappropriate for the URL <code>/</code>. <ins>That\xe2\x80\x99s all we know.</ins>
Trying hard to submit the form to no success.
This form is supposed to redirect and return new url with PDF.
Here is how to access the page in question:
Start with Search Page
Click on Document Type tab
Enter LP, click Search
Click View
Click Get Image
View PDF button is the one that Im interested in.
I need to mimic multipart formdata which looks like this:
<form name="courtform" action="http://oris.co.palm-beach.fl.us:8080/PdfServlet/PdfServlet27" method="post" enctype="multipart/form-data">
<input type="hidden" name="hostURL" value="http://oris.co.palm-beach.fl.us/or_web1/" size="60">
<input type="hidden" name="pdfPath" value="\\wcp01zfs-03.clerk.local\files2\ORISPDF\" size="60">
<input type="hidden" name="pdfURL" value="http://oris.co.palm-beach.fl.us/pdf/" size="60">
<input type="hidden" name="pages" value="1" size="60">
<!--<input type="hidden" name="pages" value="1" size="60">-->
<input type="hidden" name="id" value="22590889" size="60">
<input type="hidden" name="mpages" value="1" size="60">
<input type="hidden" name="doc_id" value="22590889" size="60">
<input type="hidden" name="page1" value="image_from_file.asp?imageurl=\\ors_fs\ORImage\O\30336\O.30336.1200.0001.tif" size="60">
<input type="hidden" name="WaterMarkText" value="1" size="60">
<input name="button" type="button" value="View PDF" onclick="javascript:ValidateAndSubmit(this.form)">
Here is part of my Scrapy code responsible for this request:
def get_image(self, response):
# inspect_response(response, self)
url = 'http://oris.co.palm-beach.fl.us:8080/PdfServlet/PdfServlet27'
headers = { 'Connection': 'keep-alive',
'origin': "http://oris.co.palm-beach.fl.us",
'upgrade-insecure-requests': "1",
'dnt': "1",
'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
'cache-control': "max-age=0",
'Accept-Encoding': 'gzip,deflate',
}
id = response.xpath("//input[#name='doc_id']/#value").extract_first()
body = {'WaterMarkText': '0',
'hostURL': 'http://oris.co.palm-beach.fl.us/or_web1/',
'mpages': '1',
'page1': 'image_from_file.asp?imageurl=\\ors_fs\\ORImage\\O\\30338\\O.30338.0268.0001.tif',
'pages': '1',
'pdfPath': '\\wcp01zfs-03.clerk.local\\files2\\ORISPDF\\',
'pdfURL': 'http://oris.co.palm-beach.fl.us/pdf/',
}
body['doc_id'] = id
body['id'] = id
me = MultipartEncoder(fields=body, boundary='------WebKitFormBoundarygGHlhpHs08goICxO')
me_body = me.to_string()
headers['Content-Type'] =me.content_type
headers['Content-Length'] = me.len
yield scrapy.Request(url, method = 'POST', body = me_body, callback = self.get_pdf, headers = headers)
yield {'body':me_body}
def get_pdf(self, response):
inspect_response(response, self)
Whenever I run the code Im getting Response 400.
How do I mimic this form correctly?
UPDATE:
It appears I do not need to provide Content-Length manually.
After I removed it worked just one time. And then reverted to 404
error.
Is Boundary supposed to be new for every request? From what I read it
looks like it does not, since it is just a divider with no other
purpose.
I had to automate the entire process of filling the form and now it seems to work just fine.
def get_image(self, response):
# inspect_response(response, self)
item = response.meta['item']
url = 'http://oris.co.palm-beach.fl.us:8080/PdfServlet/PdfServlet27'
headers = {
'Connection': 'keep-alive',
'origin': "http://oris.co.palm-beach.fl.us",
'upgrade-insecure-requests': "1",
'dnt': "1",
'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
'accept': "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3",
'cache-control': "max-age=0",
'Accept-Encoding': 'gzip,deflate',
}
body={}
# Generate body from form
for i in response.xpath("//form[#name='courtform']/input"):
name = i.xpath(".//#name").extract_first()
val = i.xpath(".//#value").extract_first()
body[name] = val
# Remove watermakr from PDF
body['WaterMarkText'] = '0'
me = MultipartEncoder(fields=body, boundary='----WebKitFormBoundarygGHghpHs08goICxO')
me_body = me.to_string()
headers['Content-Type'] =me.content_type
yield scrapy.Request(url, method = 'POST', body = me_body, callback = self.get_pdf, headers = headers, meta={'item' : item})