How can I unshorten a URL in python 3 - python-3.x

import http.client
import urllib.parse
def unshorten_url(url):
parsed = urllib.parse.urlparse(url)
h = http.client.HTTPConnection(parsed.netloc)
resource = parsed.path
if parsed.query != "":
resource += "?" + parsed.query
h.request('HEAD', resource )
response = h.getresponse()
if response.status/100 == 3 and response.getheader('Location'):
return unshorten_url(response.getheader('Location')) # changed to process chains of short urls
else:
return url
unshorten_url("http://data.europa.eu/esco/occupation/00030d09-2b3a-4efd-87cc-c4ea39d27c34")
Input will be :
http://data.europa.eu/esco/occupation/00030d09-2b3a-4efd-87cc-c4ea39d27c34 #yes the same is returned.'
Output URL after unshorten which i need : https://ec.europa.eu/esco/portal/occupation?uri=http%3A%2F%2Fdata.europa.eu%2Fesco%2Foccupation%2F00030d09-2b3a-4efd-87cc-c4ea39d27c34&conceptLanguage=en&full=true#&uri=http://data.europa.eu/esco/occupation/00030d09-2b3a-4efd-87cc-c4ea39d27c34'

As you can see I have two URLs one which Short URL which is my input and The other one is Full URL, to achieve the required output URL I identified a pattern from a set of the same kind URLs. And I wrote this code and achieved the required output.
my_url = "http://data.europa.eu/esco/occupation/00030d09-2b3a-4efd-87cc-c4ea39d27c34"
a="https://ec.europa.eu/esco/portal/occupationuri=http%3A%2F%2Fdata.europa.eu%2Fesco%2Foccupation%2F"
b = my_url.split("/")[-1]
URL = a+ b+ "&conceptLanguage=en&full=true#&uri=" + my_url
the output i.e; Required full URL is URL.
URL = " https://ec.europa.eu/esco/portal/occupation?uri=http%3A%2F%2Fdata.europa.eu%2Fesco%2Foccupation%2F00030d09-2b3a-4efd-87cc-c4ea39d27c34&conceptLanguage=en&full=true#&uri=http://data.europa.eu/esco/occupation/00030d09-2b3a-4efd-87cc-c4ea39d27c34'"

Related

Validating if user input is already sanitised

I've been working on a mini project with Python3 and tkinter recently that is used to sanitise URLs and IP addresses. I've hit a roadblock with my function that I cannot workout. What I am trying to achieve is:
Has a user entered a URL such as http://www.google.com or https://www.google.com and if so, sanitise as:
hxxp[:]//www[.]google[.]com or hxxps[:]//www[.]google[.]com
Has a user entered an IP address such as 192.168.1.1 or http://192.168.1.1 and sanitise as:
192[.]168[.]1[.]1 or hxxp[:]//192[.]168[.]1[.]1
Has a user entered already sanitised input? Is there unsanitised input along with it? If so, just sanitise the unsanitised input and print them to the results output Textbox.
I have included a screenshot of what is currently happening to my normal input, after input is sanitised and how I want to handle the above issues.
Also: Is the .strip() in the OutputTextbox.insert line redundant?
I appreciate any help and recommendations!
def printOut():
outputTextbox.delete("1.0", "end")
url = inputTextbox.get("1.0", "end-1c")
if len(url) == 0:
tk.messagebox.showerror("Error", "Please enter content to sanitise")
if "hxxp" and "[:]" and "[.]" in url or "hxxps" and "[:]" and "[.]" in url:
outputTextbox.insert("1.0", url, "\n".strip())
pass
elif "http" and ":" and "." in url:
url = url.replace("http", "hxxp")
url = url.replace(":", "[:]")
url = url.replace(".", "[.]")
outputTextbox.insert("1.0", url, "\n".strip())
elif "https" and ":" and "." in url:
url = url.replace("https", "hxxps")
url = url.replace(":", "[:]")
url = url.replace(".", "[.]")
outputTextbox.insert("1.0", url, "\n".strip())
elif "http" and ":" and range(0, 10) and "." in url or range(0, 10) and "." in url:
url = url.replace("http", "hxxp")
url = url.replace(".", "[.]")
outputTextbox.insert("1.0", url, "\n".strip())
The expression like A AND B AND C IN URL will has result like A AND B AND (C IN URL), not what you expect that A, B, C are all found in URL.
You can use re (regex module) to achieve what you want:
import re
def printOut():
outputTextbox.delete("1.0", "end")
url = inputTextbox.get("1.0", "end-1c")
if len(url) == 0:
messagebox.showerror("Error", "Please enter content to sanitise")
result = url.replace("http", "hxxp")
result = re.sub(r"([^\[]):([^\]])", r"\1[:]\2", result)
result = re.sub(r"([^\[])\.([^\]])", r"\1[.]\2", result)
outputTextbox.insert("end", result, "\n")
There may be better regex for that.

python requests.get() does not refresh the page

I have a piece of Python 3 code that fetches a webpage every 10 seconds which gives back some JSON information:
s = requests.Session()
while True:
r = s.get(currenturl)
data = r.json()
datetime = data['Timestamp']['DateTime']
value = data['PV']
print(str(datetime) + ": " + str(value) + "W")
time.sleep(10)
The output of this code is:
2020-10-13T13:26:53: 888W
2020-10-13T13:26:53: 888W
2020-10-13T13:26:53: 888W
2020-10-13T13:26:53: 888W
As you can see, the DateTime does not change with every iteration. When I refresh the page manually in my browser it does get updated every time.
I have tried adding
Cache-Control max-age=0
to the headers of my request but that does not resolve the issue.
Even when explicitely setting everything to None after loop, the same issue remains:
while True:
r = s.get(currenturl, headers={'Cache-Control': 'no-cache'})
data = r.json()
datetime = data['Timestamp']['DateTime']
value = data['PV']
print(str(datetime) + ": " + str(value) + "W")
time.sleep(10)
counter += 1
r = None
data = None
datetime = None
value = None
How can I "force" a refresh of the page with requests.get()?
It turns out this particular website doesn't continuously refresh on its own, unless the request comes from its parent url.
r = s.get(currenturl, headers={'Referer' : 'https://originalurl.com/example'})
I had to include the original parent URL as referer. Now it works as expected:
2020-10-13T15:32:27: 889W
2020-10-13T15:32:37: 889W
2020-10-13T15:32:47: 884W
2020-10-13T15:32:57: 884W
2020-10-13T15:33:07: 894W

python download file into memory and handle broken links

I'm using the following code to download a file into memory :
if 'login_before_download' in obj.keys():
with requests.Session() as session:
session.post(obj['login_before_download'], verify=False)
request = session.get(obj['download_link'], allow_redirects=True)
else:
request = requests.get(obj['download_link'], allow_redirects=True)
print("downloaded {} into memory".format(obj[download_link_key]))
file_content = request.content
obj is a dict that contains the download_link and another key that indicates if I need to login to the page to create a cookie.
The problem with my code is that if the url is broken and there isnt any file to download I'm still getting the html content of the page instead of identifying that the download failed.
Is there any way to identify that the file wasnt downloaded ?
I found the following solution in this url :
import requests
def is_downloadable(url):
"""
Does the url contain a downloadable resource
"""
h = requests.head(url, allow_redirects=True)
header = h.headers
content_type = header.get('content-type')
if 'text' in content_type.lower():
return False
if 'html' in content_type.lower():
return False
return True
print is_downloadable('https://www.youtube.com/watch?v=9bZkp7q19f0')
# >> False
print is_downloadable('http://google.com/favicon.ico')
# >> True

Merge a string to URL in python

I have to concatenate a hardcoded path of "string" type to a URL to have a result which is a URL.
url (which doesn't end with "/") + "/path/to/file/" = new_url
I tried concatenation using URL join and also tried used simple string concat but the result is not a URL which can be reached. (not that the URL address is invalid )
mirror_url = "http://amazonlinux.us-east-
2.amazonaws.com/2/core/latest/x86_64/mirror.list"
response = requests.get(mirror_url)
contents_in_url = response.content
## returns a URL as shown below but of string type which cannot be
##concatenated to another string type which could be requested as a valid
##URL.
'http://amazonlinux.us-east- 2.amazonaws.com/2/core/2.0/x86_64/8cf736cd3252ada92b21e91b8c2a324d05b12ad6ca293a14a6ab7a82326aec43'
path_to_add_to_url = "/repodata/primary.sqlite.gz"
final_url = contents_in_url + path_to_add_to_url
Desired Result:
Without omitting any path to that file.
final_url = "http://amazonlinux.us-west-2.amazonaws.com/2/core/2.0/x86_64/8cf736cd3252ada92b21e91b8c2a324d05b12ad6ca293a14a6ab7a82326aec43/repodata/primary.sqlite.gz"
You need to get contents of the first response by response.text method, not response.content:
import requests
mirror_url = "http://amazonlinux.us-east-2.amazonaws.com/2/core/latest/x86_64/mirror.list"
response = requests.get(mirror_url)
contents_in_url = response.text.strip()
path_to_add_to_url = "/repodata/primary.sqlite.gz"
response = requests.get(contents_in_url + path_to_add_to_url)
with open('primary.sqlite.gz', 'wb') as f_out:
f_out.write(response.content)
print('Downloading done.')

Lambda/boto3/python loop

This code acts as an early warning system for ADFS failures, which works fine when run locally. Problem is that when I run it in Lambda, it loops non stop.
In short:
lambda_handler() runs pagecheck()
pagecheck() produces the info needed then passes 2 lists (msgdet_list, error_list) and an int (error_count) to notification().
notification() collates and prints the output. The output is two key variables (notificationheader and notificationbody).
I've #commentedOut the SNS piece which would usually email the info, and am using print() to instead send the info to CloudWatch logs until I can get the loop sorted. Logs:
CloudWatch logs
If I run this locally, it produces a clean single output. In Lambda, the function will loop until it times out. It's almost like every time the lists are updated, they're passed to the notification() module and it's run. I can limit the function time, but would rather fix the code!
Cheers,
tac
# This python/boto3/lambda script sends a request to an Office 365 landing page, parses return details to confirm a successful redirect to /
# the organisation ADFS homepage, authenticates homepge is correct, raises any errors, and sends a consolodated report to /
# an AWS SNS topic.
# Run once to produce pageserver and htmlchar values for global variables.
# Import required modules
import boto3
import urllib.request
from urllib.request import Request, urlopen
from datetime import datetime
import time
import re
import sys
# Global variables to be set
url = "https://outlook.com/CONTOSSO.com"
adfslink = "https://sts.CONTOSSO.com/adfs/ls/?client-request-id="
# Input after first run
pageserver = "Microsoft-HTTPAPI/2.0 Microsoft-HTTPAPI/2.0"
htmlchar = 18600
# Input AWS SNS ARN
snsarn = 'arn:aws:sns:ap-southeast-2:XXXXXXXXXXXXX:Daily_Check_Notifications_CONTOSSO'
sns = boto3.client('sns')
def pagecheck():
# Present the request to the webpage as if coming from a user in a browser
user_agent = 'Mozilla/5.0 (Windows NT 6.1; Win64; x64)'
values = {'name' : 'user'}
headers = { 'User-Agent' : user_agent }
data = urllib.parse.urlencode(values)
data = data.encode('ascii')
# "Null" the Message Detail and Error lists
msgdet_list = []
error_list = []
request = Request(url)
req = urllib.request.Request(url, data, headers)
response = urlopen(request)
with urllib.request.urlopen(request) as response:
# Get the URL. This gets the real URL.
acturl = response.geturl()
msgdet_list.append("\nThe Actual URL is:")
msgdet_list.append(str(acturl))
if adfslink not in acturl:
error_list.append(str("Redirect Fail"))
# Get the HTTP resonse code
httpcode = response.code
msgdet_list.append("\nThe HTTP code is: ")
msgdet_list.append(str(httpcode))
if httpcode//200 != 1:
error_list.append(str("No HTTP 2XX Code"))
# Get the Headers as a dictionary-like object
headers = response.info()
msgdet_list.append("\nThe Headers are:")
msgdet_list.append(str(headers))
if response.info() == "":
error_list.append(str("Header Error"))
# Get the date of request and compare to UTC (DD MMM YYYY HH MM)
date = response.info()['date']
msgdet_list.append("The Date is: ")
msgdet_list.append(str(date))
returndate = str(date.split( )[1:5])
returndate = re.sub(r'[^\w\s]','',returndate)
returndate = returndate[:-2]
currentdate = datetime.utcnow()
currentdate = currentdate.strftime("%d %b %Y %H%M")
if returndate != currentdate:
date_error = ("Date Error. Returned Date: ", returndate, "Expected Date: ", currentdate, "Times in UTC (DD MMM YYYY HH MM)")
date_error = str(date_error)
date_error = re.sub(r'[^\w\s]','',date_error)
error_list.append(str(date_error))
# Get the server
headerserver = response.info()['server']
msgdet_list.append("\nThe Server is: ")
msgdet_list.append(str(headerserver))
if pageserver not in headerserver:
error_list.append(str("Server Error"))
# Get all HTML data and confirm no major change to content size by character lenth (global var: htmlchar).
html = response.read()
htmllength = len(html)
msgdet_list.append("\nHTML Length is: ")
msgdet_list.append(str(htmllength))
msgdet_list.append("\nThe Full HTML is: ")
msgdet_list.append(str(html))
msgdet_list.append("\n")
if htmllength // htmlchar != 1:
error_list.append(str("Page HTML Error - incorrect # of characters"))
if adfslink not in str(acturl):
error_list.append(str("ADFS Link Error"))
error_list.append("\n")
error_count = len(error_list)
if error_count == 1:
error_list.insert(0, 'No Errors Found.')
elif error_count == 2:
error_list.insert(0, 'Error Found:')
else:
error_list.insert(0, 'Multiple Errors Found:')
# Pass completed results and data to the notification() module
notification(msgdet_list, error_list, error_count)
# Use AWS SNS to create a notification email with the additional data generated
def notification(msgdet_list, error_list, errors):
datacheck = str("\n".join(msgdet_list))
errorcheck = str("\n".join(error_list))
notificationbody = str(errorcheck + datacheck)
if errors >1:
result = 'FAILED!'
else:
result = 'passed.'
notificationheader = ('The daily ADFS check has been marked as ' + result + ' ' + str(errors) + ' ' + str(error_list))
if result != 'passed.':
# message = sns.publish(
# TopicArn = snsarn,
# Subject = notificationheader,
# Message = notificationbody
# )
# Output result to CloudWatch logstream
print('Response: ' + notificationheader)
else:
print('passed')
sys.exit()
# Trigger the Lambda handler
def lambda_handler(event, context):
aws_account_ids = [context.invoked_function_arn.split(":")[4]]
pagecheck()
return "Successful"
sys.exit()
Your CloudWatch logs contain the following error message:
Process exited before completing request
This is caused by invoking sys.exit() in your code. Locally your Python interpreter will just terminate when encountering such a sys.exit().
AWS Lambda on the other hand expects a Python function to just return and handles sys.exit() as an error. As your function probably got invoked asynchronously AWS Lambda retries to execute it twice.
To solve your problem, you can replace the occurences of sys.exit() with return or even better, just remove the sys.exit() calls, as there would be already implicit returns in the places where you use sys.exit().

Resources