Python3 download document from web

Python3 download document from web - python-3.x

I am new in Python3 and I am trying to download a doc after login to a website.
I have 2 url which can let me instantly login to the page and download the doc. after login which are:
https://www.xxxcompany.com/login.action?loginname=name&password=psw
https://www.xxxcompany.com/doc_download_all.action?ID=37887&edition=PD&Year=2018&Month=10&Day=5&&CLI=&transferNumber=&inOut=C&deviceType=A&minDuration=0&maxDuration=0&sortType=0&sortAsc=1&showAdv=0&viewtype=0&subPage=M&RMID=-1&updateRMID=&updateRecordID=&customField1=
Here is my code. Its definitely not work and it doesn't print me the status code. Did I misunderstand some concept? Please hep me to solve the problem. Thank you so much!
from lxml import html
import webbrowser
import requests
def login():
with requests.session() as s:
# fetch the login page
s.get(url1)
print(s.status_code) #check whether its successfully login
s.get(url2) #download the doc

You need to write data to file.
url = "http://www.xxxx.com/xxx/xxxx/sample.doc"
import requests
with requests.Session() as se:
req = se.get(url)
with open(url.split("/")[-1],"wb") as doc:
doc.write(req.content)

Related

fetch html resources using urllib with authentication

There is one public html webpage where some content is available. With urllib I am able to fetch contents. But there another version where if I am logged in as user, additional content is available. in order to fetch logged-in page I use solution provided on portal and also urllib basic auth
import urllib.request
passman = urllib.request.HTTPPasswordMgrWithDefaultRealm()
passman.add_password(None, url, username, password)
authhandler = urllib.request.HTTPBasicAuthHandler(passman)
opener = urllib.request.build_opener(authhandler)
urllib.request.install_opener(opener)
res = urllib.request.urlopen(url)
res_body = res.read()
res_body.decode('utf-8')
When I check contents I do not see what is additionally available for logged-in user . I would like to know how to correct this . I am using python 3.
Thanks

How to get destination url from multiple time redirecting url in python?

I am trying to make a web scraper. I would like to get the destination URL from a query URL. But it redirects many times.
This is my URL:
https://data.jw-api.org/mediator/finder?lang=INS&item=pub-jwb_201812_16_VIDEO
Destination url should be:
https://www.jw.org/ins/library/videos/#ins/mediaitems/VODOrgLegal/pub-jwb_201812_16_VIDEO
But I am getting https://www.jw.org/ins/library/videos/?item=pub-jwb_201812_16_VIDEO&appLanguage=INS this as the redirected URL.
I tried this code:
import requests
url = 'https://data.jw-api.org/mediator/finder?lang=INS&item=pub-jwb_201812_16_VIDEO'
s = requests.get(url)
print(s.url)

The redirect is made using JavaScript
It is not a server redirect so requests is not following it.
You can get the URL using Selenium
from selenium import webdriver
import time
browser = webdriver.Chrome()
url = 'https://data.jw-api.org/mediator/finder?lang=INS&item=pub-jwb_201812_16_VIDEO'
browser.get(url)
time.sleep(5)
print (browser.current_url)
browser.quit()
Outputs
https://www.jw.org/ins/library/videos/#ins/mediaitems/VODOrgLegal/pub-jwb_201812_16_VIDEO
If you are building a scraper I would suggest you check out scrapy-splash https://github.com/scrapy-plugins/scrapy-splash or requests-html https://github.com/psf/requests-html

You can do this pretty easily using requests:
import requests
destination = requests.get("http://doi.org/10.1080/07435800.2020.1713802")
#this link redirects the user to another link with a research paper of a given DOI code
print(destination.url)
#this returns "https://www.tandfonline.com/doi/full/10.1080/07435800.2020.1713802", the redirect of the initial doi.org link

Sharepoint and Python3.7

I'm trying to access an internal sharepoint and I haven't been able to figure out how to do it. I've seen some examples and came up with some code but I got a 401 error out of it. I obviously replace my USERNAME and PASSWORD with my actual credentials but I still got a 401. I'm attempts to access the SharePoint and grab lists of data (That span multiple pages). Right now I'm trying to just access the site and proceed from there.
import requests
from requests.auth import HTTPBasicAuth
from requests_ntlm import HttpNtlmAuth
response = requests.get("http://share-internal.division.com/teams/alpha/Components/Lists/Types/AllItems.aspx", auth=HttpNtlmAuth('DOMAIN\\USERNAME','PASSWORD'))
print (response.status_code)
I also tried a Basic request such as
import requests
from requests.auth import HTTPBasicAuth
response = requests.get("http://share-internal.Server.com/teams/Beta/Components/Lists/Workt/AllItems.aspx", auth=HTTPBasicAuth(USERNAME, PASSWORD))
print (response.status_code)
Just to be honest: I censored some of the URL as to maintain privacy

Download files through GitLab API that has been manually uploaded

We have a self-hosted GitLab server and are working on automating our builds and releases. We have many old releases that we have built before using GitLab CI. Some of these should be included in a release package for a certain software. The releases are not located on any server that is easy to access, so it would be very easy if they could be accessed from our GitLab server.
It is possible to access tags from the API and to get artifacts from the build jobs. It doesn't seem possible to add build artifacts manually, so there is no way of using this for old releases.
It is possible to upload files to the release notes of a tag. These are very simple to download through the webpage, but I can't find any way of downloading these through the API. There is this API endpoint:
https://docs.gitlab.com/ee/api/projects.html#upload-a-file
but there is no "download-a-file".
Is there an easy way to upload files to our self-hosted GitLab and then to download them through the API?
All of our repositories has visibility set to private. If you try to access a link like this one, without being logged in:
http://www.example.com/group/my-project/uploads/443568a8641b1b48fc983daea27d36c0/myfile.zip
Then you get redirected to the login page.

Downloading files from the GitLab API is not possible. Here is GitLab's issue to track this feature:
https://gitlab.com/gitlab-org/gitlab/issues/25838
------
As a workaround for now, I am using a python script to simply log in to the page and get the file. This requires the username and password of a user. I set up a new user that only has access to the specific project where it is needed, so it should be safe enough to use in the build script of this project.
import argparse
import sys
import mechanize
def login_and_download(url, user, password):
br = mechanize.Browser()
br.set_handle_robots(False)
br.open(url)
br.select_form(id = 'new_user')
br['user[login]'] = user
br['user[password]'] = password
response = br.submit()
filename = url.split('/')[-1] #Last part of url
with open(filename, 'wb') as f:
f.write(response.get_data())
def main():
parser = argparse.ArgumentParser(description = 'Downloads files from a private git repo')
parser.add_argument('url', help = 'The URL to download')
parser.add_argument('user', help = 'The username to use')
parser.add_argument('password', help = 'The password to use')
args = parser.parse_args()
if len(sys.argv) == 1:
parser.print_help()
sys.exit(1)
login_and_download(args.url, args.user, args.password)
if __name__ == '__main__':
main()

Here is another example of python script, still from issue 25838.
It translates the curl call:
curl --request GET 'http://127.0.0.1:3000/root/my-project/uploads/d3d527ac296ae1b636434f44c7f57272/IMG_3935.jpg' \
--header 'Authorization: Bearer YOUR_API_TOKEN'
import requests
from bs4 import BeautifulSoup
with requests.Session() as http_session:
# my login page url
login_url = [LOGIN_URL]
login_page = http_session.get(login_url)
login_inputs = BeautifulSoup(login_page.content, features='html.parser').find_all('input')
# get the authenticity token from the login page
for login_input in login_inputs:
if 'token' in login_input['name']:
auth_token = login_input['value']
break
# send some user-agent header just in case
headers = {"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"}
# see your login page source for the required data ...
data = {
"username": [MYUSERNAME],
"password": [MYPASSWORD],
"authenticity_token": auth_token,
"remember_me": "1",
"commit": "Sign in"
}
# ... as well as the login endpoint
login_endpoint = [LOGIN_ENDPOINT_URL]
# authenticate
http_session.post(login_endpoint, data=data, headers=headers)
# myfile_url is the url for my wanted file
myfiledata = http_session.get(myfile_url).content

Adding credential(username and password) and fetching URL in python

How can I add credential(username and password) and then fetch the URL of that particular website in python?
I need help to implement this in my project
Example: suppose the first page is the login page and when we provide the credential and validate that and then we are re-directed into the homepage and then pick the url of that homepage

You can use package os
import os
url = os.environ['HTTP_HOST']
uri = os.environ['REQUEST_URI']
return url+uri

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Python3 download document from web - python-3.x

You need to write data to file. url = "http://www.xxxx.com/xxx/xxxx/sample.doc" import requests with requests.Session() as se: req = se.get(url) with open(url.split("/")[-1],"wb") as doc: doc.write(req.content)

Related

fetch html resources using urllib with authentication

How to get destination url from multiple time redirecting url in python?

Sharepoint and Python3.7

Download files through GitLab API that has been manually uploaded

Adding credential(username and password) and fetching URL in python

Categories

Resources