Web-Scraping, get a csv file from url address - python-3.x

I have a stupid issue. I have an address that generates a csv file immediately when I copy that to the browser. But I need to do it with python code, so I tried to do something like that:
import urllib.request
url = 'https://www.quandl.com/api/v3/datasets/WSE/TSGAMES.csv?column_index=4&start_date=2018-01-01&end_date=2018-12-31&collapse=monthly&transform=rdiff&api_key=AZ964MpikzEYAyLGfJD2Y
csv = urllib.request.urlopen(url).read()
with open('file.csv', 'wb') as fx: # bytes, hence mode 'wb'
fx.write(csv)
But I got an error: raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Bad Request
Do you know the reason and could you help ?
Thanks for any help !

Edit I should state that your link did not work for me, and my quandl API is different then yours.
This is pretty easy to do with the requests module:
import requests
filename = 'test_file.csv'
link = 'your link here'
data = requests.get(link) # request the link, response 200 = success
with open(filename, 'wb') as f:
f.write(data.content) # write content of request to file
f.close()

That link doesn't work for me. Try it like this (generic example).
from urllib.request import urlopen
from io import StringIO
import csv
data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode('ascii', 'ignore')
dataFile = StringIO(data)
csvReader = csv.reader(dataFile)
with open('C:/Users/Excel/Desktop/example.csv', 'w') as myFile:
writer = csv.writer(myFile)
writer.writerows(csvReader)

Related

Download pdf file using python3

I want to download this website 's pdf file using python3 https://qingarchives.npm.edu.tw/index.php?act=Display/image/207469Zh18QEz#74l
This might accomplish what you're trying to achieve:
import requests
# URL to be downloaded
url = "https://cfm.ehu.es/ricardo/docs/python/Learning_Python.pdf"
def download_pdf(url, file_name):
# Send GET request
response = requests.get(url)
# Save the PDF
with open(file_name, "wb") as f:
f.write(response.content)
download_pdf(url, 'myDownloadedFile.pdf')
from urllib import request
response = request.urlretrieve("https://cfm.ehu.es/ricardo/docs/python/Learning_Python.pdf", "learing_python.pdf")
or
import wget
URL = "https://cfm.ehu.es/ricardo/docs/python/Learning_Python.pdf"
response = wget.download(URL, ".learing_python.pdf")

Convert Web url to Image

I am trying to take screenshot of an URL but somehow it takes the screenshot of the gateway because of restricted entry. So tried adding ID and password to open the link but it does not for reason, could you help?
import requests
import urllib.parse
BASE = 'https://mini.s-shot.ru/1024x0/JPEG/1024/Z100/?' # we can modify size, format, zoom as needed
url = 'https://mail.google.com/mail/'#or whatever link you need
url = urllib.parse.quote_plus(url) #
print(url)
Id="XXXXXX"
import getpass
key = getpass.getpass('Password :: ')
path = 'target1.jpg'
response = requests.get(BASE + url+Id+Password, stream=True)
if response.status_code == 200:
with open(path, 'wb') as file:
for chunk in response:
file.write(chunk)
Thanks!

Get the name of Instagram profile and the date of post with Python

I'm in the process of learning python3 and I try to solve a simple task. I want to get the name of account and the date of post from instagram link.
import requests
from bs4 import BeautifulSoup
html = requests.get('https://www.instagram.com/p/BuPSnoTlvTR')
soup = BeautifulSoup(html.text, 'lxml')
item = soup.select_one("meta[property='og:description']")
name = item.find_previous_sibling().get("content").split("•")[0]
print(name)
This code works sometimes with links like this https://www.instagram.com/kingtop
But I need it to work also with post of image like this https://www.instagram.com/p/BuxB00KFI-x/
That's all what I could make, but this is not working. And I can't get the date also.
Do you have any ideas? I appreciate any help.
I found a way to get the name of account. Now I'm trying to find a way to get an upload date
import requests
from bs4 import BeautifulSoup
import urllib.request
import urllib.error
import time
from multiprocessing import Pool
from requests.exceptions import HTTPError
start = time.time()
file = open('users.txt', 'r', encoding="ISO-8859-1")
urls = file.readlines()
for url in urls:
url = url.strip ('\n')
try:
req = requests.get(url)
req.raise_for_status()
except HTTPError as http_err:
output = open('output2.txt', 'a')
output.write(f'не найдена\n')
except Exception as err:
output = open('output2.txt', 'a')
output.write(f'не найдены\n')
else:
output = open('output2.txt', 'a')
soup = BeautifulSoup(req.text, "lxml")
the_url = soup.select("[rel='canonical']")[0]['href']
the_url2=the_url.replace('https://www.instagram.com/','')
head, sep, tail = the_url2.partition('/')
output.write (head+'\n')

Download images with requests by generated links

i am not able to download from JavaScript generated images and to store them on my local machine. The code below is not giving any errors but there are no images in my folder. I tried already in many ways. Here is one of them:
path = "http://my.site.com/page/oo?_b=9V2FG34519CV2N56SLK567943N25J82V"
os.makedirs('C:/Images', exist_ok=True)
print('Dowload images %s...' % path)
res = request.get(path)
res.raise_for_status()
imageFile = open(os.path.join('logos', os.path.basename(res)), 'wb')
for chunk in res.iter_content(100000):
imageFile.write(chunk)
imageFile.close()
i am trying since two days to solve this problem, so i would be gratefully if somebody can help me!
Once you've got the image url, you can use requests and shutil to save the image to a given directory (working for me for the given url):
import shutil
import requests
import os
url = 'http://epub.hpo.hu/e-kutatas/aa?_p=A554F6BCDBCEA51EFF1E0E17E777F3AC'
response = requests.get(url, stream=True)
with open(out_file_path, 'wb') as out_file:
shutil.copyfileobj(response.raw, out_file)

Save HTML Source Code to File

How can I copy the source code of a website into a text file in Python 3?
EDIT:
To clarify my issue, here's what I have:
import urllib.request
def extractHTML(url):
f = open('temphtml.txt', 'w')
page = urllib.request.urlopen(url)
pagetext = page.read()
f.write(pagetext)
f.close()
extractHTML('http:www.google.com')
I get the following error for the f.write() function:
builtins.TypeError: must be str, not bytes
import urllib.request
site = urllib.request.urlopen('http://somesite.com')
data = site.read()
file = open("file.txt","wb") #open file in binary mode
file.writelines(data)
file.close()
Untested but should work.
EDIT: Updated for python3
Try this.
import urllib.request
def extractHTML(url):
urllib.request.urlretrieve(url, 'temphtml.txt')
It is easier, but if you still want to do it that way. This is the solution:
import urllib.request
def extractHTML(url):
f = open('temphtml.txt', 'w')
page = urllib.request.urlopen(url)
pagetext = str(page.read())
f.write(pagetext)
f.close()
extractHTML('https://www.google.com')
Your script gave an error saying it must be a string. Just convert bytes to a string with str().
Next I got an error saying no host was given. Google is a secured site so https: not http: and most importantly you forgot to include // at the end of https:.
probably you wanted to create something like that:
import urllib.request
class ExtractHtml():
def Page(self):
print("enter the web page name starting with 'http://': ")
url=input()
site=urllib.request.urlopen(url)
data=site.read()
file =open("D://python_projects/output.txt", "wb")
file.write(data)
file.close()
w=ExtractHtml()
w.Page()

Resources