Save HTML Source Code to File

Save HTML Source Code to File - python-3.x

How can I copy the source code of a website into a text file in Python 3?
EDIT:
To clarify my issue, here's what I have:
import urllib.request
def extractHTML(url):
f = open('temphtml.txt', 'w')
page = urllib.request.urlopen(url)
pagetext = page.read()
f.write(pagetext)
f.close()
extractHTML('http:www.google.com')
I get the following error for the f.write() function:
builtins.TypeError: must be str, not bytes

import urllib.request
site = urllib.request.urlopen('http://somesite.com')
data = site.read()
file = open("file.txt","wb") #open file in binary mode
file.writelines(data)
file.close()
Untested but should work.
EDIT: Updated for python3

Try this.
import urllib.request
def extractHTML(url):
urllib.request.urlretrieve(url, 'temphtml.txt')
It is easier, but if you still want to do it that way. This is the solution:
import urllib.request
def extractHTML(url):
f = open('temphtml.txt', 'w')
page = urllib.request.urlopen(url)
pagetext = str(page.read())
f.write(pagetext)
f.close()
extractHTML('https://www.google.com')
Your script gave an error saying it must be a string. Just convert bytes to a string with str().
Next I got an error saying no host was given. Google is a secured site so https: not http: and most importantly you forgot to include // at the end of https:.

probably you wanted to create something like that:
import urllib.request
class ExtractHtml():
def Page(self):
print("enter the web page name starting with 'http://': ")
url=input()
site=urllib.request.urlopen(url)
data=site.read()
file =open("D://python_projects/output.txt", "wb")
file.write(data)
file.close()
w=ExtractHtml()
w.Page()

Related

Getting incorrect link on parsing web page in BeautifulSoup

I'm trying to get the download link from the button in this page. But when I open the download link that I get from my code I get this message
I noticed that if I manually click the button and open the link in a new page the csrfKey part of the link is always same whereas when I run the code I get a different key every time. Here's my code
from bs4 import BeautifulSoup
import requests
import re
def GetPage(link):
source_new = requests.get(link).text
soup_new = BeautifulSoup(source_new, 'lxml')
container_new = soup_new.find_all(class_='ipsButton')
for data_new in container_new:
#print(data_new)
headline = data_new # Display text
match = re.findall('download', str(data_new), re.IGNORECASE)
if(match):
print(f'{headline["href"]}\n')
if __name__ == '__main__':
link = 'https://eci.gov.in/files/file/10985-5-number-and-types-of-constituencies/'
GetPage(link)

Before you get to the actual download links of the files, you need to agree to Terms and Conditions. So, you need to fake this with requests and then parse the next page you get.
Here's how:
import requests
from bs4 import BeautifulSoup
if __name__ == '__main__':
link = 'https://eci.gov.in/files/file/10985-5-number-and-types-of-constituencies/'
with requests.Session() as connection:
r = connection.get("https://eci.gov.in/")
confirmation_url = BeautifulSoup(
connection.get(link).text, 'lxml'
).select_one(".ipsApp .ipsButton_fullWidth")["href"]
fake_agree_to_continue = connection.get(
confirmation_url.replace("?do=download", "?do=download&confirm=1")
).text
download_links = [
a["href"] for a in
BeautifulSoup(
fake_agree_to_continue, "lxml"
).select(".ipsApp .ipsButton_small")[1:]]
for download_link in download_links:
response = connection.get(download_link)
file_name = (
response
.headers["Content-Disposition"]
.replace('"', "")
.split(" - ")[-1]
)
print(f"Downloading: {file_name}")
with open(file_name, "wb") as f:
f.write(response.content)
This should output:
Downloading: Number And Types Of Constituencies.pdf
Downloading: Number And Types Of Constituencies.xls
And save two files: a .pdf and a .xls. The later one looks like this:

os.listdir() won't go after nine files

Hello i am trying to make a program that would automaticly go to imgur, enter the name that you typed and download top 10 images.Everything is working except the os library.When i try to do os.listdir() after nine files it wont show anymore files.I tried googling and found nothing if you see something that i messed up please tell me.Thanks in advance.Sorry for bad grammar.
Here is the code sample:
#! python3
import requests, os, sys
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
os.chdir('imgur/')
broswer = webdriver.Chrome(executable_path=r'C:\Users\{YOUR USERNAME}\Downloads\chromedriver.exe')
broswer.get('https://imgur.com/')
broswer.maximize_window()
search_bar = broswer.find_element_by_tag_name('input')
search_bar.send_keys('happy')
search_bar.send_keys(Keys.ENTER)
pictures = broswer.find_elements_by_tag_name('img')
for i in range(1, 11):
res = requests.get(pictures[i].get_attribute('src'))
try:
res.raise_for_status()
except:
print('Link doesnt exist')
if os.listdir() == []:
picture = open('picture1.png', 'wb')
else:
picture = open('picture' + str(int(os.listdir()[-1][7:-4]) + 1) + '.png', 'wb')
print(os.listdir())
for chunk in res.iter_content(100000):
picture.write(chunk)
picture.close()

os.listdir(".") #u missed adding address

Web-Scraping, get a csv file from url address

I have a stupid issue. I have an address that generates a csv file immediately when I copy that to the browser. But I need to do it with python code, so I tried to do something like that:
import urllib.request
url = 'https://www.quandl.com/api/v3/datasets/WSE/TSGAMES.csv?column_index=4&start_date=2018-01-01&end_date=2018-12-31&collapse=monthly&transform=rdiff&api_key=AZ964MpikzEYAyLGfJD2Y
csv = urllib.request.urlopen(url).read()
with open('file.csv', 'wb') as fx: # bytes, hence mode 'wb'
fx.write(csv)
But I got an error: raise HTTPError(req.full_url, code, msg, hdrs, fp)
HTTPError: Bad Request
Do you know the reason and could you help ?
Thanks for any help !

Edit I should state that your link did not work for me, and my quandl API is different then yours.
This is pretty easy to do with the requests module:
import requests
filename = 'test_file.csv'
link = 'your link here'
data = requests.get(link) # request the link, response 200 = success
with open(filename, 'wb') as f:
f.write(data.content) # write content of request to file
f.close()

That link doesn't work for me. Try it like this (generic example).
from urllib.request import urlopen
from io import StringIO
import csv
data = urlopen("http://pythonscraping.com/files/MontyPythonAlbums.csv").read().decode('ascii', 'ignore')
dataFile = StringIO(data)
csvReader = csv.reader(dataFile)
with open('C:/Users/Excel/Desktop/example.csv', 'w') as myFile:
writer = csv.writer(myFile)
writer.writerows(csvReader)

segmentation fault in python3

I am running python3 on a Ubuntu machine and have noticed that the following block of code is fickle. Sometimes it runs just fine, other times it produces a segmentation fault. I don't understand why. Can someone explain what might be going on?
Basically what the code does is try to read S&P companies from Wikipedia and write the list of tickers to a file in the same directory as the script. If no connection to Wikipedia can be established, the script tries instead to read an existing list from file.
from urllib import request
from urllib.error import URLError
from bs4 import BeautifulSoup
import os
import pickle
import dateutil.relativedelta as dr
import sys
sys.setrecursionlimit(100000)
def get_standard_and_poors_500_constituents():
fname = (
os.path.abspath(os.path.dirname(__file__)) + "/sp500_constituents.pkl"
)
try:
# URL request, URL opener, read content.
req = request.Request(
"http://en.wikipedia.org/wiki/List_of_S%26P_500_companies"
)
opener = request.urlopen(req)
# Convert bytes to UTF-8.
content = opener.read().decode()
soup = BeautifulSoup(content, "lxml")
# HTML table we actually need is the first.
tables = soup.find_all("table")
external_class = tables[0].findAll("a", {"class":"external text"})
c = [ext.string for ext in external_class if not "reports" in ext]
with open(fname, "wb") as f:
pickle.dump(c, f)
except URLError:
with open(fname, "rb") as f:
c = pickle.load(f)
finally:
return c
sp500_constituents = get_standard_and_poors_500_constituents()
spdr_etf = "SPY"
sp500_index = "^GSPC"
def main():
X = get_standard_and_poors_500_constituents()
print(X)
if __name__ == "__main__":
main()

Open every file/subfolder in directory and print results to .txt file

At the moment I am working with this code:
from bs4 import BeautifulSoup
import glob
import os
import re
import contextlib
#contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
def trade_spider():
os.chdir(r"C:\Users\6930p\FLO'S DATEIEN\Master FAU\Sommersemester 2016\02_Masterarbeit\04_Testumgebung\01_Probedateien für Analyseaspekt\Independent Auditors Report")
with stdout2file("output.txt"):
for file in glob.iglob('**/*.html', recursive=True):
with open(file, encoding="utf8") as f:
contents = f.read()
soup = BeautifulSoup(contents, "html.parser")
for item in soup.findAll("ix:nonfraction"):
if re.match(".*AuditFeesExpenses", item['name']):
print(file.split(os.path.sep)[-1], end="| ")
print(item['name'], end="| ")
print(item.get_text())
trade_spider()
So far this works perfectly. But now I am stucked with another issue. If I search within a folder which has no subfolders but only files this works without problems. However if i try to run this code on a folder that has subfolders it doesn't work (it prints nothing!).
Furthermore I would like to get my results print into a .txt file without having the whole path in it. The result should be like:
Filename.html| RegEX Match| HTML text
I do get this result already, but only in PyCharm and not in a seperate .txt file.
To sum up, I do have 2 questions:
How can I also walk through subfolders in my defined Directory?
-> would os.walk() be an option for that?
How can I print my results into a .txt file?
-> would sys.stdout work on that?
Any help appreciated on this issue!
UPDATE:
It only prints the first results of the first file into my "outout.txt" file (at least I think it is the first as it is the last file in my only subfolder and recursive=true is activated). Any idea why it is not looping through all the other files?
UPDATE_2: Question resolved! Final Code can be seen above!

For walking in subdirectories, there are two options:
Use ** with glob and the argument recursive=True (glob.glob('**/*.html')). This only works in Python 3.5+. I would also recommend using glob.iglob instead of glob.glob if the directory tree is large.
Use os.walk and check the filenames (whether they end in ".html") manually or with fnmatch.filter.
Regarding the printing into a file, there are again several ways:
Just execute the script and redirect stdout, i.e. python3 myscript.py >myfile.txt
Replace calls to print with a call to the .write() method of a file object in write mode`.
Keep using print, but give it the argument file=myfile where myfile is again a writable file object.
edit: Maybe the most unobstrusive method would be the following. First, include this somewhere:
import contextlib
#contextlib.contextmanager
def stdout2file(fname):
import sys
f = open(fname, 'w')
sys.stdout = f
yield
sys.stdout = sys.__stdout__
f.close()
And then, infront of the line in which you loop over the files, add this line (and appropriately indent):
with stdout2file("output.txt"):

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

Save HTML Source Code to File - python-3.x

import urllib.request site = urllib.request.urlopen('http://somesite.com') data = site.read() file = open("file.txt","wb") #open file in binary mode file.writelines(data) file.close() Untested but should work. EDIT: Updated for python3

Related

Getting incorrect link on parsing web page in BeautifulSoup

os.listdir() won't go after nine files

Web-Scraping, get a csv file from url address

segmentation fault in python3

Open every file/subfolder in directory and print results to .txt file

Categories

Resources