urlopen is not working for python3 [closed] - python-3.x

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 7 years ago.
Improve this question
I am trying to fetch data from a url. I have tried the following in Python 2.7:
import urllib2 as ul
response = ul.urlopen("http://in.bookmyshow.com/")
page_content = response.read()
print page_content
This is working fine. But when i try it in Python 3.4 it is throwing an error:
File "C:\Python34\lib\urllib\request.py", line 161, in urlopen
return opener.open(url, data, timeout)
I am using:
import urllib.request
response = urllib.request.urlopen('http://in.bookmyshow.com/')
data = response.read()
print data

It works for me (Python 3.4.3). You need to use print(data) in Python 3.
As a side note you may also want to consider requests which makes it way easier to interact via HTTP(S).
>>> import requests
>>> r = requests.get('http://in.bookmyshow.com/')
>>> r.ok
True
>>> plaintext = r.text
Finally, if you want to get data from such complicated pages (which are intended to be displayed, as opposed to an API), you should have a look at Scrapy which will make your life easier as well.

Related

Beautifulsoup scraping Tripadvisor does not work [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 11 months ago.
The community is reviewing whether to reopen this question as of 10 months ago.
Improve this question
im a beginner with python and beautifulsoup for web scraping and i had an issue scraping Tripadvisor site for reviews like the code is not running it stays forever with no results . yet my code is working on other sites . Please help and here the code im using :
import requests
import urllib.request
from bs4 import BeautifulSoup
r = requests.get('https://www.tripadvisor.fr/Hotel_Review-g295424-d302457-Reviews-Burj_Al_Arab-Dubai_Emirate_of_Dubai.html', auth=('user', 'pass'))
print(r.text)
This returns data:
import requests
# from bs4 import BeautifulSoup
r = requests.get('https://www.tripadvisor.fr/Hotel_Review-g295424-d302457-Reviews-Burj_Al_Arab-Dubai_Emirate_of_Dubai.html')
print(r.text)
The beautifulsoup package is not used yet as it depends what you want to do, which is unspecified in the question.

read python code from Github and execute locally [duplicate]

This question already has an answer here:
How do I execute a python script that is stored on the internet?
(1 answer)
Closed 3 years ago.
I want to create a local python script that reads code from Github and executes it on my computer.
This will ensure that the latest version of code is always being used.
Here is the raw python code:
I have tried this, but doesn't work...
with open(script) as file:
data = file.read()
also
exec(script)
I feel like this should be easily done but can't figure it out! Any help much appreciated
You have the URL, you have the file reading and you have the execution. You're just missing the step where you download the file.
Easiest to use urllib in Python3:
import urllib.request
code = 'https://raw.githubusercontent.com/bensharkey3/Guess-The-Number/master/Guess%20the%20number%20game.py'
response = urllib.request.urlopen(code)
data = response.read()
exec(data)
Note that relying on the URL is a pretty fragile way to ensure you have the latest code. Much better would be to use git to pull the latest, but this should at least get you started.

Urllib not reading entire page

I have been trying to use urllib.request to read a webpage. The problem I found though was that at least in some instances urllib wasn't able to read the whole page. By this I mean that it returned some html-text but not all that was on the webpage. Can anyone explain why this happens and what you can do instead? An example of my code, the way I used it is here:
import urllib.request
open = urllib.request.urlopen('http://SomeSite.com')
page = open.read()

(PYTHON) Manipulating certain portions of URL at user's request [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
The download link I want to manipulate is below:
http://hfrnet.ucsd.edu/thredds/ncss/grid/HFR/USWC/6km/hourly/RTV/HFRADAR,_US_West_Coast,_6km_Resolution,_Hourly_RTV_best.ncd?var=u&var=v&north=47.20&west=-126.3600&east=-123.8055&south=37.2500&horizStride=1&time_start=2015-11-01T00%3A00%3A00Z&time_end=2015-11-03T14%3A00%3A00Z&timeStride=1&addLatLon=true&accept=netcdf
I want to make anything that's in bold a variable, so I can ask the user what coordinates and data set they want. This way I can download different data sets by using this script. I would also like to use the same variables to name the new file that was downloaded ex:USWC6km20151101-20151103.
I did some research and learned that I can use the urllib.parse and urllib2, but when I try experimenting with them, it says "no module named urllib.parse."
I can use the webbrowser.open() to download the file, but manipulating the url is giving me problems
THANK YOU!!
Instead of urllib you can use requests module that makes downloading content much easier. The part that makes actual work is just 4 lines long.
# first install this module
import requests
# parameters to change
location = {
'part': 'USWC',
'part2': '_US_West_Coast',
'km': '6km',
'north': '45.0000',
'west': '-120.0000',
'east': '-119.5000',
'south': '44.5000',
'start': '2016-10-01',
'end': '2016-10-02'
}
# this is template for .format() method to generate links (very naive method)
link_template = "http://hfrnet.ucsd.edu/thredds/ncss/grid/HFR/{part}/{km}/hourly/RTV/\
HFRADAR,{part2},_{km}_Resolution,_Hourly_RTV_best.ncd?var=u&var=v&\
north={north}&west={west}&east={east}&south={south}&horizStride=1&\
time_start={start}T00:00:00Z&time_end={end}T16:00:00Z&timeStride=1&addLatLon=true&accept=netcdf"
# some debug info
link = link_template.format(**location)
file_name = location['part'] + location['km'] + location['start'].replace('-', '') + '-' + location['end'].replace('-', '')
print("Link: ", link)
print("Filename: ", file_name)
# try to open webpage
response = requests.get(link)
if response.ok:
# open file for writing in binary mode
with open(file_name, mode='wb') as file_out:
# write response to file
file_out.write(response.content)
Probably the next step would be running this in loop on list that contains location dicts. Or maybe reading locations from csv file.

Best tool for text extraction from PDF in Python 3.4 [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 2 years ago.
Improve this question
I am using Python 3.4 and need to extract all the text from a PDF and then use it for text processing.
All the answers I have seen suggest options for Python 2.7.
I need something in Python 3.4.
Bonson
You need to install the PyPDF2 package to be able to work with PDFs in Python. PyPDF2 can extract text/images. The text is returned as a Python string. To install it, run pip install PyPDF2 from the command line. This module name is case-sensitive so make sure to type 'y' in lowercase and all other characters as uppercase.
import PyPDF2
reader = PyPDF2.PdfReader('my_file.pdf')
print(len(reader.pages)) # gives '56'
page = reader.pages[9] #'9' is the page number
page.extract_text()
The last statement returns all the text that is available in page 9 of 'my_file.pdf' document.
pdfminer.six ( https://github.com/pdfminer/pdfminer.six ) has also been recommended elsewhere and is intended to support Python 3. I can't personally vouch for it though, since it failed during installation MacOS. (There's an open issue for that and it seems to be a recent problem, so there might be a quick fix.)
Complementing #Sarah's answer. PDFMiner is a pretty good choice. I have been using it from quite some time, and until now it works pretty good on extracting the text content from a PDF. What I did is to create a function which uses the CLI client from pdfminer, and then it saves the output into a variable (which I can use later on somewhere else). The Python version I am using is 3.6, and the function works pretty good and does the required job, so maybe this can work for you:
def pdf_to_text(filepath):
print('Getting text content for {}...'.format(filepath))
process = subprocess.Popen(['pdf2txt.py', filepath], stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
stdout, stderr = process.communicate()
if process.returncode != 0 or stderr:
raise OSError('Executing the command for {} caused an error:\nCode: {}\nOutput: {}\nError: {}'.format(filepath, process.returncode, stdout, stderr))
return stdout.decode('utf-8')
You will have to import the subprocess module of course: import subprocess
slate3k is good for extracting text. I've tested it with a few PDF files using Python 3.7.3, and it's a lot more accurate than PyPDF2, for instance. It's a fork of slate, which is a wrapper for PDFMiner. Here's the code I am using:
import slate3k as slate
with open('Sample.pdf', 'rb') as f:
doc = slate.PDF(f)
doc
#prints the full document as a list of strings
#each element of the list is a page in the document
doc[0]
#prints the first page of the document
Credit to this comment on GitHub:
https://github.com/mstamy2/PyPDF2/issues/437#issuecomment-400491342
import pdfreader
pdfFileObj = open('/tmp/Test-test-test.pdf','rb')
viewer = SimplePDFViewer(pdfFileObject)
viewer.navigate(1)
viewer.render()
viewer.canvas.strings

Resources