I have python 3.3 installed.
i use the example they use on their site:
import urllib.request
response = urllib.request.urlopen('http://python.org/')
html = response.read()
the only thing that happens when I run it is I get this :
======RESTART=========
I know I am a rookie but I figured the example from python's own website should be able to work.
It doesn't. What am I doing wrong?Eventually I want to run this script from the website below. But I think urllib is not going to work as it is on that site. Can someone tell me if the code will work with python3.3???
http://flowingdata.com/2007/07/09/grabbing-weather-underground-data-with-beautifulsoup/
I think I see what's probably going on. You're likely using IDLE, and when it starts a new run of a program, it prints the
======RESTART=========
line to tell you that a fresh program is starting. That means that all the variables currently defined are reset and/or deleted, as appropriate.
Since your program didn't print any output, you didn't see anything.
The two lines I suggested adding were just tests to figure out what was going on, they're not needed in general. [Unless the window itself is automatically closing, which it shouldn't.] But as a rule, if you want to see output, you'll have to print what you're interested in.
Your example works for me. However, I suggest using requests instead of urllib2.
To simplify the example you linked to, it would look like:
from bs4 import BeautifulSoup
import requests
resp = requests.get("http://www.wunderground.com/history/airport/KBUF/2007/12/16/DailyHistory.html")
soup = BeautifulSoup(resp.text)
Related
I am trying to write a simple script that given an arbitrary URL will return the title tag of that website. Because many of the URLs I want to resolve need to have JavaScript enabled, I need to use something like requests_html's render function to do this. However, I have encountered an issue with the library where the example URL below never terminates. I have tried the timeout arg of the render call and it did not work. Can anyone help me figure out how to get this to timeout properly or some other work around to make sure it doesn't get stuck?
This is my current code that does not terminate (it gets stuck on the render call):
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://shan-shui-inf.lingdong.works/')
# render with JS
r.html.render(sleep = 1, keep_page=True)
# Also does not work: r.html.render(sleep = 1, keep_page=True, timeout = 3)
title = r.html.find('title', first=True).full_text
I have already tried solutions like: Timeout on a function call and Python timeout decorator which still did not timeout strangely enough.
NOTE: I am using Python 3.7.4 64-bit on Windows 10.
I would suggest to put r.session.close() at last. This worked for me.
Ok I'm quite late here,
This is what I've done:
pip install -U pyppeteer
(pip installed the 0.2.6 version for me)
Then it worked somehow
(unrelated)
If you want the Chromium browser to appear on the screen you'll need to change requests_html.py (somewhere in site-packages)'s 714th line,
headless=True -> headless=False
So I have this GUI that I made with tkinter and everything works well. What it does is connects to servers and sends commands for both Linux or Windows. I went ahead and used pyinstaller to create a windowed GUI without console and when I try to uses a specific function for sending Windows commands it will fail. If I create the GUI with a console that pops up before the GUI, it works like a charm. What I'm trying to figure out is how to get my GUI to work with the console being invisible to the user.
The part of my code that has the issue revolves around subprocess. To spare you all from the 400+ lines of code I wrote, I'm providing the specific code that has issues. Here is the snippet:
def rcmd_in(server):
import subprocess as sp
for i in command_list:
result = sp.run(['C:/"Path to executable"/rcmd.exe', '\\\\' + server, i],
universal_newlines=True, stdout=sp.PIPE, stderr=sp.STDOUT)
print(result.stdout)
The argument 'server' is passed from another function that calls to 'rcmd_in' and 'command_list' is a mutable list created in the root of the code, accessible for all functions.
Now, I have done my due diligence. I scoured multiple searches and came up with an edit to my code that makes an attempt to run my code with that console invisible, found using info from this link: recipe-subprocess. Here is what the edit looks like:
def rcmd_in(server):
import subprocess as sp
import os, os.path
si = sp.STARTUPINFO()
si.dwFlags |= sp.STARTF_USESHOWWINDOW
for i in command_list:
result = sp.run(['C:/"Path to executable"/rcmd.exe', '\\\\' + server, i],
universal_newlines=True, stdin=sp.PIPE, stdout=sp.PIPE,
stderr=sp.STDOUT, startupinfo=si, env=os.environ)
print(result.stdout)
The the problem I have now is when it runs an error of "Error:8 - Internal error -109" pops up. Let me add I tried using functions 'call()', 'Popen()', and others but only 'run()' seems to work.
I've reached a point where my brain hurts and I can use some help. Any suggestions? As always I am forever great full for anyone's help. Thanks in advance!
I figured it out and it only took me 5 days! :D
Looks like the reason the function would fail falls on how Windows handles stdin. I found a post that helped me edit my code to work with pyinstaller -w (--noconsole). Here is the updated code:
def rcmd_in(server):
import subprocess as sp
si = sp.STARTUPINFO()
si.dwFlags |= sp.STARTF_USESHOWWINDOW
for i in command_list:
result = sp.Popen(['C:/"Path to executable"/rcmd.exe', '\\\\' + server, i],
universal_newlines=True, stdin=sp.PIPE, stdout=sp.PIPE,
stderr=sp.PIPE, startupinfo=si)
print(result.stdout.read())
Note the change of functions 'run()' to 'Popen()'. The 'run()' function will not work with the print statement at the end. Also, for those of you who are curious the 'si' variable I created is preventing 'subprocess' from opening a console when being ran while using a GUI. I hope this will become useful to someone struggling with this. Cheers
My basic idea is to make a simple script in order to get geographic coordinates using this site: https://mycurrentlocation.net/.
When I run my script the attribute column is empty and the program doesn't return the list correctly : check the list
Please help me! :)
import requests,time
from bs4 import BeautifulSoup as bs
site="https://mycurrentlocation.net/"
def track_the_location():
page=requests.get(site)
page=bs(page.text,"html.parser")
latitude=page.find_all("td",{"id":"latitude"})
longitude=page.find_all("td",{"id":"longitude"})
accuracy=page.find_all("td",{"id":"accuracy"})
List=[latitude.text,longitude.text,accuracy.text]
return List
print(track_the_location())
I think the problem is the fact, that you are running it from a local script. This behaves different from a browser as it doesn't provide any location information to the website. The needed information are provided through runtime, so you actually need to simulate a browser session to get the data as long as they don't offer an API, where you can manually specify your information.
A possible solution for that would be selenium as it helps you simulating a browser session. Here's the selenium documentation for further readings.
Hope I could help you. Have a nice day.
I am trying to visually depict my topics in python using pyldavis. However i am unable to view the graph. Is it that we have to view the graph in the browser or will it get popped upon execution. Below is my code
import pyLDAvis
import pyLDAvis.gensim as gensimvis
print('Pyldavis ....')
vis_data = gensimvis.prepare(ldamodel, doc_term_matrix, dictionary)
pyLDAvis.display(vis_data)
The program is continuously in execution mode on executing the above commands. Where should I view my graph? Or where it will be stored? Is it integrated only with the Ipython notebook?Kindly guide me through this.
P.S My python version is 3.5.
This not work:
pyLDAvis.display(vis_data)
This will work for you:
pyLDAvis.show(vis_data)
I'm facing the same problem now.
EDIT:
My script looks as follows:
first part:
import pyLDAvis
import pyLDAvis.sklearn
print('start script')
tf_vectorizer = CountVectorizer(strip_accents = 'unicode',stop_words = 'english',lowercase = True,token_pattern = r'\b[a-zA-Z]{3,}\b',max_df = 0.5,min_df = 10)
dtm_tf = tf_vectorizer.fit_transform(docs_raw)
lda_tf = LatentDirichletAllocation(n_topics=20, learning_method='online')
print('fit')
lda_tf.fit(dtm_tf)
second part:
print('prepare')
vis_data = pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)
print('display')
pyLDAvis.display(vis_data)
The problem is in the line "vis_data = (...)".if I run the script, it will print 'prepare' and keep on running after that without printing anything else (so it never reaches the line "print('display')).
Funny thing is, when I just run the whole script it gets stuck on that line, but when I run the first part, got to my console and execute purely the single line "vis_data = pyLDAvis.sklearn.prepare(lda_tf, dtm_tf, tf_vectorizer)" this is executed in a couple of seconds.
As for the graph, I saved it as html ("simple") and use the html file to view the graph.
I ran into the same problem (I use PyCharm as IDE) The problem is that pyLDAvize is developed for Ipython (see the docs, https://media.readthedocs.org/pdf/pyldavis/latest/pyldavis.pdf, page 3).
My fix/workaround:
make a dict of lda_tf, dtm_tf, tf_vectorizer (eg., pyLDAviz_dict)dump the dict to a file (eg mydata_pyLDAviz.pkl)
read the pkl file into notebook (I did get some depreciation info from pyLDAviz, but that had no effect on the end result)
play around with pyLDAviz in notebook
if you're happy with the view, dump it into html
The cause is (most likely) that pyLDAviz expects continuous user interaction (including user-initiated "exit"). However, I rather dump data from a smart IDE and read that into jupyter, than develop/code in jupyter notebook. That's pretty much like going back to before-emacs times.
From experience this approach works quite nicely for other plotting rountines
If you received the module error pyLDA.gensim, then try this one instead:
import pyLdAvis.gensim_models
You get the error because of a new version update.
I met a very weird problem when I am trying to parse data from a JavaScript written website. Maybe because I am not an expert of web development.
Here is what happened:
I am trying to get all the comments data from The Globe and Mail. If you check its source code, there is no way to use Python and parse the comments data from the source code, everything is written in JavaScript.
However, there is a magic tool called "Gigya" API, it could return all the comments from a JS written website. Gigya getComments method
When I was using these lines of code in Python Scrapy Spider, it could return all the comments.
data = {"categoryID": self.categoryID,
"streamID": streamId,
"APIKey": self.apikey,
"callback": "foo",
"threadLimit": 1000 # assume all the articles have no more then 1000 comments
}
r = urlopen("http://comments.us1.gigya.com/comments.getComments", data=urlencode(data).encode("utf-8"))
comments_lst = loads(r.read().decode("utf-8"))["comments"]
However, The Globe and Mail is updating their website, all the comments posted before Nov. 28 have been hiden from the web for now. That's why on the sample url I am shoing here, you could only see 2 comments, because they were posted after Nov. 28. And these 2 new comments have been added the new feature - the "React" button.
The weird thing is, righ now when I am running my code, I can get all those hidden hundreds of comments published before Nov. 28, but cannot get the new commnets we can see on the website now.
I have tried all the Gigya comment related methods, none of them worked, other Gigya methods, do not look like something helpful...
Is there any way to solve this problem?
Or at least, do you know why, I can get all the hiden comments but cannot get visible new commnets that have the new features?
Finally, I solved the problem with Python selenium library, it's free and it's super cool.
So, it seems although in the source code of JS written website, we could not see the content, it's in fact has HTML page where we can parse the content.
First of all, I installed Firebug on Firefox, with this Addon, I'm able to see the HTML page of the url and it's very easy to help you locate the content, just search key words in Firebug
Then I wrote the code like this:
from selenium import webdriver
import time
def main():
comment_urls = [
"http://www.theglobeandmail.com/opinion/a-fascists-win-americas-moral-loss/article32753320/comments/"
]
for comment_url in comment_urls:
driver = webdriver.Firefox()
driver.get(comment_url)
time.sleep(5)
htmlSource = driver.page_source
clk = driver.find_element_by_css_selector('div.c3qHyJD')
clk.click()
reaction_counts = driver.find_elements_by_class_name('c2oytXt')
for rc in reaction_counts:
print(rc.text)
if __name__ == "__main__":
main()
The data I am parsing here are those content cannot be found in HTML page until you click the reaction image on the website. What makes selenium super cool is that click() method. After you found the element you can click, just use this method, then those generated elements will appear in the HTML and become parsable. Super cool!