HTTP Error 403 Forbidden - when downloading nltk data [duplicate] - python-3.x

This question already has answers here:
Getting 405 error while trying to download nltk data
(2 answers)
Closed 5 years ago.
I am facing some problem for accessing nltk data. I have tried nltk.download(). The gui page has come with HTTP Error 403: Forbidden error. I have also try to install from command line which is provided here.
python -m nltk.downloader all
and get this error.
C:\Python36\lib\runpy.py:125: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) [nltk_data] Error loading all: HTTP Error 403: Forbidden.
I also go through How do I download NLTK data? and Failed loading english.pickle with nltk.data.load.

The problem is coming from the nltk download server. If you look at the gui's config, it's pointing to this link
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
If you access this link in the browser, you get this as a message :
Error 403 Forbidden.
Forbidden.
Guru Mediation:
Details: cache-lcy1125-LCY 1501134862 2002107460
Varnish cache server
So, I was going to file an issue on github, but someone else already did that here : https://github.com/nltk/nltk/issues/1791
A workaround was suggested here: https://github.com/nltk/nltk/issues/1787.
Based on the discussion on github:
It seems like the Github is down/blocking access to the raw content on
the repo.
The suggested workaround is to manually download as follows:
PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA
People also suggested using an laternative index as follows:
python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj punkt

Go to /nltk/downloader.py
And change the default url:
DEFAULT_URL = 'http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml'
to
DEFAULT_URL = 'http://nltk.github.com/nltk_data/'

For me the best solution is:
PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA
link
Alternative solution is not working for me
python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj punkt

Related

MNIST dataset not found

I'm trying to run the command (that should download dataset of images) on terminal
wget http://deeplearning.net/data/mnist/mnist.pkl.gz
this is the first step in a lot of deep learning guides.
but I am getting:
bash: line 1: syntax error near unexpected token `newline'
bash: line 1: `<!DOCTYPE html>'
--2020-11-21 19:35:21-- http://deeplearning.net/data/mnist/mnist.pkl.gz
Resolving deeplearning.net (deeplearning.net)... 132.204.26.28
Connecting to deeplearning.net (deeplearning.net)|132.204.26.28|:80... connected.
HTTP request sent, awaiting response... 404 Not Found
2020-11-21 19:35:21 ERROR 404: Not Found.
can you please help me understand the issue?
I think their servers are currently down.
Meanwhile, you can manually download the whole dataset from here

GitLab API - Unable to access file which is within a directory

I have a GitLab project which is set up as follows:
myserver.com/SuperGroup/SubGroup/SubGroupProject
The tree of the following project is a top-level txt file and a txt file within a directory. I get the tree from the GitLab API with:
myserver.com/api/v4/projects/1/repository/tree?recursive=true
[{"id":"aba61143388f605d3fe9de9033ecb4575e4d9b69","name":"myDirectory","type":"tree","path":"myDirectory","mode":"040000"},{"id":"0e3a2b246ab92abac101d0eb2e96b57e2d24915d","name":"1stLevelFile.txt","type":"blob","path":"myDirectory/1stLevelFile.txt","mode":"100644"},{"id":"3501682ba833c3e50addab55e42488e98200b323","name":"top_level.txt","type":"blob","path":"top_level.txt","mode":"100644"}]
If I request the contents for top_level.txt they are returned without any issue via:
myserver.com/api/v4/projects/1/repository/files/top_level.txt?ref=master
However I am unable to access myDirectory/1stLevelFile.txt with any API call I try. E.g.:
myserver.com/api/v4/projects/1/repository/files/"myDirectory%2F1stLevelFile.txt"?ref=master
and,
myserver.com/api/v4/projects/1/repository/files/"myDirectory%2F1stLevelFile%2Etxt"?ref=master
Results in:
Not Found The requested URL /api/v4/projects/1/repository/files/myDirectory/1stLevelFile.txt was not found on this server.
Apache/2.4.25 (Debian) Server at myserver.com Port 443
myserver.com/api/v4/projects/1/repository/files/"myDirectory/1stLevelFile.txt"?ref=master and,
myserver.com/api/v4/projects/1/repository/files?ref=master&path=myDirectory%2F1stLevelFile.txt
Results in:
error "404 Not Found"
The versions of the components are:
GitLab 10.6.3-ee
GitLab Shell 6.0.4
GitLab Workhorse v4.0.0
GitLab API v4
Ruby 2.3.6p384
Rails 4.2.10
postgresql 9.6.8
According to my research there was a similar bug which was fixed with the 10.0.0 update.
I also added my ssh-key although I doubt it has any effect, following this advice with the same issue in php.
Solution:
I eventually solved it by adjusting the apache installed on the server.
Just follow these instructions: https://gitlab.com/gitlab-org/gitlab-ce/issues/35079#note_76374269
According to your code, I will go thinking you use curl.
If it is the case, why are you adding double quotes to your file path ?
The doc do not contains it.
Can you test it like that please ?
curl --request GET --header 'PRIVATE-TOKEN: XXXXXXXXX' myserver.com/api/v4/projects/1/repository/files/myDirectory%2F1stLevelFile%2Etxt?ref=master

Python requests hangs when script launched through crontab

I've got a Python script which downloads data in json format through HTTP. If I run the script through command-line using the requests module, the HTTP connection is successful and data is downloaded without any issues. But when I try to launch the script as a crontab job, the HTTP connection throws a timeout after a while. Could anyone please tell me what is going on here? I am currently downloading data via a bash script first and then running the Python script from within that bash. But this is nonsense! Thank you so much!
Using: 3.6.1 |Anaconda custom (64-bit)| (default, May 11 2017, 13:09:58) \n[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
P.S.: I haven't found any posts regarding this issue. If there is already an answer for this on some other post, then please accept my apologies.
This is an excerpt from my code. It times out when running requests.get(url):
try:
response = requests.get(url)
messages = response.json()["Messages"]
except requests.exceptions.Timeout:
logging.critical("TIMEOUT received when connecting to HTTP server.")
except requests.exceptions.ConnectionError:
logging.critical("CONNECTION ERROR received when connecting to HTTP server.")
I just found the answer to my question. I've defined the proxy being used and then used it like this in my code:
HTTP_PROXY="http://your_proxy:proxy_port"
PROXY_DICT={"http":HTTP_PROXY}
response = requests.get(url, proxies=PROXY_DICT)
Reference:
Proxies with Python 'Requests' module
Thank you all for your comprehension. I guess I should have done a thorough search before posting. Sorry.

ERROR: CCurlFile::FillBuffer - Failed: HTTP response code said error(22) xbmc

I'm trying to host and distribute xbmc addon on my site. I've made a repository which points to the directory where the addon zip file is. At the same folder I have an xml which describes the addon and so the addon name and description are being recognized by xbmc.
However when trying to install the addon it shows 0% downloading progress and then the progress disappears - resulting in the following error inside xbmc.log file:
ERROR: CCurlFile::FillBuffer - Failed: HTTP response code said error(22)
according to curl errors page, this happens when -
CURLE_HTTP_RETURNED_ERROR (22)
This is returned if CURLOPT_FAILONERROR is set TRUE and the HTTP
server returns an error code that is >= 400.
by that I assume the error may be caused by a misconfigured access permissions (perhaps I need to change some htaccess configuration?).
please help
I solved this on my own eventually. Apparently, the file structure was wrong - I needed to follow the file structure as mentioned in section 4.3 here in order for it to work

wget and curl somehow modifying bencode file when downloading

Okay so I have a bit of a weird problem going on that I'm not entirely sure how to explain... Basically I am trying to decode a bencode file (.torrent file) now I have tried 4 or 5 different scripts I have found via google and S.O. with no luck (get returns like this in not a dictionary or output error from same )
Now I am downloading the .torrent file like so
wget http://link_to.torrent file
//and have also tried with curl like so
curl -C - -O http://link_to.torrent
and am concluding that there is something happening to the file when I download in this way.
The reason for this is I found this site which will decode a .torrent file you upload online to display the info contained in the file. However when I download a .torrent file by not just clicking on the link through a browser but instead using one of the methods described above it does not work either.
So Has anyone experienced a similar problem using one of these methods and found a solution to the problem or even explain why this is happening ?
As I can;t find much online about it nor know of a workaround that I can use for my server
Update:
Okay as was suggested by #coder543 to compare the file size of download through browser vs. wget. They are not the same size using wget style results in a smaller filesize so clearly the problem is with wget & curl not the something else .. idea's?
Updat 2:
Okay so I have tried this now a few times and I am narrowing down the problem a little bit, the problem only seems to occur on torcache and torrage links. Links from other sites seems to work properly or as expected ... so here are some links and my results from the thrre different methods:
*** differnet sizes***
http://torrage.com/torrent/6760F0232086AFE6880C974645DE8105FF032706.torrent
wget -> 7345 , curl -> 7345 , browser download -> 7376
*** same size***
http://isohunt.com/torrent_details/224634397/south+park?tab=summary
wget -> 7491 , curl -> 7491 , browser download -> 7491
*** differnet sizes***
http://torcache.net/torrent/B00BA420568DA54A90456AEE90CAE7A28535FACE.torrent?title=[kickass.to]the.simpsons.s24e12.hdtv.x264.lol.eztv
wget -> 4890 , curl-> 4890 , browser download -> 4985
*** same size***
http://h33t.com/download.php?id=cc1ad62bbe7b68401fe6ca0fbaa76c4ed022b221&f=Game%20of%20Thrones%20S03E10%20576p%20HDTV%20x264-DGN%20%7B1337x%7D.torrent
wget-> 30632 , curl -> 30632 , browser download -> 30632
*** same size***
http://dl7.torrentreactor.net/download.php?id=9499345&name=ubuntu-13.04-desktop-i386.iso
wget-> 32324, curl -> 32324, browser download -> 32324
*** differnet sizes***
http://torrage.com/torrent/D7497C2215C9448D9EB421A969453537621E0962.torrent
wget -> 7856 , curl -> 7556 ,browser download -> 7888
So I it seems to work well on some site but sites which really on torcache.net and torrage.com to supply files. Now it would be nice if i could just use other sites not relying directly on the cache's however I am working with the bitsnoop api (which pulls all it data from torrage.com so it's not really an option) anyways, if anyone has any idea on how to solve this problems or steps to take to finding a solution it would be greatly appreciated!
Even if anyone can reproduce the reults it would be appreciated!
... My server is 12.04 LTS on 64-bit architecture and the laptop I tried the actual download comparison on is the same
For the file retrieved using the command line tools I get:
$ file 6760F0232086AFE6880C974645DE8105FF032706.torrent
6760F0232086AFE6880C974645DE8105FF032706.torrent: gzip compressed data, from Unix
And sure enough, decompressing using gunzip will produce the correct output.
Looking into what the server sends, gives interesting clue:
$ wget -S http://torrage.com/torrent/6760F0232086AFE6880C974645DE8105FF032706.torrent
--2013-06-14 00:53:37-- http://torrage.com/torrent/6760F0232086AFE6880C974645DE8105FF032706.torrent
Resolving torrage.com... 192.121.86.94
Connecting to torrage.com|192.121.86.94|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 200 OK
Connection: keep-alive
Content-Encoding: gzip
So the server does report it's sending gzip compressed data, but wget and curl ignore this.
curl has a --compressed switch which will correctly uncompress the data for you. This should be safe to use even for uncompressed files, it just tells the http server that the client supports compression, but in this case curl does look at the received header to see if it actually needs decompression or not.

Resources