wget and curl somehow modifying bencode file when downloading

wget and curl somehow modifying bencode file when downloading - linux

Okay so I have a bit of a weird problem going on that I'm not entirely sure how to explain... Basically I am trying to decode a bencode file (.torrent file) now I have tried 4 or 5 different scripts I have found via google and S.O. with no luck (get returns like this in not a dictionary or output error from same )
Now I am downloading the .torrent file like so
wget http://link_to.torrent file
//and have also tried with curl like so
curl -C - -O http://link_to.torrent
and am concluding that there is something happening to the file when I download in this way.
The reason for this is I found this site which will decode a .torrent file you upload online to display the info contained in the file. However when I download a .torrent file by not just clicking on the link through a browser but instead using one of the methods described above it does not work either.
So Has anyone experienced a similar problem using one of these methods and found a solution to the problem or even explain why this is happening ?
As I can;t find much online about it nor know of a workaround that I can use for my server
Update:
Okay as was suggested by #coder543 to compare the file size of download through browser vs. wget. They are not the same size using wget style results in a smaller filesize so clearly the problem is with wget & curl not the something else .. idea's?
Updat 2:
Okay so I have tried this now a few times and I am narrowing down the problem a little bit, the problem only seems to occur on torcache and torrage links. Links from other sites seems to work properly or as expected ... so here are some links and my results from the thrre different methods:
*** differnet sizes***
http://torrage.com/torrent/6760F0232086AFE6880C974645DE8105FF032706.torrent
wget -> 7345 , curl -> 7345 , browser download -> 7376
*** same size***
http://isohunt.com/torrent_details/224634397/south+park?tab=summary
wget -> 7491 , curl -> 7491 , browser download -> 7491
*** differnet sizes***
http://torcache.net/torrent/B00BA420568DA54A90456AEE90CAE7A28535FACE.torrent?title=[kickass.to]the.simpsons.s24e12.hdtv.x264.lol.eztv
wget -> 4890 , curl-> 4890 , browser download -> 4985
*** same size***
http://h33t.com/download.php?id=cc1ad62bbe7b68401fe6ca0fbaa76c4ed022b221&f=Game%20of%20Thrones%20S03E10%20576p%20HDTV%20x264-DGN%20%7B1337x%7D.torrent
wget-> 30632 , curl -> 30632 , browser download -> 30632
*** same size***
http://dl7.torrentreactor.net/download.php?id=9499345&name=ubuntu-13.04-desktop-i386.iso
wget-> 32324, curl -> 32324, browser download -> 32324
*** differnet sizes***
http://torrage.com/torrent/D7497C2215C9448D9EB421A969453537621E0962.torrent
wget -> 7856 , curl -> 7556 ,browser download -> 7888
So I it seems to work well on some site but sites which really on torcache.net and torrage.com to supply files. Now it would be nice if i could just use other sites not relying directly on the cache's however I am working with the bitsnoop api (which pulls all it data from torrage.com so it's not really an option) anyways, if anyone has any idea on how to solve this problems or steps to take to finding a solution it would be greatly appreciated!
Even if anyone can reproduce the reults it would be appreciated!
... My server is 12.04 LTS on 64-bit architecture and the laptop I tried the actual download comparison on is the same

For the file retrieved using the command line tools I get:
$ file 6760F0232086AFE6880C974645DE8105FF032706.torrent
6760F0232086AFE6880C974645DE8105FF032706.torrent: gzip compressed data, from Unix
And sure enough, decompressing using gunzip will produce the correct output.
Looking into what the server sends, gives interesting clue:
$ wget -S http://torrage.com/torrent/6760F0232086AFE6880C974645DE8105FF032706.torrent
--2013-06-14 00:53:37-- http://torrage.com/torrent/6760F0232086AFE6880C974645DE8105FF032706.torrent
Resolving torrage.com... 192.121.86.94
Connecting to torrage.com|192.121.86.94|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.0 200 OK
Connection: keep-alive
Content-Encoding: gzip
So the server does report it's sending gzip compressed data, but wget and curl ignore this.
curl has a --compressed switch which will correctly uncompress the data for you. This should be safe to use even for uncompressed files, it just tells the http server that the client supports compression, but in this case curl does look at the received header to see if it actually needs decompression or not.

Related

wget a raw file from Github from a private repo

I am trying to get a raw file from github private project using wget. Usually if my project is public it is very simple
For Public Repo This is my repo url (you don't have to click on it to answer this question)
https://github.com/samirtendulkar/profile_rest_api/blob/master/deploy/server_setup.sh
I click raw
After I lick raw My URL looks like this
https://raw.githubusercontent.com/samirtendulkar/profile_rest_api/master/deploy/server_setup.sh (Notice only the word "raw" is added to the URL)
which is awesome I then do
ubuntu#ip-172-31-39-47:~$ wget https://raw.githubusercontent.com/samirtendulkar/profile_rest_api/master/deploy/server_setup.sh
when I do ls it shows that the file has been downloaded
ubuntu#ip-172-31-39-47:~$ ls
'server_setup.sh'
For a Private repo The raw file comes with a token
https://github.com/samirtendulkar/my_project/blob/master/deploy/server_setup.sh
So far so good Now when I click Raw (see image above) My URL changes and has a token in it along with the "raw" prefix
https://raw.githubusercontent.com/samirtendulkar/my_project/master/deploy/server_setup.sh?token=AkSv7SycSHacUNlSEZamo6hpMAI6ZhsLks5b4uFuwA%3D%3D
The url has these extra parameters ?token=AkSv7SycSHacUNlSEZamo6hpMAI6ZhsLks5b4uFuwA%3D%3D
My wget does not work. How Do I fix this issue. By the way when I say it does not work I mean instead of the ls showing
ubuntu#ip-172-31-39-47:~$ ls
'server_setup.sh'
It shows as below
which is not making me run futher commands like
ubuntu#ip-172-31-39-47:~$ chmod +x server_setup.sh
and
ubuntu#ip-172-31-39-47:~$ sudo ./server_setup.sh
which I need to get the project on to AWS

The token is from the Personal Access Tokens section that you can find the details in Github.
With Personal Access Tokens, you can create one and pick the first option "repo" to get access control over the private repos for the token.
following line solved my problem which was not being able to download the file.
Hope this will help
wget --header 'Authorization: token PERSONAL_ACCESS_TOKEN_HERE' https://raw.githubusercontent.com/repoOwner/repoName/master/folder/filename

You can use wget's -O option when you're downloading just one file at a time:
wget -O server_setup.sh https://raw.githubusercontent.com/samirtendulkar/my_project/master/deploy/server_setup.sh?token=AkSv7SycSHacUNlSEZamo6hpMAI6ZhsLks5b4uFuwA%3D%3D
The downside is that you have to know the output file name, but I think that's OK if I understand your question well.

GitLab API - Unable to access file which is within a directory

I have a GitLab project which is set up as follows:
myserver.com/SuperGroup/SubGroup/SubGroupProject
The tree of the following project is a top-level txt file and a txt file within a directory. I get the tree from the GitLab API with:
myserver.com/api/v4/projects/1/repository/tree?recursive=true
[{"id":"aba61143388f605d3fe9de9033ecb4575e4d9b69","name":"myDirectory","type":"tree","path":"myDirectory","mode":"040000"},{"id":"0e3a2b246ab92abac101d0eb2e96b57e2d24915d","name":"1stLevelFile.txt","type":"blob","path":"myDirectory/1stLevelFile.txt","mode":"100644"},{"id":"3501682ba833c3e50addab55e42488e98200b323","name":"top_level.txt","type":"blob","path":"top_level.txt","mode":"100644"}]
If I request the contents for top_level.txt they are returned without any issue via:
myserver.com/api/v4/projects/1/repository/files/top_level.txt?ref=master
However I am unable to access myDirectory/1stLevelFile.txt with any API call I try. E.g.:
myserver.com/api/v4/projects/1/repository/files/"myDirectory%2F1stLevelFile.txt"?ref=master
and,
myserver.com/api/v4/projects/1/repository/files/"myDirectory%2F1stLevelFile%2Etxt"?ref=master
Results in:
Not Found The requested URL /api/v4/projects/1/repository/files/myDirectory/1stLevelFile.txt was not found on this server.
Apache/2.4.25 (Debian) Server at myserver.com Port 443
myserver.com/api/v4/projects/1/repository/files/"myDirectory/1stLevelFile.txt"?ref=master and,
myserver.com/api/v4/projects/1/repository/files?ref=master&path=myDirectory%2F1stLevelFile.txt
Results in:
error "404 Not Found"
The versions of the components are:
GitLab 10.6.3-ee
GitLab Shell 6.0.4
GitLab Workhorse v4.0.0
GitLab API v4
Ruby 2.3.6p384
Rails 4.2.10
postgresql 9.6.8
According to my research there was a similar bug which was fixed with the 10.0.0 update.
I also added my ssh-key although I doubt it has any effect, following this advice with the same issue in php.
Solution:
I eventually solved it by adjusting the apache installed on the server.
Just follow these instructions: https://gitlab.com/gitlab-org/gitlab-ce/issues/35079#note_76374269

According to your code, I will go thinking you use curl.
If it is the case, why are you adding double quotes to your file path ?
The doc do not contains it.
Can you test it like that please ?
curl --request GET --header 'PRIVATE-TOKEN: XXXXXXXXX' myserver.com/api/v4/projects/1/repository/files/myDirectory%2F1stLevelFile%2Etxt?ref=master

HTTP Error 403 Forbidden - when downloading nltk data [duplicate]

This question already has answers here:
Getting 405 error while trying to download nltk data
(2 answers)
Closed 5 years ago.
I am facing some problem for accessing nltk data. I have tried nltk.download(). The gui page has come with HTTP Error 403: Forbidden error. I have also try to install from command line which is provided here.
python -m nltk.downloader all
and get this error.
C:\Python36\lib\runpy.py:125: RuntimeWarning: 'nltk.downloader' found in sys.modules after import of package 'nltk', but prior to execution of 'nltk.downloader'; this may result in unpredictable behaviour warn(RuntimeWarning(msg)) [nltk_data] Error loading all: HTTP Error 403: Forbidden.
I also go through How do I download NLTK data? and Failed loading english.pickle with nltk.data.load.

The problem is coming from the nltk download server. If you look at the gui's config, it's pointing to this link
https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml
If you access this link in the browser, you get this as a message :
Error 403 Forbidden.
Forbidden.
Guru Mediation:
Details: cache-lcy1125-LCY 1501134862 2002107460
Varnish cache server
So, I was going to file an issue on github, but someone else already did that here : https://github.com/nltk/nltk/issues/1791
A workaround was suggested here: https://github.com/nltk/nltk/issues/1787.
Based on the discussion on github:
It seems like the Github is down/blocking access to the raw content on
the repo.
The suggested workaround is to manually download as follows:
PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA
People also suggested using an laternative index as follows:
python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj punkt

Go to /nltk/downloader.py
And change the default url:
DEFAULT_URL = 'http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml'
to
DEFAULT_URL = 'http://nltk.github.com/nltk_data/'

For me the best solution is:
PATH_TO_NLTK_DATA=/home/username/nltk_data/
wget https://github.com/nltk/nltk_data/archive/gh-pages.zip
unzip gh-pages.zip
mv nltk_data-gh-pages/ $PATH_TO_NLTK_DATA
link
Alternative solution is not working for me
python -m nltk.downloader -u https://pastebin.com/raw/D3TBY4Mj punkt

File downloaded by curl but not by node.js

So I'm trying to download a file through nodejs that opens fine in the browser, and even downloads fine in tools like curl.
But nodejs just fails for some reason to download the file. I tried downloading the file through the request module in node and through a node cli module called download-cli. Both of them fail with either a 400 or 404 response yet the file downloads fine through regular tools like curl.
What could be the issue? I have tried setting the user-agent to that of Firefox (where it opens just fine) but that doesn't do the trick. I'm assuming the problem isn't about the user-agent anyway since curl doesn't have its own user-agent.
The url in question can be any url from alicdn but lets take this one as an example:
https://ae01.alicdn.com/kf/HTB1ftVmPVXXXXXUXVXXq6xXFXXXG/Langtek-smart-watch-gt12-часы-поддержка-синхронизации-notifier-sim-карты-подключение-bluetooth-для-android-apple-iphone.jpg_640x640.jpg
Here's the response by running the above url through the node download-cli tool and the Invoke-WebRequest tool in powershell.
PS C:\code> download https://ae01.alicdn.com/kf/HTB1ftVmPVXXXXXUXVXXq6xXFXXXG/Langtek-smart-watch-gt12-часы-поддержка-син
хронизации-notifier-sim-карты-подключение-bluetooth-для-android-apple-iphone.jpg_640x640.jpg
Couldn't connect to https://ae01.alicdn.com/kf/HTB1ftVmPVXXXXXUXVXXq6xXFXXXG/Langtek-smart-watch-gt12-часы-поддержка-синхронизации-notifier-sim-карты-подключение-bluetooth-для-android-apple-iphone.jpg_640x640.jpg (404)
PS C:\code> curl https://ae01.alicdn.com/kf/HTB1ftVmPVXXXXXUXVXXq6xXFXXXG/Langtek-smart-watch-gt12-часы-поддержка-синхрон
изации-notifier-sim-карты-подключение-bluetooth-для-android-apple-iphone.jpg_640x640.jpg
StatusCode : 200
StatusDescription : OK
Content : {255, 216, 255, 224...}
RawContent : HTTP/1.1 200 OK
X-Application-Context: fileserver2-download:prod:7001
From-Req-Dns-Type: NA,NA
SERVED-FROM: 72.247.178.95
Connection: keep-alive
Network_Info: DE_FRANKFURT_16509
Timing-Allow-Ori...
Headers : {[X-Application-Context, fileserver2-download:prod:7001], [From-Req-Dns-Type, NA,NA], [SERVED-FROM, 72.247.178.95],
[Connection, keep-alive]...}
RawContentLength : 114927

Okay so I tried downloading the file through node's native http module, I tried downloading through the popular request module AND I tried downloading through a node based cli tool called download-cli. Everyone of them had the same response.
So I fired up Wireshark and tried to see exactly where the requests are different and it turns out that tools like curl and Invoke-WebRequest escape the path before making a GET request but node's native module doesn't do that. That was the only difference. Using the escaped url works fine.
Invoke-WebRequest's GET path:
GET /kf/HTB1ftVmPVXXXXXUXVXXq6xXFXXXG/Langtek-smart-watch-gt12-%D1%87%D0%B0%D1%81%D1%8B-%D0%BF%D0%BE%D0%B4%D0%B4%D0%B5%D1%80%D0%B6%D0%BA%D0%B0-%D1%81%D0%B8%D0%BD%D1%85%D1%80%D0%BE%D0%BD%D0%B8%D0%B7%D0%B0%D1%86%D0%B8%D0%B8-notifier-sim-%D0%BA%D0%B0%D1%80%D1%82%D1%8B-%D0%BF%D0%BE%D0%B4%D0%BA%D0%BB%D1%8E%D1%87%D0%B5%D0%BD%D0%B8%D0%B5-bluetooth-%D0%B4%D0%BB%D1%8F-android-apple-iphone.jpg_640x640.jpg HTTP/1.1
Node's GET path:
GET /kf/HTB1ftVmPVXXXXXUXVXXq6xXFXXXG/Langtek-smart-watch-gt12-G0AK-?>445#6:0-A8=E#>=870F88-notifier-sim-:0#BK-?>4:;NG5=85-bluetooth-4;O-android-apple-iphone.jpg_640x640.jpg HTTP/1.1

why you didnt do it :
$url='https://ae01.alicdn.com/kf/HTB1ftVmPVXXXXXUXVXXq6xXFXXXG/Langtek-smart-watch-gt12-%D1%87%D0%B0%D1%81%D1%8B-%D0%BF%D0%BE%D0%B4%D0%B4%D0%B5%D1%80%D0%B6%D0%BA%D0%B0-%D1%81%D0%B8%D0%BD%D1%85%D1%80%D0%BE%D0%BD%D0%B8%D0%B7%D0%B0%D1%86%D0%B8%D0%B8-notifier-sim-%D0%BA%D0%B0%D1%80%D1%82%D1%8B-%D0%BF%D0%BE%D0%B4%D0%BA%D0%BB%D1%8E%D1%87%D0%B5%D0%BD%D0%B8%D0%B5-bluetooth-%D0%B4%D0%BB%D1%8F-android-apple-iphone.jpg_640x640.jpg'
Invoke-WebRequest -Uri $url -OutFile C:\temp\android-apple-iphone.jpg_640x640.jpg

Shell script wget download from S3 - Forbidden error

I am trying to download a file from Amazon's S3 using a shell script and the command wget. The file in cuestion has public permissions, and I am able to download it using a standard browsers. So far this I what I have in the script:
wget --no-check-certificate -P /tmp/soDownloads https://s3-eu-west-1.amazonaws.com/myBucket/myFolder/myFile.so
cp /tmp/soDownloads/myFile.so /home/hadoop/lib/native
The problem is a bit odd for me. While I am able to download the file directly from the terminal (just typing the wget command), an error pops up when I try to execute the shell script that contains the very same command line (Script ran with >sh myScript.sh).
--2014-06-26 07:33:57-- https://s3-eu-west-1.amazonaws.com/myBucket/myFolder/myFile.so%0D
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... XX.XXX.XX.XX
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|XX.XXX.XX.XX|:443... connected.
WARNING: cannot verify s3-eu-west-1.amazonaws.com's certificate, issued by ‘/C=US/O=VeriSign, Inc./OU=VeriSign Trust Network/OU=Terms of use at https://www.verisign.com/rpa (c)10/CN=VeriSign Class 3 Secure Server CA - G3’:
Unable to locally verify the issuer's authority.
HTTP request sent, awaiting response... 403 Forbidden
2014-06-26 07:33:57 ERROR 403: Forbidden.
Now, I am aware this can just be a begginer error from my side, but I am not able to detect any mispelling or error of any type. I would appreciate any help you can provide me to solve this issue.
As a note, I would like to notice that I am running the script in a EC2 instance provided by Amazon's Elastic MapReduce framework, if it has something to do with the issue.

I suspect that the editor you used to write that script has left you a little "gift."
The command line isn't the same. Look closely:
--2014-06-26 07:33:57-- ... myFolder/myFile.so%0D
^^^ what's this about?
That's urlencoding for ASCII CR, decimal 13 hex 0x0D. You have an embedded carriage return character in the script that shouldn't be there, and wget is seeing it as the last character in the URL, and sending it to S3.
Using the less utility to view the file, or an editor like vi, this stray character might show up as ^M... or, if they're all over the file, with you open it with vi, you should see this at the bottom of the screen:
"foo" [dos] 1L, 5C
^^^^^
If you see that, then inside vi...
:set ff=unix[enter]
:x[enter]
...will convert the line endings, and save the file in what should be a usable format, if this is really the problem you're having.
If you're editing files on windows, you'll want to use an editor that understands how to save files with newlines, not carriage returns.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string