Is it possible to read only first N bytes from the HTTP server using Linux command?

Is it possible to read only first N bytes from the HTTP server using Linux command? - linux

Here is the question.
Given the url http://www.example.com, can we read the first N bytes out of the page?
using wget, we can download the whole page.
using curl, there is -r, 0-499 specifies the first 500 bytes. Seems solve the problem.
You should also be aware that many HTTP/1.1 servers do not have this feature enabled, so that when you attempt to get a range, you'll instead get the whole document.
using urlib in python. similar question here, but according to Konstantin's comment, is that really true?
Last time I tried this technique it failed because it was actually impossible to read from the HTTP server only specified amount of data, i.e. you implicitly read all HTTP response and only then read first N bytes out of it. So at the end you ended up downloading the whole 1Gb malicious response.
So the problem is that how can we read the first N bytes from the HTTP server in practice?
Regards & Thanks

You can do it natively by the following curl command (no need to download the whole document). According to the curl man page:
RANGES
HTTP 1.1 introduced byte-ranges. Using this, a client can request to get only one or more subparts of a specified document. curl
supports this with the -r flag.
Get the first 100 bytes of a document:
curl -r 0-99 http://www.get.this/
Get the last 500 bytes of a document:
curl -r -500 http://www.get.this/
`curl` also supports simple ranges for FTP files as well.
Then you can only specify start and stop position.
Get the first 100 bytes of a document using FTP:
curl -r 0-99 ftp://www.get.this/README
It works for me even with a Java web app deployed to GigaSpaces.

curl <url> | head -c 499
or
curl <url> | dd bs=1 count=499
should do
Also there are simpler utils with perhaps borader availability like
netcat host 80 <<"HERE" | dd count=499 of=output.fragment
GET /urlpath/query?string=more&bloddy=stuff
HERE
Or
GET /urlpath/query?string=more&bloddy=stuff

You should also be aware that many
HTTP/1.1 servers do not have this
feature enabled, so that when you
attempt to get a range, you'll instead
get the whole document.
You will have to get the whole web anyways, so you can get the web with curl and pipe it to head, for example.
head
c, --bytes=[-]N
print the first N bytes of each file; with the leading '-', print all
but the last N bytes of each file

I came here looking for a way to time the server's processing time, which I thought I could measure by telling curl to stop downloading after 1 byte or something.
For me, the better solution turned out to be to do a HEAD request, since this usually lets the server process the request as normal but does not return any response body:
time curl --head <URL>

Make a socket connection. Read the bytes you want. Close, and you're done.

Related

Linux bash script to get own internet IP address

I know I got quite rusty when it comes to bash coding, especially the more elaborate needed trickery handling awk or sed parts.
I do have a script that logs the IP address currently in use for the interwebs.
It gets that by either using wget -q0 URL or lynx -dump URL.
The most easy one was a site that only returned the IP address in plain text and nothing else. Unfortunately that site no longer exists.
The code was simple as can be:
IP=$(wget -qO - http://cfaj.freeshell.org/ipaddr.cgi)
But alas! using the code returns nothing cause the site is gone, as lynx can tell us:
$ lynx -dump http://cfaj.freeshell.org/ipaddr.cgi
Looking up cfaj.freeshell.org
Unable to locate remote host cfaj.freeshell.org.
Alert!: Unable to connect to remote host.
lynx: Can't access startfile http://cfaj.freeshell.org/ipaddr.cgi
Some other sites I used to retrieve for the same purpose no longer work either.
And the one I want to use is a German speaking one, not that I care one way or the other, it could be in Greek or Mandarin for all I care. I want only to have the IP address itself extracted, but like I said, my coding skills got rusty.
Here is the relevant area of what lynx -dump returns
[33]powered by
Ihre IP-Adresse lautet:
178.24.x.x
Ihre IPv6-Adresse lautet:
Ihre System-Informationen:
when running it as follows:
lynx -dump https://www.wieistmeineip.de/
Now, I need either awk or sed to find the 178.24.x.x part. (I know it can be done with python or Perl as well, but both are not part of a standard setting of my Linux, while awk and sed are.)
Since the script is there to extract the IP address, one needs to do the following either via sed or awk:
Search for "Ihre IP-Adresse lautet:"
Skip the next line.
Skip the whitespace at the beginning
Only return what is left of that line (without the lf at the end).
In the above example (that shows only the relevant part of the lynx dump, the whole dump is much larger but all above and below is irrelevant.) it would be "178.24.x.x" that should be returned.
Any help greatly appreciated to get my log-ip script back into working order.
Currently I have collected some other working URLs that report back the own internet IP. Any of these can also be used, but the area around the reported IP will differ from the above example. These are:
https://meineipinfo.de/
http://www.wie-ist-meine-ip.net/
https://www.dein-ip-check.de/
https://whatismyipaddress.com/
https://www.whatismyip.org/
https://www.whatismyip.net/
https://mxtoolbox.com/whatismyip/
https://www.whatismyip.org/my-ip-address
https://meineipadresse.de/
Even duckduckgo returns the IP address when e.g. asked this: https://duckduckgo.com/?q=ip+address&ia=answer
At least I know of no way of getting the own IP address when using the internet without retrieving an outside URL that reports that very IP address back to me.

You can do:
wget -O - v4.ident.me 2>/dev/null && echo

So, if you have a VM in some cloud provider you can solve this easily. I wrote some small Go app than echoes back an HTTP request. For instance :
$ curl 167.99.63.182:8888
Method ->
GET
Protocol ->
HTTP/1.1
Headers ->
User-Agent: [curl/7.54.0]
Accept: [*/*]
Content length (in Bytes) ->
0
Remote address ->
179.XXXXX
Payload
####################
####################
Where remote address is the address which the app received, hence, your IP.
And in case you are wondering, yes, 167.99.63.182 is the IP of the server and you can curl it right now and check it. I am disclosing the IP as anyway I get bombarded by brute force attacks for as long as I can remember and the machine does not have anything worth the break through.

Not exactly without relying on external services, but you could use dig to reach out to the resolver at opendns.com:
dig +short myip.opendns.com #resolver1.opendns.com
I think this is easier to integrate to a script.

How can I query all records with timestamp older than 30 days using cURL?

I want to fetch all records (from Solr) with a timestamp older than 30 days via cURL command.
What I have tried:
curl -g "http://localhost:8983/solr/input_records/select?q=timestamp:[* TO NOW/DAY-30DAYS]"
I do not understand why this does not work but it does not fetch anything. It simply returns nothing. If I replace '[* TO NOW/DAY-30DAYS]' with an actual value, it will retrieve that record.
Additional relevant information, this is how to delete all records older than 30 days (it works). Again, I do not want to delete, rather just fetch the data.
curl -g "http://localhost:8983/solr/input_records/update?commit=true" -H "Content-Type: text/xml" --data-binary "<delete><query>timestamp:[* TO NOW/DAY-30DAYS]</query></delete>"
Thanks in advance!

This error is happening because you don't have proper URL encoding for your request. Most likely the problem is spaces - need to replace them with %20, same applies to other symbols
Try this:
curl -g "http://localhost:8983/solr/input_records/select?q=timestamp:[*%20TO%20NOW/DAY-30DAYS]

A further addition to Mysterion's answer,
Since you are doing this using curl, you are facing the issue of the URL encoding.
If you just mention
http://localhost:8983/solr/input_records/select?q=timestamp:[* TO NOW/DAY-30DAYS]
in your browser (chrome or others)
the Url encoding is automatically handled by the browser and you would get your response as expected.

How to set max request body size in arangodb server?

I'm trying to do an 'arangorestore' operation on my local server. When i start it, i see:
ERROR internal error: got error from server: HTTP 413 (Request Entity Too Large)
How to configure the server to work properly on 'arangorestore'?

ArangoDB raises 413 error when a server receives a request body bigger than the max allowed value-512 MB. arangorestore has an option --batch-size but arangorestore should cap the max allowed value automatically. You can explicitly use this option to have lower batch sizes.

If you're doing line-based imports on a large file, you can easily split your file into sub-files based on line length with the unix split command:
split -l 1000000 my-huge-import.json import-split
split has other options as well. Then you can loop over and call curl with the output files, anmed import-splitaa, import-splitab etc.

curl chunky parser error

"Received problem 3 in the chunky parser"
I can't for the life of me find what "problem 3" in curl refers to. I'm sure it has to do with the format of the chunk I'm sending from the app server to curl, but I can't figure out what is wrong with the chunk because I can't tell what "problem 3" is.
Any ideas?

The number you see there is CHUNKE_BAD_CHUNK from the CHUNKcode enum from lib/http_chunks.h from the libcurl source code. Given a quick look, it seems that it is mostly used when a CR or LF is missing from the chunked data.
I would recommend you investigate on the raw HTTP content stream to see what the problem is with the chunked format. RFC2616 section 3.6.1 documents it.

There is a similar post to yours. Again I'm not sure what your trying to sen across so I can not point out the problem but have look at this,
Why is this warning being shown: "Received problem 2 in the chunky parser"?
Hope this helps!

So, I ran into this with a CGI program.
Long story short, the CGI script was using Python, and printing the chunk header using the length of the string, then sending to the client using:
print data,
This appends a space, making the data one byte longer than the chunk header says it is. I fixed this by changing that line to:
stdout.write( data )
A hexdump of the data out of the CGI script was the tool that finally told me what was going on.

Compare two websites and see if they are "equal?"

We are migrating web servers, and it would be nice to have an automated way to check some of the basic site structure to see if the rendered pages are the same on the new server as the old server. I was just wondering if anyone knew of anything to assist in this task?

Get the formatted output of both sites (here we use w3m, but lynx can also work):
w3m -dump http://google.com 2>/dev/null > /tmp/1.html
w3m -dump http://google.de 2>/dev/null > /tmp/2.html
Then use wdiff, it can give you a percentage of how similar the two texts are.
wdiff -nis /tmp/1.html /tmp/2.html
It can be also easier to see the differences using colordiff.
wdiff -nis /tmp/1.html /tmp/2.html | colordiff
Excerpt of output:
Web Images Vidéos Maps [-Actualités-] Livres {+Traduction+} Gmail plus »
[-iGoogle |-]
Paramètres | Connexion
Google [hp1] [hp2]
[hp3] [-Français-] {+Deutschland+}
[ ] Recherche
avancéeOutils
[Recherche Google][J'ai de la chance] linguistiques
/tmp/1.html: 43 words 39 90% common 3 6% deleted 1 2% changed
/tmp/2.html: 49 words 39 79% common 9 18% inserted 1 2% changed
(he actually put google.com into french... funny)
The common % values are how similar both texts are. Plus you can easily see the differences by word (instead of by line which can be a clutter).

The catch is how to check the 'rendered' pages. If the pages don't have any dynamic content the easiest way to do that is to generate hashes for the files using a md5 or sha1 commands and check then against the new server.
IF the pages have dynamic content you will have to download the site using a tool like wget
wget --mirror http://thewebsite/thepages
and then use diff as suggested by Warner or do the hash thing again. I think diff may be the best way to go since even a change of 1 character will mess up the hash.

I've created the following PHP code that does what Weboide suggest here. Thanks Weboide!
the paste is here:
http://pastebin.com/0V7sVNEq

Using the open source tool recheck-web (https://github.com/retest/recheck-web), there are two possibilities:
Create a Selenium test that checks all of your URLs on the old server, creating Golden Masters. Then running that test on the new server and find how they differ.
Use the free and open source (https://github.com/retest/recheck-web-chrome-extension) Chrome extension, that internally uses recheck-web to do the same: https://chrome.google.com/webstore/detail/recheck-web-demo/ifbcdobnjihilgldbjeomakdaejhplii
For both solutions you currently need to manually list all relevant URLs. In most situations, this shouldn't be a big problem. recheck-web will compare the rendered website and show you exactly where they differ (i.e. different font, different meta tags, even different link URLs). And it gives you powerful filters to let you focus on what is relevant to you.
Disclaimer: I have helped create recheck-web.

Copy the files to the same server in /tmp/directory1 and /tmp/directory2 and run the following command:
diff -r /tmp/directory1 /tmp/directory2
For all intents and purposes, you can put them in your preferred location with your preferred naming convention.
Edit 1
You could potentially use lynx -dump or a wget and run a diff on the results.

Short of rendering each page, taking screen captures, and comparing those screenshots, I don't think it's possible to compare the rendered pages.
However, it is certainly possible to compare the downloaded website after downloading recursively with wget.
wget [option]... [URL]...
-m
--mirror
Turn on options suitable for mirroring. This option turns on recursion and time-stamping, sets infinite recursion depth and keeps FTP
directory listings. It is currently equivalent to -r -N -l inf --no-remove-listing.
The next step would then be to do the recursive diff that Warner recommended.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string