Split Header and Content in curl - linux

curl -L -i google.com
I want to split the HEADER and CONTENT from the response in two variables
curl -I google.com
curl -L google.com
I cant use these two becouse im going to use it with 10000+ links
Both Header and Content can have three or more blank lines, so spliting blank lines wont work everytime
I found the answer
b=$(curl -LsD h google.com)
h=$(<h)
echo "$h$b"
This code works too
curl -sLi google.com |
awk -v bl=1 'bl{bl=0; h=($0 ~ /HTTP\/1/)} /^\r?$/{bl=1} {print $0>(h?"header":"body")}'
header=$(<header)
body=$(<body)

You can use this script:
curl -sLi google.com |
awk -v bl=1 'bl{bl=0; h=($0 ~ /HTTP\/1/)} /^\r?$/{bl=1} {print $0>(h?"header":"body")}'
header=$(<header)
body=$(<body)

Related

Combining two bash commands

If found this code
host raspberrypi | grep 'address' | cut -d' ' -f4
which gives pi Ip address
and this
wget --post-data="PiIP=1.2.3.4" http://dweet.io/dweet/for/cycy42
which sends 1.2.3.4 off to dweet.io stream
How can I get the output from 1st to replace the 1.2.3.4 in second please?
Save the output of the first command in a variable:
ip=$(host raspberrypi | grep 'address' | cut -d' ' -f4)
wget --post-data="PiIP=$ip" http://dweet.io/dweet/for/cycy42
Btw, if your raspberrypi is running raspbian,
then a much cleaner way to get the IP address:
hostname -I
Simplifying the commands to:
ip=$(hostname -I)
wget --post-data="PiIP=$ip" http://dweet.io/dweet/for/cycy42
Making that a one-liner:
wget --post-data="PiIP=$(hostname -I)" http://dweet.io/dweet/for/cycy42
UPDATE
So it seems hostname -I gives a bit different output for you.
You can use this then:
ip=$(hostname -I | awk '{print $1}')
To make it a one-liner, you can insert this into the second line just like I did in the earlier example.

Linux Terminal: Finding number of lines longer than x

I come to you with a problem that has me stumped. I'm attempting to find the number of lines in a file (in this case, the html of a certain site) longer than x (which, in this case, is 80).
For example: google.com has (by checking with wc -l) has 7 lines, two of which are longer than 80 (checking with awk '{print NF}'). I'm trying to find a way to check how many lines are longer than 80, and then outputting that number.
My command so far looks like this:
wget -qO - google.com | awk '{print NF}' | sort -g
I was thinking of just counting which lines have values larger than 80, but I can't figure out the syntax for that. Perhaps 'awk'? Maybe I'm going about this the clumsiest way possible and have hit a wall for a reason.
Thanks for the help!
Edit: The unit of measurement are characters. The command should be able to find the number of lines with more than 80 characters in them.
If you want the number of lines that are longer than 80 characters (your question is missing the units), grep is a good candidate:
grep -c '.\{80\}'
So:
wget -qO - google.com | grep -c '.\{80\}'
outputs 6.
Blue Moon's answer (in its original version) will print the number of fields, not the length of the line. Since the default field separator in awk is ' ' (space) you will get a word count, not the length of the line.
Try this:
wget -q0 - google.com | awk '{ if (length($0) > 80) count++; } END{print count}'
Using awk:
wget -qO - google.com | awk 'NF>80{count++} END{print count}'
This gives 2 as output as there are two lines with more than 80 fields.
If you mean number of characters (I presumed fields based on what you have in the question) then:
wget -qO - google.com | awk 'length($0)>80{c++} END{print c}'
which gives 6.

wget spider returns all URLs twice -- where is the bug?

I was looking for a script to create a URL list for a sitemap and found this one:
wget --spider --force-html -r -l1 http://sld.tld 2>&1 \
| grep '^--' | awk '{ print $3 }' \
| grep -v '\.\(css\|js\|png\|gif\|jpg\|ico\|txt\)$' \
> urllist.txt
The result is:
http://sld.tld/
http://sld.tld/
http://sld.tld/home.html
http://sld.tld/home.html
http://sld.tld/news.html
http://sld.tld/news.html
...
Every URL entry is saved twice. How should the script be changed to fix this?
If you look at the output of wget when you use the --spider flag, it'll look something like:
Spider mode enabled. Check if remote file exists.
--2013-04-12 22:01:03-- http://www.google.com/intl/en/about/products/
Connecting to www.google.com|173.194.75.103|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Remote file exists and could contain links to other resources -- retrieving.
--2013-04-12 22:01:03-- http://www.google.com/intl/en/about/products/
Reusing existing connection to www.google.com:80.
HTTP request sent, awaiting response... 200 OK
It checks if the link is there (thus prints out a --), then it has to download it to look for additional links (thus the second --). This is why it shows up (at least twice) when you use --spider.
Compare that to without --spider:
Location: http://www.google.com/intl/en/about/products/ [following]
--2013-04-12 22:00:49-- http://www.google.com/intl/en/about/products/
Reusing existing connection to www.google.com:80.
So you only get one line that starts with --.
You can remove the --spider flag but you could still get duplicates. If you really don't want any duplicates, add a | sort | uniq to your command:
wget --spider --force-html -r -l1 http://sld.tld 2>&1 \
| grep '^--' | awk '{ print $3 }' \
| grep -v '\.\(css\|js\|png\|gif\|jpg\|ico\|txt\)$' \
| sort | uniq > urllist.txt

Curl "write out" value of specific header

I am currently writing a bash script and I'm using curl. What I want to do is get one specific header of a response.
Basically I want this command to work:
curl -I -w "%{etag}" "server/some/resource"
Unfortunately it seems as if the -w, --write-out option only has a set of variables it supports and can not print any header that is part of the response. Do I need to parse the curl output myself to get the ETag value or is there a way to make curl print the value of a specific header?
Obviously something like
curl -sSI "server/some/resource" | grep 'ETag:' | sed -r 's/.*"(.*)".*/\1/'
does the trick, but it would be nicer to have curl filter the header.
The variables specified for "-w" are not directly connected to the http header.
So it looks like you have to "parse" them on your own:
curl -I "server/some/resource" | grep -Fi etag
You can print a specific header with a single sed or awk command, but HTTP headers use CRLF line endings.
curl -sI stackoverflow.com | tr -d '\r' | sed -En 's/^Content-Type: (.*)/\1/p'
With awk you can add FS=": " if the values contain spaces:
awk 'BEGIN {FS=": "}/^Content-Type/{print $2}'
The other answers use the -I option and parse the output. It's worth noting that -I changes the HTTP method to HEAD. (The long opt version of -I is --head). Depending on the field you're after and the behaviour of the web server, this may be a distinction without a difference. Headers like Content-Length may be different between HEAD and GET. Use the -X option to force the desired HTTP method and still only see the headers as the response.
curl -sI http://ifconfig.co/json | awk -v FS=": " '/^Content-Length/{print $2}'
18
curl -X GET -sI http://ifconfig.co/json | awk -v FS=": " '/^Content-Length/{print $2}'
302

CURL Progress Bar: How to pipe and extract numbers only using grep?

This is what I have so far:
[my1#graf home]$ curl -# -o f1.flv 'http://osr.com/f1.flv' | grep -o '*[0-9]*'
####################################################################### 100.0%
I wish to use grep and only extract the percentage from that progress bar that CURL outputs.
I think my regex is not correct and I am also not sure if this grep will take effect of the the percentage being continuously updated?
What I am trying to do is basically get CURL only to give me the percentage number as the output and nothing else.
Thank you for any help.
With curl 7.36.0 (should also work for other versions) you can extract the percentage in the following way:
curl ... 2>&1 -# | stdbuf -oL tr '\r' '\n' | grep -o '[0-9]*\.[0-9]'
Here ... stands for options/filenames. This outputs a sequence of percentage numbers.
Curl uses carriage returns \r in its output, so you need tr to transform them first into \n because grep is line oriented. You also need to modify output buffer settings with stdbuf to get the percentage numbers immediately after curl outputs them.
You can't get the progress info like that through grep; it doesn't make sense.
curl writes the progress bar to stderr, so you have to redirect to stdout before you can grep it:
$ curl -# -o f1.flv 'http://osr.com/f1.flv' 2>&1 | grep 1 | less results in:
^M 0.0
%^M######################################################################## 100.
0%^M######################################################################## 100
.0%^M######################################################################## 10
0.0%
Are you expecting a continual stream of numbers that you are redirecting somewhere else? Or do you expect to grab the numbers at a single point?
If it's the former, this sort of half-assedly works on a small file:
$ curl -# -o f1.flv 'http://osr.com/f1.flv' 2>&1 | sed 's/#//g' -
100.0% 0.0%
But it's useless on a large file. The output doesn't print until the download is finished, probably because curl seems to be sending ^H's to the terminal. There might be a better way to sed it, but I wouldn't hold my breath.
$ curl -# -o l.tbz 'ftp://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2009/06/2009-06-02-05-mozilla-1.9.1/firefox-3.5pre.en-US.linux-x86_64.tar.bz2' 2>&1 | sed 's/#//g' -
100.0%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
Try this:
curl source -o dest -# 2> tmp&
grep -o ".....%" tmp | tail -n1
You need to use .* not * in your regex.
grep -o '.*[0-9].*'
That will catch all text though, so maybe try:
grep -p '[0-9]+'

Resources