cURL - scanning a website's source

cURL - scanning a website's source - linux

I was trying to use the program cURL inside of BASH to download a webpage's source code. I am having difficulty when trying to download a page's code when the page is using more complex encoding than simple HTML. For example I am trying to view the following page's source code with the following command:
curl "http://shop.sprint.com/NASApp/onlinestore/en/Action/DisplayPhones?INTNAV=ATG:HE:Phones"
However the result of this doesn't match the source code generated by Firefox when I click "View source". I believe it is because there are Javascript elements on the page, but I can not be sure.
For example, I can not do:
curl "http://shop.sprint.com/NASApp/onlinestore/en/Action/DisplayPhones?INTNAV=ATG:HE:Phones" | grep "Access to 4G speeds"
Even though that phrase is clearly found in the Firefox source. I tried looking through the man pages but I don't know enough about the problem to figure out a possible solution.
A preferable answer will include why this is not working the way I expect it to and a solution to the issue using curl or another tool executable from a Linux box.
EDIT:
Upon suggestion below I have also included a useragent switch with no success:
curl "http://shop.sprint.com/NASApp/onlinestore/en/Action/DisplayPhones?INTNAV=ATG:HE:Phones" -A "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.3) Gecko/20100423 Ubuntu/10.04 (lucid) Firefox/3.6.3" | grep -i "Sorry"

I don't see the "Access to 4G speed" thing in the first place when I go to that page.
The two most likely culprits for this difference are cookies and your user-agent.
You can specify cookies manually using both curl or wget. Dump out your cookies from Firefox using whatever plugins you want, or just
javascript:prompt('',document.cookie);
in your location bar
Then stick read through the man pages for wget or curl and see how to include that cookie.
EDIT:
It appears to be what I thought, a missing cookie.
curl --cookie "INSERT THE COOKIE YOU GOT HERE" http://shop.sprint.com/NASApp/onlinestore/en/Action/DisplayPhones?INTNAV=ATG:HE:Phones | grep "Access to 4G"
As stated above, you can grab whatever you cookie is from above: javascript:prompt('',document.cookie) then copy the default text that comes up. Make sure you're on the sprint page when you stick that in the location bar (otherwise you'll end up with the wrong website's cookie)
EDIT 2
The reason your browser cookie and your shell cookie were different was the different in interaction that took place.
The reason I didn't see the Access to 4G speed thing you were talking about in the first place was that I hadn't entered my zip code.
If you want to have a constantly relevant cookie, you can force curl to do whatever is required to obtain that cookie, in this case, entering a zip code.
In curl, you can do this with multiple requests and holding the retrieved cookies in a cookie jar:
[stackoverflow] curl --help | grep cookie
-b/--cookie <name=string/file> Cookie string or file to read cookies from (H)
-c/--cookie-jar <file> Write cookies to this file after operation (H)
-j/--junk-session-cookies Ignore session cookies read from file (H)
So simply specify a cookie jar, send the request to send the zipcode, then work away.

If you are getting different source code from the same source the server is, most likelly sniffing your user agent and laying out specific code.
Javascript can act on the DOM and do all sorts of things but if you use 'see source' the code will be exactly the same as the one your browser first read (before DOM manipulation).

Related

Why is the div I'm scraping coming back as empty

I'm practicing some web scraping using node and cheerio on the William Hill website, but when I get to a certain point in the code, it almost stops and despite the div being full of html and inspect element showing this, and when calling .html() it just returns , as if it's empty. Any targeting of elements within this div return null.
request('https://sports.williamhill.com/betting/en-gb/football/competitions/OB_TY295/English-Premier-League/matches/OB_MGMB/Match-Betting', (error, response, html) => {
if(!error && response.statusCode == 200){
const $ = cheerio.load(html)
const bet = $('#football div[data-test-id="events-group"]')
console.log(bet.html())
}
})
I'm completely new to web scraping so I hope this makes sense, and please if possible try to 'dumb down' your answers as much as possible. Thank you

There are many protections against screen scraping that webmasters deploy to prevent this from happening. A few are, limiting requests from IP addresses, looking for specific information in the headers (like browser type) and some of them even certain types of cookies.
As AbdulSohu pointed out, curl returns nothing (neither would a direct request or even JavaScript fetch) because the request lacks what the web server needs to give you the html. It's also very brittle because websites could change their html code.
Selenium is an option, but if you want to dig into it, start investigating the minimum you'd need to return something from that site using request by adding appropriate headers to fool the web server into thinking you're a browser.
Good luck and have fun!

I tried the following to really see if I could get something following the specific link you posted:
curl "https://sports.williamhill.com/betting/en-gb/football/competitions/OB_TY295/English-Premier-League/matches/OB_MGMB/Match-Betting"
If you don't know what curl is, you can learn about it a bit more here. The command is supposed to return the html content for this specific page to me, in my terminal. What did I get?
<html></html>
So, basically, nothing.
In general, when you try to target specific containers and divs on an html page specified by a very specific hyperlink, you can run into many issues. For instance, the website administrators could change what the divs are called, the hyperlinks could redirect elsewhere etc.
What is less likely to happen, though, is that they change the structure of the website. It is still likely to happen, just less likely. So, in this case, it would probably be really beneficial if you could just write a Selenium program to browse and scrape the page like a human would with inspect element. For instance, I found this beginner tutorial.

Hosting an editable single string of text online?

I'm a twitch streamer and I'm runnig a bot named "Nightbot", which can interact with users in my stream's chat area. They can type a command such as "!hello" and, in response to that, I can tell the nightbot to load up a url, and post the text from that url into the chat.
But the text needs to change each time I play a new game, so the text must be editable. And it can't be a file, because the nightbot expects the url to return just plain text.
So I can't use a file hosting service. Please, don't recommend for me to save a text file to some free hosting service, and put my text into the file.
What I need is a very simple string of texxt that is hosted online, which can be edited, and which can be accessed by a url. Why the literal *eck is that so impossible or unreasonable? I thought we live in 2018.
I spent the entire day trying to learn Heroku, and when that turned out to be unreasonably complicated, I spent some hours trying Microsoft's Azure. Holy moly it turned into connecting storage services, picking price tiers, and do I want that to run on a windows or linux server? And how many gigs of space do I need, and will I be paying by the second? Come on I just need to save an editable string of text online, probably just 100 characters long! Why so difficult!
I guess what I'm looking for is something as easy as tinyurl, but for editable text strings online... just go there and type in the name for my variable, and boom, it gives me a url to update it, and a url to download it. Total time required: less than one minute.

WARNING: both solutions are publicly accessible and thus also editable. You don't want inapproriate text to display on your stream, so keep the link secret. Still there are no guarantees it stays secret.
Solution 1 (simple and editable via the web UI if you create an account)
You could just use pastebin.com. Here you can put public/unlisted text.
When you use the pastebin.com/raw/ + id of your text you get plain text.
Solution 2 (Bit more complicated, but more advanced)
You can use JSON Blob
This website allows you to host JSON and edit/create/get a string. It has to be valid JSON, but if you use "" around your text it is. Though if you use a curl command to change the text it doesn't have to be valid JSON. Only when you use the website to edit text it has to be.
First of you create your string and save it. Then you can access the string by doing a GET request on a url like this https://jsonblob.com/api/ + blob id
Example:
https://jsonblob.com/api/758d88a3-5e59-11e8-a54b-2b3610209abd
To edit your text you have to do a PUT request to the same url, but with the text your want it to change to.
Example command to change text (I used curl, because that's easy for me):
curl -i -X "PUT" -d 'This is new text' -H "Content-Type: application/json" -H "Accept: application/json" https://jsonblob.com/api/jsonBlob/758d88a3-5e59-11e8-a54b-2b3610209abd
You could also use a tool like POSTMAN to do the PUT request.
For more indepth instruction on how to use JSON Blob you can go to their website: https://jsonblob.com/api

I have a URL with an output that needs to be reformatted and entered back in the browser

I have a URL with an output that needs to be reformatted and entered back in the browser.
We have a server that passes caller ID and we can specify a URL to launch with the caller ID included.
IE: googledotcom/search?="{callerID}" . If this is set in the URL manager it would return a google search for "Jackson Steve" when a call is received from Steve Jackson.
**edit: the tag {callerID} that is passed from our server can not be edited in any way because of Asterisk dial plan issues.
This issue is our customer database will only handle name searches in the format of "Jackson, Steve". Without the comma the search comes back empty.
How would I take the name passed from caller ID, create a script to insert a comma and resubmit that URL in the browser?
Basically I need a way to convert "https://www.google.com/#q=name+name" to "https://www.google.com/#q=name,+name" via an automatic script or process. The comma coming after the first name being the change that needs made.
Should this be sent to a website running javascript/html where it formats the caller id name then resubmits or should this somehow be handled by a local script on a computer with something along the lines of autohotkey?
Possibly use some sort of redirect on a web page? send "Name Name" to mywebsiteDOTcom/urlformat/, write a script that would insert a comma in after the first name and redirect the user to myuserdatabaseDOTcom/search?"Name, Name"

Replace the spaces with comma, that should be the easiest solution..

Using Firefox, you could write a Greasemonkey script that is automatically run when you hit the target site (Chrome, I believe has a similar addon named Tampermonkey, not sure about IE). In the javascript, you could examine the URL and either do a straight comma-for-space replacement as Danyal suggested or you could do something a little more elegant with some regex matching, then auto-navigate the browser to the corrected URL. Once the script was installed, this process would happen automatically.
PS. What process launches the browser? Can't you capture the name and the format it before you launch the browser window? That would seem a much better approach than trying to format it after the fact.

Best non-interactive approach to enter a string into a formular field and get the resulting text

In some website for which I have access, there are some input fields. In the sixth field I need to enter some input string from a list of 10000 strings, then a new page appears, for which I would just need to count the number of lines. Finally I would like to get a table with two columns like input string and number of resulting lines. Since I have to manually enter the info for all the different 10000 strings, I wonder therefore what is the best approach to enter a string into a generic formular field and get the resulting text. I heard about curl but I am not sure whether this is the easiest one.
P.S.
Example of interactive way: I type some string o words into google search and then I get a new page with the search results. Previously I have introduced my google username and password, so the results will be probably filtered according to my profile.
Example of non-interactive way: A script somehow introduces my user information, search query and saves to some text file the search results. Imagine the same idea but for a more complicated website like this.

What you want to do is to send a HTTP POST with specific data. This can be done with any proper HTTP client code, and one such is libcurl (or the pycurl binding or even using the curl command line tool). On the response from the post, you probably get a redirect and then the results, or you need to do a separate request for the results and then you're done and go back to do the next POST. Repeat until all POSTs are done.
What you may need to take into account is that you may have to deal with cookies and possibly to follow a redirect from the POST. A good approach is to record a "manual session" as done with a browser (use firebug or LiveHTTPHeaders etc) and then use that recording to help you repeat the same thing with a HTTP client.
A decent tutorial to get some starting up details on this kind of work can be found here: http://curl.haxx.se/docs/httpscripting.html

You could also use JMeter to run all the posts. You may use the CSV input to set the 10000 strings. Then you save the result as xml and extract the necessary data.

Possible reasons for a browser executing GET rather than post

One of our most common error situations in a web application is that a user executes a GET request where the form should have submitted a POST.
I realize that the most likely case is that they got to the page (by way of a POST) and then clicked in the address bar and hit enter.
The question is, are there other less obvious ways that this could happen (print previews/back button, etc)?
We have never been able to consistently repeat the problems. The problems for different users and different pages nor do they happen very often (maybe once a week).
EDIT:
There is no data submitted with the GET request and therefore the processing page crashes out.

I was having a similar issue, although it doesn't sound like this was exactly yours. I had a form that was being submitted via ajax and shouldn't ever use the default submit. Now and then I was receiving errors from a GET done on the form. It shouldn't be possible to submit this form; however, certain versions of Safari for Mac were submitting it on enter. I just added a jquery form.submit() catch on it to prevent the default functionality. I also set the action to a page that wouldn't result in error (in case of lack of javascript).

As you said your problem is intermittent, so having a problem in form method set as get instead of post can be overruled but yes you are right, that if user presses enter in address bar it would be a get request and back button request always depends upon the last request made, if it was a post then any good browser will prompt you about resubmission and if it was get then no prompt, page will be bought back(may be from cache).
May be you can use Firebug (track requests in .net tab)or Fiddler and do some tests with different users/pages if you can reproduce it, its simply pressing enter in address bar.
Edit:
And also get is always supposed to 'retrieve information' so if browser is missing something or need something it will be a get but again check in IIS log for those get requests and try them in browser,if they contains query string for viewstate and eventvalidation, then they are really mis-formed request from post to get, if form method is not explicitly set to get.

I believe that an answer to the question "what are reasons for a browser executing GET rather than POST" does not help to solve the problem of receiving a GET on a form where you expect the a GET. But: To answer that question I would simply say that you receive an GET because the browser sends an GET and you can send a GET on any page where you can send a POST. On the other hand, if the user uses the form the browser sends a POST. Thus your server has to be prepared to handle the GET and it should handle the GET in the same manner as a POST with invalid parameters. My suggestion:
If you receive a GET, display the form again.
If you receive a POST with invalid data, display the form again and notify the user that he has to enter the data in a specific way.
Maybe a trivial answer, but that's what I would do. Don't know if it adds to the discussion.

Wrong, the most obvious reason why you get a GET instead of a POST is that because you're doing a GET instead of a POST.
A less obvious reason is you forgot to specify method="post" in one of your forms. The HTML standard specifies that the default method is GET, so if you want to POST, you must specify method="post" explicitly.
Scrutinize all of your tags and make sure all of them explicitly specify method="post".
EDIT: When you click on the address bar and pressed enter, yes it's true that the browser will GET a page, but your form wouldn't be submitted that way since the browser treats the URL similar to how a copy-pasted URL would be: a new session without any additional information sent to the server (except cookies).

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string