weird POST request in IIS logs - iis

I know that stackoverflow is not the correct place to post this question, but i already post this at serverfault and the place seems generally dead.
--
I noticed weird log entries (unless there's something i don't understand) in my IIS (7.5) logs.
it's an online dictionary with requests ( user friendly url rewriting ) and most of them are GET. However I noticed weird POST requests which are taking place by a person who is trying to crawl our content ( tens of thousands of such requests )
2013-11-09 20:39:27 GET /dict/mylang/word1 - y.y.y.y Mozilla/5.0+(compatible;+Googlebot/2.1;++http://www.google.com/bot.html) - 200 296
2013-11-09 20:39:29 GET /dict/mylang/word2 - z.z.z.z Mozilla/5.0+(iPhone;+CPU+iPhone+OS+6_0+like+Mac+OS+X)+AppleWebKit/536.26+(KHTML,+like+Gecko)+Version/6.0+Mobile/10A5376e+Safari/8536.25+(compatible;+Googlebot-Mobile/2.1;++http://www.google.com/bot.html) - 200 468
2013-11-09 20:39:29 POST /dict/mylang/word3 - x.x.x.x - - 200 2593
The two first requests are legal. Now for the third request, I don't think I have allowed cross domain POST. if that what the third log line means.
all those POST requests take that much time for unknown reasons to me. I would like to know how are those POST requests possible and how can I stop them.
p.s. I have masked the IPs on purpose.
any help would be appreciated! thank you in advance.
blocking POST generally is not an option, i extensively use AJAX. i want to know how does he do this kind of POST request and how to stop him. I've got tens of thousands of requests, i constantly ban IP ranges through firewall but he just hops through proxies.
this is how a normal POST request ( through ajax happens ):
2013-11-10 10:16:54 POST /dict/mylang/displaySem.php - 85.73.156.122 Mozilla/5.0+(Windows+NT+6.1;+rv:25.0)+Gecko/20100101+Firefox/25.0 http://www.mydomain.com/dict/mylang/randomword 200 171

Http allows anyone to POST a request to your site. Your application (not IIS) should check if it is a valid request before starting the long processing algorithm.
Some common validation methods are:
If you think he is directly POSTing to your site using an automated script, you could use a CAPTCHA to make it hard for him: http://en.wikipedia.org/wiki/CAPTCHA
If you think he is hijacking the session of other people, you can use a CSRF field in your form: http://en.wikipedia.org/wiki/Cross-site_request_forgery

Related

400 with file_gets_contents() or curl_init()

Most of the site sources, opens with a simple request, usualy by file_gets_contents() or curl_init().
I've tried a lot of combinations of stream_context_create() and curl_setopt(), and none returned any thing different of 400 bad request.
Is there an explanation for why some server-sites ( like https://phys.org/ ) do not return de source code by quoted methods?
obs.: if you were able to get the source of the exemple ( https://phys.org/ ), using file_gets_contents() or curl_init(), or any other method with php, please post the code, thanks.
Some Website's are validating the request if it comes from a real/allowed client (bot/user).
This can have multiple reasons.
Maybe the bots are sending to many requests, or the specific site is blocked behind a paywall/firewall. But there are many other people who can explain it to you better then me.
Here are some known Examples how they did it:
Some Site's are supporting request with an API-Token.
Google API's are an great example.
Some Site's are validing the User-Agent.
It looks like that your example site is doing this.
When I'm sending a custom User-Agent Header the result is returning to an error.
And Of Course can some site's check for the User IP Address :)
I believe in your example there should be a good solution to get a result.

Get Request URL Capability

I recently began working with JavaScript and am looking at various get and post requests one can send to a server.
For get, as far as I know, all of the information of the query is contained in the URL that the user triggers. On the server side this has to be dissected to retrieve the necessary parameters.
I was just wondering how larger and more detailed requests are handled with this get method? For instance what if I had millions and millions of parameters that make up my whole request? Would they all be jumbled into the URL? Is there a limit as to the number of unique URLs one can have? I read this post:
How do URL shorteners guarantee unique URLs when they don't expire?
I would really like some more input.
Thank You!

Receiving "400 Bad Request" for /oauth/access_token

I have approved for public_content clientId. To get access token, I send a request to www.instagram.com:
GET /oauth/authorize?client_id=MyClientId&redirect_uri=MyRedirectURL&response_type=code&scope=likes+comments+public_content HTTP/1.1`
After authentication, the browser redirects me to MyRedirectURL and I can get the code from the URL.
With this code I send a request to api.instagram.com:
/oauth/access_token HTTP/1.1
client_id=MyClientId&client_secret=MyClientSecret&grant_type=authorization_code&redirect_uri=MyRedirectURL&code=CodeFromURL`
But sometimes I get response HTTP/1.1 400 Bad Request.
This situation continues for a few hours, and sometimes for a day. It is interesting that the problem is very unstable. I may have two client apps that make identical requests, and one app will work fine, while the other will fail at the same time. Looks like it is a problem somewhere in the Instagram infrastructure.
Instagram is no longer supporting custom schemas for your callback urls. That was my problem, I changed it to https and the problem was solved.
I think you should prefer this document of Instagram.
You may also receive responses with an HTTP response code of 400 (Bad
Request) if we detect spammy behavior by a person using your app.
These errors are unrelated to rate limiting.
It seems like , we can not use http://localhost/... in call back url. Instagram may have restricted it.
It worked for me, when I have added live Ip of my aws server. for example http://xx.xx.xx.xx/.. instead of localhost.

at server i keep getting socketio request constantly every 2 seconds, but google analytics show no one is here.

I've removed all socketio code but someone either hasn't uploaded page for days or something else. But for some reason server is getting bombarded with socketio request which are failing because i removed all the code both on client and server. However, they are still coming. ??? what can i do. Block ip?
I can't change webdomain name. Which is given. I can't think of any options, they're coming from like 6 different ips. They would have been legit requests some weeks ago. but not now.
Are you worried that handling these requests will impede your server's performance? The only legitimate reason I can think of is that someone's browser cache hasn't been cleared properly since the update, assuming you enabled caching on your express server.
If your intention is to improve performance, I suggest putting that path high on the express method chain so that the server can end the request as quickly as possible and minimize the load on the server.
If you want the people to become aware that their requests are invalid, you could route the path to a javascript file that redirects the current page to another document. On the document, have directions that instruct the user to clear their browser cache in order to properly update their client.
Hope that helps.

How can I verify that javascript and images are being cached?

I want to verify that the images, css, and javascript files that are part of my page are being cached by my browser. I've used Fiddler and Google Page Speed and it's unclear whether either is giving me the information I need. Fiddler shows the HTTP 304 response for images, css, and javascript which should tell the browser to use the cached copy. Google Page Speed shows the 304 response but doesn't show a Transfer Size of Zero, instead it shows the full file size of the resource. Note also, I have seen Google Page Speed report a 200 response but then put the word (cache) next to the 200 (so Status is 200 (cache)), which doesnt make a lot of sense.
Any other suggestions as to how I can verify whether the server is sending back images, css, javascript after they've been retrieved and cached by a previous page hit?
In browser HTTP debuggers are probably the easiest to use in your situation. Try HTTPFox for Firefox or Opera which has dragonfly built-in. Both of these indicate when the local browser cache has been used.
If you appear to be getting conflicting information, then wireshark/tcpdump will show you if the objects are being downloaded or not as it is monitoring the actual network packets being transmitted and received. If you haven't looked at network traces before, this might be a little confusing at first.
In fiddler, check out that the response body (for images, css) is empty. Also make sure your max-age is long enough in Cache-Control header. Most browsers (Safari, Firefox) have good traffic analyzer tools.
Your servers access logs can give you a lot of information on how effective your caching strategy is.
Lets say you have a html page /home.html, which references /some.js and /lookandfeel.css. For a given time period, aggregate the number of requests to all three files.
If your caching is effective, you should see a huge number of requests for home.html, but very few for the css or js. Somewhere in between is when you see identical number of requests for all 3, but the css and js have 304s. The worst is when you are only seeing 200s.
Obviously, you have to know your application to do such a study. The js and css files may be shared across multiple pages - which may complicate your analysis. But the general idea still holds good.
The advantage of such a study is that you can find out how effective your caching strategy is for your users as opposed to 'Is caching working on my machine'. However, this is no substitute for using a http proxy / fiddler.
A HTTP/304 response is forbidden to have a body. Hence, the full-response isn't sent, instead you just get back the headers of the 304 response. But the round-trip itself isn't free, and hence sending proper expiration information is a good practice to improve performance to avoid making the conditional request that returns the 304 in the first place.
http://www.fiddler2.com/redir/?id=httpperf explains this topic in some detail.

Resources