Using same cache from different static servers - browser

I'm thinking about adding another static server to a web app, so I'd have static1.domain.tld and static2.domain.tld.
The point would be to use different domains in order to load static content faster ( more parallel connections at the same time ), but what 'troubles' me is "how to get user's browser to see static1.domain.tld/images/whatever.jpg and static2.domain.tld/images/whatever.jpg as the same file" ?
Is there a trick to accomplish this with headers or I'll have to define which file is on which server?

No, there's no way to tell the browser that two URLs are the same -- the browser caches by full URL.
What you can do is make sure you always use the same url for the same image. Ie. all images that start with A-M go on server 1, N-Z go on server 2. For a real implementation, I'd use a hash based on the name or something like that, but there's probably libraries that do that kind of thing for you.

You need to have both servers able to respond to requests sent to static.domain.tld. I've seen a number of ways of achieving this, but they're all rather low level. The two I'm aware of:
Use a DNS round-robin so that the mapping of hostnames to IP addresses changes over time; very large websites often use variations on this so that content is actually served from a CDN closer to the client.
Use a hacked router config so that an IP address is answered by multiple machines (with different MAC addresses); this is very effective in practice, but requires the machines to be physically close.
You can also do the spreading out at the "visible" level by directing to different servers based on something that might as well be random (e.g., a particular bit from the MD5 hash of the path). And, best of all, all of these techniques use independent parts of the software stack to work; you can use them in any combination you want.

This serverfault question will give you a lot of information:
Best way to load balance across multiple static file servers for even an bandwidth distribution?

Related

How to block rawgit.com to access my website server

I think my website is injected with some script that is using rawgit.com. Recently my website runs very slow with browser lower bar notification "Transferring data from rawgit.com.." or "Read rawgit.com"..." . I have never used RawGit to serve raw files directly from GitHub. I can see they are using https://cdn.rawgit.com/ domain to serve files.
I would like my website to block everything related to this domains, how can I achieve that ?
As I said in the comments, you are going about this problem in the wrong way. If your site already includes sources you do not recognise or allow, you are already compromised and your main focus should be on figuring out how you got compromised, and how much access an attacker may have gotten. Based on how much access they have gotten, you may need to scrap everything and restore a backup.
The safest thing to do is to bring the server offline while you investigate. Make sure that you still have access to the systems you need to access (e.g. ssh), but block any other remote ip. Just "blocking rawgit.com" blocks one of the symptoms you can see and allows an attacker to change their attack while you are fumbling with that.
I do not recommend to only block rawgit.com, not even when it's your first move to counter this problem, but if you want you can use the Content-Security-Policy header. You can whitelist the urls you do expect and thus block the urls you do not. See mdn for more information.

How to find out my site is being scraped?

How to find out my site is being scraped?
I've some points...
Network Bandwidth occupation, causing throughput problems (matches if proxy used).
When querting search engine for key words the new referrences appear to other similar resources with the same content (matches if proxy used).
Multiple requesting from the same IP.
High requests rate from a single IP. (by the way: What is a normal rate?)
Headless or weird user agent (matches if proxy used).
Requesting with predictable (equal) intervals from the same IP.
Certain support files are never requested, ex. favicon.ico, various CSS and javascript files (matches if proxy used).
The client's requests sequence. Ex. client access not directly accessible pages (matches if proxy used).
Would you add more to this list?
What points might fit/match if a scraper uses proxying?
As a first note; consider if its worthwhile to provide an API for bots for the future. If you are being crawled by another company/etc, if it is information you want to provide to them anyways it makes your website valuable to them. Creating an API would reduce your server load substantially and give you 100% clarity on people crawling you.
Second, coming from personal experience (I created web-crawls for quite a while), generally you can tell immediately by tracking what the browser was that accessed your website. If they are using one of the automated ones or one out of a development language it will be uniquely different from your average user. Not to mention tracking the log file and updating your .htaccess with banning them (if that's what you are looking to do).
Its usually other then that fairly easy to spot. Repeated, very consistent opening of pages.
Check out this other post for more information on how you might want to deal with them, also for some thoughts on how to identify them.
How to block bad unidentified bots crawling my website?
I would also add analysis of when the requests by the same people are made. For example if the same IP address requests the same data at the same time every day, it's likely the process is on an automated schedule. Hence is likely to be scraping...
Possible add analysis of how many pages each user session has impacted. For example if a particular user on a particular day has browsed to every page in your site and you deem this unusual, then perhaps its another indicator.
It feels like you need a range of indicators and need to score them and combine the score to show who is most likely scraping.

Secure verification of location claims by mobile app

What algorithm or set of heuristics can a server and a mobile app use so that the server can always be fairly certain that the app is used within the boundaries of a given geographic region (e.g. a country)? How can the server ensure that app users outside of the defined region can not falsely claim that they are inside the region?
You can't be 100% sure that user isn't reporting a fake location, you can only make the process of faking it as difficult as possible. You should implement several checks depending on the data you have access to:
1) user's IP address (user can use a proxy)
2) device's gps coordinates (they can be spoofed)
3) the locale of the device (isn't a reliable indicator)
One of the most secure checks (but also not 100%) is sending user an SMS with the confirmation code, which he has to type in the app.
One of the most sophisticated algorithms known to me is in the Google Play (so some apps can only be available only certain countries). It checks such parameters as IP address, user's mobile operator and several others, but there are tools (like Market Enabler) and techniques that can trick the system.
If you dont want to use Google Play or other ways, the best way (I say best because it first costs nothing performance-wise and cost-wise, and secondly it is easy to use and and thirdly you need it anyway if you expect large number of users - it provides nice tools and static cache, optimizer, analytics, user blocking, country blocking etc) is to use cloudflare.
Once you signup for a free cloudflare account, you can set up your server public IP address there so that all traffic is coming through cloudflare proxy network.
After that everything is pretty straightforward, you can install cloudflare module in your server .
In your app, you can get country code of the visitor in the global server request variable HTTP_CF_IPCOUNTRY - for example,
$_SERVER['HTTP_CF_IPCOUNTRY'] in PHP. It will give you AU for Australia. (iso-3166-1 country codes). It doesnt matter what language you use.
Coudflare IP database is frequently updated and seems very reliable to detect user's geolocation without performance overhead.
You also get free protection from attacks, get free cache and cdn features for fast-loading etc.
I had used several other ways but none of them was quite reliable.
If you app runs without a server, you cstill pout a file to a server and make a call to the remote url to get country of the user at each request.
apart from things that #bzz mentioned. you can read the wifi SSID of user wifi networks, services like http://www.skyhookwireless.com/ provides api( i think with browser plugins, i am not sure) which you can use to get location by submitting the wifi SSID.
if you need user to be within specific region all the time when using the app you ll probably end up using all the options together, in case you just need one time check, SMS based approach is the best one IMO.
for accessing wifi SSID , refer to this, still you can not be 100% sure.

Identifying requests made by Chrome extensions?

I have a web application that has some pretty intuitive URLs, so people have written some Chrome extensions that use these URLs to make requests to our servers. Unfortunately, these extensions case problems for us, hammering our servers, issuing malformed requests, etc, so we are trying to figure out how to block them, or at least make it difficult to craft requests to our servers to dissuade these extensions from being used (we provide an API they should use instead).
We've tried adding some custom headers to requests and junk-json-preamble to responses, but the extension authors have updated their code to match.
I'm not familiar with chrome extensions, so what sort of access to the host page do they have? Can they call JavaScript functions on the host page? Is there a special header the browser includes to distinguish between host-page requests and extension requests? Can the host page inspect the list of extensions and deny certain ones?
Some options we've considered are:
Rate-limiting QPS by user, but the problem is not all queries are equal, and extensions typically kick off several expensive queries that look like user entered queries.
Restricting the amount of server time a user can use, but the problem is that users might hit this limit by just navigating around or running expensive queries several times.
Adding static custom headers/response text, but they've updated their code to mimic our code.
Figuring out some sort of token (probably cryptographic in some way) we include in our requests that the extension can't easily guess. We minify/obfuscate our JS, so are ok with embedding it in the JS source code (since the variable name it would have would be hard to guess).
I realize this may not be a 100% solvable problem, but we hope to either give us an upper hand in combatting it, or make it sufficiently hard to scrape our UI that fewer people do it.
Welp, guess nobody knows. In the end we just sent a custom header and starting tracking who wasn't sending it.

Strategy for spreading image downloads across domains?

I am working on PHP wrapper for Google Image Charts API service. It supports serving images from multiple domains, such as:
http://chart.googleapis.com
http://0.chart.googleapis.com
http://1.chart.googleapis.com
...
Numeric range is 0-9, so 11 domains available in total.
I want to automatically track count of images generated and rotate domains for best performance in browser. However Google itself only vaguely recommends:
...you should only need this if you're loading perhaps five or more charts on a page.
What should be my strategy? Should I just change domain every N images and what would good N value be in context of modern browsers?
Is there point where it would make sense to reuse domain rather than introduce new one (to save DNS lookup)?
I don't have specific number of images in mind - since this is open source and publicly available code I would like to implement generic solution, rather than optimize for my specific needs.
Considerations:
Is the one host faster than the other?
Does a browser limit connection per host?
How long does it take for the browser to resolve a DNS name?
As you want this to make a component, I'd suggest you make it able to have multiple strategies to find the host name to use. This will not only allow you to have different strategies but also to test them against each other.
Also you might want to add support for the javascript libraries that can render the data on the page in the future so you might want to stay modular anyway.
Variants:
Pick one domain name and stick with it, hardcoded: http://chart.googleapis.com
Pick one domain name out of many, stick with it: e.g. http://#.chart.googleapis.com
Like 2 but start to rotate the name after some images.
Like 3 but add some javascript chunk at the end of the page that will resolve the DNS of the missing hostnames in the background so that it's cached for the next request (Provide the data of the hostnames not used so far).
Then you can make your library configureable, so you don't need to hardencode in the code the values but you provide the default configuration.
Then you can add the strategy as configuration so someone who implements can decide over it.
Then you can make the component offer to load the configuration from outside, so let's say, if you create a Wordpress plugin, the plugin can store the configuration and offer a plugin user an admin-interface to change the settings.
As the configuration already includes which strategy to follow you have completely given the responsibility to the consumer of the component and you can more easily integrate different usage-scenarios for different websites or applications.
I don't exactly understand the request to rotate domains. I guess it does make sense in the context that your browser may only allow X open requests to a given domain at once, so if you have 10 images served from chart.googleapis.com, you may need to wait for the first to finish downloading before beginning to recieve the fifth, and so on.
The problem with rotating domains randomly is that then you defeat browser caching entirely. If an image is served from 1.chart.googleapis.com on one page load and then from 7.chart.googleapis.com on the next page load, the cached chart is invalidated and the user needs to wait for it to be requested, generated, and downloaded all over again.
The best solution I can think of is somehow determining the domain to request from algorithmically from the request. If its in a function, you can md5 the arguments somehow, convert to an integer, and then serve the image from {$result % 10}.chart.googleapis.com.
Probably a little overkill, but you at least can guarantee that a given image will always be served from the same server.

Resources