Website being spammed - bots

A client has a basic html website with a few pdfs on it.
The server is reporting 10gb in a day for downloads.
I have seen the IP / Visitor reports. There are numerous downloads of the PDFs pushing the bandwidth up. I have blocked the IPs.
Is there anyway to stop this?

You could use a Captcha to allow download of the pdf.

Related

As a file hosting provider, how do you prevent phishing?

We develop a service much like Dropbox or Google Drive or S3 where our customers can host their files and get direct links to them with our domain, e.g. www.ourservice.com/customer_site/path/to/file.
Recently we started receiving emails from our server provider that we host phishing .html files. It appeared that some of our customers did host such htmls and of course we removed the files after that and blocked the customers.
But the problem is that we would like to prevent this from happening at all, mainly because this lowers our Google Search index and of course we never wanted to be a hosting for phishing scripts.
Is there a way how other file hosting providers solve this? I guess we could run some checks (anti-virus, anti-phishing etc.) when the customers upload their files, but that would be pretty tedious considering that there's a lot of them. Another options is to periodically check all the files, or all new files, but still I'd better ask if I'm missing any easier solution.

Why these tracking sites are informed when visitors come to my web?

Some months ago I have published a website made with Django to AWS. Some weeks ago I installed DISCONNECT plugin to my Firefox browser and now I see there are several tracking sites that are informed when visitors come to my place. Why is this happening?
I don't like this, I'm not tracking my visitors, I don't want to do it and I don't want tracking sites being informed when my readers come to my place. This is the partial list of the tracking sites:
doubleclick.net
viglink.com
bluekai.com
exelator.com
pippio.com
rezync.com
narrative.io
v12group.com
etc.
Some of them store cookies and use the local storage.
I wish to avoid or block these sites so my users are not tracked. How can I do it? Are they related to the robots that index my site?
Thanks!

Need some ideas on how one can spam some website, crawl some website and waste it's resources

I am working on a startup which basically serves website. Sorry, I can't reveal much details about the startup.
I need some ideas on how spammers and cralwer devs think on attacking some website. And if possible, then a way to prevent such attacks too.
We have come up with some basic ideas like:
1. Include a small JS file in the sites that would send an ACK on our servers ones all the assets are loaded. Like some crawlers/bots only come to websites and download specific stuff like images or articles. In such cases, our JS won't be triggered. And when we study our logs, which will have a record of resources requested by the particular IP and if out JS was triggered or not. We can then whitelist or blacklist IP's based on the study.
2. Like email services do, we will load a 1x1 px image on the client side via an API call. In simple words, we won't add the "img" tag directly in out HTML, but rather a JS that calls an API on our server that returns the image to the client.
3. We also have a method to detect Good bots like that of google which indexes our pages. So we can differentiate between good bots and bad bots that just waste our resources.
We are at a very basic level. Infact, all our code does right now is logs the IP's and assets requested by that IP in elasticsearch.
And so we need ideas on how people spam/crawl websites via cralwers/bots/etc. So we can come up with some solution. And if possible, please also mention the pros and cons and ways to defend against your ideas too.
Thanks in advance. If you share your ideas, you'll be helping a startup which will be doing a lot of good stuff.

By hosting some assets on a server and others on a CDN will this allow a browser to negotiate more connections?

Short question
Im trying to speed up a static website - html,css,js the site also has many images that weight allot, during testing it seems the browser is struggling managing all these connections.
Apart from the normal compression im wandering if by hosting my site files - html,css,js on the VPS and then taking all the images onto a CDN would this allow the browser to negociate with 2 server and be quicker over all. (Obviously there is going to be no mbps speed gain as that limited by the users connection, but im wandering if this would allow me to have more open connections, thus having a quicker TTFB)
Long question with background
Ive got a site that im working on speeding up, the site itself is all client side html,js, css. The issue with the speed on this site tends to be a) the size of the files b) the qty of the files. ie. there are lots of images, and they each weight allot.
Im going to do all the usual stuff combine the ones that are used for the UI into a sprite sheet and compress all image using jpegmini etc.
Ive moved the site to a VPS which has made a big difference, the next move im pondering is setting up a CDN to host the images, im not trying to host them for any geographical distribution or load balancing reason (although that is an added benefit) but i was wandering if the bottle neck on downloading assets would be less if the users browser would be getting all the site files : html, js, css from the VPS but at the same time getting the images from the CDN, is this were the bottle neck is ie. the users browser can only have so many connections to 1 server at a time, but if i had 2 servers it could negotiate the same amount of connections but on both servers concurrently ?
Im guessing there could also be an issue with load on the server, but for testing im using a 2gb multi core VPS which no one else is on so in that regard that shouldnt be a problem that comes up during my tests.
Context
Traditionally, web browsers place a limit on the number of simultaneous connections a browser can make to one domain. These limits were established in 1999 in the HTTP/1.1 specification by the Internet Engineering Task Force (IETF). The intent of the limit was to avoid overloading web servers and to reduce internet congestion. The commonly used limit is no more than two simultaneous connections with any server or proxy.
Solution to that limitation: Domain Sharding
Domain sharding is a technique to accelerate page load times by tricking browsers into opening more simultaneous connections than are normally allowed.
See this article) for an example by using multiple subdomains in order to multiply parallel connections to the CDN.
So instead of:
http://cdn.domain.com/img/brown_sheep.jpg
http://cdn.domain.com/img/green_sheep.jpg
The browser can use parallel connections by using a subdomain:
http://cdn.domain.com/img/brown_sheep.jpg
http://cdn1.domain.com/img/green_sheep.jpg
Present: beware of Sharding
You might consider the downsides of using domain sharding, because it won't be necessary and even hurts performance under SPDY. If the browsers you're targeting are mostly SPDY is supported by Chrome, Firefox, Opera, and IE 11, you might want to skip domain sharding.

Video and Song Streamming website development

Guys I need your help here,
I got a new project in PHP to make a website, basic functionality will be : admin can upload songs and videos from backend. Frontend will display songs and videos, categories and other music related stuffs. I ll be using CI or CakePHP but my main concern is about hosting. Traffic will be 5K users per day. Now what is the best hosting solution, where users will stream songs and videos online. Do I need to buy dedicated server or cloud or VPS will do?
Also if you have made this kind of website then suggest me some guidlines.
For a daily 5K users, it will always be better to go for dedicated as it is a server solely for you. So if tomorrow the number of viewers increase you wouldn't have to face any slow loading times (as in cloudhosting) and create a bad impression for users who would also leave your site and not return.

Resources