One of our advertising networks for a site I administer and develop is requesting the following:
We have been working on increasing performance on XXXX.com and our team feels that if we can set up the following CNAME on that domain it will help increase rates:
srv.XXXX.com d2xf3n3fltc6dl.XXXX.net
Could you create this record with your domain registrar? The reason we need you to create this CNAME is to preserve domain transparency within our RTB. Once we get this setup I will make some modifications in your account that should have some great results.*
Would this not open up our site to cross-site scripting vulnerabilities? Wouldn't malicious code be able to masquerade as coming from our site to bypass same-origin policy protection in browsers? I questioned him on this and this was his response:
First off let me address the benefits. The reason we would like you to create this CNAME is to increase domain transparency within our RTB. Many times when ads are fired, JS is used to scrape the URL and pass it to the buyer. We have found this method to be inefficient because sometimes the domain information does not reach the market place. This causes an impression (or hit) to show up as “uncategorized” rather than as “XXXX.com” and this results in lower rates because buyer pay up to 80% less for uncategorized inventory. By creating the CNAME we are ensuring that your domain shows up 100% of the time and we usually see CPM and revenue increases of 15-40% as a result.
I am sure you are asking yourself why other ad networks don’t do this. The reason is that this is not a very scalable solution, because as you can see, we have to work with each publisher to get this setup. Unlike big box providers like Adsense and Lijit, OURCOMPANY is focused on maximizing revenue for a smaller amount of quality publishers, rather than just getting our tags live on as many sites as possible. We take the time and effort to offer these kinds of solutions to maximize revenue for all parties.
In terms of security risks, they are minimal to none. You will simply be pointing a subdomain of XXXX.com to our ad creative server. We can’t use this to run scripts on your site, or access your site in any way.
Adding the CNAME is entirely up to you. We will still work our hardest to get the best rates possible, with or without that. We have just seen great results with this for other publishers, so I thought that I would reach out and see if it was something you were interested in.
This whole situation raised red flags with me but is really outside of my knowledge of security. Can anyone offer any insight to this please?
This would enable cookies set at the XXXX.com level to be read by each site, but it would not allow other Same Origin Policy actions unless both sites opt in. Both sites would have to set document.domain = 'XXXX.com'; in client-side script to allow access to both domains.
From MDN:
Mozilla distinguishes a document.domain property that has never been set from one explicitly set to the same domain as the document's URL, even though the property returns the same value in both cases. One document is allowed to access another if they have both set document.domain to the same value, indicating their intent to cooperate, or neither has set document.domain and the domains in the URLs are the same (implementation). Were it not for this special policy, every site would be subject to XSS from its subdomains (for example, https://bugzilla.mozilla.org could be attacked by bug attachments on https://bug*.bugzilla.mozilla.org).
Related
I'm using a web platform for my real estate business that "due to technical reasons" cannot offer subdomains. Instead, if an individual in my company wants credit for the leads that come in due to their own marketing efforts, they would be required to manually add a url parameter to every link they share. (ie. ?agent=xxxxx) This is clearly absurd.
I could write a chrome plugin or bookmarklets that add the agent= parameter for them, but this isn't a universal solution.
If it possible to host a "faux" domain which would function like it's own website, but pull all it's resources from my main website? (While adding in the url parameter that triggers the tracking cookie)
Hope this makes sense.
How to find out my site is being scraped?
I've some points...
Network Bandwidth occupation, causing throughput problems (matches if proxy used).
When querting search engine for key words the new referrences appear to other similar resources with the same content (matches if proxy used).
Multiple requesting from the same IP.
High requests rate from a single IP. (by the way: What is a normal rate?)
Headless or weird user agent (matches if proxy used).
Requesting with predictable (equal) intervals from the same IP.
Certain support files are never requested, ex. favicon.ico, various CSS and javascript files (matches if proxy used).
The client's requests sequence. Ex. client access not directly accessible pages (matches if proxy used).
Would you add more to this list?
What points might fit/match if a scraper uses proxying?
As a first note; consider if its worthwhile to provide an API for bots for the future. If you are being crawled by another company/etc, if it is information you want to provide to them anyways it makes your website valuable to them. Creating an API would reduce your server load substantially and give you 100% clarity on people crawling you.
Second, coming from personal experience (I created web-crawls for quite a while), generally you can tell immediately by tracking what the browser was that accessed your website. If they are using one of the automated ones or one out of a development language it will be uniquely different from your average user. Not to mention tracking the log file and updating your .htaccess with banning them (if that's what you are looking to do).
Its usually other then that fairly easy to spot. Repeated, very consistent opening of pages.
Check out this other post for more information on how you might want to deal with them, also for some thoughts on how to identify them.
How to block bad unidentified bots crawling my website?
I would also add analysis of when the requests by the same people are made. For example if the same IP address requests the same data at the same time every day, it's likely the process is on an automated schedule. Hence is likely to be scraping...
Possible add analysis of how many pages each user session has impacted. For example if a particular user on a particular day has browsed to every page in your site and you deem this unusual, then perhaps its another indicator.
It feels like you need a range of indicators and need to score them and combine the score to show who is most likely scraping.
Assume http://chaseonline.chase.com is a real URL with a web server sitting behind it, i,e, this URL revolves to an IP address or probably several so that there can be a lot of identical servers that allows load balancing from client requests.
I guess that probably Chase buys up URLs that are "close" in the URL namespace(<<< how to define the term "namespace"? Lexicographically?? I think the latter is not trivial (because it depends on a post that one defines on top of URL strings ... never mind this comment).
Suppose that given of the URLs (http://mychaseonline.chase.com, http://chaseonline.chase.ua, http://chaseonline.chase.ru, etc.) is "free" (not bought). I buy one of these free URLs, write my phishing/spoofing server that sits behind
my URL and renders the following screen => https://chaseonline.chase.com/
I work to get my URL indexed (hopefully) at least as high or higher than the real one (http://chaseonline.chase.com). Chance is (hopefully) most bank clients/users won't notice my bogus URLs and I start collecting . I then use my server as a client in relationship to the real bank server http://chaseonline.chase.com, log in and using my collection/list of <user id, password> tuples to login to each <user id, password> to create mischief.
Is this a cross-site request forgery? How would one prevent this from occurring?
What I'm hearing in your description is a phishing attack albeit with slightly more complexity. Let's address some of this points
2) Really hard to get all the urls, especially when you take into consideration different variations such as unicode, or even just simple kerning hacks. For example the R and N in kerning looks a lot like an m when you look quickly. Welcome to chаse.rnobile.com! So with that said, I'd guess that most companies just buy the obvious domains.
4) Getting your url indexed higher than the real one, I'll posit is impossible. Google et al. are likely sophisticated enough to catch that type of thing from happening. One approach to getting above chase in SERP would be to buy adwords for something like "Bank Online With Chase." But there again, I'd assume that the search engines have a decent filtering/fraud prevention mechanism to catch this type of thing.
Mostly you'd be better off to keep your server from being indexed since that would simply attract attention. Because this type of thing will be shut down, I presume most phishing attacks go for large numbers of small 'fish' (larger ROI) or small numbers of large 'fish' (think targeted phishing attacks of execs, bank employees, etc.)
I think you offer up an interesting idea in point 4, that there's nothing to stop a man-in-the-middle attack from occurring wherein your site delegates out to the target site for each request. The difficulty in that approach is that you'd spend a ton of resources on creating a replica website. When you think of most hacking as being a business, trying to maximize your ROI then a lot of the "this is what I'd do if I were a hacker" ideas go way.
If I were to do this type of thing, I'd provide a login facade, have the user provide me their credentials, and then redirect to the main site on POST to my server. This way I get your credentials and you think there's just been an error on the form. I'm then free to pull all the information off of your banking site at my leisure.
There's nothing cross-site about this. It's a simple forgery.
It fails for a number of reasons: lack of security (your site isn't HTTPS), malware protection vendors explicitly check against this kind of abuse, Google won't rank your forgery above highly popular sites, and finally banks with a real sense of security use 2 Factor Authentication. The login token you'd get for my bank account is valid for a few seconds, literally, and can't be used for anything but logging in.
I'm currently developing a web application that has one feature while allows input from anonymous users (No authorization required). I realize that this may prove to have security risks such as repeated arbitrary inputs (ex. spam), or users posting malicious content. So to remedy this I'm trying to create a sort of system that keeps track of what each anonymous user has posted.
So far all I can think of is tracking by IP, but it seems as though it may not be viable due to dynamic IPs, are there any other solutions for anonymous user tracking?
I would recommend requiring them to answer a captcha before posting, or after an unusual number of posts from a single ip address.
"A CAPTCHA is a program that protects websites against bots by generating and grading tests >that humans can pass but current computer programs cannot. For example, humans can read >distorted text as the one shown below, but current computer programs can't"
That way the spammers are actual humans. That will slow the firehose to a level where you can weed out any that does get through.
http://www.captcha.net/
There's two main ways: clientside and serverside. Tracking IP is all that I can think of serverside; clientside there's more accurate options, but they are all under user's control, and he can reanonymise himself (it's his machine, after all): cookies and storage come to mind.
Drop a cookie with an ID on it. Sure, cookies can be deleted, but this at least gives you something.
My suggestion is:
Use cookies for tracking of user identity. As you yourself have said, due to dynamic IP addresses, you can't reliably use them for tracking user identity.
To detect and curb spam, use IP + user browser agent combination.
I've noticed that some email services (like gmail or my school's webmail) will redirect links (or used to) in the email body. So when I put "www.google.com" in the body of my email, and I check that email in gmail or something, the link says something like "gmail.com/redirect?www.google.com".
This was very confusing for me and the people I emailed (like my parents, who are not familiar with computers). I always clicked on the link anyway, but why is this service used? (I'm also worried that maybe my information was being sent somewhere... Do I have anything to worry about? Is something being stored before the redirect?)
Sorry if this is unwarranted paranoia. I am just curious about why some things work the way they do.
Wikipedia has a good article on URL redirection. From the article:
Logging outgoing links
The access logs
of most web servers keep detailed
information about where visitors came
from and how they browsed the hosted
site. They do not, however, log which
links visitors left by. This is
because the visitor's browser has no
need to communicate with the original
server when the visitor clicks on an
outgoing link. This information can be
captured in several ways. One way
involves URL redirection. Instead of
sending the visitor straight to the
other site, links on the site can
direct to a URL on the original
website's domain that automatically
redirects to the real target. This
technique bears the downside of the
delay caused by the additional request
to the original website's server. As
this added request will leave a trace
in the server log, revealing exactly
which link was followed, it can also
be a privacy issue.1 The same
technique is also used by some
corporate websites to implement a
statement that the subsequent content
is at another site, and therefore not
necessarily affiliated with the
corporation. In such scenarios,
displaying the warning causes an
additional delay.
So, yes, Google (and Facebook and Twitter do this to) are logging where your services are taking you. This is important for a variety of reasons - it lets them know how their service is being used, shows trends in data, allows links to be monetized, etc.
As far as your concerns, my personal opinion is that, if you're on the internet, you're being tracked. All the time. If this is concerning to you, I would recommend communicating differently. However, for the most part, I think it's not worth worrying about.
This redirection is a dereferrer to avoid disclosure of the URL in the HTTP Referer field to third party sites as that URL can contain sensitive data like a session ID.