Images with unknown content: Dangerous for a browser?

Images with unknown content: Dangerous for a browser? - security

Let's say I allow users to link to any images they like. The link would be checked for syntactical correctness, escaping etc., and then inserted in an <img src="..."/> tag.
Are there any known security vulnerabilities, e.g. by someone linking to "evil.example.com/evil.jpg", and evil.jpg contains some code that will be executed due to a browser bug or something like that?
(Let's ignore CSRF attacks - it must suffice that I will only allow URLs with typical image file suffixes.)

Security risks in image files crop up from time to time. Here's an example: https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-22_11-5388621.html?tag=nl.e019. It's an old article, so obviously these things have been rolling around for a while.
While it's impossible to say for sure that something is always safe/never safe, so far it sounds like the risks have been relatively low, and are patched by the image viewer manufacturers pretty quickly. IMO the best test is how often you hear about actual problems occurring. This threat vector has been a known possibility for years, but hasn't really become widespread. Given the extent to which people link images in public forums, I'd expect it to become a big problem pretty quickly, if it was a realistic sort of attack.

There was a JPEG buffer overrun some time ago. Also, you have to account for images who actually contain code, so that you don't execute the code.

Yes, this could be a problem. There is quite a few exploits known that work by using vulnerabilities in the image rendering code of the browser or the OS. Including remote execution vulnerabilities. It might not be the easiest flaw to exploit, but it is definitly a concern.
An example of such a vulnerability : http://www.securityfocus.com/bid/14282/discuss (but you could find tons of other vulnerabilities of the same type).
I think I remember such a problem with a high visibility site having exactly this kind of vulnerability exploited. An advertisement image was displayed from some third party ad provider, and the image had not been checked. 1000's of users compromised ... Cant find the story anymore ... sorry.

So there's potential buffer overflows with handling untrusted data, however you get to it. Also if you insert untrusted data in the form of URLs into your page then there is a risk of XSS flaws. However, I guess you want to know why browsers flag it as a problem:
Off the top of my head:
I believe referrer information is still sent in this case. Even if you don't ever and wont ever use URL rewriting for sessions, you are still exposing information that was previously confidential. chris_l: True, I just did a quick test - the browser (FF 3.5) sends the Referer header.
You cannot be sure that the image data returned will not be misleading. Wrong text on buttons, for instance. Or in bigger images, spoofing instructions, say.
Image size may change layout. Image loading could even be delayed to move page at a critical time. chris_l: Good point! I should always set width and height (could be determined by the server when the user posts the image - would work as long as the image doesn't change... better ideas?)
Images can be used for AJAX-like functionality. chris_l: Please expand on this point - how does that work?
Browsers will flag an issue with your site, hiding other problems and conditioning users to accept slack security practices. chris_l: That's definitely an important problem, when the site uses HTTPS.

UPDATE: Proofed wrong!
You also have to consider, that cookies (that means sessionIDs) are also sent to the server where the image is located. So another server gets your sessionID. If the image actually contains PHP-Code it could steal the sessionID:
For example you include:
<img src="http://example.com/somepic.jpg alt="" />
On the server of http://example.com there's a .htaccess-file saying the following:
RewriteRule ^somepic\.jpg$ evilscript.php
then the pic actually is a php-file, generating the image, but also do some evil stuff, like session-stealing or whatever...

Related

Which points should be noted and observed while building a web application in a way that it can function well on most web browsers?

I'm working especially with Java web applications (in which mostly with JSF, Java Server Faces). I'm less concerned with the rest of the technologies.
Since different web browsers function less or more differently from one another, any web application should be designed in such a way that it can be incorporated and executed in a defined way by most browsers (might not be all). Which points should be kept in mind to design a web application in such a way that it can function almost exactly on most browsers?
What are the major differences among different browsers which should be noted by web application developers?

you have to check all of these points befor developing web applications with any language...
Almost all web developers (ahem! – perhaps that should read “quite a lot of web developers”) are aware of the need to check how their site looks in a variety of browsers. How far you go obviously depends on the resources available – not everyone is in a position to check Windows, Mac, Unix and Linux platforms. The minimum test would probably be:
Firefox, as that has the best standards compliance and is the second most-used browser;
Internet Explorer for Windows – currently the most widely used browser. It is essential to check both versions 6 and 7 as version 7
fixed quite a lot of bugs in 6 but introduced a new set of its own.
(Microsoft is however still kicking developers in the teeth by not
making it possible to install both versions on the same computer; you
will either need two computers or one of the work-arounds available
on the net.) Version 5 should preferably also be checked; as of
spring 2008 the number of users is not yet negligible. However it is
now uncommon enough that you needn’t worry about cosmetic issues; as
long as the site is readable that should be sufficient.
Opera – growing in popularity due to its speed and pretty good standards compliance.
For some time I also recommended checking Netscape 4 as well, as it often produces radically different results from any other browser, and was very popular for a long time. However the number of users of this bug-ridden browser is now so small (under 0.1% and decreasing) that it can now probably safely be ignored.
Check printed pages
Print some of the pages on a normal printer (i.e. with a paper size of A4 or Letter) and check that they appear sensibly. Due to the somewhat limited formatting options available for printing, you probably can’t achieve an appearance comparable to a document produced by a word-processor, but you should at least be able to read the text easily, and not have lines running off the right-hand side of the page. It is truly extraordinary how many site authors fail to think of this most elementary of operations.
You should also consider using CSS to adjust the appearance of the page when printed. For example you could – probably should – suppress the printing of information which is not relevant to the printed page, such as navigation bars. This can be done using the “#media print” or “#import print” CSS features.
Some sites provide separate “printer friendly” versions of their pages, which the user can select and print. While this may occasionally be necessary as a last resort, it significantly increases the amount of work needed to maintain the site, is inconvenient for the reader and shouldn’t usually be needed.
Switch Javascript off
There are unfortunately quite a number of Internet sites which abuse Javascript by, for example, generating unwanted pop-ups and irritating animations. There are also a number of Javascript-related security holes in browsers, especially Internet Explorer. As a result a lot of readers switch Javascript off – indeed I often do myself. (I have a page giving the reasons in more detail.) Some organisations even block the usage of Javascript completely. Furthermore few, if any, search engines support Javascript.
It is therefore important to check that your site still functions with Javascript disabled. A lot of sites rely – quite unnecessarily – on Javascript for navigation, with the result that the lack of Javascript renders the site unusable.
Clearly if you need it for essential content, that functionality will be lost. But there is no reason why the basic text of the site should be unavailable.
Avoid nearly-meaningless messages like “Javascript needed to view this site”. If you have something worth showing, tell the user what it is, e.g. “enable Javascript to see animation of solar system”.
Switch plug-ins off
The considerations for plug-ins (such as Flash or Java) are very similar to those for Javascript above. Check the site with any plug-ins disabled. The basic text and navigation should still work.
Interest the reader sufficiently, and he might just go to the trouble of down-loading the plug-in. Greet him with a blank screen or a “You need Flash to read this site” message and he will probably go away, never to return.
Switch images off
If scanning a number of sites quickly for information, many readers (including myself) switch images off, for quick loading. Other people cannot view images. So switch images off and check that the site is readable and navigable. This means, in particular, checking that sensible ALT texts have been provided for images. (This check is similar to using a text browser, but not quite the same).
its worth to take a look at this link for more info

Keep SVGs from Being Accessed by User

I'm putting together a mobile version of a webpage which consists entirely of client art. For the old-fashioned desktop version, I just used PNGs, but I really wanted to use SVG for mobile. SVGZ would be smaller and resolution independent, so it seemed like a perfect use case.
But the client is worried that, once his art is online in SVG, anyone could download the files and use his art illegally (he's had stuff he worked on pirated before, so he takes this pretty seriously.) This had never occurred to me until he brought it up, but the SVG would basically be his original source art.
I was wondering if there's any way to prevent the SVG files from being accessed by the user. As far I know this is impossible -- making the files available to the user-agent means making them available to the user -- but I wanted to ask around to be sure.
Thanks for any help.

No, this is impossible. If a web browser can request the files for display, then any computer anywhere can request the files and save the direct results.
Serving up intentionally degraded artwork (e.g. rasterization) is the only way to prevent people from having the originals. Of course, a determined thief could still re-trace the PNG and get a vectorized, resolution-independent close approximation of the original.
Your client could alternatively:
Include copyright comments in the source, proving ownership. (Yes, a thief could delete these.)
Include 'hidden' elements (0% opacity or placed under another item), proving ownership. (Yes, a thief could delete these.)
Use data steganography in the source SVG to watermark it (e.g. vary the decimal values in a path in a manner minor enough to not effect the result, but still embed custom data). (Yes, any thief suspecting this could lower decimal precision or transform all values in a manner that might remove this.)
Trust in the law to protect his works, or provide a recourse if they are stolen.
Trust in the goodness of most of mankind to not do this.
Decide that theft is the sincerest form of flattery, and not worry about it. :)

When designing a website, do you need to consider users who disable CSS?

Have we finally got to the point where we assume CSS2, and hope for CSS3?
(Not looking for discussion, if the answer is "yes, you idiot", go for it...)

You should always take into consideration users who
A. use screen readers and text-only browsers
B. are on mobile devices
C. are not human (i.e. search engine spiders)
By having a good separation of content and style, you should be able to address each of these with ease. As far as users who have CSS disabled, in this day and age, I don't think a designer should concern themselves over it too much. It's certainly not worth spending a significant amount of time and resources on.

What is your target audience and what is your cost for supporting (or not supporting) certain clients?

In addition to the fine points made by pst and ttreat31, I'll add that using semantic markup will generally let your document be readable with CSS disabled (i.e. using the browser's default CSS).
There may be a few quirks (forms come to mind), but generally I find with my own pages, they are plenty readable.

You, and your business, will probably survive if you require CSS. But you'll probably do better if you DON'T require it.
By catering for non-CSS cases, you'll write better markup, with better-structured content. You'll mitigate cross-browser problems, and develop a more robust API. Search engines will be able to parse and 'understand' your content that much better.
Allowing for 'no CSS' is much more about the philosophies relating to web standards and good coding practises than it is actually about the common final rendering.

I don't take any effort to help users who disable CSS or javascript. If I worked on a site which counted on attracting new customers and had lots of first time hits, then I would probably try and give non-javascript users a scaled down set of features. But I would never bother with users who disable CSS. I think that is probably a very small minority.

I often surf in the terminal using links or lynx when my computer is overloaded and I just can't have Firefox, Java, and some Flash applications taking half of my RAM. Text-only browsers don't have advanced CSS or Javascript support.
Many server administrators might do similar thing as most servers are headless, and some administrator might be too lazy to open their other laptop just for a quick browse. People using screenreaders usually have similar view as text-only browser, except it's now read aurally instead of text-only.
When using text browsers, I wouldn't expect any fancies colors or tables, usually I just need to have some quick information. So, IMO, you should at least make all the page's essential information available as plain HTML.

how can I protect scraping of certain data on my web pages? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 11 years ago.
I want to protect only certain numbers that are displayed after each request. There are about 30 such numbers. I was planning to have images generated in the place of those numerbers, but if the image is not warped as with captcha, wont scripts be able to decipher the number anyway? Also, how much of a performance hit would loading images be vs text?

The only way to make sure bad-guys don't get your data is not to share it with anyone. Any other solution is essentially entering an arms race with the screen-scrapers. At one point or another, one of you will find the arms-race too costly to continue. If the data you are sharing has any perceptible value, then probably the screen-scrapers will be very determined.

It's not possible.
You use javascript and encrypt the page, using document.write() calls after decrypting. I either scrape from the browser's display or feed the page through a JS engine to get the output.
You use Flash. I can poke into the flash file and get the values. You encrypt them in the flash and I can just run it then grab the output from the interpreter's display as a sequence of images.
You use images and I can just feed them through an OCR.
You're in an arms race. What you need to do is make your information so useful and your pages so easy to use that you become the authority source. It's also handy to change your output formats regularly to keep up, but screen scrapers can handle that unless you make fairly radical changes. Radical changes drive users away because the page is continually unfamiliar to them.
Your image solution wont' help much, and images are far less efficient. A number is usually only a few bytes long in HTML encoding. Images start at a few hundred bytes and expand to a 1k or more depending on how large you want. Images also will not render in the font the user has selected for their browser window, and are useless to people who use assisted computing devices (visually impaired people).

Apart from the images, you could display the numbers using JavaScript or flash.
You could also use CSS to position individual digits using various combinations of absolute or relative positions.
You could also use JavaScript to help you create these DIV.
The point is just to obfuscate enough that it becomes really hard.
One more solution is to use images of segments or single dots and re-construct the images of the digits using CSS, a bit like a dot-matrix display.
You could litter the source of the page with these absolutely positioned DIVs and again make it more difficult to reconstruct by creating them dynamically.
At any rate, you can't stop a determined scraper from getting to the data: it doesn't take a lot to automate a web browser and take screenshots that can be fed to an OCR.
There is nothing anyone from paying someone pennies to get the data manually anyway.
The point is: how determined are your opponents (user?).
It's a bit like the software protection business: making things hard enough that you would deter casual 'pirates' is not too hard, and it's a fairly good approach in general.
However, if there is much value in the data you present, there is nothing you can really do to protect it.
All you can do it make it hard enough so that casual 'thieves' will prefer to continue paying for your services rather than circumvent it.

Javascript would probably be the easiest to implement, but you could get really creative and have large blocks of numbers with certain ones being viewable by placing layers on top of the invalid numbers, blending the wrong numbers into the background, or making them invisible via css and semi-randomly generated class names.

I can't believe I'm promoting a common malware scripting tactic, but...
You could encode the numbers as encoded Javascript that gets rendered at runtime.

Generate an image containing those numbers and display the image. :-)

I think you guys are being too reactive with these solutions. Javascript, Capcha, even litigation and the DMCA process don't address the complex adaptive nature of web scraping and data theft. Don't you think the "ideal" solution to prevent malicious bots and website scraping would be something working in a real-time proactive mitigation strategy? Very similar to a Content Protection Network. Just say'n.
Examples:
IBM - IBM ISS Data Security Services
DISTIL - www.distil.it

Can you provide a little more detail on what it is you're doing? Certainly there's a performance hit to create an image instead of dumping out the text of a number, but how often would you be doing this per day?
Using JavaScript is the same as using text. It's trivial to reverse engineer.

Use animated numbers using flash. It may not be fool proof but it would make it harder to crack.

What about posting a lot of dummy numbers and showing the right ones with external CSS? Just as long the scraper doesn't start to parse the external CSS.

Don't output the numbers, i.e. prefix
echo $secretNumber;
with //.

For all those that recommend using Javascript, or CSS to obfuscate the numbers, well there's probably a way around it. Firefox has a plugin called abduction. Basically what it does is saves the page to a file as an image. You could probably modify this plugin to save the image, and then analyze the image to find out the secret number that is trying to be hidden.
Basically, if there's enough incentive behind scraping these numbers from the page, then it will be done. Otherwise, just post a regular number, and make it easier on your users so they won't have to worry so much about not being able to copy and paste the number, or other such problems the result from this trickery.

just do something unexpected and weird (different every time) w/ CSS box model. Force them to actually use a browser backed screenscraper.

I don't think this is possible, you can make their job harder (use images as some suggested here) but this is all you can do, you can't stop a determined person from getting the data, if you don't want them to scrape your data, don't publish it, as simple as that ...

Assuming these numbers are updated often (if they aren't then protecting them is completely moot as a human can just transcribe them by hand) you can limit automated scraping via throttling. An automated script would have to hit your site often to check for updates, if you can limit these checks you win, without resorting to obfuscation.
For pointers on throttling see this question.

Search by hash?

I had the idea of a search engine that would index web items like other search engines do now but would only store the file's title, url and a hash of the contents.
This way it would be easy to find items on the web if you already had them and didn't know where they came from or wanted to know all the places that something appeared.
More useful for non textual items like images, executables and archives.
I was wondering if there is already something similar?

Check out the wikipedia page on locality sensitive hashing. There's also a good page hosted by a research on MIT.
In general, there are several flavors available: hashes for strings (such as simhash), sets or 0/1 features (such as min-wise hashes), and for real vectors.
The main trick for numerical hashes is basically dimension reduction, so far. For strings, the idea is to come up with a representation that's robust in the face of minor edits.
I'm also doing a little research in this field, although I guess stackoverflow might not be the right place for nascent work.

The question seems to focus on exact match hashes, which we understand better than nearest-neighbor approaches, and are indeed worthwhile, especially if people can share tags and other metadata that way.
As #rjmunro notes, hash-based searching is a popular idea in the P2P world, and Bitzi did pretty much this, though they have shut down and their Bitpedia (Digital Media Encyclopedia) isn't hosted there any more, though some of it at least is still available at Archive.org.
Bitzi also produced software like Bitcollider (SourceForge.net),
and the Magnet URI scheme, which allows for specifying a file by hash and is thus a content-based identifier. Various applications support searching at various databases via Magnet URIs as described at that Wikipedia page.
The same idea is popular in the password-cracking scene - see e.g. findmyhash - Python script to crack hashes using online services etc.
Going a step further, I think it would be great if there were databases and online repositories identifying content by hash and providing tags and other metadata about the content from various perspectives. Then I could leave my music collection in its pristine state (no wasted backup space and time), but still tag them myself and add other metadata, via external tag databases. If my applications knew how to grab the tags, it would seem much better than the current system where we modify and copy around big files just to move tags from e.g. my desktop to my phone.
See a related idea at Metadata Independent Hashing for Media Identification & P2P Transfer Optimisation (pdf).

Well, for images, there's http://tineye.com, which will one-up that, and find you similar images too.

It's not a bad idea. Sometimes I find myself stumbled upon some file trying to figure out where it comes from :) But how are you going to track item's sources? Content can be obtained by various means - web browser, download manager, simply by copying from network share.

If I understand your proposal right, http://bitzi.com/ has done this for a while.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string