How to crawl a group of websites looking for CSP issues?

How to crawl a group of websites looking for CSP issues? - content-security-policy

We have started implementing CSP headers on some client sites, firstly in report-only mode. My disinclination to turn it on into live mode is because we may break the clients website because we missed something.
Is there a script (similar to cURL) which I can use to check for CSP issues (similar to the Google Webmaster Tools-> Console), so I can scan entire sites quickly?

In addition to being able to crawl, such a script would need to apply the CSP rules and send reports or log violations in some way. I don't know about any such scripts. In addition there are differences between browsers in how CSP is implemented, both CSP level and various implementation choices. So even though you can scan the entire website without findings there could theoretically still be violations reported from other browsers.
If your main concern is the sources loaded you could build a script that extracts all the source URLs. This would still be a time consuming task to write and fine tune regex to first filter out comments and XML namespace URLs, then apply another regex to find the actual source URL. The output would have to be checked against your policy, likely a manual process. Finally you would also need to check script and style files to find potential violations there. I have done this in a script looking for http references in client code. It is a massive task that will likely consume more hours than updating the CSP once violations occur.
As you already have a CSP in report-only mode your confidence with the policy will grow over time as real users interact with the websites. Keep the report-only policy and add a more permissive enforced policy. As confidence grows, make the enforced policy stricter by making it more similar to the report-only version. Whenever violations occur, check if they can be reproduced is just something that happens on selected clients due to specific browsers, browser extensions, rewrites by proxies etc.

You can scan entire site quickly by use a simple JavaScript:
<!DOCTYPE html>
<html>
<head></head>
<body>
Current Url: <span id='url'></span><br>
<iframe id='site' src='about:blank' onload='reload()' width='100%' height='400px'></iframe>
<script nonce="test">
var page =0;
var pages = [ // Urls from sitemap.xml
'https://example.com', 'https://example.com/csp/', 'https://example.com/es/',
'https://example.com/en/', 'https://example.com/csp/about/', 'https://example.com/contacts/'
];
function reload() {
if (page < pages.length) {
document.getElementById('url').innerHTML = pages[page];
document.getElementById('site').src = pages[page];
page = page+1;
}
else {
document.getElementById('url').innerHTML = 'All done';
document.getElementById('site').onload = "";
document.getElementById('site').src = "about:blank";
}
}
</script>
</body>
</html>
Since scanned web site is load in <iframe>, it need to remove frame-ancestors and X-Frame-Options header if present.
This script can be used in any browser, but onclick/onmouseover/etc events will not fire. Also forms will not be filled and submitted. Therefore, not all possible CSP violations will be detected by scanning.
You have to check violation reports during 2 weeks or more before switch CSP to enforce mode.
You can inject this script into scanned website itself and do not use <iframe> (use onpageload event instead of onload).

Related

Expire workboxjs cached items based on querystring

im using Workbox to pre-cache & cache resources in my sharepoint project.
ms sharepoint uses a lot of its own js / css out of the box that i would like to be cached.
Sharepoint renders the src tags for js, css with a revision ID appended to the querystring.
somethig like this:
<link rel="stylesheet" type="text/css" href="/_layouts/15/1046/styles/SuiteNav.css?rev=tyIeEoGrLkQjn4siLhDMLw%3D%3DTAG0"/>
<script type="text/javascript" src="/_layouts/15/1046/initstrings.js?rev=mwvYlbIyUbEbxtCpAg383w%3D%3DTAG0"></script>
I would like to expire those resources based on the revision (rev) querystring.
can this be done by anything out of the box in workbox js or i will need something like a custom plugin?
can you point me at the documentation / example ?
Thank you in advance

If you're dealing with resources that are not available at build time, because, for instance, those URLs only exist on the Sharepoint server, then you'll have to use runtime caching to handle them.
I'd recommend going with a cache-first strategy for resources like those that have revisioning information in their URL, since they'll (hopefully!) only change if their URL changes.
In terms of making sure that you eventually expire the older entries, that's a bit trickier to do aggressively. What I'd suggest instead is that you use a more conservative approach to expiration, with a maximum number of entries in your cache that is some multiple of the expected number of URLs that will be valid at any given time. For instance, if you know that there are likely to be ~10 URLs that are the "most recent" at any given time, then configuring cache expiration to start expiring entries once you've reach ~30 or so items in the cache would be reasonable.
Putting that all together, you could configure Workbox's runtime caching as follows (using the new v3 syntax):
workbox.routing.registerRoute(
// Customize this pattern as needed.
new RegExp('/_layouts/15/1046/'),
workbox.strategies.cacheFirst({
cacheName: 'sharepoint-assets',
plugins: [
new workbox.expiration.Plugin({maxEntries: 30}),
],
})
);

How do I safely insert potentially malformed HTML into a JSF page?

I'm working on a JSF page that needs to have the corporate privacy policy updated. Rather than copy-and-paste the new text, I'd prefer to have the PrimeFaces dialog that displays it link to the privacy policy elsewhere. So, I'm doing this:
<p:dialog id="dlgPrivacyPolicy">
<ui:include src="https://cdn.mycompany.com/privacy/en-us/privacy.htm"/>
</p:dialog>
The problem is, the HTML on that page is slightly malformed; there's a <meta> tag that isn't closed. This causes my JSF page to fail to compile.
I could track down whoever maintains that page and ask them to correct it, but that's a band-aid. If any more malformed HTML shows up on that page, it will crash mine. And having my page fail to load because the privacy policy didn't close a tag just isn't acceptable.
Is there a safe way for me to insert potentially malformed HTML into my page? Or am I stick with copying and pasting if I really want to avoid that issue?

If you don't want xhtml compilation problem, you should not include the malformed page in server side but in client side, for example by running ajax request on it and include it by using innerHtml attribute of the dlgPrivacyPolicy div.
Using JQuery :
$.ajax({
url: "https://cdn.mycompany.com/privacy/en-us/privacy.htm"
})
.done(function( html ) {
$( "#dlgPrivacyPolicy " ).html( html );
});

Considering your requirements (mentioned in your question and comments) I'd suggest to use jsoup: You can fetch the html content server-side, sanitize it and then use the sanitized content on your page. The sanitizing step is completely up to you (and jsoup's great capabilities) which can include removing unused/unsafe parts of the page (i.e. headers, css etc) as required.

I'm afraid that including a complete HTML page verbatim is always going to be painful. There's the risk of malformed HTML, or the page might do funny things like overwrite CSS styles, pollute global Javscript scope or whatever.
I think the only clean, maintainable solution will be to agree on some kind of (web) service that provides the privacy policy in a well-defined format (HTML, XHTML, whatever) suitable for inclusion elsewhere. This also makes sure the provider of the privacy policy does not suddenly decide to change the URL, or include a popup or similar. The important point is that the service is an official service with agreed-upon rules.
If you cannot get that service, you'll have to find workarounds. The best I can think of would be to filter the policy through some tolerant HTML parser on your side to fix it (at runtime, or as part of the build). Then you can also fix things like over-eager CSS rules or bad Javascript, as applicable.

Does comet server data in the iframe just accumulate?

I push [script]dosomething()[/script] tags into the iframe for my comet server using chunked data, but script tags just continues to accumulate forever. How do I wipe it after every script tag?

Wipe script tag
P.S: When you want to wipe script tags it is probably to follow Does comet server data in the iframe just accumulate?
I believe you should close the connection after sometime(no good, see Does comet server data in the iframe just accumulate? instead) which automatically frees up the memory associated with that request. You then off course need to reconnect. This page says something else even:
"Page Streaming" means the browser
discovers server changes almost
immediately. This opens up the
possibility of real-time updates in
the browser, and allows for
bi-directional information flow.
However, it's quite a departure from
standard HTTP usage, which leads to
several problems. First, there are
unfortunate memory implications,
because the Javascript keep
accumulating, and the browser must
retain all of that in its page model.
In a rich application with lots of
updates, that model is going to grow
quickly, and at some point a page
refresh will be necessary in order to
avoid hard drive swapping, or a worse
fate.
This advices to reload the page which is also an option. But I think closing that connection(iframe) and reconnecting might also work.
Comet has a lot of problems you need to hack around:
As you can read from this WIKI page it also has problems with "reliable error handling method, and the impossibility of tracking the state of the request calling process.".
Also Internet Explorer needs to sent some junk to get the process started(see http://cometdaily.com/2007/11/05/the-forever-frame-technique/)
That's why I again recommend you to use socket.io(see below) which takes care of all this nonsense.
Socket.io
I advice you to use socket.io instead, which is very good product. It is supported by all major browsers. As you can see it support a lot of transport(XHR, Websockets, etc) and choices the best one available on your browser for the best performance.

Wipe script tag without reconneting
You can remove script tag every time that it is executed by adding some code when the server prints chunk.
<script type="text/javascript">
// Calls your message handler
app.handle("Hello World");
// Removes this script element
var scripts = document.getElementsByTagName("script"),
elem = scripts[scripts.length - 1];
elem.parentNode.removeChild(elem);
</script>
Compressed version
<script type="text/javascript">
app.handle("Hello World");
(function(){var a=document.getElementsByTagName("script"),a=a[a.length-1];a.parentNode.removeChild(a)})();
</script>
But, Hidden Iframe or Forever Iframe is too annoying to use as Alfred mentioned. Personally, I think this classical way makes Comet look graceless and charmless.
jQuery Stream
My recommendation is to use jQuery Stream, which provides the unified two-way communication interface over WebSocket and HTTP protocol. It is a light-weight client-side JavaScript Library such as jQuery.
The enhanced Iframe transport being used by jQuery Stream is different from the classical one in many ways, requries text/plain response containing only messages instead of text/html response and empties the response every time that it's handled.
According to some user's test, Internet Explorer 8 using enhanced Iframe transport has no problem with messages of several megabytes (unlike Firefox using XMLHttpRequest as transport, which is really struggling).

how to check if my website is being accessed using a crawler?

how to check if a certain page is being accessed from a crawler or a script that fires contineous requests?
I need to make sure that the site is only being accessed from a web browser.
Thanks.

This question is a great place to start:
Detecting 'stealth' web-crawlers
Original post:
This would take a bit to engineer a solution.
I can think of three things to look for right off the bat:
One, the user agent. If the spider is google or bing or anything else it will identify it's self.
Two, if the spider is malicious, it will most likely emulate the headers of a normal browser. Finger print it, if it's IE. Use JavaScript to check for an active X object.
Three, take note of what it's accessing and how regularly. If the content takes the average human X amount of seconds to view, then you can use that as a place to start when trying to determine if it's humanly possible to consume the data that fast. This is tricky, you'll most likely have to rely on cookies. An IP can be shared by multiple users.

You can use the robots.txt file to block access to crawlers, or you can use javascript to detect the browser agent, and switch based on that. If I understood the first option is more appropriate, so:
User-agent: *
Disallow: /
Save that as robots.txt at the site root, and no automated system should check your site.

I had a similar issue in my web application because I created some bulky data in the database for each user that browsed into the site and the crawlers were provoking loads of useless data being created. However I didn't want to deny access to crawlers because I wanted my site indexed and found; I just wanted to avoid creating useless data and reduce the time taken to crawl.
I solved the problem the following ways:
First, I used the HttpBrowserCapabilities.Crawler property from the .NET Framework (since 2.0) which indicates whether the browser is a search engine Web crawler. You can access to it from anywhere in the code:
ASP.NET C# code behind:
bool isCrawler = HttpContext.Current.Request.Browser.Crawler;
ASP.NET HTML:
Is crawler? = <%=HttpContext.Current.Request.Browser.Crawler %>
ASP.NET Javascript:
<script type="text/javascript">
var isCrawler = <%=HttpContext.Current.Request.Browser.Crawler.ToString().ToLower() %>
</script>
The problem of this approach is that it is not 100% reliable against unidentified or masked crawlers but maybe it is useful in your case.
After that, I had to find a way to distinguish between automated robots (crawlers, screen scrapers, etc.) and humans and I realised that the solution required some kind of interactivity such as clicking on a button. Well, some of the crawlers do process javascript and it is very obvious they would use the onclick event of a button element but not if it is a non interactive element such as a div. The following is the HTML / Javascript code I used in my web application www.so-much-to-do.com to implement this feature:
<div
class="all rndCorner"
style="cursor:pointer;border:3;border-style:groove;text-align:center;font-size:medium;font-weight:bold"
onclick="$TodoApp.$AddSampleTree()">
Please click here to create your own set of sample tasks to do
</div>
This approach has been working impeccably until now, although crawlers could be changed to be even more clever, maybe after reading this article :D

Is it possible to detect Internet Explorer Enhanced Security Configuration in javascript?

Is there any method to tell from javascript if the browser has "enhanced security configuration" enabled?
I keep running into problems with certain controls not working from within dynamically loaded content. This only happens with browsers running on Windows Server 2003/2008 systems - even when I add the server to the "trusted" zone.
Maybe somebody has already develoepd a method for accomplishing this task?
Thanks in advance

Instead of testing for IE ESC directly, we can test for its effects.
I found that with ESC enabled the onclick events of dynamically added content would not fire.
So I am testing those events directly.
var IEESCEnabled = true;
var testButton = $("<button style=\"display: none;\" onclick=\"IEESCEnabled = false; alert('No problems here.');\">Test IE ESC</button>");
testButton.click();
if (IEESCEnabled) {
alert("We have a problem.");
}
<script src="https://ajax.googleapis.com/ajax/libs/jquery/1.11.1/jquery.min.js"></script>
In my application a test like this forwards the user to a page explaining their issue. It is accompanied by a noscript element to check that they have JavaScript running at all.

I don't think it's possible, and if it still is, than that's a bug that might sooner or later be fixed.
One of the main points of this "extra security" was for the client to have it but not to be detected by the servers, thus leaving them no way to know when to try to circumvent it and when not.

Isn't javascript disabled when using enhanced security configuration?
Then if you only want to display a message to the user, simply display a message in normal html and hide it with javascript so only users without javascript will see it. If you need to handle it on the server side (e.g. outputting a differerent version of your website) simply include javascript to redirect users to your javascript enabled version. Users without javascript will remain on the non-js page.
If only scriptable activex are disabled, the same method applies, simply insert a activeX and try to "script" it, if it fails you can redirect, show a message etc.
The above of course doesn't detect enhanced security configuration per se, but the symptons that occur when it is enabled. So it probably wouldn't be able to distinguish between users with using enhanced security configuration and users that simply have JS/ActiveX disabled or use a Browser that doesn't support scripting in the first place.

I think you can look for SV1 in the user agent string.

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string