a customer representative suggested that I try posting these questions here.
We spent some time monitoring issues with DocuSign loading slowly. While it was now slow every time, when it was slow it seemed to hang up on a particular point in the process.
Below is a screenshot of a trace we ran in the browser and note the element which took 52 seconds to load. When loading was slow, it seemed to hang on this particular element. Could you offer any reasons as to why it could sometimes take 52 seconds or more to load this part?
We also have some other questions:
There seems to be continuous font downloading (2 or 3 meg in size) throughout the process of loading the page. This occurs each time. Why is this and can it be avoided?
Why do we sometimes see Seattle as the connection site when most of the time is Chicago?
We noticed that DocuSign asks for permission to know our location. Does this location factor into where the document is downloaded from? Is the location also used in embedded signing processes?
Thank you for your assistance.
Unfortunately, without a bit more detail I am not entirely sure I can tell you why the page was loading so slow. Is this consistent? If so is it always the same document (perhaps template?) where you see this slowness?
As for your other three questions:
In doing my own test and decryption of the web traffic via fiddler I show the fonts being rendered for each individual tag and not the entire document. This is most likely due to each tag having it's own attributes that can be set (font included).
DocuSign data centers are in Seattle, Chicago and Dallas. All DocuSign traffic can come from any of these three data centers as the system synchronously exists in all three locations. More info can be found here.
DocuSign geo-location is just used to leverage the location capability of HTML5 enabled browsers but the signers IP address is recorded either way. It has no impact on which data center the traffic comes from. It is also included in the embedded signing process. It can be disabled on a per brand basis in the Signing Resource File setting the node DocuSign_DisableLocationAwareness to true.
Related
One way to see - how many people downloaded your extension - is to look at statistic in Chrome webstore.
Another way is to add inside background.js the chrome.runtime.onInstalled.addListener and send information on server each time, when somebody installed an extension.
My problem is that the information, collected by both these ways - is not similar.
Number of downloads, presented in Chrome webstore is less, than number of installations, collected by second way (for unique ip-addresses).
Why? Can anybody explain it?
chrome store also takes into account the uninstalls, while your method only counts installs.
you can also detect uninstalls by setting the url to open on uninstall and tally it on your server. see https://developer.chrome.com/extensions/runtime#method-setUninstallURL
with that, the numbers will match closer. still not perfect as the store takes sometimes weeks to add the stats for a day.
The number of unique IP addresses is not a reliable indicator for users, because users may be using a dynamic address (instead of a static IP address that does not change), and multiple users may be sharing an IP address (behind a NAT or proxy).
And chrome.runtime.onInstalled is not just triggered upon a new installation of your extension, but also when the browser/extension is updated.
So, your way of counting unique users is flawed (and given the small number of users, it is likely that your method is overestimating the number of users).
The Chrome Web Store dashboard (for developers only) provides the number of daily installations (probably measured by counting the number of on-demand CRX downloads).
The Chrome Web Store publicly shows the number of weekly users (measured by counting the number of update checks per week).
This number is not the number of active weekly users, and probably over-estimates the number of actual users.
For example, I have an extension that used to have 1.7k users. Because the extension became obsolete, I published an update that sends a ping to my server and removes the extension itself (using chrome.management.uninstallSelf). Every week, I receive at most a few pings, yet the CWS claims that the extension has about 400 weekly users (these users probably disabled my extension; consequently the extension cannot remove itself but Chrome still checks for updates).
Accurately counting number of users
If you want to know the number of installations, look at the CWS dashboard. If you want to continue to use the onInstalled method, at the very least check whether details.reason === 'install'.
If you want to have the most reliable indicator of "user", generate a random identifier and store it in chrome.storage.sync. Include this ID in requests to the server (for sample code, see Getting unique ClientID from chrome extension?).
Recently, I introduced server-pings in one of my extensions, to measure the number of users per Chrome version at a given day/week. In this efforts, I prioritized the privacy of users over the accuracy of statistics (by storing the random ID in localStorage (which is not synchronized) and refreshing this ID at every major browser update).
If you want to learn more about the code behind it, see https://github.com/Rob--W/pdfjs-telemetry.
On some of our pages, we display some statistics like number of times that page has been viewed today, number of times it's been viewed the past week, etc. Additionally, we have an overall statistics page where we list the pages, in order, that have been viewed the most.
Today, we just insert these pageviews and event counts into our database as they happen. We also send them to Google Analytics via normal page tracking and their API. Ideally, instead of querying our database for these stats to display on our webpages, we just query Google Analytics' API. Google Analytics does a FAR better job figuring out who the real uniques are and avoids counting people who artificially inflate their pageview counts (we allow people to create pages on our site).
So the question is if it's possible to use Google Analytics' API for updating the statistics on our webpages? If I cache the results is it more feasible? Or just occasionally update our stats? I absolutely love Google Analytics for our site metrics, but maybe there's a better solution for this particular need?
So the question is if it's possible to use Google Analytics' API for updating the statistics on our webpages?
Yes, it is. But, the authentication process and xml return may slow things up. You can speed it up by limiting the rows/columns returned. Also, authentication for the way you want to display the data (if I understood you correctly) would require you to use the client authentication method. You send the username and password. Security is an issue.
I have done exactly what you described but had to put a loading graphic on the page for the stats.
If I cache the results is it more feasible? Or just occasionally update our stats?
Either one but caching seems like it would work especially since GA data is not real-time data anyway. You could make the api call and store (or process then store) the returned xml for display later.
I haven't done this but I think I might give it a go. Could even run as a scheduled job.
I absolutely love Google Analytics for our site metrics, but maybe there's a better solution for this particular need?
There are some third-party solutions (googling should root them out) but money and feasibility should be considered.
I'm developing a website and am sensitive to people screen scraping my data. I'm not worried about scraping one or two pages -- I'm more concerned about someone scraping thousands of pages as the aggregate of that data is much more valuable than a small percentage would be.
I can imagine strategies to block users based on heavy traffic from a single IP address, but the Tor network sets up many circuits that essentially mean a single user's traffic appears to come from different IP addresses over time.
I know that it is possible to detect Tor traffic as when I installed Vidalia with its Firefox extension, google.com presented me with a captcha.
So, how can I detect such requests?
(My website's in ASP.NET MVC 2, but I think any approach used here would be language independent)
I'm developing a website and am
sensitive to people screen scraping my
data
Forget about it. If it's on the web and someone wants it, it will be impossible to stop them from getting it. The more restrictions you put in place, the more you'll risk ruining user experience for legitimate users, who will hopefully be the majority of your audience. It also makes code harder to maintain.
I'll post countermeasures to any ideas future answers propose.
You can check their ip address against a list of Tor Exit Nodes. I know for a fact this won't even slow someone down who is interested in scraping your site. Tor is too slow, most scrapers won't even consider it. There are tens of thousands of open proxy servers that can be easily scanned for or a list can be purchased. Proxy servers are nice because you can thread them or rotate if your request cap gets hit.
Google has been abused by tor users and most of the exit nodes are on Google black list and thats why you are getting a captcha.
Let me be perfectly clear: THERE IS NOTHING YOU CAN DO TO PREVENT SOMEONE FROM SCRAPING YOUR SITE.
By design of the tor network components it is not possible for the receiver to find out if the requester is the original source or if it's just a relayed request.
The behaviour you saw with Google was probably caused by a different security measure. Google detects if a logged-in user changes it's ip and presents a captcha just in case to prevent harmful interception and also allow the continuation of the session if an authenticated user really changed its IP (by re-logon to ISP, etc.).
I know this is old, but I got here from a Google search so I figured I'd get to the root concerns in the question here. I develop web applications, but I also do a ton of abusing and exploiting other peoples. I'm probably the guy you're trying to keep out.
Detecting tor traffic really isn't the route you want to go here. You can detect a good amount of open proxy servers by parsing request headers, but you've got tor, high anonymity proxies, socks proxies, cheap VPNs marketed directly to spammers, botnets and countless other ways to break rate limits. You also
If your main concern is a DDoS effect, don't worry about it. Real DDoS attacks take either muscle or some vulnerability that puts strain on your server. No matter what type of site you have, you're going to be flooded with hits from spiders as well as bad people scanning for exploits. Just a fact of life. In fact, this kind of logic on the server almost never scales well and can be the single point of failure that leaves you open to a real DDoS attack.
This can also be a single point of failure for your end users (including friendly bots). If a legitimate user or customer gets blocked you've got a customer service nightmare and if the wrong crawler gets blocked you're saying goodbye to your search traffic.
If you really don't want anybody grabbing your data, there are some things you can do. If it's a blog content or something, I generally say either don't worry about it or have summary only RSS feeds if you need feeds at all. The danger with scraped blog content is that it's actually pretty easy to take an exact copy of an article, spam links to it and rank it while knocking the original out of the search results. At the same time, because it's so easy people aren't going to put effort into targeting specific sites when they can scrape RSS feeds in bulk.
If your site is more of a service with dynamic content that's a whole other story. I actually scrape a lot of sites like this to "steal" huge amounts of structured proprietary data, but there are options to make it harder. You can limit the request per IP, but that's easy to get around with proxies. For some real protection relatively simple obfuscation goes a long way. If you try to do something like scrape Google results or download videos from YouTube you'll find out there's a lot to reverse engineer. I do both of these, but 99% of people who try fail because they lack the knowledge to do it. They can scrape proxies to get around IP limits but they're not breaking any encryption.
As an example, as far as I remember a Google result page comes with obfuscated javscript that gets injected into the DOM on page load, then some kind of tokens are set so you have to parse them out. Then there's an ajax request with those tokens that returns obfuscated JS or JSON that's decoded to build the results and so on and so on. This isn't hard to do on your end as the developer, but the vast majority of potential thieves can't handle it. Most of the ones that can won't put in the effort. I do this to wrap really valuable services Google but for most other services I just move on to some lower hanging fruit at different providers.
Hope this is useful for anyone coming across it.
I think the focus on how it is 'impossible' to prevent a determined and technically savvy user from scraping a website is given too much significance. #Drew Noakes states that the website contains information that when taken in aggregate has some 'value'. If a website has aggregate data that is readily accessible by unconstrained anonymous users, then yes, preventing scraping may be near 'impossible'.
I would suggest the problem to be solved is not how to prevent users from scraping the aggregate data, but rather what approaches could be used to remove the aggregate data from public access; thereby eliminating the target of the scrapers without the need to do the 'impossible', prevent scrapping.
The aggregate data should be treated like proprietary company information. Proprietary company information in general is not available publicly to anonymous users in an aggregate or raw form. I would argue that the solution to prevent the taking of valuable data would be to restrict and constrain access to the data, not to prevent scrapping of it when it is presented to the user.
1] User accounts/access – no one should ever have access to all the data in a within a given time period (data/domain specific). Users should be able to access data that is relevant to them, but clearly from the question, no user would have a legitimate purpose to query all the aggregate data. Without knowing the specifics of the site, I suspect that a legitimate user may need only some small subset of the data within some time period. Request that significantly exceed typical user needs should be blocked or alternatively throttled, so as to make scraping prohibitively time consuming and the scrapped data potentially stale.
2] Operations teams often monitor metrics to ensure that large distributed and complex systems are healthy. Unfortunately, it becomes very difficult to identify the causes of sporadic and intermittent problems, and often it is even difficult to identify that there is a problem as opposed to normal operational fluctuations. Operations teams often deal with statistical analysed historical data taken from many numerous metrics, and comparing them to current values to help identify significant deviations in system health, be they system up time, load, CPU utilization, etc.
Similarly, requests from users for data in amounts that are significantly greater than the norm could help identify individuals that are likely to be scrapping data; such an approach can even be automated and even extended further to look across multiple accounts for patterns that indicate scrapping. User 1 scrapes 10%, user 2 scrapes the next 10%, user 3 scrapes the next 10%, etc... Patterns like that (and others) could provide strong indicators of malicious use of the system by a single individual or group utilizing multiple accounts
3] Do not make the raw aggregate data directly accessible to end-users. Specifics matter here, but simply put, the data should reside on back end servers, and retrieved utilizing some domain specific API. Again, I assuming that you are not just serving up raw data, but rather responding to user requests for some subsets of the data. For example, if the data you have is detailed population demographics for a particular region, a legitimate end user would be interested in only a subset of that data. For example, an end user may want to know addresses of households with teenagers that reside with both parents in multi-unit housing or data on a specific city or county. Such a request would require the processing of the aggregate data to produce a resultant data set that is of interest to the end-user. It would prohibitively difficult to scrape every resultant data set retrieved from numerous potential permutations of the input query and reconstruct the aggregate data in its entirety. A scraper would also be constrained by the websites security, taking into account the # of requests/time, the total data size of the resultant data set, and other potential markers. A well developed API incorporating domain specific knowledge would be critical in ensuring that the API is comprehensive enough to serve its purpose but not overly general so as to return large raw data dumps.
The incorporation of user accounts in to the site, the establishment of usage baselines for users, the identification and throttling of users (or other mitigation approaches) that deviate significantly from typical usage patterns, and the creation of an interface for requesting processed/digested result sets (vs raw aggregate data) would create significant complexities for malicious individuals intent on stealing your data. It may be impossible to prevent scrapping of website data, but the 'impossibility' is predicated on the aggregate data being readily accessible to the scraper. You can't scrape what you can't see. So unless your aggregate data is raw unprocessed text (for example library e-books) end users should not have access to the raw aggregate data. Even in the library e-book example, significant deviation from acceptable usage patterns such as requesting large number of books in their entirety should be blocked or throttled.
You can detect Tor users using TorDNSEL - https://www.torproject.org/projects/tordnsel.html.en.
You can just use this command-line/library - https://github.com/assafmo/IsTorExit.
I have this problem. I have web page with adult content and for several past months i had PPC advertisement on it. And I've noticed a big difference between Ad company statistics of my page, Google Analytics data and Awstats data on my server.
For example, Ad company tells me, that i have 10K pageviews per day, Google Analytics tells me, that i have 15K pageviews and on Awstats it's around 13K pageviews. Which system should I trust? Should i write my own (and reinvent a wheel again)? If so, how? :)
The joke is, that i have another web page, with "normal" content (MMORPG fan site) and those numbers are +- equal in all three systems (ad company, GA, Awstats). Do you think it's because it's not adult oriented page?
And final question, that is totally offtopic, do you know about Ad company that pays per impression and don't mind adult sites?
Thanks for the answers!
First, you should make sure not to mix up »hits«, »files«, »visits« and »unique visits«. They all have a different meaning and are sometimes called differently. I recommend you to look up some definitions if you are confused about the terms.
awstats has probably the most correct statistics, because it has access to the access.log from the web server. Unfortunately, a cached site (maybe cached by the browser, a proxy from an ISP or your own caching server) might not produce a hit on the web server. Especially if your site is served with good caching hints which don't enforce a revalidation and you are running your own web cache (e.g. Squid) in front of your site, the number will be considerable lower, because it only measures the work of the web server.
On the other hand, Google Analytics is only able to count requests from users which haven't blocked Google Analytics and have JavaScript enabled (but they will count pages served by a web cache). So, this count can be influenced by the user, but isn't affected by web caches.
The ad-company is probably simply counting the number of requests which they get from your site (probably based on their access.log). So, to get counted there, the add must not be cached and must not be blocked by the user.
So, as you can see, it's not that easy to get a single correct value. But as long as you use the measured values in comparison to those from the previous months, you should get at least a (nearly) correct rate of growth.
And your porn site probably serves a high amount of static content (e.g. images from the disk) and most of the web servers are really good at serving caching hints automatically for static files. Your MMORPG on the other hand, might mostly consist of some dynamic scripts (PHP?) which don't send any caching hints at all and web servers aren't able to determine those caching headers for dynamic content automatically. That's at least my explanation, without knowing your application and server configuration :)
I wonder how high traffic websites handle traffic logging, for example a website like myspace.com receives a lot of hits, I can imagine it would take a lot of space to log all those requests, so, do they log every single request or how do they handle this?
If you view source on a MySpace page, you get the answer:
<script type="text/javascript">
var pageTracker = _gat._getTracker("UA-6293770-1");
pageTracker._setDomainName(".myspace.com");
pageTracker._setSampleRate("1"); //sets sampling rate to 1 percent
pageTracker._trackPageview();
</script>
That script means they're using Google Analytics.
They can't just gauge traffic using IIS logs because they may sell ads to third parties, and third parties won't take your word for how much traffic you get. They want independent numbers from a separate company, and that's where Google Analytics comes in.
Just for future reference - whenever you've got a question about how a web site is doing something, try viewing the source. You'd be amazed at what you can find there in plain view.
We had a similar issue with out Intranet which is used by hundreds of people. The disk activity was huge and performance was being hurt.
The short answer is Asynchronous non-blocking logging.
probably like google analytics.
Use Javascript to load a page on a difference server, etc.
Don't how they track it since I don't work there. I am pretty sure that they have enough storage to record every little thing about their user if they wanted.
If I were them, I would use AwStats if I just wanted to know basic stuff about my users.
It is more likely that they have developed their own scripts for tracking their users. Stuff they would log
-ip_address
-referrer
-time
-browser
-OS
and so on. Then a script to see different data about the user varying by day, weeks, or months. As brulak said, something along the line of Analytics, but since they have access to actual database, they can learn much more about their users.
ZXTM traffic shaping and logging, speaking from experience here
I'd be extremely surprised if they didn't log every single request, yes, and operations with particularly high traffic volumes usually roll their own log-management solutions against the raw server logs, in some form or other -- sometimes as simple batch-type processes, sometimes as complete subsystems.
One company I worked for, back in the dot-com heyday, got upwards of twenty million pageviews a day; for that site (actually a set of them, running across a few dozen machines in all, as I recall), our ops team wrote a quite sophisticated, clustered solution in C that parsed, translated (into relational storage), compressed and distributed the logs daily. Log files, especially verbose ones, pile up fast, and the commercial solutions available at the time just couldn't cut it.
If by logging you mean for collecting server related information (request and response times, db and cpu usage per request etc) I think they sample only the 10% or 1% of the traffic. That gives the same results (provide developers with auditing information) without filling in the disks or slowing the site down.