Reactive, long-running sequences and persistence in the cloud - azure

I am to build a kind of website tracking system.
Think of a website where users click on various links – a unique user id and an identifier of the page tracks all page views.
Now, a single user might view 20 pages – some relevant some not. What I want to track is if a user follows a specific “path”. Example “Home Page” -> “Product A Page” -> “Get more info page” -> “Buy” -> “Paid”. There might be other page views in between each of these steps; the important thing is IF a user follows a given pattern.
In addition, I need to measure time between each step (each page view has a timestamp).
I have been playing around with Reactive Extensions, but I am not an expert in the area so I would like to hear if this would be a job for the Reactive Framework or if other technologies are more suitable?
I imagine a server getting a stream of website page views and then some fancy reactive LINQ queries that captures the events (this is where I need some help).
Next question that comes to my mind is how do you host this behind a load-balancer (on Windows Azure)? If you run two instances and the “Home Page”-page view goes to instance 1 and “Product A Page” goes to instance 2, how do they communicate about this or should some kind of sharding e.g. per userid be enforced?
Lastly, what about persistence? How to store? Should you store data in an Event Queue pattern and then load everything into memory when you “play back” from a restart of the server?
I know that were many questions, but I do love the philosophy behind Reactive Extensions; I just cannot get my head around how to “put into production in the cloud” :)
Thanks!
Casper

There are lots of solutions out there in this space already that you can integrate into your platform. Are you sure you're not reinventing the wheel? Google Analytics has functionality similar to this. If you need to go your own way, then SQL Server StreamInsight might be a better fit.
For behind-the-firewall solutions, Also look at http://piwik.org/ (free, open-source) and http://www.haveamint.com/.

Related

Simple web crawler / scraper for deals

For fun and maybe for profit, i want to implement the following:
scheduled or manually triggered process that logs into all my bank accounts
process knows bank site structure, and goes through "cashback/partner deals" pages
all deal information is collected in one place and in one format
when i'm going to buy something, i can quickly see if any of my cards has a special offer for that place. so that i can pay with the card that offers the best deal. ideally, this should happen on my android phone. that doesn't imply a standalone app though, e.g. you can search inside an email or googledoc or anything.
any ideas on implementation? don't limit yourself. suggest crazy things, as long as they work.
PS i did look for an existing website that offers something similar, but they all seem to focus on cards themselves rather than on specific deals for stores.
Depending on the bank website, your code may need to be able to execute JavaScript to interact with it. Take a look at CasperJS for the web scraping part.

What server should I use if I am expecting over 20k visits/day?

I recently launched a fantasy football online game for the English premier league called Myfpl11.com and I want to know what server should I choose if I am expecting 20k visits a day. My visits are going up and I want the site to keep performing smoothly. Please help.
This is probably the wrong area of StackExchange to ask this sort of question. However ...
The first thing you should do is get prepared to scale horizontally instead of vertically. If you keep growing you will soon grow out of any single server that you purchase.
Instead, what you need to do is start looking at ways to modify your website to be able to work over multiple systems. If you're experiencing load issues on the server you currently have, spin up another one of the exact same instance and move the database to that server, so you will then have two -- one dedicated to the database (which will really help it do its job) and one dedicated to serving traffic.
From there look at how you can scale in to multiple web processes, databases and add caching layers.
You can add cloudflare.com as your DNS service which will help you out by better caching your assets, but most importantly they will deliver a technical issues page to your users if your site does fall over at any stage. This is really helpful for saving face, because they will get an actual page with a message instead of a continually loading white-page.
Look at using services like digitalocean.com or linode.com (both very affordable and great staff) where you can easily add/remove resources as you need them.

Statistics usage of a database

Is there a way to monitor statistics on usage of documents within a database?
I have a lotus notes database hosted on a local server. I know I can get some info from 'User Detail...' in Info tab of Database property (right click on the database from domino designer), which basically shows me which user accessed database and which CRUD action was performed, but I was looking for something more in depth i.e. which document in particular is read the most and by who.
Since this is StackOverflow, not SuperUser or ServerFault, I'm going to treat this as a programming question. (On those other sites, they would tell you that tracking actions at the document level is not built into Notes and Domino's functionality, but there are some 3rd party add-on products that can do it for you.)
You can implement tracking features down to the document level in Notes and Domino using the Extension Manager API portion of the Notes C API. There is also a free package on the OpenNTF.org web site, called TriggerHappy, which provides a framework for using the Extension Manager features to call Java agents when events that you want to track occur. This can make it significantly easier to accomplish what you want, but it will not scale as well for large user bases.
You should also bear in mind that since Notes and Domino are designed for use in a distributed environment in which users can do their work in local replica databases, a tracking mechanism that is based on an Extension Manager plugin running on the server may not see changes at the moment that users make them. Instead, it might see them when those changes replicate from the user's computer to the server -- and replication does not guarantee that order is preserved, so the server might see some things happen in a different order than what the user actually did.
Have a look at the activity trends, see notes help.
If you need more details, you have to implement it by yourself.

storing quick analytics using redis and node.js

I am new to redis and would like to store the web analytic of web site globally and per user activity .
Below is what i am stuck with.
// to get all unique ips
client.sadd('visitors',ip);
// to records hits per ip
client.hincrby('hits',ip,1);
The above so far works fine and i do get number of different ips and hit counter per ip.
the problem comes to store the activities made by each ip. i.e. Storing the link he clicked, searches he did, with datetime
Can some one please throw light on how to best manage it.
Thanks
the problem comes to store the activities made by each
You will need a separate structure for storing these.
The simplest rational structure is to have a "list of actions by session". Take a look at the sorted sets commands which provide a basic framework for creating a list of actions within a session.
This will get you something quickly. However, this is probably not what you really want. In fact redis is probably not useful for this at all.
If you want to re-trace an entire site visit you really want to connect to some sort of true analytics framework. There are dozens of website tracking tools that provide this type of functionality, so it's not really clear that building one is very efficient.

SPA Architecture questions

This post is intended to start a deeper discussion on Single Page Applications for the web. There are questions that do not seem to have a clear answer in most resources on the subject.
They are in my mind
Authorization and authentication.
With entire web app being on the client, it may make calls to the server in any of its functions, even those that the user does not have rights to. The fact that the user cannot see a menu, does not preclude that person from invoking java script functions. This is easily handled in MVC app, for example, by using controllers that validate user rights to a specific function based on a cookie for example. However, some SPA apps just use single controller with Breeze or Web Api, which make authorization server side impossible.
Memory management on the client
For small sample apps this is not an issue, but imagine an app with 100's of screens or an app with a single screen that pulls thousands of records over the course of one day. With persistent caching one could imagine large memory issues, especially on under-powered devices with little RAM, like phones or tablets. How can a group of developers had SPA route without a clear vision of handling memory management?
Three Tier deployment
Some IT departments will never allow applications with a connection string to a database located on front end web servers. Every SPA demo I have seen is structured exactly like that, including Breeze or Web Api for that matter.
Unobtrusive validation.
It would require developers to use MVC partial views and controllers instead of just HTML files, which seems to fly in the face of SPA concepts, while it provides a very robust way to easily incorporate validation and UI to support it into web applications.
Exposing primary integer based keys in the url.
This is non-no in OWASP.
As a result, SPA applications "seem" to target areas with few security requirements and small feature sets. What do you think?
Thanks.
#Sergey - I think this is just too broad a question for StackOverflow. S.O. isn't a discussion forum; it's a place to go for specific answers. So while your questions are potentially valid, I don't think you should hold out much hope for deep substantive responses here.
May I add, in the friendliest possible way, that your sweeping, unsupported, and negative statements make you look like a troll. You're not a troll are you Sergey?
On the chance that you are in fact authentically concerned, I offer a few quick reactions, particularly as they pertain to Breeze.
Authorization. In Web API you can authorize at the method level. The ApiController base class has a User property that returns the IPrincipal. So whether you have one controller or many (and you can have many in Breeze if you want), the granularity is method level, not just class level.
Memory management. Desktop developers have coped with this concern for years. It may cause you some astonishment if you've always developed traditional web apps where process lifetimes are brief. But long-running processes are not news to those of us who built large apps in desktop technologies such as WinForms, WPF, and Silverlight. The issues and solutions are much the same in the land of HTML and JavaScript.
Layers on the backend. You've been looking at demos too long. Yes most demos dump everything into one project running on one server. We assume you know how to refactor the server to meet scaling, performance and security requirements for your environment. Our demos are concerned mostly with front-end SPA development. We do dabble at the service boundary to show how data flow through a service API, through an ORM, through to the database. We thought it sufficient to identify these distinct layers and leave as an exercise for the reader the comparatively trivial matter of moving these layers to different tiers. We may have to re-visit that assumption someday. But does anyone seriously believe that there are significant obstacles to distributing layers/responsibilities across server-side tiers? Really? Like what?
Unobtrusive validation. When most people start using the word "unobtrusive" in connection with HTML, they are usually making a point about keeping JavaScript out the HTML. Perhaps that's what you mean too, in which case SPA developers everywhere agree ... and that's why there are numerous "unobtrusive validation" libraries available. HTML 5 validation, jQuery validation and Knockout validation come to mind. All of them are in the SPA developer's toolkit and none of them "require developers to use MVC partial views and controllers". What gives you the impression that a SPA would need any server-side resources of any kind to implement validation with JavaScript-free HTML markup?
Ids as security risk. Really? This is bogus. The key value is no more a security risk than any other data value. Millions of applications - not just SPAs - communicate key values to the client, both in the URL and in the body. It's standard in REST APIs. It's standard in ODATA. And you want to dismiss them all by saying that they "target areas with few security requirements and small feature sets"? Good luck with that. I think you'll have to do better than rest your case on a link to a relatively obscure organization's entire web site.
I have built some SPA applications, ranging from small to large (over 100 scripts and views). Only a handful of them had every view accessible to the public. The rest went through a strict access structure. It was so simple to return a 401 unauthorized from the server and the client just handling the 401 to redirect it to the login screen. Mr. Ward and Mr. Papa put it right. Get out of the Demo mode and try to find solutions to the issues you come across. I have watched John Papa's SPA on pluralsight, gone through numerous articles and applications on Breeze and I have to tell you, none of my applications use Breeze to do queries from the client side, because YOU DON'T NEED TO!!
Moreover, I have only extended what I have learnt and come up with my own way of solving problems. This is not an answer to your queries, but I only can provide a short comment. No technique is perfect and there is no ONE way to do everything. My server side is locked down where it needs to be locked down, my routes on the client side are locked down (if using durandal take a look at guardRoute), my scripts are minified and my images are sprited (if there is a word like that). All in all, SPA is a great technique, you got to find solutions to the quirks!

Resources