OpenAM embedded DS - openam

I have a openAM server running in the production. They run with multiple servers and fronted by Load Balancer.
The server is being congigured to use embedded openDJ as configuration store and external openSJ as the user data store.
In the production environment, we observed that in the openAM server (with embedded openDJ) the db/userRoot file keep growing and memory keep increasing as well.
I', wondering why the db/userRoot folder size keep increasing? Isn't it the user data is being allocated at the external openDJ, with the embedded openDJ size still keep increasing until 50GB ++
What sort of data stored in the the embedded openDJ?

Related

Auto suggest, Azure Webapp & .Net core WebAPI iMemoryCache

Tech Stack
Azure WebApp
.Net core 2.1 WebApi
We have around 4k reference data which is used during auto suggest lookup, so in this i was wondering whether i should cache this data on WebApp or should always get it from database / 3rd party API.
I know i can use RedisCache to solve this issue, but i would like to know how Azure WebApp works when it comes to caching, it will have memory pressure? When? Yes then scale-up is the only solution?
We are using IMemoryCache in .net Core to store reference data and it expires on daily basis or when Azure WebApp is restarted (So 1st user will get delay till it gets all data in cache).
Data size is in range of 500KB - 1MB & sometimes goes till 3MB+.
What is the best approach?
iMemoryCache is not suggested when using WebApps because it is tightly bound to your application instance, so if you try to scale out your app (in case of load surges during the day) your caching mechanism will be broken.
RedisCache is pretty much a dictionary, key-value pairs.
It is very fast on look-ups but it could be very slow in some other operations like a GetAllKeys when it has to run through the whole cache. That will bring your cache server to its knees, so it needs to be handled carefully.
It will not put any significant pressure in the memory consumption of your app, you only need to have a static client. The rest is handled by the redis server.
If you plan to scale up your application (give more RAM and CPU resources to your one running instance) the iMemory cache is probably fine.
If you plan to scale out (create multiple instances of your application), that is strongly suggested for all stateless applications, then RedisCache (or any other distributed cache) is an one way for you if you need a caching mechanism.
Value and key max size is 512MB so you are on the safe side regarding value data size.
Attention
Be sure to use the Connection multiplexer as it is suggested in the official documentation because it automatically re-establishes the connection in case it is lost. That was a bug earlier, when redis cache server was going into maintenance your calls where redirected to the fail over instance but the connection was failing, so you needed to restart your application.

Server configuration for virtual servers

From what I'm aware, best practices regarding server configuration is to have a separate server for your web server and your database server. The web server would have the relevant ports (80 and 443) exposed to internet traffic, while the database server would be completely blocked from access for everything except for networked connections between the web server.
As far as I'm aware, there are three main reasons for doing this:
1) Any sort of attack that allowed someone to gain access to the web server would ideally not allow them access to the server which actually contained the data, so the data becomes more secure. Also, an attack that brings down the web server could theoretically be resolved by rerouting traffic to a different web server connected to the same database server, thus minimizing downtime.
2) Resources consumed by the web server are separated from those for the database server so, for example, a large amount of traffic on the web server would not slow down performance on the database server
3) The machines are somewhat more scalable, since, for example, if the problem ends up being that there is too much load on the web server, it can be upgraded independently of the database server
Can these same benefits be gained by using a single machine with two VMs installed on it, one for the web server and one for the database server? The third benefit is of course not possible, since there's only one physical server. But how about the first two?
It seems to me like the first benefit is still intact, since an attack that gains access to the web server would theoretically not allow similar access to the database server, and, as far as I'm aware, gaining access to the virtual machine does not provide access to it's parent physical machine.
The second benefit seems somewhat less relevant, since there is really only one pool of resources that are being split between the two machines, so in fact, it would probably be better to not split them, as it would allow a bigger pool to draw from.
Am I overthinking this? Is the fact that there is only one physical machine pretty much the end of the story, and splitting it into two separate VMs a complete waste of time and resources?

Which Dedicated Cache configuration to use?

A large e-commerce site is looking to switch its session cache from Shared cache to dedicated cache.
It is usually running on medium-size servers (5-6)... During busy times, it's running on 20 medium servers. During the very busy times, it is not unreasonable to have 2000+ requests per second to the site
Is co-located cache good enough here or must cache be in the dedicate worker role?
Also, must high-availability be enabled for session data? The site relies upon session data to be present for good user experience. But the cache is persisted to Azure blob storage, so I'm not sure I totally get the high-availability option
The use of dedicated roles depends on how many roles you want to run, and whether or not the memory usage of your web roles determines if they scale. For example, if your web roles are always pushing memory usage, and it is memory and not CPU that is the trigger for scaling out - then consider using dedicated roles for the cache, as your web roles can then handle the load for longer. If your web roles are cpu intensive, then dedicating memory on each role to the cache may be preferred. You also need to consider that if running in dedicated roles, you need more than one role to handle the load and availability, so even during non-busy times, you will have at least 3 roles running the cache (but possibly fewer web roles). You may also want to use dedicated cache if you do lots of deployments or scaling down - where roles are shut down intentionally and frequently.
One consideration on co-located role caching is that if you had sticky sessions the latency would be lower, as the item is on the same machine. Unfortunately, the Azure load balancer is round robin, and not sticky at all, so the chance that a session gets back to the same machine is low (1/5 of the time for 5 roles). This means that most of the time the cache item will be fetched from another role in the cluster, so co-located latency benefits are lost.
The cache is distributed and in-memory - there is no blob storage that I am aware of (except for 'cluster's runtime state' - whatever that is. An item loaded into cache is made available to other machines on the cluster from the machine that it is stored (in memory) on (a read from machine B to machine A does not also store it on machine A - see comment below). Cached items are always in memory only, and the cache size is limited by available memory.
The high availability option copies the item to a separate machine (not storage), so if one machine fails, there is still a copy somewhere. High availability will also use more memory, as an item uses memory in two different places. The chances of failure maybe low enough for your e-commerce app - if an item is not cached (either through failure or expiry) it may be reconstructed from persisted data. If you are, for example, keeping the basket in cache and not persisted to storage, you don't want it lost if a role recycles - in which case high availability may be the best option.
Great answer #SimonMunro however in my experience the Azure Co-located Cache is not fit for production. Our load testing has shown us that when a server is recycled that it takes an exceptional long period of time for a the cache to recover. We have coded against this by fetching the data from our database however our site grinds to a halt due to the stress on the database. This not only happens when a node is recycled; but also if you scale your cloud services up and down; and even when you perform a VIP swap.
We have performed the same tests using the Azure dedicated cache and have found it to handle the situation of a cache worker role recycling with little to no effect to the performance of the site. It is my recommendation is to use the Azure Dedicated Cache in all cases if you want your site to perform.

spikes in traffic

I have a bare bones setup on Amazon, and wanted to know which is the better approach coming out of the gate on a new site, where we anticipate a spike of traffic occasionally (from tech press) before we gradually build up 'real' membership traffic to a reasonable level.
I currently am toying with two starter options:
1) Do I have 1 node app (micro ec2) pointing to a redis-server AND mongod (EC2 server) (which mounts one combined 10G EBS).
Or
2) do I have 1 node app (micro ec2) running redis-server and mongod locally (but with 2 10G EBS mounts, 1 for redis and 1 for mongo).
If traffic went crazy (tech press etc), which is easiest/fastest to scale to handle the spike in traffic. I anticipate equal read writes for mongo and redis btw, and I have no caching (other than that provided by cloudfront assets like images and some css)
I can't speak to Redis, but for MongoDB you'll want to be sure that you run on an instance with sufficient RAM to hold your "working set" of data in memory. "Working set" means, roughly, the full set of data that your application accesses frequently -- for instance, consider Twitter -- the working set of Twitter data is the most recent set of status updates across all users, as this is what is shown on web pages and what Twitter provides via its APIs. For your application, the definition of working set may differ.
Mongo uses memory-mapped files for data access, which means that its performance is great when there is enough memory to hold the data you are accessing frequently, and can degrade when there is not. If you expect your data set to grow beyond about 2.5 gigabytes, you will also want to ensure that you are on a 64-bit instance -- on 32-bit instances, Mongo is limited to around 2.5 gigabytes of data, due to the limited memory address space available on such a platform. For more on MongoDB on EC2, see the Mongo docs on EC2 deployment on the wiki.
I would also caution against using EC2 Micro instances in your production environment. The nature of Micros is that they have "burstable" but very limited CPU resources. If you get a spike of traffic due to tech press, it's likely that your application would be limited by EC2 to a very low amount of available CPU, which will cause performance to suffer. You can mitigate this to a certain extent with load balancing and many Micro instances, but it may be more cost-effective and less complex to simply use Large instances for both Mongo/Redis and your application servers.
You may want to have a look at this question, since IMO, the answer also applies to your situation:
Benefits of deploying multiple instances for serving/data/cache
I would never put mongod and redis-server on the same box. MongoDB is meant to swap due to its usage of memory mapped files, and will generate swapping activity if the data cannot fit in RAM. Redis does not use data structures which are compatible with swapping (like MongoDB does with btrees), and will become unresponsive if its memory is swapped out. Currently, there is no easy way to lock Redis in memory.
So I would put Redis and the app server on the same box, and isolate MongoDB on its own box.
Depending on the size of the data you want to store in Redis, I would pick a large or a small EC2 instance. Redis works well in 32 bits, but memory is limited. For MongoDB, a 64 bits box is almost mandatory. In any case, I would avoid micro instances like the plague.

Architecture recommendation for load-balanced ASP.NET site

UPDATE 2009-05-21
I've been testing the #2 method of using a single network share. It is resulting in some issues with Windows Server 2003 under load:
http://support.microsoft.com/kb/810886
end update
I've received a proposal for an ASP.NET website that works as follows:
Hardware load-balancer -> 4 IIS6 web servers -> SQL Server DB with failover cluster
Here's the problem...
We are choosing where to store the web files (aspx, html, css, images). Two options have been proposed:
1) Create identical copies of the web files on each of the 4 IIS servers.
2) Put a single copy of the web files on a network share accessible by the 4 web servers. The webroots on the 4 IIS servers will be mapped to the single network share.
Which is the better solution?
Option 2 obviously is simpler for deployments since it requires copying files to only a single location. However, I wonder if there will be scalability issues since four web servers are all accessing a single set of files. Will IIS cache these files locally? Would it hit the network share on every client request?
Also, will access to a network share always be slower than getting a file on a local hard drive?
Does the load on the network share become substantially worse if more IIS servers are added?
To give perspective, this is for a web site that currently receives ~20 million hits per month. At recent peak, it was receiving about 200 hits per second.
Please let me know if you have particular experience with such a setup. Thanks for the input.
UPDATE 2009-03-05
To clarify my situation - the "deployments" in this system are far more frequent than a typical web application. The web site is the front end for a back office CMS. Each time content is published in the CMS, new pages (aspx, html, etc) are automatically pushed to the live site. The deployments are basically "on demand". Theoretically, this push could happen several times within a minute or more. So I'm not sure it would be practical to deploy one web server at time. Thoughts?
I'd share the load between the 4 servers. It's not that many.
You don't want that single point of contention either when deploying nor that single point of failure in production.
When deploying, you can do them 1 at a time. Your deployment tools should automate this by notifying the load balancer that the server shouldn't be used, deploying the code, any pre-compilation work needed, and finally notifying the load balancer that the server is ready.
We used this strategy in a 200+ web server farm and it worked nicely for deploying without service interruption.
If your main concern is performance, which I assume it is since you're spending all this money on hardware, then it doesn't really make sense to share a network filesystem just for convenience sake. Even if the network drives are extremely high performing, they won't perform as well as native drives.
Deploying your web assets are automated anyway (right?) so doing it in multiples isn't really much of an inconvenience.
If it is more complicated than you're letting on, then maybe something like DeltaCopy would be useful to keep those disks in sync.
One reason the central share is bad is because it makes the NIC on the share server the bottleneck for the whole farm and creates a single point of failure.
With IIS6 and 7, the scenario of using a network single share across N attached web/app server machines is explicitly supported. MS did a ton of perf testing to make sure this scenario works well. Yes, caching is used. With a dual-NIC server, one for the public internet and one for the private network, you'll get really good performance. The deployment is bulletproof.
It's worth taking the time to benchmark it.
You can also evaluate a ASP.NET Virtual Path Provider, which would allow you to deploy a single ZIP file for the entire app. Or, with a CMS, you could serve content right out of a content database, rather than a filesystem. This presents some really nice options for versioning.
VPP For ZIP via #ZipLib.
VPP for ZIP via DotNetZip.
In an ideal high-availability situation, there should be no single point of failure.
That means a single box with the web pages on it is a no-no. Having done HA work for a major Telco, I would initially propose the following:
Each of the four servers has it's own copy of the data.
At a quiet time, bring two of the servers off-line (i.e., modify the HA balancer to remove them).
Update the two off-line servers.
Modify the HA balancer to start using the two new servers and not the two old servers.
Test that to ensure correctness.
Update the two other servers then bring them online.
That's how you can do it without extra hardware. In the anal-retentive world of the Telco I worked for, here's what we would have done:
We would have had eight servers (at the time, we had more money than you could poke a stick at). When the time came for transition, the four offline servers would be set up with the new data.
Then the HA balancer would be modified to use the four new servers and stop using the old servers. This made switchover (and, more importantly, switchback if we stuffed up) a very fast and painless process.
Only when the new servers had been running for a while would we consider the next switchover. Up until that point, the four old servers were kept off-line but ready, just in case.
To get the same effect with less financial outlay, you could have extra disks rather than whole extra servers. Recovery wouldn't be quite as quick since you'd have to power down a server to put the old disk back in, but it would still be faster than a restore operation.
Use a deployment tool, with a process that deploys one at a time and the rest of the system keeps working (as Mufaka said). This is a tried process that will work with both content files and any compiled piece of the application (which deploy causes a recycle of the asp.net process).
Regarding the rate of updates this is something you can control. Have the updates go through a queue, and have a single deployment process that controls when to deploy each item. Notice this doesn't mean you process each update separately, as you can grab the current updates in the queue and deploy them together. Further updates will arrive to the queue, and will be picked up once the current set of updates is over.
Update: About the questions in the comment. This is a custom solution based on my experience with heavy/long processes which needs their rate of updates controlled. I haven't had the need to use this approach for deployment scenarios, as for such dynamic content I usually go with a combination of DB and cache at different levels.
The queue doesn't need to hold the full information, it just need to have the appropriate info (ids/paths) that will let your process pass the info to start the publishing process with an external tool. As it is custom code, you can have it join the information to be published, so you don't have to deal with that in the publishing process/tool.
The DB changes would be done during the publishing process, again you just need to know where the info for the required changes is and let the publishing process/tool handle it. Regarding what to use for the queue, the main ones I have used is msmq and a custom implementation with info in sql server. The queue is just there to control the rate of the updates, so you don't need anything specially targeted at deployments.
Update 2: make sure your DB changes are backwards compatible. This is really important, when you are pushing changes live to different servers.
I was in charge of development for a game website that had 60 million hits a month. The way we did it was option #1. User did have the ability to upload images and such and those were put on a NAS that was shared between the servers. It worked out pretty well. I'm assuming that you are also doing page caching and so on, on the application side of the house. I would also deploy on demand, the new pages to all servers simultaneously.
What you gain on NLB with the 4IIS you loose it with the BottleNeck with the app server.
For scalability I'll recommend the applications on the front end web servers.
Here in my company we are implementing that solution. The .NET app in the front ends and an APP server for Sharepoint + a SQL 2008 Cluster.
Hope it helps!
regards!
We have a similar situation to you and our solution is to use a publisher/subscriber model. Our CMS app stores the actual files in a database and notifies a publishing service when a file has been created or updated. This publisher then notifies all the subscribing web applications and they then go and get the file from the database and place it on their file systems.
We have the subscribers set in a config file on the publisher but you could go the whole hog and have the web app do the subscription itself on app startup to make it even easier to manage.
You could use a UNC for the storage, we chose a DB for convenience and portability between or production and test environments (we simply copy the DB back and we have all the live site files as well as the data).
A very simple method of deploying to multiple servers (once the nodes are set up correctly) is to use robocopy.
Preferably you'd have a small staging server for testing and then you'd 'robocopy' to all deployment servers (instead of using a network share).
robocopy is included in the MS ResourceKit - use it with the /MIR switch.
To give you some food for thought you could look at something like Microsoft's Live Mesh
. I'm not saying it's the answer for you but the storage model it uses may be.
With the Mesh you download a small Windows Service onto each Windows machine you want in your Mesh and then nominate folders on your system that are part of the mesh. When you copy a file into a Live Mesh folder - which is the exact same operation as copying to any other foler on your system - the service takes care of syncing that file to all your other participating devices.
As an example I keep all my code source files in a Mesh folder and have them synced between work and home. I don't have to do anything at all to keep them in sync the action of saving a file in VS.Net, notepad or any other app initiates the update.
If you have a web site with frequently changing files that need to go to multiple servers, and presumably mutliple authors for those changes, then you could put the Mesh service on each web server and as authors added, changed or removed files the updates would be pushed automatically. As far as the authors go they would just be saving their files to a normal old folder on their computer.
Assuming your IIS servers are running Windows Server 2003 R2 or better, definitely look into DFS Replication. Each server has it's own copy of the files which eliminates a shared network bottleneck like many others have warned against. Deployment is as simple as copying your changes to any one of the servers in the replication group (assuming a full mesh topology). Replication takes care of the rest automatically including using remote differential compression to only send the deltas of files that have changed.
We're pretty happy using 4 web servers each with a local copy of the pages and a SQL Server with a fail over cluster.

Resources