I am looking for system or library for node.js, that can log information about client access on every remote server and automatically gather that information on central log server for later analysis. Remote server will have write only access, while central server will accumulate a lot of data to read.
I hope there is solution using distributed [NoSQL] database, like MongoDB.
However I have not found how to set it up.
For example I hope that cleaning old data can be initiated on central log server (when data has been processed) and entries on old dates can be removed on remote server with little overhead.
Currently we have logging into files and Hadoop system for log analysis.
But I think we need to accumulate data in database.
Winston, currently the best logging framework for node.js, has option to log into MongoDB or CouchDB.
Scribe could be what you're looking for. There are node packages too
I have never checked it out so I'd be interested in reading your thoughts in the comments if you investigate it and find it good/bad, easy/hard to setup, etc.
MongoDB or any other distributed databases will not solve problem.
In-house project must be created.
Some features of MongoDB for consideration:
Capped Collections are actually way to loose data. I may be good for short history.
Related
I am learning how to use socket.io and nodejs. In this answer they explain how to store users who are online in an array in nodejs. This is done without storing them in the database. How reliable is this?
Is data stored in the server reliable does the data always stay the way it is intended?
Is it advisable to even store data in the server? I am thinking of a scenario where there are millions of users.
Is it that there is always one instance of the server running even when the app is served from different locations? If not, will storing data in the server bring up inconsistencies between the different server instances?
Congrats on your learning so far! I hope you're having fun with it.
Is data stored in the server reliable does the data always stay the way it is intended?
No, storing data on the server is generally not reliable enough, unless you manage your server in its entirety. With managed services, storing data on the server should never be done because it could easily be wiped by the party managing your server.
Is it advisable to even store data in the server? I am thinking of a scenario where there are millions of users.
It is not advisable at all, you need a DB of some sort.
Is it that there is always one instance of the server running even when the app is served from different locations? If not, will storing data in the server bring up inconsistencies between the different server instances?
The way this works typically is that the server is always running, and has some basics information regarding its configuration stored locally - when scaling, hosted services are able to increase the processing capacity automatically, and handle load balancing in the background. Whenever the server is retrieving data for you, it requests it from the database, and then it's loaded into RAM (memory). In the example of the user, you would store the user data in a table or document (relational databases vs document oriented database) and then load them into memory to manipulate the data using 'functions'.
Additionally, to learn more about your 'data inconsistency' concern, look up concurrency as it pertains to databases, and data race conditions.
Hope that helps!
I have a nodejs project that using couchbase as database.
Just wonder if I store the temporary data in
1.redis
or in
2.couchbase directly.
As I know there is socket delay for couchbase, I think store temporary data in redis while store the permanent data in couchbase is better.
Is there any person has the experience on this?
Your comment welcome
I'm a big Redis fan, but in this situation I would use Couchbase only.
Couchbase is rather efficient, and comparable to the performance of memcached when the working set of your data fits in memory. Most of the time, an extra caching layer on top of Couchbase is not useful.
That said, if you really need a caching layer, or simply some storage for temporary data, you can simply create a memcached bucket hosted in the Couchbase cluster. So you would have an "eventually persistent" bucket for your persistent data, and a memcached bucket for the temporary data.
The bucket types are described here:
http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/#data-storage
In that context, adding Redis as a extra storage layer does not really make sense.
Couchbase has a managed cache built into it, even for Couchbase buckets. So it already has a caching layer and adding another one on top just sounds superfluous.
I am not sure what you mean by a socket delay in Couchbase. Can you perhaps explain more about that? That is not something I have ever seen before and sticks out as suspect to me. I would try and troubleshoot this and figure out what that is before looking to add redis to the mix and have yet another layer to manage and code against. Without know more about the socket delay, it is difficult to make more recommendations.
It's an old question, but I'll have my take at it as well, if nothing else then for the people coming across it via google, just as I did.
I agree with he accepted answer, in that CouchBase has the most recently used Documents in RAM. In that aspect, it does the same as Redis. The advantage of CouchBase is of course that the data can reliably spill over the RAM limit, and the server disk limit, automatically, by adding more nodes.
However, I have a project where I am considering using Redis along side CouchBase. It's basically thought as a caching server, but for the "calculated" items. Such as html-snippets or other things. CouchBase is a fantastic document store, but making lists and other structures, doesn't come that easy, especially not without a lot of views. So I'm thinking to use Redis as a temporary datastore for the ad-hoc data manipulation needed, and CouchBase as the main datastore.
We have huge log files(~ 100s of Gigs) on multiple web servers that are needed to be searched in real time. These log files are written multiple times/second by different apps. We have recently installed a hadoop cluster on some servers for this purpose. In order to implement search on these logs, I have thought of this design: there is a process running on web servers which creates an inverted-index of logs and cache it in-memory (on web servers itself) and push to HDFS via flume to be stored in Hive when the cache is full (this is much like an LRU cache). This helps in two ways when something is searched for: most recent logs are returned from in-memory cache and is fast and older logs are returned from disk. And since user wants to see latest logs first, this technique works. Can somebody verify if this design will work and scale properly. Are there any better alternatives around?
Thanks
You could store the inverted index in HBase to provide more real-time access to your older logs.
HBase would also likely be a viable alternative to your in-memory cache. You could do this if you wanted to unify the storage platform instead of having it split up. It will obviously be slower than memcached or redis.
A completely different approach could be using Lucene/Solr to index your logs. This has a lot of nice features out of the box for searching.
I have an ASP.NET MVC app hosted at webhost4life
What's a good way to save logs?
I have an access to the ftp I upload site to, should I just do effectively
File.AppendAllText("log.txt", "Ooops, we have an error" + e.Message);
Or is there a better way? Send e-mail? save log into a database?
I always try to log to a database and fall back on a file if the database is inaccessible (perhaps that's the cause of the exception). This allows you to run queries and reporting against the log directly and find out what the problem is immediately. You can also run a "health check" against the application by storing critical excepions and marking them, etc.
Avoid writing to the file system; this can generate collisions/race conditions between threads that are attempting to write to the same file. Databases are wonderful solutions for this problem, and provide some nice benefits such as being able to generate reports easily from normalized data.
Also, what sort of information are you logging? The IIS logs are very detailed. Saving information that is already available in those logs duplicates work (the server writes its logs, and then you write your own), which of course incurs a performance hit.
Okay, here's the scenario. I have a utility that processes tons of records, and enters information to the Database accordingly.
It works on these records in multi-threaded batches. Each such batch writes to the same log file for creating a workflow trace for each record. Potentially, we could be making close to a million log writes in a day.
Should this log be made into a database residing on another server? Considerations:
The obvious disadvantage of multiple threads writing to the same log file is that the log messages are shuffled amongst each other. In the database, they can be grouped by batch id.
Performance - which would slow down the batch processing more? writing to a local file or sending log data to a database on another server on the same network. Theoretically, the log file is faster, but is there a gotcha here?
Are there any optimizations that can be done on either approach?
Thanks.
The interesting question, should you decide to log to the database, is where do you log database connection errors?
If I'm logging to a database, I always have a secondary log location (file, event log, etc) in case there are communication errors. It really does make it easier to diagnose issues later on.
One thing that comes to mind is that you could have each thread writing to its own log file and then do a daily batch run to combine them.
If you are logging to database you probably need to do some tuning and optimization, especially if the DB will be across the network. At the least you will need to be reusing the DB connections.
Furthermore, do you have any specific needs to have the log in database? If all you need is a "grep " then I don't think you gain much by logging into database.
I second the other answers here, depends on what you are doing with the data.
We have two scenarios here:
The majority of the logging is to a DB since admin users for the products we build need to be able to view them in their nice little app with all the bells and whistles.
We log all of our diagnostics and debug info to file. We have no need for really "prettifying" it and TBH, we don't even often need it, so we just log and archive for the most part.
I would say if the user is doing anything with it, then log to DB, if its for you, then a file will probably suffice.
Not sure if it helps, but there's also a utility called Microsoft LogParser that you can supposedly use to parse text-based log files and use them as if they were a database. From the website:
Log parser is a powerful, versatile
tool that provides universal query
access to text-based data such as log
files, XML files and CSV files, as
well as key data sources on the
Windows® operating system such as the
Event Log, the Registry, the file
system, and Active Directory®. You
tell Log Parser what information you
need and how you want it processed.
The results of your query can be
custom-formatted in text based output,
or they can be persisted to more
specialty targets like SQL, SYSLOG, or
a chart. Most software is designed to
accomplish a limited number of
specific tasks. Log Parser is
different... the number of ways it can
be used is limited only by the needs
and imagination of the user. The
world is your database with Log
Parser.
I haven't used the program myself, but it seems quite interesting!
Or how about logging to a queue? That way you can switch out pollers whenever you like to log to different things. It makes things like rolling over and archiving log files very easy. It's also nice because you can add pollers that log to different things, for example:
a poller that looks for error messages and posts them to your FogBugz account
a poller that looks for access violations ('x tried to access /foo/y/bar.html') to a 'hacking attempts' file
etc.
Database - since you mentioned multiple threads. Synchronization as well as filtered retrieval are my reasons for my answer.
See if you have a performance problem before deciding to switch to files
"Knuth: Premature optimization is the root of all evil" I didn't get any further in that book... :)
There are ways you can work around the limitations of file logging.
You can always start each log entry with a thread id of some kind, and grep out the individual thread ids. Or a different log file for each thread.
I've logged to database in the past, in a separate thread at a lower priority. I must say, queryability is very valuable when you're trying to figure out what went wrong.
How about logging to database-file, say a SQLite database? I think it can handle multi-threaded writes - although that may also have its own performance overheads.
I think it depends greatly on what you are doing with the log files afterwards.
Of the two operations writing to the log file will be faster - especially as you are suggesting writing to a database on another server.
However if you are then trying to process and search the log files on a regular basis then the best place to do this would be a database.
If you use a logging framework like log4net they often provide simple config file based ways of redirecting input to file or database.
I like Gaius' answer. Put all the log statements in a threadsafe queue and then process them from there. For DB you could batch them up, say 100 log statements in one batch and for file you could just stream them into the file as they come into the queue.
File or Db? As many others say; it depends on what you need the log file for.