Generic graphing and charting solutions - statistics

I'm looking for a generic charting solution, ideally not a hosted one that provides the following features:
Charting a tuple of values where the values are:
1) A service identifier (e.g. CPU usage)
2) A client identifier within that service (e.g. server IP)
3) A value
4) A timestamp with millisecond/second resolution.
Optional:
I'd like to also extend the concept of a client identifier further, taking the above example further, I'd like to store statistics for each core separately, so, another identifier would be Core 1/Core 2..
Now, to make sure I'm clearly stating my problem, I don't want a utility that collects these statistics. I'd like something that stores them, but, this is also not mandatory, I can always store them in MySQL, or such.
What I'm looking for is something that takes values such as these, and charts them nicely, in a multitude of ways (timelines, motion, and the usual ones [pie, bar..]). Essentially, a nice visualization package that allows me to make use of all this data. I'd be collecting data from multiple services, multiple applications, and the datapoints will be of varying resolution. Some of the data will include multiple layers of nesting, some none. (For example, CPU would go down to Server IP, CPU#, whereas memory would only be Server IP, but would include a different identifier, i.e free/used/cached as the "secondary' identifier. Something like average request latency might not have a secondary identifier at all, in the case of ping). What I'm trying to get across is that having multiple layers of identifiers would be great. To add one final example of where multiple identifiers would be great: adding an extra identifier on top of ip/cpu#, namely, process name. I think the advantages of that are obvious.
For some applications, we might collect data at a very narrow scope, focusing on every aspect, in other cases, it might be a more general statistic. When stuff goes wrong, both come in useful, the first to quickly say "something just went wrong", and the second to say "why?".
Further, it would be a nice thing if the charting application threw out "bad" values, that is, if for some reason our monitoring program started to throw values of 300% CPU used on a single core for 10 seconds, it'd be nice if the charts themselves didn't reflect it in the long run. Some sort of smoothing, maybe? This could obviously be done at the data-layer though, so its not a requirement at all.
Finally, comparing two points in time, or comparing two different client identifiers of the same service etc without too much effort would be great.
I'm not partial to any specific language, although I'd prefer something in (one of the following) PHP, Python, C/C++, C#, as these are languages I'm familiar with. It doesn't have to be open source, it doesn't have to be a library, I'm open to using whatever fits my purpose the best.
More of a P.S than a requirement: I'd like to have pretty charts that are easy for non-technical people to understand, and act upon too (and like looking at!).
I'm open to clarifying, and, in advance, thanks for your time!

I am pretty sure that protovis meets all your requirements. But it has a bit of a learning curve. You are meant to learn by examples, and there are plenty to work from. It makes some pretty nice graphs by default. Every value can be a function, so you can do things like get rid of your "Bad" values.

Related

SuiteScript 2.0: How to write efficient code?

I am new to SuiteScript and want to make our code more efficient. Looking at our code it seems there are many script of the same type for the same record type. For example, 3 clientscripts on a sales order. Is it bad practice to roll all of these scripts into the same script?
I also want to centralise the code. The last 3 years I've written in C# and any reusable code was placed in a relevant class. I want to rewrite the code we have in the same manner. For example, any method that is to do with a sales order is placed in a module called SalesOrderServices. This module can then be added to any of the scripts and the methods are all available if needed. My concern is this would make things less efficient due to loading all the modules in the Services module, even if they are not really needed. So as a second part to this question, is this a good idea or will it make our code be less efficient?
There is quite a lot to consider in a question like this, and there won't be one correct answer, but I'll chime in with my perspective.
It is not necessarily bad practice to combine the similar scripts into one, but it also is not bad practice to keep them separate. That's really a decision that only you and your team can decide on what is most efficient for you to maintain.
I do think you are right to want to break out any reusable functionality into separate modules, but I would be careful putting "everything related to a Sales Order" into one module. My personal preference is to design and group code based on features and business processes rather than around record types. If you try to modularize based on record type, what happens when you have an approval process that touches both Purchase Orders and Vendor Bills? Where will that live? I prefer small, focused modules rather than large monolithic ones, but that is just my preference. That doesn't work best for everyone and every team.
Have you proven that loading additional modules or Script records is a performance bottleneck for your system? I would be very surprised if that were the case, and so I would caution against premature optimization of these sorts of things. There are many facets of NetSuite that operate on the order of seconds and are out of your control, so saving a few micro- or milliseconds here and there isn't going to do anything appreciable for your users.

How do I create an array of resources using Jena?

I am using Jena and Java, and am reading a CSV file. For each line of the file there is a subject resource. Two subject resources, on adjacent lines, might have share the same value of a field in the line (e.g: both lines have the same process id). In this case, I need to combine the two subject resources as each one represents a sub-process in production (for example).
My question is: how can I reference those two resources dynamically so that I can combine them? I came to the idea that when I find that they share the same property to store them in an array resource subjects. Is it the right approach?
This question would be a lot easier to answer if you could show some sample data. As it is, I think you're focusing on the wrong bit of the question. If you can decide clearly what it means to have two rows in your CSV with identical process, and then you decide how you're going to encode that meaning in your RDF model, then the question of how to write the code - as an array or whatever - will be much clearer.
For example (and I'm going to make up some data here - as I said, it would be easier if you show an actual example), suppose your CSV contains:
processId,startTime,endTime
123,15:22:00,15:23:00
123,16:22:00,16:25:00
So process 123 has, apparently two start and end time pairs. If you model this naively in RDF, you'll end up with a confusing model:
process:process123
a :Process;
process:start "15:22:00"^^xsd:time;
process:end "15:23:00"^^xsd:time;
process:start "16:22:00"^^xsd:time;
process:end "16:25:00"^^xsd:time;
.
which would suggest that one process had two start times (and two end times) which looks nonsensical. However, it might be that in reality you have a single process with multiple episodes, suggesting one way to model it, or a periodic process which occurs at different times, or, as you suggested, sub-processes of a parent process. Or something else entirely (I'm only guessing, I don't know your domain). Once you are clear what the data means, you can produce a suitable RDF model. For example, an episodic process might be:
process:process123
a :Process;
process:episode [
a process:Episode;
process:start "15:22:00"^^xsd:time;
process:end "15:23:00"^^xsd:time;
];
process:episode [
a process:Episode;
process:start "16:22:00"^^xsd:time;
process:end "16:25:00"^^xsd:time;
]
.
Once the modelling is clear in your mind, I think you can see that the question of how to produce the desire RDF triples from Java code - and whether or not you need an array - is much clearer. Equally importantly, you can think in terms of the JUnit tests you would write to test whether your code is behaving correctly.

Meaning of Leaky Abstraction?

What does the term "Leaky Abstraction" mean? (Please explain with examples. I often have a hard time grokking a mere theory.)
Here's a meatspace example:
Automobiles have abstractions for drivers. In its purest form, there's a steering wheel, accelerator and brake. This abstraction hides a lot of detail about what's under the hood: engine, cams, timing belt, spark plugs, radiator, etc.
The neat thing about this abstraction is that we can replace parts of the implementation with improved parts without retraining the user. Let's say we replace the distributor cap with electronic ignition, and we replace the fixed cam with a variable cam. These changes improve performance but the user still steers with the wheel and uses the pedals to start and stop.
It's actually quite remarkable... a 16 year old or an 80 year old can operate this complicated piece of machinery without really knowing much about how it works inside!
But there are leaks. The transmission is a small leak. In an automatic transmission you can feel the car lose power for a moment as it switches gears, whereas in CVT you feel smooth torque all the way up.
There are bigger leaks, too. If you rev the engine too fast, you may do damage to it. If the engine block is too cold, the car may not start or it may have poor performance. And if you crank the radio, headlights, and AC all at the same time, you'll see your gas mileage go down.
It simply means that your abstraction exposes some of the implementation details, or that you need to be aware of the implementation details when using the abstraction. The term is attributed to Joel Spolsky, circa 2002. See the wikipedia article for more information.
A classic example are network libraries that allow you to treat remote files as local. The developer using this abstraction must be aware that network problems may cause this to fail in ways that local files do not. You then need to develop code to handle specifically errors outside the abstraction that the network library provides.
Wikipedia has a pretty good definition for this
A leaky abstraction refers to any implemented abstraction, intended to reduce (or hide) complexity, where the underlying details are not completely hidden
Or in other words for software it's when you can observe implementation details of a feature via limitations or side effects in the program.
A quick example would be C# / VB.Net closures and their inability to capture ref / out parameters. The reason they cannot be captured is due to an implementation detail of how the lifting process occurs. This is not to say though that there is a better way of doing this.
Here's an example familiar to .NET developers: ASP.NET's Page class attempts to hide the details of HTTP operations, particularly the management of form data, so that developers don't have to deal with posted values (because it automatically maps form values to server controls).
But if you wander beyond the most basic usage scenarios the Page abstraction begins to leak and it becomes hard to work with pages unless you understand the class' implementation details.
One common example is dynamically adding controls to a page - the value of dynamically-added controls won't be mapped for you unless you add them at just the right time: before the underlying engine maps the incoming form values to the appropriate controls. When you have to learn that, the abstraction has leaked.
Well, in a way it is a purely theoretical thing, though not unimportant.
We use abstractions to make things easier to comprehend. I may operate on a string class in some language to hide the fact that I'm dealing with an ordered set of characters that are individual items. I deal with an ordered set of characters to hide the fact that I'm dealing with numbers. I deal with numbers to hide the fact that I'm dealing with 1s and 0s.
A leaky abstraction is one that doesn't hide the details its meant to hide. If call string.Length on a 5-character string in Java or .NET I could get any answer from 5 to 10, because of implementation details where what those languages call characters are really UTF-16 data-points which can represent either 1 or .5 of a character. The abstraction has leaked. Not leaking it though means that finding the length would either require more storage space (to store the real length) or change from being O(1) to O(n) (to work out what the real length is). If I care about the real answer (often you don't really) you need to work on the knowledge of what is really going on.
More debatable cases happen with cases like where a method or property lets you get in at the inner workings, whether they are abstraction leaks, or well-defined ways to move to a lower level of abstraction, can sometimes be a matter people disagree on.
I'll continue in the vein of giving examples by using RPC.
In the ideal world of RPC, a remote procedure call should look like a local procedure call (or so the story goes). It should be completely transparent to the programmer such that when they call SomeObject.someFunction() they have no idea if SomeObject (or just someFunction for that matter) are locally stored and executed or remotely stored and executed. The theory goes that this makes programming simpler.
The reality is different because there's a HUGE difference between making a local function call (even if you're using the world's slowest interpreted language) and:
calling through a proxy object
serializing your parameters
making a network connection (if not already established)
transmitting the data to the remote proxy
having the remote proxy restore the data and call the remote function on your behalf
serializing the return value(s)
transmitting the return values to the local proxy
reassembling the serialized data
returning the response from the remote function
In time alone that's about three orders (or more!) of magnitude difference. Those three+ orders of magnitude are going to make a huge difference in performance that will make your abstraction of a procedure call leak rather obviously the first time you mistakenly treat an RPC as a real function call. Further a real function call, barring serious problems in your code, will have very few failure points outside of implementation bugs. An RPC call has all of the following possible problems that will get slathered on as failure cases over and above what you'd expect from a regular local call:
you might not be able to instantiate your local proxy
you might not be able to instantiate your remote proxy
the proxies may not be able to connect
the parameters you send may not make it intact or at all
the return value the remote sends may not make it intact or at all
So now your RPC call which is "just like a local function call" has a whole buttload of extra failure conditions you don't have to contend with when doing local function calls. The abstraction has leaked again, even harder.
In the end RPC is a bad abstraction because it leaks like a sieve at every level -- when successful and when failing both.
What is abstraction?
Abstraction is a way of simplifying the world.
It means you don't have to worry about what is actually happening under the hood.
Example: Flying a 737/747 is "abstracted" away
Planes are complicated systems: it involves: jet engines, oxygen systems, electrical systems, landing gear systems etc.
...but the pilot doesn't have to worry about it... all of that is "abstracted away". The only thing a pilot needs to focus on is yoke (i.e. steering wheel of the plane).
He pushes the yoke left to go left, and right to go right etc.
....that is in an ideal world. In reality, flying a plane is much more complicated. Because many details ARE NOT "abstracted away".
Leaky Abstractions in 737 Example
Pilots in reality have to worry about a LOT of things: wind speed, thrust, angles of attack, fuel, altitude, weather problems, angles of descent. Computers can help the pilot in these tasks, but not everything is automated / simplified......not everything is "abstracted away".
e.g. If the pilot pulls up too hard on the column - the plane will obey, but then the plane might stall, and that's really bad.
In other words, it is not enough for the pilot to simply control the steering wheel without knowing anything else.........nooooo.......the pilot must know about the underlying risks and limitations of the plane before the pilot flies one.......the pilot must know how the plane works, and how the plane flies; the pilot must know implementation details..... that pulling up too hard will lead to a stall, or that landing too steeply will destroy the plane etc.
Those things are not abstracted away. A lot of things are abstracted, but not everything. The abstraction is "leaky".
Leaky Abstractions in Code
......it's the same thing in your code. If you don't know the underlying implementation details, then you're gonna have problems.
ORMs abstract a lot of the hassle in dealing with database queries, but if you've ever done something like:
User.all.each do |user|
puts user.name # let's print each user's name
end
Then you will realise that's a nice way to kill your app. You need to know that calling User.allwith 25 million users is going to spike your memory usage, and is going to cause problems. You need to know some underlying details. The abstraction is leaky.
An example in the django ORM many-to-many example:
Notice in the Sample API Usage that you need to .save() the base Article object a1 before you can add Publication objects to the many-to-many attribute. And notice that updating the many-to-many attribute saves to the underlying database immediately, whereas updating a singular attribute is not reflected in the db until the .save() is called.
The abstraction is that we are working with an object graph, where single-value attributes and mult-value attributes are just attributes. But the implementation as a relational database backed data store leaks... as the integrity system of the RDBS appears through the thin veneer of an object interface.
The fact that at some point, which will guided by your scale and execution, you will be needed to get familiar with the implementation details of your abstraction framework in order to understand why it behave that way it behave.
For example, consider this SQL query:
SELECT id, first_name, last_name, age, subject FROM student_details;
And its alternative:
SELECT * FROM student_details;
Now, they do look like a logically equivalent solutions, but the performance of the first one is better due the individual column names specification.
It's a trivial example but eventually it comes back to Joel Spolsky quote:
All non-trivial abstractions, to some degree, are leaky.
At some point, when you will reach a certain scale in your operation, you will want to optimize the way your DB (SQL) works. To do it, you will need to know the way relational databases works. It was abstracted to you in the beginning, but it's leaky. You need to learn it at some point.
Assume, we have the following code in a library:
Object[] fetchDeviceColorAndModel(String serialNumberOfDevice)
{
//fetch Device Color and Device Model from DB.
//create new Object[] and set 0th field with color and 1st field with model value.
}
When the consumer calls the API, they get an Object[]. The consumer has to understand that the first field of the object array has color value and second field is the model value. Here the abstraction has leaked from library to the consumer code.
One of the solutions is to return an object which encapsulates Model and Color of the Device. The consumer can call that object to get the model and color value.
DeviceColorAndModel fetchDeviceColorAndModel(String serialNumberOfTheDevice)
{
//fetch Device Color and Device Model from DB.
return new DeviceColorAndModel(color, model);
}
Leaky abstraction is all about encapsulating state. very simple example of leaky abstraction:
$currentTime = new DateTime();
$bankAccount1->setLastRefresh($currentTime);
$bankAccount2->setLastRefresh($currentTime);
$currentTime->setTimestamp($aTimestamp);
class BankAccount {
// ...
public function setLastRefresh(DateTimeImmutable $lastRefresh)
{
$this->lastRefresh = $lastRefresh;
} }
and the right way(not leaky abstraction):
class BankAccount
{
// ...
public function setLastRefresh(DateTime $lastRefresh)
{
$this->lastRefresh = clone $lastRefresh;
}
}
more description here.

Data clean up: are there libraries of common permutations that we can use? Or is there a better approach?

We are working on clean-up and analysis of a lot of human-entered customer data. We need to decide programmatically whether 2 addresses (for example) are the same, even though the data was entered with slight variations.
Right now we run each address through fairly simplistic string replacement (replacing avenue with ave, for example), concatenate the fields and compare the results. We are doing something similar with names.
At the very least, it seems like our list of search-replace values should already exist somewhere.
Or perhaps you can suggest a totally different and superior way to detect matches?
For the addresses, you should run them through google's map api and get a geocode for each one. Then if the geocodes are the same, the place is the same. I believe they allow 10k hits/day/ip for free.
It's unlikely that you'd come up with anything better on your own.
http://code.google.com/apis/maps/
Soundex and its variants might be a good start as are other approaches suggested by that Wikipedia page.
Essentially you're trying to find how similar two strings are and there are a lot of different ways to measure it. Dice Coefficients could work fairly well for what you're doing, although it is a bit costly of an operation.
http://en.wikipedia.org/wiki/Dice_coefficient
If you want a more comprehensive list of string similarity measures try here:
http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
At work I help write software that verifies addresses (for SmartyStreets).
Address validation is a really tricky operation -- in fact the USPS has designated certain companies which are certified to provide this service. I would not recommend (even if I was in your shoes) that you attempt this on your own. As mentioned, Google does some address parsing, but only approximates the address. Google and Yahoo and similar services will not verify the accuracy of the address data.
So you'll need a CASS-Certified approach to this problem. I would suggest something like the LiveAddress API (for point-of-entry validation) or Certified Scrubbing (for existing lists or databases of addresses). Both are CASS-Certified by the USPS and will do what you require.

I have a list of names, some of them are fake, I need to use NLP and Python 3.1 to keep the real names and throw out the fake names

I have no clue of where to start on this. I've never done any NLP and only programmed in Python 3.1, which I have to use. I'm looking at the site http://www.linkedin.com and I have to gather all of the public profiles and some of them have very fake names, like 'aaaaaa k dudujjek' and I've been told I can use NLP to find the real names, where would I even start?
This is a difficult problem to solve, and one which starts with acquiring valid given name & surname lists.
How large is the set of names that you're evaluating, and where do they come from? These are both important things for you to consider. If you're evaluating a small set of "American" names, your valid name lists will differ greatly from lists of Japanese or Indian names, for instance.
Your idea of scraping LinkedIn is on the right track, but you were right to catch the fake profile/name flaw. A better website would probably be something like IMDB (perhaps scraping names by iterating over different birth years), or Wikipedia's lists of most popular given names and most common surnames.
When it comes down to it, this is a precision vs. recall problem: in order to miss fewer fakes, you're inevitably going to throw out some real names. If you loosen up your restrictions, you'll get more fakes, but you'll also throw out fewer real names.
Several possibilities here, but the most obvious seems to be with HMMs, i.e. Hidden Markov Models. The NLTK kit includes [at least] one module for HMMs, although I must admit I never used it.
Another possible snag is that AFAIK, NTLK is not yet ported to Python 3.0
This said, and while I'm quite keen on using NLP techniques where applicable, I think that a process which would use several paradigms, including some NLP tricks may be a better solution for this particular problem. For example, storing even a reduced dictionary of common family names (and first names) in a traditional database may offer both a more reliable and more computationally efficient way of filtering a significant portion of the input data, leaving precious CPU resources to be spent on less obvious cases.
i am afraid this problem is not solveable if your list is even only minimally ‘open’ — if the names are eg customers from a small traditionally acting population, you might end up with a few hundred names for thousands of people. but generally you can hardly predict what is a real name and what is not, however unusual an arabic, chinese, or bantu name may look in a sample of, say, south english rural neighborhood names. i mean, ‘Ng’ is a common cantonese surname, and ‘O’ is common in korea, so assumptions may fail. there is this place in austria called ‘fucking’, so even looking out for four letter words is no guarantee for success.
what you could do is work through a sufficiently big sample of such names and sort them out manually. then, use all kinds of textprocessing tools and collect metrics. maybe you can derive a certain likelyhood for a name to be recognized as fake, maybe it will not be viable. you will never go beyond likelyhoods here, though.
as an aside, we used to use google maps and the telephone directory for validating customer data years ago. if google maps could find the place, we called the address validated. it is clear that under stricter requirements, true validation must go much further. let’s not forget the validation of such data is much more a social question than a linguistic one.

Resources