I am looking for an alternative spreadsheet to Excel, preferably but not necessarily open source, that allows a programmer to create a plugin that can update cells in the sheet from an external data source in real time. The spreadsheet would then internally compute all dependent calculation chains upon change of value.
This is similar functionality to what the RTD method does with Microsoft Excel. The rate of external data change could be moderate to high (whatever such relativistic terms mean).
Also the reverse process would be useful, i.e. detecting a change in cells and then sending that information to a plugin that can communicate with external processes.
Any recommendations or experience in trying this?
I am afraid you will not find any. The main consumers of the real-time spreadsheets (grids) are big banks and they usually invest in their own solutions. [Because they can afford and they used to see it as their advantage over the competition] Some of the solutions are very dated, but still going strong! Three years ago I worked on a system which was written in C++ (with TibcoRv as a backbone) and it was already five years old. It is still alive and kicking.
One of the strong points of the bespoke grid are "Excel-like formulae" where a user can use a field from the provided data dictionary. So rather than reference cells, you reference data from your systems. It makes formulae easier to implement and read. And of course you can export or share them; users really like that.
The following could be of some help:
http://www.dadisp.com
http://www.quantrix.com
http://www.resolversystems.com/products/
http://pyspread.sourceforge.net/
http://matrex.sourceforge.net/
This may not exactly satisfy your real time requirement but worth exploring.
Related
I'm a mechanical engineer, and I have developed a pretty cool spreadsheet that I use to size some steel members for lifting beams. The set back is that I need to do some trial and error in the selection of the member until I get one that gets as close to the allowable limits as possible.
What I'm hoping to improve on is to develop a function that based upon a length and weight variable that I enter, the program runs a loop and automatically selects the best member size(s) based upon a list of the members and their physical properties. Is this possible?
Yeah, depending on the complexity, either a simple search through parameters (less than, more than etc) might bring you the answer. You can do it quite easily via Pandas library. Just load up the excel as pandas DataFrame (pandas.read_excel()), which then will allow you to perform the searches on that DataFrame object.
If you want to run some optimization algo, you should look into SciPy's optimize to get what you're looking for based on the input data (it handles unconstrained and constrained functions).
Of course, the question you've stated is quite general, so I only pointed the direction. More info would be better.
I have over 6,000 excel sheets. While all the sheets describe the same thing, they are independently formatted. They all have between 9 and 13 columns, but they are out of order, the column names are independently misspelled, and they may or may not have a second, or third, column header.
I am currently trying in python to read cells in a left-down-right-up motion to attempt to locate the same data, but there is physically too many differences in structure names, column ordering, and data definitions to lock them in one a time. Is there a tool that I can use to read these documents and conform them to a single format, via a rapid mapping function?
Thanks much.
Thanks
Wow, it's the Ultimate Data Horror Story.
I want to ask how you ever let it get this way... but I actually don't want to know; I'm already going to have nightmares about this.
It's like that Hoarding show on TV, but with data.
No, I'm afraid that if you can't even identify a pattern then there's no magic function that will be able to either.
But that doesn't mean it's a lost cause. It's just going to need some human interaction, and there are ways to minimize the pain.
What you need is a custom interface that will load the documents one by one, and will walk a human through clicking each relevant column or area, and then automatically load the next document.
There would also need to be buttons for sorting things like obvious garbage sheets (blanks?), "unknowns" (that get put in a folder for advanced research later), and other "unpredictables" may come up during the process.
Also, perhaps once you get into it, you'll notice a pattern you're not thinking of, like maybe *"the person who handled the files from 2002 to 2004 set them up this way"*, or, "when Budget is misspelled, it's always either Bugdet or Budteg".
In this scope, little patterns like that can make a big difference.
Depending on your coding skills, you may or may not need outside assistance with this. I assume this is not data that can just get thrown out, or you wouldn't be asking...
If each document took an average of 20 seconds to process, that would be about 33 hours in total. An hour a day an it's done in a month. Or someone full-time, and it's done in a week.
Do you have a budget you can throw at this? Data archaeology is an actual thing! Hell, I'll do it for you for the right price... (wouldn't break the bank, depending on how urgent it is, of course!)
Either way, this ain't going to be fun for "someone"...
In my current application, which is quite simple, I have a class responsible for keeping track of all the customers. It adds new customers, updates details, assigns managers etc. All in all, it performs some logic and keeps track of what is there.
Now, I need to display this database in a complicated way, with certain string formatting, column widths etc. Should this behaviour be part of an external class that would take a list of cusomers, or should it rather stay within the Database itself, as it is the "closest" to the customer objects? What are pros/cons of both approaches? What makes more sense semantically?
Thx in advance.
I would use an external class because the code will be reusable for other applications. If you tie the formatting to the database, then you'll end up with very specific code that will cost you time further on down the road.
That being said, the elegant solution is not always the fastest. If there's a looming deadline, and you find it easier and faster to hard code your formatting info, then it may be more prudent to brute force your way through it so you don't end up with an angry client, an unfinished project and an unpaid invoice.
I'm looking for a generic charting solution, ideally not a hosted one that provides the following features:
Charting a tuple of values where the values are:
1) A service identifier (e.g. CPU usage)
2) A client identifier within that service (e.g. server IP)
3) A value
4) A timestamp with millisecond/second resolution.
Optional:
I'd like to also extend the concept of a client identifier further, taking the above example further, I'd like to store statistics for each core separately, so, another identifier would be Core 1/Core 2..
Now, to make sure I'm clearly stating my problem, I don't want a utility that collects these statistics. I'd like something that stores them, but, this is also not mandatory, I can always store them in MySQL, or such.
What I'm looking for is something that takes values such as these, and charts them nicely, in a multitude of ways (timelines, motion, and the usual ones [pie, bar..]). Essentially, a nice visualization package that allows me to make use of all this data. I'd be collecting data from multiple services, multiple applications, and the datapoints will be of varying resolution. Some of the data will include multiple layers of nesting, some none. (For example, CPU would go down to Server IP, CPU#, whereas memory would only be Server IP, but would include a different identifier, i.e free/used/cached as the "secondary' identifier. Something like average request latency might not have a secondary identifier at all, in the case of ping). What I'm trying to get across is that having multiple layers of identifiers would be great. To add one final example of where multiple identifiers would be great: adding an extra identifier on top of ip/cpu#, namely, process name. I think the advantages of that are obvious.
For some applications, we might collect data at a very narrow scope, focusing on every aspect, in other cases, it might be a more general statistic. When stuff goes wrong, both come in useful, the first to quickly say "something just went wrong", and the second to say "why?".
Further, it would be a nice thing if the charting application threw out "bad" values, that is, if for some reason our monitoring program started to throw values of 300% CPU used on a single core for 10 seconds, it'd be nice if the charts themselves didn't reflect it in the long run. Some sort of smoothing, maybe? This could obviously be done at the data-layer though, so its not a requirement at all.
Finally, comparing two points in time, or comparing two different client identifiers of the same service etc without too much effort would be great.
I'm not partial to any specific language, although I'd prefer something in (one of the following) PHP, Python, C/C++, C#, as these are languages I'm familiar with. It doesn't have to be open source, it doesn't have to be a library, I'm open to using whatever fits my purpose the best.
More of a P.S than a requirement: I'd like to have pretty charts that are easy for non-technical people to understand, and act upon too (and like looking at!).
I'm open to clarifying, and, in advance, thanks for your time!
I am pretty sure that protovis meets all your requirements. But it has a bit of a learning curve. You are meant to learn by examples, and there are plenty to work from. It makes some pretty nice graphs by default. Every value can be a function, so you can do things like get rid of your "Bad" values.
Our customers use 500+ applications and we would like to integrate these applications with our. What is the best way to do that? These applications are time registration applications and common for most of them is that they can export to csv or similar, some of them are actually home-brewed excel sheets where time is registered.
The best idea so far is to create our own excel sheet, which can be used to integrate with all these applications. The integrations could be in the form of cells containing something like ='[c:\export.csv]rawdata'!$A$3 Where export.csv is the csv file exported from the time registration applications. Can you see a better way to integrate against all these applications? It should be mentioned that almost all our customers have Microsoft Office.
Edit: Answers to the excellent questions from Pontus Gagge:
How similar are the data in the different applications?
I assume that since they time registration applications, they will have some similarities, but I assume that some will register the how long time one has worked in total for a whole month, while others will spesify for each day. If Excel is chosen, I believe that many of the differences could be ironed out using basic formulas.
What quality is the data?
The quality of the data can vary so basic validation must be undertaken, a good way is also to make it transparent for the customers, how our application understands their input, so they are responsible.
How large amounts of data are you talking about?
There will be information about the time worked for up to 50 employees.
Is the integration one-way only?
Yes
With what frequency should information be transferred?
Once per month (when they need to pay salaries).
How often do the applications themselves change, and how often does your product change?
If their application is a home-brewed Excel sheet, then I assume it will change once a year (due for example a mistake someone). If it is a standard proper time registration application, then I do not believe they are updated more often than every fifth year or so, as it is a very stabile concept.
Should the integration be fully automatic or can your end users trigger a data transfer?
They can surely trigger data transfer. The users are often dedicated to the process so they can be trained at doing it, which means that they could make up to, say 30, mouse clicks in order to integrate each month.
Will the customers have somebody to monitor the integrations?
As we have many customers, many of them should be able to undertake the integration themselves. We will though be able to assist them over the telephone. We cannot, though undertake the integration ourselves because we would then be responsible for any errors due to user mistakes, etc.
Does the phrase 'integration spaghetti' mean anything to you...?
I am looking for ideas from the best chefs to cook a nice large portion of that.
You need to come up with a common data format, and a way to translate the individual data formats to the common format. There's really no way around this - any solution you come up with will have to do this in one way or the other. It's the essential complexity of what you're doing.
The bigger issue is actually variances within the source data, in terms of how things like dates are stored, missing columns, etc. Doing a generic conversion for CSV to move columns around is comparatively easy.
I would also look at CSV and then use an OLEDB connection against the CSV file for importing.
If you try to make something that can interface to any data structure in the universe (and 500 is plenty close enough), it is guaranteed to be a maintenance nightmare. Instead I would approach this from multiple angles:
Devise an interface into which a human can enter this data already in the proper format. With 500+ clients, I'd make this a small, raw but functional browser based site that users can use to enter this information manally. This is the fall-back. At the end of the day, a human can re-key the information into the site and solve the import issue. Ideally, everyone would use this instead of their own format. Data entry people are cheap.
Similar to above, but expanded, I would develop a standard application or standardize on an off-the-shelf application that can be used to replace their existing format. This might take more time than #1. The goal would be to only do one-time imports of these varying data schemas into the application and be done with them for good.
The nice thing about spreadsheets is that you can do anything anywhere. The bad thing about spreadsheets is that you can do anything anywhere. With CSV or a spreadsheet there is simply no way to enforce data integrity and thus consistency (which is the primary goal) on the data. If the source data is already in a database, then that is obviously simpler.
I would be inclined to use database format into which each of these files need to be converted rather than a spreadsheet (e.g. use something like Jet (MDB)). If you have non-Windows users then that will make it harder and you might have to use a spreadsheet. The problem is that it is too easy for the user to change their source structure, break their upload and come crying to you. If a given end user has a resident expert, they can find a way of importing the data into that database format . If you are that expert, then I would on a case-by-case basis, write something that would import into that database format. XML would be the other choice, but that will likely take more coding than an import/export into a database format.
Standardization of the apps (even having all the sources in a database format instead of a spreadsheet would help) and control over the data schema is the ultimate goal rather than permitting a gazillion formats. There really is no nice answer other than standardization. Otherwise, you are having to write a converter for every Tom-Dick-and-Harry format and again when someone changes the source format.
With a multitude of data sources mapping each one correctly to an intermediate format is not trivial. Regular expressions are good with a finite set of known data formats. Multipass can help when data is ambiguous without context (month,day fields and have several days of data), and also help defeat data entry errors. But it seems as this data is connected to salaries there needs a good reliable transfer.
An import configuring trick
Get the customer to make a set of training data in the application. It should have a "predefined unique date" and each subsequent data field have a number corresponding to the target data field in your application. On importing your application needs to recognise the predefined date, determine the unique translation required and effect the displaying/saving of this "mapping key", and stop the import. eg If you expect "Duration hours" in field two then get the user to enter 2 in the relevant field which might be "Attendance hours".
On subsequent runs, and with the mapping definition key, import becomes a fairly easy process of translation.
Note on terms
"predefined date" - must be historical, say founding date of your company?, might need to be in PC clock settable range.
"mapping key" - could be string of hex digits and nybble based so tractable to workout
The entered code can be extended to signify required conversions ie customer's application has durations in days and your application expects it in hours.
Interfacing with windows programs (in order if increasing fragility)
Ye Olde saving as CSV file
Print to operating system printer that is setup as a text file/pdf, then scavenge the data out of that
Extract data via the application interface control, typically ActiveX for several windows programs ie like Matlab's Spreadsheet Link
Read native file format xls format ie like Matlab's xlsread
Add an additional intermediate spreadsheet sheet that has extended cell references ie ='[filename]rawdata'!$A$3
Have a look at Teiid by JBoss: http://jboss.org/teiid
Also consider using SOA - e.g., if you're on Java, try JBoss SOA platform: http://www.jboss.com/resources/soa/?intcmp=1004
Use a simple XML format. A non-technical person can easily understand a simple XML format (and could even identify basic problems with XML documents that are not well-formed).
Maybe use a DTD (or even better an XML schema) to do very basic validation, and then supplement this with an XSL stylesheet to do more validation with better error reporting. (An XSL stylesheet simply converts from XML to something else and so can be generate readable error messages.)
The advantage of this approach is that web browsers such as Internet Explorer can apply the XSL stylesheets. A customer need only spend at most a day enhancing their applications or writing excel macros to generate the XML data in the format that you specify.
Recent versions of Excel have support for converting spreadsheet data to XML, and can even validate against schemas.
Once the data passes the XSL validation checks, you have validated XML data.
If you have heaps of data and heaps of money, you could look at existing data management and cleansing tools:
http://www-01.ibm.com/software/data/infosphere/datastage
http://www-01.ibm.com/software/data/infosphere/qualitystage
But even then, you'll likely need to follow kyoryu's suggestion assuming you have 500+ data formats. The problem isn't your side. You need them to standardize their output formats if you have no control over their apps. CSV is likely the easiest. You could even send them a excel template to help them along.