I am using Microsoft Excel 2016 for data analysis, I can carry out the test fine but I cannot produce the t value along with my P value?
The function T.INV can be used to get the critical values. The documentation describes that it "Returns the left-tailed inverse of the Student's t-distribution." Thus, e.g. T.INV(0.05,9) returns the value -1.83311. The corresponding right-tailed value ABS(T.INV(0.05,9)) = 1.83311 returns the right-tailed version of it which is more traditionally used in stats books.
There is also a function T.INV.2T which you might prefer if you are using a 2-tailed T-test.
The dots in function names like T.INV are unusual in Excel naming conventions. The reason behind it is that for about a decade Excel used statistical functions which were mathematically correct but had implementations that were not numerically stable. These tended to be fine in 99% of the cases but would sometimes behave poorly when applied to values which were several standard deviations away from the mean. Eventually Microsoft responded to persistent criticism and (with Excel 2010?) completely revamped its statistical functions. The old version were kept as is for reasons of backwards. The new versions have similar names in most cases but use a dot. For example T.INV is a more modern, more numerically stable, version of the old TINV. Whenever you have a choice between using two similarly named statistical functions, use the one with the dot.
Related
I'm a mechanical engineer, and I have developed a pretty cool spreadsheet that I use to size some steel members for lifting beams. The set back is that I need to do some trial and error in the selection of the member until I get one that gets as close to the allowable limits as possible.
What I'm hoping to improve on is to develop a function that based upon a length and weight variable that I enter, the program runs a loop and automatically selects the best member size(s) based upon a list of the members and their physical properties. Is this possible?
Yeah, depending on the complexity, either a simple search through parameters (less than, more than etc) might bring you the answer. You can do it quite easily via Pandas library. Just load up the excel as pandas DataFrame (pandas.read_excel()), which then will allow you to perform the searches on that DataFrame object.
If you want to run some optimization algo, you should look into SciPy's optimize to get what you're looking for based on the input data (it handles unconstrained and constrained functions).
Of course, the question you've stated is quite general, so I only pointed the direction. More info would be better.
I'm trying to use fuzzy lookup to match a list of correct names with a set of "dirty" names. But apparently vba only uses one core of my processors and it takes too much time because I am using it on at least 5000 names.
Here's a link to the fuzzy code: https://www.mrexcel.com/forum/excel-questions/195635-fuzzy-matching-new-version-plus-explanation.html#post955137
I also researched about "multi-threading" solutions for VBA and I found that there's no native way of doing it but someone found made an alternative using some scripts.
Here's the link for the multithreading vba script tool: https://analystcave.com/excel-vba-multithreading-tool/
Now, all I need to do is to integrate the lookup code to this multithreading script so that it will speed up the processing of this function. I am assuming that this is possible right?
Can someone help me with this? I only learned VBA through googling and reading other codes but this vba multithread tool is quite complicated for a beginner like me.
Thank you very much!
I'm not qualified to address the multithreading, but about your speed issue: are you running the code directly on the spreadsheet?
A better method is to import the entire table or range into an Array, and run the code on it there while it's in computer memory. It runs MUCH faster there. Then paste the results into the spreadsheet.
Here's some info on pulling the data into an array:
Creating an Array from a Range in VBA
http://www.cpearson.com/excel/ArraysAndRanges.aspx
You'll have to fiddle with the rest of your code, but basically you'll treat the array as if it were a table.
Below is an excerpt from Microsoft website. I believe their C# based add-in Fuzzy Lookup for MS-Excel is multi-threading based and much faster than the code you provide. Why to re-invent the wheel when we have a better option available.
The Fuzzy Lookup Add-In for Excel was developed by Microsoft Research and performs fuzzy matching of textual data in Microsoft Excel. It can be used to identify fuzzy duplicate rows within a single table or to fuzzy join similar rows between two different tables. The matching is robust to a wide variety of errors including spelling mistakes, abbreviations, synonyms and added/missing data. For instance, it might detect that the rows “Mr. Andrew Hill”, “Hill, Andrew R.” and “Andy Hill” all refer to the same underlying entity, returning a similarity score along with each match. While the default configuration works well for a wide variety of textual data, such as product names or customer addresses, the matching may also be customized for specific domains or languages. The following libraries are required and will be installed if necessary:
.NET 4.5
VSTO 4.0
I am in the process of writing a document analyzing a large codebase for quality and maintainability. As part of this report I wish to include a count of the number of references an assembly makes to another assembly within the solution. This will give an idea of how tightly coupled each assembly is to another.
Is there a tool in Visual Studio 2015 Enterprise (or 3rd Party Plug-In) that can give me this number?
So far I have tried Visual Studio's Code Map tool but this appears to just generate a visualization with arrows which I would then have to count manually and futhermore this only appears to be to class/struct-level, not the number of individual references within each class/struct.
NDepend (http://www.ndepend.com/) offers this functionality. It can be also be quite helpful in more general terms for the type of exploratory quality analysis that you describe.
You can use FxCop / Code Analysis to do this, this has a number of maintainability rules, of most interest to you would probably be:
CA1506: Avoid excessive class coupling
This rule measures class
coupling by counting the number of unique type references that a type
or method contains.
I believe the thresholds are 80 for a class and 30 for a method.
It's relatively easy to set up, basically you just need to configure it on a project:
Opening the ruleset lets you choose which ones to run (and whether they are warnings or errors), there are many, many rules.
To expand upon Nicole's answer, I have tested the trial of NDepend and I believe I have found the figures I was looking for in something they call the "Dependency Matrix". My understanding of it is as follows.
The numbers in green are a count of how many times the assembly in the current row references the assembly relating to the number in the current column. The numbers in blue are a count of how many times the assembly in the current row is referenced by the assembly relating to the number in the current column. Since an assembly cannot make an external reference to itself, no numbers can appear on the diagonal line.
What I do no understand however is why, for example, the number in cell 0, 4 is 93 but the number in cell 4, 0 is 52; shouldn't these numbers be equal? Assembly 0 is only used by assembly 4 the same number of times as assembly 4 uses assembly 0 - how can these numbers be different?
UPDATE: I have watched a PluralSight video on this tool and found out that the number in the green box represents how many methods in the referencing assembly make reference to the referenced assembly. The number in the corresponding blue box represents how many methods in the referenced assembly are being used by the referencing assembly. Neither of these numbers precisely represent the number of calls one assembly makes to another (since a method could contain multiple references) but I believe it does provide a sufficient level of granuarity anyway since methods should conform to SRP and thus all references within a method should relate to a single behaviour.
I can't find anything that is useful to determine if a language is third generation or fourth. All I find is open statements like "higher level" and "closer to English" and some sources say that they are domain specific languages like SQL and others say that they can be general purpose. I'm really confused.
If 2GLs are the Assembly languages and 5GLs are the inference languages like Prolog, how do you determine if a programming language is a 3GL or a 4GL?
Most use of the terms was pure marketing -- "Oh, you're still using a third generation language? That's so last week!"
Behind that, there was a tiny bit of technical meaning though (at least for a little while, though many "4GLs" ignored it). The basic difference was (supposed to be that) third generation languages allowed you to manipulate only individual data items, where fourth generation languages allows you to manipulate groups of items as a group rather than individually.
Two obvious examples of this are SQL and APL. In SQL, you mostly work with sets. The result of a query is a set (not exactly a mathematical set, but at least somewhat similar). You can use and manipulate that set as a whole, merge it with other sets, etc. Until or unless you're exposing it to the outside world (e.g., with a cursor) you don't have to deal with the individual records/rows/tuples that make up that set.
In APL you get somewhat the same idea, except you're working with arrays instead of sets. To get an idea of what this means, let's assume you wanted to "rotate" an array so the currently-first element was moved to the end, and each other element was shifted ahead a spot. To do that in a typical 3GL (Fortran, Pascal, C, etc.) you'd write a loop that worked with the individual elements in the array. In APL, however, you have a single operator that will do that to the array as a whole, all in one operation. Even operations that work with individual items are generally trivial to apply to an entire array at once with the / operator, so (for example) the sum of all the elements in an array named a could be computed with +/a (or maybe that was /+a -- it's been a long time since I wrote any APL).
There are some pretty serious problems with the general idea of the distinction involved there though. One is that it placed a heavy emphasis on syntax -- obviously the actions involved required things like loops internally, so the distinction amounted to a syntax for an implicit loop. Another problem was that in quite a few cases you ended up with something of a blend of the two generations -- e.g., most BASICs being able to treat a string as a single thing, but requiring loops for similar operations on arrays/matrices. Finally, there was a little bit of a problem with relevance: although in a few special cases (like SQL) being able to work with a group/set/array/whatever of data as a whole really made a big difference -- but for the most part it did little to let people think and work with higher level abstractions (as was at least apparently the original intent).
That combined with a move toward languages that blurred the distinction between what was built in, and what was part of a library (or whatever). In C++, most ML-family languages, etc., it's trivial to write a function with arbitrary actions (including but not limited to loops) and attach it to an operator that's essentially indistinguishable from one that's built into the language.
It was a catchy phrase with a meaning most couldn't explain and even fewer cared about -- a prime candidate for being turned into pure marketspeak, usually translated roughly as: "you should pay me a lot for my slow, ugly, buggy CRUD application generator."
"Language generations" were a hot buzzword in the 1980s and early 1990s. They were always ill-defined, and little used in actual academic discourse.
The terms had little meaning at the time, and none now.
I'm looking for a generic charting solution, ideally not a hosted one that provides the following features:
Charting a tuple of values where the values are:
1) A service identifier (e.g. CPU usage)
2) A client identifier within that service (e.g. server IP)
3) A value
4) A timestamp with millisecond/second resolution.
Optional:
I'd like to also extend the concept of a client identifier further, taking the above example further, I'd like to store statistics for each core separately, so, another identifier would be Core 1/Core 2..
Now, to make sure I'm clearly stating my problem, I don't want a utility that collects these statistics. I'd like something that stores them, but, this is also not mandatory, I can always store them in MySQL, or such.
What I'm looking for is something that takes values such as these, and charts them nicely, in a multitude of ways (timelines, motion, and the usual ones [pie, bar..]). Essentially, a nice visualization package that allows me to make use of all this data. I'd be collecting data from multiple services, multiple applications, and the datapoints will be of varying resolution. Some of the data will include multiple layers of nesting, some none. (For example, CPU would go down to Server IP, CPU#, whereas memory would only be Server IP, but would include a different identifier, i.e free/used/cached as the "secondary' identifier. Something like average request latency might not have a secondary identifier at all, in the case of ping). What I'm trying to get across is that having multiple layers of identifiers would be great. To add one final example of where multiple identifiers would be great: adding an extra identifier on top of ip/cpu#, namely, process name. I think the advantages of that are obvious.
For some applications, we might collect data at a very narrow scope, focusing on every aspect, in other cases, it might be a more general statistic. When stuff goes wrong, both come in useful, the first to quickly say "something just went wrong", and the second to say "why?".
Further, it would be a nice thing if the charting application threw out "bad" values, that is, if for some reason our monitoring program started to throw values of 300% CPU used on a single core for 10 seconds, it'd be nice if the charts themselves didn't reflect it in the long run. Some sort of smoothing, maybe? This could obviously be done at the data-layer though, so its not a requirement at all.
Finally, comparing two points in time, or comparing two different client identifiers of the same service etc without too much effort would be great.
I'm not partial to any specific language, although I'd prefer something in (one of the following) PHP, Python, C/C++, C#, as these are languages I'm familiar with. It doesn't have to be open source, it doesn't have to be a library, I'm open to using whatever fits my purpose the best.
More of a P.S than a requirement: I'd like to have pretty charts that are easy for non-technical people to understand, and act upon too (and like looking at!).
I'm open to clarifying, and, in advance, thanks for your time!
I am pretty sure that protovis meets all your requirements. But it has a bit of a learning curve. You are meant to learn by examples, and there are plenty to work from. It makes some pretty nice graphs by default. Every value can be a function, so you can do things like get rid of your "Bad" values.