Frequency analysis - statistics

Suppose I have an experimental and control group and have 2 categories (attractive, less attractive) that their frequency is noted, what type of statistical tool should I use?
Experimental group Attractive category = 35, Less attractive category = 15
Control group Attractive category = 25, Less attractive category = 25

Try with fishers exact test, assuming case and control are independent.

Related

Formal UML representation of reshaping a data frame

For documentation of the restructuring of a data table from "wide" using a criteria column for each score to using a score column and a criterion column my first reaction was to use UML class diagram.
I am aware that by changing the structure of the data table, the class attributes have not changed.
My first question is whether the wide or the long version is the more correct representation of the data table?
My second question is whether it would make sense to relate the two representations - and if so, by which relationship?
My third question would be whether something else than a UML class diagram would be more suitable for documenting the reshaping (data preprocessing before showing distribution as a box pot in R).
You jumped a little bit to fast from the table to the UML. This makes your question very confusing, because what is wide as a table is represented long as a class, and the contrary.
Reformulating your problem, it appears that you are refactoring some tables. The wide table shows several values for a same student in the same row. This means that the maximum number of exercises is fixed by the table structure:
ID Ex1 Ex2 Ex3 .... Ex N
-----------------------------
111 A A A ... A
119 A C - ... D
127 B F B ... F
The long table has fewer columns, and each row shows only 1 specific score of 1 specific student:
ID # Score
---------------
111 1 A
111 2 A
111 3 A
...
111 N A
119 1 A
119 2 C
...
You can model this structure in an UML class diagram. But in UML, the table layout doesn't matter: that's an issue of the ORM mapping and you could perfectly have one class model (with an attribute or an association having a multiplicity 1..N) that could be implemented using either the wide or the long version. If the multiplicity would be 1..* only the long option would work.
Now to your questions:
Both representations are correct; they just have different characteristics. The wide is inflexible, since the maximum number of scores is fixed by the table structure. Also adding a new score requires in fact to update a record (so the possible concurrency of both models is not the same). The long is a little more complex to use if you want to show history of a student scores in a row.
Yes it makes sense to relate both, especially if you're writing for a transformation of the first into the second.
UML would not add necessarily value here. If you're really about tables and values, you could as well use an Entity/Relationship diagram. But UML has the advantage of allowing database modelling as well and it lets you add behavioral aspects. If not now, then later. You could consider using the non-standard «table» stereotype, to clarify what you are modelling a table (so a low level view on your design).

How to specify number of instances in UML Class diagram

Is there a way in a UML class diagram to indicate how many instances of a given class will be present in your system?
I know you can indicate the multiplicity of a relationship between classes:
Dog * ----------- 1 Yard 1 ----------- * Tree
But is there a common way to visually depict that there is exactly say, five instances of Yard in the model?
You cannot express this directly with UML bit with OCL (Object Constraint Language). It would be an invariant of Yard like
context Yard inv: Yard.allInstances()->size()=5
OCL is a language designed to express formal constraints when modelling with UML.
According to this explanation on UML multiplicities, it is perfectly valid to use any natural numbers for the number of elements. Also, if the lower bound is equal to the upper bound, you can describe them by using just one number (e.g. 1..1 is equivalent to 1)
So you can for example have:
Yard 1 ----------- 5 Tree

Create (mathematical) function from set of predefined values

I want to create an excel table that will help me when estimating implementation times for tasks that I am given. To do so, I derived 4 categories in which I individually rate the task from 1 to 10.
Those are: Complexity of system (simple scripts or entire business systems), State of requirements (well defined or very soft), Knowledge about system (how much I know about the system and the code base) and Plan for implementation (do I know what to do or don't I have any plan what to do or where to start).
After rating each task in these categories, I want to have a resulting factor of how expensive and how long the task will likely take, as a very rough estimate that I can tell my bosses.
What I thought about doing
I thought to create a function where I define the inputs and then get the result in form of a number, see:
| a | b | c | d | Result |
| 1 | 1 | 1 | 1 | 160 |
| 5 | 5 | 5 | 5 | 80 |
| 10 | 10 | 10 | 10 | 2 |
And I want to create a function that, when given a, b, c, d will produce the results above for the extreme cases (max, min, avg) and of course any values (float) in between.
How can I go about doing this? I imagine this is some form of polynomial problem, but how can I actually create the function that creates these results?
I have tasks like this often, so it would be cool to have a sort of pattern to follow whenever I need to create such functions for any amount of parameters and results needed.
I tried using wolfram alphas interpolate polynomial command for this, but the result is just a mess of extremely large fractions...
How can I create this function properly with reasonable results?
While writing this edit, I realize this may be better suited over at programmers.SE - If no one answers here, I will move the question there.
You don't have enough data as it is. The simplest formula which takes into account all your four explanatory variables would be linear:
x0 + x1*a + x2*b + x3*c + x4*d
If you formulate a set of equations for this, you have three equations but five unknowns, which means that you don't have a unique solution. On the other hand, the data points which you did provide are proof of the fact that the relation between scores and time is not exactly linear. So you might have to look at some family of functions which is even more complex, and therefore has even more parameters to tune. While it would be easy to tune parameters to match the input, that choice would be pretty arbitrary, and therefore without predictive power.
So while your system of four distinct scores might be useful in the long run, I'd not use that at the moment. I'd suggest you collect some more data points, see how long a given task actually did take you, and only use that fine-grained a model once you have enough data points to fit all of its parameters.
In the meantime, aggregate all four numbers into a single number. E.g. by taking their average. Then decide on a formula to choose. E.g. a quadratic one:
182 - 22.9*a + 0.49*a*a
That's a fair fit for your requirements, and not too complex or messy. But the choice of function, i.e. a polynomial one, is still pretty arbitrary. So revisit that choice once you have more data. Note that this polynomial is almost the one Wolfram Alpha found for your data:
1642/9 - 344/15*a + 22/45*a*a
I only converted these rational numbers to decimal notation, which I truncated pretty early on since all of this is very rough in any case.
On the whole, this question appears more suited to CrossValidated than to Programmers SE, in my opinion. But don't bother them unless you have sufficient data to actually fit a model.

Stronger boosting by date in Solr

Boosting by date field in solr is defined as:
{!boost b=recip(ms(NOW,datefield),3.16e-11,1,1)}
I looked everywhere (examples: Solr Dismax Config for Boost Scoring and Solr boost for multivalued date field and they all reference the SolrRelevancyFAQ), same definition that is used. But I found that this is not boosting my results sufficiently. How can I make this date boosting stronger?
User is searching for two keywords. Both items contain both keywords (in same order) in both title and description. Neither of the keywords is repeated.
And the solr debug output is waaay too confusing to me to understand the problem.
Now, this is not a huge problem. 99% of queries work fine and produce expected results, so its not like solr is not working at all, I just found this situation that is very confusing to me and don't know how to proceed.
recip(x, m, a, b) implements f(x) = a/(xm+b) with :
x : the document age in ms, defined as ms(NOW,<datefield>).
m : a constant that defines a time scale which is used to apply boost. It should be relative to what you consider an old document age (a reference_time) in milliseconds. For example, choosing a reference_time of 1 year (3.16e10ms) implies to use its inverse : 3.16e-11 (1/3.16e10 rounded).
a and b are constants (defined arbitrarily).
xm = 1 when the document is 1 reference_time old (multiplier = a/(1+b)).
xm ≈ 0 when the document is new, resulting in a value close to a/b.
Using the same value for a and b ensures the multiplier doesn't exceed 1 with recent documents.
With a = b = 1, a 1 reference_time old document has a multiplier of about 1/2, a 2 reference_time old document has a multiplier of about 1/3, and so on.
How to make a date boosting stronger ?
Increase m : choose a lower reference_time for example 6 months, that gives us m = 6.33e-11. Comparing to a 1 year reference, the multiplier decreases 2x faster as the document age increases.
Decreasing a and b expands the response curve of the function. This can be very agressive, see this example (page 8).
Apply a boost to the boost function itself with the bf (Boost Functions) parameter (this is a dismax parameter so it requires using DisMax or eDisMax query parser), eg. :
bf=recip(ms(NOW,datefield),3.16e-11,1,1)^2.0
It is important to note a few things :
bf is an additive boost and acts as a bonus added to the score of newer documents.
{!boost b} is a multiplicative boost and acts more as a penalty applied to the score of older document.
A bf score (the "bonus" added to the global score) is calculated independently of the relevancy score (the global score), meaning that a resultset with higher scores may not be impacted as much as a resultset with lower scores. In contrast, multiplicative boosts affect scores the same way regardless of the resultset relevancy, that's why it is usually preferred.
Do not use recip() for dates more than one reference_time in the future or it will yield negative values.
See also this very insightful post by Nolan Lawson on Comparing boost methods in Solr.
User is searching for two keywords. Both items contain both keywords
(in same order) in both title and description. Neither of the keywords
is repeated.
Well, by your example, it is clear that your results have landed into a tie situation. To understand this problem of confusing debug output and devise a tie-breaker policy, it is important to understand dismax.
With DisMax queries, the different terms of the user input are executed against different fields, if many of them hit (the term appears in different fields in the same document) the hit that scores higher is used, but what happens with the other sub-queries that hit in that document for the term? Well, that’s what the tie parameter defines. DisMax will calculate the score for a term query as:
score= [score of the top scoring subquery] + tie * (sum of other hitting subqueries)
In consequence, the tie parameter is a value between 0 and 1 that will define if the Dismax will only consider the max hit score for a term (setting tie=0), all the hits for a term (setting tie=1) or something between those two extremes.
The boost parameter is very similar to the bf parameter, but instead of adding its result to the final score, it will multiply it. This is only available in the Extended Dismax Query Parser or the Lucid Query Parser.
There is an interesting article Comparing Boost Methods of SOLR which may be useful to you.
References for this answer:
Advanced Apache Solr boosting: a case study
Using Solr’s Dismax Tie Parameter
Shishir
There is an example very well presented in the ReciprocalFloatFunction that will give you a clear view on how the boosting recipe works. If you find that dismax does not offer you enough control over the boosting, you will have to do some tinkering with BoostQParserPlugin.
A multiplier of 3.16e-11 changes the units from milliseconds to years
(since there are about 3.16e10 milliseconds per year). Thus, a very
recent date will yield a value close to 1/(0+1) or 1, a date a year in
the past will get a multiplier of about 1/(1+1) or 1/2, and date two
years old will yield 1/(2+1) or 1/3.

Matching Based on Arbitrary Categories and Similarity Measures

I have customer database who have certain attributes, and a customer type. The collection of attributes can vary (they do come from a finite set though), and when I look at a new customer with unknown type, with given attributes, I would like to determine which type s/he belongs to. For example, say I have these customers already in DB,
Customer | Type | Attributes
1 A 44,32,5,'X'
2 A 3,32,66,'A'
3 B 6,32,'A', 'B'
4 C 47,31,2,'H'
5 C 14,32,2,'O'
6 C 2,'C'
7 A 44
When I receive a new customer who has attributes, for example, 3,32,2, I would like to determine which type this customer belongs to, and the code should report its confidence (as percentage) of this match.
What is the best method to use here? Something statistical, or a method based on an affinity matrix of some kind, or recommendation engine style Pearson Correlation coefficients based approach? Sample, pseude code would be most welcome, but any, all ideas are fine.
Thanks,
The way to solve this problem is using Naive Bayes.

Resources