So I am trying to do a toy example where I know factors in advance and I want to back them out using FactorAnalysis or PCA using SciKit learn.
Lets say I have defined 4 random X factors and 10 Y dependent variable:
# number of obs
N=10000
n_factors=4
n_variables=10
# 4 Random Factors ~N(0,1)
X=np.random.normal(size=(N,n_factors))
# Loadings for 10 Y dependent variables
loadings=pd.DataFrame(np.round(np.random.normal(0,2,size=(n_factors,n_variables)),2))
# Y without unique variance
Y_hat=X.dot(loadings)
There is no random noise here so if I run the PCA it will show that 4 factors explain all the variance as one would expect:
pca=PCA(n_components=n_factors)
pca.fit(Y_hat)
np.cumsum(pca.explained_variance_ratio_)
array([0.47940185, 0.78828548, 0.93573719, 1. ])
so far so good. In the next step I have ran the FA analysis and reconstituted the Y from the calculated loadings and factor scores:
fa=FactorAnalysis(n_components=n_factors, random_state=0,rotation=None)
X_fa = fa.fit_transform(Y_hat)
loadings_fa=pd.DataFrame(fa.components_)
Y_hat_fa=X_fa.dot(loadings_fa)+np.mean(Y_hat,axis=0)
print((Y_hat_fa-Y_hat).max())
print((Y_hat_fa-Y_hat).min())
6.039613253960852e-13
-5.577760475716786e-13
So the the original variables and reconstituted variables from FA match almost exactly.
However,
The loadings don't match at all and neither do factors:
loadings_fa-loadings
0 1 2 3 4 5 6 7 8 9
0 1.70402 -3.37357 3.62861 -0.85049 -6.10061 11.63636 3.06843 -6.89921 4.17525 3.90106
1 -1.38336 5.00735 0.04610 1.50830 0.84080 -0.44424 -1.52718 3.53620 3.06496 7.13725
2 1.63517 -1.95932 2.71208 -2.34872 -2.10633 4.50955 3.45529 -1.44261 0.03151 0.37575
3 0.27463 3.89216 2.00659 -2.18016 1.99597 -1.85738 2.34128 6.40504 -0.55935 4.13107
From quick calculations the factors from FA are not even well correlated with the original factors.
I am looking for a good theoretical explanation why I can't back out the original Factors and loadings, and not necessarily looking for code example
I have a custom (discrete) probability distribution defined somewhat in the form: f(x)/(sum(f(x')) for x' in a given discrete set X). Also, 0<=x<=1.
So I have been trying to implement it in python 3.8.2, and the problem is that the numerator and denominator both come out to be really small and python's floating point representation just takes them as 0.0.
After calculating these probabilities, I need to sample a random element from an array, whose each index may be selected with the corresponding probability in the distribution. So if my distribution is [p1,p2,p3,p4], and my array is [a1,a2,a3,a4], then probability of selecting a2 is p2 and so on.
So how can I implement this in an elegant and efficient way?
Is there any way I could use the np.random.beta() in this case? Since the difference between the beta distribution and my actual distribution is only that the normalization constant differs and the domain is restricted to a few points.
Note: The Probability Mass function defined above is actually in the form given by the Bayes theorem and f(x)=x^s*(1-x)^f, where s and f are fixed numbers for a given iteration. So the exact problem is that, when s or f become really large, this thing goes to 0.
You could well compute things by working with logs. The point is that while both the numerator and denominator might underflow to 0, their logs won't unless your numbers are really astonishingly small.
You say
f(x) = x^s*(1-x)^t
so
logf (x) = s*log(x) + t*log(1-x)
and you want to compute, say
p = f(x) / Sum{ y in X | f(y)}
so
p = exp( logf(x) - log sum { y in X | f(y)}
= exp( logf(x) - log sum { y in X | exp( logf( y))}
The only difficulty is in computing the second term, but this is a common problem, for example here
On the other hand computing logsumexp is easy enough to to by hand.
We want
S = log( sum{ i | exp(l[i])})
if L is the maximum of the l[i] then
S = log( exp(L)*sum{ i | exp(l[i]-L)})
= L + log( sum{ i | exp( l[i]-L)})
The last sum can be computed as written, because each term is now between 0 and 1 so there is no danger of overflow, and one of the terms (the one for which l[i]==L) is 1, and so if other terms underflow, that is harmless.
This may however lose a little accuracy. A refinement would be to recognize the set A of indices where
l[i]>=L-eps (eps a user set parameter, eg 1)
And then compute
N = Sum{ i in A | exp(l[i]-L)}
B = log1p( Sum{ i not in A | exp(l[i]-L)}/N)
S = L + log( N) + B
I wrote a Prolog program to find all solutions to any '8 out of 10 cats does countdown' number sequence. I am happy with the result. However, the solutions are not unique. I tried distincts() and reduced() from the "solution sequences" library. They did not produce unique solutions.
The problem is simple. you have a given list of six numbers [n1,n2,n3,n4,n5,n6] and a target number (R). Calculate R from any arbitrary combination of n1 to n6 using only +,-,*,/. You do not have to use all numbers but you can only use each number once. If two solutions are identical, only one must be generated and the other discarded.
Sometimes there are equivalent results with different arrangement. Such as:
(100+3)*6*75/50+25
(100+3)*75*6/50+25
Does anyone has any suggestions to eliminate such redundancy?
Each solution is a nested operators and integers. For example +(2,*(4,-(10,5))). This solution is an unbalanced binary tree with Arithmetic Operator for root and sibling nodes and numbers for leaf nodes. In order to have unique solutions, no two trees should be equivalent.
The Code:
:- use_module(library(lists)).
:- use_module(library(solution_sequences)).
solve(L,R,OP) :-
findnsols(10,OP,solve_(L,R,OP),S),
print_solutions(S).
solve_(L,R,OP) :-
distinct(find_op(L,OP)),
R =:= OP.
find_op(L,OP) :-
select(N1,L,Ln),
select(N2,Ln,[]),
N1 > N2,
member(OP,[+(N1,N2), -(N1,N2), *(N1,N2), /(N1,N2), N1, N2]).
find_op(L,OP) :-
select(N,L,Ln),
find_op(Ln,OP_),
OP_ > N,
member(OP,[+(OP_,N), -(OP_,N), *(OP_,N), /(OP_,N), OP_]).
print_solutions([]).
print_solutions([A|B]) :-
format('~w~n',A),
print_solutions(B).
Test:
solve([25,50,75,100,6,3],952,X)
Result
(100+3)*6*75/50+25 <- s1
((100+6)*3*75-50)/25 <- s2
(100+3)*75*6/50+25 <- s1
((100+6)*75*3-50)/25 <- s2
(100+3)*75/50*6+25 <- s1
true.
This code uses select/3 from the "lists" library.
UPDATE: Generate solutions useing DCG
The following is an attempt to generate solutions using DCG. I was able to generate a more exhaustive solution set than in previous code posted. In a way, using DCG resulted in a more correct and elegant code. However, it is much more difficult to 'guess' what the code is doing.
The issue of redundant solutions still persist.
:- use_module(library(lists)).
:- use_module(library(solution_sequences)).
s(L) --> [L].
s(+(L,Ls)) --> [L],s(Ls).
s(*(L,Ls)) --> [L],s(Ls), {L =\= 1, Ls =\= 1, Ls =\= 0}.
s(-(L,Ls)) --> [L],s(Ls), {L =\= Ls, Ls =\= 0}.
s(/(L,Ls)) --> [L],s(Ls), {Ls =\= 1, Ls =\= 0}.
s(-(Ls,L)) --> [L],s(Ls), {L =\= Ls}.
s(/(Ls,L)) --> [L],s(Ls), {L =\= 1, Ls =\=0}.
solution_list([N,H|[]],S) :-
phrase(s(S),[N,H]).
solution_list([N,H|T],S) :-
phrase(s(S),[N,H|T]);
solution_list([H|T],S).
solve(L,R,S) :-
permutation(L,X),
solution_list(X,S),
R =:= S.
Does anyone has any suggestions to eliminate such redundancy?
I suggest to define a sorting weight on each node (inner or leaf). The number resulting from reducing the child node could be used, although ties will appear. These can be broken by additionally looking at topmost operations, sorting * before + for example. Actually one would like to have a sorting operation for which "tie" means "exactly the same subtree of arithmetic operations".
Since the OP is only seeking hints to help solve the problem.
Use DCG as a generator. (SWI-Prolog) (Prolog DCG Primer)
a. For a more refined version of using DCGs as a generator look for examples that use length/2. When you understand why you might see a beam of light shining down on you for a few moments (The light beam is a video gaming thing).
Use a constraint solver (SWI-Prolog) (CLP(FD) and CLP(ℤ): Prolog Integer Arithmetic) (Understanding CLP(FD) Prolog code of N-queens problem)
Since your solutions are constrained to the 6 numbers and the operators are always binary operators (+,-,*,/) then it is possible to enumerate the unique binary trees. If you know about OEIS then you can find related links that can help you solve this problem, but you need to give OEIS a sequence. To get a sequence for use with OEIS draw the trees for N from 2 to 5 and then enter that sequence into OEIS and see what you get. e.g.
N is the number of leaf (*) nodes.
N=2 ( 1 way to draw the tree )
-
/ \
* *
N=3 ( 2 ways to draw the tree )
- -
/ \ / \
- * * -
/ \ / \
* * * *
So the sequence starts with 1,2 ...
Hint - This page (link died) shows the images of the trees to see if you have done it correctly. In the description I use N to count the number of leaves (*), but on this page they use N to count the number of internal nodes (-). If we call my N N1 and the page N N2, then the relation is N2 = N1 - 1
This might be a Hamiltonian Cycle (Wolfram World) (Hamiltonianicity of the Tower of Hanoi Problem) Remember that there is a relation between Binary Trees and the Tower of Hanoi, but in your case there are added constraints. I don't know if the constraints eliminate a solution as a Hamiltonian Cycle.
Also don't think of building the final answer from a combination of any number and operator, but instead build subsets of operators and numbers, and then use those subsets to build the answer. You constrain at the start, not at the end.
Or put another way, don't think combinations at the start, but permutations of combinations (not sure if that is the correct pattern, but in the ball park) and then using that build the tree.
Using the PATH solver directly, i fail to solve the problem presented below. The original problem was sourced from https://prod.sandia.gov/techlib-noauth/access-control.cgi/2015/155584.pdf , which seems to claim the problem was solved. Using a nonlinear formation it is possible to solve.
Whether this is a versioning issues in pyomo or PATH, it is difficult to tell.
I am running pyomo 5.5.x and pathampl sourced from http://pages.cs.wisc.edu/~ferris/path.html
from pyomo.environ import *
from pyomo.mpec import *
model = ConcreteModel()
model.x1 = Var()
model.x2 = Var()
model.x3 = Var()
model.f1 = Complementarity(expr=complements(model.x1 >= 0,model.x1 + 2*model.x2 + 3*model.x3 >= 1))
model.f2 = Complementarity(expr=complements(model.x2 >= 0,model.x2 - model.x3 >= -1))
model.f3 = Complementarity(expr=complements(model.x3 >= 0,model.x1 + model.x2 >= -1))
from pyomo.opt import SolverFactory
opt = SolverFactory("pathampl")
results = opt.solve(model, load_solutions=True, tee=True)
#sends results to stdout
results.write()
Corresponding error message:
*** EXIT - infeasible.
Major Iterations. . . . 0
Minor Iterations. . . . 0
Restarts. . . . . . . . 0
Crash Iterations. . . . 0
Gradient Steps. . . . . 0
Function Evaluations. . 0
Gradient Evaluations. . 0
Basis Time. . . . . . . 0.000000
Total Time. . . . . . . 0.000000
Residual. . . . . . . . inf
WARNING: Loading a SolverResults object with a warning status into
model=unknown;
message from solver=Path 4.7.01\x3a Infeasible.; 0 iterations (0 for
crash); 0 pivots.; 0 function, 0 gradient evaluations.
# ==========================================================
# = Solver Results =
# ==========================================================
# ----------------------------------------------------------
# Problem Information
# ----------------------------------------------------------
Problem:
- Lower bound: -inf
Upper bound: inf
Number of objectives: 1
Number of constraints: 0
Number of variables: 6
Sense: unknown
# ----------------------------------------------------------
# Solver Information
# ----------------------------------------------------------
Solver:
- Status: warning
Message: Path 4.7.01\x3a Infeasible.; 0 iterations (0 for crash); 0 pivots.; 0 function, 0 gradient evaluations.
Termination condition: infeasible
Id: 201
Error rc: 0
Time: 0.37000012397766113
# ----------------------------------------------------------
# Solution Information
# ----------------------------------------------------------
Solution:
- number of solutions: 0
number of solutions displayed: 0
Displaying Solution
In the absence of someone writing a better answer, you might try using SolverFactory('mpec_nlp').solve(model) to see what happens.
If you enjoy reading *.nl files, you can also model.write('tmp.nl') to see what is generated via the AMPL interface.
As per Bethany Nicholson's post above, using PATH 4.7.04 will solve the problem. For some reason 4.7.01 returns an error.
Qi Chen's reply will solve the problem, but does not use PATH.
I have a list of values incrementing exponentially. I was asked to have multiple Coefficent of variations from them. You might agree with me that CV is only for the whole set of numbers and dividing the set of numbers into subgroups and calculating a CV for each subgroup seems unreasonable. Would there be any statistical idea behind multiple CVs and if there is, how histogram can be made by the CVs, I mean what would the bins of the historgram. I appreciate the answers in advance
I agree with you - it does not make sense to me to calculate multiple CVs for one dataset unless there's some inferential reason for doing so.
That being said, there might actually be a reason for considering sub-groups of a dataset. In the field of Statistics, context is everything. My first thought is to ask your colleague why they want you do proceed that way. Maybe there's a good reason, maybe they don't have as full a grasp of stats as you do, regardless, it should be an enlightening conversation to have.
If you do decide to go this route, here's some R code that might help (R is great - flexible, powerful, and free)
# first, simulating some fake data (100 values of measurement & group for 10 groups)
x <- rnorm(100, mean=10, sd=1)
group <- sample(LETTERS[1:10], 100, replace=T)
# first few values of each
head(data.frame(x, group))
x group
1 10.778480 F
2 9.274193 B
3 9.639143 G
4 9.080369 I
5 10.727895 D
6 10.850306 G
# this is the part you'd actually need...
# calculating the sd & avgs for each group
sds <- tapply(x, group, sd)
avgs <- tapply(x, group, mean)
# then the cv
cvs <- sds/avgs
cvs
A B C D E F G H I J
0.07859528 0.07570556 0.09370247 0.12552468 0.08897856 0.11044543 0.10947615 0.10323379 0.08908262 0.09729945
# and if you want a histogram, R makes it pretty easy
hist(cvs)