I am interested in applying the levenshteinSim function from the Record-Linkage package to vectors of strings (there's a good discussion on the function here ).
Imagine that I have a vector called codes: "A","B","C","D",etc.;
And a vector called tests: "A","B","C","D",etc.
Using sapply to test a particular value in 'tests' against the vector of codes,
sapply(codes,levenshteinSim,str2=tests[1])
I would expect to get a list or vector (my apologies if I make terminological mistakes): [score1] [score2] [score3].
Unfortunately, the output is a test of the value in tests[1] against c("A","B","C","D", ...) -- a single value.
Ultimately, I want to *apply the two vectors against one another to produce a matrix of length len1*len2 -- but I don't want to move forward until I understand what I'm doing wrong.
Can anyone provide guidance?
I'm not sure where the problem lies:
library(RecordLinkage)
sapply(codes,levenshteinSim,str2=test)
A B C D
[1,] 1 0 0 0
[2,] 0 1 0 0
[3,] 0 0 1 0
[4,] 0 0 0 1
When str2 is just one item, you get a length 4 vector.
Related
I would like to create a random, square table of 0's and 1's like so
0 1 0
0 0 1
0 1 0
but only bigger (around 14 x 14). The diagonal should be all 0's and the table should be symmetric across the diagonal. For example, this would not be good:
0 1 1
0 0 0
0 1 0
I would also like to have control over the number of 1 's that appear, or at least the probability that 1's appear.
I only need to make a few such tables, so this does not have to be fully automated by any means. I do not mind at all doing a lot of the work by hand.
If possible, I would greatly prefer doing this without coding in VBA (small code in cells is okay of course) since I do not know it at all.
Edit: amended so as to return a symmetrical array, as requested by the OP.
=LET(λ,RANDARRAY(ξ,ξ),IF(1-MUNIT(ξ),GESTEP(MOD(MMULT(λ,TRANSPOSE(λ)),1),1-ζ),0))
where ξ is the side length of the returned square array and ζ is an approximation as to the probability of a non-diagonal entry within that array being unity.
As ξ increases, so does the accuracy of ζ.
My data set looks like this:
P T O
1 1 0
1 0 1
1 1 1
0 1 0
1 1 0
My doubt is that we only have two values i.e. zero and one. That would logically mean that correlation can not compute the level of significance. My assumption is in this case it would be than calculating the coefficient based on occurrence of 4 combination i.e. {(1,1),(1,0),(0,1),(0,0)} rather than calculating in the magnitude of change in variables. But conversely if coefficient works only on magnitude of change, than is this the right method for my data set?
Could anyone tell me if I am on right track of thoughts or calculating such coefficient yields no significance?
I have an asymmetric matrix.
A B C D
A 0 0 1 0
B 1 0 0 1
C 0 0 0 0
D 1 1 1 0
I'm trying to switch rows and columns to make it into triangular form.
Like:
C A D B
C 0 1 1 0
A 0 0 1 1
D 0 0 0 1
B 0 0 1 0
Someone gave some codes which is made by VBA and used in Microsoft Excel. According to the note in those code, I found a paper ("Algorithm 529:Permutations to Block Triangular Form.") published by 1978 which was made in Fortran. I also found a paper (Implementation of Tarjan's Algorithm for the Block Triangularization of a Matrix) which might describe the concept.
I look at numpy but I didn't find such function. I'm wondering whether there is a ready-made module in some packages to complete this process. Thanks a lot.
BTW, I don't see triangular matrix in example.
Problem is NP-complete, so you can just generate all permutations of rows and columns until triangular matrix is reached. (Or try to implement algo from the linked article)
Article name for record
Obtaining a Triangular Matrix by Independent Row-Column Permutations
Guillaume Fertin, Irena Rusu, Stéphane Vialette
In mozilla's doc for feColorMatrix it is stated that
The SVG filter element changes colors based on a
transformation matrix. Every pixel's color value (represented by an
[R,G,B,A] vector) is matrix multiplied to create a new color.
However in feColorMatrix there are 5 columns, not 4.
In an excellent article that can be considered as a classical reference it is stated that:
The matrix here is actually calculating a final RGBA value in its
rows, giving each RGBA channel its own RGBA channel. The last number
is a multiplier.
But that does not explain a lot. As far as I understand, since after applying filter we basically modify exactly R, G, B and A channels and nothing else there's no need in this additional parameter. Indirectly there's an evidence for that in the article itself - all numerous examples of feColorMatrix-based filters provided - all have zeroes as fifth component. Also, why it's a multiplier?
In another famous article it is stated that:
For the other rows, you are creating each of the rgba output values as
the sum of the rgba input values multiplied by the corresponding
matrix value, plus a constant.
Calling it a constant added makes more sense than calling it a multiplier, however it's still unclear what does fifth component in feColor matrix stands for and what is unachievable without it - so that would be my question.
My last hope was the w3c reference but it's surprisingly vague as well.
The specification is clear although you do need to understand matrix math. The fifth column is a fixed offset. It's useful because if you want to add a specific amount of R/G/B/A to your output, that column is the only way to do it. Or if you want to recolor something to a specific color, that's also the way to do it.
For example - if you have multiple opaque colors in your input, but you want to recolor everything to rgba(255,51,0,1) then this is the matrix you would use.
0 0 0 0 1.0
0 0 0 0 0.2
0 0 0 0 0
0 0 0 1 0
aka
<feColorMatrix type="matrix" values="0 0 0 0 1 0 0 0 0 0.5 0 0 0 0 0 0 0 0 1 0"/>
Try out these sliders for yourself:
https://codepen.io/mullany/pen/qJCDk
I have a distance vector of a sample program. I am trying to quantify how similar they are. I was using Euclidean distance between sample groups (each value belongs to a bucket, we compare bucket by bucket), which works fine. But there are too many comparisons that needs to be done for large number of samples.
I was wondering if there is a efficient way to build an index to compare the samples. The samples look like this--
Sample:1 = {25 0 17 3 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}
Sample:2 = {25 1 16 2 6 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}
Sample:3 = {25 3 16 2 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0}
There exist many ways to characterise the "difference between two distributions". A specific and targeted answer requires more details concerning e.g. the underlying probability distribution(s).
It all depends on how you define a difference between two distributions. To give you two ideas:
A Kolmogorov-Smirnov test is a non-parametric test, that measures the "distance" between two cumulative/empirical distribution functions.
The Kullback-Leibler divergence measures the "distance" between two distributions in the language of information theory as a change in entropy.
Update [a year later]
Upon revisiting this post it might be important to emphasise a few things:
The standard two-sample Kolmogorov-Smirnov (KS) test assumes the underlying distribution to be continuous. For discrete data (which the data from the original post seems to be), an alternative may be to use a bootstrap version of the two-sample KS test, as in Matching::ks.boot. More details can be found on e.g. Cross Validated: Can I use Kolmogorov-Smirnov to compare two empirical distributions? and on Wikipedia: Two-sample Kolmogorov–Smirnov test.
Provided the sample data in the original post is representative, I don't think there will be a very meaningful answer from either a KS-statistic based test or the KL divergence (or really any other test for that matter). The reason being that the values from every sample are essentially all zero (to be precise, >80% of the values are zero). That in combination with the small sample size of 21 values per sample means that there is really not a lot "left" to characterise any underlying distribution.
In more general terms (and ignoring the limitations pointed out in the previous point), to calculate the KL divergence for all pairwise combinations one could do the following
library(entropy)
library(tidyverse)
expand.grid(1:length(lst), 1:length(lst)) %>%
rowwise() %>%
mutate(KL = KL.empirical(lst[[Var1]], lst[[Var2]]))
Since the KL divergence is not symmetric, we will need to calculate both the upper and lower triangular parts of the pairwise KL divergence matrix. In the interest of reducing compute time one could instead make use of a symmetrised KL divergence, which requires calculating the KL divergence only for the upper or lower triangular parts of the pairwise KL divergence matrix (although the symmetrised KL divergence versions themselves require calculating both KL divergences, i.e. KL(1->2) and KL(2->1) but this may be done through an optimised routine).