Untrained sentiment analysis, need help with capturing sentiment variation statistically [closed] - statistics

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 10 years ago.
The question may be vague but I will try to word it as best as possible.
So I came up with a crude algorithm to compute whether a sentence (part of a review snippet) is positive or negative or neutral (let's call this EQ for the sentence). So for 5 sentences I have some ratings for sentence based on [-100, 100]. The review has to be rated on [0, 5] basis
(0, 39.88)
(1, 73.07)
(2, 69.65)
(3, 51.43)
(4, 76.74)
The choice that I am struggling with is what method should I choose to now compute the overall rating for the review snippet.
I researched a little and tried two options
1) 50% Percentile: for above data point I got it as 70. So mapping it on 0-5 scale turns out 4.2. Results are good but the sad part is that percentile doesn't capture how the EQ varied in the snippet from one sentence to another (since it works on sorted data so the variation is lost).
2) Lagrange Polynomial: Here it came close to 69. But the prob with this approach is that I often calculate it for mid of the X-range (in this case 2) so as such this too doesn't capture the variation in EQ of the sentence (here end points do not matter, it would mostly give mid range value).
Any ideas, what method should I choose which can capture the EQ variation in the snippet and give an appropriate value which can be used to get overall sentiment.?
Probably something like excel draws trendline, prob that can be used ??

If you are interested in untrained/unsupervised sentiment analysis, read this classic paper by Peter Turney which uses an unsupervised approach achieving an accuracy of around 75% - http://nparc.cisti-icist.nrc-cnrc.gc.ca/npsi/ctrl?action=rtdoc&an=8914166
Sentiment Analysis is fun!

Related

how to explain this decision tree interpretability question?

enter image description here
enter image description here
The 2 pictures above shown the 2 decision tree....
Question is: It is often claimed that a strength of decision trees is their interpretability.
Is this always justified? Refer to Figures 5 and 6 to help with your answer.
I think the point of the question is saying that a decision tree is interpretable if its depth is relatively small. The second tree is very deep i.e for one single prediction, you will get a high number of different splitting decisions to process. Therefore, you lose interpretability because the explanation for any prediction is an intersection of too many conditions for a human-user to process.

interpretation of SVD for text mining topic analysis

Background
I'm learning about text mining by building my own text mining toolkit from scratch - the best way to learn!
SVD
The Singular Value Decomposition is often cited as a good way to:
Visualise high dimensional data (word-document matrix) in 2d/3d
Extract key topics by reducing dimensions
I've spent about a month learning about the SVD .. I must admit much of the online tutorials, papers, university lecture slides, .. and even proper printed textbooks are not that easy to digest.
Here's my understanding so far: SVD demystified (blog)
I think I have understood the following:
Any (real) matrix can be decomposed uniquely into 3 multiplied
matrices using SVD, A=U⋅S⋅V^T
S is a diagonal matrix of singular values, in descending order of magnitude
U and V^T are matrices of orthonormal vectors
I understand that we can reduce the dimensions by filtering out less significant information by zero-ing the smaller elements of S, and reconstructing the original data. If I wanted to reduce dimensions to 2, I'd only keep the 2 top-left-most elements of the diagonal S to form a new matrix S'
My Problem
To see the documents projected onto the reduced dimension space, I've seen people use S'⋅V^T. Why? What's the interpretation of S'⋅V^T?
Similarly, to see the topics, I've seen people use U⋅S'. Why? What's the interpretation of this?
My limited school maths tells me I should look at these as transformations (rotation, scale) ... but that doesn't help clarify it either.
** Update **
I've added an update to my blog explanation at SVD demystified (blog) which reflects the rationale from one of the textbooks I looked at to explain why S'.V^T is a document view, and why U.S' is a word view. Still not really convinced ...

Geodesic computation on triangle meshes? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 months ago.
Improve this question
I am trying to find the distance between two points on a triangulated surface (geodesic distance). It looks like a basic operation and is not trivial. So I am wondering if there are any libraries that do this? My google fo failed, so I would greatly appreciate any pointers.
(I am aware of CGAL, scipy.spatial, but I couldn't find anything in the docs, let me know if I missed something there)
There are many implementation for computing geodesic distance on triangle mesh. Some are approximate and some are exact.
1.Fast Marching method. This method is approximate and in practice the average error is below 1%. You can refer to Gabriel Peyre's implementation of fast marching method in matlab.
http://www.mathworks.com/matlabcentral/fileexchange/6110-toolbox-fast-marching
MMP method proposed by [1] and implemented in [2]. This method is exact and the code is in https://code.google.com/p/geodesic/ . Same as the comment by Ante. A disadvantage is that when the mesh is larege, MMP method will consume a lot of memory, O(n^2), n is the number of vertices.
CH method proposed in [3] and improved and implemented in [4]. This method is exact and consume less memory than MMP method. The code is in https://sites.google.com/site/xinshiqing/knowledge-share
Heat method proposed in [5]. One implementation is in https://github.com/dgpdec/course
This method is approximated and require a preprocessing process. It's faster than Fast Marching method.
[1] Joseph S. B. Mitchell, David M. Mount, and Christos H. Papadimitriou. 1987. The discrete geodesic problem. SIAM J. Comput. 16, 4 (August 1987), 647-668.
[2] Vitaly Surazhsky, Tatiana Surazhsky, Danil Kirsanov, Steven Gortler, Hugues Hoppe. Fast exact and approximate geodesics on meshes. ACM Trans. Graphics (SIGGRAPH), 24(3), 2005.
[3] Chen, J. and Han, Y.1990. Shortest paths on a polyhedron. InSCG '90: Proceedings of the sixth annual symposium on Computational geometry. ACM Press, New York, NY, USA, 360{369
[4] Shi-Qing Xin and Guo-Jin Wang. 2009. Improving Chen and Han's algorithm on the discrete geodesic problem. ACM Trans. Graph. 28, 4, Article 104 (September 2009), 8 pages.
[5] Crane K, Weischedel C, Wardetzky M. Geodesics in heat: a new approach to computing distance based on heat flow[J]. ACM Transactions on Graphics (TOG), 2013, 32(5): 152.
Just to add to the previous answer by wxnfifth that Fast marching method can be applied not alone, but as the first step to receive a good approximation of geodesic path, which is iteratively improved as follows:
Compose the strip of triangles containing existing approximation of the path.
Find the shortest path in the strip, which is a task that can be solved exactly in a linear time, for example by Shortest Paths in Polygons method by Wolfgang Mulzer.
If that path passes via a vertex on the boundary of triangle strip then the path along the other side of the vertex is considered, and if it is shorter then the strip is updated and the algorithm is restarted from step 2.
As to the libraries, where it is implemented, one can consider open-source MeshLib and specifically the function computeSurfacePath. And there is even a short video showing its work on some sample mesh.

How to know one system is siginficantly better than another one?

I am studying lexical semantics. I have 65 pairs of synonyms with their sense relatedness. The dataset is derived from the paper:
Rubenstein, Herbert, and John B. Goodenough. "Contextual correlates of synonymy." Communications of the ACM 8.10 (1965): 627-633.
I extract sentences containing those synonyms, transfer the neighbouring words appearing in those sentences to vectors, calculate the cosine distance between different vectors, and finally get the Pearson correlation between the distances we calculate and the sense relatedness given by Rubenstein and Goodenough
I get the Pearson correlation for Method 1 is 0.79, and for Method 2 is 0.78, for example. How do I measure Method 1 is significantly better than Method 2 or not?
Well Strictly not a programming question, but since this question is unanswered in others stackexchange sites, i'll tell the approach i would take.
I would say there are other benchmarks to check your approaches on similar tasks. You can check how your method performs on those benchmarks and analyze the results. Some methods may capture similarity more while others relatedness and some both.
This is the link WordVec Demo which automatically scores your vectors and provides you the results.

Efficient Random Texture Sampling in OpenGL ES 2.0 [closed]

It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 11 years ago.
Is there any efficient way to fetch texture data in a random way? That is, I'd like to use a texture as a look-up table and I need random access to its elements. Therefore I'd be sampling it in a random fashion. Is it a completely lost cause?
Random access is a basic feature of GLSL. E.g.
vec2 someLocation = ... whatever you like ...;
vec4 sampledColour = texture2D(sampler, someLocation);
Depending on your hardware, it may cost more to read a texture if you've calculated the sample locations directly in the pixel shader rather than out in the vertex shader and allowed them to be interpolated automatically as a varying, but that's just an immutable hardware cost relating to the decreased predictability of what you're doing.
You could always pass another texture to the shader containing random values and sample from that. That will give you the same random value for each texture coordinate but if you dont want that you can always multiply the coordinate by a uniform seed that you updated each frame.

Resources