How to understand the process of transforming research data into certain distribution (not statistical distribution)? - statistics

Statistics is not my major and English is not my native language. I tried to apply for data analysis or data science work in industry. However, I do not know how to describe my research process below in a concise and professional way. I highly appreciated if you could provide me such help.
Background: I simulating properties of materials using different research packages, such as LAMMPS. The simulated data are only coordinates of atoms. Below are my data analysis.
step 1: clean the data to make sure the data complete and atom ID is unique and not exchangeable at different time moments (timesteps).
step 2: Calculated the neighbor atoms' distance of each center atom to find the target species (a configuration formed by several target atoms, such as Al-O-H, Si-O-H, Al-O-H2, H3-O)
step 3: count the amount of species as functions of space and/or time and draw the species distribution as functions of space and/or time, lifetime distribution of species.
NOTE: such distribution is different from statistical distribution, such as Normal Distribution, Binomial Distribution.
step 4: Based on above distribution, the correlation between species would be explored and interpreted.
After above steps, I study the mechanism behind based on materials selves and local environment.
Could anyone point out how to understand above steps in statistical terms or data analytic terms or others?
I sincerely appreciate your time and help.

Related

Simple Binary Text Classification

I seek the most effective and simple way to classify 800k+ scholarly articles as either relevant (1) or irrelevant (0) in relation to a defined conceptual space (here: learning as it relates to work).
Data is: title & abstract (mean=1300 characters)
Any approaches may be used or even combined, including supervised machine learning and/or by establishing features that give rise to some threshold values for inclusion, among other.
Approaches could draw on the key terms that describe the conceptual space, though simple frequency count alone is too unreliable. Potential avenues might involve latent semantic analysis, n-grams, ..
Generating training data may be realistic for up to 1% of the corpus, though this already means manually coding 8,000 articles (1=relevant, 0=irrelevant), would that be enough?
Specific ideas and some brief reasoning are much appreciated so I can make an informed decision on how to proceed. Many thanks!
Several Ideas:
Run LDA and get document-topic and topic-word distributions say (20 topics depending on your dataset coverage of different topics). Assign the top r% of the documents with highest relevant topic as relevant and low nr% as non-relevant. Then train a classifier over those labelled documents.
Just use bag of words and retrieve top r nearest negihbours to your query (your conceptual space) as relevant and borrom nr percent as not relevant and train a classifier over them.
If you had the citations you could run label propagation over the network graph by labelling very few papers.
Don't forget to make the title words different from your abstract words by changing the title words to title_word1 so that any classifier can put more weights on them.
Cluster the articles into say 100 clusters and then choose then manually label those clusters. Choose 100 based on the coverage of different topics in your corpus. You can also use hierarchical clustering for this.
If it is the case that the number of relevant documents is way less than non-relevant ones, then the best way to go is to find the nearest neighbours to your conceptual space (e.g. using information retrieval implemented in Lucene). Then you can manually go down in your ranked results until you feel the documents are not relevant anymore.
Most of these methods are Bootstrapping or Weakly Supervised approaches for text classification, about which you can more literature.

Calculate expected color temperature of daylight

I have a location (latitude/longitude) and a timestamp (year/month/day/hour/minute).
Assuming clear skies, is there an algorithm to loosely estimate the color temperature of sunlight at that time and place?
If I know what the weather was at that time, is there a suggested way to modify the color temperature for the amount of cloud cover at that time?
I suggest taking a look at this paper which has nice practical implementation for CG applications:
A Practical Analytic Model for Daylight A. J. Preetham Peter Shirley Brian Smits
Abstract
Sunlight and skylight are rarely rendered correctly in computer
graphics. A major reason for this is high computational expense.
Another is that precise atmospheric data is rarely available. We
present an inexpensive analytic model that approximates full spectrum
daylight for various atmospheric conditions. These conditions are
parameterized using terms that users can either measure or estimate.
We also present an inexpensive analytic model that approximates the
effects of atmosphere (aerial perspective). These models are fielded
in a number of conditions and intermediate results verified against
standard literature from atmospheric science. Our goal is to achieve
as much accuracy as possible without sacrificing usability.
Both compressed postscript and pdf files of the paper are available.
Example code is available.
Color images from the paper are shown below.
Link only answers are discouraged but I can not post neither sufficient portion of the article nor any complete C++ code snippet here as both are way too big. Following the link you can find both right now.

Interpolation technique for weirdly spaced point data

I have a spatial dataset that consists of a large number of point measurements (n=10^4) that were taken along regular grid lines (500m x 500m) and some arbitrary lines and blocks in between. Single measurements taken with a spacing of about 0.3-1.0m (varying) along these lines (see example showing every 10th point).
The data can be assumed to be normally distributed but shows a strong small-scale variability in some regions. And there is some trend with elevation (r=0.5) that can easily be removed.
Regardless of the coding platform, I'm looking for a good or "the optimal" way to interpolate these points to a regular 25 x 25m grid over the entire area of interest (5000 x 7000m). I know about the wide range of kriging techniques but I wondered if somebody has a specific idea on how to handle the "oversampling along lines" with rather large gaps between the lines.
Thank you for any advice!
Leo
Kriging technique does not perform well when the points to interpolate are taken on a regular grid, because it is necessary to have a wide range of different inter-points distances in order to well estimate the covariance model.
Your case is a bit particular... The oversampling over the lines is not a problem at all. The main problem is the big holes you have in your grid. If think that these holes will create problems whatever the interpolation technique you use.
However it is difficult to predict a priori if kriging will behave well. I advise you to try it anyway.
Kriging is only suited for interpolating. You cannot extrapolate with kriging metamodel, so that you won't be able to predict values in the bottom left part of your figure for example (because you have no point here).
To perform kriging, I advise you to use the following tools (depending the languages you're more familiar with):
DiceKriging package in R (the one I use preferably)
fields package in R (which is more specialized on spatial fields)
DACE toolbox in matlab
Bonus: a link to a reference book about kriging which is available online: http://www.gaussianprocess.org/
PS: This type of question is more statistics oriented than programming and may be better suited to the stats.stackexchange.com website.

Data mining for significant variables (numerical): Where to start?

I have a trading strategy on the foreign exchange market that I am attempting to improve upon.
I have a huge table (100k+ rows) that represent every possible trade in the market, the type of trade (buy or sell), the profit/loss after that trade closed, and 10 or so additional variables that represent various market measurements at the time of trade-opening.
I am trying to find out if any of these 10 variables are significantly related to the profits/losses.
For example, imagine that variable X ranges from 50 to -50.
The average value of X for a buy order is 25, and for a sell order is -25.
If most profitable buy orders have a value of X > 25, and most profitable sell orders have a value of X < -25 then I would consider the relationship of X-to-profit as significant.
I would like a good starting point for this. I have installed RapidMiner 5 in case someone can give me a specific recommendation for that.
A Decision Tree is perhaps the best place to begin.
The tree itself is a visual summary of feature importance ranking (or significant variables as phrased in the OP).
gives you a visual representation of the entire
classification/regression analysis (in the form of a binary tree),
which distinguishes it from any other analytical/statistical
technique that i am aware of;
decision tree algorithms require very little pre-processing on your data, no normalization, no rescaling, no conversion of discrete variables into integers (eg, Male/Female => 0/1); they can accept both categorical (discrete) and continuous variables, and many implementations can handle incomplete data (values missing from some of the rows in your data matrix); and
again, the tree itself is a visual summary of feature importance ranking
(ie, significant variables)--the most significant variable is the
root node, and is more significant than the two child nodes, which in
turn are more significant than their four combined children. "significance" here means the percent of variance explained (with respect to some response variable, aka 'target variable' or the thing
you are trying to predict). One proviso: from a visual inspection of
a decision tree you cannot distinguish variable significance from
among nodes of the same rank.
If you haven't used them before, here's how Decision Trees work: the algorithm will go through every variable (column) in your data and every value for each variable and split your data into two sub-sets based on each of those values. Which of these splits is actually chosen by the algorithm--i.e., what is the splitting criterion? The particular variable/value combination that "purifies" the data the most (i.e., maximizes the information gain) is chosen to split the data (that variable/value combination is usually indicated as the node's label). This simple heuristic is just performed recursively until the remaining data sub-sets are pure or further splitting doesn't increase the information gain.
What does this tell you about the "importance" of the variables in your data set? Well importance is indicated by proximity to the root node--i.e., hierarchical level or rank.
One suggestion: decision trees handle both categorical and discrete data usually without problem; however, in my experience, decision tree algorithms always perform better if the response variable (the variable you are trying to predict using all other variables) is discrete/categorical rather than continuous. It looks like yours is probably continuous, in which case in would consider discretizing it (unless doing so just causes the entire analysis to be meaningless). To do this, just bin your response variable values using parameters (bin size, bin number, and bin edges) meaningful w/r/t your problem domain--e.g., if your r/v is comprised of 'continuous values' from 1 to 100, you might sensibly bin them into 5 bins, 0-20, 21-40, 41-60, and so on.
For instance, from your Question, suppose one variable in your data is X and it has 5 values (10, 20, 25, 50, 100); suppose also that splitting your data on this variable with the third value (25) results in two nearly pure subsets--one low-value and one high-value. As long as this purity were higher than for the sub-sets obtained from splitting on the other values, the data would be split on that variable/value pair.
RapidMiner does indeed have a decision tree implementation, and it seems there are quite a few tutorials available on the Web (e.g., from YouTube, here and here). (Note, I have not used the decision tree module in R/M, nor have i used RapidMiner at all.)
The other set of techniques i would consider is usually grouped under the rubric Dimension Reduction. Feature Extraction and Feature Selection are two perhaps the most common terms after D/R. The most widely used is PCA, or principal-component analysis, which is based on an eigen-vector decomposition of the covariance matrix (derived from to your data matrix).
One direct result from this eigen-vector decomp is the fraction of variability in the data accounted for by each eigenvector. Just from this result, you can determine how many dimensions are required to explain, e.g., 95% of the variability in your data
If RapidMiner has PCA or another functionally similar dimension reduction technique, it's not obvious where to find it. I do know that RapidMiner has an R Extension, which of course let's you access R inside RapidMiner.R has plenty of PCA libraries (Packages). The ones i mention below are all available on CRAN, which means any of the PCA Packages there satisfy the minimum Package requirements for documentation and vignettes (code examples). I can recommend pcaPP (Robust PCA by Projection Pursuit).
In addition, i can recommend two excellent step-by-step tutorials on PCA. The first is from the NIST Engineering Statistics Handbook. The second is a tutorial for Independent Component Analysis (ICA) rather than PCA, but i mentioned it here because it's an excellent tutorial and the two techniques are used for the similar purposes.

What are "Factor Graphs" and what are they useful for?

A friend is using Factor Graphs to do text mining (identifying references to people in text), and it got me interested in this tool, but I'm having a hard time finding an intuitive explanation of what Factor Graphs are and how to use them.
Can anyone provide an explanation of Factor Graphs that isn't math heavy, and which focusses on practical applications rather than abstract theory?
They are used extensively for breaking down a problem into pieces. One very interesting application of factor graphs (and message passing on them) is the XBox Live TrueSkill algorithm. I wrote extensively about it on my blog where I tried to go for an introductory explanation rather than an overly academic one.
A factor graph is the graphical representation of the dependencies between variables and factors (parts of a formula) that are present in a particular kind of formula.
Suppose you have a function f(x_1,x_2,...,x_n) and you want to compute the marginalization of this function for some argument x_i, thus summing over all assignments to the remaining formula. Further f can be broken into factors, e.g.
f(x_1,x_2,...,x_n)=f_1(x_1,x_2)f_2(x_5,x_8,x_9)...f_k(x_1,x_10,x_11)
Then in order to compute the marginalization of f for some of the variables you can use a special algorithm called sum product (or message passing), that breaks the problem into smaller computations. For this algortithm, it is very important which variables appear as arguments to which factor. This information is captured by the factor graph.
A factor graph is a bipartite graph with both factor nodes and variable nodes. And there is an edge between a factor and a variable node if the variable appears as an argument of the factor. In our example there would be an edge between the factor f_2 and the variable x_5 but not between f_2 and x_1.
There is a great article: Factor graphs and the sum-product algorithm.
Factor graph is math model, and can be explained only with math equations. In nutshell it is way to explain complex relations between interest variables in your model. Example: A is temperature, B is pressure, components C,D,E are depends on B,A in some way, and component K is depends on B,A. And you want to predict value K based on A and B. So you know only visible states. Basic ML libraries don't allow to model such structure. Neural network do it better. And Factor Graph is exactly solve that problem.
Factor graph is an example of deep learning. When it is impossible to present model with features and output, Factor models allow to build hidden states, layers and complex structure of variables to fit real world behavior. Examples are Machine translation alignment, fingerprint recognition, co-reference etc.

Resources