I'm designing a programming language, purely for fun, and want to add as many experimental features as I can, just to make programming in it something completely different, and that not in a bad way like Brainf*ck or Malbolge.
However, I seem to be quite bad at coming up with new things for it but I'm sure that there are tons of things out there that have been talked about but never really tried out.
What experimental language features or concepts not implemented in mainstream languages are there at the moment?
E.g: If I asked this in, let's say, 1960, an answer could be "Object-oriented programming".
I'm sure that there are a lot of unimplemented ideas computer-scientists have (recently) come up with, at least I was told so.
One current area of research are dependent types. There's still a lot of things you can do with those, that hasn't been done yet.
DWIMNWIS (Do What I Mean, Not What I say).
More seriously, an absolutely great feature (that'd be very-hard-to-impossible to implement) would be the ability for a language's program to be provable not to contain concurrency issues.
As far as MUST-HAVE features I'd say lambdas and First-class functions. Not exactly new or theoretical (how old is Lisp this year) but powerful and missing from many languages.
non-sequentially (an idea coming from dataflow programming), such that expressions are evaluated when their dependencies are fulfilled. So that:
print(msg)
msg = "Hello World"
is a valid program. All variables would be akin to cells in a spreadsheet.
print i
range(1..100) => i
# prints 1 to 100
It would be interesting to study the implications of the change propagations property of such paradigm. However its a heavy challenge to design such language, it starts to become messy when thinking about conditionals, iteration, etc and the synchronization issues that may appear.
read ACM and IEEE publications for research ideas
Here's an idea; and if anyone writes it they must give me copyright credit! Automatic built in arbitrary matrix templates such that these matrices are not the mathematical types but more of a storage or structure type. Depending on the template parameters these matrices can be as simple as a scalar value to as complex as a near infinite dimensional entity in theory if the computer could permit it without running out of memory but in practice this would be limited to the architecture, OS and compilers intrinsic details of the size of their integer or integral type. So one can have a low dimensional volumetric matrix that is 3 Dimensions yet run out of memory because it might look like this: Matrix<type, 1000000000,1000000000,1000000000> matOutOfBounds before a much higher dimensional matrix such as Matrix<type, 2,2,2,2,2,2,2,2> that is an 8D volumetric matrix would. Simplification can be done if they are "perfect matrices". That is where every dimension has the same exact amount of elements regardless of how many dimensions they have. For example: <3,3> <3,3,3> <3,3,3,3> are all perfect matrices. Simplification would be Matrix<type, 4^4> same as Matrix<type, 4,4,4,4> giving a 4x4x4x4 4D Volumetric Matrix with
96 elements in a 4D structure. Where `Matrix would be 3D Volumetric Matrix with many elements but has a 3D Volumetric structure as our current clocks and compasses operate at such that 360 degrees to full circle, 60 minutes, 60 seconds, except there are that many storage elements of floats.
This below will currently look like a possible C++ library that someone would include in their projects; but the idea here is making it a built in language type. Then any one with your language and compiler can use these at will. They can use these with any number of dimensions such as what this template portrays:
// Header Declaration
template<typename ClassType, unsigned int...>
matrix{
}; // No need to show body just declaration for concept
// User Code Would Be
matrix<float,2,3,4,5,7> mat; // This would create a 2x3x4x5x7 matrix that is a 5th dimensional volumetric matrix
// Default type
matrix<int> mat2; // This creates a 1x1 matrix that would in essence be a scalar.
Now what I've shown is current c++ syntax for variadic templates. The idea here would be that these kind of matrix containers would be built in types!
Want to make them mathematic? Sure that's fine but the user would have to define their own "algorithms, methods, functions, or routines" to do so.
The reason they would have to be defined independently is do to this reason:
mat<float, 3,3,3> mat1; 3x3x3 3D Volumetric Matrix - 27 elements
mat<float, 5,5> mat2; 5x5 2D Linear-Quadratic (Area) Matrix - 25 elements
mat<int, 6,7,8> mat3; 6x7x8 3D Volumetric Matrix - 336 elements
mat<bool, 8> mat4; 1x8 1D Linear Matrix (Array); transpose?
mat4::transpose; // built in - now 8x1 matrix
mat4::transpose; // back to 1x8.
class TheMotherLoad {// Many members and methods };
// ...
mat<TheMotherLoad*, 9,9,9,9,9,9,9,9,9> mat9;
// simplified version
mat<TheMotherLoad*, 9^9> mat9
// A 9 Dimensional Volumetric container So the first the would be a Cube
// with its sides having a length of 9 cells where the Volume 9^3 is the
// First Order of what a Volumetric Matrix is.
// Anything less is linear or quadratic either it being a scalar,
// translation, an array, a line, a point, a vector, rotation, quadratic and area )
// Now that we have a cube that has 729 elements and the next
// three 9s which are the 4th, 5th & 6th dimensions would act as another
// 9x9x9 matrix surrounding the first 3 dimensions respectively.
// Finally the 7th, 8th & 9th dimensions defines the "outer matrix"
// that also has "9x9x9" elements. So in total the number of elements
// in this matrix would be 729^3 and for every
Due to the properties of how matrices are is what determines what the type of math operations that can be done to them so this has to be externally done.
Related
Most recent object detection methods rely on a convolutional neural network. They create a feature map by running input data through a feature extraction step. They then add more convolutional layers to output a set of values like so (this set is from YOLO, but other architectures like SSD differ slightly):
pobj: probability of being an object
c1, c2 ... cn: indicating which class the object belongs to
x, y, w, h: bounding box of the object
However, one particular box cannot be multiple objects. As in, wouldn't having a high value for, say, c1 mean that the values for all the others c2 ... cn would be low? So why use different values for c1, c2 ... cn? Couldn't they all be represented by a single value, say 0-1, where each object has a certain range within the 0-1, say 0-0.2 is c1, 0.2-0.4 is c2 and so on...
This would reduce the dimension of the output from NxNx(5+C) (5 for the probability and bounding box, +C one for each class) to NxNx(5+1) (5 same as before and 1 for the class)
Thank you
Short answer, NO! That is almost certainly not an acceptable solution. It sounds like your core question is: Why is a a single value in the range [0,1] not a sufficient, compact output for object classification? As a clarification, I'd say this doesn't really have to do with single-shot detectors; the outputs from 2-stage detectors and most all classification networks follows this same 1D embedding structure. As a secondary clarification, I'd say that many 1-stage networks also don't output pobj in their original implementations (YOLO is the main one that does but Retinanet and I believe SSD does not).
An object's class is a categorical attribute. Assumed within a standard classification problem is that the set of possible classes is flat (i.e. no class is a subclass of any other), mutually exclusive (each example falls into only a single class), and unrelated (not quite the right term here but essentially no class is any more or less related to any other class).
This assumed attribute structure is well represented by an orthonormal encoding vector of the same length as the set of possible attributes. A vector [1,0,0,0] is no more similar to [0,1,0,0] than it is to [0,0,0,1] in this space.
(As an aside, a separate branch of ML problems called multilabel classification removes the mutual exclusivity constrain (so [0,1,1,0] and [0,1,1,1] would both be valid label predictions. In this space class or label combinations COULD be construed as more or less related since they share constituent labels or "basis vectors" in the orthonormal categorical attribute space. But enough digression..)
A single, continuous variable output for class destroys the assumption that all classes are unrelated. In fact, it assumes that the relation between any two classes is exact and quantifiable! What an assumption! Consider attempting to arrange the classes of, let's say, the ImageNet classification task, along a single dimension. Bus and car should be close, no? Let's say 0.1 and 0.2, respectively in our 1D embedding range of [0,1]. Zebra must be far away from them, maybe 0.8. But should be close to zebra fish (0.82)? Is a striped shirt closer to a zebra or a bus? Is the moon more similar to a bicycle or a trumpet? And is a zebra really 5 times more similar to a zebra fish than a bus is to a car? The exercise is immediately, patently absurd. A 1D embedding space for object class is not sufficiently rich to capture the differences between object classes.
Why can't we just place object classes randomly in the continuous range [0,1]? In a theoretical sense nothing is stopping you, but the gradient of the network would become horrendously, unmanageably non-convex and conventional approaches to training the network would fail. Not to mention the network architecture would have to encode extremely non-linear activation functions to predict the extremely hard boundaries between neighboring classes in the 1D space, resulting in a very brittle and non-generalizable model.
From here, the nuanced reader might suggest that in fact, some classes ARE related to one another (i.e. the unrelated assumption of the standard classification problem is not really correct). Bus and car are certainly more related than bus and trumpet, no? Without devolving into a critique on the limited usefulness of strict ontological categorization of the world, I'll simply suggest that in many cases there is an information embedding that strikes a middle ground. A vast field of work has been devoted to finding embedding spaces that are compact (relative to the exhaustive enumeration of "everything is its own class of 1") but still meaningful. This is the work of principal component analysis and object appearance embedding in deep learning.
Depending on the particular problem, you may be able to take advantage of a more nuanced embedding space better suited towards the final task you hope to accomplish. But in general, canonical deep learning tasks such as classification / detection ignore this nuance in the hopes of designing solutions that are "pretty good" generalized over a large range of problem spaces.
For object classification head, usually cross-entropy loss function is used which operates on the probability distribution to compute the difference between ground-truth(a one hot encoded vector) and prediction class scores.
On the otherhand, you are proposing a different way of encoding the ground-truth class labels which can be further used with certain custom loss function say L1/l2 loss function, which looks theoretically correct but it might not be as good as cross-entropy function in terms of model convergence/optimization.
I have a two 3D variables for a each time step (so I have N 3d matrix var(Nx,Ny,Nz), for each variables). I want to construct the two point statistics but I guess I'm doing something wrong.
Two-point statistics formula, where x_r is the reference point and x is the independent variable
I know that the theoretical formulation of a two-point cross correlation is the one written above.
Let's for sake of simplicity ignore the normalization, so I'm focusing on the numerator, that is the part I'm struggling with.
So, my two variables are two 3D matrix, with the following notation phi(x,y,z) = phi(i,j,k), same for psi.
My aim is to compute a 3d correlation, so given a certain reference point Reference_Point = (xr,yr,zr), but I guess I'm doing something wrong. I'm trying that on MATLAB, but my results are not accurate, and by doing some researches online it does seem that I should do convolutions or fft, but I don't find any theoretical framework that explains how to do that and why the formulation above in practices should be implemented by the use of a conv or fft. Moreover I would like to implement my cross-correlation in the spatial domain and not in the frequency domain, and with the convolution I don't understand how to choose the reference point.
Thank you so much in advance for reply
The term interpolation is usually used in mathematical functions when determining a function for given values, which makes perfect sense. I don't see how that applies for strings, what is being interpolated? Am I missing something obvious?
Interpolation in mathematics is simply working out the things between two points(a). For example, cubic spline fitting over a series of points will give you a curve of some description (I consider a straight line to be a degenerate curve here so don't bother pointing out that some formulae generate such a beast) between each set of points, even though you have no actual data there.
Contrast this with extrapolation which will give you data beyond the endpoints. An example of that is seeing that, based on history, the stock market indices rise at x percent per annum so, in a hundred years, will be much higher than they are now.
So it's a short step to the most likely explanation as to why variable substitution within strings is called interpolation, since you're changing things within the bounds of the data:
xyzzy="42"
plugh="abc${xyzzy}xyz"
// now plugh is equal to "abc42xyz"
(a) The actual roots of the word are Latin inter + polare, those translating to "within" and "polish" (in the sense of modify or improve). See here for more detail.
I have some vectors of experimental data that I need to massage, for example:
{
{0, 61237, 131895, 194760, 249935},
{0, 61939, 133775, 197516, 251018},
{0, 60919, 131391, 194112, 231930},
{0, 60735, 131015, 193584, 249607},
{0, 61919, 133631, 197186, 250526},
{0, 61557, 132847, 196143, 258687},
{0, 61643, 133011, 196516, 249891},
{0, 62137, 133947, 197848, 251106}
}
Each vector is the result of one run and consists of five numbers, which are times at which an object passes each of five sensors. Over the measurement interval the object's speed is constant (the sensor-to-sensor intervals are different because the sensor spacings are not uniform). From one run to the next the sensors' spacing remains the same, but the object's speed will vary a bit from one run to the next.
If the sensors were perfect, each vector ought to simply be a scalar multiple of any other vector (in proportion to the ratio of their speeds). But in reality each sensor will have some "jitter" and trigger early or late by some small random amount. I am trying to analyze how good the sensors themselves are, i.e. how much "jitter" is there in the measurements they give me?
So I think I need to do the following. To each vector I must scale it, and add then shift the vector a bit (adding or subtracting a fixed amount to each of its five elements). Then the StandardDeviation of each column will describe the amount of "noise" or "jitter" in that sensor. The amount that each vector is scaled, and the amount each vector is shifted, has to be chosen to minimize the standard deviation of columns.
It seemed to me that Mathematica probably has a good toolkit for getting this done, in fact I thought I might have found the answer with Standardize[] but it seems to be oriented towards processing a list of scalars, not a list of lists like I have (or at least I can't figure out to apply it to my case here).
So I am looking for some hints toward which library function(s) I might use to solve this problems, or perhaps the hint I might need to cleave the algorithm myself. Perhaps part of my problem is that I can't figure out where to look - is what I have here is a "signal processing" problem, or a data manipulation or data mining problem, or a minimization problem, or maybe it's a relatively standard statistical function that I simply haven't heard of before?
(As a bonus I would like to be able to control the weighting function used to optimize this scaling/shifting; e.g. in my data above I suspect that sensor#5 is having problems so I would like to fit the data to only consider the SDs of sensors 1-4 when doing the scaling/shifting)
I can't comment much on your algorithm itself, as data analysis is not my forte. However, from what I understand, you're trying to characterize the timing variations in each sensor. Since the data from a single sensor is in a single column of your matrix, I'd suggest transposing it and mapping Standardize on to each set of data. In other words,
dat = (* your data *)
Standardize /# Transpose[dat]
To put it back in columnar form, Transpose the result. To exclude you last sensor from this process, simply use Part ([ ]) and Span (;;)
Standardize /# Transpose[dat][[ ;; -2 ]]
Or, Most
Standardize /# Most[Transpose[dat]]
Thinking about it, I think you're going to have a hard time separating out the timing jitter from variation in velocity. Can you intentionally vary the velocity?
I am using Octave and I would like to use the anderson_darling_test from the Octave forge Statistics package to test if two vectors of data are drawn from the same statistical distribution. Furthermore, the reference distribution is unlikely to be "normal". This reference distribution will be the known distribution and taken from the help for the above function " 'If you are selecting from a known distribution, convert your values into CDF values for the distribution and use "uniform'. "
My question therefore is: how would I convert my data values into CDF values for the reference distribution?
Some background information for the problem: I have a vector of raw data values from which I extract the cyclic component (this will be the reference distribution); I then wish to compare this cyclic component with the raw data itself to see if the raw data is essentially cyclic in nature. If the the null hypothesis that the two are the same can be rejected I will then know that most of the movement in the raw data is not due to cyclic influences but is due to either trend or just noise.
If your data has a specific distribution, for instance beta(3,3) then
p = betacdf(x, 3, 3)
will be uniform by the definition of a CDF. If you want to transform it to a normal, you can just call the inverse CDF function
x=norminv(p,0,1)
on the uniform p. Once transformed, use your favorite test. I'm not sure I understand your data, but you might consider using a Kolmogorov-Smirnov test instead, which is a nonparametric test of distributional equality.
Your approach is misguided in multiple ways. Several points:
The Anderson-Darling test implemented in Octave forge is a one-sample test: it requires one vector of data and a reference distribution. The distribution should be known - not come from data. While you quote the help-file correctly about using a CDF and the "uniform" option for a distribution that is not built in, you are ignoring the next sentence of the same help file:
Do not use "uniform" if the distribution parameters are estimated from the data itself, as this sharply biases the A^2 statistic toward smaller values.
So, don't do it.
Even if you found or wrote a function implementing a proper two-sample Anderson-Darling or Kolmogorov-Smirnov test, you would still be left with a couple of problems:
Your samples (the data and the cyclic part estimated from the data) are not independent, and these tests assume independence.
Given your description, I assume there is some sort of time predictor involved. So even if the distributions would coincide, that does not mean they coincide at the same time-points, because comparing distributions collapses over the time.
The distribution of cyclic trend + error would not expected to be the same as the distribution of the cyclic trend alone. Suppose the trend is sin(t). Then it never will go above 1. Now add a normally distributed random error term with standard deviation 0.1 (small, so that the trend is dominant). Obviously you could get values well above 1.
We do not have enough information to figure out the proper thing to do, and it is not really a programming question anyway. Look up time series theory - separating cyclic components is a major topic there. But many reasonable analyses will probably be based on the residuals: (observed value - predicted from cyclic component). You will still have to be careful about auto-correlation and other complexities, but at least it will be a move in the right direction.