How to convert from string to numeric value in R - string

i'm looking to graph the heights of a group of people in R
B/c a height like 5-11 comes in as a string, I was wondering if there were any tips on how to convert to a number so it could be graphed.

It also works with base R functions:
sapply(strsplit(x, "-"), function(x) {x <- as.numeric(x); x[1] + x[2] / 12})

I am assuming your data is a factor. If this is correct you can edit all the levels to numeric values then convert the factor to a numeric vector.
levels(data$heights) #Levels of the factor
levels(data$heights)<-c(5*12+10,5*12+11,6*12,6*12+1) #Renaming factors in numerical make sure
##the numbers are in the same order as in your levels
data$heights<-as.numeric(levels(data$heights))[data$heights] #Changing factor GPA to numeric vector GPA
data$heights
Reminder that this works if your R read your data as a factor

require(stringr)
Split <- str_split(x, "-")
sapply(Split, function(x) x[1] + (x[2] / 11))

Related

Octave fplot abs looks very strange

f = #(x)(abs(x))
fplot(f, [-1, 1]
Freshly installed octave, with no configuration edited. It results in the following image, where it looks as if it is constant for a while around 0, looking more like a \_/ than a \/:
Why does it look so different from a usual plot of the absolute value near 0? How can this be fixed?
Since fplot is written in Octave it is relatively easy to read. Its location can be found using the which command. On my system this gives:
octave:1> which fplot
'fplot' is a function from the file /usr/share/octave/5.2.0/m/plot/draw/fplot.m
Examining fplot.m reveals that the function to be plotted, f(x), is evaluated at n equally spaced points between the given limits. The algorithm for determining n starts at line 192 and can be summarised as follows:
n is initially chosen to be 8 (unless specified differently by the user)
Construct a vector of arguments using a coarser grid of n/2 + 1 points:
x0 = linspace (limits(1), limits(2), n/2 + 1)'
(The linspace function will accept a non-integer value for the number of points, which it rounds down)
Calculate the corresponding values:
y0 = f(x0)
Construct a vector of arguments using a grid of n points:
x = linspace (limits(1), limits(2), n)'
Calculate the corresponding values:
y = f(x0)
Construct a vector of values corresponding to the members of x but calculated from x0 and y0 by linear interpolation using the function interp1():
yi = interp1 (x0, y0, x, "linear")
Calculate an error metric using the following formula:
err = 0.5 * max (abs ((yi - y) ./ (yi + y + eps))(:))
That is, err is proportional to the maximum difference between the calculated and linearly interpolated values.
If err is greater than tol (2e-3 unless specified by the user) then put n = 2*(n-1) and repeat. Otherwise plot(x,y).
Because abs(x) is essentially a pair of straight lines, if x0 contains zero then the linearly interpolated values will always exactly match their corresponding calculated values and err will be exactly zero, so the above algorithm will terminate at the end of the first iteration. If x doesn't contain zero then plot(x,y) will be called on a set of points that doesn't include the 'cusp' of the function and the strange behaviour will occur.
This will happen if the limits are equally spaced either side of zero and floor(n/2 + 1) is odd, which is the case for the default values (limits = [-5, 5], n = 8).
The behaviour can be avoided by choosing a combination of n and limits so that either of the following is the case:
a) the set of m = floor(n/2 + 1) equally spaced points doesn't include zero or
b) the set of n equally spaced points does include zero.
For example, limits equally spaced either side of zero and odd n will plot correctly . This will not work for n=5, though, because, strangely, if the user inputs n=5, fplot.m substitutes 8 for it (I'm not sure why it does this, I think it may be a mistake). So fplot(#abs, [-1, 1], 3) and fplot(#abs, [-1, 1], 7) will plot correctly but fplot(#abs, [-1, 1], 5) won't.
(n/2 + 1) is odd, and therefore x0 contains zero for symmetrical limits, only for every 2nd even n. This is why it plots correctly with n=6 because for that value n/2 + 1 = 4, so x0 doesn't contain zero. This is also the case for n=10, 14, 18 and so on.
Choosing slightly asymmetrical limits will also do the trick, try: fplot(#abs, [-1.1, 1.2])
The documentation says: "fplot works best with continuous functions. Functions with discontinuities are unlikely to plot well. This restriction may be removed in the future." so it is probably a bug/feature of the function itself that can't be fixed except by the developers. The ordinary plot() function works fine:
x = [-1 0 1];
y = abs(x);
plot(x, y);
The weird shape comes from the sampling rate, i.e. at how many points the function is evaluated. This is controlled by the parameter N of fplot The default call seems to accidentally skip x=0, and with fplot(#abs, [-1, 1], N=5) I get the same funny shape like you:
However, trying out different values of N can yield the correct shape, try e.g. fplot(#abs, [-1, 1], N=6):
Although in general I would suggest to use way higher numbers, like N=100.

Search and remove algorithm

Say you have an ordered array of values representing x coordinates.
[0,25,50,60,75,100]
You might notice that without the 60, the values would be evenly spaced (25). This would be indicative of a repeating pattern, something that I need to extract using this list (regardless of the length and the values of the list). In this particular example, the algorithm should find and remove the 60.
There are no time or space complexity requirements.
Both the values in the list and the ideal spacing (e.g 25) are unknown. So the algorithm must obtain this by looking at the values. In addition, the number of values, and where the outliers are in the array are not guaranteed. There may be more than one outlier. The algorithm should return a list with the outliers removed. Extra points if the algorithm uses a threshold for the spacing.
Edit: Here is an example image
Here there is one outlier on the x axis. (green-line) There are two on the y axis. The x-coordinates of the array represent the rho of the line on that axis.
arr = [0,25,50,60,75,100]
First construct the distances array
dist = np.array([arr[i+1] - arr[i] for (i, _) in enumerate(arr) if i < len(arr)-1])
print(dist)
>> [25 25 10 15 25]
Now I'm using np.where and np.percentile to cut the array in 3 part: the main , the upper values and the lower values. I arbitrary set them to 5%.
cond_sup = np.where(dist > np.percentile(dist, 95))
print(cond_sup)
>> (array([]),)
cond_inf = np.where(dist < np.percentile(dist, 5))
print(cond_inf)
>> (array([2]),)
You now got indexes where the value is different from the others.
So, dist[2] has a problem, which mean by construction the problem is between arr[2] and arr[2+1]
I don't know if you want to remove 1 or more numbers from this array. So I think the way to solve this problem will be like this:
array A[] = [0,25,50,60,75,100];
sort array (if needed).
create a new array B[] with value i-th: B[i] = A[i+1] - A[i]
find the value of B[] elements that appear most time. It's will be our distance.
find i such that A[i+1]-A[i] != distance
find k (k>i and k min) such that A[i+k]-A[i] == distance
so, we need remove A[i+1] => A[i+k-1]
I hope it is right.

Trying to end up with two decimal points on a float, but keep getting 0.0

I have a float and would like to limit to just two decimals.
I've tried format(), and round(), and still just get 0, or 0.0
x = 8.972990688205408e-05
print ("x: ", x)
print ("x using round():", round(x))
print ("x using format():"+"{:.2f}".format(x))
output:
x: 8.972990688205408e-05
x using round(): 0
x using format():0.00
I'm expecting 8.98, or 8.97 depending on what method used. What am I missing?
You are using the scientific notation. As glhr pointed out in the comments, you are trying to round 8.972990688205408e-05 = 0.00008972990688205408. This means trying to round as type float will only print the first two 0s after the decimal points, resulting in 0.00. You will have to format via 0:.2e:
x = 8.972990688205408e-05
print("{0:.2e}".format(x))
This prints:
8.97e-05
You asked in one of your comments on how to get only the 8.97.
This is the way to do it:
y = x*1e+05
print("{0:.2f}".format(y))
output:
8.97
In python (and many other programming language), any number suffix with an e with a number, it is power of 10 with the number.
For example
8.9729e05 = 8.9729 x 10^3 = 8972.9
8.9729e-05 = 8.9729 x 10^-3 = 0.000089729
8.9729e0 = 8.9729 x 10^0 = 8.9729
8.972990688205408e-05 8.972990688205408 x 10^-5 = 0.00008972990688205408
8.9729e # invalid syntax
As pointed out by other answer, if you want to print out the exponential round up, you need to use the correct Python string format, you have many choices to choose from. i.e.
e Floating point exponential format (lowercase, precision default to 6 digit)
e Floating point exponential format (uppercase, precision default to 6 digit).
g Same as "e" if exponent is greater than -4 or less than precision, "f" otherwise
G Same as "E" if exponent is greater than -4 or less than precision, "F" otherwise
e.g.
x = 8.972990688205408e-05
print('{:e}'.format(x)) # 8.972991e-05
print('{:E}'.format(x)) # 8.972991E-05
print('{:.2e}'.format(x)) # 8.97e-05
(Update)
OP asked a way to remove the exponent "E" number. Since str.format() or "%" notation just output a string object, break the "e" notation out of the string will do the trick.
'{:.2e}'.format(x).split("e") # ['8.97', '-05']
print('{:.2e}'.format(x).split('e')[0]) # 8.97
If I understand correctly, you only want to round the mantissa/significand? If you want to keep x as a float and output a float, just specify the precision when calling round:
x = round(8.972990688205408e-05,7)
Output:
8.97e-05
However, I recommend converting x with the decimal module first, which "provides support for fast correctly-rounded decimal floating point arithmetic" (see this answer):
from decimal import Decimal
x = Decimal('8.972990688205408e-05').quantize(Decimal('1e-7')) # output: 0.0000897
print('%.2E' % x)
Output:
8.97E-05
Or use the short form of the format method, which gives the same output:
print(f"{x:.2E}")
rount() returns closest multiple of 10 to the power minus ndigits,
so there is no chance you will get 8.98 or 8.97. you can check here also.

gam in mgcv R with big number of covariates

I would like to know if there is another way to write the function:
gam(VariableResponse ~ s(CovariateName1) + s(CovariateName2) + ... + s(CovariateName100),
family = gaussian(link = identity), data = MyData)
in mgcv package without typing 100 covariates' name as above?
Supposing that in MyData I have only VariableResponse in column 1, CovariateName1 in column 2, etc.
Many thank!
Yes, use the brute force approach to generate a formula by pasting together the covariate names with the strings 's(' and ')' and then collapsing the whole things with ' + '. The convert the resultant string to a formula and pass that to gam(). You may need to fix issues with the formula's environment if gam() can't find the variable you name as it is going to do some NSE on the formula to identify which terms need smooths estimating and hence need to be replaced by a basis expansion.
library(mgcv)
set.seed(2) ## simulate some data...
df <- gamSim(1, n=400, dist = "normal", scale = 2)
> names(df)
[1] "y" "x0" "x1" "x2" "x3" "f" "f0" "f1" "f2" "f3"
We'll ignore the last 5 of those columns for the purposes of this example
df <- df[1:5]
Make the formula
fm <- paste('s(', names(df[ -1 ]), ')', sep = "", collapse = ' + ')
fm <- as.formula(paste('y ~', fm))
Now fit the model
m <- gam(fm, data = df)
> m
Family: gaussian
Link function: identity
Formula:
y ~ s(x0) + s(x1) + s(x2) + s(x3)
Estimated degrees of freedom:
2.5 2.4 7.7 1.0 total = 14.6
GCV score: 4.050519
You do have to be careful about fitting GAMs this way however; concurvity (the nonlinear counterpart to multicolinearlity in linear models) can cause catastrophically bad estimates of smooth functions.

Converting N strings to a common target string in maximum of K edits

I've a set of string [S1 S2 S3 ... Sn] and I'm to count all such target strings T such that each one of S1 S2... Sn can be converted into T within a total of K edits. All the strings are of fixed length L and an edit here is hamming distance.
All I've is sort of brute force approach.
so, If my alphabet size is 4, I've sample space of O(4^L) and it takes O(L) time to check each one of them. I can't seem to bring down the complexity from exponential to some poly or pseudo-poly! Is there any way to prune down the sample space to do better?
I tried to visualize it as in a L-dimensional vector space. I've been given N points and have to count all the points whose sum of distance from the given N points is less than or equal to K. i.e. d1 + d2 + d3 +...+ dN <= K
Is there any known geometric algorithm which solves this or similar problem with a better complexity? Kindly point me in the right direction or any hints are appreciated.
Thank you
You can do this efficiently with dynamic programming.
The key idea is that you don't need to enumerate all possible target strings, you just need to know how many ways targets are possible with K edits considering only the string indicies after I.
alphabet = 'abcd'
s = [ 'aabbbb', 'bacaaa', 'dabbbb', 'cabaaa']
# use memoized from http://wiki.python.org/moin/PythonDecoratorLibrary
#memoized
def count(edits_left, index):
if index == -1 and edits_left >= 0:
return 1
if edits_left < 0:
return 0
ret = 0
for char in alphabet:
edits_used = 0
for mutate_str in s:
if mutate_str[index] != char:
edits_used += 1
ret += count(edits_left - edits_used, index - 1)
return ret
Thinking out loud, it seems to me that this problem boils down to a combinatorial problem.
In general for a string S of length L, there are a total of C(L,K) (binomial coefficient) positions that can be substituted and therefore (ALPHABET_SIZE^K)*C(L,K) target strings T from a Hamming Distance of K.
Binomial Coefficient can be computed quite easily using Dynamic Programming and the Pascal Triangle... No need to get crazy into factoriel etc...
Now that one string case is treated, dealing with multiple strings is a little bit more tricky since you might double count targets. Intuitively though if S1 is K far from S2 then both string will generate the same set of target so you don't double count in this case. This last statement might be a long shot that's why I made sure to say "intuitively" :)
Hope it helps,

Resources