as.numeric with comma decimal separators? - string

I have a large vector of strings of the form:
Input = c("1,223", "12,232", "23,0")
etc. That's to say, decimals separated by commas, instead of periods. I want to convert this vector into a numeric vector. Unfortunately, as.numeric(Input) just outputs NA.
My first instinct would be to go to strsplit, but it seems to me that this will likely be very slow. Does anyone have any idea of a faster option?
There's an existing question that suggests read.csv2, but the strings in question are not directly read in that way.

as.numeric(sub(",", ".", Input, fixed = TRUE))
should work.

The readr package has a function to parse numbers from strings. You can set many options via the locale argument.
For comma as decimal separator you can write:
readr::parse_number(Input, locale = readr::locale(decimal_mark = ","))

scan(text=Input, dec=",")
## [1] 1.223 12.232 23.000
But it depends on how long your vector is. I used rep(Input, 1e6) to make a long vector and my machine just hangs. 1e4 is fine, though. #adibender's solution is much faster. If we run on 1e4, a lot faster:
Unit: milliseconds
expr min lq median uq max neval
adibender() 6.777888 6.998243 7.119136 7.198374 8.149826 100
sebastianc() 504.987879 507.464611 508.757161 510.732661 517.422254 100

Also, if you are reading in the raw data, the read.table and all the associated functions have a dec argument. eg:
read.table("file.txt", dec=",")
When all else fails, gsub and sub are your friends.

Building on #adibender solution:
input = '23,67'
as.numeric(gsub(
# ONLY for strings containing numerics, comma, numerics
"^([0-9]+),([0-9]+)$",
# Substitute by the first part, dot, second part
"\\1.\\2",
input
))
I guess that is a safer match...

As stated by , it's way easier to do this while importing a file.
Thw recently released reads package has a very useful features, locale, well explained here, that allows the user to import numbers with comma decimal mark using locale = locale(decimal_mark = ",") as argument.

The answer by adibender does not work when there are multiple commas.
In that case the suggestion from use554546 and answer from Deena can be used.
Input = c("1,223,765", "122,325,000", "23,054")
as.numeric(gsub("," ,"", Input))
ouput:
[1] 1223765 122325000 23054
The function gsub replaces all occurences. The function sub replaces only the first.

Related

Lua unicode, using string.sub() with two-byted chars

As example: I want remove the first 2 letters from the string "ПРИВЕТ" and "HELLO." one of these are containing only two-byted unicode symbols.
Trying to use string.sub("ПРИВЕТ") and string.sub("HELLO.")
Got "РИВЕТ" and "LLO.".
string.sub() removed 2 BYTES(not chars) from these strings. So i want to know how to get the removing of the chars
Something, like utf8.sub()
The key standard function for this task is utf8.offset(s,n), which gives the position in bytes of the start of the n-th character of s.
So try this:
print(string.sub(s,utf8.offset(s,3),-1))
You can define utf8.sub as follows:
function utf8.sub(s,i,j)
i=utf8.offset(s,i)
j=utf8.offset(s,j+1)-1
return string.sub(s,i,j)
end
(This code only works for positive j. See http://lua-users.org/lists/lua-l/2014-04/msg00590.html for the general case.)
There is https://github.com/Stepets/utf8.lua, a pure lua library, which expand standard function to support utf8 string.
I found a simpler solution (the the solution using offset() didnt work for me for all cases):
function utf8.sub(s, i, j)
return utf8.char(utf8.codepoint(s, i, j))
end

String matching without using builtin functions

I want to search for a query (a string) in a subject (another string).
The query may appear in whole or in parts, but will not be rearranged. For instance, if the query is 'da', and the subject is 'dura', it is still a match.
I am not allowed to use string functions like strfind or find.
The constraints make this actually quite straightforward with a single loop. Imagine you have two indices initially pointing at the first character of both strings, now compare them - if they don't match, increment the subject index and try again. If they do, increment both. If you've reached the end of the query at that point, you've found it. The actual implementation should be simple enough, and I don't want to do all the work for you ;)
If this is homework, I suggest you look at the explanation which precedes the code and then try for yourself, before looking at the actual code.
The code below looks for all occurrences of chars of the query string within the subject string (variables m; and related ii, jj). It then tests all possible orders of those occurrences (variable test). An order is "acceptable" if it contains all desired chars (cond1) in increasing positions (cond2). The result (variable result) is affirmative if there is at least one acceptable order.
subject = 'this is a test string';
query = 'ten';
m = bsxfun(#eq, subject.', query);
%'// m: test if each char of query equals each char of subject
[ii jj] = find(m);
jj = jj.'; %'// ii: which char of query is found within subject...
ii = ii.'; %'// jj: ... and at which position
test = nchoosek(1:numel(jj),numel(query)).'; %'// test all possible orders
cond1 = all(jj(test) == repmat((1:numel(query)).',1,size(test,2)));
%'// cond1: for each order, are all chars of query found in subject?
cond2 = all(diff(ii(test))>0);
%// cond2: for each order, are the found chars in increasing positions?
result = any(cond1 & cond2); %// final result: 1 or 0
The code could be improved by using a better approach as regards to test, i.e. not testing all possible orders given by nchoosek.
Matlab allows you to view the source of built-in functions, so you could always try reading the code to see how the Matlab developers did it (although it will probably be very complex). (thanks Luis for the correction)
Finding a string in another string is a basic computer science problem. You can read up on it in any number of resources, such as Wikipedia.
Your requirement of non-rearranging partial matches recalls the bioinformatics problem of mapping splice variants to a genomic sequence.
You may solve your problem by using a sequence alignment algorithm such as Smith-Waterman, modified to work with all English characters and not just DNA bases.
Is this question actually from bioinformatics? If so, you should tag it as such.

Extract decimal part in a string in C#

I have a string in which the data is getting loaded in this format. "float;#123456.0300000" from which i need to extract only 123456.03 and trim all other starting and ending characters. May be it is very basic that i am missing, please let me know how can i do that. Thanks.
If the format is always float;# followed by the bit you want, then it's fairly simple:
// TODO: Validate that it actually starts with "float;#". What do you want
// to do if it doesn't?
string userText = originalText.Substring("float;#".Length);
// We wouldn't want to trim "300" to "3"
if (userText.Contains("."))
{
userText = userText.TrimEnd('0');
// Trim "123.000" to "123" instead of "123."
if (userText.EndsWith("."))
{
userText = userText.Substring(0, userText.Length - 1);
}
}
But you really need to be confident in the format - that there won't be anything else you need to remove, etc.
In particular:
What's the maximum number of decimal places you want to support?
Might there be thousands separators?
Will the decimal separator always be a period?
Can the value be negative?
Is an explicit leading + valid?
Do you need to handle scientific format (1.234e5 etc)?
Might there be trailing characters?
Might the trailing characters include digits?
Do you need to handle non-ASCII digits?
You may well want to use a regular expression if you need anything more complicated than the code above, but you'll really want to know as much as you can about your possible inputs before going much further. (I'd strongly recommend writing unit tests for every form of input you can think of.)
Why not just do:
String s = "float;#123456.0300000";
int i = int.Parse(s.Substring(s.IndexOf('#') + 1));
I haven't tested this, but it should be close.
Assuming the format doesn't change, just find the character after the '#' and turn it into an int, so you lose everything after the '.'.
If you need it as a string, then just call the ToString method.

I am trying to display variable names and num2str representations of their values in matlab

I am trying to produce the following:The new values of x and y are -4 and 7, respectively, using the disp and num2str commands. I tried to do this disp('The new values of x and y are num2str(x) and num2str(y) respectively'), but it gave num2str instead of the appropriate values. What should I do?
Like Colin mentioned, one option would be converting the numbers to strings using num2str, concatenating all strings manually and feeding the final result into disp. Unfortunately, it can get very awkward and tedious, especially when you have a lot of numbers to print.
Instead, you can harness the power of sprintf, which is very similar in MATLAB to its C programming language counterpart. This produces shorter, more elegant statements, for instance:
disp(sprintf('The new values of x and y are %d and %d respectively', x, y))
You can control how variables are displayed using the format specifiers. For instance, if x is not necessarily an integer, you can use %.4f, for example, instead of %d.
EDIT: like Jonas pointed out, you can also use fprintf(...) instead of disp(sprintf(...)).
Try:
disp(['The new values of x and y are ', num2str(x), ' and ', num2str(y), ', respectively']);
You can actually omit the commas too, but IMHO they make the code more readable.
By the way, what I've done here is concatenated 5 strings together to form one string, and then fed that single string into the disp function. Notice that I essentially concatenated the string using the same syntax as you might use with numerical matrices, ie [x, y, z]. The reason I can do this is that matlab stores character strings internally AS numeric row vectors, with each character denoting an element. Thus the above operation is essentially concatenating 5 numeric row vectors horizontally!
One further point: Your code failed because matlab treated your num2str(x) as a string and not as a function. After all, you might legitimately want to print "num2str(x)", rather than evaluate this using a function call. In my code, the first, third and fifth strings are defined as strings, while the second and fourth are functions which evaluate to strings.

Modifying a character in a string in Lua

Is there any way to replace a character at position N in a string in Lua.
This is what I've come up with so far:
function replace_char(pos, str, r)
return str:sub(pos, pos - 1) .. r .. str:sub(pos + 1, str:len())
end
str = replace_char(2, "aaaaaa", "X")
print(str)
I can't use gsub either as that would replace every capture, not just the capture at position N.
Strings in Lua are immutable. That means, that any solution that replaces text in a string must end up constructing a new string with the desired content. For the specific case of replacing a single character with some other content, you will need to split the original string into a prefix part and a postfix part, and concatenate them back together around the new content.
This variation on your code:
function replace_char(pos, str, r)
return str:sub(1, pos-1) .. r .. str:sub(pos+1)
end
is the most direct translation to straightforward Lua. It is probably fast enough for most purposes. I've fixed the bug that the prefix should be the first pos-1 chars, and taken advantage of the fact that if the last argument to string.sub is missing it is assumed to be -1 which is equivalent to the end of the string.
But do note that it creates a number of temporary strings that will hang around in the string store until garbage collection eats them. The temporaries for the prefix and postfix can't be avoided in any solution. But this also has to create a temporary for the first .. operator to be consumed by the second.
It is possible that one of two alternate approaches could be faster. The first is the solution offered by Paŭlo Ebermann, but with one small tweak:
function replace_char2(pos, str, r)
return ("%s%s%s"):format(str:sub(1,pos-1), r, str:sub(pos+1))
end
This uses string.format to do the assembly of the result in the hopes that it can guess the final buffer size without needing extra temporary objects.
But do beware that string.format is likely to have issues with any \0 characters in any string that it passes through its %s format. Specifically, since it is implemented in terms of standard C's sprintf() function, it would be reasonable to expect it to terminate the substituted string at the first occurrence of \0. (Noted by user Delusional Logic in a comment.)
A third alternative that comes to mind is this:
function replace_char3(pos, str, r)
return table.concat{str:sub(1,pos-1), r, str:sub(pos+1)}
end
table.concat efficiently concatenates a list of strings into a final result. It has an optional second argument which is text to insert between the strings, which defaults to "" which suits our purpose here.
My guess is that unless your strings are huge and you do this substitution frequently, you won't see any practical performance differences between these methods. However, I've been surprised before, so profile your application to verify there is a bottleneck, and benchmark potential solutions carefully.
You should use pos inside your function instead of literal 1 and 3, but apart from this it looks good. Since Lua strings are immutable you can't really do much better than this.
Maybe
"%s%s%s":format(str:sub(1,pos-1), r, str:sub(pos+1, str:len())
is more efficient than the .. operator, but I doubt it - if it turns out to be a bottleneck, measure it (and then decide to implement this replacement function in C).
With luajit, you can use the FFI library to cast the string to a list of unsigned charts:
local ffi = require 'ffi'
txt = 'test'
ptr = ffi.cast('uint8_t*', txt)
ptr[1] = string.byte('o')

Resources