I have a .csv file, in which the numbers are formatted according to da_DK locale (i.e. a comma is used instead of a period as decimal point separator, among other things), so it looks something like this:
"5000","0,00","5,25", ....
I'd like to use a command line application to convert all the numbers in the file in one go, so the output is "C" (or POSIX) locale (i.e. dot/period is used as decimal separator):
"5000","0.00","5.25", ....
... keeping the decimal places as they are (i.e. "0,00" should be converted to "0.00", not "0" or "0.") and leaving all other data/formatting unchanged.
I am aware that there is numfmt, which should allow something like:
$ LC_ALL=en_DK.utf8 numfmt --from=iec --grouping 22123,11
22.123,11
... however, numfmt can only convert between units, not locales (once LC_ALL is specified, also the input number has to conform to it, just like the output).
I'd ultimately like something that is CSV-agnostic - that is, can parse through a text file, find all substrings that match a format of a number in the given input locale (i.e. the program would deduce from a string like "5000","0,00","5,25","hello".... three locale-specific numeric substrings 5000, 0,00 and 5,25), convert and replace these substrings, and leave everything else as is; but as an alternative, I'd also like to know about a CSV-aware approach (i.e., all fields are parsed row by row, and then content of each field is checked if it matches a locale-specific numeric string).
Ok, I did find a way to do this in Perl, and it's not exactly trivial; an example (csv-agnostic) script which converts a test string is pasted below. Ultimately it prints:
Orig string: "AO900-020","Hello","World","5000","0,00","5,25","stk","","1","0,00","Test 2","42.234,12","","","0,00","","","","5,25"
Conv string: "AO900-020","Hello","World","5000","0.00","5.25","stk","","1","0.00","Test 2","42234.12","","","0.00","","","","5.25"
... which is basically what I wanted to achieve; but there may be edge cases here which would be undesirable. Maybe better to use something like this with a tool like csvfix or csvtool, or use a Perl csv library directly in code.
Still, here is the code:
#!/usr/bin/env perl
use warnings;
use strict;
use locale;
use POSIX qw(setlocale locale_h LC_ALL);
use utf8;
use Number::Format qw(:subs); # sudo perl -MCPAN -e 'install Number::Format'
use Data::Dumper;
use Scalar::Util::Numeric qw(isint); # sudo perl -MCPAN -e 'install Scalar::Util::Numeric'
my $old_locale;
# query and save the old locale
$old_locale = setlocale(LC_ALL);
# list of (installed) locales: bash$ locale -a
setlocale(LC_ALL, "POSIX");
# localeconv() returns "a reference to a hash of locale-dependent info"
# dereference here:
#%posixlocalesettings = %{localeconv()};
#print Dumper(\%posixlocalesettings);
# or without dereference:
my $posixlocalesettings = localeconv();
# the $posixlocalesettings has only 'decimal_point' => '.';
# force also thousands_sep to '', else it will be comma later on, and grouping will be made regardless
$posixlocalesettings->{'thousands_sep'} = '';
print Dumper($posixlocalesettings);
#~ my $posixNumFormatter = new Number::Format %args;
# thankfully, Number::Format seems to accept as argument same kind of hash that localeconv() returns:
my $posixNumFormatter = new Number::Format(%{$posixlocalesettings});
print Dumper($posixNumFormatter);
setlocale(LC_ALL, "en_DK.utf8");
my $dklocalesettings = localeconv();
print Dumper($dklocalesettings);
# Get some of locale's numeric formatting parameters
my ($thousands_sep, $decimal_point, $grouping) =
# #{localeconv()}{'thousands_sep', 'decimal_point', 'grouping'};
#{$dklocalesettings}{'thousands_sep', 'decimal_point', 'grouping'};
# grouping and mon_grouping are packed lists
# of small integers (characters) telling the
# grouping (thousand_seps and mon_thousand_seps
# being the group dividers) of numbers and
# monetary quantities. The integers’ meanings:
# 255 means no more grouping, 0 means repeat
# the previous grouping, 1-254 means use that
# as the current grouping. Grouping goes from
# right to left (low to high digits). In the
# below we cheat slightly by never using anything
# else than the first grouping (whatever that is).
my #grouping = unpack("C*", $grouping);
print "en_DK.utf8: thousands_sep $thousands_sep; decimal_point $decimal_point; grouping " .join(", ", #grouping). "\n";
my $inputCSVString = '"AO900-020","Hello","World","5000","0,00","5,25","stk","","1","0,00","Test 2","42.234,12","","","0,00","","","","5,25"';
# Character set modifiers
# /d, /u , /a , and /l , available starting in 5.14, are called the character set modifiers;
# /l sets the character set to that of whatever Locale is in effect at the time of the execution of the pattern match.
while ($inputCSVString =~ m/[[:digit:]]+/gl) { # doesn't take locale in account
print "A Found '$&'. Next attempt at character " . (pos($inputCSVString)+1) . "\n";
}
print "----------\n";
#~ while ($inputCSVString =~ m/(\d{$grouping[0]}($|$thousands_sep))+/gl) {
#~ while ($inputCSVString =~ m/(\d)(\d{$grouping[0]}($|$thousands_sep))+/gl) {
# match a string that starts with digit, and contains only digits, thousands separators and decimal points
# note - it will NOT match negative numbers
while ($inputCSVString =~ m/\d[\d$thousands_sep$decimal_point]+/gl) {
my $numstrmatch = $&;
my $unnumstr = unformat_number($numstrmatch); # should unformat according to current locale ()
my $posixnumstr = $posixNumFormatter->format_number($unnumstr);
print "B Found '$numstrmatch' (unf: '$unnumstr', form: '$posixnumstr'). Next attempt at character " . (pos($inputCSVString)+1) . "\n";
}
sub convertNumStr{
my $numstrmatch = $_[0];
my $unnumstr = unformat_number($numstrmatch);
# if an integer, return as is so it doesn't change trailing zeroes, if the number is a label
if ( (isint $unnumstr) && ( $numstrmatch !~ m/$decimal_point_dk/) ) { return $numstrmatch; }
#~ print "--- $unnumstr\n";
# find the length of the string after the decimal point - the precision
my $precision_strlen = length( substr( $numstrmatch, index($numstrmatch, $decimal_point_dk)+1 ) );
# must manually spec precision and trailing zeroes here:
my $posixnumstr = $posixNumFormatter->format_number($unnumstr, $precision_strlen, 1);
return $posixnumstr;
}
# e modifier to evaluate perl Code
(my $replaceString = $inputCSVString) =~ s/(\d[\d$thousands_sep$decimal_point]+)/"".convertNumStr($1).""/gle;
print "Orig string: " . $inputCSVString . "\n";
print "Conv string: " . $replaceString . "\n";
updated: this will convert numbers.numbers to numbersnumbers and numbers,numbers to numbers.numbers for any text:
sed -e 's/\([0-9]\+\)\.\([0-9]\+\)/\1\2/g' -e 's/\([0-9]\+\),\([0-9]\+\)/\1.\2/g'
Orig string: "AO900-020","Hello","World","5000","0,00","5,25","stk","","1","0,00","Test 2","42.234,12","","","0,00","","","","5,25"
Conv string: "AO900-020","Hello","World","5000","0.00","5.25","stk","","1","0.00","Test 2","42234.12","","","0.00","","","","5.25"
(same example i/o as OP's perl answer)
note: this would be very bad if you have any unquoted fields in your csv.
So say I have a string with some underscores like hi_there.
Is there a way to auto-convert that string into "hi there"?
(the original string, by the way, is a variable name that I'm converting into a plot title).
Surprising that no-one has yet mentioned strrep:
>> strrep('string_with_underscores', '_', ' ')
ans =
string with underscores
which should be the official way to do a simple string replacements. For such a simple case, regexprep is overkill: yes, they are Swiss-knifes that can do everything possible, but they come with a long manual. String indexing shown by AndreasH only works for replacing single characters, it cannot do this:
>> s = 'string*-*with*-*funny*-*separators';
>> strrep(s, '*-*', ' ')
ans =
string with funny separators
>> s(s=='*-*') = ' '
Error using ==
Matrix dimensions must agree.
As a bonus, it also works for cell-arrays with strings:
>> strrep({'This_is_a','cell_array_with','strings_with','underscores'},'_',' ')
ans =
'This is a' 'cell array with' 'strings with' 'underscores'
Try this Matlab code for a string variable 's'
s(s=='_') = ' ';
If you ever have to do anything more complicated, say doing a replacement of multiple variable length strings,
s(s == '_') = ' ' will be a huge pain. If your replacement needs ever get more complicated consider using regexprep:
>> regexprep({'hi_there', 'hey_there'}, '_', ' ')
ans =
'hi there' 'hey there'
That being said, in your case #AndreasH.'s solution is the most appropriate and regexprep is overkill.
A more interesting question is why you are passing variables around as strings?
regexprep() may be what you're looking for and is a handy function in general.
regexprep('hi_there','_',' ')
Will take the first argument string, and replace instances of the second argument with the third. In this case it replaces all underscores with a space.
In Matlab strings are vectors, so performing simple string manipulations can be achieved using standard operators e.g. replacing _ with whitespace.
text = 'variable_name';
text(text=='_') = ' '; //replace all occurrences of underscore with whitespace
=> text = variable name
I know this was already answered, however, in my case I was looking for a way to correct plot titles so that I could include a filename (which could have underscores). So, I wanted to print them with the underscores NOT displaying with as subscripts. So, using this great info above, and rather than a space, I escaped the subscript in the substitution.
For example:
% Have the user select a file:
[infile inpath]=uigetfile('*.txt','Get some text file');
figure
% this is a problem for filenames with underscores
title(infile)
% this correctly displays filenames with underscores
title(strrep(infile,'_','\_'))
I would like to concatenate strings. I tried using strcat:
x = 5;
m = strcat('is', num2str(x))
but this function removes trailing white-space characters from each string. Is there another MATLAB function to perform string concatenation which maintains trailing white-space?
You can use horzcat instead of strcat:
>> strcat('one ','two')
ans =
onetwo
>> horzcat('one ','two')
ans =
one two
Alternatively, if you're going to be substituting numbers into strings, it might be better to use sprintf:
>> x = 5;
>> sprintf('is %d',x)
ans =
is 5
How about
strcat({' is '},{num2str(5)})
that gives
' is 5'
Have a look at the final example on the strcat documentation: try using horizontal array concatination instead of strcat:
m = ['is ', num2str(x)]
Also, have a look at sprintf for more information on string formatting (leading/trailing spaces etc.).
How about using strjoin ?
x = 5;
m ={'is', num2str(x)};
strjoin(m, ' ')
What spaces does this not take into account ? Only the spaces you haven't mentioned ! Did you mean:
m = strcat( ' is ',num2str(x) )
perhaps ?
Matlab isn't going to guess (a) that you want spaces or (b) where to put the spaces it guesses you want.