Splitting a Comma-Separated String in Scala: Missing Trailing Empty Strings? - string

I have a data file in csv format.
I am trying to split each line using the basic split command line.split(',')
But when I get a string like this "2,,",
instead of returning an array as thus Array(2,"","")
I just get an Array: Array(2).
I am most definitely missing something basic, could someone help point out the correct way to split a comma separated string here?

This is inherited from Java. You can achieve behavior you want by using the split(String regex, int limit) overload:
"2,,".split(",", -1) // = Array(2, "", "")
Note the String instead of Char.
As explained by the Java Docs, the limit parameter is used as follows:
The limit parameter controls the number of times the pattern is
applied and therefore affects the length of the resulting array. If
the limit n is greater than zero then the pattern will be applied at
most n - 1 times, the array's length will be no greater than n, and
the array's last entry will contain all input beyond the last matched
delimiter. If n is non-positive then the pattern will be applied as
many times as possible and the array can have any length. If n is zero
then the pattern will be applied as many times as possible, the array
can have any length, and trailing empty strings will be discarded.
Source
Using split(separator: Char) will call the overload above, using a limit of zero.

Related

How to remove the last three characters from a string with a dynamic length in Azure Data Factory?

I have a string variable in Azure Data Factory (v2) from which I want to remove the last 3 characters. Initially I thought using the substring operation, it requires startIndex and length parameters. However the startIndex can vary because the string does not have a fixed length. Any suggestions on how to tackle this?
Example, remove last 3 characters, string length varies:
'abcdef' -> 'abc'
'helloworld' -> 'hellowo'
'2020_18' -> '2020'
We can use length function to get the length of the string.
Then we can use substring function, the expression like substring('String',0, length-3).
In ADF, we can use the expression #substring('helloworld',0,sub(length('helloworld'),3)) to remove the last 3 characters

Find the minimal lexographical string formed by merging two strings

Suppose we are given two strings s1 and s2(both lowercase). We have two find the minimal lexographic string that can be formed by merging two strings.
At the beginning , it looks prettty simple as merge of the mergesort algorithm. But let us see what can go wrong.
s1: zyy
s2: zy
Now if we perform merge on these two we must decide which z to pick as they are equal, clearly if we pick z of s2 first then the string formed will be:
zyzyy
If we pick z of s1 first, the string formed will be:
zyyzy which is correct.
As we can see the merge of mergesort can lead to wrong answer.
Here's another example:
s1:zyy
s2:zyb
Now the correct answer will be zybzyy which will be got only if pick z of s2 first.
There are plenty of other cases in which the simple merge will fail. My question is Is there any standard algorithm out there used to perform merge for such output.
You could use dynamic programming. In f[x][y] store the minimal lexicographical string such that you've taken x charecters from the first string s1 and y characters from the second s2. You can calculate f in bottom-top manner using the update:
f[x][y] = min(f[x-1][y] + s1[x], f[x][y-1] + s2[y]) \\ the '+' here represents
\\ the concatenation of a
\\ string and a character
You start with f[0][0] = "" (empty string).
For efficiency you can store the strings in f as references. That is, you can store in f the objects
class StringRef {
StringRef prev;
char c;
}
To extract what string you have at certain f[x][y] you just follow the references. To udapate you point back to either f[x-1][y] or f[x][y-1] depending on what your update step says.
It seems that the solution can be almost the same as you described (the "mergesort"-like approach), except that with special handling of equality. So long as the first characters of both strings are equal, you look ahead at the second character, 3rd, etc. If the end is reached for some string, consider the first character of the other string as the next character in the string for which the end is reached, etc. for the 2nd character, etc. If the ends for both strings are reached, then it doesn't matter from which string to take the first character. Note that this algorithm is O(N) because after a look-ahead on equal prefixes you know the whole look-ahead sequence (i.e. string prefix) to include, not just one first character.
EDIT: you look ahead so long as the current i-th characters from both strings are equal and alphabetically not larger than the first character in the current prefix.

split string by char

scala has a standard way of splitting a string in StringOps.split
it's behaviour somewhat surprised me though.
To demonstrate, using the quick convenience function
def sp(str: String) = str.split('.').toList
the following expressions all evaluate to true
(sp("") == List("")) //expected
(sp(".") == List()) //I would have expected List("", "")
(sp("a.b") == List("a", "b")) //expected
(sp(".b") == List("", "b")) //expected
(sp("a.") == List("a")) //I would have expected List("a", "")
(sp("..") == List()) // I would have expected List("", "", "")
(sp(".a.") == List("", "a")) // I would have expected List("", "a", "")
so I expected that split would return an array with (the number a separator occurrences) + 1 elements, but that's clearly not the case.
It is almost the above, but remove all trailing empty strings, but that's not true for splitting the empty string.
I'm failing to identify the pattern here. What rules does StringOps.split follow?
For bonus points, is there a good way (without too much copying/string appending) to get the split I'm expecting?
For curious you can find the code here.https://github.com/scala/scala/blob/v2.12.0-M1/src/library/scala/collection/immutable/StringLike.scala
See the split function with the character as an argument(line 206).
I think, the general pattern going on over here is, all the trailing empty splits results are getting ignored.
Except for the first one, for which "if no separator char is found then just send the whole string" logic is getting applied.
I am trying to find if there is any design documentation around these.
Also, if you use string instead of char for separator it will fall back to java regex split. As mentioned by #LRLucena, if you provide the limit parameter with a value more than size, you will get your trailing empty results. see http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String,%20int)
You can use split with a regular expression. I´m not sure, but I guess that the second parameter is the largest size of the resulting array.
def sp(str: String) = str.split("\\.", str.length+1).toList
Seems to be consistent with these three rules:
1) Trailing empty substrings are dropped.
2) An empty substring is considered trailing before it is considered leading, if applicable.
3) First case, with no separators is an exception.
split follows the behaviour of http://docs.oracle.com/javase/7/docs/api/java/lang/String.html#split(java.lang.String)
That is split "around" the separator character, with the following exceptions:
Regardless of anything else, splitting the empty string will always give Array("")
Any trailing empty substrings are removed
Surrogate characters only match if the matched character is not part of a surrogate pair.

Groovy split without final trim

I'm parsing a CVS file like the following:
"07555555555",25.70,18/11/2010,01/03/2011,N,133,0,36,,896,537,547,,Mr,John,Doe,,
"07555555555",10.15,26/01/2011,01/03/2011,N,16,0,100,,896,537,547,,Mrs,Jane,Doe,,jane#doe.com
The thing is that when using a script like this:
file.eachLine{ line ->
items = line.split(",")
println items.length
}
The result is like the following:
16
18
Which makes me thing that the split function removes a final values. I need it to have all the items even if they are empty. Any idea?
I think you want to do:
items = line.split(',', -1)
to make sure you get all the tokens
(according to the javadoc):
The limit parameter controls the
number of times the pattern is applied
and therefore affects the length of
the resulting array. If the limit n is
greater than zero then the pattern
will be applied at most n - 1 times,
the array's length will be no greater
than n, and the array's last entry
will contain all input beyond the last
matched delimiter. If n is
non-positive then the pattern will be
applied as many times as possible and
the array can have any length. If n is
zero then the pattern will be applied
as many times as possible, the array
can have any length, and trailing
empty strings will be discarded.

Array of Strings in Fortran 77

I've a question about Fortran 77 and I've not been able to find a solution.
I'm trying to store an array of strings defined as the following:
character matname(255)*255
Which is an array of 255 strings of length 255.
Later I read the list of names from a file and I set the content of the array like this:
matname(matcount) = mname
EDIT: Actually mname value is hardcoded as mname = 'AIR' of type character*255, it is a parameter of a function matadd() which executes the previous line. But this is only for testing, in the future it will be read from a file.
Later on I want to print it with:
write(*,*) matname(matidx)
But it seems to print all the 255 characters, it prints the string I assigned and a lot of garbage.
So that is my question, how can I know the length of the string stored?
Should I have another array with all the lengths?
And how can I know the length of the string read?
Thanks.
You can use this function to get the length (without blank tail)
integer function strlen(st)
integer i
character st*(*)
i = len(st)
do while (st(i:i) .eq. ' ')
i = i - 1
enddo
strlen = i
return
end
Got from here: http://www.ibiblio.org/pub/languages/fortran/ch2-13.html
PS: When you say: matname(matidx) it gets the whole string(256) chars... so that is your string plus blanks or garbage
The function Timotei posted will give you the length of the string as long as the part of the string you are interested in only contains spaces, which, if you are assigning the values in the program should be true as FORTRAN is supposed to initialize the variables to be empty and for characters that means a space.
However, if you are reading in from a file you might pick up other control characters at the end of the lines (particularly carriage return and/or line feed characters, \r and/or \n depending on your OS). You should also toss those out in the function to get the correct string length. Otherwise you could get some funny print statements as those characters are printed as well.
Here is my version of the function that checks for alternate white space characters at the end besides spaces.
function strlen(st)
integer i,strlen
character st*(*)
i = len(st)
do while ((st(i:i).eq.' ').or.(st(i:i).eq.'\r').or.
+ (st(i:i).eq.'\n').or.(st(i:i).eq.'\t'))
i = i - 1
enddo
strlen = i
return
end
If there are other characters in the "garbage" section this still won't work completely.
Assuming that it does work for your data, however, you can then change your write statement to look like this:
write(*,*) matname(matidx)(1:strlen(matname(matidx)))
and it will print out just the actual string.
As to whether or not you should use another array to hold the lengths of the string, that is up to you. the strlen() function is O(n) whereas looking up the length in a table is O(1). If you find yourself computing the lengths of these static strings often, it may improve performance to compute the length once when they are read in, store them in an array and look them up if you need them. However, if you don't notice the slowdown, I wouldn't worry about it.
Depending on the compiler that you are using, you may be able to use the trim() intrinsic function to remove any leading/trailing spaces from a string, then process it as you normally would, i.e.
character(len=25) :: my_string
my_string = 'AIR'
write (*,*) ':', trim(my_string), ':'
should print :AIR:.
Edit:
Better yet, it looks like there is a len_trim() function that returns the length of a string after it has been trimmed.
intel and Compaq Visual Fortran have the intrinsic function LEN_TRIM(STRING) which returns the length without trailing blanks or spaces.
If you want to suppress leading blanks or spaces, use "Adjust Left" i.e. ADJUSTF(STRING)
In these FORTRANs I also note a useful feature: If you pass a string in to a function or subroutine as an argument, and inside the subroutine it is declared as CHARACTER*(*), then
using the LEN(STRING) function in the subroutine retruns the actual string length passed in, and not the length of the string as declared in the calling program.
Example:
CHARACTER*1000 STRING
.
.
CALL SUBNAM(STRING(1:72)
SUBROUTINE SYBNAM(STRING)
CHARACTER*(*) STRING
LEN(STRING) will be 72, not 1000

Resources