How to efficiently split a string into lines in J? - j

I'm trying to parse a large-ish CSV file in J, and here's the line-splitting routing that I came up with:
splitlines =: 3 : 0
NB. y is the input string
nl_positions =. (y = (10 { a.)) NB. 1 if the character in that position is a newline, 0 otherwise
nl_idx =. (# i.##) nl_positions NB. A list of newline indexes in the input string
prev_idx =. (# nl_idx) {. 0 , nl_idx NB. The list above, shifted one position to the right, with 0 as the first element
result =. ''
for_i. nl_idx do. NB. For each newline
to_drop =. i_index { prev_idx NB. The number of characters from the start of the string to skip
to_take =. i - to_drop NB. The number of characters in the current line
result =. result , < (to_take {. to_drop }. y) NB. Take the current line, box it and add to the result
end.
)
It's really slow, though. The performance monitor shows that line 8 takes the longest, likely because of all the memory allocation when dropping and taking elements and extending the result list:
Time (seconds)
┌────────┬────────┬─────┬─────────────────────────────────────────┐
│all │here │rep │splitlines │
├────────┼────────┼─────┼─────────────────────────────────────────┤
│0.000011│0.000011│ 1│monad │
│0.003776│0.003776│ 1│[1] nl_positions=.(y=(10{a.)) │
│0.012429│0.012429│ 1│[2] nl_idx=.(#i.##)nl_positions │
│0.000144│0.000144│ 1│[3] prev_idx =.(#nl_idx){.0,nl_idx │
│0.000002│0.000002│ 1│[4] result=.'' │
│0.027566│0.027566│ 1│[5] for_i. nl_idx do. │
│0.940466│0.940466│20641│[6] to_drop=.i_index{prev_idx │
│0.011238│0.011238│20641│[7] to_take=.i-to_drop │
│4.310495│4.310495│20641│[8] result=.result,<(to_take{.to_drop}.y)│
│0.006926│0.006926│20641│[9] end. │
│5.313052│5.313052│ 1│total monad │
└────────┴────────┴─────┴─────────────────────────────────────────┘
Is there a better way to do this?
I'm looking for a way to:
Slice a list without memory allocation
Maybe replace the whole for loop with one single array instruction

If I understand correctly, you are currently just wanting to split a string containing multiple lines, into the separate lines. (I imagine splitting the lines into fields will be the next step at some later stage?)
The key primitive that does the heavy lifting for most of what you are wanting to do is cut (;.). For example:
<;._2 InputString NB. box each segment terminated by the last character in the string
<;._1 InputString NB. box each segment of InputString starting with the first character in the string
cut;._2 InputString NB. box each segment of InputString separated by 1 or more spaces
Other related resources that you may find useful are: splitstring, freads, the tables/dsv and tables/csv addons. freads and splitstring are both available in the standard library (post J6).
'b' freads 'myfile.txt' NB. returns contents of myfile.txt boxed by the last character (equivalent to <;._2 freads 'myfile.txt')
'","' splitstring InputString NB. boxed sub-strings of input string delimited by left argument
The tables/dsv and tables/csv addons can be installed using the Package Manager. Once installed they can be used to split lines and fields within lines as follows:
require 'tables/csv'
readcsv 'myfile.csv'
',' readdsv 'myfile.txt'
TAB readdsv 'myfile.txt'

Related

Replace atom in array of strings

Suppose I have an array of strings "31,793.1" "29,798.6" "30,455.7" "29,700.9"
How do I replace , with nothing to give me "31793.1" "29798.6" "30455.7" "29,700.9"
Another example is to replace - in "-5" "-3" "-4" with _ to give "_5" "_3" "_4" .
"31,793.1" "29,798.6" "30,455.7" "29,700.9" would not be an "array of strings" in J. I will suppose that you have a line like this and you want to end up with an array of numbers:
data =: '"31,793.1" "29,798.6" "30,455.7" "29,700.9" "-5"'
NB. Convert commas to "null" and '-'s to '_'s
NB. rplc works in pairs 'old';'new'
data rplc ',';'';'-';'_'
"31793.1" "29798.6" "30455.7" "29700.9" "_5"
NB. remove '"'s
data rplc '"';''
31793.1 29798.6 30455.7 29700.9 _5
Normally now you would have to split on whitespace (there are many ways to do this) but converting to numbers using ". takes care of this here:
NB. ". data
31793.1 29798.6 30455.7 29700.9 _5
+/ data
121743

Shifting string using Julia

This challenge in Hackerrank is to shift the string using Julia(programming language( and return the string. I have a function that takes in 3 arguments.
shiftStrings("string", leftShifts, rightShifts)
Left Shift: A single circular rotation of the string in which the first character becomes the last character and all other characters are shifted one index to the left. For example, abcde becomes bcdea after 1 left shift and cdeab after 2 left shifts.
Right Shift: A single circular rotation of the string in which the last character becomes the first character and all other characters are shifted to the right. For example, abcde becomes eabcd after 1 right shift and deabc after 2 right shifts.
I passed only 3 out of 13 test cases. Here is my solution. Please let me know the better solution.
Please refer this, they have done by python
How to shift characters in a string? - Hackerrank challenge
If you take a look at the question you linked they already had an answer to that in python.
def getShiftedString(s, leftShifts, rightShifts):
i = (leftShifts - rightShifts) % len(s)
return s[i:] + s[:i]
If you want to shift a string to the right and then to the left you just need the difference of both. I.e. if you shift 3 to the right and 3 to the left it's like you didn't change anything. "abcde" 3 to the left -> "deabc" 3 to the right -> "abcde".
Therefore, this leads to:
function shiftStrings(str, l, r)
i = mod(l - r, length(str))
str[i+1:end] * str[1:i]
end
Note:
i is the total amount of leftshifts (you take the modulo of leftshifts minus the rightshifts)
Python indexing starts from 0, whereas Julia indexing starts from 1, while modulo returns a 0 if l-r==0, that's why [i+1:end] and [1:i]
If you have Utf8 encoding then you can formulate it like this:
function shiftStrings(str, l, r)
i = mod(l - r, length(str))
indices = collect(eachindex(str))
str[indices[i+1]:end] * str[1:indices[i]]
end
Utf8 characters do not correspond to 1 byte per character, that's why the character indices are different that the String indices. (String indexing starts at every new byte, but some characters like the German "ö" need more than 1 byte)
function getShiftedString(s, leftShifts, rightShifts)
len=length(s)
lr=mod(leftShifts,len)
rr=len-mod(rightShifts,len)
ls1=SubString(s,lr+1,length(s))
ls2=SubString(s,1,lr)
ls=ls1*ls2
rs1=SubString(ls,rr+1,length(s))
rs2=SubString(ls,1,rr)
rs=rs1*rs2
end

Finding mean of ascii values in a string MATLAB

The string I am given is as follows:
scrap1 =
a le h
ke fd
zyq b
ner i
You'll notice there are 2 blank spaces indicating a space (ASCII 32) in each row. I need to find the mean ASCII value in each column without taking into account the spaces (32). So first I would convert to with double(scrap1) but then how do I find the mean without taking into account the spaces?
If it's only the ASCII 32 you want to omit:
d = double(scrap1);
result = mean(d(d~=32)); %// logical indexing to remove unwanted value, then mean
You can remove the intermediate spaces in the string with scrap1(scrap1 == ' ') = ''; This replaces any space in the input with an empty string. Then you can do the conversion to double and average the result. See here for other methods.
Probably, you can use regex to find the space and ignore it. "\s"
findSpace = regexp(scrap1, '\s', 'ignore')
% I am not sure about the ignore case, this what comes to my mind. but u can read more about regexp by typying doc regexp.

Output without boxes

My first hour with the (fascinating) J language, and things are looking good, but I've met my first problem I can't analyse with the help of the documentation.
How to count characters in a string:
y =. 'hello world'
(~. y) ; " 0 (+/ " 1 =y)
┌─┬─┐
│h│1│
├─┼─┤
│e│1│
├─┼─┤
│l│3│
├─┼─┤
│o│2│
├─┼─┤
│ │1│
├─┼─┤
│w│1│
├─┼─┤
│r│1│
├─┼─┤
│d│1│
└─┴─┘
But how can I lose the boxes? I gather that ; joins two items in a horizontal box, but it seems hard to do this without the box characters. How do I get:
h 1
e 1
l 3
and so on? Is it a presentation option?
Well, for starters you must note that you can't group in a matrix elements of different types. The boxing here helps because it puts all elements on the same foot: all boxes.
One thing you can do is format (":) to text and stitch (,.) instead of link (;):
(~.y),. (2 ": "0 +/ "1 = y)
The 2 in 2 ": serves as a format indicator, 2 digits-wide, padded with a blank space.
UPDATE:
Or in the style of Eelvex:
({. , 2&":&#) /. ~ y
There's a few elements that might be new here. Using append (,) instead of stitch (,.) because each result of key (/.) is produced separately, we're not sticking two arrays side by side, but rather creating one big array one row at a time.

How to break a big lua string into small ones

I have a big string (a base64 encoded image) and it is 1050 characters long. How can I append a big string formed of small ones, like this in C
function GetIcon()
return "Bigggg string 1"\
"continuation of string"\
"continuation of string"\
"End of string"
According to Programming in Lua 2.4 Strings:
We can delimit literal strings also by matching double square brackets [[...]]. Literals in this bracketed form may run for several lines, may nest, and do not interpret escape sequences. Moreover, this form ignores the first character of the string when this character is a newline. This form is especially convenient for writing strings that contain program pieces; for instance,
page = [[
<HTML>
<HEAD>
<TITLE>An HTML Page</TITLE>
</HEAD>
<BODY>
Lua
[[a text between double brackets]]
</BODY>
</HTML>
]]
This is the closest thing to what you are asking for, but using the above method keeps the newlines embedded in the string, so this will not work directly.
You can also do this with string concatenation (using ..):
value = "long text that" ..
" I want to carry over" ..
"onto multiple lines"
Most answers here solves this issue at run-time and not at compile-time.
Lua 5.2 introduces the escape sequence \z to solve this problem elegantly without incurring any run-time expense.
> print "This is a long \z
>> string with \z
>> breaks in between, \z
>> and is spanning multiple lines \z
>> but still is a single string only!"
This is a long string with breaks in between, and is spanning multiple lines but still is a single string only!
\z skips all subsequent characters in a string literal1 until the first non-space character. This works for non-multiline literal text too.
> print "This is a simple \z string"
This is a simple string
From Lua 5.2 Reference Manual
The escape sequence '\z' skips the following span of white-space characters, including line breaks; it is particularly useful to break and indent a long literal string into multiple lines without adding the newlines and spaces into the string contents.
1: All escape sequences, including \z, work only on short literal strings ("…", '…') and, understandably, not on long literal strings ([[...]], etc.)
I'd put all chunks in a table and use table.concat on it. This avoids the creation of new strings at every concatenation. for example (without counting overhead for strings in Lua):
-- bytes used
foo="1234".. -- 4 = 4
"4567".. -- 4 + 4 + 8 = 16
"89ab" -- 16 + 4 + 12 = 32
-- | | | \_ grand total after concatenation on last line
-- | | \_ second operand of concatenation
-- | \_ first operand of concatenation
-- \_ total size used until last concatenation
As you can see, this explodes pretty rapidly. It's better to:
foo=table.concat{
"1234",
"4567",
"89ab"}
Which will take about 3*4+12=24 bytes.
Have you tried the
string.sub(s, i [, j]) function.
You may like to look here:
http://lua-users.org/wiki/StringLibraryTutorial
This:
return "Bigggg string 1"\
"continuation of string"\
"continuation of string"\
"End of string"
C/C++ syntax causes the compiler to see it all as one large string. It is generally used for readability.
The Lua equivalent would be:
return "Bigggg string 1" ..
"continuation of string" ..
"continuation of string" ..
"End of string"
Do note that the C/C++ syntax is compile-time, while the Lua equivalent likely does the concatenation at runtime (though the compiler could theoretically optimize it). It shouldn't be a big deal though.

Resources