J string manipulation using only builtins

J string manipulation using only builtins - j

You are given a string like ))()(())(, and you wish to remove all instances of () from the string, which in this case means these two instances:
))()(())(
^^ ^^
leaving only ))()(.
I know you can use the library function stringreplace, or you could load up a regex library, but I want to know is if there is a concise way of accomplishing this the the J builtin operators exclusively?
I should clarify that my own solution was:
#~(-.#+._1&|.)#('()'&E.)
which I consider verbose -- so any similar solutions would not qualify as "concise" in my book. I'm really asking if there is a way to use a builtin (or maybe a simple combination of 2) to solve this directly. I expect this answer is no.

I think you are right that there is no ultra-concise way of expressing the operation you want to perform using just J primitives. The version I came up was very much like the one Dan, suggested above.
However given that a built in library verb rplc (based on stringreplace) performs exactly the operation you are after, I'm not sure why it would be better to replace it with a primitive.
'))()(())(' rplc '()';''
))()(
Having said that, if you can come up with a compelling case, then there is probably no reason it couldn't be added.

Not sure how concise it is, but I think that this will work:
deparen=. (-.#:(+/)#:(_1&|. ,: ])#:E. # ])
'()' deparen '))()(())('
))()(
Essentially the work is done by -. #: (+/) #: (_1&|. ,: ] )#:E. to create a bit string that removes the '()' instances using # (Copy) on the right argument.
E. identifies the positions of '()' using a bit string. Shift and laminate to get positions of '(' and ')', add them together to have 1 1 in the string where ever there is a '()' and then negate so these positions become 0 0 and are removed using Copy

Related

Determine when an SQL alias can be an open name

What would be the highest-performing implementation to determine if a string that represents an SQL alias needs to be wrapped in double-quotes?
Presently, in pg-promise I am wrapping every alias in double-quotes, to play it safe. I am looking to make the output SQL neater and shorter, if possible.
And I am divided which approach is the best -
to use a regular expression, somehow
to do a direct algorithm with strings
not to change it at all, if there are reasons for that
Basically, I am looking to improve function as.alias, if possible, not to wrap aliases into double quotes when it is not needed.
What have I tried so far...
I thought at first to do it only for the 99% of all cases - not to add double-quotes when your alias is the most typical one, just a simple word:
function skipQuotes(alias) {
const m = alias.match(/[A-Z]+|[a-z]+/);
return m && m[0] === alias;
}
This only checks it is a single word that uses either upper or lower case, but not the combination.
SOLUTION
Following the answer, I ended up with implementation that should cover 99% of all practical use cases, which is what I was trying to achieve:
const m = alias.match(/[a-z_][a-z0-9_$]*|[A-Z_][A-Z0-9_$]*/);
if (m && m[0] === alias) {
// double quotes will be skipped
} else {
// double quotes will be added
}
i.e. the surrounding double quotes are not added when the alias uses a simple syntax:
it is a same-case single word, without spaces
it can contain underscores, and can start with one
it can contain digits and $, but cannot start with those

Removing double quotes is admirable -- it definitely makes queries easier to read. The rules are pretty simple. A "valid" identifier consists of:
Letters (including diacritical marks), numbers, underscore, and dollar sign.
Starts with a letter (including diacriticals) or underscore.
Is not a reserved word.
(I think I have this summarized correctly. The real rules are in the documentation.)
The first two are readily implemented using regular expressions. The last probably wants a reference table for lookup (and the list varies by Postgres release -- although less than you might imagine).
Otherwise, the identifier needs to be surrounded by escape characters. Postgres uses double quotes (which is ANSI standard).
One reason you may want to do this is because Postgres converts identifiers to lower case for comparison. So, the following works fine:
select xa, Xa, xA, "xa"
from (select 1 as Xa) y
However, this does not work:
select Xa
from (select 1 as "Xa") y
Nor does:
select "Xa"
from (select 1 as Xa) y
In fact, there is no way to get refer to "Xa" without using quotes (at least none that I can readily think of).
Enforcing the discipline of exact matches can be a good thing or a bad thing. I find that one discipline too many: I admit to often ignoring case when writing "casual" code; it is just simpler to type without capitalization (or using double quotes). For more formal code, I try to be consistent.
On the other hand, the rules do allow:
select "Xa", "aX", ax
from (select 1 as "Xa", 2 as "aX", 3 as AX) y
(This returns 1, 2, 3.)
This is a naming convention that I would be happy if it were not allowed.

What's the point of nesting brackets in Lua?

I'm currently teaching myself Lua for iOS game development, since I've heard lots of very good things about it. I'm really impressed by the level of documentation there is for the language, which makes learning it that much easier.
My problem is that I've found a Lua concept that nobody seems to have a "beginner's" explanation for: nested brackets for quotes. For example, I was taught that long strings with escaped single and double quotes like the following:
string_1 = "This is an \"escaped\" word and \"here\'s\" another."
could also be written without the overall surrounding quotes. Instead one would simply replace them with double brackets, like the following:
string_2 = [[This is an "escaped" word and "here's" another.]]
Those both make complete sense to me. But I can also write the string_2 line with "nested brackets," which include equal signs between both sets of the double brackets, as follows:
string_3 = [===[This is an "escaped" word and "here's" another.]===]
My question is simple. What is the point of the syntax used in string_3? It gives the same result as string_1 and string_2 when given as an an input for print(), so I don't understand why nested brackets even exist. Can somebody please help a noob (me) gain some perspective?

It would be used if your string contains a substring that is equal to the delimiter. For example, the following would be invalid:
string_2 = [[This is an "escaped" word, the characters ]].]]
Therefore, in order for it to work as expected, you would need to use a different string delimiter, like in the following:
string_3 = [===[This is an "escaped" word, the characters ]].]===]
I think it's safe to say that not a lot of string literals contain the substring ]], in which case there may never be a reason to use the above syntax.

It helps to, well, nest them:
print [==[malucart[[bbbb]]]bbbb]==]
Will print:
malucart[[bbbb]]]bbbb
But if that's not useful enough, you can use them to put whole programs in a string:
loadstring([===[print "o m g"]===])()
Will print:
o m g
I personally use them for my static/dynamic library implementation. In the case you don't know if the program has a closing bracket with the same amount of =s, you should determine it with something like this:
local c = 0
while contains(prog, "]" .. string.rep("=", c) .. "]") do
c = c + 1
end
-- do stuff

Sanitize string for comparison in Matlab

This is a follow-up question from this that considered evalc, instead of figgling with file-descriptors manually. You can see below an example about poor sanitization. I want to remove things such as trailing characters, all whitespaces, all newlines etc -- that usually cause unexpected things -- is there a ready sanitization command to do this?
EDU>> a
a =
1 +1*{x} -1*{y}*{z}
EDU>> b
b =
1 +1*{x} -1*{y}*{z}
EDU>> isequal(a,b)
ans =
0

I don't know whether there exist any ready robust implementation but this works pretty well
xx=#(x)regexprep(x,'\s',''); isequal(xx(a),xx(b))
where I use anonymous function and remove some oddities such as trailing whitespaces/newlines often hard to see on the window.
Also, the commands such as strtrim() and deblank() can be useful to you in removing trailing characters.

How do you specify range to end of list?

Consider the following statement:
process.text.readLines[3..<-1]
It seems like it should work. Essentially, strip off the first two elements of the array. However, the range operator is confused by the ending -1, since its less than -1. You can easily solve this problem by storing the array as a variable and replacing -1 with size() but that requires an extra line and the definition of a variable. Any other ideas how to express this easily?

I believe you could do:
process.text.readLines()[ 2..-1 ]
or:
process.text.readLines().drop( 2 )

This will also do the trick:
process.text.readLines().with { it[2..size()-1] }
It's longer than simply calling drop as suggested above, but it might read a little better depending on the larger context. with lets you get around defining a new variable.

In R, how do I replace a string that contains a certain pattern with another string?

I'm working on a project involving cleaning a list of data on college majors. I find that a lot are misspelled, so I was looking to use the function gsub() to replace the misspelled ones with its correct spelling. For example, say 'biolgy' is misspelled in a list of majors called Major. How can I get R to detect the misspelling and replace it with its correct spelling? I've tried gsub('biol', 'Biology', Major) but that only replaces the first four letters in 'biolgy'. If I do gsub('biolgy', 'Biology', Major), it works for that case alone, but that doesn't detect other forms of misspellings of 'biology'.
Thank you!

You should either define some nifty regular expression, or use agrep from base package. stringr package is another option, I know that people use it, but I'm a very huge fan of regular expressions, so it's a no-no for me.
Anyway, agrep should do the trick:
agrep("biol", "biology")
[1] 1
agrep("biolgy", "biology")
[1] 1
EDIT:
You should also use ignore.case = TRUE, but be prepared to do some bookkeeping "by hand"...

You can set up a vector of all the possible misspellings and then do a loop over a gsub call. Something like:
biologySp = c("biolgy","biologee","bologee","bugs")
for(sp in biologySp){
Major = gsub(sp,"Biology",Major)
}
If you want to do something smarter, see if there's any fuzzy matching packages on CRAN, or something that uses 'soundex' matching....
The wikipedia page on approx. string matching might be useful, and try searching R-help for some of the key terms.
http://en.wikipedia.org/wiki/Approximate_string_matching

You could first match the majors against a list of available majors, any not matching would then be the likely missspellings. Then use the agrep function to match these against the known majors again (agrep does approximate matching, so if it is similar to a correct value then you will get a match).

The vwr package has methods for string matching:
http://ftp.heanet.ie/mirrors/cran.r-project.org/web/packages/vwr/index.html
so your best bet might be to use the string with the minimum Levenshtein distance from the possible subject strings:
> levenshtein.distance("physcs",c("biology","physics","geography"))
biology physics geography
7 1 9
If you get identical minima then flip a coin:
> levenshtein.distance("biolsics",c("biology","physics","geography"))
biology physics geography
4 4 8

example 1a) perl/linux regex: 's/oldstring/newstring/'
example 1b) R equivalent of 1a: srcstring=sub(oldstring, newstring, srcstring)
example 2a) perl/linux regex: 's/oldstring//'
example 2b) R equivalent of 2a: srcstring=sub(oldstring, "", srcstring)

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

J string manipulation using only builtins - j

Related

Determine when an SQL alias can be an open name

What's the point of nesting brackets in Lua?

Sanitize string for comparison in Matlab

How do you specify range to end of list?

In R, how do I replace a string that contains a certain pattern with another string?

Categories

Resources