When learning (or relearning) a language, a significant amount of time goes into learning the functions for doing basic operations. For example, suppose I want to reverse a String. In one language, it may be simple as myString.reverse(). In Python, it is myString[::-1]. In other languages, you may have to create an array, iterate through the string and add all the characters in reverse order and then convert it back to a string. What would be extremely useful would be a reference so that if you know the name of the function in one language, then I could find the equivalent in another. Googling or searching StackOverflow don't seem to solve this problem very well at the moment, as you have to usually try a large number of different queries. I guess I am thinking of some kind of Wiki system. Are there any websites that do this?
It sounds like you're looking for Rosetta Code. There is in fact a page on reversing a string.
Related
When running Raku code on Rakudo with the MoarVM backend, is there any way to print information about how a given Str is stored in memory from inside the running program? In particular, I am curious whether there's a way to see how many Strands currently make up the Str (whether via Raku introspection, NQP, or something that accesses the MoarVM level (does such a thing even exist at runtime?).
If there isn't any way to access this info at runtime, is there a way to get at it through output from one of Rakudo's command-line flags, such as --target, or --tracing? Or through a debugger?
Finally, does MoarVM manage the number of Strands in a given Str? I often hear (or say) that one of Raku's super powers is that is can index into Unicode strings in O(1) time, but I've been thinking about the pathological case, and it feels like it would be O(n). For example,
(^$n).map({~rand}).join
seems like it would create a Str with a length proportional to $n that consists of $n Strands – and, if I'm understanding the datastructure correctly, that means that into this Str would require checking the length of each Strand, for a time complexity of O(n). But I know that it's possible to flatten a Strand-ed Str; would MoarVM do something like that in this case? Or have I misunderstood something more basic?
When running Raku code on Rakudo with the MoarVM backend, is there any way to print information about how a given Str is stored in memory from inside the running program?
My educated guess is yes, as described below for App::MoarVM modules. That said, my education came from a degree I started at the Unseen University, and a wizard had me expelled for guessing too much, so...
In particular, I am curious whether there's a way to see how many Strands currently make up the Str (whether via Raku introspection, NQP, or something that accesses the MoarVM level (does such a thing even exist at runtime?).
I'm 99.99% sure strands are purely an implementation detail of the backend, and there'll be no Raku or NQP access to that information without MoarVM specific tricks. That said, read on.
If there isn't any way to access this info at runtime
I can see there is access at runtime via MoarVM.
is there a way to get at it through output from one of Rakudo's command-line flags, such as --target, or --tracing? Or through a debugger?
I'm 99.99% sure there are multiple ways.
For example, there's a bunch of strand debugging code in MoarVM's ops.c file starting with #define MVM_DEBUG_STRANDS ....
Perhaps more interesting are what appears to be a veritable goldmine of sophisticated debugging and profiling features built into MoarVM. Plus what appear to be Rakudo specific modules that drive those features, presumably via Raku code. For a dozen or so articles discussing some aspects of those features, I suggest reading timotimo's blog. Browsing github I see ongoing commits related to MoarVM's debugging features for years and on into 2021.
Finally, does MoarVM manage the number of Strands in a given Str?
Yes. I can see that the string handling code (some links are below), which was written by samcv (extremely smart and careful) and, I believe, reviewed by jnthn, has logic limiting the number of strands.
I often hear (or say) that one of Raku's super powers is that is can index into Unicode strings in O(1) time, but I've been thinking about the pathological case, and it feels like it would be O(n).
Yes, if a backend that supported strands did not manage the number of strands.
But for MoarVM I think the intent is to set an absolute upper bound with #define MVM_STRING_MAX_STRANDS 64 in MoarVM's MVMString.h file, and logic that checks against that (and other characteristics of strings; see this else if statement as an exemplar). But the logic is sufficiently complex, and my C chops sufficiently meagre, that I am nowhere near being able to express confidence in that, even if I can say that that appears to be the intent.
For example, (^$n).map({~rand}).join seems like it would create a Str with a length proportional to $n that consists of $n Strands
I'm 95% confident that the strings constructed by simple joins like that will be O(1).
This is based on me thinking that a Raku/NQP level string join operation is handled by MVM_string_join, and my attempts to understand what that code does.
But I know that it's possible to flatten a Strand-ed Str; would MoarVM do something like that in this case?
If you read the code you will find it's doing very sophisticated handling.
Or have I misunderstood something more basic?
I'm pretty sure I will have misunderstood something basic so I sure ain't gonna comment on whether you have. :)
As far as I understand it, the fact that MoarVM implements strands (aka, a concatenating two strings will only result in creation of a strand that consists of "references" to the original strings), is really that: an implementation detail.
You can implement the Raku Programming Language without needing to implement strands. Therefore there is no way to introspect this, at least to my knowledge.
There has been a PR to expose the nqp:: op that would actually concatenate strands into a single string, but that has been refused / closed: https://github.com/rakudo/rakudo/pull/3975
Working over a problem connected with analytic number theory, I want to make some simple computer experiments in order to examine some theoretical conjectures. The algorithms are very simple: they contain standard arithmetic operations and factorials, but I would like to find values depending on a parameter. For instance, if I understand correctly, the problem with such calculations at WolframAlpha service is that I cannot write an expression depending on a parameter and then change the value of the parameter by typing it only once. But that is what I need. I am new in programming, long ago I used some old languages like Algol, but I am not aware of the modern situation with simple computer experiments. So, my goal is to calculate some simple expressions for multiple values of a parameter, preferably with installing some simple software or by using an online machinery. How could this be done?
Assuming that my question can be perceived as off topic, if so, I would much appreciate any further recommendations before closing.
I'd like to create a left pad function in a programming language. The function pads a string with leading characters to a specified total length. Strings are UTF-16 encoded in this language.
There are a few things in Unicode that make it complicated:
Surrogates: 2 surrogate characters = 1 unicode character
Combining characters: 1 non-combining character + any number of combining characters = 1 visible character
Invisible characters: 1 invisible character = 0 visible characters
What other factors have to be taken into consideration, and how would they be dealt with?
When you’re first starting out trying to understand something, it’s really frustrating. We’ve all been there. But while it’s very easy to call it stupid and everyone who made it stupid, you’re not going to get very far doing that. With an attitude like that, you’re implying that people who do understand it are also stupid for wasting their time on something so obviously stupid. After calling the people who do understand it stupid, it’s extremely unlikely that anyone who does understand it will take the time to explain it to you.
I understand the frustration. Unicode’s really complicated and it was a huge pain for me before I understood it and it’s still a pain for a lot of things I don’t have experience with. But the reason it’s so complicated isn’t because the people who made it were stupid and trying to ruin your life. It’s complicated because it attempts to provide a standard way of representing every human writing system ever used. Writing systems are insanely complicated, and throughout history developing a new and different writing system has been a fairly standard part of identifying yourself as a different culture from the people across the river or over the next mountain range. You yourself start off by identifying yourself as Hungarian based on the language you speak. Having once tried to pronounce a Hungarian professor’s name, I know that Hungarian is very complicated compared to English, just as English is very complicated compared to Hungarian. How would you feel if I was having trouble with Hungarian and asked you, “Boy, Hungarian sure is a stupid language! It must have been designed by idiots! By the way, how do I pronounce this word??”
There’s just no simple way to express something that’s inherently complicated in a very simple way. Human writing systems are inherently complicated and intentionally different from each other. As complicated as Unicode is, it’s better than what people had to do before, when instead of one single complicated standard there were multiple complicated standards in every country and you’d have to understand all of the different ‘standards.’
I’m not sure what your general life strategy is, but what I usually do when I don’t understand something is to pick up a few textbooks on the topic, read the textbooks through, and work out the examples. A good textbook will not only tell you how things are and what you need to do, but also how they go to be that way and why you need to do what you need to do.
I found Unicode Demysitifed to be an excellent book, and the newer book Unicode Explained has even higher ratings on amazon.
I realise this question has been asked many times across Stack Overflow and across the web, in fact, I have about 20 tabs open just now with apparent solutions to this problem.
The thing is every single answer says something along the lines of
You could use Regex, but it's not a good idea and doesn't reliably work, but I won't offer any alternatives.
So my question is this - Is there really no reliable, definitive way we can extract URLs from text?
Regular Expressions are extremely powerful tools. Like most powerful tools, they are seriously misunderstood, dangerous in the hands of many of their users, and the best answer to certain tasks. Matching known patterns in strings is what they exist for. Once you have a good URL pattern in hand it will work all the time in the context it was designed for. The reason everyone shies away from using them is that creating a good URL pattern for a specific context is difficult work. The pattern will vary by the execution environment (e.g., operating system for file: URLs), by the programming language and/or library in use, etc.
For the specific case of HTTP URLs, there is a clear definition that is mostly adhered to, and you can build a reliable regular expression from it with almost any language or library.
If you want to extract URL from any string there is no other choice than using Regex.
In fact URI Scheme is defined (see http://en.wikipedia.org/wiki/URI_scheme) and if you go through all it's aspects, regex is very reliable.
Is there really no reliable, definitive way we can extract URLs from text?
Well, anything which is in string formatted list needs a careful exception handling. That is said, once you have that handling in place it should be working fine.
The regexp with a uri schema may do the trick may look something similar to:
<a href="(?<url>http://.*?)".*>(?<text>.+?)<\/a>
That's a .NET regexp though, so you may need to modify it to work on your platform language.
i'm coding a query engine to search through a very large sorted index file. so here is my plan, use binary search scan together with Levenshtein distance word comparison for a match. is there a better or faster ways than this? thanks.
You may want to look into Tries, and in many cases they are faster than binary search.
If you were searching for exact words, I'd suggest a big hash table, which would give you results in a single lookup.
Since you're looking at similar words, maybe you can group the words into many files by something like their soundex, giving you much shorter lists of words to compute the distances to. http://en.wikipedia.org/wiki/Soundex
In your shoes, I would not reinvent the wheel - rather I'd reach for the appropriate version of the Berkeley DB (now owned by Oracle, but still open-source just as it was back when it was owned and developed by the UC at Berkeley, and later when it was owned and developed by Sleepycat;-).
The native interfaces are C and Java (haven't tried the latter actually), but the Python interface is also pretty good (actually better now that it's not in Python's standard library any more, as it can better keep pace with upstream development;-), C++ is of course not a problem, etc etc -- I'm pretty sure you can use if from most any language.
And, you get your choice of "BTree" (actually more like a B*Tree) and hash (as well as other approaches that don't help in your case) -- benchmark both with realistic data, btw, you might be surprised (one way or another) at performance and storage costs.
If you need to throw multiple machines at your indexing problem (because it becomes too large and heavy for a single one), a distributed hash table is a good idea -- the original one was Chord but there are many others now (unfortunately my first-hand experience is currently limited to proprietary ones so I can't really advise you here).
after your comment on David's answer, I'd say that you need two different indexes:
the 'inverted index', where you keep all the words, each with a list of places found
an index into that file, to quickly find any word. Should easily fit in RAM, so it can be a very efficient structure, like a Hash table or a Red/Black tree. I guess the first index isn't updated frequently, so maybe it's possible to get a perfect hash.
or, just use Xapian, Lucene, or any other such library. There are several widely used and optimized.
Edit: I don't know much about word-comparison algorithms but I guess most aren't compatible with hashing. In that case, R/B Trees or Tries might be the best way.