How do I group similar strings in R? [closed] - string

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking us to recommend or find a tool, library or favorite off-site resource are off-topic for Stack Overflow as they tend to attract opinionated answers and spam. Instead, describe the problem and what has been done so far to solve it.
Closed 9 years ago.
Improve this question
I have a database with ~5,000 locality names, most of which are repetitions with typos, permutations, abreviations, etc. I would like to group them by similarity, to speed up further processing. The best would be to convert each variation into a "platonic form", and put two columns side by side, with the original and platonic forms. I've read about Multiple sequence alignment, but this seems to be mostly used in bioinformatics, for sequences of DNA/RNA/Peptides. I'm not sure it will work well with names of places. Anyone knows of a library that helps me to do it in R? Or which of the many algorithm variations might be easier to adapt?
EDIT: How do I do that in R? Up to now, I'm using adist() function, which gave me a matrix of distances between each pair of strings (although it don't treat translocations the way I think it should, see comment below). The next step I'm working right now is to turn this matrix into a grouping/clustering of similar enough values. Thanks in advance!
EDIT: To solve the translocations problem, I did a small function that gets all the words with more than 2 characters, sort them, removes any punctuation left, and paste them again into a string.
sep <- function(linha) {
resp <- strsplit(linha," |/|-")
resp <- unlist(resp)
resp <- gsub(",|;|\\.","",resp)
resp <- sort(resp[which(nchar(resp) > 2)])
paste0(resp,collapse=" ")
}
Then I apply this over all lines of my table
locs[,9] <- apply(locs,1,function(x) sep(x[1])) # 1=original data; 9=new data
and finally apply adist() to create the similarity table.

There's a built in function called "adist" that computes a measure of distance between two words.
It's like using "agrep", except it returns the distance, instead of whether the words match according to some approximate matching criteria.
For the special case of words that can be interchanged with a comma(e.g. "hello,world" should be close to "world,hello"), here's a quick hack. You can modify the function pretty easily if you have other special cases.
adist_special <- function(word1, word2){
min(adist(word1, word2),
adist(word1, gsub(word2,
pattern = "(.*),(.*)",
repl="\\2,\\1")))
}
adist("hello,world", "world,hello")
# 8
adist_special("hello,world", "world,hello")
# 0

Related

want someone to do my homework for me [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
# Change this program to output each letter of the text entered on a separate line
# with space in front of each letter, increasing to the right for each letter.
# For example, 'Python' input should output:
# P
# y
# t
# h
# o
# n
text = input('Enter text: ')
for letter in text:
print(letter)
I already tried to look online for the solution, there are none.
This code is for homework but i cant figure it out help wold be appreciated
As you haven't told us what you have tried to do so far, or what you have tried to learn to figure it out, I'm not going to post the Python code. Instead, lets think about what the program should be doing. Your assignment statement gives a broad overview, but as a programmer you need to take this overview and turn it into a set of smaller instructions. These smaller steps do not have to be in code. They can be in whatever form you like, even plain english.
For functional analysis (which is what you are doing for this problem) start with the inputs and outputs then fill in the stuff in-between.
1) Input: a string
X) Output: multiple lines with a single character and whitespace
Now how do you want to get from 1 to X. You already have the code to loop through each letter and print it, but you are missing two things to get to your required output.
A) way to place a string on a new line
B) way to add whitespace to a string
I'll give you a couple of hints. A) is something that is extremely common in almost any programming language. In fact, it is done the exact same way is any language that you are likely to use. Another hint. Your final output will be a single string that spans multiple lines. Ask yourself how a word processor or text editor gives works with blank lines.
B is a little more tricky, as there are a couple of nifty Python tricks that makes it simpler to do than in other languages. I'm guessing that you already know what happens when you add two numbers together such as 3 + 5. But what happens when you add two strings together such as "foo" + "bar". Notice how the behavior for adding numbers with the + operator is completely different than the behavior is for adding strings together with the same operator? That difference in behavior applies to other common operators as well. Play around with the other three common mathematical operators on both string and numbers to see what happens

Haskell library like SymPy? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I need to manipulate expressions like 1 + sqrt(3) and do basic arithmetic like addition, subtraction, and division. I'd like the result to be in some sort of canonical form so that it can be used as a key in a map. Turning 1 + sqrt(3) into a float is not feasible due to roundoff problems.
I used SymPy for this task in Python. Is there an equivalent native library for Haskell?
Please check out the numbers package. If all you need is to store exact numbers like "1 + √3", you may want to use Data.Number.CReal instead of symbolic arithmetics. It stores the expressions and can be computed to arbitrary number of digits when needed.
Prelude Data.Number.CReal> let cx = 1 + sqrt (3 :: CReal)
Prelude Data.Number.CReal> showCReal 400 cx
"2.7320508075688772935274463415058723669428052538103806280558069794519330169088000370811461867572485756756261414154067030299699450949989524788116555120943736485280932319023055820679748201010846749232650153123432669033228866506722546689218379712270471316603678615880190499865373798593894676503475065760507566183481296061009476021871903250831458295239598329977898245082887144638329173472241639845878553977"
There is also a Data.Number.Symbolic module in the package but the description says "It's mainly useful for debugging".
It seems you are looking for Computer Algebra System (CAS) in Haskell. Inspite of so many references to algebraic objects in the names of Haskell packages/modules, I've never heard of a general purpose and well-maintained CA system in Haskell (like SymPy or Sage in Python).
However in the list of Computer Algebra Systems on Wikipedia I've found a reference to
DoCon. The Algebraic Domain Constructor
It uses a non-standard license, but I dare say it is still Open Source (though with rename and attribution requirements). As of July 2010 docon-2.11 still builds with GHC 6.12.1 and runs demos/tests (I only had to insert a LANGUAGE FlexibleContexts pragma in one file of the demo).
DoCon is well documented (362 pages of the Manual). Its Manual is packed inside of the zip with sources, so I put it online separately for convenience:
DoCon 2.11 Manual.ps
Please look through to check if it suits your needs.
Check out the cyclotomic package, which implements exact arithmetic on the cyclotomic numbers. These include all algebraic numbers (hence in particular 1+sqrt(3)) and the key operations (like equality) are decidable.
They do not provide an Ord instance (for the same reason the complex numbers do not), but one can implement a non-semantic instance if all one needs is to use them as keys in a lookup table. You may want to contact the author about how to do this correctly, as there may be some invariants that are not obvious (e.g. one may need to be careful about zeros in the coeffs map).

Why is this an invalid Turing machine? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
This question does not appear to be about programming within the scope defined in the help center.
Closed 1 year ago.
Improve this question
Whilst doing exam revision I am having trouble answering the following question from the book, "An Introduction to the Theory of Computation" by Sipser. Unfortunately there's no solution to this question in the book.
Explain why the following is not a legitimate Turing machine.
M = {
The input is a polynomial p over variables x1, ..., xn
Try all possible settings of x1, ..., xn to integer values
Evaluate p on all of these settings
If any of these settings evaluates to 0, accept; otherwise reject.
}
This is driving me crazy! I suspect it is because the set of integers is infinite? Does this somehow exceed the alphabet's allowable size?
Although this is quite an informal way of describing a Turing machine, I'd say the problem is one of the following:
otherwise reject - i agree with Welbog on that. Since you have a countably infinite set of possible settings, the machine can never know whether a setting on which it evaluates to 0 is still to come, and will loop forever if it doesn't find any - only when such a setting is encountered, the machine may stop. That last statement is useless and will never be true, unless of course you limit the machine to a finite set of integers.
The code order: I would read this pseudocode as "first write all possible settings down, then evaluate p on each one" and there's your problem:
Again, by having an infinite set of possible settings, not even the first part will ever terminate, because there never is a last setting to write down and continue with the next step. In this case, not even can the machine never say "there is no 0 setting", but it can never even start evaluating to find one. This, too, would be solved by limiting the integer set.
Anyway, i don't think the problem is the alphabet's size. You wouldn't use an infinite alphabet since your integers can be written in decimal / binary / etc, and those only use a (very) finite alphabet.
I'm a bit rusty on turing machines, but I believe your reasoning is correct, ie the set of integers is infinite therefore you cannot compute them all. I am not sure how to prove this theoretically though.
However, the easiest way to get your head around Turing machines is to remember "Anything a real computer can compute, a Turing machine can also compute.". So, if you can write a program that given a polynomial can solve your 3 questions, you will be able to find a Turing machine which can also do it.
I think the problem is with the very last part: otherwise reject.
According to countable set basics, any vector space over a countable set is countable itself. In your case, you have a vector space over the integers of size n, which is countable. So your set of integers is countable and therefore it is possible to try every combination of them. (That is to say without missing any combination.)
Also, computing the result of p on a given set of inputs is also possible.
And entering an accepting state when p evaluates to 0 is also possible.
However, since there is an infinite number of input vectors, you can never reject the input. Therefore no Turing machine can follow all of the rules defined in the question. Without that last rule, it is possible.

Finite questions

Are there a finite number of questions that can be asked regarding a specific language (and or topic), for example - for T-SQL given that there are only so many commands, can there be a limited number of non-repetitive questions? and if so can you use that to determine sizing for a site like stackoverflow and to determine the probability of a new question being a repeat of a prior one? If there is a finite number, how would you determine/calculate it: for instance, T-SQL has x number of commands, each one can have a set of relevant questions (syntax, example of use, etc.) - so could the # of questions = x times potential questions time some relevant variation? or something like that?
No, since, theoretically, programs can be of infinite length, and this site is not just about language commands, but programs developed with those languages.
I'm pretty sure Turing says no, and if you don't believe him them Gödel might have something to say about it.
A stack overflow question is expressed as a finite length sequence of bytes. One could in principle consider the question body in terms of an integer, expressed lowest digit first, in base 256 (or larger, if you wish to think about it as unicode). This is a bijection between questions and whole numbers. Therefore the set of all stack overflow questions has a countably infinite cardinality (How do i typeset \aleph_0 in SO?).

The History Behind the Definition of a 'String' [closed]

Closed. This question is not about programming or software development. It is not currently accepting answers.
This question does not appear to be about a specific programming problem, a software algorithm, or software tools primarily used by programmers. If you believe the question would be on-topic on another Stack Exchange site, you can leave a comment to explain where the question may be able to be answered.
Closed 2 months ago.
Locked. This question and its answers are locked because the question is off-topic but has historical significance. It is not currently accepting new answers or interactions.
I have never thought about until recently, but I'm not sure why we call strings strings. I am a .NET programmer, but I believe the concept of strings exist in virtually every programming language.
Outside of programming, I don't believe I've heard the word string used to describe words or letters. A quick Google of, 'Define: string' yields a bunch of definitions that have nothing to do with the concept of letters, words, or anything of the nature associated to programming.
My guess of it, is that, back in the day, strings were really just arrays of characters of a particular length, often with a delimiting character at the end. But, I don't see a natural transition from 'character array' to string.
Can someone offer up some insight to why we call strings strings?
My assumption has always been that the programming term originated from the following definition of the word "string" (from Merriam-Webster):
(1): a series of things arranged in or as if in a line <a string of cars> <a string of names>
(2): a sequence of like items (as bits, characters, or words)
Since a string in programming is simply an ordered sequence of characters, referring to this as a "string of characters" (or simply "string") seems like the most probable origin.
From this reference:
The 1971 OED (p. 3097) quotes an 1891
Century Dictionary on a source in the
Milwaukee Sentinel of 11 Jan. 1898
(section 3, p. 1) to the effect that
this is a compositor's term. Printers
would paste up the text that they had
generated in a long strip of
characters. (Presumably, they were
paid by the foot, not by the word!)
The quote says that it was not unusual
for compositors to create more than
1500 (characters?) per hour.
From searching through the ACM bibliography it seems the word string acquired its meaning in computer science during the 1960s. At the beginning a string is a general kind of sequence or list, e.g. A command language for handling strings of symbols from 1958.
This article explicitly mentions "character strings" in 1964.
Unfortunately I can't access the full texts, which are behind a toll booth.
I had guessed that "string" was in use by mathematicians long before its adoption in programming languages. Turing machines effectively operate on strings. Turing may not have used the term, but it is used everywhere in automata textbooks, going back decades.
The earliest reference I could find was a fragment in Google books of a 1944 article "Recursively enumerable sets of positive integers and their decision problems" by logician Emil Post in Bulletin of the AMS. Fortunately, AMS provides online archives of complete articles free for download. Here is a link: http://www.ams.org/journals/bull/1944-50-05/S0002-9904-1944-08111-1/S0002-9904-1944-08111-1.pdf
I think there is little doubt that he is using "string" in the conventional sense used in computer science. P. 286 "For working purposes, we in-
troduce the letter b, and consider "strings" of 1's and b's such as
11b1bb1. An operation on such strings such as "b1bP produces P1bb1"
we term a normal operation. This particular normal operation is ap-
plicable only to strings starting with b1b, and the derived string is
then obtained from the given string by first removing the initial b1b,
and then tacking on 1bb1 at the end. Thus b1bb becomes b1bb1."
I suspect it's because string originally meant just a sequence of data values: "I'll just string these together" etc. These values didn't have to be characters. One very common use for this general concept happened to be a sequence of characters, and this took over as the general meaning of the word.
The earliest reference I could find in computing is from March 1963's METEOR: A LISP Interpreter for String Transformations by Daniel G. Bobrow at MIT's AI Labs.
However, definition 15d. in the Oxford English Dictionary is:
Computing A linear sequence of records or data.
... and with a first quotation from a 1956 Journal of the Association for Computing Machinery:
Areas are set aside for shuttling strings of control fields back and forth until a completely sorted sequence is obtained.
This use naturally follows on from definition 15c.:
Math., etc. A sequence of symbols or linguistic elements in a definite order.
... and first used in Clarence Irving Lewis and Cooper Harold Langford's Symbolic Logic (1932):
Propositions are not strings of marks, or series of sounds, except incidentally.
This in turn follows on from many other, much earlier definitions for things in a line.
The word was originally used to differentiate between a set of values to which the particular order of elements doesn't matter (for instance, a set of random samples of measurements) and another that could only have its meaning preserved when the order is also preserved. Originally a string could be a set of any kind of values, but since in the post-mainframe era a string of characters is by far the most common kind, the fact that the values are characters became a "default".
A string is a sequence of discrete objects (usually char).
Given that, I would probably venture a guess that it may have to do with a metaphor related to "string of pearls". Each bead on the string is a single character.
It's called a strings, because it's actually an array of char type elements.
That being said, they are "stringing together" (or is it strung together) via this array, which turns them into a "string".

Resources