With data.table, return between certain characters into a new column - string

I have a feeling this might be a simple question, but I've searched through SO for a bit now and found many interesting related Q/A, I'm still stumped.
Here's what I need to learn (in honesty, I'm playing with the kaggle Titanic dataset, but I want to use data.table)...
Let's say you have the following data.table:
dt <- data.table(name=c("Johnston, Mr. Bob", "Stone, Mrs. Mary", "Hasberg, Mr. Jason"))
I want my output to be JUST the titles "Mr.", "Mrs.", and "Mr." -- heck we can leave out the period as well.
I've been playing around (all night) and discovered that using regular expressions might hold the answer, but I've only been able to get that to work on a single string, not with the whole data.table.
For example,
substr(dt$name[1], gregexpr(",.", dt$name[1]), gregexpr("[.]", dt$name[1]))
Returns:
[1] ", Mr."
Which is cool, and I can do some further processing to get rid of the ", " and ".", but, the optimist(/optimizer) in me feels that that's ugly, gross, and inefficent.
Besides, even if I wanted to settle on that, (it pains me to admit) I don't know how to apply that into the J of data.table....
So, how do I add a column to dt called "Title", that contains:
[1] "Mr"
[2] "Mrs"
[3] "Mr"
I firmly believe that if I'm able to use regular expressions to select and extract data within a data.table that I will probably use this 100x a day. So thank you in advance for helping me figure out this pivotal technique.
PS. I'm an excel refugee, in excel I would just do this:
=mid(data, find(", ", data), find(".", data))

Umm.. I may have figured it out:
dt[, Title:=sub(".*?, (.*?)[.].*", "\\1", name)]
But I'm going to leave this here in case anyone else needs help, or perhaps there's an even better way of doing this!

You can use the stringr package
library(stringr)
str_extract(dt$name, "M.+\\.")
[1] "Mr." "Mrs." "Mr."
Different variations on the regular expression will let you extract other titles, like Dr., Master, or Reverend which may also be of interest to you.
To get all characters between "," and "." (inclusive) you can use
str_extract(dt$name, ",.+\\.")
and then remove the first and last characters of the result with str_sub (also from stringr package).
But as I think about it more, I might use grepl to create indicator variables for all the different titles that are in the Titanic dataset. For example
dr_ind <- grepl("Dr|Doctor", dt$name)
titled_ind <- grepl("Count|Countess|Baron", dt$name)
etc.

Related

is this code, right? if there is an error, could you guide?

i am a complete noob to programming and i got an assignment from the online course where I am learning, mine also gives the output, but it is not same as the instructor's method. This works for me and this method was easy for me.
but i am not sure is this the right method or had i done something wrong? could someone help?
I was not allowed to use choice()
**
import random
names_string = input("Give me everybody's names, separated by a comma. ")
names = names_string.split(", ")
a=len(names)
random_name=random.randint(0,a)
print(f"{names[random_name]} is going to pay the bill")
**
Welcome!
First of all you need to describe what exactly your problem and what is the problem you are facing.
It looks like you even have not tried to run that code. I will recommend an online interpreter to you to test your code on. You may use that
https://www.online-python.com/
Secondly "import" statement must be in lower case not "Import"
Finally the code works only incase of the string (names) is separated by a comma followed by space ", " same as the split string used. for example "a,b,c" is not going to work, but "a, b, c" does
random.randint(0,a) (a is included; may be returned) so it must be a-1 to avoid IndexError: list index out of range
Fixed code
import random
names_string = input("Give me everybody's names, separated by a comma. ")
names = names_string.split(", ")
a=len(names)
random_name=random.randint(0,a-1)
print(f"{names[random_name]} is going to pay the bill")
no, the code is not right.The error can be solved by replacing "Import" by "import".

Way to find a number at the end of a string in Smalltalk

I have different commands my program is reading in (i.e., print, count, min, max, etc.). These words can also include a number at the end of them (i.e., print3, count1, min2, max6, etc.). I'm trying to figure out a way to extract the command and the number so that I can use both in my code.
I'm struggling to figure out a way to find the last element in the string in order to extract it, in Smalltalk.
You didn't told which incarnation of Smalltalk you use, so I will explain what I would do in Pharo, that is the one I'm familiar with.
As someone that is playing with Pharo a few months at most, I can tell you the sheer amount of classes and methods available can feel overpowering at first, but the environment actually makes easy to find things. For example, when you know the exact input and output you want, but doesn't know if a method already exists somewhere, or its name, the Finder actually allow you to search by giving a example. You can open it in the world menu, as shown bellow:
By default it seeks selectors (method names) matching your input terms:
But this default is not what we need right now, so you must change the option in the upper right box to "Examples", and type in the search field a example of the input, followed by the output you want, both separated by a ".". The input example I used was the string 'max6', followed by the desired result, the number 6. Pharo then gives me a list of methods that match that:
To get what would return us the text part, you can make a new search, changing the example output from number 6 to the string 'max':
Fortunately there is several built-in methods matching the description of your problem.
There are more elegant ways, I suppose, but you can make use of the fact that String>>#asNumber only parses the part it can recognize. So you can do
'print31' reversed asNumber asString reversed asNumber
to give you 31. That only works if there actually is a number at the end.
This is one of those cases where we can presume the input data has a specific form, ie, the only numbers appear at the end of the string, and you want all those numbers. In that case it's not too hard to do, really, just:
numText := 'Kalahari78' select: [ :each | each isDigit ].
num := numText asInteger. "78"
To get the rest of the string without the digits, you can just use this:
'Kalahari78' withoutTrailingDigits. "Kalahari"6
As some of the Pharo "OGs" pointed out, you can take a look at the String class (just type CMD-Return, type in String, hit Return) and you will find an amazing number of methods for all kinds of things. Usually you can get some ideas from those. But then there are times when you really just need an answer!

Questions regarding Python replace specific texts

I'm writing a script to scrape from another website with Python, and I am facing this question that I have yet to figure out a method to resolve it.
So say I have set to replace this particular string with something else.
word_replace_1 = 'dv'
namelist = soup.title.string.replace(word_replace_1,'11dv')
The script works fine, when the titles are dv234,dv123 etc.
The output will be 11dv234, 11dv123.
However if the titles are, dv234, mixed with dvab123, even though I did not set dvab to be replaced with anything, the script is going to replace it to 11dvab123. What should I do here?
Also, if the title is a combination of alphabits,numbers and Korean characters, say DAV123ㄱㄴㄷ,
how exactly should I make it to only spitting out DAV123, and adding - in between alphabits and numbers?
Python - making a function that would add "-" between letters
This gives me the idea to add - in between all characters, but is there a method to add - between character and number?
the only way atm I can think of is creating a table of replacing them, for example something like this
word_replace_3 = 'a1'
word_replace_4 = 'a2'
.......
and then print them out as
namelist3 = soup.title.string.replace(word_replace_3,'a-1').replace(word_replace_4,'a-2')
This is just slow and not efficient. What would be the best method to resolve this?
Thanks.

Netlogo: Creating subsets of agentsets of a particular breed

I am still new to Netlogo, but I can not find an explanation for this in the documentation.
I am trying to create a subset of an agentset that only contains one type of breed. It would seem that I could use "with" to perform this, but for some reason that does not work.
This code works:
ask link-neighbors with [shape = "person"][
set pmt (pmt + dist)
]
But this code does not:
ask link-neighbors with [breed = "psngrs"][
set pmt (pmt + dist)
]
How can I create a subset of an agentset with only this particular breed?
Thanks!
This question is showing as unanswered, though Alan gave the correct answer in a comment. So, just so I stop clicking on this thinking it is unanswered, I'm going to reiterate what he said as an answer. Alan, if you add your comment as answer, I'll happily delete mine.
Anyway, just get rid of the quotes around the breed name, like so:
link-neighbors with [breed = psngrs]

In R, how do I replace a string that contains a certain pattern with another string?

I'm working on a project involving cleaning a list of data on college majors. I find that a lot are misspelled, so I was looking to use the function gsub() to replace the misspelled ones with its correct spelling. For example, say 'biolgy' is misspelled in a list of majors called Major. How can I get R to detect the misspelling and replace it with its correct spelling? I've tried gsub('biol', 'Biology', Major) but that only replaces the first four letters in 'biolgy'. If I do gsub('biolgy', 'Biology', Major), it works for that case alone, but that doesn't detect other forms of misspellings of 'biology'.
Thank you!
You should either define some nifty regular expression, or use agrep from base package. stringr package is another option, I know that people use it, but I'm a very huge fan of regular expressions, so it's a no-no for me.
Anyway, agrep should do the trick:
agrep("biol", "biology")
[1] 1
agrep("biolgy", "biology")
[1] 1
EDIT:
You should also use ignore.case = TRUE, but be prepared to do some bookkeeping "by hand"...
You can set up a vector of all the possible misspellings and then do a loop over a gsub call. Something like:
biologySp = c("biolgy","biologee","bologee","bugs")
for(sp in biologySp){
Major = gsub(sp,"Biology",Major)
}
If you want to do something smarter, see if there's any fuzzy matching packages on CRAN, or something that uses 'soundex' matching....
The wikipedia page on approx. string matching might be useful, and try searching R-help for some of the key terms.
http://en.wikipedia.org/wiki/Approximate_string_matching
You could first match the majors against a list of available majors, any not matching would then be the likely missspellings. Then use the agrep function to match these against the known majors again (agrep does approximate matching, so if it is similar to a correct value then you will get a match).
The vwr package has methods for string matching:
http://ftp.heanet.ie/mirrors/cran.r-project.org/web/packages/vwr/index.html
so your best bet might be to use the string with the minimum Levenshtein distance from the possible subject strings:
> levenshtein.distance("physcs",c("biology","physics","geography"))
biology physics geography
7 1 9
If you get identical minima then flip a coin:
> levenshtein.distance("biolsics",c("biology","physics","geography"))
biology physics geography
4 4 8
example 1a) perl/linux regex: 's/oldstring/newstring/'
example 1b) R equivalent of 1a: srcstring=sub(oldstring, newstring, srcstring)
example 2a) perl/linux regex: 's/oldstring//'
example 2b) R equivalent of 2a: srcstring=sub(oldstring, "", srcstring)

Resources