Count word occurrences in R - string

Is there a function for counting the number of times a particular keyword is contained in a dataset?
For example, if dataset <- c("corn", "cornmeal", "corn on the cob", "meal") the count would be 3.

Let's for the moment assume you wanted the number of element containing "corn":
length(grep("corn", dataset))
[1] 3
After you get the basics of R down better you may want to look at the "tm" package.
EDIT: I realize that this time around you wanted any-"corn" but in the future you might want to get word-"corn". Over on r-help Bill Dunlap pointed out a more compact grep pattern for gathering whole words:
grep("\\<corn\\>", dataset)

Another quite convenient and intuitive way to do it is to use the str_count function of the stringr package:
library(stringr)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for mere occurences of the pattern:
str_count(dataset, "corn")
# [1] 1 1 1 0
# for occurences of the word alone:
str_count(dataset, "\\bcorn\\b")
# [1] 1 0 1 0
# summing it up
sum(str_count(dataset, "corn"))
# [1] 3

You can also do something like the following:
length(dataset[which(dataset=="corn")])

I'd just do it with string division like:
library(roperators)
dataset <- c("corn", "cornmeal", "corn on the cob", "meal")
# for each vector element:
dataset %s/% 'corn'
# for everything:
sum(dataset %s/% 'corn')

You can use the str_count function from the stringr package to get the number of keywords that match a given character vector.
The pattern argument of the str_count function accepts a regular expression that can be used to specify the keyword.
The regular expression syntax is very flexible and allows matching whole words as well as character patterns.
For example the following code will count all occurrences of the string "corn" and will return 3:
sum(str_count(dataset, regex("corn")))
To match complete words use:
sum(str_count(dataset, regex("\\bcorn\\b")))
The "\b" is used to specify a word boundary. When using str_count function, the default definition of word boundary includes apostrophe. So if your dataset contains the string "corn's", it would be matched and included in the result.
This is because apostrophe is considered as a word boundary by default. To prevent words containing apostrophe from being counted, use the regex function with parameter uword = T. This will cause the regular expression engine to use the unicode TR 29 definition of word boundaries. See http://unicode.org/reports/tr29/tr29-4.html. This definition does not consider apostrophe as a word boundary.
The following code will give the number of time the word "corn" occurs. Words such as "corn's" will not be included.
sum(str_count(dataset, regex("\\bcorn\\b", uword = T)))

Related

How to substitute a repeating character with the same number of a different character in regex python?

Assume there's a string
"An example striiiiiing with other words"
I need to replace the 'i's with '*'s like 'str******ng'. The number of '*' must be same as 'i'. This replacement should happen only if there are consecutive 'i' greater than or equal to 3. If the number of 'i' is less than 3 then there is a different rule for that. I can hard code it:
import re
text = "An example striiiiing with other words"
out_put = re.sub(re.compile(r'i{3}', re.I), r'*'*3, text)
print(out_put)
# An example str***iing with other words
But number of i could be any number greater than 3. How can we do that using regex?
The i{3} pattern only matches iii anywhere in the string. You need i{3,} to match three or more is. However, to make it all work, you need to pass your match into a callable used as a replacement argument to re.sub, where you can get the match text length and multiply correctly.
Also, it is advisable to declare the regex outside of re.sub, or just use a string pattern since patterns are cached.
Here is the code that fixes the issue:
import re
text = "An example striiiiing with other words"
rx = re.compile(r'i{3,}', re.I)
out_put = rx.sub(lambda x: r'*'*len(x.group()), text)
print(out_put)
# => An example str*****ng with other words

Is there anything else used instead of slicing the String?

This is one of the practice problems from Problem solving section of Hackerrank. The problem statement says
Steve has a string of lowercase characters in range ascii[‘a’..’z’]. He wants to reduce the string to its shortest length by doing a series of operations. In each operation he selects a pair of adjacent lowercase letters that match, and he deletes them.
For example : 'aaabbccc' -> 'ac' , 'abba' -> ''
I have tried solving this using slicing of strings but this gives me timeout runtime error on larger strings. Is there anything else to be used?
My code:
s = list(input())
i=1
while i<len(s):
if s[i]==s[i-1]:
s = s[:i-1]+s[i+1:]
i = i-2
i+=1
if len(s)==0:
print("Empty String")
else:
print(''.join(s))
This gives me terminated due to timeout message.
Thanks for your time :)
Interning each new immutable string can be expensive,
as it has O(N) linear cost with the length of the string.
Consider processing "aa" * int(1e6).
You will write on the order of 1e12 characters to memory
by the time you're finished.
Take a moment (well, take linear time) to
copy each character over to a mutable list element:
[c for c in giant_string]
Then you can perform dup processing by writing a tombstone
of "" to each character you wish to delete,
using just constant time.
Finally, in linear time you can scan through the survivors using "".join( ... )
One other possible solution is to use regex. The pattern ([a-z])\1 matches a duplicate lowercase letter. The implementation would involve something like this:
import re
pattern = re.compile(r'([a-z])\1')
while pattern.search(s): # While match is found
s = pattern.sub('', s) # Remove all matches from "s"
I'm not an expert at efficiency, but this seems to write fewer strings to memory than your solution. For the case of "aa" * int(1e6) that J_H mentioned, it will only write one, thanks to pattern.sub replacing all occurances at once.

How to match a part of string before a character into one variable and all after it into another

I have a problem with splitting string into two parts on special character.
For example:
12345#data
or
1234567#data
I have 5-7 characters in first part separated with "#" from second part, where are another data (characters,numbers, doesn't matter what)
I need to store two parts on each side of # in two variables:
x = 12345
y = data
without "#" character.
I was looking for some Lua string function like splitOn("#") or substring until character, but I haven't found that.
Use string.match and captures.
Try this:
s = "12345#data"
a,b = s:match("(.+)#(.+)")
print(a,b)
See this documentation:
First of all, although Lua does not have a split function is its standard library, it does have string.gmatch, which can be used instead of a split function in many cases. Unlike a split function, string.gmatch takes a pattern to match the non-delimiter text, instead of the delimiters themselves
It is easily achievable with the help of a negated character class with string.gmatch:
local example = "12345#data"
for i in string.gmatch(example, "[^#]+") do
print(i)
end
See IDEONE demo
The [^#]+ pattern matches one or more characters other than # (so, it "splits" a string with 1 character).

Efficient way to insert characters between other characters in a string

What is an efficient way in MATLAB to replace/insert one symbol (in series of symbols) with several others that correspond to the one that is being replaced?
For example, consider having a string Eq: Eq = 'A*exp(-((x-xc)/w)^2)'. Is there a way to replace * with .*, / with ./,\ with .\, and ^ with .^ without writing four separate strrep() lines?
Regular expressions will do the job nicely. Regular expressions simply find patterns in text. You specify what kind of pattern you are looking for by a regular expression, and the output gives you the locations of where the pattern occurred.
For our particular case, not only do we want to find where patterns occur, we also want to replace those patterns with something else. Specifically, use the function regexprep from MATLAB to replace matches in a string with something else. What you want to do is replace all *, /, \ and ^ symbols by adding a . in front of each.
How regexprep works is that the first input is the string you're looking at, the second input is a pattern that you're trying to find. In our case, we want to find any of *, /, \ and ^. To specify this pattern, you put those desired symbols in [] brackets. Regular expressions reserve \ as a special symbol to delineate characters that can be parsed as a regular expression but actually aren't. As such, you need to use \\ for the \ character and \^ for the ^ character. The third input is what you want to replace each match with. In our case, we simply want to reuse each matched character, but we add a . at the beginning of the match. This is done by doing \.$0 in the regular expression syntax. $0 means to grab the first token produced by a match... which is essentially the matched symbol from the pattern. . is also a reserved keyword using regular expressions, so we must prepend this symbol with a \ character.
Without further ado:
>> Eq = 'A*exp(-((x-xc)/w)^2)';
>> out = regexprep(Eq, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2)
The pattern we are looking for is [*/\\\^], which means that we want to find any of *, /, \ - denoted as \\ in regex, and \^ - denoted as ^ in regex. We want to find any of these symbols and replace them with the same symbol by adding a . character in front - \.$0.
As a more complicated example, let's make sure that we include all of the symbols you're looking for in a sample equation:
>> A = 'A*exp(-((x-xc)/w)^2) \ b^2';
>> out = regexprep(A, '[*/\\\^]', '\.$0')
out =
A.*exp(-((x-xc)./w).^2) .\ b.^2
I'd go with regexp as in rayryeng's answer. But here's another approach, just to provide an alternative.
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
[~, jj] = sort([1:numel(Eq) ii-.5]); %// will be used to properly order the result
result = [Eq repmat('.',1,numel(ii))]; %// insert dots at the end
result = result(jj); %// properly order the result
And a variant:
ops = '*/\^'; %// operators that need a dot
ii = find(ismember(Eq, ops)); %// find where dots should be inserted
jj = sort([1:numel(Eq) ii-.5]); %// dot locations are marked with fractional part
result = Eq(ceil(jj)); %// repeat characters where the dots will be placed
result(mod(jj,1)>0) = '.'; %// place dots at indices with fractional part
The vectorize function already does almost all of what you want except that it does not convert mldivide (\) to ldivide (.\).
By "efficient," do you mean fewer lines of code or faster? Regular expressions are almost always slower than other approaches and less readable. I don't think they're necessary or a good choice in this case. If you only need to convert your string once, then speed is less of a concern than readability (strrep will still be faster). If you need to do it many times, this simple code that you alluded to is 4–5 times faster than regexrep for short strings like your example (and much faster for longer strings):
out = strrep(Eq,'*','.*');
out = strrep(out,'/','./');
out = strrep(out,'\','.\');
out = strrep(out,'^','.^');
If you want one line, use:
out = strrep(strrep(strrep(strrep(Eq,'*','.*'),'/','./'),'\','.\'),'^','.^');
which will also be slightly faster still. Or create your own version of vectorize and call that.
Where regular expressions shine is in more complex cases, e.g., if your string is already partially vectorized: Eq = 'A.*exp(-((x-xc)/w)^2)'. Even still, the vectorize function just uses strrep and then calls strfind to "remove any possible '..*', '../', etc." and replace them with the proper element-wise operators because it's faster (symbolic math strings can get very large, for example).

How can I remove repeated characters in a string with R?

I would like to implement a function with R that removes repeated characters in a string. For instance, say my function is named removeRS, so it is supposed to work this way:
removeRS('Buenaaaaaaaaa Suerrrrte')
Buena Suerte
removeRS('Hoy estoy tristeeeeeee')
Hoy estoy triste
My function is going to be used with strings written in spanish, so it is not that common (or at least correct) to find words that have more than three successive vowels. No bother about the possible sentiment behind them. Nonetheless, there are words that can have two successive consonants (especially ll and rr), but we could skip this from our function.
So, to sum up, this function should replace the letters that appear at least three times in a row with just that letter. In one of the examples above, aaaaaaaaa is replaced with a.
Could you give me any hints to carry out this task with R?
I did not think very carefully on this, but this is my quick solution using references in regular expressions:
gsub('([[:alpha:]])\\1+', '\\1', 'Buenaaaaaaaaa Suerrrrte')
# [1] "Buena Suerte"
() captures a letter first, \\1 refers to that letter, + means to match it once or more; put all these pieces together, we can match a letter two or more times.
To include other characters besides alphanumerics, replace [[:alpha:]] with a regex matching whatever you wish to include.
I think you should pay attention to the ambiguities in your problem description. This is a first stab, but it clearly does not work with "Good Luck" in the manner you desire:
removeRS <- function(str) paste(rle(strsplit(str, "")[[1]])$values, collapse="")
removeRS('Buenaaaaaaaaa Suerrrrte')
#[1] "Buena Suerte"
Since you want to replace letters that appear AT LEAST 3 times, here is my solution:
gsub("([[:alpha:]])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
#[1] "Buenna Suertee"
As you can see the 4 "a" have been reduced to only 1 a, the 3 r have been reduced to 1 r but the 2 n and the 2 e have not been changed.
As suggested above you can replace the [[:alpha:]] by any combination of [a-zA-KM-Z] or similar, and even use the "or" operator | inside the squre brackets [y|Q] if you want your code to affect only repetitions of y and Q.
gsub("([a|e])\\1{2,}", "\\1", "Buennaaaa Suerrrtee")
# [1] "Buenna Suerrrtee"
# triple r are not affected and there are no triple e.

Resources