R extract time components from semi-standard strings - string

Setup
I have a column of durations stored as a strings in a dataframe. I want to convert them to an appropriate time object, probably POSIXlt. Most of the strings are easy to parse using this method:
> data <- data.frame(time.string = c(
+ "1 d 2 h 3 m 4 s",
+ "10 d 20 h 30 m 40 s",
+ "--"))
> data$time.span <- strptime(data$time.string, "%j d %H h %M m %S s")
> data$time.span
[1] "2012-01-01 02:03:04" "2012-01-10 20:30:40" NA
Missing durations are coded "--" and need to be converted to NA - this already happens but should be preserved.
The challenge is that the string drops zero-valued elements. Thus the desired value 2012-01-01 02:00:14 would be the string "1 d 2 h 14 s". However this string parses to NA with the simple parser:
> data2 <- data.frame(time.string = c(
+ "1 d 2 h 14 s",
+ "10 d 20 h 30 m 40 s",
+ "--"))
> data2$time.span <- strptime(data2$time.string, "%j d %H h %M m %S s")
> data2$time.span
[1] NA "2012-01-10 20:30:40" NA
Questions
What is the "R Way" to handle all the possible string formats? Perhaps test for and extract each element individually, then recombine?
Is POSIXlt the right target class? I need duration free from any specific start time, so the addition of false year and month data (2012-01-) is troubling.
Solution
#mplourde definitely had the right idea w/ dynamic creation of a formatting string based on testing various conditions in the date format. The addition of cut(Sys.Date(), breaks='years') as the baseline for the datediff was also good, but failed to account for a critical quirk in as.POSIXct() Note: I'm using R2.11 base, this may have been fixed in later versions.
The output of as.POSIXct() changes dramatically depending on whether or not a date component is included:
> x <- "1 d 1 h 14 m 1 s"
> y <- "1 h 14 m 1 s" # Same string, no date component
> format (x) # as specified below
[1] "%j d %H h %M m %S s"
> format (y)
[1] "% H h % M %S s"
> as.POSIXct(x,format=format) # Including the date baselines at year start
[1] "2012-01-01 01:14:01 EST"
> as.POSIXct(y,format=format) # Excluding the date baselines at today start
[1] "2012-06-26 01:14:01 EDT"
Thus the second argument for the difftime function should be:
The start of the first day of the current year if the input string has a day component
The start of the current day if the input string does not have a day component
This can be accomplished by changing the unit parameter on the cut function:
parse.time <- function (x) {
x <- as.character (x)
break.unit <- ifelse(grepl("d",x),"years","days") # chooses cut() unit
format <- paste(c(if (grepl("d", x)) "%j d",
if (grepl("h", x)) "%H h",
if (grepl("m", x)) "%M m",
if (grepl("s", x)) "%S s"), collapse=" ")
if (nchar(format) > 0) {
difftime(as.POSIXct(x, format=format),
cut(Sys.Date(), breaks=break.unit),
units="hours")
} else {NA}
}

difftime objects are time duration objects that can be added to either POSIXct or POSIXlt objects. Maybe you want to use this instead of POSIXlt?
Regarding the conversion from strings to time objects, you could do something like this:
data <- data.frame(time.string = c(
"1 d 1 h",
"30 m 10 s",
"1 d 2 h 3 m 4 s",
"2 h 3 m 4 s",
"10 d 20 h 30 m 40 s",
"--"))
f <- function(x) {
x <- as.character(x)
format <- paste(c(if (grepl('d', x)) '%j d',
if (grepl('h', x)) '%H h',
if (grepl('m', x)) '%M m',
if (grepl('s', x)) '%S s'), collapse=' ')
if (nchar(format) > 0) {
if (grepl('%j d', format)) {
# '%j 1' is day 0. We add a day so that x = '1 d' means 24hrs.
difftime(as.POSIXct(x, format=format) + as.difftime(1, units='days'),
cut(Sys.Date(), breaks='years'),
units='hours')
} else {
as.difftime(x, format, units='hours')
}
} else { NA }
}
data$time.span <- sapply(data$time.string, FUN=f)

I think you will have better luck with lubridate:
From Dates and Times Made Easy with lubridate:
5.3. Durations
...
The length of a duration is invariant to leap years, leap seconds, and daylight savings time
because durations are measured in seconds. Hence, durations have consistent lengths and
can be easily compared to other durations. Durations are the appropriate object to use when
comparing time based attributes, such as speeds, rates, and lifetimes.
lubridate uses the difftime class from base R for durations. Additional difftime methods
have been created to facilitate this.
lubridate uses the difftime class from base R for durations. Additional difftime methods
have been created to facilitate this.
...
Duration objects can be easily created with the helper functions dyears(), dweeks(), ddays(), dhours(), dminutes(), and dseconds(). The d in the title stands for duration and distinguishes these objects from period objects, which are discussed in Section 5.4. Each object creates a duration in seconds using the estimated relationships given above.
That said, I haven't (yet) found a function to parse a string into a duration.
You might also take a look at Ruby's Chronic to see how elegant time parsing can be. I haven't found a library like this for R.

Related

Print and for in 1 line Python 3.6.4

I am making a passport and it was done.
But the file was 7kb big. That was too much for me.
So I started to make it shorter. I had a problem and this (look code below) is a simplified version of it but it didn't work. Can I do this for loop in the print command?
C = ["A","B","C","D","E","F","G"]
print(C[N] for N in range(0,7))
Questions=["What's your last name?","What's your first name?","On what day are you born? (dd/mm/yyyy + hh:mm)","What's your place of birth?","What's your nationality?","What language(s) do you speak?","What's your sex? (W/M)"]
Data=[input(Questions[N]+"\n") for N in range(0,7)]
Keys = ["Last name","First name","Birthday","Birthplace","Nationality","Language(s)","Sex"]
for N in range(0,7):
print(repr(N+1)+") "+Keys[N]+": "+Data[N])
(Above) Total First it asks you 7 question: (see 'Questions' list) it stores the answers in the list 'Data' and then it tells you what you typed. it shoud look like this (below)
What's your last name?
A
What's your first name?
B
On what day are you born? (dd/mm/yyyy + hh:mm)
C
What's your place of birth?
D
What's your nationality?
E
What language(s) do you speak?
G
What's your sex? (W/M)
H
1) Last name: A
2) First name: B
3) Birthday: C
4) Birthplace: D
5) Nationality: E
6) Language: G
7) Sex: H
Question: Can I do the code(below) in 1 line
for N in range(0,7):
print(repr(N+1)+") "+Keys[N]+": "+Data[N])
`

Wrong float precision using textscan in Matlab

I have a devised a function in matlab that allows me (or so I thought) to extract data from a textfile that looks like this (at least the beginning)
G1 50
G2 50
M-0.35 0
M-0.05 0.013
M3.3 0.1
M9.75 0.236
M17.15 0.425
M25.85 0.666
M35.35 0.958
The idea is to match the letter I have with its position with a vector (because only the values next to M are really interesting to me), and get the two other numbers in a vector.
The end of the treatment works well, but the values I get by the end of my code are sometimes far from the real ones.
For instance, instead of [0 0.013 0.1 0.236 0.425 0.666 0.958] I get [0 0.013 0.1010 0.237 0.426 0.666 0.959].
This is not such an issue, the problem is much worse for the first column : instead of a maximum at 119, it doesn't reach 90. I had a code that worked properly with integers, but now I'm using floats it fails everytime.
I will try and display only the interesting parts of the code :
nom_essai='test.txt'
fid1 = fopen(nom_essai, 'rt');
tableau = textscan(fid1, '%s %.5f ', 'HeaderLines', 1, 'CollectOutput', true); %There are a few lines that I skip because they give the parameters, I get them with another line of the code
colonne_force=tableau{1}; %on recupere la premiere colonne
colonne_deplacement=tableau{2}; %on recupere la seconde colonne
indice=2*found_G+found_F+3*found_R; %this is the result of the treatment on colonne_force to match an index with the letter, which helps me keep the period next to G and the 2 values next to M.
force=linspace(0,0,length(n_indices)); %initialisation
deplacement=linspace(0,0,length(n_indices)); %initialisation
temps=linspace(0,0,length(n_indices)); %initialisation
for k=1:length(colonne_force) %%%%k is for the length of my vectors, while j is for the length of the columns
if indice(k)==2 %un G est trouve => temps d'echantillonnage
T=colonne_deplacement(k); %to keep the period next to G
end
elseif indice(k)==1 %an F is found : skip it
elseif indice(k)==3 %an R is found : skip it
else %an M is found : I need to get the values on these lines
j=j+1;
deplacement(j)=colonne_deplacement(k); %I keep the value on the second column
M=strsplit(colonne_force{k},'M'); %I get the string 'MXXX'
force(j)=str2double(M{2}); %I recover this string without the M, and convert the number to double
end
end
The kind of precision I would like to have is to keep values like [M108.55 23.759] with up to 3 digits.
Thank you in advance, feel free to ask for any information if I failed to give only the part of the code that contains the problem.
Modifying a bit your code as:
nom_essai='test.txt';
fid1 = fopen(nom_essai, 'rt');
tableau = textscan(fid1, '%s %f ', 'HeaderLines', 1, 'CollectOutput', true); % Change to %f not to miss significative figures
colonne_force = tableau{1}; %on recupere la premiere colonne
colonne_deplacement=tableau{2}; %on recupere la seconde colonne
% Check if has M
hasM = cellfun(#(x) any(x == 'M'), colonne_force);
column2 = colonne_deplacement(hasM);
column1 = colonne_force(hasM);
column1 = cellfun(#(x) str2double(x(2:end)), column1); % delete M and convert to double
The precision is retained:

R - create new variable using if statement

I am trying to create a new variable within data table under if statement: if string variable contains substring, then new variable equals to numerical value.
My data:
N X
1 aa1aa
2 bb2bb
3 cc-1bb
...
Dataframe contains several thousands of rows.
Result needed is new column containing numerical value which is withing string (X collumn):
N X Y
1 aa1aa 1
2 bb2bb 2
3 cc-1bb -1
I was trying with
for (i in 1:length(mydata)){
if (grep('1', mydata$X) == TRUE) {
mydata$Y <- 1 }
but I'm not sure if I'm even on correct way... Any help please?
This should work on more of your extended samples. Basically it takes out everything that's not a letter from the middle of the string.
X <- c("aa1aa", "bb2bb", "cc-1bb","aa+0.5b","fg-0.25h")
gsub("^[a-z]+([^a-z]*)[a-z]+$","\\1",X,perl=T)
#[1] "1" "2" "-1" "+0.5" "-0.25"
Using the example data from #Paulo you can use gsub from base R...
d$Y <- gsub( "[^0-9]" , "" , d$X )
something like this?
d <- data.frame(N = 1:3,
X = c('aa1aa', 'bb2bb', 'cc-1bb'),
stringsAsFactors = FALSE)
library(stringr)
d$Y <- as.numeric(str_extract_all(d$X,"\\(?[0-9,.]+\\)?"))
d
N X Y
1 1 aa1aa 1
2 2 bb2bb 2
3 3 cc-1bb 1
EDIT - Speed test
The gsub approch provided by #Simon is much faster than stringr
library(microbenchmark)
# 30000 lines data.frame
d1 <- data.frame(N = 1:30000,
X = rep(c('aa1aa', 'bb2bb', 'cc-1bb'), 10000),
stringsAsFactors = FALSE)
stringr
microbenchmark(as.numeric(str_extract_all(d1$X,"\\(?[0-9,.]+\\)?")),
times = 10L)
Unit: seconds
expr min lq median uq max neval
as.numeric(str_extract_all(d1$X, "\\\\(?[0-9,.]+\\\\)?")) 2.677408 2.75283 2.76473 2.781083 2.796648 10
base gsub
microbenchmark(gsub( "[^0-9]" , "" , d1$X ), times = 10L)
Unit: milliseconds
expr min lq median uq max neval
gsub("[^0-9]", "", d1$X) 44.95564 45.05358 45.07238 45.10201 45.23645 10

How to convert number into string in agda?

I need to write something to convert number into string with agda. I found someone asked the way to transfer string into agda before.
Agda: parse a string with numbers
I thinked about use it backwards,
row-to-stringh : (m : ℕ) → string
row-to-stringh 0 = "0"
row-to-stringh 1 = "1"
row-to-stringh 2 = "2"
row-to-stringh 3 = "3"
row-to-stringh 4 = "4"
row-to-stringh 5 = "5"
row-to-stringh 6 = "6"
row-to-stringh 7 = "7"
row-to-stringh 8 = "8"
row-to-stringh 9 = "9"
row-to-stringh _ = ""
but it not good enough. when the number is greater than 9, it will just convert it into "", instead of "(that number)". Can someone help me with this?
If you don't want to implement this function yourself, there's a show function in the standard library.
If you want to write it yourself: the usual way of converting a number into a string is to extract the digits by repeatedly dividing with a remainder. For example (remainders are written in parens):
7214 / 10 = 721 (4)
721 / 10 = 72 (1)
72 / 10 = 7 (2)
7 / 10 = 0 (7)
You then just collect the remainders into list, reverse it and convert the digits to chars. It might be tempting to try this in Agda as well, however, you'll run into problems with termination checker.
Firstly, you'll have to convince it that divMod (that is, division with remainder - modulus) terminates. You can just hardcode the divisor into the function and convincing the termination checker becomes easy.
The hard part is showing that repeatedly dividing the number by 10 actually terminates. This will most likely involve some rather complex tricks (such as well founded recursion).
If you want to know how it's done this way, take a look at the implementation linked above. Anyways, there's a bit less efficient but much simpler way of doing this.
Let's represent digits by a list of natural numbers.
Digits = List ℕ
We would like to write a function addOne, that (as the name suggests) adds one to a number represented by a list of digits, that is:
addOne : Digits → Digits
For this, we'll use the primitive pen & paper method: add one to the least significant digit; if the result is less than 10, we are done; if it isn't, write 0 and carry the 1 to the next digit. So, here's our carry:
data Carry : Set where
+0 : Carry
+1 : Carry
And here's the function that performs the addition - the second Carry argument can be thought of as a carry from the addition of previous two digits.
ripple-carry : Digits → Carry → Digits
ripple-carry ns +0 = ?
ripple-carry [] +1 = ?
ripple-carry (n ∷ ns) +1 with suc n ≤? 9
... | yes _ = ?
... | no _ = ?
The actual implementation is an exercise - use the description given above. Just note that we store digits in reverse order (this allows for more efficient and easier implementation). For example, 123 is represented by 3 ∷ 2 ∷ 1 ∷ [] and 0 by [].
We can recover the addOne function:
addOne : Digits → Digits
addOne n = ripple-carry n +1
The rest is just plumbing.
toDigits : ℕ → Digits
toDigits zero = []
toDigits (suc n) = addOne (toDigits n)
show : ℕ → String
show 0 = "0"
show n = (fromList ∘ map convert ∘ reverse ∘ toDigits) n
where
convert : ℕ → Char
convert 0 = '0'
convert 1 = '1'
convert 2 = '2'
convert 3 = '3'
convert 4 = '4'
convert 5 = '5'
convert 6 = '6'
convert 7 = '7'
convert 8 = '8'
convert 9 = '9'
convert _ = ' ' -- Never happens.
Used modules:
open import Data.Char
open import Data.List
open import Data.Nat
open import Data.String
open import Function
open import Relation.Nullary
I did some testing and it turns out that this method is actually fairly effective (especially when compared to the function from standard library).
The algorithm presented above needs to access O(n) digits (addOne needs to access only one digit in 90% of cases, two digits in 9%, three in 0.9%, etc) for a given number n. Unless we have some faster primitive operations (such as _+_ using Haskell's Integer behind the scenes), this is about the fastest we can get - we are working with unary numbers after all.
Standard library uses repeated division mentioned above, which is also (unless my math is wrong) O(n). However, this does not count handling of proofs, which adds enormous overhead, slowing it down to halt. Let's do a comparison:
open import Data.Nat
open import Data.Nat.Show
open import Function
open import IO
main = (run ∘ putStrLn ∘ show) n
And here are times for the compiled code (using C-c C-x C-c in Emacs). show from standard library:
n time
———————————————
1000 410 ms
2000 2690 ms
3000 8640 ms
If we use show as defined above, we get:
n time
———————————————
100000 26 ms
200000 41 ms
300000 65 ms

algorithm/code in R to find pattern from any position in a string

I want to find the pattern from any position in any given string such that the pattern repeats for a threshold number of times at least.
For example for the string "a0cc0vaaaabaaaabaaaabaa00bvw" the pattern should come out to be "aaaab". Another example: for the string "ff00f0f0f0f0f0f0f0f0000" the pattern should be "0f".
In both cases threshold has been taken as 3 i.e. the pattern should be repeated for at least 3 times.
If someone can suggest an optimized method in R for finding a solution to this problem, please do share with me. Currently I am achieving this by using 3 nested loops, and it's taking a lot of time.
Thanks!
Use regular expressions, which are made for this type of stuff. There may be more optimized ways of doing it, but in terms of easy to write code, it's hard to beat. The data:
vec <- c("a0cc0vaaaabaaaabaaaabaa00bvw","ff00f0f0f0f0f0f0f0f0000")
The function that does the matching:
find_rep_path <- function(vec, reps) {
regexp <- paste0(c("(.+)", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
And some tests:
sapply(vec, find_rep_path, reps=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "aaaab" "0f0f"
sapply(vec, find_rep_path, reps=5L)
# $a0cc0vaaaabaaaabaaaabaa00bvw
# character(0)
#
# $ff00f0f0f0f0f0f0f0f0000
# [1] "0f"
Note that with threshold as 3, the actual longest pattern for the second string is 0f0f, not 0f (reverts to 0f at threshold 5). In order to do this, I use back references (\\1), and repeat these as many time as necessary to reach threshold. I need to then substr the result because annoyingly base R doesn't have an easy way to get just the captured sub expressions when using perl compatible regular expressions. There is probably a not too hard way to do this, but the substr approach works well in this example.
Also, as per the discussion in #G. Grothendieck's answer, here is the version with the cap on length of pattern, which is just adding the limit argument and the slight modification of the regexp.
find_rep_path <- function(vec, reps, limit) {
regexp <- paste0(c("(.{1,", limit,"})", rep("\\1", reps - 1L)), collapse="")
match <- regmatches(vec, regexpr(regexp, vec, perl=T))
substr(match, 1, nchar(match) / reps)
}
sapply(vec, find_rep_path, reps=3L, limit=3L)
# a0cc0vaaaabaaaabaaaabaa00bvw ff00f0f0f0f0f0f0f0f0000
# "a" "0f"
find.string finds substring of maximum length subject to (1) substring must be repeated consecutively at least th times and (2) substring length must be no longer than len.
reps <- function(s, n) paste(rep(s, n), collapse = "") # repeat s n times
find.string <- function(string, th = 3, len = floor(nchar(string)/th)) {
for(k in len:1) {
pat <- paste0("(.{", k, "})", reps("\\1", th-1))
r <- regexpr(pat, string, perl = TRUE)
if (attr(r, "capture.length") > 0) break
}
if (r > 0) substring(string, r, r + attr(r, "capture.length")-1) else ""
}
and here are some tests. The last test processes the entire text of James Joyce's Ulysses in 1.4 seconds on my laptop:
> find.string("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
>
> joyce <- readLines("http://www.gutenberg.org/files/4300/4300-8.txt")
> joycec <- paste(joyce, collapse = " ")
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
1.36 0.00 1.39
> result
[1] " Hoopsa boyaboy hoopsa!"
ADDED
Although I developed my answer before having seen BrodieG's, as he points out they are very similar to each other. I have added some features of his to the above to get the solution below and tried the tests again. Unfortunately when I added the variation of his code the James Joyce example no longer works although it does work on the other two examples shown. The problem seems to be in adding the len constraint to the code and may represent a fundamental advantage of the code above (i.e. it can handle such a constraint and such constraints may be essential for very long strings).
find.string2 <- function(string, th = 3, len = floor(nchar(string)/th)) {
pat <- paste0(c("(.", "{1,", len, "})", rep("\\1", th-1)), collapse = "")
r <- regexpr(pat, string, perl = TRUE)
ifelse(r > 0, substring(string, r, r + attr(r, "capture.length")-1), "")
}
> find.string2("a0cc0vaaaabaaaabaaaabaa00bvw")
[1] "aaaab"
> find.string2("ff00f0f0f0f0f0f0f0f0000")
[1] "0f0f"
> system.time(result <- find.string2(joycec, len = 25))
user system elapsed
0 0 0
> result
[1] "w"
REVISED The James Joyce test that was supposed to be testing find.string2 was actually using find.string. This is now fixed.
Not optimized (even it is fast) function , but I think it is more R way to do this.
Get all patterns of certains length > threshold : vectorized using mapply and substr
Get the occurrence of these patterns and extract the one with maximum occurrence : vectorized using str_locate_all.
Repeat 1-2 this for all lengths and tkae the one with maximum occurrence.
Here my code. I am creating 2 functions ( steps 1-2) and step 3:
library(stringr)
ss = "ff00f0f0f0f0f0f0f0f0000"
ss <- "a0cc0vaaaabaaaabaaaabaa00bvw"
find_pattern_length <-
function(length=1,ss){
patt = mapply(function(x,y) substr(ss,x,y),
1:(nchar(ss)-length),
(length+1):nchar(ss))
res = str_locate_all(ss,unique(patt))
ll = unlist(lapply(res,length))
list(patt = patt[which.max(ll)],
rep = max(ll))
}
get_pattern_threshold <-
function(ss,threshold =3 ){
res <-
sapply(seq(threshold,nchar(ss)),find_pattern_length,ss=ss)
res[,which.max(res['rep',])]
}
some tests:
get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',5)
$patt
[1] "0f0f0"
$rep
[1] 6
> get_pattern_threshold('ff00f0f0f0f0f0f0f0f0000',2)
$patt
[1] "f0"
$rep
[1] 18
Since you want at least three repetitions, there is a nice O(n^2) approach.
For each possible pattern length d cut string into parts of length d. In case of d=5 it would be:
a0cc0
vaaaa
baaaa
baaaa
baa00
bvw
Now look at each pairs of subsequent strings A[k] and A[k+1]. If they are equal then there is a pattern of at least two repetitions. Then go further (k+2, k+3) and so on. Finally you also check if suffix of A[k-1] and prefix of A[k+n] fit (where k+n is the first string that doesn't match).
Repeat it for each d starting from some upper bound (at most n/3).
You have n/3 possible lengths, then n/d strings of length d to check for each d. It should give complexity O(n (n/d) d)= O(n^2).
Maybe not optimal but I found this cutting idea quite neat ;)
For a bounded pattern (i.e not huge) it's best I think to just create all possible substrings first and then count them. This is if the sub-patterns can overlap. If not change the step fun in the loop.
pat="a0cc0vaaaabaaaabaaaabaa00bvw"
len=nchar(pat)
thr=3
reps=floor(len/2)
# all poss strings up to half length of pattern
library(stringr)
pat=str_split(pat, "")[[1]][-1]
str.vec=vector()
for(win in 2:reps)
{
str.vec= c(str.vec, rollapply(data=pat,width=win,FUN=paste0, collapse=""))
}
# the max length string repeated more than 3 times
tbl=table(str.vec)
tbl=tbl[tbl>=3]
tbl[which.max(nchar(names(tbl)))]
aaaabaa
3
NB Whilst I'm lazy and append/grow the str.vec here in a loop, for a larger problem I'm pretty sure the actual length of str.vec is predetermined by the length of the pattern if you care to work it out.
Here is my solution, it's not optimized (build vector with patterns <- c() ; pattern <- c(patterns, x) for example) and can be improve but simpler than yours, I think.
I can't understand which pattern exactly should (I just return the max) be returned but you can adjust the code to what you want exactly.
str <- "a0cc0vaaaabaaaabaaaabaa00bvw"
findPatternMax <- function(str){
nb <- nchar(str):1
length.patt <- rev(nb)
patterns <- c()
for (i in 1:length(nb)){
for (j in 1:nb[i]){
patterns <- c(patterns, substr(str, j, j+(length.patt[i]-1)))
}
}
patt.max <- names(which(table(patterns) == max(table(patterns))))
return(patt.max)
}
findPatternMax(str)
> findPatternMax(str)
[1] "a"
EDIT :
Maybe you want the returned pattern have a min length ?
then you can add a nchar.patt parameter for example :
nchar.patt <- 2 #For a pattern of 2 char min
nb <- nb[length.patt >= nchar.patt]
length.patt <- length.patt[length.patt >= nchar.patt]

Resources