How to aid Smaz in further compressing repeating characters?

How to aid Smaz in further compressing repeating characters? - string

Smaz is able to compress a short string (< 100 bytes) where other compressing tools fail.
But there is a problem with it, particularly repeating characters that it doesn't optimize by itself.
For example the string "this is a short string" compresses fine:
\x9b8\xac>\xbb\xf2>\xc3F
It is 9 bytes long. But if you have a short string with repeating characters you have a problem.. for example the string "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's" compresses into this:
\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\xfe'\n
It is still smaller, but the many "\x04"'s look like a waste of space.
I've been thinking about calculating a letter occurrence and replacing it with a sort of "bookmark".. for example "aaaaaaaaaa" with ten "a" occurrences becomes "a//10".
This is a test Python snippet I've created out of my head, but is very very ugly as of now
a = set("this is a string with many aaaaaaaaaaaaaaaaaaaaaa's")
b = "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's"
for i in a:
if i+i in b: # if char occ. > 2
o = b.count(i) - 2
s = i*o
c = b.replace(s, i+'//'+str(o))
print c
It then becomes:
this is a string with many a//22's
Smaz compressed
\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\xc5\xc5\xff\x0222'\n
My worry is, what if the string contains an url? Is it safe to escape it like "//"? but then you have regex strings. How can it be escaped in that case?
Finally, my clear and concise question is: How do you safely shorten repeating characters that Smaz doesn't compress by itself?

Here's an example of safe compression of repeating bytes. My result for your data example
"this is a string with many aaaaaaaaaaaaaaaaaaaaaa's"
is:
"this is a string with many \x16a's"
It's 31 bytes long, a 39% reduction. "\x16" represents the one byte hexadecimal (22 decimal) count of repeating "a"'s.
What result do you get if you "Smaz" my result?
My result for your Smaz output example
"\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\xfe"
is:
"\x9b8\xac>\xc3F\xf3\xe3\xad\x01\tG\x16\x04\xfe"
It's 15 bytes long, a 56% reduction. "\x16" represents the one byte hexadecimal (22 decimal) count of repeating compressed "\x04"'s ("a"'s).
Here's my code in Go.
package main
import (
"fmt"
)
func Compress(src []byte) (dst []byte) {
for len(src) > 0 {
c := src[0]
n := 1
for ; n < len(src) && src[n] == c; n++ {
}
src = src[n:]
for n > 0 {
m := (n-1)%31 + 1
n -= m
if m == 1 && !(1 <= c && c <= 31) {
dst = append(dst, c)
} else {
dst = append(dst, byte(m), c)
}
}
}
return dst
}
func Decompress(src []byte) (dst []byte) {
for i := 0; i < len(src); i++ {
n, c := byte(1), src[i]
if i+1 < len(src) && (1 <= c && c <= 31) {
n, c = c, src[i+1]
i++
}
for j := byte(0); j < n; j++ {
dst = append(dst, c)
}
}
return dst
}
func test(data string) {
src := []byte(data)
fmt.Printf("%d %q\n", len(src), src)
compress := Compress(src)
fmt.Printf("%d %q\n", len(compress), compress)
decompress := Decompress(compress)
fmt.Printf("%d %q\n", len(decompress), decompress)
fmt.Println(string(Decompress(Compress(src))) == string(src))
}
func main() {
data := "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's"
test(data)
fmt.Println()
smaz := "\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\xfe"
test(smaz)
}
Output:
51 "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's"
31 "this is a string with many \x16a's"
51 "this is a string with many aaaaaaaaaaaaaaaaaaaaaa's"
true
34 "\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\xfe"
15 "\x9b8\xac>\xc3F\xf3\xe3\xad\x01\tG\x16\x04\xfe"
34 "\x9b8\xac>\xc3F\xf3\xe3\xad\tG\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\x04\xfe"
true

How do you safely shorten repeating characters that Smaz doesn't compress by itself?
You can't without changing the Smaz algorithm and being incompatible with Smaz.
Smaz is purpose built to be effective on small strings because its dictionary is universal and pre-computed. Other compression schemes need to build up a dictionary that is data set dependent, and typically takes a few hundred bytes for you to see positive returns. Repeating sequences are rare in short strings.
For your proposed Smaz variant with run length encoding scheme to work you would have to take up one of the 256 precious byte slots Smaz reserves for its codes. You could change one of the byte slots to mean "a byte indicating length to follow, followed by the byte to be repeated" - i.e., 3 bytes to communicate [REPEAT BYTE] [BYTE indicating 2 - 257 times] [BYTE CODE TO REPEAT]. You could reassign the Smaz byte code 253 from its present meaning of ".com" for the purpose of run-length encoding. But be aware that your compression will be slightly less effective for general data with ".com".
Also be aware that searching for repeating sequences in a hypothetical Smaz variant with run-length encoding would necessarily take more CPU compute time for the backtracking compression.

Related

Can I make a prefilled string in golang with make or new?

I am trying to optimize my stringpad library in Go. So far the only way I have found to fill a string (actually bytes.Buffer) with a known character value (ex. 0 or " ") is with a for loop.
the snippet of code is:
// PadLeft pads string on left side with p, c times
func PadLeft(s string, p string, c int) string {
var t bytes.Buffer
if c <= 0 {
return s
}
if len(p) < 1 {
return s
}
for i := 0; i < c; i++ {
t.WriteString(p)
}
t.WriteString(s)
return t.String()
}
The larger the string pad I believe there is more memory copies of the t buffer. Is there a more elegant way to make a known size buffer with a known value on initialization?

You can only use make() and new() to allocate buffers (byte slices or arrays) that are zeroed. You may use composite literals to obtain slices or arrays that initially contain non-zero values, but you can't describe the initial values dynamically (indices must be constants).
Take inspiration from the similar but very efficient strings.Repeat() function. It repeats the given string with given count:
func Repeat(s string, count int) string {
// Since we cannot return an error on overflow,
// we should panic if the repeat will generate
// an overflow.
// See Issue golang.org/issue/16237
if count < 0 {
panic("strings: negative Repeat count")
} else if count > 0 && len(s)*count/count != len(s) {
panic("strings: Repeat count causes overflow")
}
b := make([]byte, len(s)*count)
bp := copy(b, s)
for bp < len(b) {
copy(b[bp:], b[:bp])
bp *= 2
}
return string(b)
}
strings.Repeat() does a single allocation to obtain a working buffer (which will be a byte slice []byte), and uses the builtin copy() function to copy the repeatable string. One thing noteworthy is that it uses the working copy and attempts to copy the whole of it incrementally, meaning e.g. if the string has already been copied 4 times, copying this buffer will make it 8 times, etc. This will minimize the calls to copy(). Also the solution takes advantage of that copy() can copy bytes from a string without having to convert it to a byte slice.
What we want is something similar, but we want the result to be prepended to a string.
We can account for that, simply allocating a buffer that is used inside Repeat() plus the length of the string we're left-padding.
The result (without checking the count param):
func PadLeft(s, p string, count int) string {
ret := make([]byte, len(p)*count+len(s))
b := ret[:len(p)*count]
bp := copy(b, p)
for bp < len(b) {
copy(b[bp:], b[:bp])
bp *= 2
}
copy(ret[len(b):], s)
return string(ret)
}
Testing it:
fmt.Println(PadLeft("aa", "x", 1))
fmt.Println(PadLeft("aa", "x", 2))
fmt.Println(PadLeft("abc", "xy", 3))
Output (try it on the Go Playground):
xaa
xxaa
xyxyxyabc
See similar / related question: Is there analog of memset in go?

Go: convert rune (string) to string representation of the binary

This is just in case someone else is learning Golang and is wondering how to convert from a string to a string representation in binary.
Long story short, I have been looking at the standard library without being able to find the right call. So I started with something similar to the following:
func RuneToBinary(r rune) string {
var buf bytes.Buffer
b := []int64{128, 64, 32, 16, 8, 4, 2, 1}
v := int64(r)
for i := 0; i < len(b); i++ {
t := v-b[i]
if t >= 0 {
fmt.Fprintf(&buf, "1")
v = t
} else {
fmt.Fprintf(&buf, "0")
}
}
return buf.String()
}
This is all well and dandy, but after a couple of days looking around I found that I should have been using the fmt package instead and just format the rune with %b%:
var r rune
fmt.Printf("input: %b ", r)
Is there a better way to do this?
Thanks

Standard library support
fmt.Printf("%b", r) - this solution is already very compact and easy to write and understand. If you need the result as a string, you can use the analog Sprintf() function:
s := fmt.Sprintf("%b", r)
You can also use the strconv.FormatInt() function which takes a number of type int64 (so you first have to convert your rune) and a base where you can pass 2 to get the result in binary representation:
s := strconv.FormatInt(int64(r), 2)
Note that in Go rune is just an alias for int32, the 2 types are one and the same (just you may refer to it by 2 names).
Doing it manually ("Simple but Naive"):
If you'd want to do it "manually", there is a much simpler solution than your original. You can test the lowest bit with r & 0x01 == 0 and shift all bits with r >>= 1. Just "loop" over all bits and append either "1" or "0" depending on the bit:
Note this is just for demonstration, it is nowhere near optimal regarding performance (generates "redundant" strings):
func RuneToBin(r rune) (s string) {
if r == 0 {
return "0"
}
for digits := []string{"0", "1"}; r > 0; r >>= 1 {
s = digits[r&1] + s
}
return
}
Note: negative numbers are not handled by the function. If you also want to handle negative numbers, you can first check it and proceed with the positive value of it and start the return value with a minus '-' sign. This also applies the other manual solution below.
Manual Performance-wise solution:
For a fast solution we shouldn't append strings. Since strings in Go are just byte slices encoded using UTF-8, appending a digit is just appending the byte value of the rune '0' or '1' which is just one byte (not multi). So we can allocate a big enough buffer/array (rune is 32 bits so max 32 binary digits), and fill it backwards so we won't even have to reverse it at the end. And return the used part of the array converted to string at the end. Note that I don't even call the built-in append function to append the binary digits, I just set the respective element of the array in which I build the result:
func RuneToBinFast(r rune) string {
if r == 0 {
return "0"
}
b, i := [32]byte{}, 31
for ; r > 0; r, i = r>>1, i-1 {
if r&1 == 0 {
b[i] = '0'
} else {
b[i] = '1'
}
}
return string(b[i+1:])
}

How to convert []int8 to string

What's the best way (fastest performance) to convert from []int8 to string?
For []byte we could do string(byteslice), but for []int8 it gives an error:
cannot convert ba (type []int8) to type string
I got the ba from SliceScan() method of *sqlx.Rows that produces []int8 instead of string
Is this solution the fastest?
func B2S(bs []int8) string {
ba := []byte{}
for _, b := range bs {
ba = append(ba, byte(b))
}
return string(ba)
}
EDIT my bad, it's uint8 instead of int8.. so I can do string(ba) directly.

Note beforehand: The asker first stated that input slice is []int8 so that is what the answer is for. Later he realized the input is []uint8 which can be directly converted to string because byte is an alias for uint8 (and []byte => string conversion is supported by the language spec).
You can't convert slices of different types, you have to do it manually.
Question is what type of slice should we convert to? We have 2 candidates: []byte and []rune. Strings are stored as UTF-8 encoded byte sequences internally ([]byte), and a string can also be converted to a slice of runes. The language supports converting both of these types ([]byte and []rune) to string.
A rune is a unicode codepoint. And if we try to convert an int8 to a rune in a one-to-one fashion, it will fail (meaning wrong output) if the input contains characters which are encoded to multiple bytes (using UTF-8) because in this case multiple int8 values should end up in one rune.
Let's start from the string "世界" whose bytes are:
fmt.Println([]byte("世界"))
// Output: [228 184 150 231 149 140]
And its runes:
fmt.Println([]rune("世界"))
// [19990 30028]
It's only 2 runes and 6 bytes. So obviously 1-to-1 int8->rune mapping won't work, we have to go with 1-1 int8->byte mapping.
byte is alias for uint8 having range 0..255, to convert it to []int8 (having range -128..127) we have to use -256+bytevalue if the byte value is > 127 so the "世界" string in []int8 looks like this:
[-28 -72 -106 -25 -107 -116]
The backward conversion what we want is: bytevalue = 256 + int8value if the int8 is negative but we can't do this as int8 (range -128..127) and neither as byte (range 0..255) so we also have to convert it to int first (and back to byte at the end). This could look something like this:
if v < 0 {
b[i] = byte(256 + int(v))
} else {
b[i] = byte(v)
}
But actually since signed integers are represented using 2's complement, we get the same result if we simply use a byte(v) conversion (which in case of negative numbers this is equivalent to 256 + v).
Note: Since we know the length of the slice, it is much faster to allocate a slice with this length and just set its elements using indexing [] and not calling the built-in append function.
So here is the final conversion:
func B2S(bs []int8) string {
b := make([]byte, len(bs))
for i, v := range bs {
b[i] = byte(v)
}
return string(b)
}
Try it on the Go Playground.

Not entirely sure it is the fastest, but I haven't found anything better.
Change ba := []byte{} for ba := make([]byte,0, len(bs) so at the end you have:
func B2S(bs []int8) string {
ba := make([]byte,0, len(bs))
for _, b := range bs {
ba = append(ba, byte(b))
}
return string(ba)
}
This way the append function will never try to insert more data that it can fit in the slice's underlying array and you will avoid unnecessary copying to a bigger array.

What is sure from "Convert between slices of different types" is that you have to build the right slice from your original int8[].
I ended up using rune (int32 alias) (playground), assuming that the uint8 were all simple ascii character. That is obviously an over-simplification and icza's answer has more on that.
Plus the SliceScan() method ended up returning uint8[] anyway.
package main
import (
"fmt"
)
func main() {
s := []int8{'a', 'b', 'c'}
b := make([]rune, len(s))
for i, v := range s {
b[i] = rune(v)
}
fmt.Println(string(b))
}
But I didn't benchmark it against using a []byte.

Use unsafe package.
func B2S(bs []int8) string {
return strings.TrimRight(string(*(*[]byte)unsafe.Pointer(&bs)), "\x00")
}
Send again ^^

How to generate a random string of a fixed length in Go?

I want a random string of characters only (uppercase or lowercase), no numbers, in Go. What is the fastest and simplest way to do this?

Paul's solution provides a simple, general solution.
The question asks for the "the fastest and simplest way". Let's address the fastest part too. We'll arrive at our final, fastest code in an iterative manner. Benchmarking each iteration can be found at the end of the answer.
All the solutions and the benchmarking code can be found on the Go Playground. The code on the Playground is a test file, not an executable. You have to save it into a file named XX_test.go and run it with
go test -bench . -benchmem
Foreword:
The fastest solution is not a go-to solution if you just need a random string. For that, Paul's solution is perfect. This is if performance does matter. Although the first 2 steps (Bytes and Remainder) might be an acceptable compromise: they do improve performance by like 50% (see exact numbers in the II. Benchmark section), and they don't increase complexity significantly.
Having said that, even if you don't need the fastest solution, reading through this answer might be adventurous and educational.
I. Improvements
1. Genesis (Runes)
As a reminder, the original, general solution we're improving is this:
func init() {
rand.Seed(time.Now().UnixNano())
}
var letterRunes = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
func RandStringRunes(n int) string {
b := make([]rune, n)
for i := range b {
b[i] = letterRunes[rand.Intn(len(letterRunes))]
}
return string(b)
}
2. Bytes
If the characters to choose from and assemble the random string contains only the uppercase and lowercase letters of the English alphabet, we can work with bytes only because the English alphabet letters map to bytes 1-to-1 in the UTF-8 encoding (which is how Go stores strings).
So instead of:
var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
we can use:
var letters = []byte("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
Or even better:
const letters = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
Now this is already a big improvement: we could achieve it to be a const (there are string constants but there are no slice constants). As an extra gain, the expression len(letters) will also be a const! (The expression len(s) is constant if s is a string constant.)
And at what cost? Nothing at all. strings can be indexed which indexes its bytes, perfect, exactly what we want.
Our next destination looks like this:
const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
func RandStringBytes(n int) string {
b := make([]byte, n)
for i := range b {
b[i] = letterBytes[rand.Intn(len(letterBytes))]
}
return string(b)
}
3. Remainder
Previous solutions get a random number to designate a random letter by calling rand.Intn() which delegates to Rand.Intn() which delegates to Rand.Int31n().
This is much slower compared to rand.Int63() which produces a random number with 63 random bits.
So we could simply call rand.Int63() and use the remainder after dividing by len(letterBytes):
func RandStringBytesRmndr(n int) string {
b := make([]byte, n)
for i := range b {
b[i] = letterBytes[rand.Int63() % int64(len(letterBytes))]
}
return string(b)
}
This works and is significantly faster, the disadvantage is that the probability of all the letters will not be exactly the same (assuming rand.Int63() produces all 63-bit numbers with equal probability). Although the distortion is extremely small as the number of letters 52 is much-much smaller than 1<<63 - 1, so in practice this is perfectly fine.
To make this understand easier: let's say you want a random number in the range of 0..5. Using 3 random bits, this would produce the numbers 0..1 with double probability than from the range 2..5. Using 5 random bits, numbers in range 0..1 would occur with 6/32 probability and numbers in range 2..5 with 5/32 probability which is now closer to the desired. Increasing the number of bits makes this less significant, when reaching 63 bits, it is negligible.
4. Masking
Building on the previous solution, we can maintain the equal distribution of letters by using only as many of the lowest bits of the random number as many is required to represent the number of letters. So for example if we have 52 letters, it requires 6 bits to represent it: 52 = 110100b. So we will only use the lowest 6 bits of the number returned by rand.Int63(). And to maintain equal distribution of letters, we only "accept" the number if it falls in the range 0..len(letterBytes)-1. If the lowest bits are greater, we discard it and query a new random number.
Note that the chance of the lowest bits to be greater than or equal to len(letterBytes) is less than 0.5 in general (0.25 on average), which means that even if this would be the case, repeating this "rare" case decreases the chance of not finding a good number. After n repetition, the chance that we still don't have a good index is much less than pow(0.5, n), and this is just an upper estimation. In case of 52 letters the chance that the 6 lowest bits are not good is only (64-52)/64 = 0.19; which means for example that chances to not have a good number after 10 repetition is 1e-8.
So here is the solution:
const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
const (
letterIdxBits = 6 // 6 bits to represent a letter index
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
)
func RandStringBytesMask(n int) string {
b := make([]byte, n)
for i := 0; i < n; {
if idx := int(rand.Int63() & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i++
}
}
return string(b)
}
5. Masking Improved
The previous solution only uses the lowest 6 bits of the 63 random bits returned by rand.Int63(). This is a waste as getting the random bits is the slowest part of our algorithm.
If we have 52 letters, that means 6 bits code a letter index. So 63 random bits can designate 63/6 = 10 different letter indices. Let's use all those 10:
const letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ"
const (
letterIdxBits = 6 // 6 bits to represent a letter index
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
letterIdxMax = 63 / letterIdxBits // # of letter indices fitting in 63 bits
)
func RandStringBytesMaskImpr(n int) string {
b := make([]byte, n)
// A rand.Int63() generates 63 random bits, enough for letterIdxMax letters!
for i, cache, remain := n-1, rand.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = rand.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}
return string(b)
}
6. Source
The Masking Improved is pretty good, not much we can improve on it. We could, but not worth the complexity.
Now let's find something else to improve. The source of random numbers.
There is a crypto/rand package which provides a Read(b []byte) function, so we could use that to get as many bytes with a single call as many we need. This wouldn't help in terms of performance as crypto/rand implements a cryptographically secure pseudorandom number generator so it's much slower.
So let's stick to the math/rand package. The rand.Rand uses a rand.Source as the source of random bits. rand.Source is an interface which specifies a Int63() int64 method: exactly and the only thing we needed and used in our latest solution.
So we don't really need a rand.Rand (either explicit or the global, shared one of the rand package), a rand.Source is perfectly enough for us:
var src = rand.NewSource(time.Now().UnixNano())
func RandStringBytesMaskImprSrc(n int) string {
b := make([]byte, n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}
return string(b)
}
Also note that this last solution doesn't require you to initialize (seed) the global Rand of the math/rand package as that is not used (and our rand.Source is properly initialized / seeded).
One more thing to note here: package doc of math/rand states:
The default Source is safe for concurrent use by multiple goroutines.
So the default source is slower than a Source that may be obtained by rand.NewSource(), because the default source has to provide safety under concurrent access / use, while rand.NewSource() does not offer this (and thus the Source returned by it is more likely to be faster).
7. Utilizing strings.Builder
All previous solutions return a string whose content is first built in a slice ([]rune in Genesis, and []byte in subsequent solutions), and then converted to string. This final conversion has to make a copy of the slice's content, because string values are immutable, and if the conversion would not make a copy, it could not be guaranteed that the string's content is not modified via its original slice. For details, see How to convert utf8 string to []byte? and golang: []byte(string) vs []byte(*string).
Go 1.10 introduced strings.Builder. strings.Builder is a new type we can use to build contents of a string similar to bytes.Buffer. Internally it uses a []byte to build the content, and when we're done, we can obtain the final string value using its Builder.String() method. But what's cool in it is that it does this without performing the copy we just talked about above. It dares to do so because the byte slice used to build the string's content is not exposed, so it is guaranteed that no one can modify it unintentionally or maliciously to alter the produced "immutable" string.
So our next idea is to not build the random string in a slice, but with the help of a strings.Builder, so once we're done, we can obtain and return the result without having to make a copy of it. This may help in terms of speed, and it will definitely help in terms of memory usage and allocations.
func RandStringBytesMaskImprSrcSB(n int) string {
sb := strings.Builder{}
sb.Grow(n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
sb.WriteByte(letterBytes[idx])
i--
}
cache >>= letterIdxBits
remain--
}
return sb.String()
}
Do note that after creating a new strings.Buidler, we called its Builder.Grow() method, making sure it allocates a big-enough internal slice (to avoid reallocations as we add the random letters).
8. "Mimicing" strings.Builder with package unsafe
strings.Builder builds the string in an internal []byte, the same as we did ourselves. So basically doing it via a strings.Builder has some overhead, the only thing we switched to strings.Builder for is to avoid the final copying of the slice.
strings.Builder avoids the final copy by using package unsafe:
// String returns the accumulated string.
func (b *Builder) String() string {
return *(*string)(unsafe.Pointer(&b.buf))
}
The thing is, we can also do this ourselves, too. So the idea here is to switch back to building the random string in a []byte, but when we're done, don't convert it to string to return, but do an unsafe conversion: obtain a string which points to our byte slice as the string data.
This is how it can be done:
func RandStringBytesMaskImprSrcUnsafe(n int) string {
b := make([]byte, n)
// A src.Int63() generates 63 random bits, enough for letterIdxMax characters!
for i, cache, remain := n-1, src.Int63(), letterIdxMax; i >= 0; {
if remain == 0 {
cache, remain = src.Int63(), letterIdxMax
}
if idx := int(cache & letterIdxMask); idx < len(letterBytes) {
b[i] = letterBytes[idx]
i--
}
cache >>= letterIdxBits
remain--
}
return *(*string)(unsafe.Pointer(&b))
}
(9. Using rand.Read())
Go 1.7 added a rand.Read() function and a Rand.Read() method. We should be tempted to use these to read as many bytes as we need in one step, in order to achieve better performance.
There is one small "problem" with this: how many bytes do we need? We could say: as many as the number of output letters. We would think this is an upper estimation, as a letter index uses less than 8 bits (1 byte). But at this point we are already doing worse (as getting the random bits is the "hard part"), and we're getting more than needed.
Also note that to maintain equal distribution of all letter indices, there might be some "garbage" random data that we won't be able to use, so we would end up skipping some data, and thus end up short when we go through all the byte slice. We would need to further get more random bytes, "recursively". And now we're even losing the "single call to rand package" advantage...
We could "somewhat" optimize the usage of the random data we acquire from math.Rand(). We may estimate how many bytes (bits) we'll need. 1 letter requires letterIdxBits bits, and we need n letters, so we need n * letterIdxBits / 8.0 bytes rounding up. We can calculate the probability of a random index not being usable (see above), so we could request more that will "more likely" be enough (if it turns out it's not, we repeat the process). We can process the byte slice as a "bit stream" for example, for which we have a nice 3rd party lib: github.com/icza/bitio (disclosure: I'm the author).
But Benchmark code still shows we're not winning. Why is it so?
The answer to the last question is because rand.Read() uses a loop and keeps calling Source.Int63() until it fills the passed slice. Exactly what the RandStringBytesMaskImprSrc() solution does, without the intermediate buffer, and without the added complexity. That's why RandStringBytesMaskImprSrc() remains on the throne. Yes, RandStringBytesMaskImprSrc() uses an unsynchronized rand.Source unlike rand.Read(). But the reasoning still applies; and which is proven if we use Rand.Read() instead of rand.Read() (the former is also unsynchronzed).
II. Benchmark
All right, it's time for benchmarking the different solutions.
Moment of truth:
BenchmarkRunes-4 2000000 723 ns/op 96 B/op 2 allocs/op
BenchmarkBytes-4 3000000 550 ns/op 32 B/op 2 allocs/op
BenchmarkBytesRmndr-4 3000000 438 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMask-4 3000000 534 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMaskImpr-4 10000000 176 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMaskImprSrc-4 10000000 139 ns/op 32 B/op 2 allocs/op
BenchmarkBytesMaskImprSrcSB-4 10000000 134 ns/op 16 B/op 1 allocs/op
BenchmarkBytesMaskImprSrcUnsafe-4 10000000 115 ns/op 16 B/op 1 allocs/op
Just by switching from runes to bytes, we immediately have 24% performance gain, and memory requirement drops to one third.
Getting rid of rand.Intn() and using rand.Int63() instead gives another 20% boost.
Masking (and repeating in case of big indices) slows down a little (due to repetition calls): -22%...
But when we make use of all (or most) of the 63 random bits (10 indices from one rand.Int63() call): that speeds up big time: 3 times.
If we settle with a (non-default, new) rand.Source instead of rand.Rand, we again gain 21%.
If we utilize strings.Builder, we gain a tiny 3.5% in speed, but we also achieved 50% reduction in memory usage and allocations! That's nice!
Finally if we dare to use package unsafe instead of strings.Builder, we again gain a nice 14%.
Comparing the final to the initial solution: RandStringBytesMaskImprSrcUnsafe() is 6.3 times faster than RandStringRunes(), uses one sixth memory and half as few allocations. Mission accomplished.

You can just write code for it. This code can be a little simpler if you want to rely on the letters all being single bytes when encoded in UTF-8.
package main
import (
"fmt"
"time"
"math/rand"
)
func init() {
rand.Seed(time.Now().UnixNano())
}
var letters = []rune("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
func randSeq(n int) string {
b := make([]rune, n)
for i := range b {
b[i] = letters[rand.Intn(len(letters))]
}
return string(b)
}
func main() {
fmt.Println(randSeq(10))
}

Use package uniuri, which generates cryptographically secure uniform (unbiased) strings.
Disclaimer: I'm the author of the package

Simple solution for you, with least duplicate result:
import (
"fmt"
"math/rand"
"time"
)
func randomString(length int) string {
rand.Seed(time.Now().UnixNano())
b := make([]byte, length+2)
rand.Read(b)
return fmt.Sprintf("%x", b)[2 : length+2]
}
Check it out in the PlayGround

Two possible options (there might be more of course):
You can use the crypto/rand package that supports reading random byte arrays (from /dev/urandom) and is geared towards cryptographic random generation. see http://golang.org/pkg/crypto/rand/#example_Read . It might be slower than normal pseudo-random number generation though.
Take a random number and hash it using md5 or something like this.

If you want cryptographically secure random numbers, and the exact charset is flexible (say, base64 is fine), you can calculate exactly what the length of random characters you need from the desired output size.
Base 64 text is 1/3 longer than base 256. (2^8 vs 2^6; 8bits/6bits = 1.333 ratio)
import (
"crypto/rand"
"encoding/base64"
"math"
)
func randomBase64String(l int) string {
buff := make([]byte, int(math.Ceil(float64(l)/float64(1.33333333333))))
rand.Read(buff)
str := base64.RawURLEncoding.EncodeToString(buff)
return str[:l] // strip 1 extra character we get from odd length results
}
Note: you can also use RawStdEncoding if you prefer + and / characters to - and _
If you want hex, base 16 is 2x longer than base 256. (2^8 vs 2^4; 8bits/4bits = 2x ratio)
import (
"crypto/rand"
"encoding/hex"
"math"
)
func randomBase16String(l int) string {
buff := make([]byte, int(math.Ceil(float64(l)/2)))
rand.Read(buff)
str := hex.EncodeToString(buff)
return str[:l] // strip 1 extra character we get from odd length results
}
However, you could extend this to any arbitrary character set if you have a base256 to baseN encoder for your character set. You can do the same size calculation with how many bits are needed to represent your character set. The ratio calculation for any arbitrary charset is: ratio = 8 / log2(len(charset))).
Though both of these solutions are secure, simple, should be fast, and don't waste your crypto entropy pool.
Here's the playground showing it works for any size. https://play.golang.org/p/_yF_xxXer0Z

Another version, inspired from generate password in JavaScript crypto:
package main
import (
"crypto/rand"
"fmt"
)
var chars = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890-"
func shortID(length int) string {
ll := len(chars)
b := make([]byte, length)
rand.Read(b) // generates len(b) random bytes
for i := 0; i < length; i++ {
b[i] = chars[int(b[i])%ll]
}
return string(b)
}
func main() {
fmt.Println(shortID(18))
fmt.Println(shortID(18))
fmt.Println(shortID(18))
}

Following icza's wonderfully explained solution, here is a modification of it that uses crypto/rand instead of math/rand.
const (
letterBytes = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" // 52 possibilities
letterIdxBits = 6 // 6 bits to represent 64 possibilities / indexes
letterIdxMask = 1<<letterIdxBits - 1 // All 1-bits, as many as letterIdxBits
)
func SecureRandomAlphaString(length int) string {
result := make([]byte, length)
bufferSize := int(float64(length)*1.3)
for i, j, randomBytes := 0, 0, []byte{}; i < length; j++ {
if j%bufferSize == 0 {
randomBytes = SecureRandomBytes(bufferSize)
}
if idx := int(randomBytes[j%length] & letterIdxMask); idx < len(letterBytes) {
result[i] = letterBytes[idx]
i++
}
}
return string(result)
}
// SecureRandomBytes returns the requested number of bytes using crypto/rand
func SecureRandomBytes(length int) []byte {
var randomBytes = make([]byte, length)
_, err := rand.Read(randomBytes)
if err != nil {
log.Fatal("Unable to generate random bytes")
}
return randomBytes
}
If you want a more generic solution, that allows you to pass in the slice of character bytes to create the string out of, you can try using this:
// SecureRandomString returns a string of the requested length,
// made from the byte characters provided (only ASCII allowed).
// Uses crypto/rand for security. Will panic if len(availableCharBytes) > 256.
func SecureRandomString(availableCharBytes string, length int) string {
// Compute bitMask
availableCharLength := len(availableCharBytes)
if availableCharLength == 0 || availableCharLength > 256 {
panic("availableCharBytes length must be greater than 0 and less than or equal to 256")
}
var bitLength byte
var bitMask byte
for bits := availableCharLength - 1; bits != 0; {
bits = bits >> 1
bitLength++
}
bitMask = 1<<bitLength - 1
// Compute bufferSize
bufferSize := length + length / 3
// Create random string
result := make([]byte, length)
for i, j, randomBytes := 0, 0, []byte{}; i < length; j++ {
if j%bufferSize == 0 {
// Random byte buffer is empty, get a new one
randomBytes = SecureRandomBytes(bufferSize)
}
// Mask bytes to get an index into the character slice
if idx := int(randomBytes[j%length] & bitMask); idx < availableCharLength {
result[i] = availableCharBytes[idx]
i++
}
}
return string(result)
}
If you want to pass in your own source of randomness, it would be trivial to modify the above to accept an io.Reader instead of using crypto/rand.

Here is my way ) Use math rand or crypto rand as you wish.
func randStr(len int) string {
buff := make([]byte, len)
rand.Read(buff)
str := base64.StdEncoding.EncodeToString(buff)
// Base 64 can be longer than len
return str[:len]
}

Here is a simple and performant solution for a cryptographically secure random string.
package main
import (
"crypto/rand"
"unsafe"
"fmt"
)
var alphabet = []byte("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ")
func main() {
fmt.Println(generate(16))
}
func generate(size int) string {
b := make([]byte, size)
rand.Read(b)
for i := 0; i < size; i++ {
b[i] = alphabet[b[i] % byte(len(alphabet))]
}
return *(*string)(unsafe.Pointer(&b))
}
Benchmark
Benchmark 95.2 ns/op 16 B/op 1 allocs/op

func Rand(n int) (str string) {
b := make([]byte, n)
rand.Read(b)
str = fmt.Sprintf("%x", b)
return
}

I usually do it like this if it takes an option to capitalize or not
func randomString(length int, upperCase bool) string {
rand.Seed(time.Now().UnixNano())
var alphabet string
if upperCase {
alphabet = "ABCDEFGHIJKLMNOPQRSTUVWXYZ"
} else {
alphabet = "abcdefghijklmnopqrstuvwxyz"
}
var sb strings.Builder
l := len(alphabet)
for i := 0; i < length; i++ {
c := alphabet[rand.Intn(l)]
sb.WriteByte(c)
}
return sb.String()
}
and like this if you don't need capital letters
func randomString(length int) string {
rand.Seed(time.Now().UnixNano())
var alphabet string = "abcdefghijklmnopqrstuvwxyz"
var sb strings.Builder
l := len(alphabet)
for i := 0; i < length; i++ {
c := alphabet[rand.Intn(l)]
sb.WriteByte(c)
}
return sb.String()
}

If you are willing to add a few characters to your pool of allowed characters, you can make the code work with anything which provides random bytes through a io.Reader. Here we are using crypto/rand.
// len(encodeURL) == 64. This allows (x <= 265) x % 64 to have an even
// distribution.
const encodeURL = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789-_"
// A helper function create and fill a slice of length n with characters from
// a-zA-Z0-9_-. It panics if there are any problems getting random bytes.
func RandAsciiBytes(n int) []byte {
output := make([]byte, n)
// We will take n bytes, one byte for each character of output.
randomness := make([]byte, n)
// read all random
_, err := rand.Read(randomness)
if err != nil {
panic(err)
}
// fill output
for pos := range output {
// get random item
random := uint8(randomness[pos])
// random % 64
randomPos := random % uint8(len(encodeURL))
// put into output
output[pos] = encodeURL[randomPos]
}
return output
}

This is a sample code which I used to generate certificate number in my app.
func GenerateCertificateNumber() string {
CertificateLength := 7
t := time.Now().String()
CertificateHash, err := bcrypt.GenerateFromPassword([]byte(t), bcrypt.DefaultCost)
if err != nil {
fmt.Println(err)
}
// Make a Regex we only want letters and numbers
reg, err := regexp.Compile("[^a-zA-Z0-9]+")
if err != nil {
log.Fatal(err)
}
processedString := reg.ReplaceAllString(string(CertificateHash), "")
fmt.Println(string(processedString))
CertificateNumber := strings.ToUpper(string(processedString[len(processedString)-CertificateLength:]))
fmt.Println(CertificateNumber)
return CertificateNumber
}

/*
korzhao
*/
package rand
import (
crand "crypto/rand"
"math/rand"
"sync"
"time"
"unsafe"
)
// Doesn't share the rand library globally, reducing lock contention
type Rand struct {
Seed int64
Pool *sync.Pool
}
var (
MRand = NewRand()
randlist = []byte("abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ1234567890")
)
// init random number generator
func NewRand() *Rand {
p := &sync.Pool{New: func() interface{} {
return rand.New(rand.NewSource(getSeed()))
},
}
mrand := &Rand{
Pool: p,
}
return mrand
}
// get the seed
func getSeed() int64 {
return time.Now().UnixNano()
}
func (s *Rand) getrand() *rand.Rand {
return s.Pool.Get().(*rand.Rand)
}
func (s *Rand) putrand(r *rand.Rand) {
s.Pool.Put(r)
}
// get a random number
func (s *Rand) Intn(n int) int {
r := s.getrand()
defer s.putrand(r)
return r.Intn(n)
}
// bulk get random numbers
func (s *Rand) Read(p []byte) (int, error) {
r := s.getrand()
defer s.putrand(r)
return r.Read(p)
}
func CreateRandomString(len int) string {
b := make([]byte, len)
_, err := MRand.Read(b)
if err != nil {
return ""
}
for i := 0; i < len; i++ {
b[i] = randlist[b[i]%(62)]
}
return *(*string)(unsafe.Pointer(&b))
}
24.0 ns/op 16 B/op 1 allocs/

As a follow-up to icza's brilliant solution, below I am using rand.Reader
func RandStringBytesMaskImprRandReaderUnsafe(length uint) (string, error) {
const (
charset = "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789"
charIdxBits = 6 // 6 bits to represent a letter index
charIdxMask = 1<<charIdxBits - 1 // All 1-bits, as many as charIdxBits
charIdxMax = 63 / charIdxBits // # of letter indices fitting in 63 bits
)
buffer := make([]byte, length)
charsetLength := len(charset)
max := big.NewInt(int64(1 << uint64(charsetLength)))
limit, err := rand.Int(rand.Reader, max)
if err != nil {
return "", err
}
for index, cache, remain := int(length-1), limit.Int64(), charIdxMax; index >= 0; {
if remain == 0 {
limit, err = rand.Int(rand.Reader, max)
if err != nil {
return "", err
}
cache, remain = limit.Int64(), charIdxMax
}
if idx := int(cache & charIdxMask); idx < charsetLength {
buffer[index] = charset[idx]
index--
}
cache >>= charIdxBits
remain--
}
return *(*string)(unsafe.Pointer(&buffer)), nil
}
func BenchmarkBytesMaskImprRandReaderUnsafe(b *testing.B) {
b.ReportAllocs()
b.ResetTimer()
const length = 16
b.RunParallel(func(pb *testing.PB) {
for pb.Next() {
RandStringBytesMaskImprRandReaderUnsafe(length)
}
})
}

package main
import (
"encoding/base64"
"fmt"
"math/rand"
"time"
)
// customEncodeURL is like `bas64.encodeURL`
// except its made up entirely of uppercase characters:
const customEncodeURL = "ABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKLMNOPQRSTUVWXYZABCDEFGHIJKL"
// Random generates a random string.
// It is not cryptographically secure.
func Random(n int) string {
b := make([]byte, n)
rand.Seed(time.Now().UnixNano())
_, _ = rand.Read(b) // docs say that it always returns a nil error.
customEncoding := base64.NewEncoding(customEncodeURL).WithPadding(base64.NoPadding)
return customEncoding.EncodeToString(b)
}
func main() {
fmt.Println(Random(16))
}

const (
chars = "0123456789_abcdefghijkl-mnopqrstuvwxyz" //ABCDEFGHIJKLMNOPQRSTUVWXYZ
charsLen = len(chars)
mask = 1<<6 - 1
)
var rng = rand.NewSource(time.Now().UnixNano())
// RandStr 返回指定长度的随机字符串
func RandStr(ln int) string {
/* chars 38个字符
* rng.Int63() 每次产出64bit的随机数,每次我们使用6bit(2^6=64) 可以使用10次
*/
buf := make([]byte, ln)
for idx, cache, remain := ln-1, rng.Int63(), 10; idx >= 0; {
if remain == 0 {
cache, remain = rng.Int63(), 10
}
buf[idx] = chars[int(cache&mask)%charsLen]
cache >>= 6
remain--
idx--
}
return *(*string)(unsafe.Pointer(&buf))
}
BenchmarkRandStr16-8 20000000 68.1 ns/op 16 B/op 1 allocs/op

How to get the number of characters in a string

How can I get the number of characters of a string in Go?
For example, if I have a string "hello" the method should return 5. I saw that len(str) returns the number of bytes and not the number of characters so len("£") returns 2 instead of 1 because £ is encoded with two bytes in UTF-8.

You can try RuneCountInString from the utf8 package.
returns the number of runes in p
that, as illustrated in this script: the length of "World" might be 6 (when written in Chinese: "世界"), but the rune count of "世界" is 2:
package main
import "fmt"
import "unicode/utf8"
func main() {
fmt.Println("Hello, 世界", len("世界"), utf8.RuneCountInString("世界"))
}
Phrozen adds in the comments:
Actually you can do len() over runes by just type casting.
len([]rune("世界")) will print 2. At least in Go 1.3.
And with CL 108985 (May 2018, for Go 1.11), len([]rune(string)) is now optimized. (Fixes issue 24923)
The compiler detects len([]rune(string)) pattern automatically, and replaces it with for r := range s call.
Adds a new runtime function to count runes in a string.
Modifies the compiler to detect the pattern len([]rune(string))
and replaces it with the new rune counting runtime function.
RuneCount/lenruneslice/ASCII 27.8ns ± 2% 14.5ns ± 3% -47.70%
RuneCount/lenruneslice/Japanese 126ns ± 2% 60 ns ± 2% -52.03%
RuneCount/lenruneslice/MixedLength 104ns ± 2% 50 ns ± 1% -51.71%
Stefan Steiger points to the blog post "Text normalization in Go"
What is a character?
As was mentioned in the strings blog post, characters can span multiple runes.
For example, an 'e' and '◌́◌́' (acute "\u0301") can combine to form 'é' ("e\u0301" in NFD). Together these two runes are one character.
The definition of a character may vary depending on the application.
For normalization we will define it as:
a sequence of runes that starts with a starter,
a rune that does not modify or combine backwards with any other rune,
followed by possibly empty sequence of non-starters, that is, runes that do (typically accents).
The normalization algorithm processes one character at at time.
Using that package and its Iter type, the actual number of "character" would be:
package main
import "fmt"
import "golang.org/x/text/unicode/norm"
func main() {
var ia norm.Iter
ia.InitString(norm.NFKD, "école")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Printf("Number of chars: %d\n", nc)
}
Here, this uses the Unicode Normalization form NFKD "Compatibility Decomposition"
Oliver's answer points to UNICODE TEXT SEGMENTATION as the only way to reliably determining default boundaries between certain significant text elements: user-perceived characters, words, and sentences.
For that, you need an external library like rivo/uniseg, which does Unicode Text Segmentation.
That will actually count "grapheme cluster", where multiple code points may be combined into one user-perceived character.
package uniseg
import (
"fmt"
"github.com/rivo/uniseg"
)
func main() {
gr := uniseg.NewGraphemes("👍🏼!")
for gr.Next() {
fmt.Printf("%x ", gr.Runes())
}
// Output: [1f44d 1f3fc] [21]
}
Two graphemes, even though there are three runes (Unicode code points).
You can see other examples in "How to manipulate strings in GO to reverse them?"
👩🏾‍🦰 alone is one grapheme, but, from unicode to code points converter, 4 runes:
👩: women (1f469)
dark skin (1f3fe)
ZERO WIDTH JOINER (200d)
🦰red hair (1f9b0)

There is a way to get count of runes without any packages by converting string to []rune as len([]rune(YOUR_STRING)):
package main
import "fmt"
func main() {
russian := "Спутник и погром"
english := "Sputnik & pogrom"
fmt.Println("count of bytes:",
len(russian),
len(english))
fmt.Println("count of runes:",
len([]rune(russian)),
len([]rune(english)))
}
count of bytes 30 16
count of runes 16 16

I should point out that none of the answers provided so far give you the number of characters as you would expect, especially when you're dealing with emojis (but also some languages like Thai, Korean, or Arabic). VonC's suggestions will output the following:
fmt.Println(utf8.RuneCountInString("🏳️‍🌈🇩🇪")) // Outputs "6".
fmt.Println(len([]rune("🏳️‍🌈🇩🇪"))) // Outputs "6".
That's because these methods only count Unicode code points. There are many characters which can be composed of multiple code points.
Same for using the Normalization package:
var ia norm.Iter
ia.InitString(norm.NFKD, "🏳️‍🌈🇩🇪")
nc := 0
for !ia.Done() {
nc = nc + 1
ia.Next()
}
fmt.Println(nc) // Outputs "6".
Normalization is not really the same as counting characters and many characters cannot be normalized into a one-code-point equivalent.
masakielastic's answer comes close but only handles modifiers (the rainbow flag contains a modifier which is thus not counted as its own code point):
fmt.Println(GraphemeCountInString("🏳️‍🌈🇩🇪")) // Outputs "5".
fmt.Println(GraphemeCountInString2("🏳️‍🌈🇩🇪")) // Outputs "5".
The correct way to split Unicode strings into (user-perceived) characters, i.e. grapheme clusters, is defined in the Unicode Standard Annex #29. The rules can be found in Section 3.1.1. The github.com/rivo/uniseg package implements these rules so you can determine the correct number of characters in a string:
fmt.Println(uniseg.GraphemeClusterCount("🏳️‍🌈🇩🇪")) // Outputs "2".

If you need to take grapheme clusters into account, use regexp or unicode module. Counting the number of code points(runes) or bytes also is needed for validaiton since the length of grapheme cluster is unlimited. If you want to eliminate extremely long sequences, check if the sequences conform to stream-safe text format.
package main
import (
"regexp"
"unicode"
"strings"
)
func main() {
str := "\u0308" + "a\u0308" + "o\u0308" + "u\u0308"
str2 := "a" + strings.Repeat("\u0308", 1000)
println(4 == GraphemeCountInString(str))
println(4 == GraphemeCountInString2(str))
println(1 == GraphemeCountInString(str2))
println(1 == GraphemeCountInString2(str2))
println(true == IsStreamSafeString(str))
println(false == IsStreamSafeString(str2))
}
func GraphemeCountInString(str string) int {
re := regexp.MustCompile("\\PM\\pM*|.")
return len(re.FindAllString(str, -1))
}
func GraphemeCountInString2(str string) int {
length := 0
checked := false
index := 0
for _, c := range str {
if !unicode.Is(unicode.M, c) {
length++
if checked == false {
checked = true
}
} else if checked == false {
length++
}
index++
}
return length
}
func IsStreamSafeString(str string) bool {
re := regexp.MustCompile("\\PM\\pM{30,}")
return !re.MatchString(str)
}

There are several ways to get a string length:
package main
import (
"bytes"
"fmt"
"strings"
"unicode/utf8"
)
func main() {
b := "这是个测试"
len1 := len([]rune(b))
len2 := bytes.Count([]byte(b), nil) -1
len3 := strings.Count(b, "") - 1
len4 := utf8.RuneCountInString(b)
fmt.Println(len1)
fmt.Println(len2)
fmt.Println(len3)
fmt.Println(len4)
}

Depends a lot on your definition of what a "character" is. If "rune equals a character " is OK for your task (generally it isn't) then the answer by VonC is perfect for you. Otherwise, it should be probably noted, that there are few situations where the number of runes in a Unicode string is an interesting value. And even in those situations it's better, if possible, to infer the count while "traversing" the string as the runes are processed to avoid doubling the UTF-8 decode effort.

I tried to make to do the normalization a bit faster:
en, _ = glyphSmart(data)
func glyphSmart(text string) (int, int) {
gc := 0
dummy := 0
for ind, _ := range text {
gc++
dummy = ind
}
dummy = 0
return gc, dummy
}

Develop Reference

node.js excel linux python-3.x azure haskell apache-spark rust .htaccess string

How to aid Smaz in further compressing repeating characters? - string

Related

Can I make a prefilled string in golang with make or new?

Go: convert rune (string) to string representation of the binary

How to convert []int8 to string

How to generate a random string of a fixed length in Go?

How to get the number of characters in a string

Categories

Resources