Go's String Internals: UTF-8 and Common Operations
Emily Parker
Product Engineer · Leapcell

Go's approach to strings is elegant and pragmatic. Unlike some languages that might treat strings as simple byte arrays or implicitly assume ASCII, Go natively embraces UTF-8. This design choice simplifies working with multilingual text and avoids common pitfalls related to character encoding. This article will thoroughly explore how Go represents strings internally using UTF-8 and demonstrate common and efficient ways to manipulate them.
The Immutable Nature of Go Strings
First and foremost, it's crucial to understand that Go strings are immutable. Once a string is created, its content cannot be changed. Any operation that appears to modify a string, such as concatenation or trimming, actually creates a new string. This immutability simplifies concurrency and ensures data integrity, as multiple goroutines can safely read the same string without fear of modification.
A Go string is essentially a read-only slice of bytes. Its underlying representation is a two-word data structure: a pointer to the byte array that holds the string's content and an integer representing its length.
// Internal representation of a string (conceptual, not directly accessible) type StringHeader struct { Data uintptr // Pointer to the underlying byte array Len int // Length of the string in bytes }
UTF-8: Go's Native Encoding
Go's commitment to UTF-8 is fundamental. All string literals in Go source code are UTF-8 encoded. This means directly working with characters from various languages—like Chinese, Japanese, Korean, or emojis—is seamless.
UTF-8 is a variable-width encoding. This means different characters can take up different numbers of bytes.
- ASCII characters (U+0000 to U+007F) occupy 1 byte.
- Most European characters (e.g., 'é', 'ñ') occupy 2 bytes.
- Common CJK characters (Chinese, Japanese, Korean) occupy 3 bytes.
- Some rare characters or emojis can occupy 4 bytes.
Let's illustrate this with an example:
package main import ( "fmt" "unicode/utf8" ) func main() { s1 := "hello" // ASCII only s2 := "你好世界" // Chinese characters s3 := "Go Gopher 🤘" // Unicode, including emoji fmt.Printf("String: \"%s\", Length (bytes): %d\n", s1, len(s1)) fmt.Printf("String: \"%s\", Length (bytes): %d\n", s2, len(s2)) fmt.Printf("String: \"%s\", Length (bytes): %d\n", s3, len(s3)) fmt.Println("\n--- Counting Runes (Characters) ---") fmt.Printf("String: \"%s\", Length (runes): %d\n", s1, utf8.RuneCountInString(s1)) fmt.Printf("String: \"%s\", Length (runes): %d\n", s2, utf8.RuneCountInString(s2)) fmt.Printf("String: \"%s\", Length (runes): %d\n", s3, utf8.RuneCountInString(s3)) }
Output:
String: "hello", Length (bytes): 5
String: "你好世界", Length (bytes): 12
String: "Go Gopher 🤘", Length (bytes): 13
--- Counting Runes (Characters) ---
String: "hello", Length (runes): 5
String: "你好世界", Length (runes): 4
String: "Go Gopher 🤘", Length (runes): 11
Notice the difference between len(s)
and utf8.RuneCountInString(s)
.
len(s)
returns the number of bytes in the string.utf8.RuneCountInString(s)
returns the number of runes (Unicode code points, or characters) in the string. This is usually what you mean when you say "length" of a string.
Iterating Over Strings
Because strings are UTF-8 encoded byte sequences, iterating over them directly using a for
loop will yield individual bytes, not characters.
str := "你好" for i := 0; i < len(str); i++ { fmt.Printf("Byte at index %d: %x\n", i, str[i]) } // Output: // Byte at index 0: e4 // Byte at index 1: bd // Byte at index 2: a0 // Byte at index 3: e5 // Byte at index 4: a5 // Byte at index 5: bd
To iterate over Unicode code points (runes), Go provides a special for...range
loop construct for strings:
str := "你好Go 🌎" for i, r := range str { fmt.Printf("Code point '%c' (U+%04X) at byte index %d\n", r, r, i) } // Output: // Code point '你' (U+4F60) at byte index 0 // Code point '好' (U+597D) at byte index 3 // Code point 'G' (U+0047) at byte index 6 // Code point 'o' (U+006F) at byte index 7 // Code point ' ' (U+0020) at byte index 8 // Code point '🌎' (U+1F30E) at byte index 9
The for...range
loop correctly decodes UTF-8 sequences into rune
values. i
will be the starting byte index of the rune, and r
will be the rune
(an alias for int32
).
Common String Operations
Go's standard library, particularly the strings
and strconv
packages, provides a rich set of functions for string manipulation.
1. String Conversion
-
String to Byte Slice: A string can be converted to a
[]byte
slice, which can then be mutated. This implicitly creates a new underlying array.s := "Hello" b := []byte(s) b[0] = 'h' // Mutate the byte slice fmt.Println(string(b)) // Convert back to string (creates new string) -> "hello"
-
Byte Slice to String: Converting a
[]byte
to astring
creates a new string by copying the bytes.b := []byte{'G', 'o'} s := string(b) fmt.Println(s) // "Go"
-
String to Rune Slice: Converting a
string
to a[]rune
slice allows direct manipulation of individual characters. This also creates a new slice.s := "你好" r := []rune(s) r[0] = '您' // Change the first character fmt.Println(string(r)) // Convert back to string -> "您好"
2. Concatenation
String concatenation in Go creates a new string. While the +
operator is convenient for small numbers of concatenations, it can be inefficient for many operations due to repeated memory allocations and copies.
Inefficient Concatenation:
var s string for i := 0; i < 1000; i++ { s += "a" // Each += creates a new string } // This performs 1000 string allocations and copies.
Efficient Concatenation with strings.Builder
:
For building strings iteratively, strings.Builder
is highly recommended. It minimizes reallocations by maintaining an internal byte buffer.
import ( "strings" "fmt" ) func main() { var sb strings.Builder sb.Grow(1000) // Optional: Pre-allocate capacity if you know the approximate final size for i := 0; i < 1000; i++ { sb.WriteString("a") } finalString := sb.String() fmt.Println("Length of built string:", len(finalString)) // This performs far fewer allocations and copies, resulting in better performance. }
3. Substring Extraction
Because strings are byte sequences, slicing creates a new string that shares the underlying byte array. However, be cautious with byte indices when dealing with multi-byte runes.
s := "你好世界" // 12 bytes, 4 runes sub1 := s[0:6] // "你好" - Correct for the first two runes (3 bytes each) sub2 := s[0:7] // "你好" - Incorrect, splits a multi-byte rune, results in a replacement character '' fmt.Println(sub1) fmt.Println(sub2) // To get a substring by rune count or for safe slicing, convert to []rune: r := []rune(s) subRune1 := string(r[0:2]) // "你好" subRune2 := string(r[2:]) // "世界" fmt.Println(subRune1) fmt.Println(subRune2)
Important: Direct slicing s[start:end]
always works on byte indices. If start
or end
fall in the middle of a multi-byte UTF-8 sequence, the resulting substring will contain invalid UTF-8 and display replacement characters. For robust, character-aware slicing, convert to []rune
first.
4. Searching and Replacing
The strings
package offers various functions for searching and replacing:
import "strings" func main() { text := "Go is a great language. Go is simple." // Contains fmt.Println("Contains 'Go':", strings.Contains(text, "Go")) // true // Index fmt.Println("Index of 'great':", strings.Index(text, "great")) // 8 fmt.Println("Last index of 'Go':", strings.LastIndex(text, "Go")) // 24 // HasPrefix, HasSuffix fmt.Println("Starts with 'Go':", strings.HasPrefix(text, "Go")) // true fmt.Println("Ends with 'simple.':", strings.HasSuffix(text, "simple.")) // true // Replace newText := strings.Replace(text, "Go", "Golang", 1) // Replace first occurrence fmt.Println("Replaced once:", newText) // Golang is a great language. Go is simple. newTextAll := strings.ReplaceAll(text, "Go", "Golang") // Replace all occurrences fmt.Println("Replaced all:", newTextAll) // Golang is a great language. Golang is simple. }
5. Case Conversion
import "strings" func main() { s := "Hello World" fmt.Println(strings.ToLower(s)) // hello world fmt.Println(strings.ToUpper(s)) // HELLO WORLD // For Unicode-aware case folding (e.g., Turkish 'i'), use unicode.ToUpper/ToLower // as strings.ToUpper/ToLower might not handle all edge cases. }
6. Trimming
Remove leading/trailing whitespace or specified characters.
import "strings" func main() { s := " Hello World \n" fmt.Printf("Trimmed space: \"%s\"\n", strings.TrimSpace(s)) // "Hello World" s2 := "abccbaHelloabccba" // Trim characters from the beginning and end based on the cutset fmt.Printf("Trimmed cutset: \"%s\"\n", strings.Trim(s2, "abc")) // "Hello" }
Performance Considerations
While Go makes working with strings straightforward, understanding the underlying mechanics helps in writing performant code:
- Immutability and Copies: Almost every string operation (concatenation, slicing, conversion) creates a new string (and potentially a new underlying byte array). This can lead to memory allocations and garbage collection overhead if done frequently in performance-critical loops.
strings.Builder
for building strings: Always preferstrings.Builder
for constructing strings from many smaller parts.[]byte
vs.string
conversions: Converting betweenstring
and[]byte
involves copying data. If you're building a string that you only need to process as bytes, consider sticking to[]byte
throughout the operation.- Rune-aware vs. Byte-wise operations: Operations on
[]rune
slices are often more computationally expensive than basic byte-wise operations because they involve UTF-8 decoding and encoding. Choose the right tool for the job. If you just need to work with bytes (e.g., network protocols, file serialization), use[]byte
. If you need to manipulate characters, use[]rune
orfor...range
for strings. - Benchmarking: When performance is paramount, always benchmark your string operations to understand their actual impact.
package main import ( "bytes" "fmt" "strings" "testing" ) func benchmarkConcatenation(b *testing.B, strategy string) { s := "some_string_part_" num := 1000 // Number of concatenations b.ResetTimer() for i := 0; i < b.N; i++ { switch strategy { case "plus": result := "" for j := 0; j < num; j++ { result += s } case "strings.Builder": var sb strings.Builder sb.Grow(len(s) * num) // Optimize by pre-allocating for j := 0; j < num; j++ { sb.WriteString(s) } _ = sb.String() case "bytes.Buffer": // Another alternative, less common for pure strings var buf bytes.Buffer buf.Grow(len(s) * num) for j := 0; j < num; j++ { buf.WriteString(s) } _ = buf.String() } } } func BenchmarkConcatenationPlus(b *testing.B) { benchmarkConcatenation(b, "plus") } func BenchmarkConcatenationStringsBuilder(b *testing.B) { benchmarkConcatenation(b, "strings.Builder") } func BenchmarkConcatenationBytesBuffer(b *testing.B) { benchmarkConcatenation(b, "bytes.Buffer") } // How to run this benchmark: // go test -bench=. -benchmem -run=none // Example output (will vary by machine): // goos: darwin // goarch: arm64 // pkg: example/string_bench // BenchmarkConcatenationPlus-8 162 7077677 ns/op 799981 B/op 1000 allocs/op // BenchmarkConcatenationStringsBuilder-8 19782 59114 ns/op 4088 B/op 4 allocs/op // BenchmarkConcatenationBytesBuffer-8 18042 67073 ns/op 4088 B/op 4 allocs/op
The benchmark results clearly demonstrate the significant performance advantage of strings.Builder
(and bytes.Buffer
) over repeated +
concatenation, especially in terms of allocations and memory usage.
Conclusion
Go's string handling is a strong testament to its design philosophy: simplicity, safety, and efficiency. By standardizing on UTF-8, it sidesteps many common pitfalls of internationalization. Understanding that strings are immutable, byte-oriented slices internally, and judiciously using the for...range
loop for character iteration or strings.Builder
for efficient construction, empowers Go developers to write robust and performant code for any textual data. Embrace Go's string model, and you'll find working with text a far more pleasant experience.