Go: Split Long Text By Byte Count

Posted on 01-30-2023 Views:

Motivation: Need to split text

I'm not a professional GO developer, but I do like GO, I have taught myself some Go by contributing to open source projects, and since then GO has become my language of choice for personal projects.

Recently, I have encountered an interesting problem when working on one of my personal projects. For this project, I need to split a long text by byte count. The eventual result is an array of strings in which each of element is less or equal to the byte size that's passed in.

Naive/Wrong approach: Split by number of characters

At first glance, I thought I can just use the character count, which assumes a single character will always take one byte. Well, this turns out to be a terrible idea.

Background: UTF-8 and Bytes

In GO (as well as in most major programming languages), a character is usually encoded in the form of UTF-8. According to this brilliant article, UTF-8 in memory uses 8 bit bytes and

In UTF-8, every code point from 0-127 is stored in a single byte. Only code points 128 and above are stored using 2, 3, in fact, up to 6 bytes.

Coincidentally, all the English characters fits into the first byte of an UTF-8 character which is what's being used in ASCII and ANSI. Therefore, if all the characters in the string are English characters, then the assumption of 1 byte per character is correct.

However, it is wrong to make that assumption. As in my case, I need to deal with Chinese characters which made things complicated.

Discovery: String and Rune in Go

In Go, a string is in effect a read-only slice of bytes. What this implies is that if you do a simple for loop over a string, and print the result, you won't get what you expect.

Take the following program as an example:

package main

import "fmt"

func main() {
    const sample = "Hello"

    for i := 0; i < len(sample); i++ {
        fmt.Printf("% x", sample[i])
    }
}

The result of the above print function will be 48 65 6c 6c 6f, which are the hex representations of the each of the character or rune in the string. If you convert them to decimal, you will get 72 101 108 108 111, this means they will only take 8 bit or 1 byte.

However, this will not be true for Chinese character or any other non-Latin characters.As an example, the following program:

package main

import "fmt"

func main() {
    const sample = "國語"

    for i := 0; i < len(sample); i++ {
        fmt.Printf("% x", sample[i])
    }
}

The above code will print e5 9c 8b e8 aa 9e, if you again convert them back to decimal. As you can see, these two Chinese characters take a total of 6 bytes (!), meaning each one of them take 2 bytes.

There are several implications to this:

If you want to create chunk from a string, you can't simply just add and count the byte and then truncate when your given limit is reached, this will most likely result in invalid Unicode character/string. To prove this, consider the following example, we have a string: 汽,車 (this means car in Chinese, but there's an English comma in between).

We want to truncate and create chunks of strings that occupies 3 bytes each. If we just add the bytes up and truncate, then because the English comma takes 1 byte, and the Chinese character afterward takes 3 bytes, the result will be an invalid string by combining the whole byte that represents the English comma, and the first two bytes that represents the character 車.

In UTF-8 encoding, each character can take at most 8 bits (or 4 bytes). But you won't be able to know how many bytes the character actually take beforehand. Fortunately, there's a way in Go to read by one UTF-8 encoded rune in each iteration, and there's also a handy library method to calculate the rune length

As an example:

package main

import (
	"fmt"
	"unicode/utf8"
)

func main() {
    const sample = "汽,車"

    for index, runeValue := range sample{
        fmt.Printf("%#U starts at byte position %d and occupies %d bytes \n", runeValue, index, utf8.RuneLen(runeValue))
    }
}

The result will be as follows:

1
2
3

U+6C7D '汽' starts at byte position 0 and occupies 3 bytes
U+002C ',' starts at byte position 3 and occupies 1 bytes
U+8ECA '車' starts at byte position 4 and occupies 3 bytes

Solution

Based on the above discovery, at this point, I have developed my own solution for creating chunks of string that solves my problem. Below is my solution:

import (
	"errors"
	"fmt"
	"unicode/utf8"
)

func chunksByte(s string, chunkSize int) ([]string, error) {
	if len(s) <= chunkSize {
		return []string{s}, nil
	}

        // We won't create chunks, if we can't safely do so based on the size of
        // first rune
	if _, runeSize := utf8.DecodeRuneInString(s); runeSize > chunkSize {
		return nil, errors.New("rune size larger than chunk size")
	}

	currentLen, currentStart := 0, 0

	chunks := make([]string, 0)

	for i, ch := range s {
		if runeLen := utf8.RuneLen(ch); runeLen != -1 {
			currentLen += runeLen
			if currentLen > chunkSize {
				chunks = append(chunks, s[currentStart:i])
				currentLen = runeLen
				currentStart = i
			}
		}
	}

        //always append the last chunk
	chunks = append(chunks, s[currentStart:])

	return chunks, nil
}

This is by no means an efficient (or production ready) solution (so use/copy it at your own risk), but it works for me.

Here are some of the test cases I created, and they all worked as expected.

package main

import "fmt"

func main() {
    fmt.Println(chunksByte("abcd", 2))
    //[ab cd] <nil>
    //byte size for each item: [2 2]

    fmt.Println(chunksByte("汽", 2))
    //[] Can not create chunks, rune size larger than chunk

    fmt.Println(chunksByte("汽a,車,車", 3))
    //[汽 a, 車 , 車] <nil>
    //byte size for each item: [3 2 3 1 3]

    fmt.Println(chunksByte("汽車,車", 3))
    //[汽 車 , 車] <nil>
    //byte size for each item: [3 3 1 3]
}

References:

https://go.dev/blog/strings
https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses