Character Types in Golang

Introduction

In the previous section, we discussed commonly used numeric types. In this section, we will learn about character types in Go.

Knowledge Points:

ASCII Encoding
UTF-8 Encoding
Unicode Character Set
byte
rune

ASCII Encoding

In the early days of computers, the ASCII (American Standard Code for Information Interchange) encoding format was used. It represented characters using 7 bits and could represent 128 (2^7) characters. The characters from bit 0 to 31 and bit 127 represented control characters that could not be displayed, while characters from bit 32 to 126 represented everyday uppercase and lowercase letters, numbers, and punctuation marks. See the table for details.

As computers evolved, the need to support different languages arose. ASCII encoding was inadequate for this purpose. Consequently, different languages developed their own encoding formats, such as GB2312 for Simplified Chinese, EUC-KR for Korean, and KOI8-R for Russian.

However, with many language families in the world, it became necessary to have a single encoding format that could unify all languages. Unicode was created to fulfill this need.

Unicode Character Set

In 1991, the Unicode Consortium released the first version of the Unicode character set. Its goal was to unify all languages into a single encoding format, enabling computers worldwide to display and process text more easily and avoiding compatibility issues in multilingual environments.

However, Unicode was only a character set; it defined character codes but not how they were stored. This led to difficulties in widespread adoption for a long time, until the rise of the Internet.

UTF-8 Encoding

With the continual development of the Internet, UTF-8, a Unicode implementation encoding, has gained popularity. It is a variable-length encoding, meaning that different symbols can have different byte lengths in UTF-8.

For example, for English letters, which fall within the ASCII range, they are represented by 1 byte. The character 'y' (Unicode value 121) takes up 1 byte.

For everyday use, most Chinese characters take up 3 bytes. For example, the character '实' (Unicode value 23454) takes up 3 bytes.

However, there are some Chinese characters that take up 4 bytes. This is because there are over 100,000 Chinese characters, but according to the diagram below, 3 bytes can only represent slightly over 60,000 characters, so a small number of Chinese characters require 4 bytes to be represented.

Another advantage of UTF-8 encoding is that it is backward compatible with ASCII encoding. In fact, ASCII is a subset of UTF-8. The first 128 characters in UTF-8 correspond one-to-one with ASCII characters. This means that software originally using ASCII can continue to be used with little or no modification. Because of these advantages, UTF-8 has gradually become the preferred encoding format.

The creators of the Go programming language, Rob Pike and Ken Thompson, also invented UTF-8, so Go has a special affinity for UTF-8. Go requires source code files to be saved in UTF-8 encoding. When operating on text characters, UTF-8 encoding is the preferred choice. Moreover, the standard library provides many functions related to UTF-8 encoding and decoding.

byte and rune

byte is an alias for uint8 and occupies one byte (8 bits). It can be used to represent all characters in the ASCII table. However, because byte can represent a limited range of values (256 or 2^8), when dealing with composite characters such as Chinese characters, we need to use the rune type.

Create a new file called byte.go and enter the following code:

cd ~/project
touch byte.go

package main

import "fmt"

func main() {
    var a byte = 76
    fmt.Printf("Value of a: %c\n", a)

    var b uint8 = 76
    fmt.Printf("Value of b: %c\n", b)

    var c byte = 'L'
    fmt.Printf("Value of c: %c\n", c)
}

After running the program, the following result will be output:

go run byte.go

Value of a: L
Value of b: L
Value of c: L

The %c placeholder is used to output characters. It can be seen that the byte type and uint8 type produce the same output when their values are the same. By referring to the ASCII table, it can be seen that the ASCII value of the letter 'A' is 65. When we use the integer placeholder %d to output the value, it is also 65.

Therefore, it is evident that byte in Go is equivalent to uint8 in integer types. The same applies to rune, but it represents a different range of integer values.

rune is an alias for int32 and occupies four bytes (32 bits). It is used to represent composite characters, such as emoji characters.

Update the byte.go file with the following code:

package main

import "fmt"

func main() {
    var a rune = '😊' // Smile emoji
    fmt.Printf("Value of a: %c\n", a)

    var b int32 = 9829 // Decimal representation of a Unicode character (Heart symbol)
    fmt.Printf("Value of b: %c\n", b)
    var c rune = 0x1F496 // Hexadecimal representation of a Unicode character (Sparkling heart emoji)
    fmt.Printf("Value of c: %c\n", c)

    var d rune = '\u0041' // Unicode character represented by its code point (Capital letter 'A')
    fmt.Printf("Value of d: %c\n", d)
    var e rune = '\U0001F609' // Unicode character represented by its code point (Winking face emoji)
    fmt.Printf("Value of e: %c\n", e)
}

After running the program, the following output will be displayed:

go run byte.go

Note: Run the program in the Desktop or WebIDE Terminal, but avoid running it in the Terminal Tab located at the top of the LabEx VM.

Value of a: 😊
Value of b: ♥
Value of c: 💖
Value of d: A
Value of e: 😉

Variable a represents the smile emoji '😊'.
Variable b is initialized with the decimal representation of a Unicode character (9829), which corresponds to the heart symbol '♥'.
Variable c is initialized with the hexadecimal representation of a Unicode character (0x1F496), which corresponds to the sparkling heart emoji '💖'.
Variable d represents the capital letter 'A' using the Unicode format \u0041.
Variable e represents the winking face emoji '😉' using the \U format with the code point 0001F609.

Note: In Go, single quotes and double quotes are not the same. Single quotes are used to represent characters, while double quotes are used to declare strings. Therefore, single quotes should be used when declaring byte and rune types, or an error will occur.

Quiz

Now, let's reinforce what we have learned. Create a new file called rune.go and enter the following code. Complete the code so that the hexadecimal number 0x1F648 is assigned to the variable a and the program correctly outputs its value.

Requirements: The file rune.go should be placed in the ~/project directory.
Hint: For long hexadecimal numbers, a specified format must be used.

package main

import "fmt"

func main() {
    var a rune = 0x1F648
    fmt.Printf("The value of a is: %c\n", a)
}

Summary

Let's recap what we have learned in this section:

ASCII characters occupy one byte and can represent 128 characters.
UTF-8 encoding is a form of the Unicode character set. It is a variable-length encoding, and its byte length varies depending on the character being represented.
The byte data type can be used to represent ASCII characters, while the rune data type can be used to represent Unicode characters.
The byte data type can represent ASCII characters, and the rune data type can represent Unicode characters.

In this section, we first explained ASCII, UTF-8, and Unicode. Then, we explained the relationship between the character data types byte and rune and the integer data types.