Introduction
In the previous section, we discussed commonly used numeric types. In this section, we will learn about character types in Go.
Knowledge Points:
- ASCII Encoding
- UTF-8 Encoding
- Unicode Character Set
byte
rune
In the previous section, we discussed commonly used numeric types. In this section, we will learn about character types in Go.
Knowledge Points:
byte
rune
In the early days of computers, the ASCII (American Standard Code for Information Interchange) encoding format was used. It represented characters using 7 bits and could represent 128 (2^7) characters. The characters from bit 0 to 31 and bit 127 represented control characters that could not be displayed, while characters from bit 32 to 126 represented everyday uppercase and lowercase letters, numbers, and punctuation marks. See the table for details.
As computers evolved, the need to support different languages arose. ASCII encoding was inadequate for this purpose. Consequently, different languages developed their own encoding formats, such as GB2312 for Simplified Chinese, EUC-KR for Korean, and KOI8-R for Russian.
However, with many language families in the world, it became necessary to have a single encoding format that could unify all languages. Unicode was created to fulfill this need.
In 1991, the Unicode Consortium released the first version of the Unicode character set. Its goal was to unify all languages into a single encoding format, enabling computers worldwide to display and process text more easily and avoiding compatibility issues in multilingual environments.
However, Unicode was only a character set; it defined character codes but not how they were stored. This led to difficulties in widespread adoption for a long time, until the rise of the Internet.
With the continual development of the Internet, UTF-8, a Unicode implementation encoding, has gained popularity. It is a variable-length encoding, meaning that different symbols can have different byte lengths in UTF-8.
For example, for English letters, which fall within the ASCII range, they are represented by 1 byte. The character 'y' (Unicode value 121) takes up 1 byte.
For everyday use, most Chinese characters take up 3 bytes. For example, the character 'åŪ' (Unicode value 23454) takes up 3 bytes.
However, there are some Chinese characters that take up 4 bytes. This is because there are over 100,000 Chinese characters, but according to the diagram below, 3 bytes can only represent slightly over 60,000 characters, so a small number of Chinese characters require 4 bytes to be represented.
Another advantage of UTF-8 encoding is that it is backward compatible with ASCII encoding. In fact, ASCII is a subset of UTF-8. The first 128 characters in UTF-8 correspond one-to-one with ASCII characters. This means that software originally using ASCII can continue to be used with little or no modification. Because of these advantages, UTF-8 has gradually become the preferred encoding format.
The creators of the Go programming language, Rob Pike and Ken Thompson, also invented UTF-8, so Go has a special affinity for UTF-8. Go requires source code files to be saved in UTF-8 encoding. When operating on text characters, UTF-8 encoding is the preferred choice. Moreover, the standard library provides many functions related to UTF-8 encoding and decoding.
byte
is an alias for uint8
and occupies one byte (8 bits). It can be used to represent all characters in the ASCII table. However, because byte
can represent a limited range of values (256 or 2^8), when dealing with composite characters such as Chinese characters, we need to use the rune
type.
Create a new file called byte.go
and enter the following code:
cd ~/project
touch byte.go
package main
import "fmt"
func main() {
var a byte = 76
fmt.Printf("Value of a: %c\n", a)
var b uint8 = 76
fmt.Printf("Value of b: %c\n", b)
var c byte = 'L'
fmt.Printf("Value of c: %c\n", c)
}
After running the program, the following result will be output:
go run byte.go
Value of a: L
Value of b: L
Value of c: L
The %c
placeholder is used to output characters. It can be seen that the byte
type and uint8
type produce the same output when their values are the same. By referring to the ASCII table, it can be seen that the ASCII value of the letter 'A' is 65. When we use the integer placeholder %d
to output the value, it is also 65.
Therefore, it is evident that byte
in Go is equivalent to uint8
in integer types. The same applies to rune
, but it represents a different range of integer values.
rune
is an alias for int32
and occupies four bytes (32 bits). It is used to represent composite characters, such as emoji characters.
Update the byte.go
file with the following code:
package main
import "fmt"
func main() {
var a rune = 'ð' // Smile emoji
fmt.Printf("Value of a: %c\n", a)
var b int32 = 9829 // Decimal representation of a Unicode character (Heart symbol)
fmt.Printf("Value of b: %c\n", b)
var c rune = 0x1F496 // Hexadecimal representation of a Unicode character (Sparkling heart emoji)
fmt.Printf("Value of c: %c\n", c)
var d rune = '\u0041' // Unicode character represented by its code point (Capital letter 'A')
fmt.Printf("Value of d: %c\n", d)
var e rune = '\U0001F609' // Unicode character represented by its code point (Winking face emoji)
fmt.Printf("Value of e: %c\n", e)
}
After running the program, the following output will be displayed:
go run byte.go
Note: Run the program in the Desktop or WebIDE Terminal, but avoid running it in the Terminal Tab located at the top of the LabEx VM.
Value of a: ð
Value of b: âĨ
Value of c: ð
Value of d: A
Value of e: ð
Note: In Go, single quotes and double quotes are not the same. Single quotes are used to represent characters, while double quotes are used to declare strings. Therefore, single quotes should be used when declaring byte
and rune
types, or an error will occur.
Now, let's reinforce what we have learned. Create a new file called rune.go
and enter the following code. Complete the code so that the hexadecimal number 0x1F648
is assigned to the variable a
and the program correctly outputs its value.
rune.go
should be placed in the ~/project
directory.package main
import "fmt"
func main() {
var a rune = 0x1F648
fmt.Printf("The value of a is: %c\n", a)
}
Let's recap what we have learned in this section:
In this section, we first explained ASCII, UTF-8, and Unicode. Then, we explained the relationship between the character data types byte and rune and the integer data types.