Introduction
In the previous section, we discussed commonly used numerical types. In this section, we will learn about character types in Go.
Knowledge Points:
- ASCII Encoding
- UTF-8 Encoding
- Unicode Character Set
- byte
- rune
In the previous section, we discussed commonly used numerical types. In this section, we will learn about character types in Go.
Knowledge Points:
In the early days of computers, the ASCII (American Standard Code for Information Interchange) encoding format was used. It represented characters using 7 bits and could represent 128 (2^7) characters. The characters from bit 0 to 31 and bit 127 represented control characters that could not be displayed, while characters from bit 32 to 126 represented everyday uppercase and lowercase letters, numbers, and punctuation marks. See the table for details.
As computers evolved, the need to support different languages arose. ASCII encoding was inadequate for this purpose. Consequently, different languages developed their own encoding formats, such as GB2312 for Simplified Chinese, EUC_KR for Korean, and KOI8-R for Russian.
However, with many language families in the world, it became necessary to have a single encoding format that could unify all languages. Unicode was created to fulfill this need.
In 1991, the Unicode Consortium released the first version of the Unicode character set. Its goal was to unify all languages into a single encoding format, enabling computers worldwide to display and process text more easily and avoiding compatibility issues in multilingual environments.
However, Unicode was only a character set; it defined character codes but not how they were stored. This led to difficulties in widespread adoption for a long time, until the rise of the Internet.
With the continual development of the Internet, UTF-8, a Unicode implementation encoding, gained popularity. It is a variable-length encoding, meaning that different symbols can have different byte lengths in UTF-8.
For example, for English letters, which fall into the ASCII range, they are represented by 1 byte. The character 'y' (Unicode value 121) takes up 1 byte.
For everyday use, most Chinese characters take up 3 bytes. For example, the character 'åŪ' (Unicode value 23454) takes up 3 bytes.
However, there are some Chinese characters that take up 4 bytes. This is because there are over 100,000 Chinese characters, but according to the diagram below, 3 bytes can only represent slightly over 60,000 characters, so a small number of Chinese characters require 4 bytes to be represented.
Another advantage of UTF-8 encoding is that it is backward compatible with ASCII encoding. In fact, ASCII is a subset of UTF-8. The first 128 characters in UTF-8 correspond one-to-one with ASCII characters. This means that software originally using ASCII can continue to be used with little or no modification. Because of these advantages, UTF-8 has gradually become the preferred encoding format.
The creators of the Go language, Rob Pike and Ken Thompson, also invented UTF-8, so Go has a special affinity for UTF-8. Go requires source code files to be saved in UTF-8 encoding. When operating on text characters, UTF-8 encoding is the preferred choice. Moreover, the standard library provides many functions related to UTF-8 encoding and decoding.
byte
is an alias for uint8
and occupies one byte (8 bits). It can be used to represent all characters in the ASCII table. However, because byte can represent a limited range of values (256 or 2^8), when dealing with composite characters such as Chinese characters, we need to use the rune
type.
Create a new file called byte.go
and enter the following code:
package main
import "fmt"
func main() {
var a byte = 76
fmt.Printf("Value of a: %c\n", a)
var b uint8 = 76
fmt.Printf("Value of b: %c\n", b)
var c byte = 'L'
fmt.Printf("Value of c: %c\n", c)
}
After running the program, the following result will be output:
Value of a: L
Value of b: L
Value of c: L
The %c
placeholder is used to output characters. It can be seen that the byte type and uint8 type produce the same output when their values are the same. By referring to the ASCII table, it can be seen that the ASCII value of the letter 'a' is 97. When we use the integer placeholder %d
to output the value, it is also 97.
Therefore, it is evident that byte
in Go is equivalent to uint8
in integer types. The same applies to rune
, but it represents a different range of integer values.
rune
is an alias for int32
and occupies four bytes (32 bits). It is used to represent composite characters, such as Chinese characters. Here's an example:
package main
import "fmt"
func main() {
// Declare a Unicode character and use the %c placeholder to output it
var a rune = 'ð' // Smile emoji
fmt.Printf("Value of a: %c\n", a)
// Declare Unicode characters using decimal and hexadecimal notation and output them
// In hexadecimal notation, the prefix 0x or 0X is added before the number
var b int32 = 9829 // Decimal representation of a Unicode character (Heart symbol)
fmt.Printf("Value of b: %c\n", b)
var c rune = 0x1F496 // Hexadecimal representation of a Unicode character (Sparkling heart emoji)
fmt.Printf("Value of c: %c\n", c)
// Declare characters using the Unicode format \u and \U
// \u is followed by a 4-digit hexadecimal number
// \U is followed by an 8-digit hexadecimal number
var d rune = '\u0041' // Unicode character represented by its code point (Capital letter 'A')
fmt.Printf("Value of d: %c\n", d)
var e rune = '\U0001F609' // Unicode character represented by its code point (Winking face emoji)
fmt.Printf("Value of e: %c\n", e)
}
After running the program, the following output will be displayed:
Value of a: ð
Value of b: âĨ
Value of c: ð
Value of d: A
Value of e: ð
Note: In Go, single quotes and double quotes are not the same. Single quotes are used to represent characters, while double quotes are used to declare strings. Therefore, single quotes should be used when declaring byte
and rune
types, or an error will occur. Please try it out for yourself.
Now, let's reinforce what we have learned. Create a new file called rune.go
and enter the following code. Complete the code so that the hexadecimal number 1F648
is assigned to the variable a
and the program correctly outputs its value.
Requirements: The file rune.go
should be placed in the ~/project
directory.
Hint: For long hexadecimal numbers, a specified format must be used.
package main
import "fmt"
func main() {
var a
fmt.Printf("Guess who I am?\n", a)
}
Let's recap what we have learned in this section:
In this section, we first explained ASCII, UTF-8, and Unicode. Then, we explained the relationship between the character types byte and rune and the integer types. After learning about characters as an appetizer, we will move on to strings in the next section.