Exploring Character Types in Go

Introduction

In the previous section, we discussed commonly used numerical types. In this section, we will learn about character types in Go.

Knowledge Points:

ASCII Encoding
UTF-8 Encoding
Unicode Character Set
byte
rune

Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL go(("`Go`")) -.-> go/BasicsGroup(["`Basics`"]) go(("`Go`")) -.-> go/FunctionsandControlFlowGroup(["`Functions and Control Flow`"]) go/BasicsGroup -.-> go/variables("`Variables`") go/FunctionsandControlFlowGroup -.-> go/functions("`Functions`") subgraph Lab Skills go/variables -.-> lab-149068{{"`Character Types in Go`"}} go/functions -.-> lab-149068{{"`Character Types in Go`"}} end

ASCII Encoding

In the early days of computers, the ASCII (American Standard Code for Information Interchange) encoding format was used. It represented characters using 7 bits and could represent 128 (2^7) characters. The characters from bit 0 to 31 and bit 127 represented control characters that could not be displayed, while characters from bit 32 to 126 represented everyday uppercase and lowercase letters, numbers, and punctuation marks. See the table for details.

As computers evolved, the need to support different languages arose. ASCII encoding was inadequate for this purpose. Consequently, different languages developed their own encoding formats, such as GB2312 for Simplified Chinese, EUC_KR for Korean, and KOI8-R for Russian.

However, with many language families in the world, it became necessary to have a single encoding format that could unify all languages. Unicode was created to fulfill this need.

Unicode Character Set

In 1991, the Unicode Consortium released the first version of the Unicode character set. Its goal was to unify all languages into a single encoding format, enabling computers worldwide to display and process text more easily and avoiding compatibility issues in multilingual environments.

However, Unicode was only a character set; it defined character codes but not how they were stored. This led to difficulties in widespread adoption for a long time, until the rise of the Internet.

UTF-8 Encoding

With the continual development of the Internet, UTF-8, a Unicode implementation encoding, gained popularity. It is a variable-length encoding, meaning that different symbols can have different byte lengths in UTF-8.

For example, for English letters, which fall into the ASCII range, they are represented by 1 byte. The character 'y' (Unicode value 121) takes up 1 byte.

For everyday use, most Chinese characters take up 3 bytes. For example, the character '实' (Unicode value 23454) takes up 3 bytes.

However, there are some Chinese characters that take up 4 bytes. This is because there are over 100,000 Chinese characters, but according to the diagram below, 3 bytes can only represent slightly over 60,000 characters, so a small number of Chinese characters require 4 bytes to be represented.

Another advantage of UTF-8 encoding is that it is backward compatible with ASCII encoding. In fact, ASCII is a subset of UTF-8. The first 128 characters in UTF-8 correspond one-to-one with ASCII characters. This means that software originally using ASCII can continue to be used with little or no modification. Because of these advantages, UTF-8 has gradually become the preferred encoding format.

The creators of the Go language, Rob Pike and Ken Thompson, also invented UTF-8, so Go has a special affinity for UTF-8. Go requires source code files to be saved in UTF-8 encoding. When operating on text characters, UTF-8 encoding is the preferred choice. Moreover, the standard library provides many functions related to UTF-8 encoding and decoding.

byte and rune

byte is an alias for uint8 and occupies one byte (8 bits). It can be used to represent all characters in the ASCII table. However, because byte can represent a limited range of values (256 or 2^8), when dealing with composite characters such as Chinese characters, we need to use the rune type.

Create a new file called byte.go and enter the following code:

package main

import "fmt"

func main() {
    var a byte = 76
    fmt.Printf("Value of a: %c\n", a)

    var b uint8 = 76
    fmt.Printf("Value of b: %c\n", b)

    var c byte = 'L'
    fmt.Printf("Value of c: %c\n", c)
}

After running the program, the following result will be output:

Value of a: L
Value of b: L
Value of c: L

The %c placeholder is used to output characters. It can be seen that the byte type and uint8 type produce the same output when their values are the same. By referring to the ASCII table, it can be seen that the ASCII value of the letter 'a' is 97. When we use the integer placeholder %d to output the value, it is also 97.

Therefore, it is evident that byte in Go is equivalent to uint8 in integer types. The same applies to rune, but it represents a different range of integer values.

rune is an alias for int32 and occupies four bytes (32 bits). It is used to represent composite characters, such as Chinese characters. Here's an example:

package main

import "fmt"

func main() {
    // Declare a Unicode character and use the %c placeholder to output it
    var a rune = '😊' // Smile emoji
    fmt.Printf("Value of a: %c\n", a)

    // Declare Unicode characters using decimal and hexadecimal notation and output them
    // In hexadecimal notation, the prefix 0x or 0X is added before the number
    var b int32 = 9829 // Decimal representation of a Unicode character (Heart symbol)
    fmt.Printf("Value of b: %c\n", b)
    var c rune = 0x1F496 // Hexadecimal representation of a Unicode character (Sparkling heart emoji)
    fmt.Printf("Value of c: %c\n", c)

    // Declare characters using the Unicode format \u and \U
    // \u is followed by a 4-digit hexadecimal number
    // \U is followed by an 8-digit hexadecimal number
    var d rune = '\u0041' // Unicode character represented by its code point (Capital letter 'A')
    fmt.Printf("Value of d: %c\n", d)
    var e rune = '\U0001F609' // Unicode character represented by its code point (Winking face emoji)
    fmt.Printf("Value of e: %c\n", e)
}

After running the program, the following output will be displayed:

Value of a: 😊
Value of b: ♥
Value of c: 💖
Value of d: A
Value of e: 😉

Variable a represents the smile emoji '😊'.
Variable b is initialized with the decimal representation of a Unicode character (9829), which corresponds to the heart symbol '♥'.
Variable c is initialized with the hexadecimal representation of a Unicode character (0x1F496), which corresponds to the sparkling heart emoji '💖'.
Variable d represents the capital letter 'A' using the Unicode format \u0041.
Variable e represents the winking face emoji '😉' using the \U format with the code point 0001F609.

Note: In Go, single quotes and double quotes are not the same. Single quotes are used to represent characters, while double quotes are used to declare strings. Therefore, single quotes should be used when declaring byte and rune types, or an error will occur. Please try it out for yourself.

Quiz

Now, let's reinforce what we have learned. Create a new file called rune.go and enter the following code. Complete the code so that the hexadecimal number 1F648 is assigned to the variable a and the program correctly outputs its value.

Requirements: The file rune.go should be placed in the ~/project directory.

Hint: For long hexadecimal numbers, a specified format must be used.

package main

import "fmt"

func main() {
    var a
    fmt.Printf("Guess who I am?\n", a)
}

✨ Check Solution and Practice

Summary

Let's recap what we have learned in this section:

ASCII characters occupy one byte and can represent 128 characters.
UTF-8 encoding is a form of the Unicode character set. It is a variable-length encoding, and its byte length varies depending on the character being represented.
The byte type can be used to represent ASCII characters, while the rune type can be used to represent Unicode characters.
byte can represent ASCII characters, and rune can represent Unicode characters.

In this section, we first explained ASCII, UTF-8, and Unicode. Then, we explained the relationship between the character types byte and rune and the integer types. After learning about characters as an appetizer, we will move on to strings in the next section.

Character Types in Go