Text Processing Fundamentals

PythonPythonBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

This section introduces ways to work with text.

Representing Literal Text

String literals are written in programs with quotes.

## Single quote
a = 'Yeah but no but yeah but...'

## Double quote
b = "computer says no"

## Triple quotes
c = '''
Look into my eyes, look into my eyes, the eyes, the eyes, the eyes,
not around the eyes,
don't look around the eyes,
look into my eyes, you're under.
'''

Normally strings may only span a single line. Triple quotes capture all text enclosed across multiple lines including all formatting.

There is no difference between using single (') versus double (") quotes. However, the same type of quote used to start a string must be used to terminate it.

String escape codes

Escape codes are used to represent control characters and characters that can't be easily typed directly at the keyboard. Here are some common escape codes:

'\n'      Line feed
'\r'      Carriage return
'\t'      Tab
'\''      Literal single quote
'\"'      Literal double quote
'\\'      Literal backslash

String Representation

Each character in a string is stored internally as a so-called Unicode "code-point" which is an integer. You can specify an exact code-point value using the following escape sequences:

a = '\xf1'          ## a = 'Ãą'
b = '\u2200'        ## b = '∀'
c = '\U0001D122'    ## c = 'ð„Ē'
d = '\N{FOR ALL}'   ## d = '∀'

The Unicode Character Database is a reference for all available character codes.

String Indexing

Strings work like an array for accessing individual characters. You use an integer index, starting at 0. Negative indices specify a position relative to the end of the string.

a = 'Hello world'
b = a[0]          ## 'H'
c = a[4]          ## 'o'
d = a[-1]         ## 'd' (end of string)

You can also slice or select substrings specifying a range of indices with :.

d = a[:5]     ## 'Hello'
e = a[6:]     ## 'world'
f = a[3:8]    ## 'lo wo'
g = a[-5:]    ## 'world'

The character at the ending index is not included. Missing indices assume the beginning or ending of the string.

String operations

Concatenation, length, membership and replication.

## Concatenation (+)
a = 'Hello' + 'World'   ## 'HelloWorld'
b = 'Say ' + a          ## 'Say HelloWorld'

## Length (len)
s = 'Hello'
len(s)                  ## 5

## Membership test (`in`, `not in`)
t = 'e' in s            ## True
f = 'x' in s            ## False
g = 'hi' not in s       ## True

## Replication (s * n)
rep = s * 5             ## 'HelloHelloHelloHelloHello'

String methods

Strings have methods that perform various operations with the string data.

Example: stripping any leading / trailing white space.

s = '  Hello '
t = s.strip()     ## 'Hello'

Example: Case conversion.

s = 'Hello'
l = s.lower()     ## 'hello'
u = s.upper()     ## 'HELLO'

Example: Replacing text.

s = 'Hello world'
t = s.replace('Hello' , 'Hallo')   ## 'Hallo world'

More string methods:

Strings have a wide variety of other methods for testing and manipulating the text data. This is a small sample of methods:

s.endswith(suffix)     ## Check if string ends with suffix
s.find(t)              ## First occurrence of t in s
s.index(t)             ## First occurrence of t in s
s.isalpha()            ## Check if characters are alphabetic
s.isdigit()            ## Check if characters are numeric
s.islower()            ## Check if characters are lower-case
s.isupper()            ## Check if characters are upper-case
s.join(slist)          ## Join a list of strings using s as delimiter
s.lower()              ## Convert to lower case
s.replace(old,new)     ## Replace text
s.rfind(t)             ## Search for t from end of string
s.rindex(t)            ## Search for t from end of string
s.split([delim])       ## Split string into list of substrings
s.startswith(prefix)   ## Check if string starts with prefix
s.strip()              ## Strip leading/trailing space
s.upper()              ## Convert to upper case

String Mutability

Strings are "immutable" or read-only. Once created, the value can't be changed.

>>> s = 'Hello World'
>>> s[1] = 'a'
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>>

All operations and methods that manipulate string data, always create new strings.

String Conversions

Use str() to convert any value to a string. The result is a string holding the same text that would have been produced by the print() statement.

>>> x = 42
>>> str(x)
'42'
>>>

Byte Strings

A string of 8-bit bytes, commonly encountered with low-level I/O, is written as follows:

data = b'Hello World\r\n'

By putting a little b before the first quotation, you specify that it is a byte string as opposed to a text string.

Most of the usual string operations work.

len(data)                         ## 13
data[0:5]                         ## b'Hello'
data.replace(b'Hello', b'Cruel')  ## b'Cruel World\r\n'

Indexing is a bit different because it returns byte values as integers.

data[0]   ## 72 (ASCII code for 'H')

Conversion to/from text strings.

text = data.decode('utf-8') ## bytes -> text
data = text.encode('utf-8') ## text -> bytes

The 'utf-8' argument specifies a character encoding. Other common values include 'ascii' and 'latin1'.

Raw Strings

Raw strings are string literals with an uninterpreted backslash. They are specified by prefixing the initial quote with a lowercase "r".

>>> rs = r'c:\newdata\test' ## Raw (uninterpreted backslash)
>>> rs
'c:\\newdata\\test'

The string is the literal text enclosed inside, exactly as typed. This is useful in situations where the backslash has special significance. Example: filename, regular expressions, etc.

f-Strings

A string with formatted expression substitution.

>>> name = 'IBM'
>>> shares = 100
>>> price = 91.1
>>> a = f'{name:>10s} {shares:10d} {price:10.2f}'
>>> a
'       IBM        100      91.10'
>>> b = f'Cost = ${shares*price:0.2f}'
>>> b
'Cost = $9110.00'
>>>

Note: This requires Python 3.6 or newer. The meaning of the format codes is covered later.

In these exercises, you'll experiment with operations on Python's string type. You should do this at the Python interactive prompt where you can easily see the results. Important note:

In exercises where you are supposed to interact with the interpreter, >>> is the interpreter prompt that you get when Python wants you to type a new statement. Some statements in the exercise span multiple lines--to get these statements to run, you may have to hit 'return' a few times. Just a reminder that you DO NOT type the >>> when working these examples.

Start by defining a string containing a series of stock ticker symbols like this:

>>> symbols = 'AAPL,IBM,MSFT,YHOO,SCO'
>>>

Exercise 1.13: Extracting individual characters and substrings

Strings are arrays of characters. Try extracting a few characters:

>>> symbols[0]
?
>>> symbols[1]
?
>>> symbols[2]
?
>>> symbols[-1]        ## Last character
?
>>> symbols[-2]        ## Negative indices are from end of string
?
>>>

In Python, strings are read-only.

Verify this by trying to change the first character of symbols to a lower-case 'a'.

>>> symbols[0] = 'a'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: 'str' object does not support item assignment
>>>

Exercise 1.14: String concatenation

Although string data is read-only, you can always reassign a variable to a newly created string.

Try the following statement which concatenates a new symbol "GOOG" to the end of symbols:

>>> symbols = symbols + 'GOOG'
>>> symbols
'AAPL,IBM,MSFT,YHOO,SCOGOOG'
>>>

Oops! That's not what you wanted. Fix it so that the symbols variable holds the value 'AAPL,IBM,MSFT,YHOO,SCO,GOOG'.

>>> symbols = ?
>>> symbols
'AAPL,IBM,MSFT,YHOO,SCO,GOOG'
>>>

Add 'HPQ' to the front the string:

>>> symbols = ?
>>> symbols
'HPQ,AAPL,IBM,MSFT,YHOO,SCO,GOOG'
>>>

In these examples, it might look like the original string is being modified, in an apparent violation of strings being read only. Not so. Operations on strings create an entirely new string each time. When the variable name symbols is reassigned, it points to the newly created string. Afterwards, the old string is destroyed since it's not being used anymore.

Exercise 1.15: Membership testing (substring testing)

Experiment with the in operator to check for substrings. At the interactive prompt, try these operations:

>>> 'IBM' in symbols
?
>>> 'AA' in symbols
True
>>> 'CAT' in symbols
?
>>>

Why did the check for 'AA' return True?

Exercise 1.16: String Methods

At the Python interactive prompt, try experimenting with some of the string methods.

>>> symbols.lower()
?
>>> symbols
?
>>>

Remember, strings are always read-only. If you want to save the result of an operation, you need to place it in a variable:

>>> lowersyms = symbols.lower()
>>>

Try some more operations:

>>> symbols.find('MSFT')
?
>>> symbols[13:17]
?
>>> symbols = symbols.replace('SCO','DOA')
>>> symbols
?
>>> name = '   IBM   \n'
>>> name = name.strip()    ## Remove surrounding whitespace
>>> name
?
>>>

Exercise 1.17: f-strings

Sometimes you want to create a string and embed the values of variables into it.

To do that, use an f-string. For example:

>>> name = 'IBM'
>>> shares = 100
>>> price = 91.1
>>> f'{shares} shares of {name} at ${price:0.2f}'
'100 shares of IBM at $91.10'
>>>

Modify the mortgage.py program from Exercise 1.10 to create its output using f-strings. Try to make it so that output is nicely aligned.

âœĻ Check Solution and Practice

Exercise 1.18: Regular Expressions

One limitation of the basic string operations is that they don't support any kind of advanced pattern matching. For that, you need to turn to Python's re module and regular expressions. Regular expression handling is a big topic, but here is a short example:

>>> text = 'Today is 3/27/2018. Tomorrow is 3/28/2018.'
>>> ## Find all occurrences of a date
>>> import re
>>> re.findall(r'\d+/\d+/\d+', text)
['3/27/2018', '3/28/2018']
>>> ## Replace all occurrences of a date with replacement text
>>> re.sub(r'(\d+)/(\d+)/(\d+)', r'\3-\1-\2', text)
'Today is 2018-3-27. Tomorrow is 2018-3-28.'
>>>

For more information about the re module, see the official documentation at https://docs.python.org/library/re.html.

Commentary

As you start to experiment with the interpreter, you often want to know more about the operations supported by different objects. For example, how do you find out what operations are available on a string?

Depending on your Python environment, you might be able to see a list of available methods via tab-completion. For example, try typing this:

>>> s = 'hello world'
>>> s.<tab key>
>>>

If hitting tab doesn't do anything, you can fall back to the builtin-in dir() function. For example:

>>> s = 'hello'
>>> dir(s)
['__add__', '__class__', '__contains__', ..., 'find', 'format',
'index', 'isalnum', 'isalpha', 'isdigit', 'islower', 'isspace',
'istitle', 'isupper', 'join', 'ljust', 'lower', 'lstrip', 'partition',
'replace', 'rfind', 'rindex', 'rjust', 'rpartition', 'rsplit',
'rstrip', 'split', 'splitlines', 'startswith', 'strip', 'swapcase',
'title', 'translate', 'upper', 'zfill']
>>>

dir() produces a list of all operations that can appear after the (.). Use the help() command to get more information about a specific operation:

>>> help(s.upper)
Help on built-in function upper:

upper(...)
    S.upper() -> string

    Return a copy of the string S converted to uppercase.
>>>

Summary

Congratulations! You have completed the Strings lab. You can practice more labs in LabEx to improve your skills.

Other Python Tutorials you may like