Python 문자열 비 영숫자 문자 제거 방법 | 텍스트 데이터 정리

소개

Python 프로그래밍에서 문자열 작업은 기본적인 기술입니다. 종종 특수 문자, 구두점 또는 기타 비 알파벳 숫자 문자를 제거하여 텍스트 데이터를 정리해야 합니다. 이 과정은 데이터 분석, 자연어 처리, 웹 개발 등 다양한 응용 분야에 필수적입니다.

이 튜토리얼은 Python 문자열에서 비 알파벳 숫자 문자를 필터링하는 다양한 방법을 안내합니다. 튜토리얼을 마치면 Python 응용 프로그램에서 처리하기 쉬운 깨끗하고 구조화된 데이터로 지저분한 텍스트를 변환할 수 있게 됩니다.

Python 문자열 기본 및 영숫자 문자

비 영숫자 문자를 필터링하기 전에 Python 에서 문자열과 영숫자 문자가 무엇인지 이해해 보겠습니다.

Python 문자열이란 무엇인가요?

Python 의 문자열은 따옴표로 묶인 문자 시퀀스입니다. 작은 따옴표 ('), 큰 따옴표 ("), 또는 삼중 따옴표 (''' 또는 """) 를 사용하여 문자열을 정의할 수 있습니다.

문자열을 실험하기 위해 새로운 Python 파일을 만들어 보겠습니다. WebIDE 에서 탐색기 패널의 "New File" 아이콘을 클릭하여 /home/labex/project 디렉토리에 새 파일을 만듭니다. 파일 이름을 string_basics.py로 지정합니다.

다음 코드를 파일에 추가합니다.

## Different ways to define strings in Python
string1 = 'Hello, World!'
string2 = "Python Programming"
string3 = '''This is a
multiline string.'''

## Display each string
print("String 1:", string1)
print("String 2:", string2)
print("String 3:", string3)

이 파일을 실행하려면 터미널을 열고 (아직 열려 있지 않은 경우) 다음을 실행합니다.

python3 /home/labex/project/string_basics.py

다음과 유사한 출력을 볼 수 있습니다.

String 1: Hello, World!
String 2: Python Programming
String 3: This is a
multiline string.

영숫자 문자란 무엇인가요?

영숫자 문자는 다음을 포함합니다.

문자 (A-Z, a-z)
숫자 (0-9)

다른 모든 문자 (구두점, 공백, 기호 등) 는 비 영숫자로 간주됩니다.

문자가 영숫자인지 확인하기 위해 다른 파일을 만들어 보겠습니다. 다음 내용으로 alphanumeric_check.py라는 새 파일을 만듭니다.

## Check if characters are alphanumeric
test_string = "Hello123!@#"

print("Testing each character in:", test_string)
print("Character | Alphanumeric?")
print("-" * 24)

for char in test_string:
    is_alnum = char.isalnum()
    print(f"{char:^9} | {is_alnum}")

## Check entire strings
examples = ["ABC123", "Hello!", "12345", "a b c"]
print("\nChecking entire strings:")
for ex in examples:
    print(f"{ex:10} | {ex.isalnum()}")

이 파일을 실행합니다.

python3 /home/labex/project/alphanumeric_check.py

어떤 문자가 영숫자이고 어떤 문자가 아닌지 보여주는 출력을 볼 수 있습니다.

Testing each character in: Hello123!@#
Character | Alphanumeric?
------------------------
    H     | True
    e     | True
    l     | True
    l     | True
    o     | True
    1     | True
    2     | True
    3     | True
    !     | False
    @     | False
    ##     | False

Checking entire strings:
ABC123     | True
Hello!     | False
12345      | True
a b c      | False

보시다시피 isalnum() 메서드는 문자 및 숫자에 대해 True를 반환하고 다른 모든 문자에 대해 False를 반환합니다. 이는 비 영숫자 문자를 식별해야 할 때 유용합니다.

문자열 메서드를 사용한 필터링

Python 은 비 영숫자 문자를 필터링하는 데 도움이 되는 여러 내장 문자열 메서드를 제공합니다. 이 단계에서는 이러한 메서드를 살펴보고 자체 필터링 함수를 만들 것입니다.

문자열 컴프리헨션 사용

문자를 필터링하는 일반적인 방법 중 하나는 문자열 컴프리헨션 (string comprehension) 을 사용하는 것입니다. string_filter.py라는 새 파일을 만들어 보겠습니다.

## Using string comprehension to filter non-alphanumeric characters

def filter_alphanumeric(text):
    ## Keep only alphanumeric characters
    filtered_text = ''.join(char for char in text if char.isalnum())
    return filtered_text

## Test the function with different examples
test_strings = [
    "Hello, World!",
    "Python 3.10 is amazing!",
    "Email: user@example.com",
    "Phone: (123) 456-7890"
]

print("Original vs Filtered:")
print("-" * 40)

for text in test_strings:
    filtered = filter_alphanumeric(text)
    print(f"Original: {text}")
    print(f"Filtered: {filtered}")
    print("-" * 40)

이 파일을 실행합니다.

python3 /home/labex/project/string_filter.py

다음과 같은 출력을 볼 수 있습니다.

Original vs Filtered:
----------------------------------------
Original: Hello, World!
Filtered: HelloWorld
----------------------------------------
Original: Python 3.10 is amazing!
Filtered: Python310isamazing
----------------------------------------
Original: Email: user@example.com
Filtered: Emailuserexamplecom
----------------------------------------
Original: Phone: (123) 456-7890
Filtered: Phone1234567890
----------------------------------------

filter_alphanumeric() 함수는 문자열의 각 문자를 반복하고 isalnum() 검사를 통과하는 문자만 유지합니다.

`filter()` 함수 사용

Python 의 내장 filter() 함수는 동일한 결과를 얻는 또 다른 방법을 제공합니다. 이 메서드를 파일에 추가해 보겠습니다.

## Add to the string_filter.py file

def filter_alphanumeric_using_filter(text):
    ## Using the built-in filter() function
    filtered_text = ''.join(filter(str.isalnum, text))
    return filtered_text

print("\nUsing the filter() function:")
print("-" * 40)

for text in test_strings:
    filtered = filter_alphanumeric_using_filter(text)
    print(f"Original: {text}")
    print(f"Filtered: {filtered}")
    print("-" * 40)

WebIDE 에서 string_filter.py 파일을 열고 위의 코드를 파일 끝에 추가합니다. 그런 다음 다시 실행합니다.

python3 /home/labex/project/string_filter.py

두 메서드가 동일한 결과를 생성하는 것을 볼 수 있습니다.

사용자 정의 필터링

때로는 일부 비 영숫자 문자를 유지하면서 다른 문자를 제거하고 싶을 수 있습니다. 어떤 추가 문자를 유지할지 지정할 수 있는 함수를 추가해 보겠습니다.

## Add to the string_filter.py file

def custom_filter(text, keep_chars=""):
    ## Keep alphanumeric characters and any characters specified in keep_chars
    filtered_text = ''.join(char for char in text if char.isalnum() or char in keep_chars)
    return filtered_text

print("\nCustom filtering (keeping spaces and @):")
print("-" * 40)

for text in test_strings:
    filtered = custom_filter(text, keep_chars=" @")
    print(f"Original: {text}")
    print(f"Filtered: {filtered}")
    print("-" * 40)

이 코드를 string_filter.py 파일 끝에 추가하고 다시 실행합니다.

python3 /home/labex/project/string_filter.py

이제 공백과 @ 기호가 필터링된 결과에 유지되는 것을 볼 수 있습니다. 이는 특정 서식 또는 특수 문자를 유지해야 할 때 유용할 수 있습니다.

텍스트 정리를 위한 정규 표현식 사용

정규 표현식 (regex) 은 텍스트에서 패턴을 식별하고 조작하는 강력한 방법을 제공합니다. Python 의 re 모듈은 정규 표현식을 사용하기 위한 함수를 제공합니다.

문자 필터링을 위한 기본 정규 표현식 소개

regex_filter.py라는 새 파일을 만들어 보겠습니다.

## Using regular expressions to filter non-alphanumeric characters
import re

def filter_with_regex(text):
    ## Replace all non-alphanumeric characters with an empty string
    filtered_text = re.sub(r'[^a-zA-Z0-9]', '', text)
    return filtered_text

## Test the function with different examples
test_strings = [
    "Hello, World!",
    "Python 3.10 is amazing!",
    "Email: user@example.com",
    "Phone: (123) 456-7890"
]

print("Original vs Regex Filtered:")
print("-" * 40)

for text in test_strings:
    filtered = filter_with_regex(text)
    print(f"Original: {text}")
    print(f"Filtered: {filtered}")
    print("-" * 40)

정규 표현식 패턴 [^a-zA-Z0-9]는 대문자, 소문자 또는 숫자가 아닌 모든 문자와 일치합니다. re.sub() 함수는 일치하는 모든 문자를 빈 문자열로 바꿉니다.

파일을 실행합니다.

python3 /home/labex/project/regex_filter.py

다음과 유사한 출력을 볼 수 있습니다.

Original vs Regex Filtered:
----------------------------------------
Original: Hello, World!
Filtered: HelloWorld
----------------------------------------
Original: Python 3.10 is amazing!
Filtered: Python310isamazing
----------------------------------------
Original: Email: user@example.com
Filtered: Emailuserexamplecom
----------------------------------------
Original: Phone: (123) 456-7890
Filtered: Phone1234567890
----------------------------------------

정규 표현식을 사용한 사용자 정의 패턴

정규 표현식은 더 복잡한 패턴과 대체 기능을 허용합니다. 사용자 정의 패턴을 허용하는 함수를 추가해 보겠습니다.

## Add to the regex_filter.py file

def custom_regex_filter(text, pattern=r'[^a-zA-Z0-9]', replacement=''):
    ## Replace characters matching the pattern with the replacement
    filtered_text = re.sub(pattern, replacement, text)
    return filtered_text

print("\nCustom regex filtering (keeping spaces and some punctuation):")
print("-" * 60)

## Keep alphanumeric chars, spaces, and @.
custom_pattern = r'[^a-zA-Z0-9\s@\.]'

for text in test_strings:
    filtered = custom_regex_filter(text, pattern=custom_pattern)
    print(f"Original: {text}")
    print(f"Filtered: {filtered}")
    print("-" * 60)

패턴 [^a-zA-Z0-9\s@\.]는 영숫자 문자, 공백 (\s), @ 기호 또는 마침표가 아닌 모든 문자와 일치합니다. 이 코드를 regex_filter.py 파일에 추가하고 다시 실행합니다.

python3 /home/labex/project/regex_filter.py

비 영숫자 문자 식별

때로는 문자열에 어떤 비 영숫자 문자가 있는지 식별하고 싶을 수 있습니다. 이러한 문자를 식별하는 함수를 추가해 보겠습니다.

## Add to the regex_filter.py file

def identify_non_alphanumeric(text):
    ## Find all non-alphanumeric characters in the text
    non_alphanumeric = re.findall(r'[^a-zA-Z0-9]', text)
    ## Return unique characters as a set
    return set(non_alphanumeric)

print("\nIdentifying non-alphanumeric characters:")
print("-" * 40)

for text in test_strings:
    characters = identify_non_alphanumeric(text)
    print(f"Text: {text}")
    print(f"Non-alphanumeric characters: {characters}")
    print("-" * 40)

이 코드를 regex_filter.py 파일에 추가하고 다시 실행합니다.

python3 /home/labex/project/regex_filter.py

출력은 각 문자열에 어떤 비 영숫자 문자가 있는지 보여줍니다. 이는 데이터에서 무엇을 필터링해야 하는지 이해하는 데 유용할 수 있습니다.

실제 텍스트 정리 응용

이제 비 영숫자 문자를 필터링하는 다양한 방법을 배웠으므로 이러한 기술을 실제 시나리오에 적용해 보겠습니다.

사용자 입력 정리

사용자 입력에는 정리해야 할 예기치 않은 문자가 포함되는 경우가 많습니다. 이를 시연하기 위해 text_cleaning_app.py라는 파일을 만들어 보겠습니다.

## Text cleaning application for user input
import re

def clean_username(username):
    """Cleans a username by removing special characters and spaces"""
    return re.sub(r'[^a-zA-Z0-9_]', '', username)

def clean_search_query(query):
    """Preserves alphanumeric chars and spaces, replaces multiple spaces with one"""
    ## First, replace non-alphanumeric chars (except spaces) with empty string
    cleaned = re.sub(r'[^a-zA-Z0-9\s]', '', query)
    ## Then, replace multiple spaces with a single space
    cleaned = re.sub(r'\s+', ' ', cleaned)
    ## Finally, strip leading and trailing spaces
    return cleaned.strip()

## Simulate user input
usernames = [
    "john.doe",
    "user@example",
    "my username!",
    "admin_123"
]

search_queries = [
    "python   programming",
    "how to filter?!  special chars",
    "$ regex      examples $",
    "   string methods   "
]

## Clean and display usernames
print("Username Cleaning:")
print("-" * 40)
for username in usernames:
    cleaned = clean_username(username)
    print(f"Original: {username}")
    print(f"Cleaned:  {cleaned}")
    print("-" * 40)

## Clean and display search queries
print("\nSearch Query Cleaning:")
print("-" * 40)
for query in search_queries:
    cleaned = clean_search_query(query)
    print(f"Original: '{query}'")
    print(f"Cleaned:  '{cleaned}'")
    print("-" * 40)

이 파일을 실행합니다.

python3 /home/labex/project/text_cleaning_app.py

파일 데이터 처리

샘플 텍스트 파일을 만들고 정리해 보겠습니다. 먼저, 다음 내용으로 sample_data.txt라는 파일을 만듭니다.

User1: john.doe@example.com (Active: Yes)
User2: jane_smith@example.com (Active: No)
User3: admin#123@system.org (Active: Yes)
Notes: Users should change their passwords every 90 days!

WebIDE 편집기를 사용하여 이 파일을 만들 수 있습니다. 이제 이 데이터를 정리하기 위해 file_cleaner.py라는 파일을 만들어 보겠습니다.

## File cleaning application
import re

def extract_emails(text):
    """Extract email addresses from text"""
    ## Simple regex for email extraction
    email_pattern = r'[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+'
    return re.findall(email_pattern, text)

def extract_usernames(text):
    """Extract the username part from email addresses"""
    emails = extract_emails(text)
    usernames = [email.split('@')[0] for email in emails]
    return usernames

def clean_usernames(usernames):
    """Clean usernames by removing non-alphanumeric characters"""
    return [re.sub(r'[^a-zA-Z0-9]', '', username) for username in usernames]

## Read the sample data file
try:
    with open('/home/labex/project/sample_data.txt', 'r') as file:
        data = file.read()
except FileNotFoundError:
    print("Error: sample_data.txt file not found!")
    exit(1)

## Process the data
print("File Cleaning Results:")
print("-" * 50)
print("Original data:")
print(data)
print("-" * 50)

## Extract emails
emails = extract_emails(data)
print(f"Extracted {len(emails)} email addresses:")
for email in emails:
    print(f"  - {email}")

## Extract and clean usernames
usernames = extract_usernames(data)
cleaned_usernames = clean_usernames(usernames)

print("\nUsername extraction and cleaning:")
for i, (original, cleaned) in enumerate(zip(usernames, cleaned_usernames)):
    print(f"  - User {i+1}: {original} → {cleaned}")

print("-" * 50)

이 파일을 실행합니다.

python3 /home/labex/project/file_cleaner.py

성능 비교

다양한 필터링 방법은 서로 다른 성능 특성을 가질 수 있습니다. 이를 비교하기 위해 performance_test.py라는 파일을 만들어 보겠습니다.

## Performance comparison of different filtering methods
import re
import time

def filter_with_loop(text):
    """Filter using a simple loop"""
    result = ""
    for char in text:
        if char.isalnum():
            result += char
    return result

def filter_with_comprehension(text):
    """Filter using list comprehension"""
    return ''.join(char for char in text if char.isalnum())

def filter_with_filter_function(text):
    """Filter using the built-in filter function"""
    return ''.join(filter(str.isalnum, text))

def filter_with_regex(text):
    """Filter using regular expressions"""
    return re.sub(r'[^a-zA-Z0-9]', '', text)

def filter_with_translate(text):
    """Filter using string.translate"""
    ## Create a translation table that maps all non-alphanumeric chars to None
    from string import ascii_letters, digits
    allowed = ascii_letters + digits
    translation_table = str.maketrans('', '', ''.join(c for c in map(chr, range(128)) if c not in allowed))
    return text.translate(translation_table)

## Generate test data (a string with a mix of alphanumeric and other characters)
test_data = "".join(chr(i) for i in range(33, 127)) * 1000  ## ASCII printable characters repeated

## Define the filtering methods to test
methods = [
    ("Simple Loop", filter_with_loop),
    ("List Comprehension", filter_with_comprehension),
    ("Filter Function", filter_with_filter_function),
    ("Regular Expression", filter_with_regex),
    ("String Translate", filter_with_translate)
]

print("Performance Comparison:")
print("-" * 60)
print(f"Test data length: {len(test_data)} characters")
print("-" * 60)
print(f"{'Method':<20} | {'Time (seconds)':<15} | {'Characters Removed':<20}")
print("-" * 60)

## Test each method
for name, func in methods:
    start_time = time.time()
    result = func(test_data)
    end_time = time.time()

    execution_time = end_time - start_time
    chars_removed = len(test_data) - len(result)

    print(f"{name:<20} | {execution_time:<15.6f} | {chars_removed:<20}")

print("-" * 60)

이 파일을 실행합니다.

python3 /home/labex/project/performance_test.py

출력은 비 영숫자 문자를 필터링하는 데 가장 효율적인 방법을 보여줍니다. 이는 대량의 텍스트 데이터를 처리할 때 중요할 수 있습니다.

요약

이 Lab 에서는 Python 문자열에서 비 영숫자 문자를 필터링하는 여러 가지 방법을 배웠습니다.

문자열 메서드 (String Methods): isalnum()과 같은 Python 의 내장 문자열 메서드를 사용하여 문자를 확인하고 필터링합니다.
컴프리헨션 및 필터 (Comprehension and Filter): 리스트 컴프리헨션 (list comprehension) 과 내장 filter() 함수를 사용하여 정리된 문자열을 생성합니다.
정규 표현식 (Regular Expressions): 강력한 패턴 매칭 (pattern matching) 및 대체 (replacement) 를 위해 Python 의 re 모듈을 활용합니다.
실제 응용 (Real-World Applications): 사용자 입력 정리, 파일 데이터 처리, 성능 비교와 같은 실제 시나리오에 이러한 기술을 적용합니다.

이러한 기술은 다음과 같은 다양한 분야에서 텍스트 처리 작업에 필수적입니다.

데이터 분석 및 머신 러닝 (machine learning) 에서의 데이터 정리
자연어 처리 (Natural Language Processing, NLP)
웹 스크래핑 (Web scraping) 및 데이터 추출
웹 애플리케이션 (web applications) 의 사용자 입력 유효성 검사

이러한 방법을 숙달함으로써 이제 Python 애플리케이션에서 분석하고 처리하기 쉬운 깨끗하고 구조화된 형식으로 지저분한 텍스트 데이터를 변환하는 기술을 갖추게 되었습니다.

Python 문자열에서 비 영숫자 문자 필터링 방법

소개

Python 문자열 기본 및 영숫자 문자

Python 문자열이란 무엇인가요?

영숫자 문자란 무엇인가요?

문자열 메서드를 사용한 필터링

문자열 컴프리헨션 사용

filter() 함수 사용

사용자 정의 필터링

텍스트 정리를 위한 정규 표현식 사용

문자 필터링을 위한 기본 정규 표현식 소개

정규 표현식을 사용한 사용자 정의 패턴

비 영숫자 문자 식별

실제 텍스트 정리 응용

사용자 입력 정리

파일 데이터 처리

성능 비교

요약

`filter()` 함수 사용