Techniques avancées
Correspondance approximative de chaînes de caractères
Distance de Levenshtein
def levenshtein_distance(s1, s2):
if len(s1) < len(s2):
return levenshtein_distance(s2, s1)
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (c1 != c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
## Example
print(levenshtein_distance("python", "pyth0n")) ## Outputs minimal edit distance
Correspondance phonétique
Algorithme Soundex
def soundex(name):
## Convert to uppercase and remove non-alphabetic characters
name = name.upper()
name = ''.join(filter(str.isalpha, name))
## Keep first letter
soundex = name[0]
## Encode remaining letters
encoding = {
'BFPV': '1', 'CGJKQSXZ': '2',
'DT': '3', 'L': '4',
'MN': '5', 'R': '6'
}
for char in name[1:]:
for key in encoding:
if char in key:
code = encoding[key]
if code != soundex[-1]:
soundex += code
break
## Pad or truncate to 4 characters
return (soundex + '000')[:4]
## Example
print(soundex("Robert")) ## R163
print(soundex("Rupert")) ## R163
Correspondance avec des expressions régulières
import re
def advanced_string_match(pattern, text):
## Case-insensitive partial match
return re.search(pattern, text, re.IGNORECASE) is not None
## Example
patterns = [
r'\bpython\b', ## Whole word match
r'prog.*lang', ## Partial match with wildcards
]
test_strings = [
"I love Python programming",
"Programming languages are awesome"
]
for pattern in patterns:
for text in test_strings:
print(f"Pattern: {pattern}, Text: {text}")
print(f"Match: {advanced_string_match(pattern, text)}")
Workflow de correspondance
graph TD
A[Input Strings] --> B{Matching Technique}
B -->|Levenshtein| C[Calculate Edit Distance]
B -->|Soundex| D[Generate Phonetic Code]
B -->|Regex| E[Apply Pattern Matching]
C --> F[Determine Similarity]
D --> F
E --> F
F --> G[Match Result]
Comparaison des techniques avancées
Technique |
Cas d'utilisation |
Complexité |
Performance |
Levenshtein |
Distance d'édition |
O(mn) |
Modérée |
Soundex |
Correspondance phonétique |
O(n) |
Rapide |
Regex |
Correspondance de motifs |
Variable |
Dépend du motif |
Approche d'apprentissage automatique
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def ml_string_similarity(s1, s2):
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([s1, s2])
return cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
## Example
print(ml_string_similarity("machine learning", "ml techniques"))
Points clés
- La correspondance avancée de chaînes de caractères va au-delà des comparaisons exactes
- Plusieurs techniques conviennent à différents scénarios
- LabEx recommande de choisir les techniques en fonction des besoins spécifiques