Técnicas avanzadas
Coincidencia difusa de cadenas
Distancia de Levenshtein
def levenshtein_distance(s1, s2):
if len(s1) < len(s2):
return levenshtein_distance(s2, s1)
if len(s2) == 0:
return len(s1)
previous_row = range(len(s2) + 1)
for i, c1 in enumerate(s1):
current_row = [i + 1]
for j, c2 in enumerate(s2):
insertions = previous_row[j + 1] + 1
deletions = current_row[j] + 1
substitutions = previous_row[j] + (c1!= c2)
current_row.append(min(insertions, deletions, substitutions))
previous_row = current_row
return previous_row[-1]
## Example
print(levenshtein_distance("python", "pyth0n")) ## Outputs minimal edit distance
Coincidencia fonética
Algoritmo Soundex
def soundex(name):
## Convert to uppercase and remove non-alphabetic characters
name = name.upper()
name = ''.join(filter(str.isalpha, name))
## Keep first letter
soundex = name[0]
## Encode remaining letters
encoding = {
'BFPV': '1', 'CGJKQSXZ': '2',
'DT': '3', 'L': '4',
'MN': '5', 'R': '6'
}
for char in name[1:]:
for key in encoding:
if char in key:
code = encoding[key]
if code!= soundex[-1]:
soundex += code
break
## Pad or truncate to 4 characters
return (soundex + '000')[:4]
## Example
print(soundex("Robert")) ## R163
print(soundex("Rupert")) ## R163
Coincidencia con expresiones regulares
import re
def advanced_string_match(pattern, text):
## Case-insensitive partial match
return re.search(pattern, text, re.IGNORECASE) is not None
## Example
patterns = [
r'\bpython\b', ## Whole word match
r'prog.*lang', ## Partial match with wildcards
]
test_strings = [
"I love Python programming",
"Programming languages are awesome"
]
for pattern in patterns:
for text in test_strings:
print(f"Pattern: {pattern}, Text: {text}")
print(f"Match: {advanced_string_match(pattern, text)}")
Flujo de trabajo de coincidencia
graph TD
A[Input Strings] --> B{Matching Technique}
B -->|Levenshtein| C[Calculate Edit Distance]
B -->|Soundex| D[Generate Phonetic Code]
B -->|Regex| E[Apply Pattern Matching]
C --> F[Determine Similarity]
D --> F
E --> F
F --> G[Match Result]
Comparación de técnicas avanzadas
| Técnica |
Caso de uso |
Complejidad |
Rendimiento |
| Levenshtein |
Distancia de edición |
O(mn) |
Moderado |
| Soundex |
Coincidencia fonética |
O(n) |
Rápido |
| Regex |
Coincidencia de patrones |
Varia |
Depende del patrón |
Enfoque de aprendizaje automático
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
def ml_string_similarity(s1, s2):
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform([s1, s2])
return cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])[0][0]
## Example
print(ml_string_similarity("machine learning", "ml techniques"))
Puntos clave
- La coincidencia avanzada de cadenas va más allá de las comparaciones exactas.
- Múltiples técnicas son adecuadas para diferentes escenarios.
- LabEx recomienda elegir las técnicas en función de los requisitos específicos.