Practical Frequency Analysis
Real-World Applications
Text Processing and Analysis
def analyze_text_frequency(filename):
with open(filename, 'r') as file:
text = file.read().lower()
words = text.split()
## Comprehensive frequency analysis
total_words = len(words)
unique_words = len(set(words))
word_freq = Counter(words)
return {
'total_words': total_words,
'unique_words': unique_words,
'top_words': word_freq.most_common(5),
'word_diversity': unique_words / total_words
}
## Example usage
result = analyze_text_frequency('sample_document.txt')
print(result)
Frequency Analysis Workflow
graph TD
A[Load Text Data] --> B[Preprocess Text]
B --> C[Tokenize Words]
C --> D[Calculate Frequencies]
D --> E[Generate Insights]
E --> F[Visualize/Report Results]
Key Analysis Techniques
Technique |
Purpose |
Method |
Word Count |
Basic frequency |
Count occurrences |
TF-IDF |
Term importance |
Weighted frequency |
N-gram Analysis |
Context understanding |
Multiple word sequences |
Advanced Frequency Filtering
def advanced_frequency_filter(text, min_length=3, min_frequency=2):
words = text.lower().split()
word_freq = Counter(words)
filtered_words = {
word: freq for word, freq in word_freq.items()
if len(word) >= min_length and freq >= min_frequency
}
return dict(sorted(filtered_words.items(), key=lambda x: x[1], reverse=True))
sample_text = "python programming is fun python is powerful programming is exciting"
filtered_frequencies = advanced_frequency_filter(sample_text)
print(filtered_frequencies)
Natural Language Processing Techniques
from sklearn.feature_extraction.text import CountVectorizer
def extract_text_features(documents):
vectorizer = CountVectorizer(max_features=10)
frequency_matrix = vectorizer.fit_transform(documents)
return {
'feature_names': vectorizer.get_feature_names_out(),
'frequency_matrix': frequency_matrix.toarray()
}
documents = [
"python is great",
"python programming is awesome",
"data science with python"
]
features = extract_text_features(documents)
print(features)
- Use efficient data structures
- Implement caching mechanisms
- Optimize memory usage for large datasets
Visualization Strategies
import matplotlib.pyplot as plt
def visualize_word_frequencies(freq_dict):
plt.figure(figsize=(10, 5))
plt.bar(freq_dict.keys(), freq_dict.values())
plt.title('Word Frequencies')
plt.xlabel('Words')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
LabEx Recommendation
At LabEx, we emphasize practical skills in frequency analysis, combining theoretical knowledge with hands-on coding experience to solve real-world text processing challenges.