Sampling Methods in Python
Overview of Sampling Libraries
Python offers multiple libraries for random sampling, each with unique capabilities:
Library |
Key Features |
Best Used For |
random |
Basic sampling |
Simple random selections |
numpy.random |
Advanced statistical sampling |
Scientific computing |
pandas |
DataFrame sampling |
Data analysis |
sklearn.utils |
Machine learning sampling |
Model training |
Random Module Sampling Techniques
Simple Random Sampling
import random
## Generate a list
data = list(range(1, 100))
## Random sample without replacement
sample_without_replacement = random.sample(data, 10)
## Random sample with replacement
sample_with_replacement = [random.choice(data) for _ in range(10)]
Weighted Sampling
import random
## Weighted sampling
items = ['apple', 'banana', 'cherry']
weights = [0.5, 0.3, 0.2]
weighted_sample = random.choices(items, weights=weights, k=5)
NumPy Sampling Methods
import numpy as np
## Set random seed for reproducibility
np.random.seed(42)
## Generate random sample
data = np.arange(100)
random_sample = np.random.choice(data, size=10, replace=False)
## Uniform distribution sampling
uniform_sample = np.random.uniform(0, 1, 10)
## Normal distribution sampling
normal_sample = np.random.normal(0, 1, 10)
Pandas Sampling Techniques
import pandas as pd
import numpy as np
## Create sample DataFrame
df = pd.DataFrame(np.random.rand(100, 3), columns=['A', 'B', 'C'])
## Random sample of rows
random_rows = df.sample(n=10)
## Stratified sampling
stratified_sample = df.groupby('A').apply(lambda x: x.sample(n=3))
Sampling Workflow
graph TD
A[Raw Data] --> B{Sampling Method}
B --> |Simple Random| C[random.sample]
B --> |Weighted| D[random.choices]
B --> |Scientific| E[numpy.random]
B --> |DataFrame| F[pandas.sample]
Advanced Sampling Scenarios
Reservoir Sampling
Efficient method for sampling from large or streaming datasets:
def reservoir_sampling(iterator, k):
reservoir = []
for i, item in enumerate(iterator):
if len(reservoir) < k:
reservoir.append(item)
else:
j = random.randint(0, i)
if j < k:
reservoir[j] = item
return reservoir
Best Practices
- Set random seed for reproducibility
- Choose appropriate sampling method
- Consider computational complexity
- Validate sample representativeness
LabEx recommends experimenting with different sampling techniques to understand their nuances.