Working with Text Data
In this lab, we will explore how to work with text data using scikit-learn, a popular machine learning library in Python. We will learn how to load text data, preprocess it, extract features, train a model, and evaluate its performance.
Machine Learningscikit-learn
Wikipedia PageRank with Randomized SVD
In this lab, we will be analyzing the graph of links inside Wikipedia articles to rank articles by relative importance according to the eigenvector centrality. The traditional way to compute the principal eigenvector is to use the power iteration method. Here we will be using Martinsson's Randomized SVD algorithm implemented in scikit-learn.
Machine Learningscikit-learn
Hierarchical Clustering with Connectivity Constraints
This lab demonstrates how to perform hierarchical clustering with connectivity constraints using the Scikit-learn library in Python. In hierarchical clustering, clusters are formed by recursively merging or splitting them based on the distance between them. Connectivity constraints can be used to restrict the formation of clusters based on the connectivity between data points, which can result in more meaningful clusters.
Machine Learningscikit-learn
Class Probabilities with VotingClassifier
In this lab, we will learn how to plot class probabilities calculated by the VotingClassifier in Scikit-Learn. We will use three different classifiers, including LogisticRegression, GaussianNB, and RandomForestClassifier, and average their predicted probabilities using the VotingClassifier. We will then visualize the probability weighting by fitting each classifier on the training set and plot the predicted class probabilities for the first sample in the dataset.
Machine Learningscikit-learn
Plot Topics Extraction with NMF Lda
In this lab, we will apply Non-negative Matrix Factorization (NMF) and Latent Dirichlet Allocation (LDA) on a corpus of documents to extract additive models of the topic structure of the corpus. The output will be a plot of topics, each represented as a bar plot using the top few words based on weights.
Machine Learningscikit-learn
Multiclass ROC Evaluation with Scikit-Learn
This lab demonstrates the use of Receiver Operating Characteristic (ROC) metric to evaluate the quality of multiclass classifiers using Scikit-learn library.
Machine Learningscikit-learn
Sparse Coding with Precomputed Dictionary
In this lab, we will learn how to transform a signal as a sparse combination of Ricker wavelets using sparse coding methods. The Ricker (also known as Mexican hat or the second derivative of a Gaussian) is not a particularly good kernel to represent piecewise constant signals like this one. It can therefore be seen how much adding different widths of atoms matters and it therefore motivates learning the dictionary to best fit your type of signals.
Machine Learningscikit-learn
Recursive Feature Elimination with Cross-Validation
In this lab, we will go through a step-by-step process of implementing Recursive Feature Elimination with Cross-Validation (RFECV) using scikit-learn. RFECV is used for feature selection, which is the process of selecting a subset of relevant features for use in model construction. We will use a classification task with 15 features, out of which 3 are informative, 2 are redundant, and 10 are non-informative.
Machine Learningscikit-learn
ROC with Cross Validation
In this lab, we will learn how to estimate and visualize the variance of the Receiver Operating Characteristic (ROC) metric using cross-validation in Python. ROC curves are used in binary classification to measure the performance of a model by plotting the true positive rate (TPR) against the false positive rate (FPR). We will use the Scikit-learn library to load the iris dataset, create noisy features, and classify the dataset with Support Vector Machine (SVM). We will then plot the ROC curves with cross-validation and calculate the mean Area Under the Curve (AUC) to see the variability of the classifier output when the training set is split into different subsets.
Machine Learningscikit-learn
Polynomial Kernel Approximation with Scikit-Learn
This lab will demonstrate how to use polynomial kernel approximation in scikit-learn to efficiently generate polynomial kernel feature-space approximations. This is used to train linear classifiers that approximate the accuracy of kernelized ones. We will be using the Covtype dataset, which contains 581,012 samples with 54 features each, distributed among 6 classes. The goal of this dataset is to predict forest cover type from cartographic variables only (no remotely sensed data). After loading, we transform it into a binary classification problem to match the version of the dataset in the LIBSVM webpage, which was the one used in the original paper.
Machine Learningscikit-learn
Digit Classification with RBM Features
This lab focuses on the use of Bernoulli Restricted Boltzmann Machine (RBM) for classification of handwritten digits. The RBM feature extractor is combined with a logistic regression classifier to predict the digits. The dataset used is a greyscale image data where pixel values can be interpreted as degrees of blackness on a white background.
Machine Learningscikit-learn
Sparse Signal Recovery with Orthogonal Matching Pursuit
Orthogonal Matching Pursuit (OMP) is a method for recovering a sparse signal from a noisy measurement encoded with a dictionary. In this lab, we will use scikit-learn to implement OMP to recover a sparse signal from a noisy measurement.
Machine Learningscikit-learn
Joint Feature Selection with Multi-Task Lasso
In this lab, we will explore how to perform joint feature selection using the multi-task Lasso algorithm. We will use scikit-learn, a popular Python machine learning library, to generate some sample data and fit models to it. We will then plot the results of the models to see how they compare.
Machine Learningscikit-learn
Gradient Boosting Monotonic Constraints
This is a step-by-step tutorial to demonstrate the effect of monotonic constraints on a gradient boosting estimator. Gradient boosting is a popular machine learning technique used for regression and classification tasks. In this tutorial, we will build an artificial dataset and use a gradient boosting estimator to demonstrate the effect of monotonic constraints on the model's predictions.
Machine Learningscikit-learn
Clustering Analysis with Silhouette Method
Clustering is a popular unsupervised learning technique that involves grouping similar data points together based on their features. The Silhouette Method is a commonly used technique to determine the optimal number of clusters in a dataset. In this lab, we will use the Silhouette Method to determine the optimal number of clusters using the KMeans algorithm.
Machine Learningscikit-learn
Outlier Detection with LOF
The Local Outlier Factor (LOF) algorithm is an unsupervised machine learning method that is used to detect anomalies in data. It computes the local density deviation of a given data point with respect to its neighbors and considers as outliers the samples that have a substantially lower density than their neighbors.
Machine Learningscikit-learn
Hierarchical Clustering with Scikit-Learn
In this lab, we will be using Python's scikit-learn library to perform hierarchical clustering on a few toy datasets. Hierarchical clustering is a method of clustering where you build a hierarchy of clusters, either in a top-down or bottom-up fashion. The goal of hierarchical clustering is to find clusters of points that are similar to each other, and dissimilar to points in other clusters.
Machine Learningscikit-learn
Sparse Signal Regression with L1-Based Models
In this lab, we will demonstrate how to use L1-based regression models to deal with high-dimensional and sparse signals. In particular, we will compare three popular L1-based models: Lasso, Automatic Relevance Determination (ARD), and ElasticNet. We will use a synthetic dataset to illustrate the performance of these models in terms of fitting time, R2 score, and sparsity of estimated coefficients.
Machine Learningscikit-learn