Exploring Scikit-Learn Datasets and Estimators

Machine LearningMachine LearningBeginner
Practice Now

This tutorial is from open-source community. Access the source code

Introduction

In this lab, we will explore the setting and the estimator object in scikit-learn, a popular machine learning library in Python. We will learn about datasets, which are represented as 2D arrays, and how to preprocess them for scikit-learn. We will also explore the concept of estimator objects, which are used to learn from data and make predictions.

VM Tips

After the VM startup is done, click the top left corner to switch to the Notebook tab to access Jupyter Notebook for practice.

Sometimes, you may need to wait a few seconds for Jupyter Notebook to finish loading. The validation of operations cannot be automated because of limitations in Jupyter Notebook.

If you face issues during learning, feel free to ask Labby. Provide feedback after the session, and we will promptly resolve the problem for you.


Skills Graph

%%%%{init: {'theme':'neutral'}}%%%% flowchart RL sklearn(("`Sklearn`")) -.-> sklearn/UtilitiesandDatasetsGroup(["`Utilities and Datasets`"]) ml(("`Machine Learning`")) -.-> ml/FrameworkandSoftwareGroup(["`Framework and Software`"]) sklearn/UtilitiesandDatasetsGroup -.-> sklearn/base("`Base Classes and Utility Functions`") ml/FrameworkandSoftwareGroup -.-> ml/sklearn("`scikit-learn`") subgraph Lab Skills sklearn/base -.-> lab-71095{{"`Exploring Scikit-Learn Datasets and Estimators`"}} ml/sklearn -.-> lab-71095{{"`Exploring Scikit-Learn Datasets and Estimators`"}} end

Understanding Datasets

Scikit-learn represents datasets as 2D arrays, where the first axis represents the samples and the second axis represents the features. Let's take a look at an example using the iris dataset:

from sklearn import datasets

iris = datasets.load_iris()
data = iris.data
print(data.shape)

Output:

(150, 4)

The iris dataset consists of 150 observations of irises, with each observation described by 4 features. The shape of the data array is (150, 4).

Reshaping Data

Sometimes the data may not be initially in the shape required by scikit-learn. In such cases, we need to preprocess the data to transform it into the (n_samples, n_features) shape. An example of reshaping data is the digits dataset, which consists of 1797 8x8 images of hand-written digits:

digits = datasets.load_digits()
print(digits.images.shape)

Output:

(1797, 8, 8)

To use this dataset with scikit-learn, we need to reshape each 8x8 image into a feature vector of length 64:

data = digits.images.reshape((digits.images.shape[0], -1))

Estimator Objects

Estimator objects in scikit-learn are used to learn from data and make predictions. They can be classification, regression, or clustering algorithms, or transformers that extract useful features from raw data. Let's create a simple example of an estimator object:

from sklearn.base import BaseEstimator

class Estimator(BaseEstimator):
    def __init__(self, param1=0, param2=0):
        self.param1 = param1
        self.param2 = param2

    def fit(self, data):
        ## Implementation of the fit method
        pass

estimator = Estimator()

Fitting Data

The main API implemented by scikit-learn is the fit method of an estimator object. It takes a dataset (usually a 2D array) as input. To fit data with an estimator, we can call the fit method:

estimator.fit(data)

Estimator Parameters

Estimator objects can have parameters that affect their behavior. These parameters can be set when the estimator is instantiated or by modifying the corresponding attribute. Let's set some parameters for our example estimator:

estimator = Estimator(param1=1, param2=2)
print(estimator.param1)

Output:

1

Estimated Parameters

When data is fitted with an estimator, the parameters are estimated from the data. All the estimated parameters are attributes of the estimator object, ending with an underscore. For example:

print(estimator.estimated_param_)

Summary

In this lab, we learned about datasets in scikit-learn, how to reshape data, and the concept of estimator objects. We explored fitting data with an estimator, setting parameters, and accessing estimated parameters. This understanding of the setting and the estimator object will be essential when working with scikit-learn for statistical learning tasks.

Other Machine Learning Tutorials you may like