culebra.tools.Dataset class

class Dataset(*files: tuple[str | PathLike[str] | TextIO], output_index: int | None = None, sep: str = '\\s+')

Bases: Base

Create a dataset.

Datasets can be organized in only one file or in two files. If one file per dataset is used, then output_index must be used to indicate which column stores the output values. If output_index is omitted, it will be assumed that the dataset is composed by two consecutive files, the first one containing the input columns and the second one storing the output column. Only the first column of the second file will be loaded in this case (just one output value per sample).

If no files are provided, an empty dataset is returned.

Parameters:
  • files (tuple[str | PathLike[str] | TextIO]) – Files containing the dataset. If output_index is omitted, two files are necessary, the first one containing the input columns and the second one containing the output column. Otherwise, only one file will be used to access to the whole dataset (input and output columns)

  • output_index (int) – If the dataset is provided with only one file, this parameter indicates which column in the file does contain the output values. Otherwise this parameter must be omitted (set to None) to express that inputs and ouputs are stored in two different files. Its default value is None

  • sep (str) – Column separator used within the files. Defaults to DEFAULT_SEP

Raises:
  • TypeError – If output_index is not None or int

  • TypeError – If sep is not a string

  • IndexError – If output_index is out of range

  • RuntimeError – If output_index is None and only one file is provided

  • RuntimeError – When loading a dataset composed of two files, if the file containing the input columns and the file containing the output column do not have the same number of rows.

  • RuntimeError – If any file is empty

Returns:

The dataset

Return type:

Dataset

Class methods

classmethod Dataset.load(filename: str) Base

Load a serialized object from a file.

Parameters:

filename (str) – The file name.

Returns:

The loaded object

Raises:
classmethod Dataset.load_from_uci(name: str | None = None, id_number: int | None = None) Dataset

Load the dataset from the UCI ML repository.

The dataset can be identified by either its id_number or its name, but only one of these should be provided.

If the dataset has more than one output column, only the first column is considered.

Parameters:
  • name (str) – Dataset name, or substring of name, optional

  • id_number (int) – Dataset ID for UCI ML Repository, optional

Raises:

RuntimeError – If the dataset can not be loaded

Returns:

The dataset

Return type:

Dataset

Properties

property Dataset.inputs: ndarray

Input data of the dataset.

Return type:

ndarray

property Dataset.num_feats: int

Number of features in the dataset.

Return type:

int

property Dataset.outputs: ndarray

Output data of the dataset.

Return type:

ndarray

property Dataset.size: int

Number of samples in the dataset.

Return type:

int

Methods

Dataset.append_random_features(num_feats: int, random_seed: int | None = None) Dataset

Return a new dataset with some random features appended.

Parameters:
  • num_feats (int) – Number of random features to be appended (greater than 0)

  • random_seed (int) – Random seed for the random generator, defaults to None

Raises:
  • TypeError – If the number of random features is not an integer

  • ValueError – If the number of random features not greater than 0

Returns:

The new dataset

Return type:

Dataset

Dataset.drop_missing() Dataset

Drop samples with missing values.

Returns:

A clean dataset

Return type:

Dataset

Dataset.dump(filename: str) None

Serialize this object and save it to a file.

Parameters:

filename (str) – The file name.

Raises:
Dataset.normalize() Dataset

Normalize the dataset between 0 and 1.

Returns:

A normalized dataset

Return type:

Dataset

Dataset.oversample(n_neighbors: int = 5, random_seed: int | None = None) Dataset

Oversample all classes but the majority class.

All classes but the majority class are oversampled to equal the number of samples of the majority class. SMOTE is used for oversampling, but if any class has less than n_neighbors samples, RandomOverSampler is first applied

Parameters:
Returns:

An oversampled dataset

Return type:

Dataset

Dataset.remove_outliers(prop: float = 0.05, random_seed: int | None = None) Dataset

Remove the outliers.

Parameters:
Returns:

A clean dataset

Return type:

Dataset

Dataset.scale() Dataset

Scale features robust to outliers.

Returns:

A scaled dataset

Return type:

Dataset

Dataset.select_features(feats: Sequence[int]) Dataset

Return a new dataset only with some selected features.

Parameters:

feats (Sequence[int]) – Indices of the selected features

Returns:

The new dataset

Return type:

Dataset

Dataset.split(test_prop: float, random_seed: int | None = None) tuple[Dataset, Dataset]

Split the dataset.

Parameters:
  • test_prop (float) – Proportion of the dataset used as test data. The remaining samples will be returned as training data

  • random_seed (int) – Random seed for the random generator, defaults to None

Raises:
Returns:

The training and test datasets

Return type:

tuple[Dataset]