culebra.tools.Dataset class

class Dataset(*files: str | PathLike[str] | TextIO, output_index: int | None = None, sep: str = '\\s+')

Create a dataset.

Datasets can be organized in only one file or in two files. If one file per dataset is used, then output_index must be used to indicate which column stores the output values. If output_index is set to None (its default value), it will be assumed that the dataset is composed by two consecutive files, the first one containing the input columns and the second one storing the output column. Only the first column of the second file will be loaded in this case (just one output value per sample).

If no files are provided, an empty dataset is returned.

Parameters:
  • files (Sequence of path-like objects, urls or file-like objects, optional) – Files containing the dataset. If output_index is None, two files are necessary, the first one containing the input columns and the second one containing the output column. Otherwise, only one file will be used to access to the whole dataset (input and output columns)

  • output_index (int, optional) – If the dataset is provided with only one file, this parameter indicates which column in the file does contain the output values. Otherwise this parameter must be set to None to express that inputs and ouputs are stored in two different files. Its default value is None

  • sep (str, optional) – Column separator used within the files. Defaults to DEFAULT_SEP

Raises:
  • TypeError – If output_index is not None or int

  • TypeError – If sep is not a string

  • IndexError – If output_index is out of range

  • RuntimeError – If output_index is None and only one file is provided

  • RuntimeError – When loading a dataset composed of two files, if the file containing the input columns and the file containing the output column do not have the same number of rows.

  • RuntimeError – If any file is empty

Returns:

The dataset

Return type:

Dataset

Class methods

classmethod Dataset.load_pickle(filename: str) Base

Load a pickled object from a file.

Parameters:

filename (str) – The file name.

Raises:
classmethod Dataset.load_from_uci(name: str | None = None, id: int | None = None) Dataset

Load the dataset from the UCI ML repository.

The dataset can be identified by either its id or its name, but only one of these should be provided.

If the dataset has more than one output column, only the first column is considered.

Parameters:
  • name (str) – Dataset name, or substring of name

  • id (int) – Dataset ID for UCI ML Repository

Raises:

RuntimeError – If the dataset can not be loaded

Returns:

The dataset

Return type:

Dataset

Properties

property Dataset.num_feats: int

Get the number of features in the dataset.

Type:

int

property Dataset.size: int

Get the number of samples in the dataset.

Type:

int

property Dataset.inputs: ndarray

Get the input data of the dataset.

Type:

numpy.ndarray

property Dataset.outputs: ndarray

Get the output data of the dataset.

Type:

numpy.ndarray

Methods

Dataset.save_pickle(filename: str) None

Pickle this object and save it to a file.

Parameters:

filename (str) – The file name.

Raises:
Dataset.normalize() Dataset

Normalize the dataset between 0 and 1.

Returns:

A normalized dataset

Return type:

Dataset

Dataset.scale() Dataset

Scale features robust to outliers.

Returns:

A scaled dataset

Return type:

Dataset

Dataset.drop_missing() Dataset

Drop samples with missing values.

Returns:

A clean dataset

Return type:

Dataset

Dataset.remove_outliers(prop: float = 0.05, random_seed: int | None = None) Dataset

Remove the outliers.

Parameters:
  • prop (float) – Expected outlier proportion por class, defaults to 0.05

  • random_seed (int, optional) – Random seed for the random generator, defaults to None

Returns:

A clean dataset

Return type:

Dataset

Dataset.oversample(n_neighbors: int | None = 5, random_seed: int | None = None) Dataset

Oversample all classes but the majority class.

All classes but the majority class are oversampled to equal the number of samples of the majority class. SMOTE is used for oversampling, but if any class has less than n_neighbors samples, RandomOverSampler is first applied

Parameters:
  • n_neighbors (int, optional) – Number of neighbors for SMOTE, defaults to 5.

  • random_seed (int, optional) – Random seed for the random generator, defaults to None

Returns:

An oversampled dataset

Return type:

Dataset

Dataset.select_features(feats: Sequence[int]) Dataset

Return a new dataset only with some selected features.

Parameters:

feats (Sequence of int) – Indices of the selected features

Returns:

The new dataset

Return type:

Dataset

Dataset.append_random_features(num_feats: int, random_seed: int | None = None) Dataset

Return a new dataset with some random features appended.

Parameters:
  • num_feats (An int greater than 0) – Number of random features to be appended

  • random_seed (int, optional) – Random seed for the random generator, defaults to None

Raises:
  • TypeError – If the number of random features is not an integer

  • ValueError – If the number of random features not greater than 0

Returns:

The new dataset

Return type:

Dataset

Dataset.split(test_prop: float, random_seed: int | None = None) Tuple[Dataset, Dataset]

Split the dataset.

Parameters:
  • test_prop (float) – Proportion of the dataset used as test data. The remaining samples will be returned as training data

  • random_seed (int, optional) – Random seed for the random generator, defaults to None

Raises:
Returns:

The training and test datasets

Return type:

tuple of Dataset