`culebra.tools.Dataset` class¶

class Dataset(*files: str | PathLike[str] | TextIO, output_index: int | None = None, sep: str = '\\s+')¶

Create a dataset.

Datasets can be organized in only one file or in two files. If one file per dataset is used, then output_index must be used to indicate which column stores the output values. If output_index is set to None (its default value), it will be assumed that the dataset is composed by two consecutive files, the first one containing the input columns and the second one storing the output column. Only the first column of the second file will be loaded in this case (just one output value per sample).

If no files are provided, an empty dataset is returned.

Parameters:

files (Sequence of path-like objects, urls or file-like objects, optional) – Files containing the dataset. If output_index is None, two files are necessary, the first one containing the input columns and the second one containing the output column. Otherwise, only one file will be used to access to the whole dataset (input and output columns)
output_index (int, optional) – If the dataset is provided with only one file, this parameter indicates which column in the file does contain the output values. Otherwise this parameter must be set to None to express that inputs and ouputs are stored in two different files. Its default value is None
sep (str, optional) – Column separator used within the files. Defaults to DEFAULT_SEP

Raises:

TypeError – If output_index is not None or int
TypeError – If sep is not a string
IndexError – If output_index is out of range
RuntimeError – If output_index is None and only one file is provided
RuntimeError – When loading a dataset composed of two files, if the file containing the input columns and the file containing the output column do not have the same number of rows.
RuntimeError – If any file is empty

Returns:

The dataset

Return type:

Dataset

Class methods¶

classmethod Dataset.load_pickle(filename: str) → Base¶

Load a pickled object from a file.

Parameters:

filename (str) – The file name.

Raises:

TypeError – If filename is not a valid file name
ValueError – If the filename extension is not PICKLE_FILE_EXTENSION

classmethod Dataset.load_from_uci(name: str | None = None, id: int | None = None) → Dataset¶

Load the dataset from the UCI ML repository.

The dataset can be identified by either its id or its name, but only one of these should be provided.

If the dataset has more than one output column, only the first column is considered.

Parameters:

name (str) – Dataset name, or substring of name
id (int) – Dataset ID for UCI ML Repository

Raises:

RuntimeError – If the dataset can not be loaded

Returns:

The dataset

Return type:

Dataset

Properties¶

property Dataset.num_feats: int¶

Get the number of features in the dataset.

Type:: int

property Dataset.size: int¶

Get the number of samples in the dataset.

Type:: int

property Dataset.inputs: ndarray¶

Get the input data of the dataset.

Type:: numpy.ndarray

property Dataset.outputs: ndarray¶

Get the output data of the dataset.

Type:: numpy.ndarray

Methods¶

Dataset.save_pickle(filename: str) → None¶

Pickle this object and save it to a file.

Parameters:

filename (str) – The file name.

Raises:

TypeError – If filename is not a valid file name
ValueError – If the filename extension is not PICKLE_FILE_EXTENSION

Dataset.normalize() → Dataset¶

Normalize the dataset between 0 and 1.

Returns:: A normalized dataset
Return type:: Dataset

Dataset.scale() → Dataset¶

Scale features robust to outliers.

Returns:: A scaled dataset
Return type:: Dataset

Dataset.drop_missing() → Dataset¶

Drop samples with missing values.

Returns:: A clean dataset
Return type:: Dataset

Dataset.remove_outliers(prop: float = 0.05, random_seed: int | None = None) → Dataset¶

Remove the outliers.

Parameters:

prop (float) – Expected outlier proportion por class, defaults to 0.05
random_seed (int, optional) – Random seed for the random generator, defaults to None

Returns:

A clean dataset

Return type:

Dataset

Dataset.oversample(n_neighbors: int | None = 5, random_seed: int | None = None) → Dataset¶

Oversample all classes but the majority class.

All classes but the majority class are oversampled to equal the number of samples of the majority class. SMOTE is used for oversampling, but if any class has less than n_neighbors samples, RandomOverSampler is first applied

Parameters:

n_neighbors (int, optional) – Number of neighbors for SMOTE, defaults to 5.
random_seed (int, optional) – Random seed for the random generator, defaults to None

Returns:

An oversampled dataset

Return type:

Dataset

Dataset.select_features(feats: Sequence[int]) → Dataset¶

Return a new dataset only with some selected features.

Parameters:: feats (Sequence of int) – Indices of the selected features
Returns:: The new dataset
Return type:: Dataset

Dataset.append_random_features(num_feats: int, random_seed: int | None = None) → Dataset¶

Return a new dataset with some random features appended.

Parameters:

num_feats (An int greater than 0) – Number of random features to be appended
random_seed (int, optional) – Random seed for the random generator, defaults to None

Raises:

TypeError – If the number of random features is not an integer
ValueError – If the number of random features not greater than 0

Returns:

The new dataset

Return type:

Dataset

Dataset.split(test_prop: float, random_seed: int | None = None) → Tuple[Dataset, Dataset]¶

Split the dataset.

Parameters:

test_prop (float) – Proportion of the dataset used as test data. The remaining samples will be returned as training data
random_seed (int, optional) – Random seed for the random generator, defaults to None

Raises:

TypeError – If test_prop is not None or float
ValueError – If test_prop is not in (0, 1)

Returns:

The training and test datasets

Return type:

tuple of Dataset

culebra.tools.Dataset class¶

Class methods¶

Properties¶

Methods¶

`culebra.tools.Dataset` class¶