culebra.tools.Dataset class¶
- class Dataset(*files: tuple[str | PathLike[str] | TextIO], output_index: int | None = None, sep: str = '\\s+')¶
Bases:
BaseCreate a dataset.
Datasets can be organized in only one file or in two files. If one file per dataset is used, then output_index must be used to indicate which column stores the output values. If output_index is omitted, it will be assumed that the dataset is composed by two consecutive files, the first one containing the input columns and the second one storing the output column. Only the first column of the second file will be loaded in this case (just one output value per sample).
If no files are provided, an empty dataset is returned.
- Parameters:
files (tuple[str | PathLike[str] | TextIO]) – Files containing the dataset. If output_index is omitted, two files are necessary, the first one containing the input columns and the second one containing the output column. Otherwise, only one file will be used to access to the whole dataset (input and output columns)
output_index (int) – If the dataset is provided with only one file, this parameter indicates which column in the file does contain the output values. Otherwise this parameter must be omitted (set to
None) to express that inputs and ouputs are stored in two different files. Its default value isNonesep (str) – Column separator used within the files. Defaults to
DEFAULT_SEP
- Raises:
TypeError – If sep is not a string
IndexError – If output_index is out of range
RuntimeError – If output_index is
Noneand only one file is providedRuntimeError – When loading a dataset composed of two files, if the file containing the input columns and the file containing the output column do not have the same number of rows.
RuntimeError – If any file is empty
- Returns:
The dataset
- Return type:
Class methods¶
- classmethod Dataset.load(filename: str) Base¶
Load a serialized object from a file.
- Parameters:
filename (str) – The file name.
- Returns:
The loaded object
- Raises:
TypeError – If filename is not a valid file name
ValueError – If the filename extension is not
SERIALIZED_FILE_EXTENSION
- classmethod Dataset.load_from_uci(name: str | None = None, id_number: int | None = None) Dataset¶
Load the dataset from the UCI ML repository.
The dataset can be identified by either its id_number or its name, but only one of these should be provided.
If the dataset has more than one output column, only the first column is considered.
- Parameters:
- Raises:
RuntimeError – If the dataset can not be loaded
- Returns:
The dataset
- Return type:
Properties¶
Methods¶
- Dataset.append_random_features(num_feats: int, random_seed: int | None = None) Dataset¶
Return a new dataset with some random features appended.
- Parameters:
- Raises:
TypeError – If the number of random features is not an integer
ValueError – If the number of random features not greater than 0
- Returns:
The new dataset
- Return type:
- Dataset.drop_missing() Dataset¶
Drop samples with missing values.
- Returns:
A clean dataset
- Return type:
- Dataset.dump(filename: str) None¶
Serialize this object and save it to a file.
- Parameters:
filename (str) – The file name.
- Raises:
TypeError – If filename is not a valid file name
ValueError – If the filename extension is not
SERIALIZED_FILE_EXTENSION
- Dataset.normalize() Dataset¶
Normalize the dataset between 0 and 1.
- Returns:
A normalized dataset
- Return type:
- Dataset.oversample(n_neighbors: int = 5, random_seed: int | None = None) Dataset¶
Oversample all classes but the majority class.
All classes but the majority class are oversampled to equal the number of samples of the majority class.
SMOTEis used for oversampling, but if any class has less than n_neighbors samples,RandomOverSampleris first applied
- Dataset.remove_outliers(prop: float = 0.05, random_seed: int | None = None) Dataset¶
Remove the outliers.
- Parameters:
prop (float) – Expected outlier proportion por class, defaults to
DEFAULT_OUTLIER_PROPORTIONrandom_seed (int) – Random seed for the random generator, defaults to
None
- Returns:
A clean dataset
- Return type:
- Dataset.select_features(feats: Sequence[int]) Dataset¶
Return a new dataset only with some selected features.

