culebra.tools.Dataset
class¶
- class Dataset(*files: str | PathLike[str] | TextIO, output_index: int | None = None, sep: str = '\\s+')¶
Create a dataset.
Datasets can be organized in only one file or in two files. If one file per dataset is used, then output_index must be used to indicate which column stores the output values. If output_index is set to
None
(its default value), it will be assumed that the dataset is composed by two consecutive files, the first one containing the input columns and the second one storing the output column. Only the first column of the second file will be loaded in this case (just one output value per sample).If no files are provided, an empty dataset is returned.
- Parameters:
files (Sequence of path-like objects, urls or file-like objects, optional) – Files containing the dataset. If output_index is
None
, two files are necessary, the first one containing the input columns and the second one containing the output column. Otherwise, only one file will be used to access to the whole dataset (input and output columns)output_index (
int
, optional) – If the dataset is provided with only one file, this parameter indicates which column in the file does contain the output values. Otherwise this parameter must be set toNone
to express that inputs and ouputs are stored in two different files. Its default value isNone
sep (
str
, optional) – Column separator used within the files. Defaults toDEFAULT_SEP
- Raises:
TypeError – If sep is not a string
IndexError – If output_index is out of range
RuntimeError – If output_index is
None
and only one file is providedRuntimeError – When loading a dataset composed of two files, if the file containing the input columns and the file containing the output column do not have the same number of rows.
RuntimeError – If any file is empty
- Returns:
The dataset
- Return type:
Class methods¶
- classmethod Dataset.load_pickle(filename: str) Base ¶
Load a pickled object from a file.
- Parameters:
filename (
str
) – The file name.- Raises:
TypeError – If filename is not a valid file name
ValueError – If the filename extension is not
PICKLE_FILE_EXTENSION
- classmethod Dataset.load_from_uci(name: str | None = None, id: int | None = None) Dataset ¶
Load the dataset from the UCI ML repository.
The dataset can be identified by either its id or its name, but only one of these should be provided.
If the dataset has more than one output column, only the first column is considered.
- Parameters:
- Raises:
RuntimeError – If the dataset can not be loaded
- Returns:
The dataset
- Return type:
Properties¶
Methods¶
- Dataset.save_pickle(filename: str) None ¶
Pickle this object and save it to a file.
- Parameters:
filename (
str
) – The file name.- Raises:
TypeError – If filename is not a valid file name
ValueError – If the filename extension is not
PICKLE_FILE_EXTENSION
- Dataset.normalize() Dataset ¶
Normalize the dataset between 0 and 1.
- Returns:
A normalized dataset
- Return type:
- Dataset.drop_missing() Dataset ¶
Drop samples with missing values.
- Returns:
A clean dataset
- Return type:
- Dataset.remove_outliers(prop: float = 0.05, random_seed: int | None = None) Dataset ¶
Remove the outliers.
- Dataset.oversample(n_neighbors: int | None = 5, random_seed: int | None = None) Dataset ¶
Oversample all classes but the majority class.
All classes but the majority class are oversampled to equal the number of samples of the majority class.
SMOTE
is used for oversampling, but if any class has less than n_neighbors samples,RandomOverSampler
is first applied
- Dataset.select_features(feats: Sequence[int]) Dataset ¶
Return a new dataset only with some selected features.
- Dataset.append_random_features(num_feats: int, random_seed: int | None = None) Dataset ¶
Return a new dataset with some random features appended.
- Parameters:
- Raises:
TypeError – If the number of random features is not an integer
ValueError – If the number of random features not greater than 0
- Returns:
The new dataset
- Return type: