Data Module
BalancedBatchSampler
Bases: Sampler[List[int]]
A custom PyTorch Sampler that avoids creating a final batch of size 1.
This sampler behaves like a standard BatchSampler but with a key
difference in handling the last batch. If the last batch would normally
have a size of 1, this sampler redistributes the last two batches to be
of roughly equal size. For example, if a dataset of 129 samples is used
with a batch size of 128, instead of yielding batches of [128, 1], it
will yield two balanced batches, such as [65, 64].
This is particularly useful for avoiding issues with layers like
BatchNorm, which require batch sizes greater than 1, without having to
drop data (drop_last=True).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_source
|
Sized
|
The dataset to sample from. |
required |
batch_size
|
int
|
The target number of samples in each batch. |
required |
shuffle
|
bool
|
If True, the sampler will shuffle the indices at start of each epoch. |
True
|
Source code in src/autoencodix/data/_sampler.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
__init__(data_source, batch_size, shuffle=True)
Initializes the BalancedBatchSampler. Args: data_source: The dataset to sample from. batch_size: The target number of samples in each batch. shuffle: If True, the sampler will shuffle the indices at start of each epoch.
Source code in src/autoencodix/data/_sampler.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | |
__iter__()
Returns an iterator over batches of indices.
Source code in src/autoencodix/data/_sampler.py
43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |
__len__()
Returns the total number of batches in an epoch.
Source code in src/autoencodix/data/_sampler.py
85 86 87 88 89 90 91 92 93 94 95 96 97 98 | |
DataFilter
Preprocesses dataframes, including filtering and scaling.
This class separates the filtering logic that needs to be applied consistently across train, validation, and test sets from the scaling logic that is typically fitted on the training data and then applied to the other sets.
Attributes:
| Name | Type | Description |
|---|---|---|
data_info |
Configuration object containing preprocessing parameters. |
|
filtered_features |
Optional[Set[str]]
|
Set of features to keep after filtering on the training data. None initially. |
_scaler |
The fitted scaler object. None initially. |
|
ontologies |
Ontology information, if provided for Ontix. |
|
config |
Configuration object containing default parameters. |
Source code in src/autoencodix/data/_filter.py
42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 | |
available_methods
property
Lists all available filtering methods.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of available filtering method names. |
__init__(data_info, config, ontologies=None)
Initializes the DataFilter with a configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data_info
|
DataInfo
|
Configuration object containing preprocessing parameters. |
required |
config
|
DefaultConfig
|
Configuration object containing default parameters. |
required |
ontologies
|
Optional[tuple]
|
Ontology information, if provided for Ontix. |
None
|
Source code in src/autoencodix/data/_filter.py
57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 | |
filter(df, genes_to_keep=None)
Applies the configured filtering method to the dataframe.
This method is intended to be called on the training data to determine
which features to keep. The filtered_features attribute will be set
based on the result.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
DataFrame
|
Input dataframe to be filtered (typically the training set). |
required |
genes_to_keep
|
Optional[List]
|
A list of gene names to explicitly keep. If provided, other filtering methods will be ignored. |
None
|
Returns:
| Type | Description |
|---|---|
Tuple[Union[Series, DataFrame], List[str]]
|
A tuple containing: - The filtered dataframe. - A list of column names (features) that were kept. |
Raises:
| Type | Description |
|---|---|
KeyError
|
If some genes in |
Source code in src/autoencodix/data/_filter.py
173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 | |
fit_scaler(df)
Fits the scaler to the input dataframe (typically the training set).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
Union[Series, DataFrame]
|
Input dataframe to fit the scaler on. |
required |
Returns:
| Type | Description |
|---|---|
Any
|
The fitted scaler object. |
Source code in src/autoencodix/data/_filter.py
285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 | |
scale(df, scaler)
Applies the fitted scaler to the input dataframe.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
df
|
Union[Series, DataFrame]
|
Input dataframe to be scaled. |
required |
scaler
|
Any
|
The fitted scaler object. |
required |
Returns:
| Type | Description |
|---|---|
Union[Series, DataFrame]
|
Scaled dataframe. |
Source code in src/autoencodix/data/_filter.py
301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 | |
DataPackage
dataclass
Represents a data package containing multiple types of data.
Source code in src/autoencodix/data/datapackage.py
14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 | |
__getitem__(key)
Allow dictionary-like access to top-level attributes.
Source code in src/autoencodix/data/datapackage.py
46 47 48 49 50 | |
__iter__()
Make DataPackage iterable, yielding (key, value) pairs.
For dictionary attributes, yields nested items as (parent_key.child_key, value).
Source code in src/autoencodix/data/datapackage.py
59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | |
__setitem__(key, value)
Allow dictionary-like item assignment to top-level attributes.
Source code in src/autoencodix/data/datapackage.py
52 53 54 55 56 57 | |
format_shapes()
Format the shape dictionary in a clean, readable way.
Source code in src/autoencodix/data/datapackage.py
75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 | |
get_common_ids()
Get the common sample IDs across modalities that have data.
Returns:
| Type | Description |
|---|---|
List[str]
|
List of sample IDs that are present in all modalities with data |
Source code in src/autoencodix/data/datapackage.py
183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | |
get_modality_key(direction)
Get the first key for a specific direction's modality.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
direction
|
str
|
Either 'from' or 'to' |
required |
Returns:
| Type | Description |
|---|---|
Optional[str]
|
First key of the modality dictionary or None if empty |
Source code in src/autoencodix/data/datapackage.py
366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 | |
get_n_samples()
Get the number of samples for each data type in nested dictionary format.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, int]]
|
Dictionary with nested structure: {modality_type: {sub_key: count}} |
Source code in src/autoencodix/data/datapackage.py
123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
is_empty()
Check if the data package is empty.
Source code in src/autoencodix/data/datapackage.py
110 111 112 113 114 115 116 117 118 119 120 121 | |
shape()
Get the shape of the data for each data type in nested dictionary format.
Returns:
| Type | Description |
|---|---|
Dict[str, Dict[str, Any]]
|
Dictionary with nested structure: {modality_type: {sub_key: shape}} |
Source code in src/autoencodix/data/datapackage.py
291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 | |
DataPackageSplitter
Splits DataPackage objects into training, validation, and testing sets.
Supports paired and unpaired (translation) splitting.
Attributes:
| Name | Type | Description |
|---|---|---|
data_package |
The original DataPackage to split. |
|
config |
The configuration settings for the splitting process. |
|
indices |
The indices for each split (train/val/test). |
Source code in src/autoencodix/data/_datapackage_splitter.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
split()
Splits the underlying DataPackage into train, valid, and test subsets. Returns: A dictionary containing the split data packages for "train", "valid", and "test". Each entry contains a "data" key with the DataPackage and an "indices" key with the corresponding indices. Raises: ValueError: If no data package is available for splitting. TypeError: If indices are not provided for unpaired translation case.
Source code in src/autoencodix/data/_datapackage_splitter.py
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 | |
DataSplitter
Splits data into train, validation, and test sets. And validates the splits.
Also allows for custom splits to be provided. Here we allow empty splits (e.g. test_ratio=0), this might raise an error later in the pipeline, when this split is expected to be non-empty. However, this allows are more flexible usage of the pipeline (e.g. when the user only wants to run the fit step).
Constraints: 1. Split ratios must sum to 1 2. Each non-empty split must have at least min_samples_per_split samples 3. Any split ratio must be <= 1.0 4. Custom splits must contain 'train', 'valid', and 'test' keys and non-overlapping indices
Attributes:
| Name | Type | Description |
|---|---|---|
_config |
Configuration object containing split ratios |
|
_custom_splits |
Optional pre-defined split indices |
Source code in src/autoencodix/data/_datasplitter.py
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 | |
__init__(config, custom_splits=None)
Initialize DataSplitter with configuration and optional custom splits.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config
|
DefaultConfig
|
Configuration object containing split ratios |
required |
custom_splits
|
Optional[Dict[str, ndarray]]
|
Pre-defined split indices |
None
|
Source code in src/autoencodix/data/_datasplitter.py
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
split(n_samples)
Split data into train, validation, and test sets.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n_samples
|
int
|
Total number of samples in the dataset |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, ndarray]
|
Dictionary containing indices for each split, with empty arrays for splits with ratio=0 |
Raises:
| Type | Description |
|---|---|
ValueError
|
If resulting splits would violate size constraints |
Source code in src/autoencodix/data/_datasplitter.py
168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 | |
DatasetContainer
dataclass
A container for datasets used in training, validation, and testing.
train : Dataset The training dataset. valid : Dataset The validation dataset. test : Dataset The testing dataset.
Source code in src/autoencodix/data/_datasetcontainer.py
6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | |
__getitem__(key)
Allows dictionary-like access to datasets.
Source code in src/autoencodix/data/_datasetcontainer.py
23 24 25 26 27 | |
__setitem__(key, value)
Allows dictionary-like assignment of datasets.
Source code in src/autoencodix/data/_datasetcontainer.py
29 30 31 32 33 | |
GeneralPreprocessor
Bases: BasePreprocessor
Preprocessor for handling multi-modal data.
Attributes:
| Name | Type | Description |
|---|---|---|
_datapackage_dict |
Optional[Dict[str, Any]]
|
Dictionary holding DataPackage objects for each data split. |
_dataset_container |
Optional[DatasetContainer]
|
Container holding processed datasets for each split. |
_reverse_mapping_multi_bulk |
Dict[str, Dict[str, Tuple[List[int], List[str]]]]
|
Reverse mapping for multi-bulk data reconstruction. |
_reverse_mapping_multi_sc |
Dict[str, Dict[str, Tuple[List[int], List[str]]]]
|
Reverse mapping for multi-single-cell data reconstruction. |
Source code in src/autoencodix/data/general_preprocessor.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 | |
ImageDataset
Bases: TensorAwareDataset
A custom PyTorch dataset that handles image data with proper dtype conversion.
Attributes:
| Name | Type | Description |
|---|---|---|
raw_data |
List of ImgData objects containing original image data and metadata. |
|
config |
Configuration object for dataset settings. |
|
mytype |
Enum indicating the dataset type (set to DataSetTypes.IMG). |
|
data |
List of image tensors converted to the appropriate dtype. |
|
sample_ids |
List of identifiers for each sample. |
|
split_indices |
Optional numpy array of indices for splitting the dataset. |
|
feature_ids |
Optional list of identifiers for each feature (set to None for images). |
|
metadata |
Optional pandas DataFrame containing additional metadata. |
Source code in src/autoencodix/data/_image_dataset.py
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | |
__getitem__(idx)
Get item at index - data is already converted to proper dtype Returns: Tuple of (index, image tensor, sample_id)
Source code in src/autoencodix/data/_image_dataset.py
79 80 81 82 83 84 | |
__init__(data, config, split_indices=None, metadata=None)
Initialize the dataset Args: data: List of image data objects config: Configuration object
Source code in src/autoencodix/data/_image_dataset.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
get_input_dim()
Gets the input dimension of the dataset's feature space.
Returns:
| Type | Description |
|---|---|
Tuple[int, ...]
|
The input dimension of the dataset's feature space |
Source code in src/autoencodix/data/_image_dataset.py
86 87 88 89 90 91 92 93 | |
ImagePreprocessor
Bases: GeneralPreprocessor
Preprocessor for cross-modal data, handling multiple data types and their transformations.
Attributes:
| Name | Type | Description |
|---|---|---|
data_config |
Configuration specific to data handling and preprocessing. |
|
dataset_dicts |
Dictionary holding datasets for different splits (train/test/valid). |
Source code in src/autoencodix/data/_image_processor.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 | |
preprocess(raw_user_data=None, predict_new_data=False)
Preprocess the data according to the configuration.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
raw_user_data
|
Optional[DataPackage]
|
The raw data package provided by the user. |
None
|
predict_new_data
|
bool
|
Flag indicating if new data is being predicted. |
False
|
Returns: A DatasetContainer with processed training, validation, and test datasets.
Source code in src/autoencodix/data/_image_processor.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 | |
ImgData
dataclass
Stores image data along with its associated metadata.
Attributes:
| Name | Type | Description |
|---|---|---|
img |
The image data as a NumPy array. |
|
sample_id |
str
|
A unique identifier for the image sample. |
annotation |
Union[Series, DataFrame]
|
A DataFrame containing annotations or metadata related to the image. |
Source code in src/autoencodix/data/_imgdataclass.py
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | |
MultiModalDataset
Bases: BaseDataset, Dataset
Handles multiple datasets of different modalities.
Attributes:
| Name | Type | Description |
|---|---|---|
datasets |
Dictionary of datasets for each modality. |
|
n_modalities |
Number of modalities. |
|
sample_to_modalities |
Mapping from sample IDs to available modalities. |
|
sample_ids |
List[Any]
|
List of all unique sample IDs across modalities. |
config |
Configuration object. |
|
data |
Data from the first modality (for compatibility). |
|
feature_ids |
Feature IDs (currently None, to be implemented). |
|
_id_to_idx |
Reverse lookup tables for sample IDs to indices per modality. |
|
paired_sample_ids |
List of sample IDs that have data in all modalities. |
|
unpaired_sample_ids |
List of sample IDs that do not have data in all modalities. |
Source code in src/autoencodix/data/_multimodal_dataset.py
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | |
is_fully_paired
property
Returns True if all samples are fully paired across all modalities (no unpaired samples).
__init__(datasets, config)
Initialize the MultiModalDataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
datasets
|
Dict[str, BaseDataset]
|
Dictionary of datasets for each modality. |
required |
config
|
DefaultConfig
|
Configuration object. |
required |
Source code in src/autoencodix/data/_multimodal_dataset.py
27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 | |
NaNRemover
Removes NaN values from multi-modal datasets.
This object identifies and removes NaN values from various data structures commonly used in single-cell and multi-modal omics, including AnnData, MuData, and Pandas DataFrames. It supports processing of X matrices, layers, and observation annotations within AnnData objects, as well as handling bulk and annotation data within a DataPackage.
Attributes:
| Name | Type | Description |
|---|---|---|
config |
Configuration object containing settings for data processing. |
|
relevant_cols |
List of columns in metadata to check for NaNs. |
Source code in src/autoencodix/data/_nanremover.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | |
__init__(config)
Initialize the NaNRemover with configuration settings. Args: config: Configuration object containing settings for data processing.
Source code in src/autoencodix/data/_nanremover.py
27 28 29 30 31 32 33 34 35 36 37 | |
remove_nan(data)
Removes NaN values from all applicable DataPackage components.
Iterates through the bulk data, annotation data, and multi-modal single-cell data (MuData and AnnData objects) within the provided DataPackage and removes rows/columns/entries containing NaN values.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
DataPackage
|
The DataPackage object containing multi-modal data. |
required |
Returns:
| Type | Description |
|---|---|
DataPackage
|
The DataPackage object with NaN values removed from its components. |
Source code in src/autoencodix/data/_nanremover.py
81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 | |
NumericDataset
Bases: TensorAwareDataset
A custom PyTorch dataset that handles tensors.
Attributes:
| Name | Type | Description |
|---|---|---|
data |
The input features as a torch.Tensor. |
|
config |
Configuration object containing settings for data processing. |
|
sample_ids |
Optional list of sample identifiers. |
|
feature_ids |
Optional list of feature identifiers. |
|
metadata |
Optional pandas DataFrame containing metadata. |
|
split_indices |
Optional numpy array for data splitting. |
|
mytype |
Enum indicating the dataset type (set to DataSetTypes.NUM). |
Source code in src/autoencodix/data/_numeric_dataset.py
116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 | |
__getitem__(index)
Retrieves a single sample and its corresponding label.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
index
|
int
|
Index of the sample to retrieve. |
required |
Returns:
| Type | Description |
|---|---|
Union[Tuple[Union[Tensor, int], Union[Tensor, 'ImgData'], Any], Dict[str, Tuple[Any, Tensor, Any]]]
|
A tuple containing the index, the data sample and its label, or a dictionary |
Union[Tuple[Union[Tensor, int], Union[Tensor, 'ImgData'], Any], Dict[str, Tuple[Any, Tensor, Any]]]
|
mapping keys to such tuples in case we have multiple uncombined data at this step. |
Source code in src/autoencodix/data/_numeric_dataset.py
169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 | |
__init__(data, config, sample_ids=None, feature_ids=None, metadata=None, split_indices=None)
Initialize the dataset
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
data
|
Union[Tensor, ndarray, spmatrix]
|
Input features |
required |
config
|
DefaultConfig
|
Configuration object |
required |
sample_ids
|
Union[None, List[Any]]
|
Optional sample identifiers |
None
|
feature_ids
|
Union[None, List[Any]]
|
Optional feature identifiers |
None
|
metadata
|
Optional[Union[Series, DataFrame]]
|
Optional metadata |
None
|
split_indices
|
Optional[Union[Dict[str, Any], List[Any], ndarray]]
|
Optional split indices |
None
|
Source code in src/autoencodix/data/_numeric_dataset.py
130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 | |
__len__()
Returns the number of samples (rows) in the dataset
Source code in src/autoencodix/data/_numeric_dataset.py
203 204 205 | |
SingleCellFilter
Filter and scale single-cell data, returning a MuData object with synchronized metadata.AnnData
Attributes:
| Name | Type | Description |
|---|---|---|
data_info |
Configuration for filtering and scaling (can be a single DataInfo or a dict of DataInfo per modality). |
|
total_features |
Total number of features to keep across all modalities. |
|
config |
Configuration object containing settings for data processing. |
|
_is_data_info_dict |
Internal flag indicating if data_info is a dictionary. |
Source code in src/autoencodix/data/_sc_filter.py
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 | |
__init__(data_info, config)
Initialize single-cell filter. Args: data_info: Either a single data_info object for all modalities or a dictionary of data_info objects for each modality. config: Configuration object containing settings for data processing.
Source code in src/autoencodix/data/_sc_filter.py
30 31 32 33 34 35 36 37 38 39 40 41 42 | |
distribute_features_across_modalities(mudata, total_features)
Distributes a total number of features across modalities evenly.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
mudata
|
MuData
|
Multi-modal data object |
required |
total_features
|
Optional[int]
|
Total number of features to distribute across all modalities |
required |
Returns:
| Type | Description |
|---|---|
Dict[str, int]
|
Dictionary mapping modality keys to number of features to keep |
Source code in src/autoencodix/data/_sc_filter.py
381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 | |
general_postsplit_processing(mudata, gene_map, scaler_map=None)
Process single-cell data with proper MuData handling Args: mudata: Input multi-modal data container gene_map: Optional override of genes to keep per modality scaler_map: Optional pre-fitted scalers per modality and layer Returns: Processed MuData with filtered and scaled modalities,
Source code in src/autoencodix/data/_sc_filter.py
278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 | |
presplit_processing(multi_sc)
Process each modality independently to filter cells based on min_genes.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
multi_sc
|
Union[MuData, Dict[str, MuData]]
|
Either a single MuData object or a dictionary of MuData objects. |
required |
Returns: A dictionary mapping modality keys to processed MuData objects.
Source code in src/autoencodix/data/_sc_filter.py
120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 | |
sc_postsplit_processing(mudata, gene_map=None)
Process each modality independently to filter genes based on X layer, then consistently apply the same filtering to all layers.
Args: mudata : Input multi-modal data container gene_map : Optional override of genes to keep per modality
Returns:
| Type | Description |
|---|---|
MuData
|
|
Dict[str, List[str]]
|
|
Source code in src/autoencodix/data/_sc_filter.py
191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 | |
StackixDataset
Bases: NumericDataset
Dataset for handling multiple modalities in Stackix models.
This dataset holds individual BaseDataset objects for multiple data modalities and provides a consistent interface for accessing them during training. It's designed to work specifically with StackixTrainer.
Attributes:
| Name | Type | Description |
|---|---|---|
dataset_dict |
Dictionary mapping modality names to dataset objects |
|
modality_keys |
List of modality names |
Source code in src/autoencodix/data/_stackix_dataset.py
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
__getitem__(index)
Get a single sample and its label from the dataset.
Returns the data from the first modality to maintain compatibility with the BaseDataset interface, while still supporting multi-modality access through dataset_dict. Args: index: Index of the sample to retrieve
Returns:
| Type | Description |
|---|---|
Union[Tuple[Tensor, Any], Dict[str, Tuple[Tensor, Any]]]
|
Dictionary of (data tensor, label) pairs for each modality |
Source code in src/autoencodix/data/_stackix_dataset.py
78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 | |
__init__(dataset_dict, config)
Initialize a StackixDataset instance.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_dict
|
Dict[str, BaseDataset]
|
Dictionary mapping modality names to dataset objects |
required |
config
|
DefaultConfig
|
Configuration object |
required |
Raises:
| Type | Description |
|---|---|
ValueError
|
If the datasets dictionary is empty or if modality datasets have different numbers of samples |
NotImplementedError
|
If the datasets have incompatible shapes for concatenation |
Source code in src/autoencodix/data/_stackix_dataset.py
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 | |
__len__()
Return the number of samples in the dataset.
Source code in src/autoencodix/data/_stackix_dataset.py
74 75 76 | |
get_modality_item(modality, index)
Get a sample for a specific modality. Args: modality: The modality name to retrieve data from index: Index of the sample to retrieve
Returns:
| Type | Description |
|---|---|
Tuple[Tensor, Any]
|
Tuple of (data tensor, label) for the specified modality and sample index |
Raises:
| Type | Description |
|---|---|
KeyError
|
If the requested modality doesn't exist in the dataset |
Source code in src/autoencodix/data/_stackix_dataset.py
98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 | |
StackixPreprocessor
Bases: BasePreprocessor
Preprocessor for Stackix architecture, which handles multiple modalities separately.
Unlike GeneralPreprocessor which combines all modalities, StackixPreprocessor keeps modalities separate for individual VAE training in the Stackix architecture.
Attributes: config: Configuration parameters for preprocessing and model architecture _datapackage: Dictionary storing processed data splits _dataset_container:Container for processed datasets by split
Source code in src/autoencodix/data/_stackix_preprocessor.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 | |
__init__(config, ontologies=None)
Initialize the StackixPreprocessor with the given configuration. Args: config: Configuration parameters for preprocessing
Source code in src/autoencodix/data/_stackix_preprocessor.py
30 31 32 33 34 35 36 37 38 39 | |
format_reconstruction(reconstruction, result=None)
Takes the reconstructed tensor and from which modality it comes and uses the dataset_dict to obtain the format of the original datapackage, but instead of the .data attribute we populate this attribute with the reconstructed tensor (as pd.DataFrame or MuData object)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
reconstruction
|
Any
|
The reconstructed tensor |
required |
result
|
Optional[Result]
|
Optional[Result] containing additional information |
None
|
Returns: DataPackage with reconstructed data in original format
Source code in src/autoencodix/data/_stackix_preprocessor.py
203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 | |
preprocess(raw_user_data=None, predict_new_data=False)
Execute preprocessing steps for Stackix architecture.
Args raw_user_data: Raw user data to preprocess, or None to use self._datapackage
Returns:
| Type | Description |
|---|---|
DatasetContainer
|
Container with MultiModalDataset for each split |
Raises:
| Type | Description |
|---|---|
TypeError
|
If datapackage is None after preprocessing |
Source code in src/autoencodix/data/_stackix_preprocessor.py
41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 | |
TensorAwareDataset
Bases: BaseDataset
Handles dtype mapping and tensor conversion logic.
Source code in src/autoencodix/data/_numeric_dataset.py
13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 | |
XModalPreprocessor
Bases: GeneralPreprocessor
Preprocessor for cross-modal data, handling multiple data types and their transformations.
Attributes:
| Name | Type | Description |
|---|---|---|
data_config |
Configuration specific to data handling. |
|
dataset_dicts |
Dictionary holding datasets for different splits (train, test, valid). |
Source code in src/autoencodix/data/_xmodal_preprocessor.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 | |
__init__(config, ontologies=None)
Initializes the XModalPreprocessor Args: config: Configuration object for the preprocessor. ontologies: Optional ontologies for data processing.
Source code in src/autoencodix/data/_xmodal_preprocessor.py
27 28 29 30 31 32 33 34 35 36 | |
preprocess(raw_user_data=None, predict_new_data=False)
Preprocess the data according to the configuration. Args: raw_user_data: Optional raw data provided by the user. predict_new_data: Flag indicating if new data is being predicted.
Source code in src/autoencodix/data/_xmodal_preprocessor.py
38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | |