Fix `ReduceONPlateau` wrong logic

mirror of https://github.com/qurator-spk/eynollah.git synced 2025-10-27 07:44:12 +01:00

# Training Script Improvements

## Learning Rate Management Fixes

### 1. ReduceLROnPlateau Implementation
- Fixed the learning rate reduction mechanism by replacing the manual epoch loop with a single `model.fit()` call
- This ensures proper tracking of validation metrics across epochs
- Configured with:
  ```python
  reduce_lr = ReduceLROnPlateau(
      monitor='val_loss',
      factor=0.2,        # More aggressive reduction
      patience=3,        # Quick response to plateaus
      min_lr=1e-6,       # Minimum learning rate
      min_delta=1e-5,    # Minimum change to be considered improvement
      verbose=1
  )
  ```

### 2. Warmup Implementation
- Added learning rate warmup using TensorFlow's native scheduling
- Gradually increases learning rate from 1e-6 to target (2e-5) over 5 epochs
- Helps stabilize initial training phase
- Implemented using `PolynomialDecay` schedule:
  ```python
  lr_schedule = tf.keras.optimizers.schedules.PolynomialDecay(
      initial_learning_rate=warmup_start_lr,
      decay_steps=warmup_epochs * steps_per_epoch,
      end_learning_rate=learning_rate,
      power=1.0  # Linear decay
  )
  ```

### 3. Early Stopping
- Added early stopping to prevent overfitting
- Configured with:
  ```python
  early_stopping = EarlyStopping(
      monitor='val_loss',
      patience=6,
      restore_best_weights=True,
      verbose=1
  )
  ```

## Model Saving Improvements

### 1. Epoch-based Model Saving
- Implemented custom `ModelCheckpointWithConfig` to save both model and config
- Saves after each epoch with corresponding config.json
- Maintains compatibility with original script's saving behavior

### 2. Best Model Saving
- Saves the best model at training end
- If early stopping triggers: saves the best model from training
- If no early stopping: saves the final model

## Configuration
All parameters are configurable through the JSON config file:
```json
{
    "reduce_lr_enabled": true,
    "reduce_lr_monitor": "val_loss",
    "reduce_lr_factor": 0.2,
    "reduce_lr_patience": 3,
    "reduce_lr_min_lr": 1e-6,
    "reduce_lr_min_delta": 1e-5,
    "early_stopping_enabled": true,
    "early_stopping_monitor": "val_loss",
    "early_stopping_patience": 6,
    "early_stopping_restore_best_weights": true,
    "warmup_enabled": true,
    "warmup_epochs": 5,
    "warmup_start_lr": 1e-6
}
```

## Benefits
1. More stable training with proper learning rate management
2. Better handling of training plateaus
3. Automatic saving of best model
4. Maintained compatibility with existing config saving
5. Improved training monitoring and control

This commit is contained in:

johnlockejrr

2025-05-17 23:24:40 +03:00

• committed by

GitHub

parent 7661080899

commit f298643fcf

No known key found for this signature in database

GPG key ID: B5690EEEBB952194

2 changed files with 44 additions and 41 deletions

									
										7

train_no_patches_448x448.json
									
										View file
										
				@ -7,7 +7,7 @@

				    "input_width" : 448,

				    "weight_decay" : 1e-4,

				    "n_batch" : 4,

				    "learning_rate": 5e-5,

				    "learning_rate": 2e-5,

				    "patches" : false,

				    "pretraining" : true,

				    "augmentation" : true,

				@ -39,8 +39,9 @@

				    "dir_output": "runs/sam_41_mss_npt_448x448",

				    "reduce_lr_enabled": true,

				    "reduce_lr_monitor": "val_loss",

				    "reduce_lr_factor": 0.5,

				    "reduce_lr_patience": 4,

				    "reduce_lr_factor": 0.2,

				    "reduce_lr_patience": 3,

				    "reduce_lr_min_delta": 1e-5,

				    "reduce_lr_min_lr": 1e-6,

				    "early_stopping_enabled": true,

				    "early_stopping_monitor": "val_loss",

Rows
Columns

Fix ReduceONPlateau wrong logic

7 train_no_patches_448x448.json Unescape Escape View file

Fix `ReduceONPlateau` wrong logic

7

train_no_patches_448x448.json

View file