MACEFittingParameters

class MACEFittingParameters(experiment_name, training_data_filepath=None, testing_data_filepath=None, isolated_atom_energies=None, validation_fraction=None, batch_size=None, validation_batch_size=None, max_number_of_epochs=None, patience=None, device=None, distributed_training=None, random_seed=None, keep_checkpoints=None, number_of_workers=None, default_dtype=None, number_of_channels=None, max_l_equivariance=None, distance_cutoff=None, energy_key=None, forces_key=None, stress_key=None, energy_weight=None, forces_weight=None, stress_weight=None, loss_function_type=None, compute_stress=None, evaluation_interval=None, error_measure=None, model_type=None, number_of_interactions=None, correlation_order=None, max_ell=None, number_of_radial_basis_functions=None, scaling_type=None, learning_rate=None, weight_decay=None, restart_from_last_checkpoint=None, use_exponential_moving_average=None, exponential_moving_average_decay=None, scheduler_patience=None, gradient_clipping_threshold=None, save_for_cpu=None, log_directory=None, results_directory=None, checkpoint_directory=None, foundation_model_path=None, use_multiheads_finetuning=None, pretrained_head_train_file=None, number_of_samples_from_pretrained_head=None, mp_data_path=None, mp_descriptors_path=None, additional_parameters=None)

A parameter class for passing settings to training MACE models [1][2]. The class acts as a QuantumATK interface to the mace-torch package functionality. The parameters are used for setting up from-scratch fitting or fine-tuning of MACE models to custom data. The default parameters mostly follow the default settings in the open source mace-torch package with some exceptions for increasing the ease of use. Advanced settings not covered by the parameters available via this class can be set using the additional_parameters dictionary via keys corresponding to advanced open source MACE arguments.

Parameters:
  • experiment_name (str) – The name of the experiment.

  • training_data_filepath (str) – The path to the training data.
    Default: ‘train.xyz’

  • testing_data_filepath (str) – The path to the testing data.
    Default: None

  • isolated_atom_energies (dict of Element: PhysicalQuantity of type energy) – The energy of an isolated atom for each species. The dictionary should contain the species as keys and the energy as values.
    Default: {}

  • validation_fraction (float) – The fraction of the training data to use for validation.
    Default: 0.1

  • batch_size (int) – The batch size for training.
    Default: 8

  • validation_batch_size (int) – The batch size for validation.
    Default: 8

  • max_number_of_epochs (int) – The maximum number of epochs to train for.
    Default: 200

  • patience (int) – The number of epochs to wait before early stopping.
    Default: 40

  • device (Automatic | MACEParameterOptions.DEVICE option) – The device to train on.
    Default: Automatic

  • distributed_training (bool) – Whether to use distributed (multi GPU) training.
    Default: False

  • random_seed (int) – The random seed used for splitting the provided training data into validation and training data sets.
    Default: 123

  • keep_checkpoints (bool) – Whether to keep checkpoints.
    Default: True

  • number_of_workers (int) – The number of workers for data loading.
    Default: 0

  • default_dtype (MACEParameterOptions.DTYPE option of type str) – The default torch data type.
    Default: MACEParameterOptions.DTYPE.FLOAT64

  • number_of_channels (int) – The number of embedding channels.
    Default: 128

  • max_l_equivariance (int) – The max L equivariance of the message.
    Default: 1

  • distance_cutoff (PhysicalQuantity of type energy) – The radial cutoff distance for each node.
    Default: 5.0*Angstrom

  • energy_key (str) – The key for the energy in the dataset.
    Default: ‘REF_energy’

  • forces_key (str) – The key for the forces in the dataset.
    Default: ‘REF_forces’

  • stress_key (str) – The key for the stress in the dataset.
    Default: ‘REF_stress’

  • energy_weight (float) – The weight of the energy loss.
    Default: 1

  • forces_weight (float) – The weight of the forces loss.
    Default: 100

  • stress_weight (float) – The weight of the stress loss.
    Default: 1

  • loss_function_type (MACEParameterOptions.LOSS_TYPE option of type str) – The loss function type to use.
    Default: MACEParameterOptions.LOSS_TYPE.WEIGHTED

  • compute_stress (bool) – Whether to compute the stress.
    Default: False

  • evaluation_interval (int) – The epoch interval at which evaluations should occur.
    Default: 1

  • error_measure (MACEParameterOptions.ERROR_MEASURE option of type str) – Type of error table produced at the end of the training.
    Default: MACEParameterOptions.ERROR_MEASURE.PER_ATOM_RMSE

  • model_type (MACEParameterOptions.MODEL_TYPE option of type str) – The model type to train.
    Default: MACEParameterOptions.MODEL_TYPE.MACE

  • number_of_interactions (int) – The number of interactions.
    Default: 2

  • correlation_order (int) – The correlation order at each layer.
    Default: 3

  • max_ell (int) – The highest ell of spherical harmonics.
    Default: 3

  • number_of_radial_basis_functions (int) – The number of radial basis functions.
    Default: 8

  • scaling_type (MACEParameterOptions.SCALING_TYPE option of type str) – The type of scaling to the output.
    Default: MACEParameterOptions.SCALING_TYPE.RMS_FORCES_SCALING

  • learning_rate (float) – The learning rate of optimizer.
    Default: 0.005

  • weight_decay (float) – The weight decay (L2 penalty).
    Default: 5e-7

  • restart_from_last_checkpoint (bool) – Whether to restart training from last saved checkpoint. In order to ensure easy restartability/resubmittability from the job tool, this setting has to be activated from the first submission.
    Default: True

  • use_exponential_moving_average (bool) – Whether to use exponential moving average.
    Default: False

  • exponential_moving_average_decay (float) – The decay rate of the exponential moving average.
    Default: 0.99

  • scheduler_patience (int) – The patience of the scheduler.
    Default: 5

  • gradient_clipping_threshold (float) – The gradient clipping threshold for clipping high parameter values during training.
    Default: 10

  • save_for_cpu (bool) – Whether to save a model loadable on CPU.
    Default: True

  • log_directory (str) – The directory to save logs.
    Default: None

  • results_directory (str) – The directory to save results.
    Default: None

  • checkpoint_directory (str) – The directory to save checkpoints.
    Default: None

  • foundation_model_path (str) – The path to the foundation MACE model to finetune.
    Default: None

  • use_multiheads_finetuning (bool) – Whether to use multiheads finetuning.
    Default: False

  • pretrained_head_train_file (str) – The path to the pretrained head train file for multiheads finetuning.
    Default: None

  • number_of_samples_from_pretrained_head (int) – The number of samples from the pretrained head for multiheads finetuning.
    Default: 10000

  • mp_data_path (str) – The path to the MP data for finetuning an MP foundation model.
    Default: ‘’

  • mp_descriptors_path (str) – The path to the MP descriptors for finetuning an MP foundation model.
    Default: ‘’

  • additional_parameters (dict) – Additional parameters for the MACE model. Here parameters that have not been included in the parameter class, but that are available in the mace-torch package, can be set as dictionary entries. Keys must match mace-torch input parameter names exactly and values must be in the right format as expected by the open source package.
    Default: {}

additionalParameters()
Returns:

Additional parameters for the MACE model.

Return type:

dict

batchSize()
Returns:

The batch size for training.

Return type:

int

checkpointDirectory()
Returns:

The directory to save checkpoints.

Return type:

str

computeStress()
Returns:

Whether to compute the stress.

Return type:

bool

correlationOrder()
Returns:

The correlation order at each layer.

Return type:

int

defaultDtype()
Returns:

The default torch data type.

Return type:

str

device()
Returns:

The device to train on.

Return type:

str

distanceCutoff()
Returns:

The radial cutoff distance for each node.

Return type:

PhysicalQuantity of type energy

distributedTraining()
Returns:

Whether to use distributed (multi GPU) training.

Return type:

bool

energyKey()
Returns:

The key for the energy in the dataset.

Return type:

str

energyWeight()
Returns:

The weight of the energy loss.

Return type:

float

errorMeasure()
Returns:

Type of error table produced at the end of the training.

Return type:

str

evaluationInterval()
Returns:

The epoch interval at which evaluations should occur.

Return type:

int

experimentName()
Returns:

The name of the experiment.

Return type:

str

exponentialMovingAverageDecay()
Returns:

The decay rate of the exponential moving average.

Return type:

float

forcesKey()
Returns:

The key for the forces in the dataset.

Return type:

str

forcesWeight()
Returns:

The weight of the forces loss.

Return type:

float

foundationModelPath()
Returns:

The path to the foundation MACE model to be finetuned.

Return type:

str

gradientClippingThreshold()
Returns:

The gradient clipping value.

Return type:

float

isolatedAtomEnergies()
Returns:

The energy of an isolated atom for each species.

Return type:

dict of {Element: PhysicalQuantity of type energy}

keepCheckpoints()
Returns:

Whether to keep checkpoints.

Return type:

bool

learningRate()
Returns:

The learning rate of optimizer.

Return type:

float

logDirectory()
Returns:

The directory to save logs.

Return type:

str

lossFunctionType()
Returns:

The loss function to use.

Return type:

str

maxEll()
Returns:

The highest ell of spherical harmonics.

Return type:

int

maxLEquivariance()
Returns:

The max L equivariance of the message.

Return type:

int

maxNumberOfEpochs()
Returns:

The maximum number of epochs to train for.

Return type:

int

modelType()
Returns:

The model type to train.

Return type:

str

mpDataPath()
Returns:

The path to the MP data for finetuning an MP foundation model.

Return type:

str

mpDescriptorsPath()
Returns:

The path to the MP descriptors for finetuning an MP foundation model.

Return type:

str

nlinfo()
Returns:

The nlinfo.

Return type:

dict

numberOfChannels()
Returns:

The number of embedding channels.

Return type:

int

numberOfInteractions()
Returns:

The number of interactions.

Return type:

int

numberOfRadialBasisFunctions()
Returns:

The number of radial basis functions.

Return type:

int

numberOfSamplesFromPretrainedHead()
Returns:

The number of samples from the pretrained head to use for Multihead Replay Finetuning.

Return type:

int

numberOfWorkers()
Returns:

The number of workers for data loading.

Return type:

int

patience()
Returns:

The number of epochs to wait before early stopping.

Return type:

int

pretrainedHeadTrainFile()
Returns:

The path to the pretrained head train file.

Return type:

str

randomSeed()
Returns:

The random seed.

Return type:

int

restartFromLastCheckpoint()
Returns:

Whether to restart training from last saved checkpoint.

Return type:

bool

resultsDirectory()
Returns:

The directory to save results.

Return type:

str

saveForCpu()
Returns:

Whether to save the model in CPU.

Return type:

bool

scalingType()
Returns:

The type of scaling to the output.

Return type:

str

schedulerPatience()
Returns:

The patience of the scheduler.

Return type:

int

stressKey()
Returns:

The key for the stress in the dataset.

Return type:

str

stressWeight()
Returns:

The weight of the stress loss.

Return type:

float

testingDataFilepath()
Returns:

The path to the testing data.

Return type:

str

trainingDataFilepath()
Returns:

The path to the training data.

Return type:

str

uniqueString()

Return a unique string representing the state of the object.

useExponentialMovingAverage()
Returns:

Whether to use exponential moving average.

Return type:

bool

useMultiheadsFinetuning()
Returns:

Whether to use multiheads finetuning.

Return type:

bool

validationBatchSize()
Returns:

The batch size for validation.

Return type:

int

validationFraction()
Returns:

The fraction of the training data to use for validation.

Return type:

float

weightDecay()
Returns:

The weight decay (L2 penalty).

Return type:

float

Usage Examples

The MACEFittingParameters class is used for choosing settings for training MACE Machine Learned Force Field (MLFF) models [1][2] using the MachineLearnedForceFieldTrainer. Once trained, the model is automatically converted into a QATK-compatible format, and a TremoloXCalculator can be set up using the trained model.

There are three main usage scenarios of the MACEFittingParameters class.

  1. Training a MACE model from scratch.

This is the most straightforward way to train a MACE model where a model is trained only on data provided by the user. The model size and hyperparameters are fully customizable allowing users to define and utilize their own optimal tradeoff between desired model accuracy and performance. Shown in the example below is a simple training script that trains a MACE model from scratch highlighting some of the most important parameters.


# Setup MACE parameters for training from scratch
mace_fp = MACEFittingParameters(
    # Name of the model - has to be set
    experiment_name='mace_experiment1',

    # Most important model size parameters (affects accuracy and speed)
    max_l_equivariance=0,
    number_of_channels=64,
    distance_cutoff=4.0*Ang,

    # Weights for the different parts of the loss term
    energy_weight=1.0,
    forces_weight=100.0,
    stress_weight=1.0,
    loss_function_type=MACEParameterOptions.LOSS_TYPE.UNIVERSAL,

    # Most relevant other parameters regarding training setup
    validation_fraction=0.2, # Ratio of training data used for validation
    max_number_of_epochs=200, # Number of epochs to train for
    patience=50, # Number of epochs without any improvement that will cause model training to finalize
    batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
    validation_batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
    random_seed=123, # Can be used for experimenting with the influence on different data splits
    default_dtype=MACEParameterOptions.DTYPE.FLOAT64, # Can be adjusted for shifting tradeoff between model accuracy and training speed
    device=Automatic, # Device can be set explicitly but Automatic is recommended to automatically use the GPU if available

    # Include stress if desired and present in the training data
    compute_stress=True,
)

# Setup ML model training object
mlfft = MachineLearnedForceFieldTrainer(
    fitting_parameters=mace_fp,
    training_sets=training_set,
    calculator=calculator
)

# Run the training
mlfft.train()

  1. Training with Naive Finetuning.

This is the simplest way to finetune a MACE model. The model is trained on a new dataset with model weights and hyperparameters initializing from the values of the pretrained model. Any hyperparameter settings related to the neural network architecture provided in the finetuning MACEFittingParameters object will be overwritten. This ensures that the model size and architecture remain consistent with the pretrained model.


# Setup MACE parameters for training with Naive Finetuning
mace_fp = MACEFittingParameters(
    # Name of the model - has to be set
    experiment_name='mace_experiment3',

    # Weights for the different parts of the loss term and type pf loss
    energy_weight=1.0,
    forces_weight=100.0,
    stress_weight=1.0,
    loss_function_type=MACEParameterOptions.LOSS_TYPE.UNIVERSAL,

    # Most relevant parameters regarding training setup
    validation_fraction=0.2, # Ratio of training data used for validation
    max_number_of_epochs=30, # Number of epochs to train for - keep low for finetuning
    batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
    validation_batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
    random_seed=123, # Can be used for experimenting with the influence on different data splits
    default_dtype=MACEParameterOptions.DTYPE.FLOAT64, # Can be adjusted for shifting tradeoff between model accuracy and training speed
    device=Automatic, # Device can be set explicitly but Automatic is recommended to automatically use the GPU if available

    # When using naive finetuning on a custom model (or on a MP foundation model)
    foundation_model_path='/path/to/model/data/on/run/machine/my_previous_mace_model.model',
)

# Setup ML model training object
mlfft = MachineLearnedForceFieldTrainer(
    fitting_parameters=mace_fp,
    training_sets=training_set,
    calculator=calculator
)

# Run the training
mlfft.train()

  1. Training with Multihead Replay Finetuning.

This is the most advanced way to finetune a MACE model. It is also the recommended approach for finetuning any materials project foundation model [3][4] as this usually leads to more robust and stable models. The model is trained on both a new dataset as well as (some of) the data used for training the original model in separate model heads. The model heads have different readout weights but share the remaining model weights to be optimized during the fitting. This ensures that fitting takes both data sets into account, learning from new data while retaining the knowledge of the original model as best as possible. The model weights and the hyperparameters initialize from the values of the pretrained model. +Any hyperparameter settings related to the neural network architecture provided in the finetuning +MACEFittingParameters object will be overwritten. This ensures that the model size and architecture +remain consistent with the pretrained model.


# Setup MACE parameters for training with Multihead Replay Finetuning
mace_fp = MACEFittingParameters(
    # Name of the model - has to be set
    experiment_name='mace_experiment3',

    # Weights for the different parts of the loss term and type of loss
    energy_weight=1.0,
    forces_weight=100.0,
    stress_weight=1.0,
    loss_function_type=MACEParameterOptions.LOSS_TYPE.UNIVERSAL,

    # Most relevant parameters regarding training setup
    validation_fraction=0.2, # Ratio of training data used for validation
    max_number_of_epochs=20, # Number of epochs to train for - keep low for finetuning
    batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
    validation_batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
    random_seed=123, # Can be used for experimenting with the influence on different data splits
    default_dtype=MACEParameterOptions.DTYPE.FLOAT64, # Can be adjusted for shifting tradeoff between model accuracy and speed
    device=Automatic, # Device can be set explicitly but Automatic is recommended to automatically use the GPU if available

    # When using multihead replay finetuning on MP foundation models
    use_multiheads_finetuning=True,
    foundation_model_path='/path/to/model/data/on/run/machine/mace_agnesi_small.model',
    number_of_samples_from_pretrained_head=15000, # Number of samples from the original model training data to use for the multihead replay finetuning
    pretrained_head_train_file='mp', # Use 'mp' for MP foundation type model. Use path to xyz file with training data for other custom models
    mp_data_path='/path/to/model/data/on/run/machine/mp_traj_combined.xyz',
    mp_descriptors_path='/path/to/model/data/on/run/machine/descriptors.npy',
)

# Setup ML model training object
mlfft = MachineLearnedForceFieldTrainer(
    fitting_parameters=mace_fp,
    training_sets=training_set,
    calculator=calculator
)

# Run the training
mlfft.train()

The example scripts, including a full training setup with MachineLearnedForceFieldTrainer and TremoloXCalculator for retrieving the trained models as calculators, are available for downloading:

  1. mace_example1.py

  2. mace_example2.py

  3. mace_example3.py

Notes

The MACEFittingParameters class provides an interface for training MACE MLFFs [1][2], utilizing the open source mace-torch package developed by the authors. The framework provides a multitude of possible settings aiding the creation of accurate and performant MLFFs.

In order to use any MACE training functionality, it is assumed that at least one TrainingSet is available with precomputed energies, forces, and, if desired, stress. The calculator used for calculating the TrainingSet data is similarly assumed to be either attached to the TrainingSet or to be defined in the calculator parameter of the MachineLearnedForceFieldTrainer class - unless isolated atom energies are passed explicitly in the MACEFittingParameters. Importantly, the calculator should use spin polarized calculations in order to create accurate isolated atom energy reference values. If previous MACE model training data is stored in XYZ files, these files can be explicitly passed into the MACEFittingParameters object using the training_data_filepath and testing_data_filepath arguments. However, this requires the isolation energies to be explicitly provided through the isolated_atom_energies arguments. In QuantumATK, it is recommended to supply TrainingSets to the MachineLearnedForceFieldTrainer class for a smooth training experience.

The MACEFittingParameters class is a simplified version of the complete set of MACE training parameters available in the open source package with more self-explanatory parameter namings and a reduced number of total parameters for simplifying general user interaction. MACE training parameters that have been left out of the MACEFittingParameters class, but that are available in the open source training framework, can, however, still be passed into the MACEFittingParameters object by supplying them as key-value pairs in a python dictionary to be inserted via the additional_parameters parameter.

Most of the parameters have default values corresponding to the defaults in the open source package, but some have been adapted in the MACEFittingParameters class for ease of use and more appropriate settings for most QuantumATK users. All used default values are documented in the parameter list above.

The mp_descriptors_path and mp_data_path parameters have been introduced in order to avoid automatic downloading of the MACE descriptors and training data files containing Materials Project data for the assembly of the training set used for training MP foundation models. In order to Multihead Replay Finetuning with MP foundation models, these parameters therefore have to be set with local paths to locations where the MACE descriptors and training data files are stored on the cluster/location where the training will be performed - see the Multihead Replay Finetuning example script above. These files therefore have to be downloaded/transferred to the compute file system from the MACE-foundations repository [4] in order to make finetuning from MP foundation models possible. Here, it is also possible to check which models are currently recommended to start finetuning from.

The MP foundation models [3], their underlying training data [5], and (in the case of MP foundation models) their descriptors file, all of which are required for performing Multihead Replay Finetuning, are available via the mace-foundations repo of ACESuit [4]. It should be noted, that the training data was calculated with VASP calculators and that total compatibility with QuantumATK-native calculators is not guaranteed. This is therefore a prime use case for Multihead Replay Finetuning, which supports heads using different DFT definitions. The total energies may differ, but derivative quantities such as forces and stress for similar levels of theory show signs of good agreement (for same level of theory). This means that in terms of absolute energies, pretrained models may have to somewhat relearn different energy scales, but the inter-atomic relationships learned by a foundation model are still a strong foundation for continued learning although the level/type of theory is updated. In practice, the different model heads will have separate weights for the readout layer, but otherwise share the remaining model weights.

After training a model via the general training flow outlined above, the resulting model can be used directly with the TremoloXCalculator, as evident in the downloadable example scripts, and is loadable via the MachineLearnedForceFieldCalculator block in the Workflow Builder. While the main use case within QuantumATK utilizes the .qatkpt output file, the regular MACE .model output file will remain available after training.

GPU acceleration of MACE training using CUDA is supported. The default way to activate GPU acceleration is by checking the “Enable proprietary GPU acceleration” option in the Job Manager or by using the atkpython_gpu command instead of atkpython, as described in the technical notes.

To utilize multiple GPUs for training, enable the distributed_training parameter and ensure that the training script runs with a single process per GPU.