MACEFittingParameters¶
- class MACEFittingParameters(experiment_name, training_data_filepath=None, testing_data_filepath=None, isolated_atom_energies=None, validation_fraction=None, batch_size=None, validation_batch_size=None, max_number_of_epochs=None, patience=None, device=None, distributed_training=None, random_seed=None, keep_checkpoints=None, number_of_workers=None, default_dtype=None, number_of_channels=None, max_l_equivariance=None, distance_cutoff=None, energy_key=None, forces_key=None, stress_key=None, energy_weight=None, forces_weight=None, stress_weight=None, loss_function_type=None, compute_stress=None, evaluation_interval=None, error_measure=None, model_type=None, number_of_interactions=None, correlation_order=None, max_ell=None, number_of_radial_basis_functions=None, scaling_type=None, learning_rate=None, weight_decay=None, restart_from_last_checkpoint=None, use_exponential_moving_average=None, exponential_moving_average_decay=None, scheduler_patience=None, gradient_clipping_threshold=None, save_for_cpu=None, log_directory=None, results_directory=None, checkpoint_directory=None, foundation_model_path=None, use_multiheads_finetuning=None, pretrained_head_train_file=None, number_of_samples_from_pretrained_head=None, mp_data_path=None, mp_descriptors_path=None, additional_parameters=None)¶
A parameter class for passing settings to training MACE models [1][2]. The class acts as a QuantumATK interface to the mace-torch package functionality. The parameters are used for setting up from-scratch fitting or fine-tuning of MACE models to custom data. The default parameters mostly follow the default settings in the open source mace-torch package with some exceptions for increasing the ease of use. Advanced settings not covered by the parameters available via this class can be set using the
additional_parameters
dictionary via keys corresponding to advanced open source MACE arguments.- Parameters:
experiment_name (str) – The name of the experiment.
training_data_filepath (str) – The path to the training data. Default: ‘train.xyz’
testing_data_filepath (str) – The path to the testing data. Default: None
isolated_atom_energies (dict of
Element
: PhysicalQuantity of type energy) – The energy of an isolated atom for each species. The dictionary should contain the species as keys and the energy as values. Default: {}validation_fraction (float) – The fraction of the training data to use for validation. Default: 0.1
batch_size (int) – The batch size for training. Default: 8
validation_batch_size (int) – The batch size for validation. Default: 8
max_number_of_epochs (int) – The maximum number of epochs to train for. Default: 200
patience (int) – The number of epochs to wait before early stopping. Default: 40
device (
Automatic
|MACEParameterOptions.DEVICE
option) – The device to train on. Default:Automatic
distributed_training (bool) – Whether to use distributed (multi GPU) training. Default: False
random_seed (int) – The random seed used for splitting the provided training data into validation and training data sets. Default: 123
keep_checkpoints (bool) – Whether to keep checkpoints. Default: True
number_of_workers (int) – The number of workers for data loading. Default: 0
default_dtype (
MACEParameterOptions.DTYPE
option of type str) – The default torch data type. Default: MACEParameterOptions.DTYPE.FLOAT64number_of_channels (int) – The number of embedding channels. Default: 128
max_l_equivariance (int) – The max L equivariance of the message. Default: 1
distance_cutoff (PhysicalQuantity of type energy) – The radial cutoff distance for each node. Default: 5.0*Angstrom
energy_key (str) – The key for the energy in the dataset. Default: ‘REF_energy’
forces_key (str) – The key for the forces in the dataset. Default: ‘REF_forces’
stress_key (str) – The key for the stress in the dataset. Default: ‘REF_stress’
energy_weight (float) – The weight of the energy loss. Default: 1
forces_weight (float) – The weight of the forces loss. Default: 100
stress_weight (float) – The weight of the stress loss. Default: 1
loss_function_type (
MACEParameterOptions.LOSS_TYPE
option of type str) – The loss function type to use. Default: MACEParameterOptions.LOSS_TYPE.WEIGHTEDcompute_stress (bool) – Whether to compute the stress. Default: False
evaluation_interval (int) – The epoch interval at which evaluations should occur. Default: 1
error_measure (
MACEParameterOptions.ERROR_MEASURE
option of type str) – Type of error table produced at the end of the training. Default: MACEParameterOptions.ERROR_MEASURE.PER_ATOM_RMSEmodel_type (
MACEParameterOptions.MODEL_TYPE
option of type str) – The model type to train. Default: MACEParameterOptions.MODEL_TYPE.MACEnumber_of_interactions (int) – The number of interactions. Default: 2
correlation_order (int) – The correlation order at each layer. Default: 3
max_ell (int) – The highest ell of spherical harmonics. Default: 3
number_of_radial_basis_functions (int) – The number of radial basis functions. Default: 8
scaling_type (
MACEParameterOptions.SCALING_TYPE
option of type str) – The type of scaling to the output. Default: MACEParameterOptions.SCALING_TYPE.RMS_FORCES_SCALINGlearning_rate (float) – The learning rate of optimizer. Default: 0.005
weight_decay (float) – The weight decay (L2 penalty). Default: 5e-7
restart_from_last_checkpoint (bool) – Whether to restart training from last saved checkpoint. In order to ensure easy restartability/resubmittability from the job tool, this setting has to be activated from the first submission. Default: True
use_exponential_moving_average (bool) – Whether to use exponential moving average. Default: False
exponential_moving_average_decay (float) – The decay rate of the exponential moving average. Default: 0.99
scheduler_patience (int) – The patience of the scheduler. Default: 5
gradient_clipping_threshold (float) – The gradient clipping threshold for clipping high parameter values during training. Default: 10
save_for_cpu (bool) – Whether to save a model loadable on CPU. Default: True
log_directory (str) – The directory to save logs. Default: None
results_directory (str) – The directory to save results. Default: None
checkpoint_directory (str) – The directory to save checkpoints. Default: None
foundation_model_path (str) – The path to the foundation MACE model to finetune. Default: None
use_multiheads_finetuning (bool) – Whether to use multiheads finetuning. Default: False
pretrained_head_train_file (str) – The path to the pretrained head train file for multiheads finetuning. Default: None
number_of_samples_from_pretrained_head (int) – The number of samples from the pretrained head for multiheads finetuning. Default: 10000
mp_data_path (str) – The path to the MP data for finetuning an MP foundation model. Default: ‘’
mp_descriptors_path (str) – The path to the MP descriptors for finetuning an MP foundation model. Default: ‘’
additional_parameters (dict) – Additional parameters for the MACE model. Here parameters that have not been included in the parameter class, but that are available in the mace-torch package, can be set as dictionary entries. Keys must match mace-torch input parameter names exactly and values must be in the right format as expected by the open source package. Default: {}
- additionalParameters()¶
- Returns:
Additional parameters for the MACE model.
- Return type:
dict
- batchSize()¶
- Returns:
The batch size for training.
- Return type:
int
- checkpointDirectory()¶
- Returns:
The directory to save checkpoints.
- Return type:
str
- computeStress()¶
- Returns:
Whether to compute the stress.
- Return type:
bool
- correlationOrder()¶
- Returns:
The correlation order at each layer.
- Return type:
int
- defaultDtype()¶
- Returns:
The default torch data type.
- Return type:
str
- device()¶
- Returns:
The device to train on.
- Return type:
str
- distanceCutoff()¶
- Returns:
The radial cutoff distance for each node.
- Return type:
PhysicalQuantity of type energy
- distributedTraining()¶
- Returns:
Whether to use distributed (multi GPU) training.
- Return type:
bool
- energyKey()¶
- Returns:
The key for the energy in the dataset.
- Return type:
str
- energyWeight()¶
- Returns:
The weight of the energy loss.
- Return type:
float
- errorMeasure()¶
- Returns:
Type of error table produced at the end of the training.
- Return type:
str
- evaluationInterval()¶
- Returns:
The epoch interval at which evaluations should occur.
- Return type:
int
- experimentName()¶
- Returns:
The name of the experiment.
- Return type:
str
- exponentialMovingAverageDecay()¶
- Returns:
The decay rate of the exponential moving average.
- Return type:
float
- forcesKey()¶
- Returns:
The key for the forces in the dataset.
- Return type:
str
- forcesWeight()¶
- Returns:
The weight of the forces loss.
- Return type:
float
- foundationModelPath()¶
- Returns:
The path to the foundation MACE model to be finetuned.
- Return type:
str
- gradientClippingThreshold()¶
- Returns:
The gradient clipping value.
- Return type:
float
- isolatedAtomEnergies()¶
- Returns:
The energy of an isolated atom for each species.
- Return type:
dict of {
Element
: PhysicalQuantity of type energy}
- keepCheckpoints()¶
- Returns:
Whether to keep checkpoints.
- Return type:
bool
- learningRate()¶
- Returns:
The learning rate of optimizer.
- Return type:
float
- logDirectory()¶
- Returns:
The directory to save logs.
- Return type:
str
- lossFunctionType()¶
- Returns:
The loss function to use.
- Return type:
str
- maxEll()¶
- Returns:
The highest ell of spherical harmonics.
- Return type:
int
- maxLEquivariance()¶
- Returns:
The max L equivariance of the message.
- Return type:
int
- maxNumberOfEpochs()¶
- Returns:
The maximum number of epochs to train for.
- Return type:
int
- modelType()¶
- Returns:
The model type to train.
- Return type:
str
- mpDataPath()¶
- Returns:
The path to the MP data for finetuning an MP foundation model.
- Return type:
str
- mpDescriptorsPath()¶
- Returns:
The path to the MP descriptors for finetuning an MP foundation model.
- Return type:
str
- nlinfo()¶
- Returns:
The nlinfo.
- Return type:
dict
- numberOfChannels()¶
- Returns:
The number of embedding channels.
- Return type:
int
- numberOfInteractions()¶
- Returns:
The number of interactions.
- Return type:
int
- numberOfRadialBasisFunctions()¶
- Returns:
The number of radial basis functions.
- Return type:
int
- numberOfSamplesFromPretrainedHead()¶
- Returns:
The number of samples from the pretrained head to use for Multihead Replay Finetuning.
- Return type:
int
- numberOfWorkers()¶
- Returns:
The number of workers for data loading.
- Return type:
int
- patience()¶
- Returns:
The number of epochs to wait before early stopping.
- Return type:
int
- pretrainedHeadTrainFile()¶
- Returns:
The path to the pretrained head train file.
- Return type:
str
- randomSeed()¶
- Returns:
The random seed.
- Return type:
int
- restartFromLastCheckpoint()¶
- Returns:
Whether to restart training from last saved checkpoint.
- Return type:
bool
- resultsDirectory()¶
- Returns:
The directory to save results.
- Return type:
str
- saveForCpu()¶
- Returns:
Whether to save the model in CPU.
- Return type:
bool
- scalingType()¶
- Returns:
The type of scaling to the output.
- Return type:
str
- schedulerPatience()¶
- Returns:
The patience of the scheduler.
- Return type:
int
- stressKey()¶
- Returns:
The key for the stress in the dataset.
- Return type:
str
- stressWeight()¶
- Returns:
The weight of the stress loss.
- Return type:
float
- testingDataFilepath()¶
- Returns:
The path to the testing data.
- Return type:
str
- trainingDataFilepath()¶
- Returns:
The path to the training data.
- Return type:
str
- uniqueString()¶
Return a unique string representing the state of the object.
- useExponentialMovingAverage()¶
- Returns:
Whether to use exponential moving average.
- Return type:
bool
- useMultiheadsFinetuning()¶
- Returns:
Whether to use multiheads finetuning.
- Return type:
bool
- validationBatchSize()¶
- Returns:
The batch size for validation.
- Return type:
int
- validationFraction()¶
- Returns:
The fraction of the training data to use for validation.
- Return type:
float
- weightDecay()¶
- Returns:
The weight decay (L2 penalty).
- Return type:
float
Usage Examples¶
The MACEFittingParameters class is used for choosing settings for training MACE Machine Learned Force Field (MLFF) models [1][2] using the MachineLearnedForceFieldTrainer. Once trained, the model is automatically converted into a QATK-compatible format, and a TremoloXCalculator can be set up using the trained model.
There are three main usage scenarios of the MACEFittingParameters class.
Training a MACE model from scratch.
This is the most straightforward way to train a MACE model where a model is trained only on data provided by the user. The model size and hyperparameters are fully customizable allowing users to define and utilize their own optimal tradeoff between desired model accuracy and performance. Shown in the example below is a simple training script that trains a MACE model from scratch highlighting some of the most important parameters.
# Setup MACE parameters for training from scratch
mace_fp = MACEFittingParameters(
# Name of the model - has to be set
experiment_name='mace_experiment1',
# Most important model size parameters (affects accuracy and speed)
max_l_equivariance=0,
number_of_channels=64,
distance_cutoff=4.0*Ang,
# Weights for the different parts of the loss term
energy_weight=1.0,
forces_weight=100.0,
stress_weight=1.0,
loss_function_type=MACEParameterOptions.LOSS_TYPE.UNIVERSAL,
# Most relevant other parameters regarding training setup
validation_fraction=0.2, # Ratio of training data used for validation
max_number_of_epochs=200, # Number of epochs to train for
patience=50, # Number of epochs without any improvement that will cause model training to finalize
batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
validation_batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
random_seed=123, # Can be used for experimenting with the influence on different data splits
default_dtype=MACEParameterOptions.DTYPE.FLOAT64, # Can be adjusted for shifting tradeoff between model accuracy and training speed
device=Automatic, # Device can be set explicitly but Automatic is recommended to automatically use the GPU if available
# Include stress if desired and present in the training data
compute_stress=True,
)
# Setup ML model training object
mlfft = MachineLearnedForceFieldTrainer(
fitting_parameters=mace_fp,
training_sets=training_set,
calculator=calculator
)
# Run the training
mlfft.train()
Training with Naive Finetuning.
This is the simplest way to finetune a MACE model. The model is trained on a new dataset with model weights and hyperparameters initializing from the values of the pretrained model. Any hyperparameter settings related to the neural network architecture provided in the finetuning MACEFittingParameters object will be overwritten. This ensures that the model size and architecture remain consistent with the pretrained model.
# Setup MACE parameters for training with Naive Finetuning
mace_fp = MACEFittingParameters(
# Name of the model - has to be set
experiment_name='mace_experiment3',
# Weights for the different parts of the loss term and type pf loss
energy_weight=1.0,
forces_weight=100.0,
stress_weight=1.0,
loss_function_type=MACEParameterOptions.LOSS_TYPE.UNIVERSAL,
# Most relevant parameters regarding training setup
validation_fraction=0.2, # Ratio of training data used for validation
max_number_of_epochs=30, # Number of epochs to train for - keep low for finetuning
batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
validation_batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
random_seed=123, # Can be used for experimenting with the influence on different data splits
default_dtype=MACEParameterOptions.DTYPE.FLOAT64, # Can be adjusted for shifting tradeoff between model accuracy and training speed
device=Automatic, # Device can be set explicitly but Automatic is recommended to automatically use the GPU if available
# When using naive finetuning on a custom model (or on a MP foundation model)
foundation_model_path='/path/to/model/data/on/run/machine/my_previous_mace_model.model',
)
# Setup ML model training object
mlfft = MachineLearnedForceFieldTrainer(
fitting_parameters=mace_fp,
training_sets=training_set,
calculator=calculator
)
# Run the training
mlfft.train()
Training with Multihead Replay Finetuning.
This is the most advanced way to finetune a MACE model. It is also the recommended approach for finetuning any materials project foundation model [3][4] as this usually leads to more robust and stable models. The model is trained on both a new dataset as well as (some of) the data used for training the original model in separate model heads. The model heads have different readout weights but share the remaining model weights to be optimized during the fitting. This ensures that fitting takes both data sets into account, learning from new data while retaining the knowledge of the original model as best as possible. The model weights and the hyperparameters initialize from the values of the pretrained model. +Any hyperparameter settings related to the neural network architecture provided in the finetuning +MACEFittingParameters object will be overwritten. This ensures that the model size and architecture +remain consistent with the pretrained model.
# Setup MACE parameters for training with Multihead Replay Finetuning
mace_fp = MACEFittingParameters(
# Name of the model - has to be set
experiment_name='mace_experiment3',
# Weights for the different parts of the loss term and type of loss
energy_weight=1.0,
forces_weight=100.0,
stress_weight=1.0,
loss_function_type=MACEParameterOptions.LOSS_TYPE.UNIVERSAL,
# Most relevant parameters regarding training setup
validation_fraction=0.2, # Ratio of training data used for validation
max_number_of_epochs=20, # Number of epochs to train for - keep low for finetuning
batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
validation_batch_size=4, # Generally, higher means quicker training, but too high batch_sizes can cause GPU memory errors
random_seed=123, # Can be used for experimenting with the influence on different data splits
default_dtype=MACEParameterOptions.DTYPE.FLOAT64, # Can be adjusted for shifting tradeoff between model accuracy and speed
device=Automatic, # Device can be set explicitly but Automatic is recommended to automatically use the GPU if available
# When using multihead replay finetuning on MP foundation models
use_multiheads_finetuning=True,
foundation_model_path='/path/to/model/data/on/run/machine/mace_agnesi_small.model',
number_of_samples_from_pretrained_head=15000, # Number of samples from the original model training data to use for the multihead replay finetuning
pretrained_head_train_file='mp', # Use 'mp' for MP foundation type model. Use path to xyz file with training data for other custom models
mp_data_path='/path/to/model/data/on/run/machine/mp_traj_combined.xyz',
mp_descriptors_path='/path/to/model/data/on/run/machine/descriptors.npy',
)
# Setup ML model training object
mlfft = MachineLearnedForceFieldTrainer(
fitting_parameters=mace_fp,
training_sets=training_set,
calculator=calculator
)
# Run the training
mlfft.train()
The example scripts, including a full training setup with MachineLearnedForceFieldTrainer and TremoloXCalculator for retrieving the trained models as calculators, are available for downloading:
Notes¶
The MACEFittingParameters class provides an interface for training MACE MLFFs [1][2], utilizing the open source mace-torch package developed by the authors. The framework provides a multitude of possible settings aiding the creation of accurate and performant MLFFs.
In order to use any MACE training functionality, it is assumed that at least one
TrainingSet is available with precomputed energies, forces, and, if desired, stress. The
calculator used for calculating the TrainingSet
data is similarly assumed to be either attached
to the TrainingSet or to be defined in the calculator
parameter of the
MachineLearnedForceFieldTrainer class - unless isolated atom energies are passed explicitly
in the MACEFittingParameters. Importantly, the calculator should use spin polarized calculations
in order to create accurate isolated atom energy reference values.
If previous MACE model training data is stored in XYZ files, these files can be explicitly passed
into the MACEFittingParameters object using the training_data_filepath
and
testing_data_filepath
arguments. However, this requires the isolation energies to be explicitly
provided through the isolated_atom_energies
arguments.
In QuantumATK, it is recommended to supply TrainingSets to the
MachineLearnedForceFieldTrainer class for a smooth training experience.
The MACEFittingParameters class is a simplified version of the complete set of MACE training
parameters available in the open source package with more self-explanatory
parameter namings and a reduced number of total parameters for simplifying general user interaction.
MACE training parameters that have been left out of the MACEFittingParameters class, but that are
available in the open source training framework, can, however, still be passed into the
MACEFittingParameters object by supplying them as key-value pairs in a python dictionary to be
inserted via the additional_parameters
parameter.
Most of the parameters have default values corresponding to the defaults in the open source
package, but some have been adapted in the MACEFittingParameters
class for ease of use and more
appropriate settings for most QuantumATK users. All used default values are documented in the
parameter list above.
The mp_descriptors_path
and mp_data_path
parameters have been introduced in order to avoid
automatic downloading of the MACE descriptors and training data files containing Materials Project
data for the assembly of the training set used for training MP foundation models. In order to
Multihead Replay Finetuning with MP foundation models, these parameters therefore have to be set
with local paths to locations where the MACE descriptors and training data files are stored on the
cluster/location where the training will be performed - see the Multihead Replay Finetuning example
script above. These files therefore have to be downloaded/transferred to the compute file system
from the MACE-foundations repository [4] in order to make
finetuning from MP foundation models possible. Here, it is also possible to check which models are
currently recommended to start finetuning from.
The MP foundation models [3], their underlying training data [5], and (in the case of MP foundation models) their descriptors file, all of which are required for performing Multihead Replay Finetuning, are available via the mace-foundations repo of ACESuit [4]. It should be noted, that the training data was calculated with VASP calculators and that total compatibility with QuantumATK-native calculators is not guaranteed. This is therefore a prime use case for Multihead Replay Finetuning, which supports heads using different DFT definitions. The total energies may differ, but derivative quantities such as forces and stress for similar levels of theory show signs of good agreement (for same level of theory). This means that in terms of absolute energies, pretrained models may have to somewhat relearn different energy scales, but the inter-atomic relationships learned by a foundation model are still a strong foundation for continued learning although the level/type of theory is updated. In practice, the different model heads will have separate weights for the readout layer, but otherwise share the remaining model weights.
After training a model via the general training flow outlined above, the resulting model can
be used directly with the TremoloXCalculator, as evident in the downloadable example
scripts, and is loadable via the MachineLearnedForceFieldCalculator
block in the Workflow
Builder. While the main use case within QuantumATK utilizes the .qatkpt
output file, the
regular MACE .model
output file will remain available after training.
GPU acceleration of MACE training using CUDA is supported. The default way to activate GPU
acceleration is by checking the “Enable proprietary GPU acceleration” option
in the Job Manager or by using the atkpython_gpu
command instead of
atkpython
, as described in the technical notes.
To utilize multiple GPUs for training, enable the distributed_training
parameter and ensure that
the training script runs with a single process per GPU.