TrainingSet

class TrainingSet(configurations, sample_size=None, calculator=None, recalculate_training_data=None, log_filename_prefix=None, data_tag=None)

Class for storing a set of training configurations.

Parameters:
  • configurations (MoleculeConfiguration | BulkConfiguration | sequence of[MoleculeConfiguration | BulkConfiguration] | Trajectory | MDTrajectory | ConfigurationDataContainer | MomentTensorPotentialTraining | SurfaceProcessSimulation | NudgedElasticBand | Table) – The training set, either as a sequence of individual configurations or stored within a trajectory object. When using a trajectory object, energy, force and stress data may also be included.

  • sample_size (int) – The number of configurations to use in the list provided, spaced out evenly.
    Default: All configurations.

  • calculator (Calculator) – The calculator that was used to calculate the energy, force and stress data on the trajectory object, if present.
    Default: energy, force and stress data is ignored.

  • recalculate_training_data (bool | None) – Flag to enforce or avoid recalculation of training data. If set, this flag will take precedence over the calculator. If not set, the data will be automatically recalculated if the specified calculator is different from the reference calculator in the MomentTensorPotentialTraining object.
    Default: None

  • log_filename_prefix (str) – Filename prefix for the logging output of the tasks associated with this set.
    Default: Defined by the MomentTensorPotentialTraining object.

  • data_tag (str) – Label for this training set to enable selection of different data in MTP fitting.

calculator()
Returns:

The calculator that was used to calculate the energy, force and stress data on the trajectory object, or None if energy, force and stress data is not present or should be ignored.

Return type:

Calculator | None

configurations()
Returns:

The configurations in the training set. If the configurations have the same elements they are returned as an MDTrajectory, otherwise the configurations are returned as a ConfigurationDataContainer.

Return type:

ConfigurationDataContainer | MDTrajectory

dataTag()
Returns:

The selection tag added to the data in the training set.

Return type:

str

logFilenameIdentifier()
Returns:

Filename identifier for the logging output of the tasks associated with this set, or None if it hasn’t been set yet.

Return type:

str | None

logFilenamePrefix()
Returns:

Filename prefix for the logging output of the tasks associated with this set, or None if it is to be defined by the MomentTensorPotentialTraining object.

Return type:

str | LogToStdOut | None

nlinfo()
Returns:

The training set information.

Return type:

dict

recalculateTrainingData()
Returns:

A flag signaling whether the training data should be recalculated or not.

Return type:

bool

referenceConfigurations()
Returns:

The list of reference configurations that identify the training set.

Return type:

list of [MoleculeConfiguration | BulkConfiguration]

sampleSize()
Returns:

The number of training configurations for each combination of list parameters.

Return type:

int

classmethod supportedConfigurationTypes()

Return a list of supported configurations.

Returns:

List of supported configurations.

Return type:

list

uniqueString()

Return a unique string representing the state of the object.

Usage Examples

Setup of an MTP training by reading pre-calculated training data and passing it as TrainingSet:

# Import a pre-calculated training dataset.
training_data = nlread('training_data_precalculated.hdf5')[-1]
training_set_precalc = TrainingSet(training_data, recalculate_training_data=False)

# Import a training dataset with a calculator to check for calculation consistency.
# In this case the calculator is different to the one given in the MomentTensorPotentialTraining
# object, and so the training data will be re-calculated.
calculator = LCAOCalculator(exchange_correlation=GGA.PBE)
training_data = nlread('training_data_recalculate.hdf5')[-1]
training_set_recalc = TrainingSet(training_data, calculator=calculator)

# Import a training dataset with a calculator to check for calculation consistency.
# In this case the calculator is the same as the one given in the MomentTensorPotentialTraining
# object, and so the training data will not be re-calculated.
calculator = LCAOCalculator(exchange_correlation=HybridGGA.HSE06)
training_data = nlread('training_data_same_calculator.hdf5')[-1]
training_set_same_calc = TrainingSet(training_data, calculator=calculator)

# Set up MTP training and run the training data calculation and MTP training.
moment_tensor_potential_training = MomentTensorPotentialTraining(
    filename='mtp_study.hdf5',
    object_id='training',
    training_sets=[training_set_precalc, training_set_recalc, training_set_same_calc],
    calculator=LCAOCalculator(exchange_correlation=HybridGGA.HSE06),
    calculate_stress=True,
    fitting_parameters_list=fitting_parameters,
)
moment_tensor_potential_training.update()

TrainingSet_example.py

Here, the first training dataset is included as is, without recalculating any of the energy, force or stress values. For the second training dataset, a calculator is given. As this calculator is different to the one given in the MomentTensorPotentialTraining, the training data will be recalculated. In the third training set the same calculator is given as in the MomentTensorPotentialTraining object. Here, if training data is given for each configuration, the training data will not be re-calculated.

The following example shows how to save the training data from a completed MomentTensorPotentialTraining object in a TrainingSet and read it again to be used in another training.

# Run the training calculations using the MomentTensorPotentialTraining
moment_tensor_potential_training = MomentTensorPotentialTraining(
    filename='mtp_study',
    object_id='training',
    training_sets=training_sets,
    calculator=calculator,
    calculate_stress=True,
)
moment_tensor_potential_training.update()

# Pack the completed MomentTensorPotentialTraining object in a TrainingSet and save it.
training_set = TrainingSet(
   moment_tensor_potential_training,
   calculator=calculator,
   recalculate_training_data=False,
)

nlsave('mtp_training_set1.hdf5', training_set)

# Read it again and combine with other training sets from disc to run a new fit.
training_sets = [
   nlread('mtp_training_set1.hdf5', TrainingSet)[0],
   nlread('mtp_training_set2.hdf5', TrainingSet)[0],
   nlread('mtp_training_set3.hdf5', TrainingSet)[0],
]
new_mtp_training = MomentTensorPotentialTraining(
    filename='mtp_study_from_training_sets.hdf5',
    object_id='training',
    training_sets=training_sets,
    calculator=calculator,
    calculate_stress=True,
    fitting_parameters=MomentTensorPotentialFittingParameters(),
    train_test_split=0.95,
)
new_mtp_training.update()
nlprint(new_mtp_training)

Notes

The TrainingSet class can be used to include existing configurations, MD, optimization, ConfigurationDataContainer, or SurfaceProcessSimulation trajectories, as well as pre-calculated MomentTensorPotentialTraining objects in the training data of another MomentTensorPotentialTraining.

This can be useful when combining existing training data from different sources or projects to fit a new Moment Tensor Potential. It can also be used to efficiently re-calculate DFT data for existing un-labelled configurations, i.e. configurations without DFT energy, forces, and stress.

By default training data is re-calculated using the calculator in the MomentTensorPotentialTraining object. To keep the original data in the training set, the argument recalculate_training_data can be set to False. This stops re-calculation of data regardless of calculator settings. In this case if any training data is missing, an error will be raised in the MomentTensorPotentialTraining object. It is also possible to re-calculate data based on the consistency of calculator settings. The argument calculator takes the calculator used to generate the energy, force and stress data. If this calculator is the same as the calculator given in the MomentTensorPotentialTraining object, the original data is kept and not re-calculated. If the calculators differ or there is missing training data then the dataset is re-calculated using the MomentTensorPotentialTraining calculator.

It is generally recommended to save training data as TrainingSet objects, which can easily be read and combined with other TrainingSet objects in a list or Table to be re-used in other training scenarios (as shown in the example above).