ForceFieldTrainingSetGenerator¶

Included in QATK.MLFF

class ForceFieldTrainingSetGenerator(filename, object_id, training_sets, calculator=None, correction_calculator=None, calculate_stress=None, ignore_non_converged_configurations=None, scaled_spins=None, number_of_processes_per_task=None, log_filename_prefix=None, configurations_per_task=None)¶

Study for creating a dataset for training a force field.

Parameters:

filename (str) – The full or relative filename path the Study object should be saved to. See nlsave().
object_id (str) – The name of the study that the Study object should be saved to within the file. This needs to be a unique name in this file. See nlsave().
training_sets (RandomDisplacementsParameters | CrystalInterfaceTrainingParameters | AlloyTrainingParameters | MolecularConfigurationsParameters | MolecularDynamicsSnapshotsParameters | TrainingSet | sequence of [RandomDisplacementsParameters | CrystalInterfaceTrainingParameters | AlloyTrainingParameters | MolecularConfigurationsParameters | MolecularDynamicsSnapshotsParameters] | TrainingSet | Table) – The list of training sets to generate as part of the complete dataset.
calculator (Calculator) – The calculator to use for the calculations.
Default: LCAOCalculator().
correction_calculator (Calculator) – The calculator to correct energy, force and stress data with. The dataset will have the reference calculator’s energy, force and stress data minus the correction calculator’s.
Default: None.
calculate_stress (bool) – Whether the stress will be calculated and added to the output. Only applies to bulk configurations.
Default: True.
ignore_non_converged_configurations (bool) – Flag to state whether non-converged configurations are ignored in fitting.
Default: True.
scaled_spins (A list of tuples each of a PeriodicTableElement object and a scaled spin value. For non-collinear spin systems the tuple has four numbers, atom index, scaled spin, theta, phi, where the latter two are spherical coordinates as PhysicalQuantity object of type Degree or Radians) – The initial scaled spins for each type of element.
Default: None.
number_of_processes_per_task (int | None | ProcessesPerNode) – The number of processes that will be used to execute each task. If the total number of processes does not divide evenly into the tasks, some tasks may have less than this number of processes. If None, all available processes execute each task collaboratively.
Default: None.
log_filename_prefix (str | LogToStdOut) – General filename prefix for the logging output of the study tasks. If LogToStdOut, all logging will instead be sent to standard output.
Default: 'dataset_generation_'.
configurations_per_task (int) – The number of configurations packed into each bundle task. For most applications, the default value is recommended to ensure a good balance between the performances of parallelization, restarting, and disc reading during task execution. This parameter is ignored for training sets of type CrystalInterface.
Default: Determined automatically with a linear bundle scaling based on the average number of atoms per configuration.

addTrainingSet(training_set)¶

Add a new training set to the study.

Parameters:: training_set (RandomDisplacementsParameters | CrystalInterfaceTrainingParameters | MolecularConfigurationsParameters | AlloyTrainingParameters | MolecularDynamicsSnapshotsParameters | TrainingSet) – The training set to add.

availableDataTags()¶

Returns:: The available tags that have been set to select different datasets.
Return type:: list

calculateStress()¶

Returns:: Whether the stress will be calculated and added to the output. Only applies to bulk configurations.
Return type:: bool

calculator()¶

Returns:: The calculator to use for the calculations.
Return type:: Calculator

combinedCalculator()¶

Returns:: The combined calculator that calculates the reference calculator’s energy, force and stress data minus the correction calculator’s if a correction calculator is set. Otherwise None is returned.
Return type:: Calculator | None

configurationTypesAreConsistent(training_sets)¶

Check that the configuration types in all the training sets are the same.

Parameters:: training_sets (RandomDisplacementsParameters | CrystalInterfaceTrainingParameters | AlloyTrainingParameters | MolecularConfigurationsParameters | MolecularDynamicsSnapshotsParameters | TrainingSet | sequence of [RandomDisplacementsParameters | CrystalInterfaceTrainingParameters | AlloyTrainingParameters | MolecularConfigurationsParameters | MolecularDynamicsSnapshotsParameters] | TrainingSet | Table) – The list of training sets to generate as part of the complete dataset.
Returns:: True if all training sets contain the same type of configurations, False otherwise.
Return type:: bool

configurationsPerTask()¶

Returns:: The number of configurations per task specified by the user, or None if assigned automatically.
Return type:: int | None

continueTrainingSets(log_filename_prefix=None, recalculate_training_data=None)¶

Return a list with 2 items: The first is a training set that includes all the converged configurations, the second is a training set with all the non-converged configurations.

Parameters:

log_filename_prefix (str) – Prefix for the log file for completed and ignored data.
recalculate_training_data (bool) – Whether or not the energy, forces and stresses are recalculated for the converged configurations. Note: The TrainingSet containing the non-converged configurations is always recalculated.
Default: False

Returns:

List of training sets containing available converged and non-converged configurations.

Return type:

list

correctionCalculator()¶

Returns:: The calculator to correct energy, force and stress data with. The dataset will have the reference calculator’s energy, force and stress data minus the correction calculator’s.
Return type:: Calculator

dependentStudies()¶

Returns:: The list of dependent studies.
Return type:: list of Study

filename()¶

Returns:: The filename where the study object is stored.
Return type:: str

generatedTrainingSet(data_tags=None)¶

The complete output training datasets with the calculated E/F/S data. This result will only be available after the study has been updated.

Parameters:: data_tags (str | list | None) – One or more tags used to identify which configurations is returned.
Default: None, which returns all available configurations.
Returns:: The complete training dataset as a TrainingSet. If not available, returns None.
Return type:: TrainingSet | None

generatedTrajectory(data_tags=None, efs_only=True)¶

The complete output training datasets with the calculated E/F/S data. This result will only be available after the study has been updated.

Parameters:

data_tags (str | list | None) – One or more tags used to identify which configurations is returned.
Default: None, which returns all available configurations.
efs_only (bool) – Flag to disable reading of correction data. This significantly speeds up construction of the trajectory.

Returns:

The complete training dataset as a Trajectory. If not available, returns None.

Return type:

Trajectory | None

ignoreNonConvergedConfigurations()¶

Returns:: Whether or not configurations are ignored if the calculation of energy does not converge. Only applies to calculators that use an SCF method.
Return type:: bool

ignoredConfigurations()¶

Return the configurations that are ignored because the calculation of energy, force and stress did not converge.

Returns:: The training configurations that did not converge.
Return type:: ConfigurationDataContainer | None

logFilenamePrefix()¶

Returns:: The filename prefix for the logging output of the study.
Return type:: str | LogToStdOut

nlinfo()¶

Information about the dataset generator study.

Returns:: A dictionary with study information.
Return type:: dict

nlprint(stream=None)¶

Print a string containing an ASCII table useful for plotting the Study object.

Parameters:: stream (python stream) – The stream the table should be written to.
Default: NLPrintLogger()

numberOfProcessesPerTask()¶

Returns:: The number of processes to be used to execute each task. If None, all available processes execute each task collaboratively.
Return type:: int | None | ProcessesPerNode

numberOfProcessesPerTaskResolved()¶

Returns:: The number of processes to be used to execute each task. Default values are resolved based on the current execution settings.
Return type:: int

objectId()¶

Returns:: The name of the study object in the file.
Return type:: str

saveToFileAfterUpdate()¶

Returns:: Whether the study is automatically saved after it is updated.
Return type:: bool

scaledSpins()¶

Returns:: The scaled spins of the atoms.
Return type:: list.

classmethod supportedTrainingSetTypes()¶

Return a tuple of supported training set types.

Returns:: The supported training set types.
Return type:: tuple of training set types

trainingSets()¶

Returns:: The list of training sets to use or generate based on the reference configurations.
Return type:: list of [RandomDisplacementsParameters | CrystalInterfaceTrainingParameters | AlloyTrainingParameters | MolecularConfigurationsParameters MolecularDynamicsSnapshotsParameters | TrainingSet]

uniqueString()¶: Return a unique string representing the state of the object.

update()¶: Run the calculations for the study object

Usage Examples¶

Setup of ForceFieldTrainingSetGenerator of quartz.

# Set up lattice
lattice = Hexagonal(4.916*Angstrom, 5.4054*Angstrom)

# Define elements
elements = [Silicon, Silicon, Silicon, Oxygen, Oxygen, Oxygen, Oxygen, Oxygen,
           Oxygen]

# Define coordinates
fractional_coordinates = [[ 0.4697        ,  0.            ,  0.            ],
                         [ 0.            ,  0.4697        ,  0.666666666667],
                         [ 0.5303        ,  0.5303        ,  0.333333333333],
                         [ 0.4135        ,  0.2669        ,  0.1191        ],
                         [ 0.2669        ,  0.4135        ,  0.547567      ],
                         [ 0.7331        ,  0.1466        ,  0.785767      ],
                         [ 0.5865        ,  0.8534        ,  0.214233      ],
                         [ 0.8534        ,  0.5865        ,  0.452433      ],
                         [ 0.1466        ,  0.7331        ,  0.8809        ]]

# Set up configuration
reference_configuration = BulkConfiguration(
    bravais_lattice=lattice,
    elements=elements,
    fractional_coordinates=fractional_coordinates
)

# Define calculator for E/F/S data calculations.
calculator = LCAOCalculator()

# In this specific example, the default displacement protocol for crystals is used
training_sets = crystalTrainingRandomDisplacements(
    reference_configuration,
    supercell_repetitions_list=[(2, 2, 2), (3, 3, 3)],
    sample_size_per_stage=10,
)

# Set up training data generation.
force_field_training_set_generator = ForceFieldTrainingSetGenerator(
    filename='study_training_data.hdf5',
    object_id='fftsg',
    training_sets=training_sets,
    calculator=calculator,
    calculate_stress=True,
)
force_field_training_set_generator.update()

# Retrieve TrainingSet labeled by DFT calculations - here using the
# ForceFieldTrainingSetGenerator API directly.
generated_trainingset = force_field_training_set_generator.generatedTrainingSet()

# This TrainingSet can then be used for training machine-learned force fields.
# For example, an MTP can be trained with this TrainingSet by:

# Set up non-linear coefficients with optimization.
non_linear_coefficients_parameters = NonLinearCoefficientsParameters(
    perform_optimization=True,
)

# Set up parameters to use in the MTP fitting.
fitting_parameters = MomentTensorPotentialFittingParameters(
    basis_size=PredefinedBasisSmall,
    outer_cutoff_radii=3.0*Angstrom,
    mtp_filename='mtp_study.mtp',
    non_linear_coefficients_parameters=non_linear_coefficients_parameters,
)

# Set up and run the MTP training.
machine_learned_force_field_trainer = MachineLearnedForceFieldTrainer(
    fitting_parameters=fitting_parameters,
    training_sets=generated_trainingset,
    calculator=calculator,
    train_test_split=0.8,
)
machine_learned_force_field_trainer.train()

ForceFieldTrainingSetGenerator_example1.py

Several ForceFieldTrainingSetGenerator objects with pre-calculated training data with different calculator references can be loaded and then passed into another ForceFieldTrainingSetGenerator object which recalculates all data using a new reference calculator:

training_sets = []

# Load another ForceFieldTrainingSetGenerator (or old MomentTensorPotentialTraining) object with precalculated data.
# For the configurations in this dataset, we keep the energy, forces, and stress data, if available.
mtp_training_data_input = nlread('mtp_training_data.hdf5', ForceFieldTrainingSetGenerator)[0]
training_sets.append(
    TrainingSet(mtp_training_data_input, recalculate_training_data=False)
)

# Load another Trajectory object with data another calculator.
# For the configurations in this dataset, we recalculate the training data with the new calculator.
trajectory_training_data_input = nlread('trajectory_training_data.hdf5', Trajectory)[0]
training_sets.append(
    TrainingSet(trajectory_training_data_input, recalculate_training_data=True)
)

# Set up training data generation.
force_field_training_set_generator = ForceFieldTrainingSetGenerator(
    filename='study_training_data.hdf5',
    object_id='fftsg',
    training_sets=training_sets,
    calculator=LCAOCalculator(),
    calculate_stress=True,
)
force_field_training_set_generator.update()

# Retrieve TrainingSet now labeled by compatible LCAO calculations.
generated_trainingset = force_field_training_set_generator.generatedTrainingSet()

# This TrainingSet can then be used for training machine-learned force fields.
# For example, an MTP can be trained with this TrainingSet by:

# Set up non-linear coefficients with optimization.
non_linear_coefficients_parameters = NonLinearCoefficientsParameters(
    perform_optimization=True,
)

# Set up parameters to use in the MTP fitting.
fitting_parameters = MomentTensorPotentialFittingParameters(
    basis_size=PredefinedBasisSmall,
    outer_cutoff_radii=3.0*Angstrom,
    mtp_filename='mtp_study.mtp',
    non_linear_coefficients_parameters=non_linear_coefficients_parameters,
)

# Set up and run the MTP training.
machine_learned_force_field_trainer = MachineLearnedForceFieldTrainer(
    fitting_parameters=fitting_parameters,
    training_sets=generated_trainingset,
    calculator=calculator,
    train_test_split=0.8,
)
machine_learned_force_field_trainer.train()

ForceFieldTrainingSetGenerator_example2.py

Notes¶

Note

Study objects behave differently from analysis objects. See the Study object overview for more details.

This class implements generation and calculation of training data for Machine-Learned Force-Field (MLFF) models such as Moment Tensor Potential [1] and MACE [2][3].

To generate and calculate training data, different possibilities are provided:

RandomDisplacementsParameters or crystalTrainingRandomDisplacements: A series of random atomic displacements and strain to sample the phase space around an equilibrium configuration.
MolecularDynamicsSnapshotsParameters: Snapshots from molecular dynamics simulations using either the final reference calculator, or a force field or fast high-throughput DFT calculator. In the latter case energies, forces, and stress are recalculated using the reference calculator for a subset of the snapshots from the MD trajectory.
TrainingSet: A set of available training configurations with or without pre-calculated reference energy, forces, and stress. If no reference data is provided, it will automatically be recalculated using the given reference calculator.
CrystalInterfaceTrainingParameters: Used to generate training configurations by building interfaces from two crystalline bulk materials.
MolecularConfigurationsParameters: Class for generating a set of training configurations using a combination of sampling different torsion angles and atomic displacements.
AlloyTrainingParameters: Class for generating a set of alloy training configurations.

The calculation of reference data in the ForceFieldTrainingSetGenerator object can be parallelized efficiently over different process groups via the keyword number_of_processes_per_task.