TrainingSet

Included in QATK.MLFF

class TrainingSet(configurations, sample_size=None, calculator=None, recalculate_training_data=None, log_filename_prefix=None, data_tag=None)

Class for storing a set of training configurations.

Parameters:
  • configurations (MoleculeConfiguration | BulkConfiguration | sequence of[MoleculeConfiguration | BulkConfiguration] | Trajectory | MDTrajectory | ConfigurationDataContainer | MomentTensorPotentialTraining | SurfaceProcessSimulation | NudgedElasticBand | Table | ForceFieldTrainingSetGenerator) –

    The training set, either as a sequence of individual configurations or stored within a trajectory object. When using a trajectory object, energy, force and stress data may also be included.

    If a Table is provided, it must contain one InstanceColumn with configurations (BulkConfiguration or MoleculeConfiguration). Any additional columns in the table will be stored as additional properties with the training data. Property columns can contain:

    • Configuration-level scalars: FloatColumn, IntegerColumn, QuantityColumn with one value per configuration, e.g., band gaps or formation energies

    • Atom-wise scalars: InstanceColumn containing arrays with shape (n_atoms,), one array per configuration

  • sample_size (int) – The number of configurations to use in the list provided, spaced out evenly.
    Default: All configurations.

  • calculator (Calculator) – The calculator that was used to calculate the energy, force and stress data on the trajectory object, if present.
    Default: energy, force and stress data is ignored.

  • recalculate_training_data (bool | None) – Flag to enforce or avoid recalculation of training data. If set, this flag will take precedence over the calculator. If not set, the data will be automatically recalculated if the specified calculator is different from the reference calculator in the MomentTensorPotentialTraining object.
    Default: None

  • log_filename_prefix (str) – Filename prefix for the logging output of the tasks associated with this set.
    Default: Defined by the MomentTensorPotentialTraining object.

  • data_tag (str) – Label for this training set to enable selection of different data in MTP fitting.

additionalProperties()
Returns:

Dictionary of additional properties stored with the training data, or None if no additional properties were specified. Properties are retrieved from the ConfigurationDataContainer if available.

Return type:

dict of (str: list) | None

additionalProperty(property_name, index=None)

Retrieve an additional property from the training set.

Parameters:
  • property_name (str) – Name of the property to retrieve.

  • index (int | None) – Index of the configuration for which to retrieve the property. If None, returns all property values.
    Default: None

Returns:

The property value(s) as numpy array or PhysicalQuantity, or None if property doesn’t exist. When index is None, returns a list of values.

Return type:

numpy.ndarray | PhysicalQuantity | list | None

additionalPropertyNames()

Get the names of all additional properties stored in this training set.

Returns:

List of property names in sorted order, or empty list if no properties are stored.

Return type:

list of str

calculator()
Returns:

The calculator that was used to calculate the energy, force and stress data on the trajectory object, or None if energy, force and stress data is not present or should be ignored.

Return type:

Calculator | None

classmethod concatenate(training_sets, ignore_calculator=False)

Create a new training set by concatenating multiple training sets.

This method creates a new TrainingSet object containing all configurations from all input training sets, without modifying the original training sets.

Parameters:
  • training_sets (sequence of TrainingSet) – A sequence of training sets to concatenate. Must contain at least two training sets.

  • ignore_calculator (bool) – Flag to ignore differences in the calculators of the training sets.
    Default: False

Returns:

A new TrainingSet containing all configurations from all input training sets.

Return type:

TrainingSet

configurations()
Returns:

The configurations in the training set. If the configurations have the same elements they are returned as an MDTrajectory, otherwise the configurations are returned as a ConfigurationDataContainer.

Return type:

ConfigurationDataContainer | MDTrajectory

dataTag()
Returns:

The selection tag added to the data in the training set.

Return type:

str

extend(other, ignore_calculator=False)

Extend the training set with another training set.

Parameters:
  • other (TrainingSet) – The other training set to append to this one.

  • ignore_calculator (bool) – Flag to ignore differences in the calculators of the two training sets.
    Default: False

logFilenameIdentifier()
Returns:

Filename identifier for the logging output of the tasks associated with this set, or None if it hasn’t been set yet.

Return type:

str | None

logFilenamePrefix()
Returns:

Filename prefix for the logging output of the tasks associated with this set, or None if it is to be defined by the MomentTensorPotentialTraining object.

Return type:

str | LogToStdOut | None

nlinfo()
Returns:

The training set information.

Return type:

dict

recalculateTrainingData()
Returns:

A flag signaling whether the training data should be recalculated or not.

Return type:

bool

referenceConfigurations()
Returns:

The list of reference configurations that identify the training set.

Return type:

list of [MoleculeConfiguration | BulkConfiguration]

sampleSize()
Returns:

The number of training configurations for each combination of list parameters.

Return type:

int

classmethod supportedConfigurationTypes()

Return a list of supported configurations.

Returns:

List of supported configurations.

Return type:

list

trainTestSplit(train_size=0.9, random_seed=None, shuffle=True)

Split the training set into training and test sets.

Parameters:
  • train_size (float | int) – Fraction of the dataset to use for training (between 0 and 1), or absolute number of training samples (integer).
    Default: 0.9

  • random_seed (int | None) – Random seed for reproducible splits. If None, the split will be random.
    Default: None

  • shuffle (bool) – Whether to shuffle the data before splitting. If False, the first train_size samples will be used for training.
    Default: True

Returns:

A tuple of (train_set, test_set) as TrainingSet objects.

Return type:

tuple of (TrainingSet, TrainingSet)

uniqueString()

Return a unique string representing the state of the object.

Usage Examples

Setup of an MTP training by reading pre-calculated training data and passing it as TrainingSet:

# object, and so the training data will not be re-calculated.
calculator = LCAOCalculator(exchange_correlation=HybridGGA.HSE06)
training_data = nlread('training_data_same_calculator.hdf5')[0]
training_set_same_calc = TrainingSet(training_data, calculator=calculator)

# Set up new training set generator using the precalculated, calculator compatible
# TrainingSet objects and recalculate the last TrainingSet contents with the given
# calculator.
force_field_training_set_generator = ForceFieldTrainingSetGenerator(
    filename='Combine_and_recalculate_training_data.hdf5',
    object_id='fftsg',
    training_sets=[training_set_precalc, training_set_recalc, training_set_same_calc],
    calculator=calculator,
    calculate_stress=True,
)
force_field_training_set_generator.update()

# Retrieve TrainingSet labeled by compatible DFT calculations - here using the
# ForceFieldTrainingSetGenerator API directly.
generated_trainingset = force_field_training_set_generator.generatedTrainingSet()

# This TrainingSet can then be used for training machine-learned force fields.
# For example, an MTP can be trained with this TrainingSet by:

# Set up non-linear coefficients with optimization.
non_linear_coefficients_parameters = NonLinearCoefficientsParameters(
    perform_optimization=True,
)

# Set up parameters to use in the MTP fitting.
fitting_parameters = MomentTensorPotentialFittingParameters(
    basis_size=PredefinedBasisSmall,
    outer_cutoff_radii=3.0*Angstrom,
    mtp_filename='mtp_study.mtp',
    non_linear_coefficients_parameters=non_linear_coefficients_parameters,
)

# Set up and run the MTP training.
machine_learned_force_field_trainer = MachineLearnedForceFieldTrainer(
    fitting_parameters=fitting_parameters,
    training_sets=generated_trainingset,
    calculator=calculator,
    train_test_split=0.8,
)
machine_learned_force_field_trainer.train()

TrainingSet_example.py

Here, the first training dataset is included as is, without recalculating any of the energy, force or stress values. For the second training dataset, a calculator is given. As this calculator is different to the one given in the ForceFieldTrainingSetGenerator, the training data will be recalculated. In the third training set the same calculator is given as in the ForceFieldTrainingSetGenerator object. Here, if training data is given for each configuration, the training data will not be re-calculated.

The following example shows how to save the training data from a completed ForceFieldTrainingSetGenerator object in a TrainingSet and read it again to be used in another training.

# Run the training calculations using the ForceFieldTrainingSetGenerator
force_field_training_set_generator = ForceFieldTrainingSetGenerator(
    filename='study_training_data.hdf5',
    object_id='fftsg',
    training_sets=training_sets,
    calculator=calculator,
    calculate_stress=True,
)
force_field_training_set_generator.update()

# Pack the completed ForceFieldTrainingSetGenerator object in a TrainingSet.
training_set = TrainingSet(
   force_field_training_set_generator,
   calculator=calculator,
   recalculate_training_data=False,
)

# This TrainingSet can be used directly for training machine-learned force
# fields using the MachineLearnedForceFieldTrainer class.

# It can also be saved to disc for later use or recombination with other
# TrainingSet objects.
nlsave('mlff_training_set1.hdf5', training_set)

# Read it again and combine with other training sets from disc to run a new fit.
training_sets = [
   nlread('mlff_training_set1.hdf5', TrainingSet)[0],
   nlread('mlff_training_set2.hdf5', TrainingSet)[0],
   nlread('mlff_training_set3.hdf5', TrainingSet)[0],
]

# Train machine-learned force filed on combined data with fitting parameters
# set according to the type of model to be trained (e.g. MTP here).
new_mlff_training = MachineLearnedForceFieldTrainer(
    training_sets=training_sets,
    calculator=calculator,
    fitting_parameters=MomentTensorPotentialFittingParameters(),
    train_test_split=0.95,
)
new_mlff_training.train()
model_evaluator = new_mlff_training.modelEvaluator()
nlprint(model_evaluator)

Additional Properties

TrainingSet supports storing additional properties beyond the standard energy, forces, and stress data. This is particularly useful for property prediction models that learn to predict quantities like band gaps, formation energies, or atomic charges.

Additional properties can be passed using a Table with the configurations:


# Example: Creating a TrainingSet with additional properties using Table

# Assume we have a list of configurations
configurations = [...]  # List of BulkConfiguration or MoleculeConfiguration instances

# Configuration-level property (one scalar value per configuration)
bandgaps = [1.0*eV, 1.1*eV, 1.3*eV]  # One value per configuration

# Atom-wise properties (one array per configuration, with one value per atom)
# Each configuration may have a different number of atoms
atomic_charges = [
    [0.1, -0.1, 0.2, -0.2]*e,  # 4 atoms in first configuration
    [0.15, -0.15]*e,  # 2 atoms in second configuration
    [0.1, -0.1, 0.1, -0.1]*e,  # 4 atoms in third configuration
]

# Create a Table with configurations and properties
table = Table()

# Add configuration column
table.addInstanceColumn(
    key="configuration",
    title="Configuration",
    types=(BulkConfiguration, MoleculeConfiguration),
)

# Add configuration-level scalar property
table.addQuantityColumn(key="bandgap", title="Band Gap", unit=eV)

# Add atom-wise property column
table.addInstanceColumn(
    key="atomic_charges",
    title="Atomic Charges",
    types=(PhysicalQuantity,),
)

# Populate the table
for configuration, bandgap, charges in zip(configurations, bandgaps, atomic_charges):
    table.append(
        configuration=configuration,
        bandgap=bandgap,
        atomic_charges=charges,
    )

# Create TrainingSet from Table - additional properties are automatically included
training_set = TrainingSet(configurations=table)

# Access additional properties
print("Available properties:", training_set.additionalPropertyNames())

# Retrieve all values for a property
all_bandgaps = training_set.additionalProperty("bandgap")
print(f"Band gaps: {all_bandgaps}")

# Retrieve property for a specific configuration
bandgap_0 = training_set.additionalProperty("bandgap", index=0)
print(f"Band gap of first configuration: {bandgap_0}")

# Retrieve atom-wise property
charges_0 = training_set.additionalProperty("atomic_charges", index=0)
print(f"Atomic charges of first configuration: {charges_0}")

TrainingSet_additional_properties_example.py

Additional properties can be:

  • Configuration-level scalars: Single values per configuration (e.g., band gap, total energy) added via FloatColumn, IntegerColumn, or QuantityColumn

  • Atom-wise properties: Arrays with one value per atom (e.g., atomic charges, magnetic moments) added via InstanceColumn with PhysicalQuantity or numpy.ndarray type

Notes

The TrainingSet class can be used to include existing configurations, MD, optimization, ConfigurationDataContainer, or SurfaceProcessSimulation trajectories, as well as pre-calculated ForceFieldTrainingSetGenerator objects in the training data of another ForceFieldTrainingSetGenerator.

This can be useful when combining existing training data from different sources or projects to fit a new Machine-learned Force Field. It can also be used to efficiently re-calculate DFT data for existing un-labelled configurations, i.e. configurations without DFT energy, forces, and stress.

By default training data is re-calculated using the calculator in the ForceFieldTrainingSetGenerator object. To keep the original data in the training set, the argument recalculate_training_data can be set to False. This stops re-calculation of data regardless of calculator settings. In this case if any training data is missing, an error will be raised in the ForceFieldTrainingSetGenerator object. It is also possible to re-calculate data based on the consistency of calculator settings. The argument calculator takes the calculator used to generate the energy, force and stress data. If this calculator is the same as the calculator given in the ForceFieldTrainingSetGenerator object, the original data is kept and not re-calculated. If the calculators differ or there is missing training data then the dataset is re-calculated using the ForceFieldTrainingSetGenerator calculator.

It is generally recommended to save training data as TrainingSet objects, which can easily be read and combined with other TrainingSet objects in a list, Table, or by using the extend on an existing TrainingSet object with the data from another compatible TrainingSet object in order to allow flexibility and re-usage of data in other training scenarios (as shown in the example above).