MolecularConfigurationsParameters

Included in QATK.MLFF

class MolecularConfigurationsParameters(reference_configurations, sample_size=None, monte_carlo_calculator=None, monte_carlo_temperature=None, max_rotated_bonds=None, overlap_ratio=None, max_displacement=None, sample_cis_trans=None, generate_bulk_configurations=None, random_seed=None, log_filename_prefix=None, data_tag=None)

Class for storing parameters for generating a set of training configurations using a combination of sampling different torsion angles and atomic displacements.

Parameters:
  • reference_configurations (MoleculeConfiguration | BulkConfiguration | DeviceConfiguration | SurfaceConfiguration | sequence of configurations) – One or more reference configurations to be used to generate the training set.

  • sample_size (int) – The total number of configurations to return.
    Default: 20

  • monte_carlo_calculator (Calculator | None) – The calculator to use for the molecular dynamics simulations. Giving None disables Monte Carlo structure selection.
    Default: None.

  • monte_carlo_temperature (PhysicalQuantity of type temperature) – Temperature used in the Monte Carlo selection.
    Default: 300K.

  • max_rotated_bonds (int) – The maximum number of bonds rotated in each new configuration.
    Default: Rotate all available bonds in each configuration.

  • overlap_ratio (float) – The acceptable VDW radius overlap between two atoms.
    Default: 0.7.

  • max_displacement (PhysicalQuantity of type length) – The maximum displacement in the atomic random displacement.
    Default: 0.1 Angstrom.

  • sample_cis_trans (bool) – Whether or not cis and trans orientations of SP2 bonds are sampled.
    Default: True.

  • generate_bulk_configurations (bool) – Whether or not the returned dataset contains bulk configurations.
    Default: False.

  • random_seed (int) – The random seed used for generating the displacements.
    Default: Generated automatically.

  • log_filename_prefix (str) – Filename prefix for the logging output of the tasks associated with this set.
    Default: Defined by the MomentTensorPotentialTraining object.

  • data_tag (str) – Label for this training set to enable selection of different data in MTP fitting.

dataTag()
Returns:

The selection tag added to the data in the training set.

Return type:

str

generateBulkConfigurations()
Returns:

Whether or not the returned configurations are Bulk or Molecules.

Return type:

bool

logFilenameIdentifier()
Returns:

Filename identifier for the logging output of the tasks associated with this set, or None if it hasn’t been set yet.

Return type:

str | None

logFilenamePrefix()
Returns:

Filename prefix for the logging output of the tasks associated with this set, or None if it is to be defined by the MomentTensorPotentialTraining object.

Return type:

str | LogToStdOut | None

maxDisplacement()
Returns:

The maximum random displacement for each atom.

Return type:

PhysicalQuantity

maxRotatedBonds()
Returns:

The maximum number of bonds rotated in each trial configuration.

Return type:

int

monteCarloCalculator()
Returns:

The calculator used to calculate Monte Carlo energies.

Return type:

Calculator | None

monteCarloTemperature()
Returns:

The temperature used in the Monte Carlo selection

Return type:

PhysicalQuantity of type temperature

overlapRatio()
Returns:

The minimum acceptable VDW radii overlap ratio.

Return type:

float

randomSeed()
Returns:

The random seed used in the random number generator.

Return type:

int | None

referenceConfigurations()
Returns:

The list of reference configurations to be used to generate the training set.

Return type:

list of [MoleculeConfiguration | BulkConfiguration]

sampleCisTrans()
Returns:

Whether or not cis and trans orientations of SP2 bonds are sampled.

Return type:

bool

sampleSize()
Returns:

The number of training configurations for each combination of list parameters.

Return type:

int

strainTensors()
Returns:

The list of strains to apply to each configuration (None for no strain).

Return type:

list of [numpy.ndarray | None]

supercellRepetitionsList()
Returns:

The list of supercells to construct for each configuration, given as the number of repetitions of the bulk unit cell along the (a, b, c) directions.

Return type:

list of tuple (size 3) of int

systemSizes()
Returns:

The system sizes to be used for the training set.

Return type:

int | sequence of int

uniqueString()

Return a unique string representing the state of the object.

Usage Examples

Setup of a training set of C2F 4 and C2F 6 using MolecularConfigurationsParameters.

setVerbosity(MinimalLog)

# -------------------------------------------------------------
# Fluoroethane configuration
# -------------------------------------------------------------
# Define elements
elements = [Carbon, Carbon, Fluorine, Fluorine, Fluorine, Fluorine, Fluorine, Fluorine]

# Define coordinates
cartesian_coordinates = [[-0.003263937084, -0.            ,  0.            ],
                         [ 1.516263937084, -0.            , -0.            ],
                         [-0.469715381813,  0.            , -1.301600235398],
                         [-0.469715381813, -1.127218869426,  0.650800117699],
                         [-0.469715381813,  1.127218869426,  0.650800117699],
                         [ 1.982715381813,  1.127218869426, -0.650800117699],
                         [ 1.982715381813,  0.            ,  1.301600235398],
                         [ 1.982715381813, -1.127218869426, -0.650800117699]]*Angstrom

# Set up configuration
C2F6_configuration = MoleculeConfiguration(
    elements=elements,
    cartesian_coordinates=cartesian_coordinates
)
C2F6_configuration.findBonds()

# -------------------------------------------------------------
# Fluoroethene configuration
# -------------------------------------------------------------
# Define elements
elements = [Carbon, Carbon, Fluorine, Fluorine, Fluorine, Fluorine]

# Define coordinates
cartesian_coordinates = [[ 0.089971642147,  0.            , -0.            ],
                         [ 1.423028357853,  0.            ,  0.            ],
                         [-0.594324695679,  0.            , -1.173915650421],
                         [-0.594324695679,  0.            ,  1.173915650421],
                         [ 2.107324695679,  0.            ,  1.173915650421],
                         [ 2.107324695679,  0.            , -1.173915650421]]*Angstrom

# Set up configuration
C2F4_configuration = MoleculeConfiguration(
    elements=elements,
    cartesian_coordinates=cartesian_coordinates
)
C2F4_configuration.findBonds()

# -------------------------------------------------------------
# Specify molecular training set
# -------------------------------------------------------------
training_set = MolecularConfigurationsParameters(
    [C2F6_configuration, C2F4_configuration],
    sample_size=100,
    generate_bulk_configurations=True
)

# Generate and label training data with the DFT calculator.
force_field_training_set_generator = ForceFieldTrainingSetGenerator(
    filename='Carbon_Fluorine_training_data.hdf5',
    object_id='fftsg',
    training_sets=training_set,
    calculator=LCAOCalculator(),
    calculate_stress=True,
)
force_field_training_set_generator.update()

# The generated training set can then be obtained from the study object directly
# or by retrieving it from study object saved in the HDF5 file in separate MLFF
# model training scripts. 
generated_trainingset = force_field_training_set_generator.generatedTrainingSet()

Molecule_training_data_generation.py

This script generates 100 configurations each of C2F 4 and C2F 6. The energy, force and stress of each configuration is then calculated using density functional theory. The MolecularConfigurationsParameters object defines the recipe for creating these configurations. The ForceFieldTrainingSetGenerator class is then used to do the actual generation of the training data and labeling it according to the reference DFT calculator. The generated training set data can then be used for training machine-learned force fields via the MachineLearnedForceFieldTrainer class.

Notes

The MolecularConfigurationsParameters generates a trajectory of configurations that can be used to a train a machine-learned force field. Starting with one or more configurations as input, new configurations are generated by rotating available bond torsions and adding random displacements to the atoms. Bonds that are part of cyclic structures, such as the carbon-carbon bond in cyclohexane are not freely rotated so as to not break the ring structure. Double and triple bonds are also not freely rotated because of their typically high rotational barriers. Double bonds are rotated over a smaller range centered on the planar geometry.

To avoid larger molecules being twisted onto itself while sampling the torsions, distances between non-bonded atoms are checked for each configuration. Configurations where inter-atomic distances are shorter than the factor of the van der Waals radius given by the parameter overlap_ratio are rejected. An additional Monte Carlo selection can also be performed on configurations with rotated torsions. A calculator for evaluating the configuration energy and a Monte Carlo selection temperature must be specified. This is done using the arguments monte_carlo_calculator and monte_carlo_temperature respectively. If a calculator is not given then then Monte Carlo selection is not performed.

The number of bonds rotated in each trial structure can be set using the argument max_rotated_bonds. Limiting the number of bonds rotated at each step can make it easier to find non-overlapping structures when fitting a machine-learned force field for a polymer or other large molecule. For configurations with double bonds, both cis and trans orientations of the bond are sampled by default. Setting the argument sample_cis_trans disables this sampling and turns off bond rotation for double bonds.

In some cases, such as fitting a machine-learned force field for surface process simulations, molecular and bulk training sets may be combined together. In these cases it is more convenient for the molecular training sets to also contain bulk configurations. Setting the argument generate_bulk_configurations to True generates a training set of molecules that are placed in periodic boundary conditions with a large buffer.

After defining the MolecularConfigurationsParameters it can be used with ForceFieldTrainingSetGenerator for doing the actual generation of the training data. This data can then in turn be used as training set inputs when fitting a machine-learned force field with MachineLearnedForceFieldTrainer.