MolecularFingerprints.jl API Reference

Index of Available Functions and Types

Core Interface

MolecularFingerprints.fingerprintFunction
fingerprint(smiles::String, calc::AbstractCalculator)

Calculate the fingerprint for a single SMILES string using the provided calc.

Arguments

  • smiles: A string representing the molecule in SMILES format.
  • calc: A subtype of AbstractCalculator defining the fingerprint type.

Returns

  • A fingerprint representation (the specific type depends on calc).
source
fingerprint(mol::nothing, calc::AbstractCalculator)

Handle cases where the molecule is invalid or could not be parsed.

Arguments

  • mol: A nothing value indicating an invalid molecule.
  • calc: A subtype of AbstractCalculator defining the fingerprint type.

Returns

  • A default empty fingerprint based on the calculator type.
source
fingerprint(smiles_list::Vector{String}, calc::AbstractCalculator)

Calculate fingerprints for a collection of SMILES strings.

This method uses multithreading to process the list. Ensure that JULIA_NUM_THREADS is set appropriately before running the code.

Arguments

  • smiles_list: A vector of SMILES strings.
  • calc: The calculator instance to apply to each molecule.

Returns

  • Vector: A collection of fingerprints, typed according to the first successful calculation.
Note

This function is thread-parallelized using Threads.@threads.

source
fingerprint(mol::MolGraph, calc::MHFP)

Calculates the MHFP fingerprint of the given molecule and returns it as a vector of UInt32's

For more information on the MHFP (MinHash fingerprint) and its algorithm, see the documentation of the MHFP calculator type MHFP.

Arguments

  • mol::MolGraph: Molecular graph, for which the fingerprint is to be calculated
  • calc::MHFP: MHFP calculator object, contains settings and parameters for the calculation

Example

julia> using MolecularGraph  # required to define MolGraph objects

julia> benzene = smilestomol("C1=CC=CC=C1")
{6, 6} simple molecular graph SMILESMolGraph

julia> calc = MHFP(3, 0, true, fp_size=2048, seed=42)  # radius, min_radius, rings, ...
MHFP(3, 0, true, 2048, 42, ...)

julia> fingerprint(benzene, calc)
2048-element Vector{UInt32}:
 0x48039e21
          ⋮
 0x6f0c88d1
source
fingerprint(mol::MolGraph, calc::ECFP{N}) where N

Generate an ECFP (Extended-Connectivity Fingerprint) for a molecule.

This function implements the Morgan/ECFP algorithm as described in the original paper and matching the RDKit implementation. It generates circular fingerprints by iteratively expanding atomic neighborhoods up to the specified radius.

Algorithm Overview

  1. Compute initial atom invariants (layer 0)
  2. For each layer up to the specified radius:
    • Expand atomic neighborhoods by one bond
    • Hash neighborhood information to create new invariants
    • Detect and eliminate duplicate neighborhoods
    • Store unique atomic environments
  3. Map all environment hashes to bit positions in the fingerprint

Arguments

  • mol::MolGraph: Input molecular graph
  • calc::ECFP{N}: ECFP calculator specifying radius and fingerprint size

Returns

  • BitVector: Binary fingerprint of length N with bits set for detected molecular features

Examples

julia> using MolecularFingerprints, MolecularGraph

julia> mol = smilestomol("CCO");  # Ethanol

julia> fp_calc = ECFP{2048}(2);   # ECFP4 with 2048 bits

julia> fp = fingerprint(mol, fp_calc);

julia> length(fp)
2048

julia> fp isa BitVector
true

References

  • Rogers, D., & Hahn, M. (2010). Extended-connectivity fingerprints. J. Chem. Inf. Model., 50(5), 742-754.
  • RDKit implementation: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/GraphMol/Fingerprints/MorganGenerator.cpp#L257
source
fingerprint(mol::MolGraph, calc::TopologicalTorsion)

Get topological torsion fingerprint as a sparse integer vector for the molecule belonging to mol. The Topological Torsion fingerprint is calculated based on the molecular structure using paths of length pathLength.

Arguments

  • mol::MolGraph: the molecule for which to calculate the fingerprint
  • calc::TopologicalTorsion: struct containing parameters for fingerprint computation

Returns

  • SparseVector: Fingerprint as Sparse Integer Vector of fixed length with nonzero entries set through molecular features of simple paths and cycles
source
fingerprint(mol::MolGraph, calc::TopologicalTorsionHashed)

Get topological torsion fingerprint as a sparse integer vector for the molecule belonging to mol. The Topological Torsion fingerprint is calculated based on the molecular structure using paths of length pathLength.

Arguments

  • mol::MolGraph: the molecule for which to calculate the fingerprint
  • calc::TopologicalTorsionHashed: struct containing parameters for fingerprint computation

Returns

  • SparseVector: Fingerprint as Sparse Integer Vector of length nBits with nonzero entries set through molecular features of simple paths and cycles
source
fingerprint(mol::MolGraph, calc::TopologicalTorsionHashedAsBitVec)

Get topological torsion fingerprint as Bitvector for the molecule belonging to mol. The Topological Torsion fingerprint is calculated based on the molecular structure using paths of length pathLength.

Arguments

  • mol::MolGraph: the molecule for which to calculate the fingerprint
  • calc::TopologicalTorsionHashedAsBitVec: struct containing parameters for fingerprint computation

Returns

  • BitVector: Binary fingerprint of length nBits with bits set through molecular features of simple paths and cycles of fixed length
source

Type Hierarchy

Abstract Types

MolecularFingerprints.AbstractFingerprintType
AbstractFingerprint <: AbstractCalculator

Abstract type for calculators that produce representations of molecular features (e.g., MACCS, ECFP).

Unlike descriptors, fingerprints typically represent the presence or absence of specific substructures or patterns within a molecule.

source

Concrete Types

MolecularFingerprints.ECFPType
ECFP{N}(radius)

Extended-Connectivity Fingerprint (ECFP) calculator.

ECFPs are circular fingerprints encoding a local molecular environment around each atom up to a specified radius. This implementation closely follows the RDKit algorithm.

Fields

  • radius::R: The maximum number of bonds to traverse from each atom (default: 2)

Type Parameters

  • N: The size of the fingerprint bit vector

Examples

julia> ECFP()
ECFP{1024, Int64}(2)

julia> ECFP(3)
ECFP{1024, Int64}(3)

julia> ECFP{512}()
ECFP{512, Int64}(2)

julia> ECFP{2048}(Int8(3))
ECFP{2048, Int8}(3)
source
MolecularFingerprints.MHFPType
MHFP(
    radius::Int = 3,
    min_radius::Int = 1,
    rings::Bool = true,
    fp_size::Int = 2048,
    seed::Int = 42
)

MHFP (MinHash fingerprint) calculator. Contains settings and parameters for MHFP fingerprint generation.

Algorithm description

The MHFP fingerprint is a vector of UInt32's, calculated for a given molecule by:

  1. generating the "molecular shingling" of the molecule, which is a set of strings, containing:
    1. The SMILES strings of all rings in the smallest set of smallest rings (sssr) of the molecule (optional, corresponds to setting rings=true in the MHFP calculator object),
    2. The SMILES strings of the circular substructures of radii min_radius toradius around each heavy atom of the molecule. Note: if min_radius=0, the corresponding substructures are just the atoms themselves.
  2. Hashing the molecular shingling, which consists of:
    1. Converting each string to a 32-bit integer using SHA1 (and only using the first 32 bits of the hashed result)
    2. Applying the MinHash scheme to the set of 32-bit integers in order to generate the final fingerprint. The exact formula is given in the original authors paper, but we note here that it takes a vector of 32-bit integers as input, and is furthermore dependent on two vectors a and b, each of a given length k, which is also the length of the resulting fingerprint vector. The two vectors are sampled at random, but must be the same for comparable fingerprints. Note: the vectors a, b and their length k are stored in the fields of MHFP calculators, where they are named _a, _b and fp_size, respectively.

Parameters:

  • radius::Int: The maximum radius of circular substructures around each heavy atom of a molecule that are to be included in the fingerprint. Recommended values are 2 or 3 according to the original authors, with 3 (default) giving best results.
  • min_radius::Int: The minimum radius of circular substructures around each heavy atom of a molecule that are to be considered. Will be 1 (default) in most cases, however 0 is also valid; in this case information about the heavy atoms of the molecules is included explicitly in the fingerprints. The original paper only considers the case min_radius=1.
  • rings::Bool: If true (default), information about rings in the molecules is included in the fingerprints explicitly. This matches the original authors description of the fingerprint in their paper.

Keyword arguments

  • fp_size::Int: length of the fingerprint. Also means that this is the length of the random vectors a and b which are used in the hashing process. Default is 2048, as recommended by the original authors in their paper.
  • seed::Int: seed for the generation of the random vectors a and b which are used in the hashing process. Must be the same for comparable fingerprints. Default is 42.

Internal fields of the calculator

Also contains the fields _mersenne_prime, _max_hash, _a and _b, which are internal and cannot be set explicitly. The first two are constants, and the second two are random vectors which are generated automatically based on the given seed.

Example

julia> smiles_benzene = "C1=CC=CC=C1"
"C1=CC=CC=C1"

julia> calc = MHFP(3, 0, true, fp_size=2048, seed=42)  # radius, min_radius, rings, ...
MHFP(3, 0, true, 2048, 42, ...)

julia> fingerprint(smiles_benzene, calc)
2048-element Vector{UInt32}:
 0x48039e21
          ⋮
 0x6f0c88d1

References

source
MolecularFingerprints.MACCSType
MACCS(count::Bool=false, sparse::Bool=false)

MACCS (Molecular ACCess System) fingerprint calculator.

Arguments

  • count: If false, produces a boolean vector (presence/absence). If true, produces a count-based fingerprint.
  • sparse: If false, produces a dense representation. If true, produces a sparse representation.

References

source

Miscellaneous

MolecularFingerprints.AccumTupleType
AccumTuple(;
    bits::BitVector,
    invariant::UInt32,
    atom_index::Int
)

Internal structure for tracking and comparing atomic neighborhoods during ECFP generation.

Used to detect duplicate neighborhoods and maintain consistency with RDKit's algorithm by storing bond connectivity patterns along with invariant hashes.

Fields

  • bits::BitVector: Bit representation of the bond neighborhood
  • invariant::UInt32: Hash invariant for this neighborhood
  • atom_index::Int: Index of the central atom
source
MolecularFingerprints.MorganAtomEnvType
MorganAtomEnv(;
    code::UInt32,
    atom_id::Int,
    layer::Int
)

Internal structure representing a Morgan atom environment.

Stores the hash code, atom identifier (index), and layer/radius for each atomic environment encountered during ECFP fingerprint generation.

Fields

  • code::UInt32: Hash code representing the atomic environment
  • atom_id::Int: Identifier of the central atom
  • layer::Int: Radius/layer at which this environment was computed
source
MolecularFingerprints.TopologicalTorsionHashedType
TopologicalTorsionHashed(pathLength::Int=4, nBits::Int = 2048)

Topological Torsion fingerprint calculator.

Arguments

  • pathLength: Length of the paths in the molecular graph to consider, default is 4
  • nBits::Int: length of fingerprint vector, default is 2048

References

  • "Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors" by Nilakantan, Bauman and Dixon
source
MolecularFingerprints.TopologicalTorsionHashedAsBitVecType
TopologicalTorsionHashedAsBitVec(pathLength::Int=4, nBits::Int = 2048, nBitsPerEntry::Int = 4)

Topological Torsion fingerprint calculator.

nBits must be a multiple of nBitsPerEntry.

Arguments

  • pathLength: Length of the paths in the molecular graph to consider, default is 4
  • nBits::Int: length of fingerprint vector, default is 2048
  • nBitsPerEntry::Int: number of bits to use for each torsion, default is 4

References

  • "Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors" by Nilakantan, Bauman and Dixon
source
MolecularFingerprints.TTFPHelperMethod
TTFPHelper(mol::MolGraph, pathLength::Int, size::UInt64, codeFunction::F, nBits::Int = typemax(Int)) where {F}

This function loops over all simple paths of length pathLength and all cycles of length pathLength - 1 of the molecular graph, and gets a number for each atom in a path, an "Atom Code", from which an index to increase an entry of a sparse IntVector is calculated. For the hashed version, we get the index by taking TTFPCode % nBits. If a < b, a,b > 0, then a%b = a, which is why as default we choose nBits = typemax(Int) for the unhashed version, where we do not want the modulo.

Arguments

  • mol::MolGraph: the molecule for which to calculate the fingerprint
  • pathLength::Int: length of walks from molecular graph used to calculated fingerprint
  • size::UInt64: length of fingerprint vector
  • codeFunction::F: function which calculates the index from the path codes
  • nBits::Int: either equal to size or just a large dummy value

Returns

  • SparseVector: Sparse Integer Vector as basic topological torsion fingerprint or hashed fingerprint

References

  • [RDKit implementation] (https://github.com/rdkit/rdkit/blob/4b92c2fa8c41410191cceae6f469b4b9fb980d2b/Code/GraphMol/Fingerprints/AtomPairs.cpp#L159)
source
MolecularFingerprints.calculateAtomCodeMethod
calculateAtomCode(degree::Int, piBond::Int, atomicNumber::Int)

Calculates an integer for an atom of a molecule from number of non-hydrogen branches, number of pi bonds and atomic number

Arguments

  • degree::Int: number of non-hydrogen branches
  • piBond::Int: number of pi bonds
  • atomicNumber::Int: atomic number

Returns

  • UInt32: code for each atom in path from which "pathCodes" will be calculated

References

  • [RDKit implementation] (https://github.com/rdkit/rdkit/blob/e598f608fe620e88689efdff615beb4bc761d697/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L45)
source
MolecularFingerprints.canonicalizeMethod
canonicalize(pathCodes::Vector)

Canonicalization is done to obtain unique fingerprints for different smiles strings as described in https://depth-first.com/articles/2021/10/06/molecular-graph-canonicalization/.

Arguments

  • pathCodes::Vector: Vertex indices of an n-path or a ring from the molecular graph

Returns

  • bool: if true, reverse this path, if false, keep it in its current order

References

  • [RDKit implementation] (https://github.com/rdkit/rdkit/blob/e598f608fe620e88689efdff615beb4bc761d697/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L111-L123)
source
MolecularFingerprints.ecfp_hashMethod
ecfp_hash(v::AbstractVector{UInt32})

Generate a hash value from a vector of UInt32 values.

Iteratively combines all values in the vector using the ECFP hash combining algorithm to produce a single hash value representing the entire vector.

Arguments

  • v::AbstractVector{UInt32}: Vector of values to hash

Returns

  • UInt32: Hash value representing the input vector

Examples

julia> using MolecularFingerprints

julia> MolecularFingerprints.ecfp_hash(UInt32[1, 2, 3])
0xfb58d153

julia> MolecularFingerprints.ecfp_hash(UInt32[])
0x00000000

julia> result = MolecularFingerprints.ecfp_hash(UInt32[42, 100, 200]);

julia> result isa UInt32
true

References

Boost hash implementation, as provided by RDKit: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/RDGeneral/hash/hash.hpp

source
MolecularFingerprints.ecfp_hash_combineMethod
ecfp_hash_combine(seed::UInt32, value::UInt32)

Combine two hash values using the boost hash_combine algorithm.

This function implements the hash combining strategy used in RDKit's ECFP implementation, which is based on the boost C++ library's hash_combine function.

Arguments

  • seed::UInt32: Current hash seed value
  • value::UInt32: New value to combine into the hash

Returns

  • UInt32: Combined hash value

Examples

julia> using MolecularFingerprints

julia> MolecularFingerprints.ecfp_hash_combine(UInt32(0), UInt32(42))
0x9e3779e3

julia> result = MolecularFingerprints.ecfp_hash_combine(UInt32(100), UInt32(200));

julia> result isa UInt32
true

References

Boost hash implementation, as provided by RDKit: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/RDGeneral/hash/hash.hpp

source
MolecularFingerprints.getAtomCodesMethod
getAtomCodes(mol::Graph)

Gets vector with atom codes of each atom in the molecular graph

Arguments

  • mol::MolGraph: the molecule for which to calculate the atom codes

Returns

  • Vector: contains all atom codes of the molecular graph sorted by vertex numbers
source
MolecularFingerprints.getPathsOfLengthNMethod
getPathsOfLengthN(mol::MolGraph, N::Int)

Finds all simple paths of length N and cycles of length N - 1 in the Molecular Graph.

Arguments

  • mol::MolGraph: the molecule from which to extract the walks
  • N::Int: length of the walks, meaning number of vertices in walk

Returns

  • Vector: contains all simple paths of length N or cycles of length N - 1 of the molecular graph
source
MolecularFingerprints.getTTFPCodeMethod
getTTFPCode(pathCodes::Vector)

Calculates an integer from a number calculated from the atom codes of a path which will serve as an index at which the fingerprint will be increased by 1.

Arguments

  • pathCodes::Vector: contains a code generated from the atom codes of molecules of a path

Returns

  • Vector: Vector which will be used to index basic topological torsion fingerprint

References

  • [RDKit implementation] (https://github.com/rdkit/rdkit/blob/e598f608fe620e88689efdff615beb4bc761d697/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L125-L136)
source
MolecularFingerprints.getTTFPCodeHashedMethod
getTTFPCodeHashed(pathCodes::Vector)

Calculates an integer from a number calculated from the atom codes of a path to find an index at which the fingerprint will be increased by 1.

Arguments

  • pathCodes::Vector: contains a code generated from the atom codes of molecules of a path

Returns

  • Vector: Vector which will be used to index hashed topological torsion fingerprint

References

  • [RDKit implementation] (https://github.com/rdkit/rdkit/blob/e598f608fe620e88689efdff615beb4bc761d697/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L156-L167)
source
MolecularFingerprints.getTopologicalTorsionFPMethod
getTopologicalTorsionFP(mol::MolGraph, pathLength::Int, nBits::Int, nBitsPerEntry::Int)

This function transforms the sparse int vector from the hashed fingerprint to a Bit Vector.

Arguments

  • mol::MolGraph: the molecule for which to calculate the fingerprint
  • pathLength::Int: length of walks from molecular graph used to calculated fingerprint
  • nBits::Int: length of fingerprint vector
  • nBitsPerEntry::Int: number of bits to use for each torsion

Returns

  • BitVector: Binary fingerprint of length nBits with bits set through molecular features of simple paths and cycles

References

  • [RDKit implementation] (https://github.com/rdkit/rdkit/blob/4b92c2fa8c41410191cceae6f469b4b9fb980d2b/Code/GraphMol/Fingerprints/AtomPairs.cpp#L312)
source
MolecularFingerprints.getTopologicalTorsionFPMethod
getTopologicalTorsionFP(mol::MolGraph, pathLength::Int, nBits::Int)

Get the Topological Torsion Fingerprint of a molecule as a sparse Int Vector of length nBits.

Arguments

  • mol::MolGraph: the molecule for which to calculate the fingerprint
  • pathLength::Int: length of walks from molecular graph used to calculated fingerprint
  • nBits::Int: length of fingerprint vector

Returns

  • SparseVector: Fingerprint as Sparse Integer Vector of length nBits with nonzero entries set through molecular features of simple paths and cycles
source
MolecularFingerprints.getTopologicalTorsionFPMethod
getTopologicalTorsionFP(mol::MolGraph, pathLength::Int)

Get the Topological Torsion Fingerprint of a molecule as a sparse Int Vector.

Arguments

  • mol::MolGraph: the molecule for which to calculate the fingerprint
  • pathLength::Int: length of walks from molecular graph used to calculated fingerprint

Returns

  • SparseVector: Fingerprint as Sparse Integer Vector of fixed length with nonzero entries set through molecular features of simple paths and cycles
source
MolecularFingerprints.get_atom_invariantsMethod
get_atom_invariants(smiles::AbstractString)
get_atom_invariants(mol::AbstractMolGraph)

Calculate atomic invariants for ECFP fingerprint generation.

The atomic invariants are properties of an atom that don't depend on initial atom numbering, based on the Daylight atomic invariants. This implementation follows the RDKit approach.

Arguments

  • smiles::AbstractString: SMILES string representation of a molecule
  • mol::AbstractMolGraph: Molecular graph structure

Returns

  • Vector{UInt32}: Hash invariants for each atom in the molecule

Invariant Components

The computed invariants include (in order):

  1. Atomic number
  2. Total degree (number of neighbors including implicit hydrogens)
  3. Total number of hydrogens (implicit + explicit)
  4. Atomic charge
  5. Delta mass (difference from standard isotope mass)
  6. Ring membership indicator (1 if atom is in a ring, omitted otherwise)

Examples

julia> using MolecularFingerprints, MolecularGraph

julia> invariants = MolecularFingerprints.get_atom_invariants("CCO");

julia> length(invariants)  # 3 atoms: C, C, O
3

julia> all(x -> x isa UInt32, invariants)
true

References

RDKit implementation: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L244

source
MolecularFingerprints.get_bond_invariantsMethod
get_bond_invariants(mol::MolGraph)

Compute bond type invariants for all bonds in a molecule.

Arguments

  • mol::MolGraph: Input molecular graph

Returns

  • Vector{UInt32}: Bond type codes for each bond in the molecule

Known Issue

The edge properties provided by MolecularGraph.jl are not in the same order as in RDKit. This results in different hashes and, ultimately, in different fingerprints for larger molecules compared to RDKit. As this would require rework on the smilestomol algorithm provided by MolecularGraph.jl, a fix for this issue is currently not in scope of this project.

References

RDKit implementation: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/GraphMol/Fingerprints/MorganGenerator.cpp#L126

source
MolecularFingerprints.handleRingsMethod
handleRings(path::Vector)

Since every ring can be found several times, we have to abandon all but one ring. We only keep the ring which starts at the lowest numbered vertex.

Arguments

  • path::Vector: Vertex indices of a ring from the molecular graph

Returns

  • bool: if true, keep this path, if false abandon this path
source
MolecularFingerprints.has_any_bondMethod
has_any_bond(mol, s1::Symbol, s2::Symbol)-> Bool

N~S - check whether there is at least one bond between atoms (s1::Symbol, s2::Symbol) in molecule (mol)

source
MolecularFingerprints.has_bondMethod

has_bond(mol, s1::Symbol, s2::Symbol, order::Int) -> Bool

Check whether a bond exists between atoms (s1::Symbol, s2::Symbol) with given order in molecule (mol)

source
MolecularFingerprints.has_path3Method
has_path3(mol, s1::Symbol, s2::Symbol, s3::Symbol)->Bool

A~B~C - check whether there is a path of length 3 between atoms (s1::Symbol, s2::Symbol, s3::Symbol) in molecule (mol)

source
MolecularFingerprints.has_strict_path3Method
has_strict_path3(mol, s1::Symbol, s2::Nothing, s3::Symbol) -> Bool

A~X~C - check whether there is a path of length 3 between atoms (s1::Symbol, s3::Symbol) in molecule (mol), where X can be any atom

source
MolecularFingerprints.helper_copy_molMethod
helper_copy_mol(mol::MolGraph)

Creates a copy of the given molecule, and disables automatic kekulization on update.

This serves two goals:

  • User-given molecules are not modified. This is important, as hydrogens are removed before creating the MHFP fingerprint, which is a side effect the user may not want on their given molecule. Furthermore, the molecule is equipped with custom oninit and onupdate functions (see second point below), which is not something the user should have to consider.
  • Automatic kekulization of the molecule upon modification (such as creating a subgraph) is disabled. This is important as kekulization is not always successful on the substructures we generate, as some of their properties are considered invalid. This is done as follows: we give the MolGraph object custom oninit and onupdate functions, where in the latter, the kekulization is not included (which it otherwise is by default), while in the former, we added kekulization (so that it is at least performed once, upon initialization)

Remark

MolecularGraph.jl has defined MolState{T, F1, F2} as a generic type struct, where F1 and F2 are the types of the two functions oninit and onupdate, respectively. This means that we cannot modify the fields oninit and onupdate in-place, since once a molecule is initialized, F1 and F2 have fixed types, namely, the types typeof(<the current on_init function>) and analogously for on_update. Since the new functions have types like typeof(<the new on_init function>), replacing in-place is not possible. Instead, we create a new MolGraph object. In MolecularGraph v0.22.0, this has been changed (our current dependency is v0.21.1), and F1, F2 are no longer generic types. If MolecularFingerprints is ported to use this version in the future, the function below could be simplified into mol=copy(mol); mol.state.on_init = <new_on_init_function> etc.

source
MolecularFingerprints.helper_custom_on_init!Method
helper_custom_on_init!(mol::SimpleMolGraph)

Custom function for which actions are to be performed on a MolGraph on initialization. In particular, as we skip kekulization in our custom onupdate! function, we add kekulization on initialization, so that it has been done once at least. Other than that, the function is copied from the default `smilesoninit!from https://github.com/mojaie/MolecularGraph.jl/blob/1c4498363381cdfd6162368f33d54d67dd3f1e04/src/smarts/base.jl#L61C1-L64C4. Note that this is the function in the 0.21.1 release of MolecularGraph (which is what MolecularFingerprints is using as a dependency), and newer releases include an additional stepcheckvalence`.

source
MolecularFingerprints.helper_custom_on_update!Method
helper_custom_on_update!(mol::SimpleMolGraph)

Custom function for which actions are to be performed on a MolGraph when properties change. In particular, we skip kekulization, as it will not always be possible on our substructures (since they may have invalid molecular properties). Other than that, the function is copied from the default smiles_on_update! from https://github.com/mojaie/MolecularGraph.jl/blob/1c4498363381cdfd6162368f33d54d67dd3f1e04/src/smarts/base.jl#L66C1-L77C4. Note that this is the function in the 0.21.1 release of MolecularGraph (which is what MolecularFingerprints uses as a dependency).

source
MolecularFingerprints.mhfp_hash_from_molecular_shinglingMethod
mhfp_hash_from_molecular_shingling(shingling::Vector{String}, calc::MHFP)

Calculate the MinHash values from a given Molecular shingling.

The given calculator contains parameters such as the length of the random vectors a , b that are used in the hashing scheme, as well as the seed used when generating them. The algorithm is described in more detail in the original authors paper.

source
MolecularFingerprints.mhfp_shingling_from_molMethod
mhfp_shingling_from_mol(
    mol::MolGraph,
    calc::MHFP)

Calculate the "molecular shingling" of a given molecule.

A molecular shingling is a vector of "SMILES"-strings, calculated from the ring structures and atom types of the molecule (optional), and the circular substructures around each heavy (=non-hydrogen) atom of the molecule.

Arguments

  • mol::MolGraph: the molecule for which to calculate the shingling.
  • calc::MHFP: fingeprint "calculator" object, containing the relevant parameters for the fingerprint calculation, e.g., the radii of the circular substructures to be considered and whether to include ring information explicitly in the fingerprints
source
MolecularFingerprints.numPiBondsMethod
numPiBonds(mol::MolGraph)

Calculates the number of pi bonds of every atom in the molecular graph

Arguments

  • mol::MolGraph: the molecule for which to calculate the number of pi bonds

Returns

  • Vector: number of pi bonds of each atom in the molecular graph sorted by vertex numbers

References

  • [RDKit implementation] (https://github.com/rdkit/rdkit/blob/d3d4170e7cf5513835e00eb9739aadffca6c3a4e/Code/GraphMol/Atom.cpp#L934)
source
MolecularFingerprints.rdkit_bond_typeMethod
rdkit_bond_type(bond::SMILESBond)

Convert a SMILES bond to RDKit's bond type encoding.

Maps bond properties to integer codes matching RDKit's bond type enumeration.

Arguments

  • bond::SMILESBond: Input bond object

Returns

  • Int: Bond type code (1-6 for single to hextuple, 12 for aromatic, 20 for other, 21 for zero)

Known Issues

Due to differences in the internal representation of bonds within MolecularGraph.jl, we currently only support the most common bond types (1 to 6).

References

RDKit bond types: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/GraphMol/Bond.h#L55

source
MolecularFingerprints.smiles_from_circular_substructuresMethod
smiles_from_circular_substructures(
    mol::MolGraph,
    radius::Int,
    min_radius::Int)

Return vector of SMILES strings of circular substructures around all atoms of a molecule.

For each atom of the given molecule, extract the substructures of radii min_radius to radius, and generate their corresponding SMILES strings.

source
MolecularFingerprints.smiles_from_ringsMethod
smiles_from_rings(mol::MolGraph)

Return vector containing SMILES strings of all rings in the SSSR of the given molecule.

SSSR stands for the smallest set of smallest rings of the molecule.

Note: This function uses the function sssr from MolecularGraph.jl, which returns a "true" smallest set of smallest rings of the given molecule. However, in the original implementation of the mhfp algorithm, the "symmetrisized sssr" is used, which in some cases is non-minimal, i.e., contains an additional ring. The rdkit function to get the symmetrisized sssr is not available in MolecularGraph.jl or in RDKitMinimalLib, which is why the standard sssr is used. In most cases, this will not have any effect, but for some molecules, such as cubane, it will.

source
MolecularFingerprints.smiles_to_neutralized_molMethod
smiles_to_neutralized_mol(smiles_string::String)

Convert a SMILES string to a neutralized MolGraph instance. This function identifies the largest fragment in the SMILES string, removes charges from common organic elements, and returns the corresponding MolGraph.

Arguments

  • smiles_string: A string representing the molecule in SMILES format.

Returns

  • A MolGraph instance of the neutralized largest fragment.
source
MolecularFingerprints.tanimoto_similarityMethod
tanimoto_similarity(a::BitVector, b::BitVector)

Calculate the tanimoto_similarity similarity coefficient (Jaccard Index) between two fingerprints. Formula: c / (a + b - c) where c is intersection count.

source