MolecularFingerprints.jl API Reference
Index of Available Functions and Types
MolecularFingerprints.AbstractCalculatorMolecularFingerprints.AbstractDescriptorMolecularFingerprints.AbstractFingerprintMolecularFingerprints.AccumTupleMolecularFingerprints.ECFPMolecularFingerprints.MACCSMolecularFingerprints.MHFPMolecularFingerprints.MorganAtomEnvMolecularFingerprints.TopologicalTorsionMolecularFingerprints.TopologicalTorsionHashedMolecularFingerprints.TopologicalTorsionHashedAsBitVecMolecularFingerprints.TTFPHelperMolecularFingerprints.bond_order_sumMolecularFingerprints.calculateAtomCodeMolecularFingerprints.canonicalizeMolecularFingerprints.count_CH3MolecularFingerprints.count_atomMolecularFingerprints.count_neighborsMolecularFingerprints.ecfp_hashMolecularFingerprints.ecfp_hash_combineMolecularFingerprints.fingerprintMolecularFingerprints.getAtomCodesMolecularFingerprints.getPathsOfLengthNMolecularFingerprints.getTTFPCodeMolecularFingerprints.getTTFPCodeHashedMolecularFingerprints.getTopologicalTorsionFPMolecularFingerprints.getTopologicalTorsionFPMolecularFingerprints.getTopologicalTorsionFPMolecularFingerprints.get_atom_invariantsMolecularFingerprints.get_bond_invariantsMolecularFingerprints.handleRingsMolecularFingerprints.has_NHMolecularFingerprints.has_OHMolecularFingerprints.has_any_bondMolecularFingerprints.has_atomMolecularFingerprints.has_atom_in_setMolecularFingerprints.has_bondMolecularFingerprints.has_path3MolecularFingerprints.has_ringMolecularFingerprints.has_ring_of_sizeMolecularFingerprints.has_strict_path3MolecularFingerprints.helper_copy_molMolecularFingerprints.helper_custom_on_init!MolecularFingerprints.helper_custom_on_update!MolecularFingerprints.internal_implicit_hydrogensMolecularFingerprints.is_CH2MolecularFingerprints.is_CH3MolecularFingerprints.max_valenceMolecularFingerprints.mhfp_hash_from_molecular_shinglingMolecularFingerprints.mhfp_shingling_from_molMolecularFingerprints.nonH_neighborsMolecularFingerprints.numPiBondsMolecularFingerprints.rdkit_bond_typeMolecularFingerprints.safe_atom_symbolMolecularFingerprints.safe_smilestomolMolecularFingerprints.smiles_from_atomsMolecularFingerprints.smiles_from_circular_substructuresMolecularFingerprints.smiles_from_ringsMolecularFingerprints.smiles_to_neutralized_molMolecularFingerprints.tanimoto_similarity
Core Interface
MolecularFingerprints.fingerprint — Function
fingerprint(smiles::String, calc::AbstractCalculator)Calculate the fingerprint for a single SMILES string using the provided calc.
Arguments
smiles: A string representing the molecule in SMILES format.calc: A subtype ofAbstractCalculatordefining the fingerprint type.
Returns
- A fingerprint representation (the specific type depends on
calc).
fingerprint(mol::nothing, calc::AbstractCalculator)Handle cases where the molecule is invalid or could not be parsed.
Arguments
mol: Anothingvalue indicating an invalid molecule.calc: A subtype ofAbstractCalculatordefining the fingerprint type.
Returns
- A default empty fingerprint based on the calculator type.
fingerprint(smiles_list::Vector{String}, calc::AbstractCalculator)Calculate fingerprints for a collection of SMILES strings.
This method uses multithreading to process the list. Ensure that JULIA_NUM_THREADS is set appropriately before running the code.
Arguments
smiles_list: A vector of SMILES strings.calc: The calculator instance to apply to each molecule.
Returns
Vector: A collection of fingerprints, typed according to the first successful calculation.
fingerprint(mol::MolGraph, calc::MHFP)Calculates the MHFP fingerprint of the given molecule and returns it as a vector of UInt32's
For more information on the MHFP (MinHash fingerprint) and its algorithm, see the documentation of the MHFP calculator type MHFP.
Arguments
mol::MolGraph: Molecular graph, for which the fingerprint is to be calculatedcalc::MHFP: MHFP calculator object, contains settings and parameters for the calculation
Example
julia> using MolecularGraph # required to define MolGraph objects
julia> benzene = smilestomol("C1=CC=CC=C1")
{6, 6} simple molecular graph SMILESMolGraph
julia> calc = MHFP(3, 0, true, fp_size=2048, seed=42) # radius, min_radius, rings, ...
MHFP(3, 0, true, 2048, 42, ...)
julia> fingerprint(benzene, calc)
2048-element Vector{UInt32}:
0x48039e21
⋮
0x6f0c88d1fingerprint(mol::MolGraph, calc::ECFP{N}) where NGenerate an ECFP (Extended-Connectivity Fingerprint) for a molecule.
This function implements the Morgan/ECFP algorithm as described in the original paper and matching the RDKit implementation. It generates circular fingerprints by iteratively expanding atomic neighborhoods up to the specified radius.
Algorithm Overview
- Compute initial atom invariants (layer 0)
- For each layer up to the specified radius:
- Expand atomic neighborhoods by one bond
- Hash neighborhood information to create new invariants
- Detect and eliminate duplicate neighborhoods
- Store unique atomic environments
- Map all environment hashes to bit positions in the fingerprint
Arguments
mol::MolGraph: Input molecular graphcalc::ECFP{N}: ECFP calculator specifying radius and fingerprint size
Returns
BitVector: Binary fingerprint of length N with bits set for detected molecular features
Examples
julia> using MolecularFingerprints, MolecularGraph
julia> mol = smilestomol("CCO"); # Ethanol
julia> fp_calc = ECFP{2048}(2); # ECFP4 with 2048 bits
julia> fp = fingerprint(mol, fp_calc);
julia> length(fp)
2048
julia> fp isa BitVector
trueReferences
- Rogers, D., & Hahn, M. (2010). Extended-connectivity fingerprints. J. Chem. Inf. Model., 50(5), 742-754.
- RDKit implementation: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/GraphMol/Fingerprints/MorganGenerator.cpp#L257
fingerprint(mol::MolGraph, calc::TopologicalTorsion)Get topological torsion fingerprint as a sparse integer vector for the molecule belonging to mol. The Topological Torsion fingerprint is calculated based on the molecular structure using paths of length pathLength.
Arguments
mol::MolGraph: the molecule for which to calculate the fingerprintcalc::TopologicalTorsion: struct containing parameters for fingerprint computation
Returns
SparseVector: Fingerprint as Sparse Integer Vector of fixed length with nonzero entries set through molecular features of simple paths and cycles
fingerprint(mol::MolGraph, calc::TopologicalTorsionHashed)Get topological torsion fingerprint as a sparse integer vector for the molecule belonging to mol. The Topological Torsion fingerprint is calculated based on the molecular structure using paths of length pathLength.
Arguments
mol::MolGraph: the molecule for which to calculate the fingerprintcalc::TopologicalTorsionHashed: struct containing parameters for fingerprint computation
Returns
SparseVector: Fingerprint as Sparse Integer Vector of length nBits with nonzero entries set through molecular features of simple paths and cycles
fingerprint(mol::MolGraph, calc::TopologicalTorsionHashedAsBitVec)Get topological torsion fingerprint as Bitvector for the molecule belonging to mol. The Topological Torsion fingerprint is calculated based on the molecular structure using paths of length pathLength.
Arguments
mol::MolGraph: the molecule for which to calculate the fingerprintcalc::TopologicalTorsionHashedAsBitVec: struct containing parameters for fingerprint computation
Returns
BitVector: Binary fingerprint of length nBits with bits set through molecular features of simple paths and cycles of fixed length
Type Hierarchy
Abstract Types
MolecularFingerprints.AbstractCalculator — Type
AbstractCalculatorSupertype for all molecular property calculators. Subtypes should implement specific calculation logic for molecular properties.
MolecularFingerprints.AbstractFingerprint — Type
AbstractFingerprint <: AbstractCalculatorAbstract type for calculators that produce representations of molecular features (e.g., MACCS, ECFP).
Unlike descriptors, fingerprints typically represent the presence or absence of specific substructures or patterns within a molecule.
MolecularFingerprints.AbstractDescriptor — Type
AbstractDescriptor <: AbstractCalculatorAbstract type for calculators that produce scalar or numerical molecular properties (e.g., LogP, Molecular Weight, TPSA).
Concrete Types
MolecularFingerprints.ECFP — Type
ECFP{N}(radius)Extended-Connectivity Fingerprint (ECFP) calculator.
ECFPs are circular fingerprints encoding a local molecular environment around each atom up to a specified radius. This implementation closely follows the RDKit algorithm.
Fields
radius::R: The maximum number of bonds to traverse from each atom (default: 2)
Type Parameters
N: The size of the fingerprint bit vector
Examples
julia> ECFP()
ECFP{1024, Int64}(2)
julia> ECFP(3)
ECFP{1024, Int64}(3)
julia> ECFP{512}()
ECFP{512, Int64}(2)
julia> ECFP{2048}(Int8(3))
ECFP{2048, Int8}(3)MolecularFingerprints.MHFP — Type
MHFP(
radius::Int = 3,
min_radius::Int = 1,
rings::Bool = true,
fp_size::Int = 2048,
seed::Int = 42
)MHFP (MinHash fingerprint) calculator. Contains settings and parameters for MHFP fingerprint generation.
Algorithm description
The MHFP fingerprint is a vector of UInt32's, calculated for a given molecule by:
- generating the "molecular shingling" of the molecule, which is a set of strings, containing:
- The SMILES strings of all rings in the smallest set of smallest rings (sssr) of the molecule (optional, corresponds to setting
rings=truein the MHFP calculator object), - The SMILES strings of the circular substructures of radii
min_radiustoradiusaround each heavy atom of the molecule. Note: ifmin_radius=0, the corresponding substructures are just the atoms themselves.
- The SMILES strings of all rings in the smallest set of smallest rings (sssr) of the molecule (optional, corresponds to setting
- Hashing the molecular shingling, which consists of:
- Converting each string to a 32-bit integer using SHA1 (and only using the first 32 bits of the hashed result)
- Applying the MinHash scheme to the set of 32-bit integers in order to generate the final fingerprint. The exact formula is given in the original authors paper, but we note here that it takes a vector of 32-bit integers as input, and is furthermore dependent on two vectors a and b, each of a given length k, which is also the length of the resulting fingerprint vector. The two vectors are sampled at random, but must be the same for comparable fingerprints. Note: the vectors a, b and their length k are stored in the fields of MHFP calculators, where they are named
_a,_bandfp_size, respectively.
Parameters:
radius::Int: The maximum radius of circular substructures around each heavy atom of a molecule that are to be included in the fingerprint. Recommended values are 2 or 3 according to the original authors, with 3 (default) giving best results.min_radius::Int: The minimum radius of circular substructures around each heavy atom of a molecule that are to be considered. Will be 1 (default) in most cases, however 0 is also valid; in this case information about the heavy atoms of the molecules is included explicitly in the fingerprints. The original paper only considers the casemin_radius=1.rings::Bool: If true (default), information about rings in the molecules is included in the fingerprints explicitly. This matches the original authors description of the fingerprint in their paper.
Keyword arguments
fp_size::Int: length of the fingerprint. Also means that this is the length of the random vectors a and b which are used in the hashing process. Default is 2048, as recommended by the original authors in their paper.seed::Int: seed for the generation of the random vectorsaandbwhich are used in the hashing process. Must be the same for comparable fingerprints. Default is 42.
Internal fields of the calculator
Also contains the fields _mersenne_prime, _max_hash, _a and _b, which are internal and cannot be set explicitly. The first two are constants, and the second two are random vectors which are generated automatically based on the given seed.
Example
julia> smiles_benzene = "C1=CC=CC=C1"
"C1=CC=CC=C1"
julia> calc = MHFP(3, 0, true, fp_size=2048, seed=42) # radius, min_radius, rings, ...
MHFP(3, 0, true, 2048, 42, ...)
julia> fingerprint(smiles_benzene, calc)
2048-element Vector{UInt32}:
0x48039e21
⋮
0x6f0c88d1References
MolecularFingerprints.MACCS — Type
MACCS(count::Bool=false, sparse::Bool=false)MACCS (Molecular ACCess System) fingerprint calculator.
Arguments
count: Iffalse, produces a boolean vector (presence/absence). Iftrue, produces a count-based fingerprint.sparse: Iffalse, produces a dense representation. Iftrue, produces a sparse representation.
References
MolecularFingerprints.TopologicalTorsion — Type
TopologicalTorsion(pathLength::Int=4)Topological Torsion fingerprint calculator.
Arguments
pathLength: Length of the paths in the molecular graph to consider, default is 4
Miscellaneous
MolecularFingerprints.AccumTuple — Type
AccumTuple(;
bits::BitVector,
invariant::UInt32,
atom_index::Int
)Internal structure for tracking and comparing atomic neighborhoods during ECFP generation.
Used to detect duplicate neighborhoods and maintain consistency with RDKit's algorithm by storing bond connectivity patterns along with invariant hashes.
Fields
bits::BitVector: Bit representation of the bond neighborhoodinvariant::UInt32: Hash invariant for this neighborhoodatom_index::Int: Index of the central atom
MolecularFingerprints.MorganAtomEnv — Type
MorganAtomEnv(;
code::UInt32,
atom_id::Int,
layer::Int
)Internal structure representing a Morgan atom environment.
Stores the hash code, atom identifier (index), and layer/radius for each atomic environment encountered during ECFP fingerprint generation.
Fields
code::UInt32: Hash code representing the atomic environmentatom_id::Int: Identifier of the central atomlayer::Int: Radius/layer at which this environment was computed
MolecularFingerprints.TopologicalTorsionHashed — Type
TopologicalTorsionHashed(pathLength::Int=4, nBits::Int = 2048)Topological Torsion fingerprint calculator.
Arguments
pathLength: Length of the paths in the molecular graph to consider, default is 4nBits::Int: length of fingerprint vector, default is 2048
References
- "Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors" by Nilakantan, Bauman and Dixon
MolecularFingerprints.TopologicalTorsionHashedAsBitVec — Type
TopologicalTorsionHashedAsBitVec(pathLength::Int=4, nBits::Int = 2048, nBitsPerEntry::Int = 4)Topological Torsion fingerprint calculator.
nBits must be a multiple of nBitsPerEntry.
Arguments
pathLength: Length of the paths in the molecular graph to consider, default is 4nBits::Int: length of fingerprint vector, default is 2048nBitsPerEntry::Int: number of bits to use for each torsion, default is 4
References
- "Topological Torsion: A New Molecular Descriptor for SAR Applications. Comparison with Other Descriptors" by Nilakantan, Bauman and Dixon
MolecularFingerprints.TTFPHelper — Method
TTFPHelper(mol::MolGraph, pathLength::Int, size::UInt64, codeFunction::F, nBits::Int = typemax(Int)) where {F}This function loops over all simple paths of length pathLength and all cycles of length pathLength - 1 of the molecular graph, and gets a number for each atom in a path, an "Atom Code", from which an index to increase an entry of a sparse IntVector is calculated. For the hashed version, we get the index by taking TTFPCode % nBits. If a < b, a,b > 0, then a%b = a, which is why as default we choose nBits = typemax(Int) for the unhashed version, where we do not want the modulo.
Arguments
mol::MolGraph: the molecule for which to calculate the fingerprintpathLength::Int: length of walks from molecular graph used to calculated fingerprintsize::UInt64: length of fingerprint vectorcodeFunction::F: function which calculates the index from the path codesnBits::Int: either equal to size or just a large dummy value
Returns
SparseVector: Sparse Integer Vector as basic topological torsion fingerprint or hashed fingerprint
References
- [RDKit implementation] (https://github.com/rdkit/rdkit/blob/4b92c2fa8c41410191cceae6f469b4b9fb980d2b/Code/GraphMol/Fingerprints/AtomPairs.cpp#L159)
MolecularFingerprints.bond_order_sum — Method
bond_order_sum(mol, v) -> IntCount sum of bond orders for atom v in molecule (mol)
MolecularFingerprints.calculateAtomCode — Method
calculateAtomCode(degree::Int, piBond::Int, atomicNumber::Int)Calculates an integer for an atom of a molecule from number of non-hydrogen branches, number of pi bonds and atomic number
Arguments
degree::Int: number of non-hydrogen branchespiBond::Int: number of pi bondsatomicNumber::Int: atomic number
Returns
UInt32: code for each atom in path from which "pathCodes" will be calculated
References
- [RDKit implementation] (https://github.com/rdkit/rdkit/blob/e598f608fe620e88689efdff615beb4bc761d697/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L45)
MolecularFingerprints.canonicalize — Method
canonicalize(pathCodes::Vector)Canonicalization is done to obtain unique fingerprints for different smiles strings as described in https://depth-first.com/articles/2021/10/06/molecular-graph-canonicalization/.
Arguments
pathCodes::Vector: Vertex indices of an n-path or a ring from the molecular graph
Returns
bool: if true, reverse this path, if false, keep it in its current order
References
- [RDKit implementation] (https://github.com/rdkit/rdkit/blob/e598f608fe620e88689efdff615beb4bc761d697/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L111-L123)
MolecularFingerprints.count_CH3 — Method
count_CH3(mol) -> IntCount how many CH3 groups are in molecule (mol)
MolecularFingerprints.count_atom — Method
count_atom(mol, sym::Symbol) -> IntCount how many atoms (sym::Symbol) are in molecule (mol)
MolecularFingerprints.count_neighbors — Method
count_neighbors(mol, neigh, sym::Symbol) -> IntCount how many neighbors (sym::Symbol) are in the neighbor list (neigh) of molecule (mol)
MolecularFingerprints.ecfp_hash — Method
ecfp_hash(v::AbstractVector{UInt32})Generate a hash value from a vector of UInt32 values.
Iteratively combines all values in the vector using the ECFP hash combining algorithm to produce a single hash value representing the entire vector.
Arguments
v::AbstractVector{UInt32}: Vector of values to hash
Returns
UInt32: Hash value representing the input vector
Examples
julia> using MolecularFingerprints
julia> MolecularFingerprints.ecfp_hash(UInt32[1, 2, 3])
0xfb58d153
julia> MolecularFingerprints.ecfp_hash(UInt32[])
0x00000000
julia> result = MolecularFingerprints.ecfp_hash(UInt32[42, 100, 200]);
julia> result isa UInt32
trueReferences
Boost hash implementation, as provided by RDKit: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/RDGeneral/hash/hash.hpp
MolecularFingerprints.ecfp_hash_combine — Method
ecfp_hash_combine(seed::UInt32, value::UInt32)Combine two hash values using the boost hash_combine algorithm.
This function implements the hash combining strategy used in RDKit's ECFP implementation, which is based on the boost C++ library's hash_combine function.
Arguments
seed::UInt32: Current hash seed valuevalue::UInt32: New value to combine into the hash
Returns
UInt32: Combined hash value
Examples
julia> using MolecularFingerprints
julia> MolecularFingerprints.ecfp_hash_combine(UInt32(0), UInt32(42))
0x9e3779e3
julia> result = MolecularFingerprints.ecfp_hash_combine(UInt32(100), UInt32(200));
julia> result isa UInt32
trueReferences
Boost hash implementation, as provided by RDKit: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/RDGeneral/hash/hash.hpp
MolecularFingerprints.getAtomCodes — Method
getAtomCodes(mol::Graph)Gets vector with atom codes of each atom in the molecular graph
Arguments
mol::MolGraph: the molecule for which to calculate the atom codes
Returns
Vector: contains all atom codes of the molecular graph sorted by vertex numbers
MolecularFingerprints.getPathsOfLengthN — Method
getPathsOfLengthN(mol::MolGraph, N::Int)Finds all simple paths of length N and cycles of length N - 1 in the Molecular Graph.
Arguments
mol::MolGraph: the molecule from which to extract the walksN::Int: length of the walks, meaning number of vertices in walk
Returns
Vector: contains all simple paths of length N or cycles of length N - 1 of the molecular graph
MolecularFingerprints.getTTFPCode — Method
getTTFPCode(pathCodes::Vector)Calculates an integer from a number calculated from the atom codes of a path which will serve as an index at which the fingerprint will be increased by 1.
Arguments
pathCodes::Vector: contains a code generated from the atom codes of molecules of a path
Returns
Vector: Vector which will be used to index basic topological torsion fingerprint
References
- [RDKit implementation] (https://github.com/rdkit/rdkit/blob/e598f608fe620e88689efdff615beb4bc761d697/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L125-L136)
MolecularFingerprints.getTTFPCodeHashed — Method
getTTFPCodeHashed(pathCodes::Vector)Calculates an integer from a number calculated from the atom codes of a path to find an index at which the fingerprint will be increased by 1.
Arguments
pathCodes::Vector: contains a code generated from the atom codes of molecules of a path
Returns
Vector: Vector which will be used to index hashed topological torsion fingerprint
References
- [RDKit implementation] (https://github.com/rdkit/rdkit/blob/e598f608fe620e88689efdff615beb4bc761d697/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L156-L167)
MolecularFingerprints.getTopologicalTorsionFP — Method
getTopologicalTorsionFP(mol::MolGraph, pathLength::Int, nBits::Int, nBitsPerEntry::Int)This function transforms the sparse int vector from the hashed fingerprint to a Bit Vector.
Arguments
mol::MolGraph: the molecule for which to calculate the fingerprintpathLength::Int: length of walks from molecular graph used to calculated fingerprintnBits::Int: length of fingerprint vectornBitsPerEntry::Int: number of bits to use for each torsion
Returns
BitVector: Binary fingerprint of length nBits with bits set through molecular features of simple paths and cycles
References
- [RDKit implementation] (https://github.com/rdkit/rdkit/blob/4b92c2fa8c41410191cceae6f469b4b9fb980d2b/Code/GraphMol/Fingerprints/AtomPairs.cpp#L312)
MolecularFingerprints.getTopologicalTorsionFP — Method
getTopologicalTorsionFP(mol::MolGraph, pathLength::Int, nBits::Int)Get the Topological Torsion Fingerprint of a molecule as a sparse Int Vector of length nBits.
Arguments
mol::MolGraph: the molecule for which to calculate the fingerprintpathLength::Int: length of walks from molecular graph used to calculated fingerprintnBits::Int: length of fingerprint vector
Returns
SparseVector: Fingerprint as Sparse Integer Vector of length nBits with nonzero entries set through molecular features of simple paths and cycles
MolecularFingerprints.getTopologicalTorsionFP — Method
getTopologicalTorsionFP(mol::MolGraph, pathLength::Int)Get the Topological Torsion Fingerprint of a molecule as a sparse Int Vector.
Arguments
mol::MolGraph: the molecule for which to calculate the fingerprintpathLength::Int: length of walks from molecular graph used to calculated fingerprint
Returns
SparseVector: Fingerprint as Sparse Integer Vector of fixed length with nonzero entries set through molecular features of simple paths and cycles
MolecularFingerprints.get_atom_invariants — Method
get_atom_invariants(smiles::AbstractString)
get_atom_invariants(mol::AbstractMolGraph)Calculate atomic invariants for ECFP fingerprint generation.
The atomic invariants are properties of an atom that don't depend on initial atom numbering, based on the Daylight atomic invariants. This implementation follows the RDKit approach.
Arguments
smiles::AbstractString: SMILES string representation of a moleculemol::AbstractMolGraph: Molecular graph structure
Returns
Vector{UInt32}: Hash invariants for each atom in the molecule
Invariant Components
The computed invariants include (in order):
- Atomic number
- Total degree (number of neighbors including implicit hydrogens)
- Total number of hydrogens (implicit + explicit)
- Atomic charge
- Delta mass (difference from standard isotope mass)
- Ring membership indicator (1 if atom is in a ring, omitted otherwise)
Examples
julia> using MolecularFingerprints, MolecularGraph
julia> invariants = MolecularFingerprints.get_atom_invariants("CCO");
julia> length(invariants) # 3 atoms: C, C, O
3
julia> all(x -> x isa UInt32, invariants)
trueReferences
RDKit implementation: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L244
MolecularFingerprints.get_bond_invariants — Method
get_bond_invariants(mol::MolGraph)Compute bond type invariants for all bonds in a molecule.
Arguments
mol::MolGraph: Input molecular graph
Returns
Vector{UInt32}: Bond type codes for each bond in the molecule
Known Issue
The edge properties provided by MolecularGraph.jl are not in the same order as in RDKit. This results in different hashes and, ultimately, in different fingerprints for larger molecules compared to RDKit. As this would require rework on the smilestomol algorithm provided by MolecularGraph.jl, a fix for this issue is currently not in scope of this project.
References
RDKit implementation: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/GraphMol/Fingerprints/MorganGenerator.cpp#L126
MolecularFingerprints.handleRings — Method
handleRings(path::Vector)Since every ring can be found several times, we have to abandon all but one ring. We only keep the ring which starts at the lowest numbered vertex.
Arguments
path::Vector: Vertex indices of a ring from the molecular graph
Returns
bool: if true, keep this path, if false abandon this path
MolecularFingerprints.has_NH — Method
has_NH(mol) -> BoolCheck whether molecule (mol) has NH group
MolecularFingerprints.has_OH — Method
has_OH(mol) -> BoolCheck whether molecule (mol) has OH group
MolecularFingerprints.has_any_bond — Method
has_any_bond(mol, s1::Symbol, s2::Symbol)-> BoolN~S - check whether there is at least one bond between atoms (s1::Symbol, s2::Symbol) in molecule (mol)
MolecularFingerprints.has_atom — Method
has_atom(mol, sym::Symbol) -> BoolCheck whether atom (sym::Symbol) is contained in molecule (mol)
MolecularFingerprints.has_atom_in_set — Method
has_atom_in_set(mol, syms::Set{Symbol}) -> BoolCheck whether at least one atom of molecule (mol) is in set of atoms (syms::Set{Symbol})
MolecularFingerprints.has_bond — Method
has_bond(mol, s1::Symbol, s2::Symbol, order::Int) -> Bool
Check whether a bond exists between atoms (s1::Symbol, s2::Symbol) with given order in molecule (mol)
MolecularFingerprints.has_path3 — Method
has_path3(mol, s1::Symbol, s2::Symbol, s3::Symbol)->BoolA~B~C - check whether there is a path of length 3 between atoms (s1::Symbol, s2::Symbol, s3::Symbol) in molecule (mol)
MolecularFingerprints.has_ring — Method
has_ring(mol) -> BoolCheck whether molecule (mol) has at least one ring
MolecularFingerprints.has_ring_of_size — Method
has_ring_of_size(mol, n::Int) -> BoolCheck whether molecule (mol) has a ring of given size (n::Int)
MolecularFingerprints.has_strict_path3 — Method
has_strict_path3(mol, s1::Symbol, s2::Nothing, s3::Symbol) -> BoolA~X~C - check whether there is a path of length 3 between atoms (s1::Symbol, s3::Symbol) in molecule (mol), where X can be any atom
MolecularFingerprints.helper_copy_mol — Method
helper_copy_mol(mol::MolGraph)Creates a copy of the given molecule, and disables automatic kekulization on update.
This serves two goals:
- User-given molecules are not modified. This is important, as hydrogens are removed before creating the MHFP fingerprint, which is a side effect the user may not want on their given molecule. Furthermore, the molecule is equipped with custom oninit and onupdate functions (see second point below), which is not something the user should have to consider.
- Automatic kekulization of the molecule upon modification (such as creating a subgraph) is disabled. This is important as kekulization is not always successful on the substructures we generate, as some of their properties are considered invalid. This is done as follows: we give the MolGraph object custom oninit and onupdate functions, where in the latter, the kekulization is not included (which it otherwise is by default), while in the former, we added kekulization (so that it is at least performed once, upon initialization)
Remark
MolecularGraph.jl has defined MolState{T, F1, F2} as a generic type struct, where F1 and F2 are the types of the two functions oninit and onupdate, respectively. This means that we cannot modify the fields oninit and onupdate in-place, since once a molecule is initialized, F1 and F2 have fixed types, namely, the types typeof(<the current on_init function>) and analogously for on_update. Since the new functions have types like typeof(<the new on_init function>), replacing in-place is not possible. Instead, we create a new MolGraph object. In MolecularGraph v0.22.0, this has been changed (our current dependency is v0.21.1), and F1, F2 are no longer generic types. If MolecularFingerprints is ported to use this version in the future, the function below could be simplified into mol=copy(mol); mol.state.on_init = <new_on_init_function> etc.
MolecularFingerprints.helper_custom_on_init! — Method
helper_custom_on_init!(mol::SimpleMolGraph)Custom function for which actions are to be performed on a MolGraph on initialization. In particular, as we skip kekulization in our custom onupdate! function, we add kekulization on initialization, so that it has been done once at least. Other than that, the function is copied from the default `smilesoninit!from https://github.com/mojaie/MolecularGraph.jl/blob/1c4498363381cdfd6162368f33d54d67dd3f1e04/src/smarts/base.jl#L61C1-L64C4. Note that this is the function in the 0.21.1 release of MolecularGraph (which is what MolecularFingerprints is using as a dependency), and newer releases include an additional stepcheckvalence`.
MolecularFingerprints.helper_custom_on_update! — Method
helper_custom_on_update!(mol::SimpleMolGraph)Custom function for which actions are to be performed on a MolGraph when properties change. In particular, we skip kekulization, as it will not always be possible on our substructures (since they may have invalid molecular properties). Other than that, the function is copied from the default smiles_on_update! from https://github.com/mojaie/MolecularGraph.jl/blob/1c4498363381cdfd6162368f33d54d67dd3f1e04/src/smarts/base.jl#L66C1-L77C4. Note that this is the function in the 0.21.1 release of MolecularGraph (which is what MolecularFingerprints uses as a dependency).
MolecularFingerprints.internal_implicit_hydrogens — Method
internal_implicit_hydrogens(mol, v) -> IntCount how many implicit (invisible) hydrogens atom v has in molecule (mol)
MolecularFingerprints.is_CH2 — Method
is_CH2(mol, v) -> BoolCheck whether atom v is in group CH2 in molecule (mol)
MolecularFingerprints.is_CH3 — Method
is_CH3(mol, v) -> BoolCheck whether atom v is in group CH3 in molecule (mol)
MolecularFingerprints.max_valence — Method
max_valence(sym::Symbol) -> IntReturns valence for given atom symbol (sym::Symbol)
MolecularFingerprints.mhfp_hash_from_molecular_shingling — Method
mhfp_hash_from_molecular_shingling(shingling::Vector{String}, calc::MHFP)Calculate the MinHash values from a given Molecular shingling.
The given calculator contains parameters such as the length of the random vectors a , b that are used in the hashing scheme, as well as the seed used when generating them. The algorithm is described in more detail in the original authors paper.
MolecularFingerprints.mhfp_shingling_from_mol — Method
mhfp_shingling_from_mol(
mol::MolGraph,
calc::MHFP)Calculate the "molecular shingling" of a given molecule.
A molecular shingling is a vector of "SMILES"-strings, calculated from the ring structures and atom types of the molecule (optional), and the circular substructures around each heavy (=non-hydrogen) atom of the molecule.
Arguments
mol::MolGraph: the molecule for which to calculate the shingling.calc::MHFP: fingeprint "calculator" object, containing the relevant parameters for the fingerprint calculation, e.g., the radii of the circular substructures to be considered and whether to include ring information explicitly in the fingerprints
MolecularFingerprints.nonH_neighbors — Method
nonH_neighbors(mol, v) -> Vector{Int}Get neighbors of atom v in molecule (mol) which are NOT hydrogen
MolecularFingerprints.numPiBonds — Method
numPiBonds(mol::MolGraph)Calculates the number of pi bonds of every atom in the molecular graph
Arguments
mol::MolGraph: the molecule for which to calculate the number of pi bonds
Returns
Vector: number of pi bonds of each atom in the molecular graph sorted by vertex numbers
References
- [RDKit implementation] (https://github.com/rdkit/rdkit/blob/d3d4170e7cf5513835e00eb9739aadffca6c3a4e/Code/GraphMol/Atom.cpp#L934)
MolecularFingerprints.rdkit_bond_type — Method
rdkit_bond_type(bond::SMILESBond)Convert a SMILES bond to RDKit's bond type encoding.
Maps bond properties to integer codes matching RDKit's bond type enumeration.
Arguments
bond::SMILESBond: Input bond object
Returns
Int: Bond type code (1-6 for single to hextuple, 12 for aromatic, 20 for other, 21 for zero)
Known Issues
Due to differences in the internal representation of bonds within MolecularGraph.jl, we currently only support the most common bond types (1 to 6).
References
RDKit bond types: https://github.com/rdkit/rdkit/blob/Release202509_4/Code/GraphMol/Bond.h#L55
MolecularFingerprints.safe_atom_symbol — Method
safe_atom_symbol(atom)Returns the atom symbol always as a Symbol (:C).
MolecularFingerprints.safe_smilestomol — Method
safe_smilestomol(smiles::String)Attempts to parse a SMILES string. Returns nothing if it fails instead of crashing the entire thread.
MolecularFingerprints.smiles_from_atoms — Method
smiles_from_atoms(mol::MolGraph)Return vector containing SMILES strings of all atoms of the given molecule.
MolecularFingerprints.smiles_from_circular_substructures — Method
smiles_from_circular_substructures(
mol::MolGraph,
radius::Int,
min_radius::Int)Return vector of SMILES strings of circular substructures around all atoms of a molecule.
For each atom of the given molecule, extract the substructures of radii min_radius to radius, and generate their corresponding SMILES strings.
MolecularFingerprints.smiles_from_rings — Method
smiles_from_rings(mol::MolGraph)Return vector containing SMILES strings of all rings in the SSSR of the given molecule.
SSSR stands for the smallest set of smallest rings of the molecule.
Note: This function uses the function sssr from MolecularGraph.jl, which returns a "true" smallest set of smallest rings of the given molecule. However, in the original implementation of the mhfp algorithm, the "symmetrisized sssr" is used, which in some cases is non-minimal, i.e., contains an additional ring. The rdkit function to get the symmetrisized sssr is not available in MolecularGraph.jl or in RDKitMinimalLib, which is why the standard sssr is used. In most cases, this will not have any effect, but for some molecules, such as cubane, it will.
MolecularFingerprints.smiles_to_neutralized_mol — Method
smiles_to_neutralized_mol(smiles_string::String)Convert a SMILES string to a neutralized MolGraph instance. This function identifies the largest fragment in the SMILES string, removes charges from common organic elements, and returns the corresponding MolGraph.
Arguments
smiles_string: A string representing the molecule in SMILES format.
Returns
- A
MolGraphinstance of the neutralized largest fragment.
MolecularFingerprints.tanimoto_similarity — Method
tanimoto_similarity(a::BitVector, b::BitVector)Calculate the tanimoto_similarity similarity coefficient (Jaccard Index) between two fingerprints. Formula: c / (a + b - c) where c is intersection count.