Similarity Search: Finding Chemically Related Molecules

In this tutorial, we will find the most chemically similar molecules in a database relative to a "query" molecule.


1. Setup and Library Loading

We will use the same core libraries: MolecularFingerprints.jl for generating descriptors and MolecularGraph.jl for parsing molecules.

using LinearAlgebra
using MolecularFingerprints
using MolecularGraph

# A small "database" of molecules
database_smiles = [
    "CCO",                # Ethanol
    "CC(=O)Oc1ccccc1C(=O)O", # Aspirin
    "c1ccccc1",           # Benzene
    "CCCCCCCC",           # Octane
    "CN1C=NC2=C1C(=O)N(C(=O)N2C)C", # Caffeine
    "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O"  # Ibuprofen
]

# Our Query Molecule: Acetaminophen (Paracetamol)
query_smiles = "CC(=O)Nc1ccc(O)cc1"

2. Fingerprinting the Database

To compare molecules, we must first map them into the same vector space. We'll use Morgan Fingerprints (ECFP4), which capture the local neighborhood of atoms.

# Convert database to molecules and then to fingerprints
db_mols = [smilestomol(s) for s in database_smiles]
featurizer = ECFP{2}(2) # Radius 2 (ECFP4 equivalent)

# Generate a list of BitVectors
db_fps = [fingerprint(m, featurizer) for m in db_mols]

# Process the Query Molecule
query_mol = smilestomol(query_smiles)
query_fp = fingerprint(query_mol, featurizer)

3. Calculating Tanimoto Similarity

In cheminformatics, the tanimoto_similarity Coefficient (or Jaccard Index) is the industry standard for comparing bit-vector fingerprints. It measures the intersection of features divided by the union.

A score of 1.0 means the fingerprints are identical; 0.0 means they share no common features.

# Function to calculate tanimoto_similarity between two BitVectors
function tanimoto_similarity_similarity(fp1, fp2)
    intersection = sum(fp1 .& fp2)
    union = sum(fp1 .| fp2)
    return intersection / union
end

# Calculate similarity between query and every item in the database
similarities = [tanimoto_similarity(query_fp, db_fp) for db_fp in db_fps]

# Pair the SMILES with their scores for easy reading
results = collect(zip(database_smiles, similarities))

4. Ranking and Results

Finally, we sort our database based on the similarity score to find the closest matches.

# Sort by similarity score in descending order
sort!(results, by = x -> x[2], rev = true)

println("Similarity Search Results for Query: $query_smiles")
println("-"^45)
for (smiles, score) in results
    println("Score: $(round(score, digits=3)) | Molecule: $smiles")
end