Getting Started with MolecularFingerprints.jl

This guide will help you set up your environment and compute your first molecular representations.

Installation

We recommend using Julia's built-in package manager (Pkg) to manage dependencies. Choose the method that best fits your workflow:

Option 1: Sandbox (Trial)

Best for a quick "Hello World" or testing a specific feature without modifying your global state.

using Pkg
Pkg.activate(temp=true)
Pkg.add(url="https://github.com/LukaszSztukiewicz/MolecularFingerprints.jl")
using MolecularFingerprints

Option 2: Project-Specific (Recommended)

Best for building reproducible research or production pipelines. This ensures your project’s dependencies are locked in a Project.toml file.

using Pkg
Pkg.activate(".") 
Pkg.add(url="https://github.com/LukaszSztukiewicz/MolecularFingerprints.jl")

Usage

Molecular fingerprints are essentially feature extraction steps in a pipeline. The API is designed to be functional: you define a Calculator (the model) and apply it to your Data.

Basic Pipeline

using MolecularFingerprints

# 1. Input: SMILES string (Benzene)
smiles = "C1=CC=CC=C1"

# 2. This package implements 4 types of fingerprints. 
# All of them could be customized with parameters, but here we use default settings.
ecfp_calc = ECFP() # Extended Connectivity Fingerprints
mhfp_calc = MHFP() # MinHash Fingerprints
torsion_calc = TopologicalTorsion() # Topological Torsion Fingerprints
maccs_calc = MACCS() # MACCS Keys

# 3. Execution: Compute the fingerprint for each type
ecfp_vector = fingerprint(smiles, ecfp_calc)
mhfp_vector = fingerprint(smiles, mhfp_calc)
torsion_vector = fingerprint(smiles, torsion_calc)
maccs_vector = fingerprint(smiles, maccs_calc)

# 4. Analysis: Find indices of active features

# ECFP returns BitVector to see active bits, we can use findall
println("ECFP active bits: ", findall(ecfp_vector))

# MACCS returns BitVector to see active bits, we can use findall
println("MACCS active bits: ", findall(maccs_vector))

# MHFP returns Vector{Int64} with each non-zero entry, so all bits are active
# You will see that are of the 2048 bits are being listed
println("MHFP active bits: ", findall(mhfp_vector .!= 0))

# TopologicalTorsion returns SparseArrays.SparseVector{Int32, Int64} so it is easy to find non-zero entries
using SparseArrays
println("Topological Torsion active bits: ", SparseArrays.findnz(torsion_vector)[1])

High-Throughput Processing

For large datasets, the package provides a vectorized implementation that leverages multithreading.

# A list of SMILES (e.g., from a CSV)
dataset = ["CCO", "C1=CC=CC=C1", "CC(=O)O"]

# The vectorized call automatically parallelizes over available threads
batch_vectors = fingerprint(dataset, calc)

If you have never used molecular fingerprints before, see Explanation for an introduction to the concept.

For more detailed examples and advanced usage, please refer to the API Reference and tutorials on Solubility Prediction and Similarity Search.