protclust

A Python library for working with protein sequence data, providing:

Clustering capabilities via MMseqs2
Machine learning dataset creation with cluster-aware splits
Protein sequence embeddings and feature extraction

Requirements

This library requires MMseqs2, which must be installed and accessible via the command line. MMseqs2 can be installed using one of the following methods:

Installation Options for MMseqs2

Homebrew:
```
brew install mmseqs2
```

Conda:

conda install -c conda-forge -c bioconda mmseqs2

Docker:
```
docker pull ghcr.io/soedinglab/mmseqs2
```

Static Build (AVX2, SSE4.1, or SSE2):

wget https://mmseqs.com/latest/mmseqs-linux-avx2.tar.gz
tar xvfz mmseqs-linux-avx2.tar.gz
export PATH=$(pwd)/mmseqs/bin/:$PATH

MMseqs2 must be accessible via the mmseqs command in your system's PATH. If the library cannot detect MMseqs2, it will raise an error.

Installation

You can install protclust using pip:

pip install protclust

Or if installing from source, clone the repository and run:

pip install -e .

For development purposes, also install the testing dependencies:

pip install pytest pytest-cov pre-commit ruff

Features

Sequence Clustering and Dataset Creation

import pandas as pd
from protclust import clean, cluster, split, set_verbosity

# Enable detailed logging (optional)
set_verbosity(verbose=True)

# Example data
df = pd.DataFrame({
    "id": ["seq1", "seq2", "seq3", "seq4"],
    "sequence": ["ACDEFGHIKL", "ACDEFGHIKL", "MNPQRSTVWY", "MNPQRSTVWY"]
})

# Clean data
clean_df = clean(df, sequence_col="sequence")

# Cluster sequences
clustered_df = cluster(clean_df, sequence_col="sequence", id_col="id")

# Split data into train and test sets
train_df, test_df = split(clustered_df, group_col="representative_sequence", test_size=0.3)

print("Train set:\n", train_df)
print("Test set:\n", test_df)

# Or use the combined function
from protclust import train_test_cluster_split
train_df, test_df = train_test_cluster_split(df, sequence_col="sequence", id_col="id", test_size=0.3)

Advanced Splitting Options

# Three-way split
from protclust import train_test_val_cluster_split
train_df, val_df, test_df = train_test_val_cluster_split(
    df, sequence_col="sequence", test_size=0.2, val_size=0.1
)

# K-fold cross-validation with cluster awareness
from protclust import cluster_kfold
folds = cluster_kfold(df, sequence_col="sequence", n_splits=5)

# MILP-based splitting with property balancing
from protclust import milp_split
train_df, test_df = milp_split(
    clustered_df,
    group_col="representative_sequence",
    test_size=0.3,
    balance_cols=["molecular_weight", "hydrophobicity"]
)

Protein Embeddings

# Basic embeddings
from protclust.embeddings import blosum62, aac, property_embedding, onehot

# Add BLOSUM62 embeddings
df_with_blosum = blosum62(df, sequence_col="sequence")

# Generate embeddings with ESM models (requires extra dependencies)
from protclust.embeddings import embed_sequences

# ESM embedding
df_with_esm = embed_sequences(df, "esm", sequence_col="sequence")

# Saving embeddings to HDF5
df_with_refs = embed_sequences(
    df,
    "esm",
    sequence_col="sequence",
    use_hdf=True,
    hdf_path="embeddings.h5"
)

# Retrieve embeddings
from protclust.embeddings import get_embeddings
embeddings = get_embeddings(df_with_refs, "esm", hdf_path="embeddings.h5")

Parameters

Common parameters for clustering functions:

df: Pandas DataFrame containing sequence data
sequence_col: Column name containing sequences
id_col: Column name containing unique identifiers
min_seq_id: Minimum sequence identity threshold (0.0-1.0, default 0.3)
coverage: Minimum alignment coverage (0.0-1.0, default 0.5)
cov_mode: Coverage mode (0-3, default 0)
test_size: Desired fraction of data in test set (default 0.2)
random_state: Random seed for reproducibility
tolerance: Acceptable deviation from desired split sizes (default 0.05)

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Run tests (pytest tests/)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use protclust in your research, please cite:

@software{protclust,
  author = {Michael Scutari},
  title = {protclust: Protein Sequence Clustering and ML Dataset Creation},
  url = {https://github.com/michaelscutari/protclust},
  version = {0.1.0},
  year = {2025},
}

Name		Name	Last commit message	Last commit date
Latest commit History 59 Commits
.github/workflows		.github/workflows
assets/images		assets/images
protclust		protclust
tests		tests
.DS_Store		.DS_Store
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
codecov.yml		codecov.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

protclust

Requirements

Installation Options for MMseqs2

Installation

Installation

Features

Sequence Clustering and Dataset Creation

Advanced Splitting Options

Protein Embeddings

Parameters

Contributing

License

Citation

About

Releases

Packages

Languages

License

michaelscutari/protclust

Folders and files

Latest commit

History

Repository files navigation

protclust

Requirements

Installation Options for MMseqs2

Installation

Installation

Features

Sequence Clustering and Dataset Creation

Advanced Splitting Options

Protein Embeddings

Parameters

Contributing

License

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages