import os
from typing import List
from pprint import pprint
from biokb_coconut import get_session
from biokb_coconut import models
from biokb_coconut.api import schemas
os.environ.pop("CONNECTION_STR", None) # to make sure no environment variable is used
'mysql+pymysql://biokb_user:biokb_password@localhost:3306/biokb'
Query and search data¶
BioKb-Coconut uses SQLAlchemy to define the database schema for storing chemical compound data from the Coconut database. The following diagram illustrates the main entities and their relationships:
The central entity in the data model is the Compound, which represents a chemical compound in Coconut. Other key entities include ChemicalData, Structure, Name, Relation, Reference, DatabaseAccession, Source, Comment, and Status. Each entity has its own set of attributes and relationships with other entities.
Overview¶
You can query the database using SQLAlchemy.
First import the data using the import_data function. You can skip this, if you have already done so. This will download the Coconut data files, parse them, and populate the database. Depending on your system and internet connection, this may take some time.
from biokb_coconut import import_data
import_data()
Example Query¶
The next cell builds and executes a SQLAlchemy query to fetch up to three terpenoid compounds associated with the organism Achillea millefolium that satisfy simple drug‑likeness constraints, then loads the results into a pandas DataFrame for inspection.
- Opens a database session using the get_session context manager to ensure proper cleanup.
- Constructs a select with labeled columns from Compound, Organism, and NpClassifierClass.
- Joins ORM relationships to link compounds to their organisms and NP Classifier classes.
- Applies filters:
- Organism name equals “Achillea millefolium”
- lipinski_rule_of_five_violations equals 0
- molecular_weight less than 300
- np_classifier_class_name matches “%Terpenoids%” (case‑insensitive)
- Limits the result set to 3 records to keep the query lightweight.
- Executes the statement, collects rows, and converts them into a pandas DataFrame named df.
- Displays df, which contains the columns compound_name, organism_name, and np_classifier_class_name and currently holds three entries.
from sqlalchemy import select
import pandas as pd
with get_session() as session:
stmt = (
select(
models.Compound.name.label("compound_name"),
models.Organism.name.label("organism_name"),
models.NpClassifierClass.name.label("np_classifier_class_name"),
)
.join(models.Compound.organisms)
.join(models.NpClassifierClass)
.where(
models.Organism.name == "Achillea millefolium",
models.Compound.lipinski_rule_of_five_violations == 0,
models.Compound.molecular_weight < 300,
models.NpClassifierClass.name.ilike("%Terpenoids%"),
)
.limit(3)
)
compounds = session.execute(stmt).all()
df = pd.DataFrame(compounds)
df
INFO:biokb_coconut.db.manager:Engine Engine(sqlite:////home/ceb/.biokb/biokb.db)
| compound_name | organism_name | np_classifier_class_name | |
|---|---|---|---|
| 0 | GAMMA-TERPINENE | Achillea millefolium | Monocyclic monoterpenoids |
| 1 | Artecanin | Achillea millefolium | Guaiane sesquiterpenoids |
| 2 | 2,5,5-trimethyl-1,3,6-heptatriene | Achillea millefolium | Irregular monoterpenoids |