BINN - Biologically Informed Neural Network¶
This notebooks demonstrates some examples on how a BINN can be created.
Read some test data. This requires and input and a pathway file. These correspond to the first layer (input) and intermediary (hidden) layers in the model. We also include the option to have a translation-file which maps the input to the intermediary layers.
In this example, the input layers consist of proteins with UniProt IDs and the intermediary layers consist of biological pathways with Reactome IDs. The translation file maps the UniProt IDs to the Reactome IDs.
import pandas as pd
input_data = pd.read_csv("../data/test_data.tsv", sep="\t")
translation = pd.read_csv("../data/translation.tsv", sep="\t")
pathways = pd.read_csv("../data/pathways.tsv", sep="\t")
input_data.head()
PeptideSequence | Charge | Decoy | Protein | CK_P1912_146 | CK_P1912_147 | CK_P1912_148 | CK_P1912_150 | CK_P1912_151 | CK_P1912_152 | ... | TM_M2012_191 | TM_M2012_192 | TM_M2012_196 | TM_M2012_197 | TM_M2012_198 | TM_M2012_199 | TM_M2012_200 | TM_M2012_202 | TM_M2012_203 | RetentionTime | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | VDRDVAPGTLC(UniMod:4)DVAGWGIVNHAGR | 3 | False | P00746 | 7238870.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3749.82 |
1 | VDRDVAPGTLC(UniMod:4)DVAGWGIVNHAGR | 4 | False | P00746 | 2681940.0 | 2634110.0 | 2297470.0 | 1935300.0 | 2181160.0 | 2615960.0 | ... | NaN | 519698.0 | NaN | NaN | NaN | NaN | NaN | 2221730.0 | NaN | 3593.61 |
2 | VDTVDPPYPR | 2 | False | P04004 | 28535800.0 | 34874600.0 | 34586900.0 | 25820800.0 | 24657400.0 | 30830100.0 | ... | 12486000.0 | 11995900.0 | 24003800.0 | 9802000.0 | 6933130.0 | 7297560.0 | 4328240.0 | 13002400.0 | 4716600.0 | 2502.15 |
3 | AVTEQGAELSNEER | 2 | False | P27348 | NaN | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | 340523.0 | 336960.0 | 435119.0 | 257422.0 | NaN | NaN | 1790.84 |
4 | VDVIPVNLPGEHGQR | 2 | False | P02751 | 652100.0 | NaN | NaN | NaN | NaN | NaN | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 3158.43 |
5 rows × 202 columns
pathways.head()
parent | child | |
---|---|---|
0 | R-HSA-109581 | R-HSA-109606 |
1 | R-HSA-109581 | R-HSA-169911 |
2 | R-HSA-109581 | R-HSA-5357769 |
3 | R-HSA-109581 | R-HSA-75153 |
4 | R-HSA-109582 | R-HSA-140877 |
translation.head()
Unnamed: 0 | input | translation | |
---|---|---|---|
0 | 1323 | A0A075B6P5 | R-HSA-166663 |
1 | 1324 | A0A075B6P5 | R-HSA-173623 |
2 | 1325 | A0A075B6P5 | R-HSA-198933 |
3 | 1326 | A0A075B6P5 | R-HSA-202733 |
4 | 1327 | A0A075B6P5 | R-HSA-2029481 |
The first step is to create the network as described above.
from binn import Network
network = Network(
input_data=input_data,
pathways=pathways,
mapping=translation,
input_data_column = "Protein", # specify the column for entities in input data
source_column = "child", # defined by our pathways-file
target_column = "parent",
)
IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
Thereafter we can create a BINN (model). The BINN is implemented in PyTorch Lightning and takes the network as input argument, as well as some other arguments.
from binn import BINN
binn = BINN(
network=network,
n_layers=5,
dropout=0.2,
validate=False,
)
binn.layers
BINN is on the device: cpu
Sequential( (Layer_0): Linear(in_features=448, out_features=564, bias=True) (BatchNorm_0): BatchNorm1d(564, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (Dropout_0): Dropout(p=0.2, inplace=False) (Tanh 0): Tanh() (Layer_1): Linear(in_features=564, out_features=444, bias=True) (BatchNorm_1): BatchNorm1d(444, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (Dropout_1): Dropout(p=0.2, inplace=False) (Tanh 1): Tanh() (Layer_2): Linear(in_features=444, out_features=285, bias=True) (BatchNorm_2): BatchNorm1d(285, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (Dropout_2): Dropout(p=0.2, inplace=False) (Tanh 2): Tanh() (Layer_3): Linear(in_features=285, out_features=116, bias=True) (BatchNorm_3): BatchNorm1d(116, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (Dropout_3): Dropout(p=0.2, inplace=False) (Tanh 3): Tanh() (Layer_4): Linear(in_features=116, out_features=28, bias=True) (BatchNorm_4): BatchNorm1d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (Dropout_4): Dropout(p=0.2, inplace=False) (Tanh 4): Tanh() (Output layer): Linear(in_features=28, out_features=2, bias=True) )
binn.trainable_params
6072
Looking at the layer names, we see that these correspond to the input and intermediary layers in the model.
layers = binn.layer_names
layers[0][0]
'A0M8Q6'