BINN - Biologically Informed Neural Network¶

This notebooks demonstrates some examples on how a BINN can be created.

Read some test data. This requires and input and a pathway file. These correspond to the first layer (input) and intermediary (hidden) layers in the model. We also include the option to have a translation-file which maps the input to the intermediary layers.

In this example, the input layers consist of proteins with UniProt IDs and the intermediary layers consist of biological pathways with Reactome IDs. The translation file maps the UniProt IDs to the Reactome IDs.

In [1]:

Copied!

import pandas as pd

input_data = pd.read_csv("../data/test_data.tsv", sep="\t")
translation = pd.read_csv("../data/translation.tsv", sep="\t")
pathways = pd.read_csv("../data/pathways.tsv", sep="\t")
import pandas as pd

input_data = pd.read_csv("../data/test_data.tsv", sep="\t")
translation = pd.read_csv("../data/translation.tsv", sep="\t")
pathways = pd.read_csv("../data/pathways.tsv", sep="\t")

In [2]:

Copied!

input_data.head()
input_data.head()

Out[2]:

	PeptideSequence	Charge	Decoy	Protein	CK_P1912_146	CK_P1912_147	CK_P1912_148	CK_P1912_150	CK_P1912_151	CK_P1912_152	...	TM_M2012_191	TM_M2012_192	TM_M2012_196	TM_M2012_197	TM_M2012_198	TM_M2012_199	TM_M2012_200	TM_M2012_202	TM_M2012_203	RetentionTime
0	VDRDVAPGTLC(UniMod:4)DVAGWGIVNHAGR	3	False	P00746	7238870.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3749.82
1	VDRDVAPGTLC(UniMod:4)DVAGWGIVNHAGR	4	False	P00746	2681940.0	2634110.0	2297470.0	1935300.0	2181160.0	2615960.0	...	NaN	519698.0	NaN	NaN	NaN	NaN	NaN	2221730.0	NaN	3593.61
2	VDTVDPPYPR	2	False	P04004	28535800.0	34874600.0	34586900.0	25820800.0	24657400.0	30830100.0	...	12486000.0	11995900.0	24003800.0	9802000.0	6933130.0	7297560.0	4328240.0	13002400.0	4716600.0	2502.15
3	AVTEQGAELSNEER	2	False	P27348	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	340523.0	336960.0	435119.0	257422.0	NaN	NaN	1790.84
4	VDVIPVNLPGEHGQR	2	False	P02751	652100.0	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	3158.43

5 rows × 202 columns

In [3]:

Copied!

pathways.head()
pathways.head()

Out[3]:

	parent	child
0	R-HSA-109581	R-HSA-109606
1	R-HSA-109581	R-HSA-169911
2	R-HSA-109581	R-HSA-5357769
3	R-HSA-109581	R-HSA-75153
4	R-HSA-109582	R-HSA-140877

In [4]:

Copied!

translation.head()
translation.head()

Out[4]:

	Unnamed: 0	input	translation
0	1323	A0A075B6P5	R-HSA-166663
1	1324	A0A075B6P5	R-HSA-173623
2	1325	A0A075B6P5	R-HSA-198933
3	1326	A0A075B6P5	R-HSA-202733
4	1327	A0A075B6P5	R-HSA-2029481

The first step is to create the network as described above.

In [5]:

Copied!





from binn import Network
network = Network(
    input_data=input_data,
    pathways=pathways,
    mapping=translation,
    input_data_column = "Protein", # specify the column for entities in input data
    source_column = "child", # defined by our pathways-file
    target_column = "parent",
)
from binn import Network
network = Network(
    input_data=input_data,
    pathways=pathways,
    mapping=translation,
    input_data_column = "Protein", # specify the column for entities in input data
    source_column = "child", # defined by our pathways-file
    target_column = "parent",
)

IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

Thereafter we can create a BINN (model). The BINN is implemented in PyTorch Lightning and takes the network as input argument, as well as some other arguments.

In [10]:

Copied!





from binn import BINN

binn = BINN(
    network=network,
    n_layers=5,
    dropout=0.2,
    validate=False,
    )
binn.layers
from binn import BINN

binn = BINN(
    network=network,
    n_layers=5,
    dropout=0.2,
    validate=False,
    )
binn.layers

BINN is on the device: cpu

Out[10]:

Sequential(
  (Layer_0): Linear(in_features=448, out_features=564, bias=True)
  (BatchNorm_0): BatchNorm1d(564, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_0): Dropout(p=0.2, inplace=False)
  (Tanh 0): Tanh()
  (Layer_1): Linear(in_features=564, out_features=444, bias=True)
  (BatchNorm_1): BatchNorm1d(444, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_1): Dropout(p=0.2, inplace=False)
  (Tanh 1): Tanh()
  (Layer_2): Linear(in_features=444, out_features=285, bias=True)
  (BatchNorm_2): BatchNorm1d(285, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_2): Dropout(p=0.2, inplace=False)
  (Tanh 2): Tanh()
  (Layer_3): Linear(in_features=285, out_features=116, bias=True)
  (BatchNorm_3): BatchNorm1d(116, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_3): Dropout(p=0.2, inplace=False)
  (Tanh 3): Tanh()
  (Layer_4): Linear(in_features=116, out_features=28, bias=True)
  (BatchNorm_4): BatchNorm1d(28, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (Dropout_4): Dropout(p=0.2, inplace=False)
  (Tanh 4): Tanh()
  (Output layer): Linear(in_features=28, out_features=2, bias=True)
)

In [11]:

Copied!

binn.trainable_params
binn.trainable_params

Out[11]:

Looking at the layer names, we see that these correspond to the input and intermediary layers in the model.

In [12]:

Copied!

layers = binn.layer_names
layers[0][0]
layers = binn.layer_names
layers[0][0]

Out[12]:

'A0M8Q6'