Loading Data
DPKS can load data from a variety of different proteomic processing pipelines directly. If you have a filetype that you would like to be able to parse directly into DPKS, please let us know.
The QuantMatrix
is the main entry point to all analysis in DPKS. A new QuantMatrix
object can be instantiated with your input data and a design matrix by passing the file paths:
quant_matrix = QuantMatrix(
quantification_file="path_to_quant_file.tsv",
design_matrix_file="path_to_design_matrix_file.tsv"
)
Or by passing in a pandas
DataFrame
:
quant_data = pd.read_csv(
"path_to_quant_file.tsv",
sep="\t"
)
design_matrix = pd.read_csv(
"path_to_design_matrix_file.tsv",
sep="\t"
)
quant_matrix = QuantMatrix(
quantification_file=quant_data,
design_matrix_file=design_matrix
)
This is particularly useful if you want to process your data (reformat, filter, etc.) in someway before loading it into DPKS. The ability to pass in files or DataFrames
directly to the QuantMatrix
object provides some flexibility in the type of data that you can load, making it easy to write custom parsers for new result file types.
Tip
If you encounter errors during parsing, it is useful to first load your data as DataFrame
s to first verify that
everything is formatted correctly
Generic Input
Quantitative Data
DPKS accepts a generic results file that you can reformat your own data to if there is not a built-in parser available.
Column | Description |
---|---|
PrecursorId | A unique identifier generally composed of the Peptide Sequence (with mods) and the charge. |
Charge | The precursor charge. |
PeptideSequence | The modified peptide sequence. |
Decoy (Optional) | Indicating if the precursor is a decoy (used for filtering). |
RetentionTime | The retention time of the precursor. |
Protein | The protein accession code associated with the precursor. |
PeptideQValue (Optional) | The global peptide level q-value (used for filtering). |
ProteinQValue (Optional) | The global protein level q-value (used for filtering). |
Sample Columns (Many Columns) | All other columns containing quantification data for your samples. |
If you already have controlled for the global FDR, you do not need to include the Decoy, PeptideQValue, or ProteinQValue columns.
A generic file format may look like this:
PeptideSequence | Charge | Decoy | Protein | RetentionTime | PeptideQValue | ProteinQValue | SAMPLE_1.osw | SAMPLE_2.osw | SAMPLE_3.osw |
---|---|---|---|---|---|---|---|---|---|
PEPTIK | 4 | 0 | P00352 | 5736.15 | 7.81e-06 | 0.0001169 | 29566.2 | 59295.7 | 24536.4 |
EFMEEVIQR | 2 | 0 | P04275 | 3155.5 | 9.41e-06 | 0.0001169 | 69900.3 | 195571.0 | 403947.0 |
SSSGTPDLPVLLTDLK | 2 | 0 | P00352 | 5386.69 | 7.815e-06 | 0.000116 | 115684.0 | 132524.0 | 217962.0 |
Note
If you want to pass already quantified Proteins you could do this:
Protein | SAMPLE_1.osw | SAMPLE_2.osw | SAMPLE_3.osw |
---|---|---|---|
P00352 | 29566.2 | 59295.7 | 24536.4 |
P04275 | 69900.3 | 195571.0 | 403947.0 |
P00352 | 115684.0 | 132524.0 | 217962.0 |
Design Matrix
A basic design matrix will have 2 main columns:
Column | Description |
---|---|
Sample (Required) | A list of the samples. This helps differentiate between sample columns and annotation columns in the QuantMatrix |
Group (Optional) | The group the sample belongs to. Used in differential testing and explainable machine learning. |
A minimal design matrix for the above input examples could look like this:
Sample |
---|
SAMPLE_1.osw |
SAMPLE_2.osw |
SAMPLE_3.osw |
And an example using the Group
column:
sample | group |
---|---|
AAS_P2009_167 | 6 |
AAS_P2009_169 | 4 |
AAS_P2009_176 | 6 |
AAS_P2009_178 | 4 |
AAS_P2009_187 | 4 |
AAS_P2009_194 | 6 |
AAS_P2009_196 | 4 |
AAS_P2009_203 | 6 |
AAS_P2009_205 | 4 |
AAS_P2009_212 | 6 |
AAS_P2009_214 | 4 |
AAS_P2009_221 | 6 |
AAS_P2009_230 | 6 |
AAS_P2009_232 | 4 |
AAS_P2009_239 | 6 |
AAS_P2009_241 | 4 |
AAS_P2009_248 | 6 |
AAS_P2009_250 | 4 |
DIANN
You can load data directly from DIA-NN using the long-format diann-output.tsv
file that is generated. The samples in your design matrix column should match the Run
column in the DIA-NN output, but other columns can be indicated if desired.
Additionally, if you have used MBR, the correct columns will be used to filter precursors at the indicated FDR threshold.
quant_matrix = QuantMatrix(
quantification_file=quant_file,
design_matrix_file=simple_design,
quant_type="diann",
diann_qvalue=0.01
)