2.4. Format of the file with a gene expression table

Gene expression data are represented as a table consisting of rows (genes) and columns (fields). Each row in the table contains a set of data for one gene. The fields are the same for all genes (rows). As a rule, columns list gene expression parameters measured under various conditions (in tissues, organs, cell cultures, etc.). In addition, some columns may show additional information (both numerical and textual) related to the genes (e.g. gene names).

In general view, there are four main types of the column (field) contents:

IVALUE
value - integer
FVALUE
value - floating-point number
WORD
value - word (a piece of text without spaces)
STRING
value - string (a piece of text separated with spaces)

Fields are completely defined by their type and name.

2.4.1. Basic format of the SelTag input file

The basic format of the SelTag input file is the following:

; May contain comment in any line of the file
NAME<tab>WORD
GENEID<tab>IVALUE
TISSUECANCER0<tab>FVALUE
TISSUECANCER1<tab>FVALUE
TISSUENORMAL0<tab>FVALUE
TISSUENORMAL1<tab>FVALUE
TISSUENORMAL2<tab>FVALUE
DATA
GENE04675<tab>402<tab>6.00<tab>5.60<tab>5.97<tab>6.00<tab>6.00
GENE46890<tab>794<tab>2.77<tab>3.22<tab>5.65<tab>5.68<tab>5.68
GENE23794<tab>404<tab>5.97<tab>5.97<tab>6.00<tab>5.60<tab>5.97

In this example, <tab> means the tab symbol (the 'Tab' key on the keyboard).

The first lines, above the "DATA" line, contain the description of the data format. In this part of the file, each line contains a description of one field: field name and type.

The lines below the "DATA" line show the gene expression data for each gene. Each line corresponds to a single gene. The data in the fields are separated with the tab. Double tabs mean empty fields.

2.4.2. Groups of fields

Fields of the same type can be combined into a group of fields. A group can be defined by the group name and the list of group fields. This combination generally represents the functional significance of these fields. For example, the fields that represent gene expression levels in tumor tissues can belong to the "Cancer tissues" group. The same field may join several groups of fields, i.e. some fields in various groups may overlap.

The format description includes a description of the group of fields. The format description begins with the "#GROUP" line, which defines the name of the group. The following lines contain a list of fields included into this group. A group description group should end with the "#ENDGROUP" line.

An example of the data format with defined fields:

; May contain comment in any line of the file
NAME<tab>WORD
GENEID<tab>IVALUE
TISSUECANCER0<tab>FVALUE
TISSUECANCER1<tab>FVALUE
TISSUENORMAL0<tab>FVALUE
TISSUENORMAL1<tab>FVALUE
TISSUENORMAL2<tab>FVALUE
#GROUP<tab>Cancer tissues
TISSUECANCER0
TISSUECANCER1
#ENDGROUP
#GROUP<tab>Arbitrary group
TISSUECANCER1
TISSUECANCER2
TISSUENORMAL0
TISSUENORMAL1
#ENDGROUP
DATA
GENE04675<tab>402<tab>6.00<tab>5.60<tab>5.97<tab>6.00<tab>6.00
GENE46890<tab>794<tab>2.77<tab>3.22<tab>5.65<tab>5.68<tab>5.68
GENE23794<tab>404<tab>5.97<tab>5.97<tab>6.00<tab>5.60<tab>5.97

This data format defines two groups: "Cancer tissues" (includes the TISSUECANCER0 and TISSUECANCER1 fields) and "Arbitrary group" (includes the TISSUECANCER1, TISSUECANCER2, TISSUENORMAL0, and TISSUENORMAL1 fields).

Data can be loaded to SelTag from files containing tables in simpler formats or directly retrieved from tables provided that they match exactly the format of your table (both have the same number of columns with identical names and the columns in the data file are in the same order as the columns in your table).