Pscan Help


Input
Submitting gene sets
Submitting gene sets and your own matrices
Output
Reading p-values
Comparing the results of the same matrix on different gene sets
Resetting the interface

Input:

Just submit a list of gene or transcript identifiers: RefSeq (for human, mouse, and drosophila, e.g. NM_000546) TAIR (e.g. AT1G08810) for Arabidopsis; SGD (e.g. YPL248C) for yeast; and specify the source organism as well as the region you want to be analyzed (w.r.t. the annotated transcription start site). In case you have a list with other descriptors (official gene name, Affy id, etc.) you can use
this tool for a quick conversion.
With the "Select Descriptors" option you can choose whether the analysis has to be performed with the TFBSs matrices available in the JASPAR or TRANSFAC databases, or if you want to upload a specific matrix. In the latter case, prepare a TEXT file containing one or more matrices in the following format:
>matrix1
A_1 A_2 ..... A_n
C_1 C_2 ..... C_n
G_1 G_2 ..... G_n
T_1 T_2 ..... T_n
>matrix2
A_1 A_2 ..... A_n
C_1 C_2 ..... C_n
G_1 G_2 ..... G_n
T_1 T_2 ..... T_n

..and so on, where A_i, etc. are the frequencies of the four nucleotides in the columns of the matrix. These values can be either integers or floating point values, they will be automatically rescaled to frequencies summing to one in each column. There's an example matrix file in the main page. Notice that matrix names can contain only letters or digits.

Example: submitting gene sets

On the right-hand column of the main page several datasets are available for testing the interface. Clicking on any link opens a page with a list of gene RefSeq IDs, that can be copied and pasted in the text box of the input form.



For example, click on the NFkB100 link. It contains a set of genes for which the binding of NFkB in the promoter region has been determined experimentally via ChIP on Chip. 100 indicates that all the genes of the set are NFkB targets; the NFkBxx sets are sets in which xx percent of the genes are NFkB targets, while the others have been replaced by random genes to assess the performance of the algorithm. Open the NFkB100 link, and copy and paste the identifiers in the input text-box:
Below the input box, you can choose the source organism (human, in this case), the region, with respect to the TSS, you want to analyse, and the matrix database you want to use (Jaspar, Transfac, or you can upload one or more matrices). For NFkB100, leave all the options as they are.
If you want to submit a human gene set together with their orthologs in mouse, just paste in the input box all the identifiers (both for human and mouse genes), and select "Human and Mouse" as source organism. Notice: Pscan does not check whether the orthology annotations are correct!
Click "Run!"



In the textbox under the "Run" button, a confirmation message has appeared. In a few seconds, results will appear in the middle column of the page.



Similar experiments can be performed with all the the NFkBxx target files.

Example: submitting gene sets and matrices

For the NRF1 sequence sets you also have to upload the NRF1 binding site matrix, since it is not included in the TRANSFAC matrices publicly available.
The matrix is contained in
this file. Save it on your computer. In the input page, copy a set of NRF1 target genes from any NRFxx file in the input text box, and select "User Defined". Just below a file upload box will appear. Click on "browse", and locate the file containing the NRF1 matrix you just saved. Finally, click on "Run".

In this case the computation will take longer, since the program will have to scan the whole promoter set (in this case, the whole human promoter set) to build the background statistics to assess the significance of the results. Once again, the results will appear in the middle column (otherwise, in case of any problem with the matrix file an error message will appear in the text box under the "Run" button). Everything now is the same as in the previous example (see the Output section), with the exception that in the detailed results page the program will not be able to output any external link for your matrix.

Output:

When you click the "Run" button, after a few seconds ("User Defined" matrices can take longer, since the program has to scan the whole set of promoters of an organism to build a background model) the result of the computation will appear in the middle column of the page, together with a small image (the "heatmap") on the right.



The output shows the ranking of the matrices selected according to their enrichment p-values. At the top of the column there's a link for downloading the results in text format as well as the number of matrices used to analyze the sequences (see below, section "Reading the p-value").

By clicking on a matrix name, you can open a dedicated page showing the detailed results regarding the matrix, and in particular 1) the matrix itself, its logo (at the bottom), its information content and links to its database entry as well as to the ID (PMID) of the PubMed entry describing its generation. A simple graphic representation shows the average matching value of the matrix on the sequences analyzed compared to the average matching value and standard deviation on the whole promoter set (same set of regions w.r.t. the TSS as selected) of the same organism. Then, two further boxes, showing on the left ("Sample statistics") the statistics concerning the matrix on the current dataset: the z-test p-value, the BOnferroni corrected p-value (see section
Reading p-values for further info), mean and standard deviation of the matching score on the current dataset, and finally dataset size. these latter pieces of information can be useful to compare the results of different datasets by using the "Compare with.." box next to the statistics one, as explained in the Comparing the results of the same matrix on different gene sets section.



By clicking on the "Report Occurrences" button at the bottom of the "Matrix Info" table you can retrieve, for each gene submitted, the best matching oligo in each one, as well as its score (from 0 to 1) and its position w.r.t. the annotated TSS. Occurrences are sorted according to their score. The "Text Results" button allows you to download the occurrence table in text format. On the bottom right hand of the page two diagrams appear, showing the distribution of 1) the position of the best occurrences w.r.t. the TSS and 2) the scores of the best occurrences. Notice, for example, how the image below shows that most of the predicted sites are clustered in the -100 +50 region. Prediction are colored according to their score (red-high)



It might happen that two different RefSeq IDs correspond to the same TSS ( e.g. the two genes differ in splicing). This corresponds to having the same oligo appearing twice in the list, with identical score and position. Notice however that duplicate input promoters are filtered out automatically by the program in order not to bias the statistical evaluation.
The "heatmap" image shows in a microarray-like fashion the contribution of each input gene to the score of each matrix. Red spots correspond to positive contributions to the z-score, vice versa green spots (black spots are around the average genome-wise score of the matrix itself).



Reading the p-value: when a result is significant?

Pscan associates with each matrix a p-value that is used for ranking the set of matrix used in the analysis. The p-value should be read as:

If we take as many random genes from the same organism as in the input set, and the corresponding regions selected in the input, what is the probability of having the same score (enrichment) obtained in the input set?

Or - more simply - what is the probability of having the same result by chance?
Our experimental tests have shown how the z-test pvalue computed by pscan corresponds to the experimental one (see article) - and is not an under-estimate as in many other similar methods. Thus, typical pvalue thresholds used in statistical testing can be used quite safely to report significant results. However, keep in mind that if n matrices are used, pscan performs n independent statistical tests. Thus, if 100 tests are performed and a p-value threshold of 0.01 is used, you can expect to have one of 100 tests to have p-value lower than the threshold by chance! To be on the safe side, you can use a Bonferroni-corrected significance threshold value (or, a Bonferroni corrected p-value), as follows.
Let T be your significance threshold (typically, 0.05 or 0.01). Let n be the number of matrices used by p-scan (reported at the top of the output column). To mantain the same level of significance you can use as threshold T/n, or, alternatively, apply the significance threshold T not to the "raw" p-value, but to the "Bonferroni corrected" p-value reported in the detailed matrix output. See
this page for further explanations.

Comparing different input gene sets:

In the detailed output for a given matrix, you can compare the results obtained with the matrix on the gene set just submitted with the results the matrix had produced on another gene set. The latter could be a "negative" gene set (or vice versa the current one could be the negative set, and the other one the one you treated as "positive", the order does not matter). To perform the comparison, you have to fill in the "Compare with..." box fields with mean, standard deviation and sample size values of the other analysis - for the current one you can find them in the "Sample Data Statistics" box or in the overall text output that can be downloaded from the main output page. Warning: make sure that the values you input are correct, and especially that they were obtained by using the same matrix!. Once you have clicked the "Go!" button, an output window will pop up and report if either of the two means is significantly higher than the other, together with a confidence p-value computed with a Welch t-test (see Supplementary Material for further info).





Resetting the interface

At any moment, you can return to the initial web page (with empty input boxes and no results in the middle column) by clicking the "Reset" button in the left-hand column (next to "Run!").