Frequently Asked Questions (FAQs)
Yes. The data files you upload for analysis as well as any analysis results, are not downloaded or examined in any way by the administrators, unless required for system maintenance and troubleshooting. All files will be deleted automatically after 72 hours, and no archives or backups are kept unless you have registered an account and saved the analysis. You are advised to download your results immediately after performing an analysis.
EcoToxXplorer is designed to analyze EcoToxChip data. Previous versions of EcoToxXplorer supported processing and analyzing raw RNA-seq data. These functions are now accessible through the EcoToxChip Project Galaxy page, and ExpressAnalyst.
Registering on EcoToxXplorer allows you to save up to 10 projects that will be stored in the system for 10 months. You will be able to reload the work state of previous projects to resume previous analysis.
Files should be .txt or .csv format with two columns, the first for well ID and the second for raw Ct values. Each file should have the same well IDs. All values in the Ct column should be numeric, except for non-detects or missing values which should be denoted with "NA".
Troubleshooting: if you get an error saying that there are special characters in your file, try change the type of .csv or .txt. Mac allows .csv and .txt formats with UTF-8 encoding, which can cause problems with our file uploader. Change to the regular format using "Save As" and selecting the .csv or .txt without UTF-8 encoding.
Download this example data set to see how uploaded files should be formatted.
The QA/QC checks examine the strand synthesis efficiency (Delta-Ct of RTC-PPC), the presence of genomic DNA contamination (GDC), and whether housekeeping genes are stable across all conditions. If the first two checks do not meet the default thresholds consistently across all samples, and are only off by a little, we recommend proceeding with the analysis with caution. However, if one or several of the samples significantly fail these quality checks while others pass, we recommend removing the problematic samples using the "Data Editor".
For the housekeeping genes, there are two options. If many of the housekeeping genes are unstable, we recommend using quantile normalization. If only one or two of the housekeeping genes fail, we recommend removing these housekeeping genes using the "Select options" link on the normalization page, and proceeding with delta-Ct normalization.
Normalizing the data accounts for systematic technical sources of variation so that biologically-driven changes in gene expression can be better detected between samples. There are two options for normalization in EcoToxXplorer: delta-Ct and quantile. Delta-Ct is the most commonly used normalization for qPCR data, and so we suggest starting with this method for differential expression analysis, however quantile normalization is becoming more popular as qPCR arrays get larger. Also, if the housekeeping genes are not stable across conditions, quantile normalization can still be used since it does not use the housekeeping genes as internal references. Quantile normalization is the only option for dose-response modeling because the curve fitting algorithm requires only positive values, and delta-Ct produces both positive and negative values.
Potential outlier samples can be identified from PCA plots. The potential outlier will distinguish itself as the one located far away from the major clusters formed by the remaining samples. To deal with outliers, the first thing is to check the QA/QC wells to see if there were some obvious technical problems. If they are flagged and the sample appears as an outlier in the PCA plot, the sample should be removed from the input data using the "Data Editor" on the Quality Check page.
Imputation assumes that missing and high Ct values represent situations of very low or no expression. However, normalization and differential expression analysis algorithms do not perform well when there are missing entries in the data matrix, and so we aim to replace them with values that represent very low expression. Giving all missing measurements the exact same value causes problems during differential expression, potentially leading to artificially low p-values.
The first step is to decide the Ct cut-off for reliable measurements, as accuracy tends to decrease at high cycle numbers. Then, all missing values and values above the cut-off are replaced by randomly drawn values from a normal distribution that has a mean of the Ct cut-off and a standard deviation of the original data surrounding this cut-off. For example, in the image above, the Ct-cutoff was chosen to remain at the default of 35. Then, values are randomly drawn from the distribution of the data around this point.
Genes are removed either because the well filter, or if there were too few values for differential expression analysis. The filter is determined by the "Filter wells" drop-down menu on the normalization page. It is for removing QC wells and for automatically detecting likely outlier measurements.
ANOVA and Kruskal Wallis do not work with less than three replicates per condition, and so genes with two or fewer replicates for any experimental condition are removed. This issue is not encountered if missing values are imputed. You can see which genes are removed using the processing summary table.
Limma is a popular method for differential analysis that was first developed for microarray differential analysis. It addresses the problem of low sample sizes typical to whole-transcriptome studies by using the whole expression profile to make more stable estimates of gene expression variance. EdgeR and DESeq2 were both developed to analyze RNAseq data. All three methods are well-established and should give similar results. Please note:
In differential expression analysis, you should first determine whether any of the metadata encode blocking factors, then decide on how to classify individual samples into groups, and finally decide which groups of samples should be compared to each other using statistical tests. Let's assume that none of your metadata are blocking factors (more on that later) and try to understand how selecting primary and secondary factors creates different groups of samples. Consider the "Estrogen" example data, generated in a study that measured gene expression at multiple time points in breast cancer cells in which the estrogen receptor (ER) was either present or absent. Here, the metadata are "ER" and "TIME". As the figure below shows, selecting "ER" as the primary factor divides the data into two groups because "ER" has two different levels ('present' and 'absent'). Selecting "TIME" as the secondary factor results in four groups because the two primary groups are split based on the two time points. If there were three time points, each primary group would be split into three groups, resulting in six groups overall.
The defined groups can now be compared to find genes that are differentially expressed between them (more details on this in later sections). In some experimental designs, we aren't interested in finding the genes that are differentially expressed between the groups defined by the secondary factor because it is a blocking factor. Examples of blocking factors are subject IDs when multiple samples were taken from the same subject (e.g. paired samples, multiple tissue types), or batches of samples that were measured at different times or in different locations. If you indicate that your secondary factor is a blocking factor, EcoToxXplorer will conduct comparisons within the groups that it defines, which typically improves the accuracy of the overall result.
This means you do not have enough samples to perform the analysis you specified, usually when combining two metadata in an independent two-factor analysis (no blocking factors). In this case, the total number of groups will be the product of the number of levels in each metadata factor (i.e. if the primary metadata contains 3 levels, and the secondary metadata contains 4, the total number of groups will be 3 * 4 = 12). We recommend a minimum of 3 samples per group, therefore at least 36 samples are required in order to perform a 3 x 4 two-factor analysis.
In this case, you should focus on a single primary metadata and leave the seconday metadata as "Not available", and perform differential analysis with regard to individual metadata. You can then choose the other metadata as the primary metadata and perform the analysis again. If there are no or very few significant genes identified, it is most likely that incorporating the secondary metadata into the analysis will not affect the result.
A pair-wise comparison tests for genes that are differentially expressed between any pair of groups. For example, take three groups A, B, and C. The "all pairwise" comparison will contrast A-B, A-C, and B-C. A time-series comparison will only contrast consecutive pairs of groups, so in our example only A-B and B-C. Time-series are commonly used when gene expression was measured at multiple time points, or after treatments with varying concentrations/durations.
A nested comparison allows you to determine which genes respond differently to a treatment condition, respective to some other metadata. For example, consider the experimental design described in the section on multiple metadata where cells with and without an ER were measured at 10hrs and at 48 hrs. To find the genes that respond differently over time in the ER vs. noER cells, you would perform a nested comparison. First, compare ER10-ER48 to find the genes that are differentially expressed in cells with an ER (ERgenes). Next, compare noER10-noER48 to find the genes differentially expressed in cells with no ER (noERgenes). Finally, to find the genes that respond differently over time in ER vs. noER cells, compare ERgenes-noERgenes.
Selecting "Interaction only" will return significant results from only the ERgenes-noERgenes contrast. Otherwise the full model is returned, which is the combination of significant genes from the ER10-ER48, noER10-noER48, and the ERgenes-noERgenes contrasts.
Gene-level BMDs are calculated using six steps.
More details are given for each of these steps in the below FAQs.
The statistical models are the same models the the US Environmental Protection Agency and the National Toxicology Program have used in their studies on 'omics dose-response analysis. They are the same models available in BMDExpress, and are the ones recommended in the peer-reviewed report that outlines the National Toxicology Program Approach to Genomic Dose-Response Modeling. Exp2 - Exp4 are four different forms of exponential models. Poly2, Poly3, and Poly4 are polynomial models with degree 2, 3, and 4. More details and the full mathematical forms can be found in the NTP report linked above.
The NTP recommendations say to use all of the models other than Poly3 and Poly4. While these higher degree polynomials often give good fits, there are concerns that allowing too many changes of direction may over fit the data. However, of the remaining models (Exp2-5, Linear, Polynomial degree 2, Power, and Hill), only Poly 2 allows for non-monotonic behaviour. Thus, if you expect non-monotonic behaviour, you may wish to include higher order polynomials. The NTP report suggests that future implementations of dose-response software should include the ability to constrain Poly3 and Poly4 models to only change direction once. We plan to include this feature in future updates to EcoToxXplorer.
A statistical model has a "lack-of-fit" if it fails to adequately explain the relationship between the x (dose) and y (response) variables. In a lack-of-fit statistical test, the null hypothesis is that the model fits the data. Thus, in these tests a significant p-value indicates that there is evidence that the model does not adequately explain the data, and so here we check that the p-value is greater than the significance threshold . Significance thresholds commonly range from 0.05 to 0.5 depending on the desired stringency. The NTP recommendations give a threshold of 0.10.
After applying the lack-of-fit p-value threshold, there may be several statistical models that fit the data well. From these remaining models, the one with the lowest AIC (Akaike information criterion) is selected as the best fit. The AIC is a measure of prediction error that penalizes models with more parameters. This means that if there are two models that do an equally good job of explaining the data, the model with fewer parameters will be selected.
The AIC is not displayed in the results table on the curve fitting page, but it is included in the bmd.csv file that can be downloaded from the Analysis Pipeline side panel.
BMR means "benchmark response", which is the pre-determined response level that is considered "adverse" or "significant". The dose that corresponds to the BMR is defined as the benchmark dose (BMD). In an ideal case, we would know which change in response variable is physiologically significant and potentially toxic, however this is rarely known for individual genes. For 'omics dose-response analysis, the NTP recommended approach defines the BMR for a gene as the mean of the control gene expression values, plus or minus a certain number of standard deviations of the control values. The number of standard deviations is the BMR factor, and increasing this parameter increases the absolute value of the BMR for each gene.
In the figure above, the BMR factor is one since this is the number of standard deviations that was used to calculate the BMR.
The BMDl and BMDu are the lower and upper limits of the 95% Wald confidence interval of the BMD, extracted from the model-fit object. The BMDl is sometimes used instead of the BMD since it is a more conservative estimate that depends on the uncertainty of the model fit to the data.
There are several quality criteria applied to the BMDs to filter out low-confidence or otherwise undesirable BMD estimates:
Additionally, for some genes, the expression values never exceed the BMR and thus EcoToxXplorer cannot compute a BMD.
The intial curve fitting and BMD calculation are done at the gene level (geneBMD). However, the main statistic of interest is usually the dose at which the whole-transcriptome is responding to chemical exposure, or the tPOD. The NTP report defines the tPOD based on pathway enrichment analysis of the geneBMDs. However, with many ecological species these pathway-based tPODs are unstable due to the low number of annotated gene sets compared to popular mammalian model organisms, and thus it has been proposed to estimate a statistical tPOD based on the distribution of geneBMDs. There are two different methods for estimating statistical tPODs in EcoToxXplorer:
Gene set analysis is used to identify significantly overrepresented pathways in the list of geneBMDs. Pathway-level BMDs (pathBMDs) are calculated as the bootstrapped median of the geneBMDs in that pathway. The National Toxicology Program Approach to Genomic Dose-Response Modeling defines the omicBMD as the lowest pathBMD.
The pathway heatmap values are calculated through a series of steps:
The purpose of these steps are to produce a visually appealing heatmap that clearly shows how the expression of each pathway gene compares to the others.
By default, fold-change values are from the comparison selected on the "Feature Selection" page. To use the interactive visual analytics tools to view other results, navigate back to this page and update the drop-down menu.
All gene set analysis in EcoToxXplorer is over-representation analysis, computed using hypergeometric tests. This test computes the probability that the list of differentially expressed genes contains more results from a particular gene set than would be expected if the list of DEGs were drawn randomly. EcoToxXplorer supports gene set analysis of EcoToxModules and EcoToxProcesses. Other gene set libraries such as Gene Ontology do not contain enough representation from each pathway since the EcoToxChips have only 384 wells. Since the EcoToxModules and Processes are much larger, this is less of an issue.
You can drag your mouse over a section of the smaller overview on the left side of the heatmap page. The section that you highlight will appear in the focus view. Since gene set analysis is generally performed on the focus view, this allows you to visually identify clusters of genes with distinct expression patterns, view them in more detail in the focus view, and try to understand which biological processes they represent using enrichment analysis.
The following steps were taken to define the EcoToxModule gene sets:
See more details here ...
Either the mean or median of the abs(log2FC) for every gene on the EcoToxChip is computed to get a summary score for the whole chip. The colour is then determined by comparing it to the thresholds. This is the same method that is used to determine the colour for each process and module.
You will be logged off in seconds.
Do you want to continue your session? |