Frequently Asked Questions (FAQs)
Yes. The data files you upload for analysis as well as any analysis results, are not downloaded or examined in any way by the administrators, unless required for system maintenance and troubleshooting. All files will be deleted automatically after 72 hours, and no archives or backups are kept unless you have registered an account and saved the analysis. You are advised to download your results immediately after performing an analysis.
EcoToxXplorer accepts data from 6 species, in the following formats:
There is a 50MB limit for the uploaded data. For gene expression profiles with 20 000 genes, this corresponds to about 300 samples. Note - since DESeq2 requires high computational resources, there is a 50 sample limit for this option.
It is critical to properly label your data so that they can be recognized and compared. The following common IDs are supported:
The gene expression data also should contain sample names in the first line. Each sample name should be unique. The class labels of experimental conditions should be in a new line beginning with "#CLASS". Multiple class labels can be indicated by adding a colon and its name (for example, "#CLASS:cancer_type" and "#CLASS:stage"). For meta-analysis, the same set of labels must be used for ALL datasets.
Here is a good tutorial on how to generate tab delimited text files from the Excel Spreadsheet program. When you open your data using any text editor (for example, WordPad), it should look like the following:
#NAME Sample1 Sample2 Sample3 Sample4 Sampl5 Sampl6 Sample7 Sample8 #CLASS case case case case control control control control Gene1 -3.06 -2.25 -1.15 -6.64 0.4 1.08 1.22 1.02 Gene2 -1.36 -0.67 -0.17 -0.97 -2.32 -5.06 0.28 1.32 Gene3 1.61 -0.27 0.71 -0.62 0.14 0.11 0.98 Gene4 0.93 1.29 -0.23 -0.74 -2 -1.25 1.07 1.27
#NAME Sample1 Sample2 Sample3 Sample4 Sampl5 Sampl6 Sample7 Sample8 #CLASS:CANCER case case case case control control control control #CLASS:SEX F F M M F M F M Gene1 -3.06 -2.25 -1.15 -6.64 0.4 1.08 1.22 1.02 Gene2 -1.36 -0.67 -0.17 -0.97 -2.32 -5.06 0.28 1.32 Gene3 1.61 -0.27 0.71 -0.62 0.14 0.11 0.98 Gene4 0.93 1.29 -0.23 -0.74 -2 -1.25 1.07 1.27
You have three options:
EcoToxXplorer support four different types of files (.sif, .txt(edge list), .graphml and .json).
Please click on the following links to see example files supported:
Registering on EcoToxXplorer allows you to save up to 10 projects that will be stored in the system for 10 months. You will be able to reload the work state of previous projects to resume previous analysis.
Microarray data provides probe-level expression measurements and RNA-seq data provides exon-level or transcript-level (i.e. different isoforms of the same gene) expression measurements. However, current functional annotations are mainly assigned at the gene or protein level. Therefore, when multiple probes or transcripts are mapped to the same gene, they need to be summarized into a single value for that gene. At the Gene Annotation step, users can choose to use the averages or medians of multiple probe intensities (microarray), or sums of counts from multiple transcripts (RNA-seq) to perform gene-level summarization.
The purpose of filtering is to increase the statistical power of differential expression analysis be removing any genes that are less likely to be informative. Please refer to the paper Independent filtering increases detection power for high-throughput experiments for detailed discussion and benchmark tests
Low variance filter: genes whose expression values do not change across different samples, and thus have very low variance. Genes are ranked by their variance from low to high, and you can exclude a certain percentile of genes with the lowest variance by adjusting the "Variance filter" slider. The above referenced study has suggested that up to 50% genes can be removed based on their variance with improved results
Low abundance filter: genes with very low abundance are not measured relaibly and amy not be biologically important. You can exclude genes below a certain threshold by adjusting the "Low abundance" slider. The above referenced study has suggested 10% genes can be removed based on their abundance with improved results
Normalizing the data accounts for systematic technical sources of variation so that biologically-driven changes in gene expression can be better detected between samples. A gene expression normalization method should be chosen unless the data has already been normalized, in which case the user should select "None".
All of the normalization methods available on EcoToxXplorer are well-established and have been used in many previous studies. They are based on slightly different assumptions about the underlaying distributions, but should produce relatively similar results. If you are concerned about significant differences between normalization methods, you can try out more than one and visualize the results using the provided plots.
Tip: if you are not sure whether the data is already log transformed or not, you can easily figure this out by visualizing the data (i.e. boxplot). For microarray data, log transformed data values are usually less than 16. For RNA-seq data with 1 million reads, log2(1,000,000) is less than 20. Therefore if all data values are all below 20, it is reasonable to assume that the data has already been log transformed.
Potential outlier samples can be identified from PCA plots. The potential outlier will distinguish itself as the one located far away from the major clusters formed by the remaining samples. To deal with outliers, the first thing is to check if the sample was measured properly. In many cases, outliers are the result of operational errors during the analytical process. If those values cannot be corrected, the sample should be removed from the input data and the analysis re-started.
Limma is a popular method for differential analysis that was first developed for microarray differential analysis. It addresses the problem of low sample sizes typical to whole-transcriptome studies by using the whole expression profile to make more stable estimates of gene expression variance. EdgeR and DESeq2 were both developed to analyze RNAseq data. All three methods are well-established and should give similar results. Please note:
In differential expression analysis, you should first determine whether any of the metadata encode blocking factors, then decide on how to classify individual samples into groups, and finally decide which groups of samples should be compared to each other using statistical tests. Let's assume that none of your metadata are blocking factors (more on that later) and try to understand how selecting primary and secondary factors creates different groups of samples. Consider the "Estrogen" example data, generated in a study that measured gene expression at multiple time points in breast cancer cells in which the estrogen receptor (ER) was either present or absent. Here, the metadata are "ER" and "TIME". As the figure below shows, selecting "ER" as the primary factor divides the data into two groups because "ER" has two different levels ('present' and 'absent'). Selecting "TIME" as the secondary factor results in four groups because the two primary groups are split based on the two time points. If there were three time points, each primary group would be split into three groups, resulting in six groups overall.
The defined groups can now be compared to find genes that are differentially expressed between them (more details on this in later sections). In some experimental designs, we aren't interested in finding the genes that are differentially expressed between the groups defined by the secondary factor because it is a blocking factor. Examples of blocking factors are subject IDs when multiple samples were taken from the same subject (e.g. paired samples, multiple tissue types), or batches of samples that were measured at different times or in different locations. If you indicate that your secondary factor is a blocking factor, EcoToxXplorer will conduct comparisons within the groups that it defines, which typically improves the accuracy of the overall result.
This means you do not have enough samples to perform the analysis you specified, usually when combining two metadata in an independent two-factor analysis (no blocking factors). In this case, the total number of groups will be the product of the number of levels in each metadata factor (i.e. if the primary metadata contains 3 levels, and the secondary metadata contains 4, the total number of groups will be 3 * 4 = 12). We recommend a minimum of 3 samples per group, therefore at least 36 samples are required in order to perform a 3 x 4 two-factor analysis.
In this case, you should focus on a single primary metadata and leave the seconday metadata as "Not available", and perform differential analysis with regard to individual metadata. You can then choose the other metadata as the primary metadata and perform the analysis again. If there are no or very few significant genes identified, it is most likely that incorporating the secondary metadata into the analysis will not affect the result.
A pair-wise comparison tests for genes that are differentially expressed between any pair of groups. For example, take three groups A, B, and C. The "all pairwise" comparison will contrast A-B, A-C, and B-C. A time-series comparison will only contrast consecutive pairs of groups, so in our example only A-B and B-C. Time-series are commonly used when gene expression was measured at multiple time points, or after treatments with varying concentrations/durations.
A nested comparison allows you to determine which genes respond differently to a treatment condition, respective to some other metadata. For example, consider the experimental design described in the section on multiple metadata where cells with and without an ER were measured at 10hrs and at 48 hrs. To find the genes that respond differently over time in the ER vs. noER cells, you would perform a nested comparison. First, compare ER10-ER48 to find the genes that are differentially expressed in cells with an ER (ERgenes). Next, compare noER10-noER48 to find the genes differentially expressed in cells with no ER (noERgenes). Finally, to find the genes that respond differently over time in ER vs. noER cells, compare ERgenes-noERgenes.
Selecting "Interaction only" will return significant results from only the ERgenes-noERgenes contrast. Otherwise the full model is returned, which is the combination of significant genes from the ER10-ER48, noER10-noER48, and the ERgenes-noERgenes contrasts.
PCA (principal component analysis) and t-SNE (t-distributed stochastic neighbor embedding) are both popular dimension reduction techniques. In PCA, each principal component is the linear combination of predictor variables that explains the greatest amount of variability in the outcome variable, after accounting for previously computed principal components. Unlike PCA, t-SNE utilizes random walks to estimate non-linear relationships between predictor variables for each sample. This means that each iteration of t-SNE will generate slightly different results.
The interactive PCA visualization summarizes all the data into the the first three principal components (PCs). Each data point in the Scores Plot represents a sample. Samples that are close together are more similar to each other. The colors of these data points are based on the factor labels. Users can change the colors according to any of the two factor labels.
Each data point in the Loadings Plot represents a feature. When scores and loadings plots are viewed from the identical perspective, the direction of separation on the scores plot can be explained by the corresponding features on the same directions - i.e. features on the two ends of the direction contribute more to the pattern of separation.
Gene-level BMDs are calculated using six steps.
More details are given for each of these steps in the below FAQs.
The statistical models are the same models the the US Environmental Protection Agency and the National Toxicology Program have used in their studies on 'omics dose-response analysis. They are the same models available in BMDExpress, and are the ones recommended in the peer-reviewed report that outlines the National Toxicology Program Approach to Genomic Dose-Response Modeling. Exp2 - Exp4 are four different forms of exponential models. Poly2, Poly3, and Poly4 are polynomial models with degree 2, 3, and 4. More details and the full mathematical forms can be found in the NTP report linked above.
The NTP recommendations say to use all of the models other than Poly3 and Poly4. While these higher degree polynomials often give good fits, there are concerns that allowing too many changes of direction may over fit the data. However, of the remaining models (Exp2-5, Linear, Polynomial degree 2, Power, and Hill), only Poly 2 allows for non-monotonic behaviour. Thus, if you expect non-monotonic behaviour, you may wish to include higher order polynomials. The NTP report suggests that future implementations of dose-response software should include the ability to constrain Poly3 and Poly4 models to only change direction once. We plan to include this feature in future updates to EcoToxXplorer.
A statistical model has a "lack-of-fit" if it fails to adequately explain the relationship between the x (dose) and y (response) variables. In a lack-of-fit statistical test, the null hypothesis is that the model fits the data. Thus, in these tests a significant p-value indicates that there is evidence that the model does not adequately explain the data, and so here we check that the p-value is greater than the significance threshold . Significance thresholds commonly range from 0.05 to 0.5 depending on the desired stringency. The NTP recommendations give a threshold of 0.10.
After applying the lack-of-fit p-value threshold, there may be several statistical models that fit the data well. From these remaining models, the one with the lowest AIC (Akaike information criterion) is selected as the best fit. The AIC is a measure of prediction error that penalizes models with more parameters. This means that if there are two models that do an equally good job of explaining the data, the model with fewer parameters will be selected.
The AIC is not displayed in the results table on the curve fitting page, but it is included in the bmd.csv file that can be downloaded from the Analysis Pipeline side panel.
BMR means "benchmark response", which is the pre-determined response level that is considered "adverse" or "significant". The dose that corresponds to the BMR is defined as the benchmark dose (BMD). In an ideal case, we would know which change in response variable is physiologically significant and potentially toxic, however this is rarely known for individual genes. For 'omics dose-response analysis, the NTP recommended approach defines the BMR for a gene as the mean of the control gene expression values, plus or minus a certain number of standard deviations of the control values. The number of standard deviations is the BMR factor, and increasing this parameter increases the absolute value of the BMR for each gene.
In the figure above, the BMR factor is one since this is the number of standard deviations that was used to calculate the BMR.
The BMDl and BMDu are the lower and upper limits of the 95% confidence interval of the BMD, computed using the profile liklihood method. The BMDl is sometimes used instead of the BMD since it is a more conservative estimate that depends on the uncertainty of the model fit to the data.
There are several quality criteria applied to the BMDs to filter out low-confidence or otherwise undesirable BMD estimates:
Additionally, for some genes, the expression values never exceed the BMR and thus EcoToxXplorer cannot compute a BMD.
The intial curve fitting and BMD calculation are done at the gene level (geneBMD). However, the main statistic of interest is usually the dose at which the whole-transcriptome is responding to chemical exposure, or the omicBMD. The NTP report defines the omicBMD based on pathway enrichment analysis of the geneBMDs (see the Pathway BMD Analysis FAQ section for more details). However, with many ecological species these pathway-based omicBMDs are unstable due to the low number of annotated gene sets compared to popular mammalian model organisms, and thus it has been proposed to estimate a statistical omicBMD based on the distribution of geneBMDs. There are three different methods for estimating statistical omicBMDs in EcoToxXplorer:
Gene set analysis is used to identify significantly overrepresented pathways in the list of geneBMDs. Pathway-level BMDs (pathBMDs) are calculated as the bootstrapped median of the geneBMDs in that pathway. The National Toxicology Program Approach to Genomic Dose-Response Modeling defines the omicBMD as the lowest pathBMD.
EcoToxXplorer uses overrepresentation analysis (ORA) to compute gene set enrichment statistics. ORA is a statistical technique to identify gene sets or pathways that have a significant overlap with a gene list of interest. In EcoToxXplorer, Hypergeometric tests are used to compute the p-values.
The NTP approach recommends no α cut-off, but to only consider pathways with at least 3 genes or 5% of the total pathway genes.
EcoToxXplorer supports enrichment analysis with gene sets from the Gene Ontology, PANTHER, KEGG, Reactome, and MSigDB databases. Note - not all gene set libraries are available for all species.
The GO:BP, GO:MF, and GO:CC gene sets include the complete set of Gene Ontology terms (> 45 000) for the biological process, molecular function, and cellular component categories. The PANTHER:BP, PANTHER:MF, and PANTHER:CC are reduced sets of GO terms ("GO slims") that have been manually chosen based on the PANTHER protein classification system. Briefly, the PANTHER project has created > 15 000 phylogenetic trees that encode the evolutionary relationships within protein families. Subsets of GO terms were chosen that best reflect the function gain or loss along the branches of the PANTHER trees for each of the BP, MF, and CC categories. In general, GO slims can simplify the interpretation of enrichment analysis results because they reduce the number of highly similar GO terms.
The KEGG and Reactome gene sets are networks of molecular interactions that represent biological pathways and processes. Reactome pathways are created through a process similar to scientific peer review, where different experts create and review the pathway organization, and all interactions contain references to the primary literature. KEGG pathways are also based on molecular interactions in the primary literature, but are accompanied by an extensive ortholog mapping that allows KEGG pathways to be rapidly extended to additional species based on genome sequence homology.
The pathway heatmap values are calculated through a series of steps:
The purpose of these steps are to produce a visually appealing heatmap that clearly shows how the expression of each pathway gene compares to the others.
You will be logged off in seconds.
Do you want to continue your session? |