Stack Exchange Network

Stack Exchange network consists of 183 Q&A communities including Stack Overflow , the largest, most trusted online community for developers to learn, share their knowledge, and build their careers.

Q&A for work

Connect and share knowledge within a single location that is structured and easy to search.

Statistical analysis (comparison) of time course experiments

I have data sets that represent multiple measurements over time. The data sets come from biological experiments in which we measure something in N individual cells, with around 250 measurements over a time frame of around one minute. For each of the time points, we have summary statistics (mean, SD, N) that we can plot to create a curve; this curve will thus be the average curve of, say, 30 individual curves from individual cells, and error bars will be either the SD or SEM of those individual curves.

Now, we have several such "average" curves that represent, e.g. "treated" versus "non-treated" cells. We need to statistically evaluate whether the curves are significantly different.

What I have done so far are multiple t-tests ("one per row") between two curves - this gives me individual p values, one for each time point, which looks like a reasonable start. In principle it tells me from what time point in the measurement the curves are significantly different (if they are at all).

Is there a tried-and-true statistical method to get a single p-value to determine whether two such curves are significantly different? Can anyone let me know (in layman's terms) how to do it?

edit: For illustration, here is a typical result - we have means, SD and N for every time point, but show only a few SDs here as error bars to not totally obscure the "average" line. What we'd like to have is one value that describes the significance of the difference between two lines, e.g. of the blue and the red ones.

three example graphs

  • statistical-significance
  • data-visualization
  • descriptive-statistics
  • multivariate-normal-distribution

frofa's user avatar

  • 1 $\begingroup$ You could have a look at the tag functional-data-analysis . Maybe you can show us some plots? $\endgroup$ –  kjetil b halvorsen ♦ Commented May 10, 2022 at 20:05
Is there a tried-and-true statistical method to get a single p-value to determine whether two such curves are significantly different?

Unless you have some theoretical form for the change in signal over time, this seems to be an ideal application of generalized least squares regression. You model the individual observations as a flexible function of the combination of time and treatment, doing so in a way that takes the correlations of observations within individual cells into account. That is sometimes called the "growth curve model," a name that seems to describe your example data pretty well. Frank Harrell's course notes and book explain the approach with an extended example in Chapter 7.

Using 250 separate t-tests is both inefficient and potentially misleading. With a p < 0.05 cutoff for "significance" you would expect about a dozen false positives just by chance. Also, that doesn't directly model the time course or account for the fact that you have repeated measurements on individual cells.

You can accomplish what you want with a single generalized least squares model. Each observation is modeled, with one row per observation notated with time, treatment, a cell ID, and any other covariates you want to evaluate.

The time course is fit with just a few coefficients, for example via a regression spline, versus the 250 separate coefficients you implicitly use now. With fairly simple, non-wiggly curves as you show, you only need a few "knots" to break the time axis into time periods for this. Within each time period between a pair of knots, values are fit with a cubic polynomial. At each knot, the polynomials from the time periods on both sides are constrained to be continuous and smoothly joined. From the data you show you might even be able to fit a single low-order polynomial over the entire time range. The idea is the same whether you use a single polynomial or a spline: describe the time course with just a few coefficients whose values you estimate directly from the data.

To evaluate the effect of treatment , you include as predictors in the model both a term for treatment by itself and an interaction between treatment and time , including interactions with the spline/polynomial terms modeling time . The coefficient for treatment would determine whether there was any difference between treated and control at a reference time point, typically at time = 0 . The interaction terms allow the shape and time course to differ between the treatments.

To account for the correlations of observations within individual cells over time, you specify an overall structure for those correlations that is incorporated into the modeling. The Harrell references show how to do that in the context of generalized least squares. Mixed models with "random effects" for cells might be used instead to this end, but I don't see any advantage for your application.

"To get a single p-value to determine whether two such curves are significantly different," you do a test that evaluates all of the coefficients associated with treatment to determines whether any are different from 0. If you work with the tools in Harrell's rms package and its Gls() interface to generalized least squares, that's readily done with that package's anova() function. If you have a particular interest in the differences at 50 seconds, you can get error estimates that include information from all of the observations, not just those taken at 50 seconds.

EdM's user avatar

  • $\begingroup$ Thank you very much for the detailed answer - there is quite a bit I first have to digest before I can get into action. One thing I do not (yet) understand is how, from modeling the curves (that is easy in my case) I can get to a statistical evaluation. In particular, I have the feeling (correct me if I am totally wrong) that by modeling the curves I will lose the information from the repetitive measurements per time point (the error bars/ standard deviation and n). Disregarding this information cannot be correct... $\endgroup$ –  frofa Commented May 15, 2022 at 16:11
  • $\begingroup$ @frofa if you use all the individual observations , annotated as I suggest, then the statistical model will incorporate information from all the observations. You don't lose anything, as all of the observations contribute to estimating the coefficients that define the curves. $\endgroup$ –  EdM Commented May 15, 2022 at 16:29
  • $\begingroup$ Thanks again. We do, of course, have all individual observations, but as we already have the summary statistics, we would very much like to work with those instead. This would scale much more easily to a larger number of experiments, and also allow to use normalized data (as we do to create the average curves and error bars in my example). $\endgroup$ –  frofa Commented May 16, 2022 at 10:54
  • $\begingroup$ @frofa using the summary statistics time point by time point discards all the information about trajectories over time within each cell. The data-storage and computational requirements to work with individual observations don't seem to be overwhelming here. Including the time 0 observation as a covariate helps normalize, and you can display 0-1 "normalized" data after the statistical tests based on generalized least squares. $\endgroup$ –  EdM Commented May 16, 2022 at 11:24
  • $\begingroup$ @frofa I suppose you could consider weighted least squares (WLS) while modeling treatment and time as suggested for generalized least squares. That's a "specialization of generalized least squares." You weight each summarized time-point value by its estimated variance. But, depending on the nature of the data, that's not only an inefficient use of the data but it can also be misleading, as it ignores the correlation structure within the data. $\endgroup$ –  EdM Commented May 16, 2022 at 11:35

Your Answer

Sign up or log in, post as a guest.

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy .

Not the answer you're looking for? Browse other questions tagged statistical-significance data-visualization descriptive-statistics multivariate-normal-distribution or ask your own question .

  • Featured on Meta
  • We've made changes to our Terms of Service & Privacy Policy - July 2024
  • Bringing clarity to status tag usage on meta sites

Hot Network Questions

  • Unsupervised classifier, no data in training input
  • Which game is "that 651"?
  • Book about a colony ship making an unscheduled stop in a star system with no habitable planets
  • How to determine if a set is countable or uncountable?
  • Is there racial discrimination at Tbilisi airport?
  • The hat-check problem
  • What's the counterpart to "spheroid" for a circle? There's no "circoid"
  • Did US troops insist on segregation in British pubs?
  • Why if gravity were higher, designing a fully reusable rocket would be impossible?
  • Which cards use −5 V and −12 V in IBM PC compatible systems?
  • Is it possible to do physics without mathematics?
  • Clarification on proof of the algebraic completeness of nimbers
  • An integral using Mathematica or otherwise
  • Will this be the first time, that there are more People an ISS than seats in docked Spacecraft?
  • Old TV episode about a rich man and his guests trapped in his high-tech underground bunker after a nuclear disaster
  • Everyone hates this Key Account Manager, but company won’t act
  • Are automorphisms of matrix algebras necessarily determinant preservers?
  • Why are the titles of certain types of works italicized?
  • How can I push back on my co-worker's changes that I disagree with?
  • Is it possible to approximately compile Toffoli using H and CSWAP?
  • What is the difference between an `.iso` OS for a network and an `.iso` OS for CD?
  • Why is the passive used in this sentence?
  • Vector of integers such that almost all dot products are positive
  • grep command fails with out-of-memory error

time course experiment statistics

  • Short Report
  • Open access
  • Published: 19 March 2010

Analysing time course microarray data using Bioconductor: a case study using yeast2 Affymetrix arrays

  • Colin S Gillespie 1 , 2 ,
  • Guiyuan Lei 1 , 2 ,
  • Richard J Boys 1 , 2 ,
  • Amanda Greenall 2 , 3 &
  • Darren J Wilkinson 1 , 2  

BMC Research Notes volume  3 , Article number:  81 ( 2010 ) Cite this article

25k Accesses

13 Citations

3 Altmetric

Metrics details

Large scale microarray experiments are becoming increasingly routine, particularly those which track a number of different cell lines through time. This time-course information provides valuable insight into the dynamic mechanisms underlying the biological processes being observed. However, proper statistical analysis of time-course data requires the use of more sophisticated tools and complex statistical models.

Using the open source CRAN and Bioconductor repositories for R, we provide example analysis and protocol which illustrate a variety of methods that can be used to analyse time-course microarray data. In particular, we highlight how to construct appropriate contrasts to detect differentially expressed genes and how to generate plausible pathways from the data. A maintained version of the R commands can be found at http://www.mas.ncl.ac.uk/~ncsg3/microarray/ .

Conclusions

CRAN and Bioconductor are stable repositories that provide a wide variety of appropriate statistical tools to analyse time course microarray data.

Introduction

As experimental costs decrease, large scale microarray experiments are becoming increasingly routine, particularly those which track a number of different cell lines through time. This is because time-course information provides valuable insight into the dynamic mechanisms underlying the biological processes being observed. However, a proper statistical analysis of time-course data requires the use of more sophisticated tools and complex statistical models. For example, problems due to multiple comparisons are increased by catering for changing effects over time. In this case study, we demonstrate how to analyse time-course microarray data by investigating a data set on yeast. We discuss issues related to normalisation, extraction of probesets for specific species, chip quality and differential expression. We also discuss network inference in the Additional file 1 . The freely available software system R (see [ 1 , 2 ]) has many benefits for analysing data of this type and so throughout the analysis we give the R commands that produce the numerical/graphical output shown in this paper. A maintained version of the R commands can be found at http://www.mas.ncl.ac.uk/~ncsg3/microarray/ .

Description of the data

The data were collected according to the experimental protocol described in [ 3 ]. Briefly, three biological replicates were studied on each of a wild-type (WT) yeast strain and a strain carrying the cdc13-1 temperature sensitive mutation (in which telomere uncapping is induced by growth at temperatures above around 27°C). These replicates were sampled initially at 23°C (at which cdc13-1 has essentially WT telomeres) and then at 1, 2, 3 and 4 hours after a shift to 30°C to induce telomere uncapping. The thirty resulting RNA samples were hybridised to Affymetrix yeast2 arrays. The microarray data are available in the ArrayExpress database (see [ 4 ]) under accession number E-MEXP-1551 .

Loading microarray data into Bioconductor

Installing bioconductor and associated packages.

Assuming that R is already installed, Bioconductor is fairly straightforward to obtain installation script, viz:

> url =' http://bioconductor.org/biocLite.R '

> source ( url )

> biocLite()

This installs a number of base packages, including affy , affyPLM , limma , and gcrma (see [ 5 – 7 ]). Additional non-standard packages can also be easily installed. For example, the additional packages needed for this paper can be installed by using

> #From Bioconductor

> biocLite( c ('ArrayExpress', 'Mfuzz', 'timecourse', 'yeast2.db', 'yeast2probe'))

> #From cran

> install.packages ( c ('GeneNet', 'gplots'))

Bioconductor packages are updated regularly on the web and so users can easily update their currently installed packages by starting a new R session and then using

> update.packages (repos = biocinstallRepos())

See [ 8 ] for further details on installation.

A list of packages used in this paper is given in the Additional file 1 .

Entering data into Bioconductor

The data used in this paper can be downloaded from ArrayExpress into R using the commands

> library (ArrayExpress)

> yeast.raw = ArrayExpress('E-MEXP-1551')

Unfortunately due to changes in the ArrayExpress website, the ArrayExpress package for Bioconductor 2.4 (the default version for R 2.9) produces an error and so we must use the package in Bioconductor 2.5 (the default version for R 2.10). Details for downloading the latest ArrayExpress package can be found in the Additional file 1 .

A brief description of the yeast.raw object can be obtained by using the print(yeast.raw) command:

AffyBatch object

size of arrays = 496 × 496 features (3163 kb)

cdf = Yeast_2 (10928 affyids)

number of samples = 30

number of genes = 10928

annotation = yeast2

If the Affymetrix microarray data sets have been downloaded into a single directory, then the .cel files can be loaded into R using the ReadAffy command.

Also available from ArrayExpress are the experimental conditions. However, some preprocessing is necessary:

> ph = yeast.raw@phenoData

> exp_fac = data.frame (data_order = seq (1, 30),

+                                     strain = ph@data $ Factor.Value.GENOTYPE.,

+                                     replicates = ph@data $ Factor.Value.INDIVIDUAL.,

+                                     tps = ph@data $ Factor.Value.TIME.)

> levels (exp_fac $ strain) = c ('m', 'w')

> exp_fac = with (exp_fac, exp_fac[ order (strain, replicates, tps), ])

> exp_fac $ replicate = rep ( c (1, 2, 3), each = 5, 2)

The data frame exp_fac stores all the necessary information, such as strain, time and replicate, which are necessary for the statistical analysis.

Note that there are two yeast species on this chip, S. pombe and S. cerevisiae . Also, amongst the 10,928 probesets (with each probeset having 11 probe pairs), there are 5,900 S. cerevisiae probesets.

Pre-processing

Extraction of s. cerevisiae probesets.

As these microarrays contain probesets for both S. cerevisiae and S. pombe , we first need to extract the S. cerevisiae data before normalisation. This can be done by filtering out the S. pombe data using the s_cerevisiae.msk file from the Affymetrix website (see [ 9 ]). Note that users first need to register with the Affymetrix website before downloading this file. Also note that in our analysis, the transcript id i.e. the systematic orf name (obtained from [ 10 ]) is used for genes with no name.

We obtain a data frame containing lists of S. cerevisiae genes, probes and transcripts (using the function ExtractIDs() in the Additional file 1 ) as follows

> #Read in the mask file

> s_cer = read.table ('s_cerevisiae.msk', skip = 2, stringsAsFactors = FALSE)

> probe_filter = s_cer[[1]]

> source ('ExtractIDs.R')

> c_df = ExtractIDs(probe_filter)

We also need to restrict the view of yeast.raw to the x - and y - coordinates of the S. cerevisiae probesets in the cdf environment by using

> #Get the raw dataset for S. cerevisiae only

> library (affy)

> library (yeast2probe)

> source ('RemoveProbes.R')

> cleancdf = cleancdfname(yeast.raw@cdfName)

> RemoveProbes(probe_filter, cleancdf, 'yeast2probe')

Note that the commands in RemoveProbes.R are listed in the Additional file 1 . Thus the attributes of yeast.raw , obtained via print(yeast.raw) , are now

size of arrays = 496 × 496 features(3167 kb)

cdf = Yeast_2(5900 affyids)

number of genes = 5900

and the number of genes (actually probesets here) is 5,900 now that the S. pombe probesets have been removed.

Data Quality Assessment

Before any formal statistical analysis, it is important to check for data quality. Initially, we might examine the perfect and mismatch probe-level data to detect anomalies. Images of the first five arrays can be obtained using

> op = par (mfrow = c (3, 2))

> for (iin 1:5) {

+          plot_title = paste ('Strain:', exp_fac $ strain [i], 'Time:', exp_fac $ tps [i])

+          d = exp_fac $ data_order [i]

+           image (yeast.raw [, d], main = plot_title)

These commands produce the image shown in the Additional file 1 : Figure S2. Data quality can be assessed by examining such images for anything that appears non-random such as rings, shadows, lines and strong variations in shade. The images for our data set do not appear to have any non-random structure and so data quality is probably high.

Another useful quality assessment tool is to examine density plots of the probe intensities. The command

> d = exp_fac $ data_order[1:5]

> hist (yeast.raw[, d], lwd = 2, ylab = 'Density', xlab = 'Log (base 2) intensities')

produces the image shown in the Additional file 1 : Figure S3. Typically, differences in spread and position are corrected by normalisation. However, the appearance of significant multi-modality in the distribution or many outlying observations are indicative of poor data quality.

Other exploratory data analysis techniques that should be carried include MAplots, where two microarrays are compared and their log intensity difference for each probe on each gene are plotted against their average. Also of interest is to examine RNA degradation (see [ 6 ]), although [ 11 ] cast some doubt over the validity of this method. For details on how to carry out both of these methods in R, see [ 12 , 13 ] for detailed instructions.

Normalising Microarray Data

There are number of methods for normalising microarray data. Two of the most popular methods are GeneChip RMA (GCRMA) and Robust Multiple-array Average (RMA); see [ 14 , 15 ]. Essentially, GCRMA and RMA differ in how they deal with background noise, with GCRMA using a more sophisticated correction algorithm. However, the approach adopted by GCRMA means that it can be time-consuming to use with large data sets in contrast to RMA. A potential drawback of using RMA is that it assumes that the overall levels of expression are similar for each array. However this assumption may be invalid if, for example, mutant cells have a radically different level of transcriptional activity than the WT. For further information regarding normalising microarray data sets, see for example [ 16 , 17 ].

Since we have thirty microarray data sets and believe that the levels of transcriptional activity are similar across strains, we will use the RMA normalisation method. This technique normalises across the set of hybridizations at the probe level. The data can be normalised via

> yeast.rma = rma(yeast.raw)

> yeast.matrix = exprs(yeast.rma)[, exp_fac $ data_order]

> cnames = paste (exp_fac $ strain, exp_fac $ tps, sep = ' ')

> colnames (yeast.matrix) = cnames

> exp_fac $ data_order = 1:30

The normalisation procedure consists of three steps: model-based background correction, quantile normalisation and robust averaging. The aim of the quantile normalisation is to make the distribution of probe intensities for each array in a set of arrays the same. We illustrate its effect by studying boxplots of the raw S. cerevisiae data against their normalised counterparts values, shown in the Additional file 1 : Figure S4. Boxplots provide a useful graphical view of data distributions and contain their median, quartiles, maximum and minimum values. The boxplot command is in the affyPLM package and so the figure is produced by using

> library (affyPLM)

> par (mfrow = c (1, 2))

> #Raw data intensities

> boxplot (yeast.raw, col = 'red', main="")

> #Normalised intensities

> boxplot (yeast.rma, col = ' blue')

Principal Component Analysis

Principal component analysis (PCA) is useful in exploratory data analysis as it can reduce the number of variables to consider whilst still retaining much of the variability in the data. In particular, PCA is useful for identifying patterns in the data. Essentially, principal components partition the data into orthogonal linear components which explain different contributions to the variability in the data. The first component explains the largest contribution to variability in the original dataset, that is, retains most information, with the second component explaining the next largest contribution to variability, and so on. The following commands calculate the principal components

> yeast.PC = prcomp ( t (yeast.matrix))

> yeast.scores = predict (yeast.PC)

which we can then plot using

> #Plot of the first two principal components

> plot (yeast.scores [, 1], yeast.scores [, 2],

+             xlab = 'PC 1', ylab = 'PC 2',

+             pch = rep ( seq (1, 5), 6),

+              col = as.numeric (exp_fac $ strain))

> legend (-20, -4, pch = 1:5, cex = 0.6, c ('t 0', 't 60', 't 120', 't 180', 't 240'))

Figure 1 highlights a clear (and expected) time effect in the mutant yeast which is not present in the wild-type strain. In particular, mutant samples are clustered by their time points; for example, the three mutant replicates at time point 4 are clustered at the bottom right of the figure.

figure 1

A plot of the first two principal components . The red symbols correspond to the wild-type strain.

Identifying differentially expressed genes

In this experiment, interest lies in differences in gene expression over time between the wild-type and mutant yeast strains. It is expected that the wild-type expression level is independent of time. Also we anticipate that the mutant expressions at time t = 0 are the same as the wild-type expression level. This hypothesis is supported by the PCA plot in Figure 1 .

There are currently two main packages available to detect differentially expressed genes using this kind of data: the timecourse package and the limma package. We illustrate how to analyse these data using both packages.

Using the timecourse package

This package assesses treatment differences by comparing time-course mean profiles allowing for variability both within and between time points. It uses the multivariate empirical Bayes model proposed by [ 18 ].

Further details of the timecourse package can be found in [ 19 ]. After installing the timecourse library, we construct a size matrix describing the replication structure using

> library (timecourse)

> size = matrix (3, nrow = 5900, ncol = 2)

> c.grp = as.character (exp_fac $ strain)

> t.grp = as.numeric (exp_fac $ tps)

> r.grp = as.character (exp_fac $ replicate)

> MB.2D = mb.long(yeast.matrix, times = 5, method = '2', reps = size,

+                               condition.grp = c.grp, time.grp = t.grp, rep.grp = r.grp)

The top (say) one hundred genes can be extracted via

> gene_positions = MB.2D $ pos.HotellingT2 [1:100]

> gnames = rownames (yeast.matrix)

> gene_probes = gnames[gene_positions]

The expression profiles can also be easily obtained. The profile for the top ranked expression is found using

> plotProfile(MB.2D, ranking = 1, gnames = rownames (yeast.matrix))

and is shown in the Additional file 1 : Figure S5.

Using the limma package

The limma package uses the moderated t -statistic described by [ 7 , 20 ]. The function lmFit within the limma library fits a linear model for each gene for a given series of arrays, where the coefficients of the fitted models describe the differences between the RNA sources hybridised to the arrays. Precisely, we fit the model E [ y g ] = Xα g , where y g = ( y g ,1 , ..., y g, n ) T contains the expression values for gene g across the n arrays, X is a design matrix which describes key features of the experimental design used and α g is the coefficient vector for gene g . In the analysis studied here, the yeast data consists of data from n = 30 arrays. The entries in the columns of X depend on the experimental design used: there are two yeast strains (mutant and wild type), each measured at five separate time points, and we are interested in comparing the gene expressions between mutant and wild type strains over time. Thus we seek a linear model describing the ten strain × time combinations by determining values for the ten coefficients in the coefficient vector α g . We will label these ten coefficients as ('m0', 'm60, 'm120', 'm180', 'm240', 'w0', 'w60', 'w120', 'w180', 'w240'), where the first five coefficients represent the levels of the mutant strain at time points t = 0, 1, 2, 3, 4 and the remaining five coefficients are the equivalent versions for the wild type strain. Statistically speaking, the model has a single factor with ten levels. The design matrix X links these factors to the data in the arrays by having zero entries except when an array contributes an observation to a particular strain × time combination. For example, array 26 measures the expression of the first wild type microarray at time t = 0 and so contributes an observation to level 'w0', the sixth strain × time combination. Thus the entry in row 26, column 6 of the design matrix X (26, 6) = 1. Further, the arrays are arranged in groups of three replicates. Thus the overall experimental structure ( expt_structure below) has three arrays on level 'm0', then three arrays on 'm60', and so on. Setting up the factor levels and the design matrix is done in R by using

> library (limma)

> expt_structure = factor ( colnames (yeast.matrix))

> #Construct the design matrix

> X = model.matrix (~0 + expt_structure)

> colnames (X) = c ('m0', 'm60', 'm120', 'm180', 'm240', 'w0', 'w60', 'w120', 'w180', 'w240')

and then the coefficient vector α g is estimated via the command

> lm.fit = lmFit(yeast.matrix, X)

Determining the differentially expressed genes amounts to studying contrasts of the various strain × time levels, as described by a contrast matrix C . For these data, we are mainly interested in differences at the later time points, and so a possible set of contrasts to investigate is that of differences between the mutant and wild type strains at each time point, that is, ('m60-w60', 'm120-w120', 'm180-w180', 'm240-w240') . The limma package allows complete flexibility over the choice of contrasts, however this necessarily includes an additional level of complexity. The values in the coefficient vector of contrasts, β g = C T α g for gene g , describe the size of the difference between strains at each time point. The relevant R commands are

> mc = makeContrasts('m60-w60', 'm120-w120', 'm180-w180', 'm240-w240', levels = X)

> c.fit = contrasts.fit( lm.fit , mc)

> eb = eBayes(c.fit)

The final command uses the eBayes function to produce moderated t -statistics which assess whether individual contrast values β gj are plausibly zero, corresponding to no signifficant evidence of a difference between strains at time point j . The moderated t -statistic is constructed using a shrinkage approach and so is not as sensitive as the standard t -statistic to small sample sizes. It also gives a moderated F -statistic which can be used to test whether all contrasts are zero simultaneously, that is, whether there is no difference between strains at all time points.

Ranking differentially expressed genes

There are a number of ways to rank the differentially expressed genes. For example, they can be ranked according to their log-fold change

> #see help(toptable) for more options

> toptable(eb, sort.by = 'logFC')

or by using F -statistics

> topTableF(eb)

The advantage of using F -statistics over the log fold change is that the F -statistic takes into account variability and reproducibility, in addition to fold-change.

Our analysis is based on a large number of statistical tests, and so we must correct for this multiple testing. In our example we use the (very) conservative Bonferroni correction since we have a large number of differentially expressed genes and the resulting corrected list is still long. Another common method of correcting for multiple testing is to use the false discovery rate (fdr) (use the command ?p.adjust to obtain further details). The following commands rank genes according to their (corrected) F -statistic p -value and annotates the output by indicating the direction of the change for each contrast for each gene: +1 for up-regulated expression (mutant type having higher expression than wild type at a particular time point), -1 for down-regulated expression and 0 for no significant change.

> modFpvalue = eb $ F.p.value

> #Change 'bonferroni' to 'fdr' to use the false discovery rate as a cut - off

> indx = p.adjust (modFpvalue, method = 'bonferroni') < 0.05

> sig = modFpvalue[indx]

> #No. of sig. differential expressed genes

> nsiggenes = length (sig)

> results = decideTests(eb, method = 'nestedF')

> modF = eb $F

> modFordered = order (modF, decreasing = TRUE)

> #Retrieve the significant probes and genes

> c_rank_probe = c_df $ probe [modFordered [1:nsiggenes]]

> c_rank_genename = c_df $ genename [modFordered [1: nsiggenes]]

> #Create a list and write to a file

> updown = results[modFordered [1:nsiggenes],]

> write.table ( cbind (c_rank_probe, c_rank_genename, updown),

+                       file = 'updown.csv', sep = ',', row.names = FALSE, col.names = FALSE)

The following code (adapted from lecture material found at [ 13 ]) plots the time course expression for the top one hundred differentially expressed genes according to their F -statistic (see Figure 2 ).

figure 2

Time course expression levels for the top 9 differentially expressed genes, ranked by their F -statistic .

> #Rank of Probesets, also output gene names

> par (mfrow = c (3, 3), ask = TRUE, cex = 0.5)

> for (i in 0:99){

+    indx = rank (modF) == nrow (yeast.matrix) -i

+    id = c_df $ probe [indx]

+    name = c_df $ genename [indx]

+    genetitle = paste ( sprintf ('%. 30s', id), sprintf ('%. 30s', name), 'Rank =', i +1)

+    exprs.row = yeast.matrix[indx, ]

+     plot (0, pch = NA, xlim = range (0, 240), ylim = range (exprs.row), ylab = 'Expression',

+                   xlab = Time, main = genetitle)

+     for (j in 1:6){

+       pch_value = as.character (exp_fac $ strain [5 * j])

+        points ( c (0, 60, 120, 180, 240), exprs.row[(5 * j-4):(5 * j)], type = 'b', pch = pch_value)

When interpreting rank orderings based on statistical significance, it is important to bear in mind that a statistically significant differential expression is not always biologically meaningful. For example, Figure 2 contains RNR2 . This gene is highly significant because of low variation in its time course. However the actual difference in expression levels between wild-type and mutant stains is relatively small. We address this problem in the next section.

Comparison of the timecourse and limma packages

Both packages have different strengths. One advantage of the timecourse package over the limma package is that it allows for correlation between repeated measurements on the same experimental unit, thereby reducing false positives and false negatives; these false positives/negatives are a significant problem when the variance-covariance matrix is poorly estimated. An advantage of the limma package is that it allows more flexibility by allowing users to construct different contrasts. In general we might expect both packages to produce fairly similar lists of say the top 100 probesets. In the analysis of the yeast data, we can determine the overlap of the top 100 probesets by using

> N = 100

> gene_positions = MB.2D $ pos.HotellingT2[1:N]

> tc_top_probes = gnames[gene_positions]

> lm_top_probes = c_df $ probe[modFordered[1:N]]

> length ( intersect (tc_top_probes, lm_top_probes))

The result is a moderately large overlap of fifty-three probesets. We note that changing the ranking method in the limma package also yields similar results as those given by the timecourse library.

Two fold-change list

When looking for "interesting" genes it can be helpful to restrict attention to those differential expressed that are both statistically significant and of biological interest. This objective can be achieved by considering only significant genes which show, say, at least a two-fold change in their expression level. This gene list is obtained using the following code (adapted from [ 12 ])

> #Obtain the maximum fold change but keep the sign

> maxfoldchange = function (foldchange)

+    foldchange[ which.max ( abs (foldchange))]

> difference = apply (eb $ coeff, 1, maxfoldchange)

> pvalue = eb $ F.p.value

> lodd = -log10(pvalue)

> #hfc: high fold - change

> nd = ( abs (difference) > log (2, 2))

> ordered_hfc = order ( abs (difference), decreasing = TRUE)

> hfc = ordered_hfc[1: length (difference[nd])]

> np = p.adjust (pvalue, method = ' bonferroni') < 0.05

> #lpv: low p value(large F - value)

> ordered_lpv = order ( abs (pvalue), decreasing = FALSE)

> lpv = ordered_lpv[1: length (pvalue[np])]

> oo = union (lpv, hfc)

> i i = intersect (lpv, hfc)

Figure 3 contains a "volcano" plot which illustrates the effect of using different levels of fold change and significance thresholds. The figure is produced by using the following code

figure 3

Volcano plot showing the Bonferroni cut-off and the two-fold change .

> #Construct a volcano plot using moderated F-statistics

> par (cex = 0.5)

> plot (difference[-oo], lodd[-oo], xlim = range (difference), ylim = range (lodd))

> points (difference[hfc], lodd[hfc], pch = 18)

> points (difference[lpv], lodd[lpv], pch = 1)

> #Add the cut - off lines

> abline (v = log (2, 2), col = 5); abline (v = - log (2, 2), col = 5)

> abline (h = -log10 (0.05 / 5900), col = 5)

> text ( min (difference) + 1, -log10 (0.05 / 5900) + 0.2, 'Bonferroni cut off')

> text (1, max (lodd) - 1, paste ( length (i i), 'intersects'))

Cluster Analysis

Biological insight can be gained by determining groups of differentially expressed genes, that is, groups of genes which increase or decrease simultaneously. This can be achieved by using cluster analysis.

Traditional cluster analysis

In this section, we separate the top fifty differentially expressed genes into groups of similar pattern (clusters). Clearly different genes will have different overall levels of expression and so we first standardise their measurements by taking the expression level of the mutant strain (at each time point) relative to the wild-type at time t = 0:

> c_probe_data = yeast.matrix [ii,]

> #Average of WT

> wt_means = apply (c_probe_data [, 16:30], 1, mean )

> m = matrix ( nrow = dim (c_probe_data) [ 1 ], ncol = 5)

> for (i in 1:5) {

+       mut_rep = c (i, i+5, i +10)

+       m [, i] = apply (c_probe_data [, mut_rep], 1, mean ) - wt_means

> colnames (m) = sort ( unique (exp_fac $ tps))

The heatmap in Figure 4 is obtained by using the function heatmap.2 from the library gplots via the following code

figure 4

Clustering of the top fifty differentially expressed genes . Red and green correspond to up- and down-regulation respectively.

> library (gplots)

> #Cluster the top 50 genes

> heatmap.2 (m [1:50,], dendrogram = 'row', Colv = FALSE, col = greenred (75),

+                   key = FALSE, keysize = 1.0, symkey = FALSE, density.info = ' none',

+                    trace = 'none', colsep = rep (1:10), sepcolor = 'white', sepwidth = 0.05,

+                   hclustfun = function ( c ){ hclust ( c , method = 'average')},

+                   labRow = NA, cexCol = 1)

Figure 4 shows the relative expression levels for the mutant strain at each time point ('0', '60', '120', '180', '240'). As expected, the relative expression levels at time t = 0 are very similar. However, as time progresses, groupings of genes appear whose levels are up-regulated (red) or down-regulated (green). Note that the intensity of the colour corresponds to the magnitude of the relative expression. Gene names appear on the right side of the figure and on the left side, the cluster dendrogram shows which genes have similar expression. The dendrogram suggests that there are perhaps six to ten clusters.

Soft clustering

Soft clustering methods have the advantage that a probe can be assigned to more than one cluster. Furthermore, it is possible to grade cluster membership within particular groupings. Soft clustering is considered more robust when dealing with noisy data; for more details see [ 21 , 22 ]. The Mfuzz package implements soft clustering using a fuzzy c-means algorithm. Analysing the data for c = 8 clusters is achieved by using

> library (Mfuzz)

> tmp_expr = new ('ExpressionSet', exprs = m)

> cl = mfuzz(tmp_expr, c = 8, m = 1.25)

> mfuzz.plot(tmp_expr, cl = cl, mfrow = c (2, 4), new.window = FALSE)

Of course, it is usually not clear how many clusters there are (or should be) within a dataset and so the sensitivity of conclusions to the choice of number of clusters ( c ) should always be investigated. For example, if c is chosen to be too large then some clusters will appear sparse and this might suggest choosing a smaller value of c . Figure 5 shows the profiles of the eight clusters obtained from the Mfuzz package. The probes present within each cluster can be found by using

> cluster = 1

> cl [[4]][,cluster]

figure 5

Eight clusters obtained using the Mfuzz package .

The response to telomere uncapping in cdc13-1 strains was expected to share features in common with responses to cell cycle progression, environmental stress, DNA damage and other types of telomere damage. The statistical analysis determined lists of probesets associated with genes involved in all of these processes. The techniques used focussed on making best use of the temporal information in time-course data. The use of cdc13-1 strains, which uncap telomeres quickly and synchronously, also allowed the identification of genes involved in the acute response to telomere damage. This case study has demonstrated the power of R/Bioconductor to analyse time-course microarray data. Whilst the statistical analysis of such data is still an active research area, this paper presents some of the cutting-edge tools that are available to the life science community. All software discussed in this article is free, with many of the packages being open-source and subject to on-going development.

R Development Core Team: R: A Language and Environment for Statistical Computing. 2009, Vienna, Austria, [ http://www.r-project.org ]

Google Scholar  

Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, Hornik K, Hothorn T, Huber W, Iacus S, Irizarry R, Leisch F, Li C, Maechler M, Rossini AJ, Sawitzki G, Smith C, Smyth G, Tierney L, Yang JYH, Zhang J: Bioconductor: open software development for computational biology and bioinformatics. Genome Biology. 2004, 5: R80-10.1186/gb-2004-5-10-r80.

Article   PubMed Central   PubMed   Google Scholar  

Greenall A, Lei G, Swan DC, James K, Wang L, Peters H, Wipat A, Wilkinson DJ, Lydall D: A genome wide analysis of the response to uncapped telomeres in budding yeast reveals a novel role for the NAD+ biosynthetic gene BNA2 in chromosome end protection. Genome Biology. 2008, 9: R146-10.1186/gb-2008-9-10-r146.

Parkinson H, Kapushesky M, Kolesnikov N, Rustici G, Shojatalab M, Abeygunawardena N, Berube H, Dylag M, Emam I, Farne A, Holloway E, Lukk M, Malone J, Mani R, Pilicheva E, Rayner TF, Rezwan F, Sharma A, Williams E, Bradley XZ, Adamusiak T, Brandizi M, Burdett T, Coulson R, Krestyaninova M, Kurnosov P, Maguire E, Neogi SG, Rocca-Serra P, Sansone SA, Sklyar N, Zhao M, Sarkans U, Brazma A: ArrayExpress update-from an archive of functional genomics experiments to the atlas of gene expression. Nucleic Acids Research. 2009, 37: D868-72. 10.1093/nar/gkn889.

Article   CAS   PubMed Central   PubMed   Google Scholar  

Gautier L, Cope L, Bolstad BM, Irizarry RA: affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics (Oxford, England). 2004, 20 (3): 307-15. 10.1093/bioinformatics/btg405.

Article   CAS   Google Scholar  

Bolstad BM, Collin F, Brettschneider J, Simpson K, Cope L, Irizarry RA, Speed TP: Bioinformatics and Computational Biology Solutions Using R and Bioconductor. 2005, New York: Springer-Verlag. Statistics for Biology and Health, 33-48. full_text.

Book   Google Scholar  

Smyth GK: Limma: linear models for microarray data. Bioinformatics and Computational Biology Solutions using R and Bioconductor. Edited by: Gentleman R, Carey V, Dudoit S, R Irizarry WH. 2005, New York: Springer, 397-420. full_text.

Chapter   Google Scholar  

Bioconductor: Installation instructions. [ http://www.bioconductor.org/docs/install ]

Pombe and Cerevisiae Filter. [ http://www.affymetrix.com/Auth/support/downloads/mask_files/s_cerevisiae.zip ]

Yeast Annotation File. [ http://www.affymetrix.com/Auth/analysis/downloads/na24/ivt/Yeast_2.na24.annot.csv.zip ]

Archer KJ, Dumur CI, Joel SE, Ramakrishnan V: Assessing quality of hybridized RNA in Affymetrix GeneChip experiments using mixed-effects models. Biostatistics (Oxford, England). 2006, 7 (2): 198-212.

Article   Google Scholar  

Gregory Alvord W, Roayaei JA, Quiñnones OA, Schneider KT: A microarray analysis for differential gene expression in the soybean genome using Bioconductor and R. Briefings in Bioinformatics. 2007, 8: 415-31. 10.1093/bib/bbm043.

Article   CAS   PubMed   Google Scholar  

Analysis of Gene Expression Data Short Course, JSM2005. [ http://bioinf.wehi.edu.au/marray/jsm2005/ ]

Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (Oxford, England). 2003, 4: 249-64.

Wu Z, Irizarry RA, Gentleman R, Martinez-Murillo R, Spencer R: A Model Based Background Adjustment for Oligonucleotide Expression Arrays. Journal of the American Statistical Association. 2004, 99: 909-917. 10.1198/016214504000000683.

Harr B, Schlötterer C: Comparison of algorithms for the analysis of Affymetrix microarray data as evaluated by co-expression of genes in known operons. Nucleic Acids Research. 2006, 34: e8-10.1093/nar/gnj010.

Labbe A, Roth MP, Carmichael PH, Martinez M: Impact of gene expression data pre-processing on expression quantitative trait locus mapping. BMC Proceedings. 2007, 1 (Suppl 1): S153-10.1186/1753-6561-1-s1-s153.

Tai YC, Speed TP: A multivariate empirical Bayes statistic for replicated microarray time course data. The Annals of Statistics. 2006, 34: 2387-2412. 10.1214/009053606000000759.

Tai YC: timecourse: Statistical Analysis for Developmental Microarray Time Course Data. 2007, [ http://www.bioconductor.org ]

Smyth GK: Linear models and empirical Bayes methods for assessing differential expression in microarray experiments. Statistical Applications in Genetics and Molecular Biology. 2004, 3: Article3-10.2202/1544-6115.1027.

Article   PubMed   Google Scholar  

Futschik ME, Carlisle B: Noise-robust soft clustering of gene expression time-course data. Journal of Bioinformatics and Computational Biology. 2005, 3: 965-88. 10.1142/S0219720005001375.

Futschik M: Mfuzz: Soft clustering of time series gene expression data. 2007, [ http://itb.biologie.hu-berlin.de/~futschik/software/R/Mfuzz/index.html ]

Download references

Acknowledgements

We wish to thank Dan Swan (Newcastle University Bioinformatics Support Unit) and David Lydall for helpful discussions. The authors are affiliated with the Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN) at Newcastle University, which is supported jointly by the Biotechnology and Biological Sciences Research Council (BBSRC) and the Engineering and Physical Sciences Research Council (EPSRC).

Author information

Authors and affiliations.

School of Mathematics & Statistics, Newcastle University, Newcastle upon Tyne, NE1 7RU, UK

Colin S Gillespie, Guiyuan Lei, Richard J Boys & Darren J Wilkinson

Centre for Integrated Systems Biology of Ageing and Nutrition (CISBAN), Newcastle University, UK

Colin S Gillespie, Guiyuan Lei, Richard J Boys, Amanda Greenall & Darren J Wilkinson

Institute for Ageing and Health, Newcastle University, Campus for Ageing and Vitality, Newcastle upon Tyne, NE4 5PL, UK

Amanda Greenall

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Colin S Gillespie .

Additional information

Competing interests.

The authors declare that they have no competing interests.

Authors' contributions

AG conducted the microarray experiments. All authors participated in the analysis of the data and in the writing of the manuscript.

Colin S Gillespie, Guiyuan Lei contributed equally to this work.

Electronic supplementary material

13104_2009_485_moesm1_esm.pdf.

Additional file 1: Additional R commands and analysis. 1. R commands for extracting S. cerevisiae ids, removing unwanted probesets and converting probesets to genes. 2. R commands for genetic regulatory network inference. 3. A list of R packages used in this manuscript. 4. Additional figures. (PDF 424 KB)

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2, authors’ original file for figure 3, authors’ original file for figure 4, authors’ original file for figure 5, authors’ original file for figure 8, authors’ original file for figure 9, authors’ original file for figure 10, rights and permissions.

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License ( http://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article.

Gillespie, C.S., Lei, G., Boys, R.J. et al. Analysing time course microarray data using Bioconductor: a case study using yeast2 Affymetrix arrays. BMC Res Notes 3 , 81 (2010). https://doi.org/10.1186/1756-0500-3-81

Download citation

Received : 06 July 2009

Accepted : 19 March 2010

Published : 19 March 2010

DOI : https://doi.org/10.1186/1756-0500-3-81

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Limma Package
  • Soft Cluster
  • Normalise Microarray Data
  • Complex Statistical Model
  • Proper Statistical Analysis

View archived comments (1)

BMC Research Notes

ISSN: 1756-0500

time course experiment statistics

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Published: 26 February 2015

Points of Significance

Split plot design

  • Naomi Altman 1 &
  • Martin Krzywinski 2  

Nature Methods volume  12 ,  pages 165–166 ( 2015 ) Cite this article

53k Accesses

51 Citations

7 Altmetric

Metrics details

  • Research data
  • Statistical methods

When some factors are harder to vary than others, a split plot design can be efficient.

You have full access to this article via your institution.

We have already seen that varying two factors simultaneously provides an effective experimental design for exploring the main (average) effects and interactions of the factors 1 . However, in practice, some factors may be more difficult to vary than others at the level of experimental units. For example, drugs given orally are difficult to administer to individual tissues, but observations on different tissues may be done by biopsy or autopsy. When the factors can be nested, it is more efficient to apply a difficult-to-change factor to the units at the top of the hierarchy and then apply the easier-to-change factor to a nested unit. This is called a split plot design.

The term “split plot” derives from agriculture, where fields may be split into plots and subplots. It is instructive to review completely randomized design (CRD) and randomized complete block design (RCBD) 2 and show how these relate to split plot design. Suppose we are studying the effect of irrigation amount and fertilizer type on crop yield. We have access to eight fields, which can be treated independently and without proximity effects ( Fig. 1a ). If applying irrigation and fertilizer is equally easy, we can use a complete 2 × 2 factorial design and assign levels of both factors randomly to fields in a balanced way (each combination of factor levels is equally represented).

figure 1

( a ) In CRD, levels of irrigation and fertilizer are assigned to plots of land (experimental units) in a random and balanced fashion. ( b ) In RCBD, similar experimental units are grouped (for example, by field) into blocks and treatments are distributed in a CRD fashion within the block. ( c ) If irrigation is more difficult to vary on a small scale and fields are large enough to be split, a split plot design becomes appropriate. Irrigation levels are assigned to whole plots by CRD and fertilizer is assigned to subplots using RCBD (irrigation is the block). ( d ) If the fields are large enough, they can be used as blocks for two levels of irrigation. Each field is composed of two whole plots, each composed of two subplots. Irrigation is assigned to whole plots using RCBD (blocked by field) and fertilizer assigned to subplots using RCBD (blocked by irrigation).

If our land is divided into two large fields that may differ in some way, we can use the field as a blocking factor ( Fig. 1b ). Within each block, we again perform a complete 2 × 2 factorial design: irrigation and fertilizer are assigned to each of the four smaller fields within the large field, leading to an RCBD with field as the block. Each combination of irrigation and fertilizer is balanced within the large field.

So far, we have not considered whether managing levels of irrigation and fertilizer require the same effort. If varying irrigation on a small scale is difficult, it makes more sense to irrigate larger areas of land than in Figure 1a and then vary the fertilizer accordingly to maintain a balanced design. If our land is divided into four fields (whole plots), each of which can be split into two subplots ( Fig. 1c ), we would assign irrigation to whole plots using CRD. Within a whole plot, fertilizer would be distributed across subplots using RCBD, randomly and balanced within whole plots with a given irrigation level. Irrigation is the whole plot factor and fertilizer is the subplot factor. It is important to note that all split plot experiments include at least one RCBD subexperiment, with the whole plot factor acting as a block.

Assigning levels of irrigation to fields at random neglects any heterogeneity among the fields. For example, if the land is divided into two large fields ( Fig. 1b ), it is best to consider each as a block. Within each block, we consider half of the field as a whole plot and irrigate using RCBD ( Fig. 1d ). As before, the fertilizer is assigned to subplots using RCBD. The designs in Figure 1c and Figure 1d vary only in how the whole plot factor levels are assigned: by CRD or RCBD.

Because split plot designs are based on RCBD, the two can be easily confused. For example, why is Figure 1b not considered a split plot design with field index being the whole plot factor? The answer involves whether we are interested in specific levels of the factor or are using it for blocking purposes. In Figure 1b , the field is a blocking factor because it is used to control the variability of the plots, not as a systematic effect. We use these two fields to generalize to all fields. In Figure 1c , irrigation is a whole plot factor and not a blocking factor because we are studying the specific levels of irrigation.

The terms “whole plot” and “subplot” translate naturally from agricultural to biological context, where split plot designs are common. Many factors, such as diet or housing conditions, are more easily applied to large groups of experimental subjects, making them suitable at the whole plot level. In other experiments, factors that are sampled hierarchically or from the same individual (tissue, cell or time points) can act as subplot factors. Figure 2 illustrates split plot designs in a biological context.

figure 2

( a ) A two-factor, split plot animal experiment design. The whole plot is represented by a mouse assigned to drug, and tissues represent subplots. ( b ) Biological variability coming from nuisance factors, such as weight, can be addressed by blocking the whole plot factor, whose levels are now sampled using RCBD. ( c ) With three factors, the design is split-split plot. The housing unit is the whole plot experimental unit, each subject to a different temperature. Temperature is assigned to housing using CRD. Within each whole plot, the design shown in b is performed. Drug and tissue are subplot and sub-subplot units. Replication is done by increasing the number of housing units.

Suppose that we wish to determine the in vivo effect of a drug on gene expression in two tissues. We assign mice to one of two drug treatments using CRD. The mouse is the whole plot experimental unit and the drug is the whole plot factor. Both tissues are sampled from each mouse. The tissue is the subplot factor and each mouse acts as a block for the tissue subplot factor; this is the RCBD component ( Fig. 2a ). The mouse itself can be considered a random factor used to sample biological variability and increase the external validity of the experiment. If we suspect environmental variability, we can group the mice by their housing unit ( Fig. 2b ), just as we did whole plots by field ( Fig. 1d ). The housing unit is now a blocking factor for the drug, which is applied to mice using RCBD. Other ways to group mice might be by weight, familial relationship or genotype.

Sensitivity in detecting effects of the subplot factor as well as interactions is generally greater than for a corresponding completely randomized factorial design in which only one tissue is measured in each mouse. This is because tissue comparisons are within mouse. However, because comparing the whole plot factor (drug) is done between subjects, the sensitivity for the whole plot factor is similar to that of a completely randomized design. Applying blocking at the whole plot level, such as housing ( Fig. 2b ), can improve sensitivity for the whole plot factor similarly to using a RCBD. Compared to a split plot design, the completely randomized design is both more expensive (twice as many mice are required) and less efficient (mouse variability will not cancel, and thus the tissue and interaction effects will include mouse-to-mouse variability).

The experimental unit at the whole plot level does not have to correspond to an individual. It can be one level above the individual in the hierarchy, such as a group or enclosure. For example, suppose we are interested in adding temperature as one of the factors to the study in Figure 2b . Since it is more practical to control the temperature of the housing unit than of individual mice, we use cage as the whole plot ( Fig. 2c ). Temperature is the whole plot factor and cage is the experimental unit at the whole plot level. As in Figure 2a , we use CRD to assign the whole plot factor (temperature) levels to whole plots (cages). Mice are now experimental units at the subplot level and the drug is now a subplot factor. Because we have three layers in the hierarchy of factors, tissue is at the sub-subplot level and the design is split-split plot. In Figure 2b , the cage is a block used to control variability because the effects of housing are not of specific interest to us. By contrast, in Figure 2c , specific levels of the temperature factor are of interest so it is part of the plot factor hierarchy.

Care must be taken to not mistake a split plot design for CRD. For example, an inadvertent split plot 3 can result if some factor levels are not changed between experiments. If the analysis treats all experiments as independent, then we can expect mistakes in conclusions about the significance of effects.

With two factors, more complicated designs are also possible. For example, we might expose the whole mouse to a drug (factor A ) in vivo and then expose two liver samples to different in vitro treatments (factor B ). In this case, the two liver samples from the same mouse form a block, which is nested in mouse 4 .

The split plot CRD design ( Fig. 2a ) is commonly used as the basis for a repeated measures design, which is a type of time course design. The most basic time course includes time as one of the factors in a two-factor design. In a completely randomized time course experiment, different mice are used at each of the measurement times t 1 , t 2 and t 3 after initial treatment ( Fig. 3a ). If the same mouse is used at each time and the mice are assigned at random to the levels of a (time-invariant) factor, the design becomes a repeated measures design ( Fig. 3b ) because the measurements are nested within mouse. The time of measurement is the subplot factor. The corresponding repeated measures of the design that uses housing as a block in Figure 2b is shown in Figure 3c . As before, housing is the block and drug is the whole plot factor, but now time is the subplot factor. If we include tissue type, the design becomes a split-split plot, with tissue being subplot and time sub-subplot ( Fig. 3d ).

figure 3

( a ) Basic time course design, in which time is one of the factors. Each measurement uses a different mouse. ( b ) In a repeated measures design, mice are followed longitudinally. Drug is assigned to mice using CRD. Time is the subplot factor. ( c ) Drug is blocked by housing. ( d ) A three-factor, repeated measures split-split plot design, now including tissue. Tissue is subplot and time is sub-subplot.

Split plot designs are analyzed using ANOVA. Because comparisons at the whole plot level have different variability than those at the subplot level, the ANOVA table contains two sources of error, MS wp and MS sp , the mean square associated with whole plots and subplots, respectively ( Table 1 ). This difference occurs because the subplot factor is always compared within a block, while the whole plot factor is compared between the whole plots. For example, in Figure 2a , variation between mice cancels out when comparing tissues but not when comparing drugs. Analogously to a two-factor ANOVA 1 , we calculate the sums of squares and mean squares in a split plot ANOVA. For example, in a split plot with RCBD, given n blocks of blocking factor bl ( Table 1 ) at the whole plot level and a and b levels of whole plot factor A and subplot factor B , MS bl = SS bl /( n − 1), where SS bl is the sum of squared deviations of the average across each block relative to the grand mean times the number of measurements contributing to each average ( a × b ). Similarly, SS A uses the average across levels of A and the multiple is n × b . The analysis at the whole plot level is essentially the same as in a one-way ANOVA with blocking: the subplot values are considered subsamples. The associated MS sp is usually lower than in a factorial design, which improves the sensitivity in detecting A × B interactions.

Split plot designs are helpful when it is difficult to vary all factors simultaneously, and, if factors that require more time or resources can be identified, split plot designs can offer cost savings. This type of design is also useful for cases when the investigator wishes to expand the scope of the experiment: a factor can be added at the whole plot level without sacrificing sensitivity in the subplot factor.

Krzywinski, M., Altman, N. & Blainey, P. Nat. Methods 11 , 1187–1188 (2014).

Article   CAS   Google Scholar  

Krzywinski, M. & Altman, N. Nat. Methods 11 , 699–700 (2014).

Ganju, J. & Lucas, J.M. J. Stat. Plan. Infer. 81 , 129–140 (1999).

Article   Google Scholar  

Krzywinski, M., Altman, N. & Blainey, P. Nat. Methods 11 , 977–978 (2014).

Download references

Author information

Authors and affiliations.

Naomi Altman is a Professor of Statistics at The Pennsylvania State University.,

Naomi Altman

Martin Krzywinski is a staff scientist at Canada's Michael Smith Genome Sciences Centre.,

Martin Krzywinski

You can also search for this author in PubMed   Google Scholar

Ethics declarations

Competing interests.

The authors declare no competing financial interests.

Rights and permissions

Reprints and permissions

About this article

Cite this article.

Altman, N., Krzywinski, M. Split plot design. Nat Methods 12 , 165–166 (2015). https://doi.org/10.1038/nmeth.3293

Download citation

Published : 26 February 2015

Issue Date : March 2015

DOI : https://doi.org/10.1038/nmeth.3293

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

This article is cited by

Race specific and non-specific resistance to magnaporthe oryzae and qtl mapping in wild introgression lines using the standard differential system.

  • Divya Balakrishnan
  • Yoshimichi Fukuta
  • Sarla Neelamraju

Tropical Plant Pathology (2024)

Restoring Fringing Tidal Marshes for Ecological Function and Ecosystem Resilience to Moderate Sea-level Rise in the Northern Gulf of Mexico

  • Sara Martin
  • Eric L. Sparks
  • Julia A. Cherry

Environmental Management (2021)

Understanding the factors controlling biofilm as an autochthonous resource in shaded oligotrophic neotropical streams

  • Tiago Borges Kisaka
  • Andréia de Almeida
  • Gabriela Bielefeld Nardoto

Aquatic Sciences (2021)

Gene expression in diapausing rotifer eggs in response to divergent environmental predictability regimes

  • Eva Tarazona
  • J. Ignacio Lucas-Lledó
  • Eduardo M. García-Roger

Scientific Reports (2020)

Mixed-strain housing for female C57BL/6, DBA/2, and BALB/c mice: validating a split-plot design that promotes refinement and reduction

  • Michael Walker
  • Carole Fureix
  • Georgia Mason

BMC Medical Research Methodology (2016)

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

time course experiment statistics

  • Methodology Article
  • Open access
  • Published: 09 January 2020

Integrative analysis of time course metabolic data and biomarker discovery

  • Takoua Jendoubi   ORCID: orcid.org/0000-0001-7846-9763 1 , 2 &
  • Timothy M. D. Ebbels 3  

BMC Bioinformatics volume  21 , Article number:  11 ( 2020 ) Cite this article

5697 Accesses

8 Citations

1 Altmetric

Metrics details

Metabolomics time-course experiments provide the opportunity to understand the changes to an organism by observing the evolution of metabolic profiles in response to internal or external stimuli. Along with other omic longitudinal profiling technologies, these techniques have great potential to uncover complex relations between variations across diverse omic variables and provide unique insights into the underlying biology of the system. However, many statistical methods currently used to analyse short time-series omic data are i) prone to overfitting, ii) do not fully take into account the experimental design or iii) do not make full use of the multivariate information intrinsic to the data or iv) are unable to uncover multiple associations between different omic data. The model we propose is an attempt to i) overcome overfitting by using a weakly informative Bayesian model, ii) capture experimental design conditions through a mixed-effects model, iii) model interdependencies between variables by augmenting the mixed-effects model with a conditional auto-regressive (CAR) component and iv) identify potential associations between heterogeneous omic variables by using a horseshoe prior.

We assess the performance of our model on synthetic and real datasets and show that it can outperform comparable models for metabolomic longitudinal data analysis. In addition, our proposed method provides the analyst with new insights on the data as it is able to identify metabolic biomarkers related to treatment, infer perturbed pathways as a result of treatment and find significant associations with additional omic variables. We also show through simulation that our model is fairly robust against inaccuracies in metabolite assignments. On real data, we demonstrate that the number of profiled metabolites slightly affects the predictive ability of the model.

Conclusions

Our single model approach to longitudinal analysis of metabolomics data provides an approach simultaneously for integrative analysis and biomarker discovery. In addition, it lends better interpretation by allowing analysis at the pathway level. An accompanying R package for the model has been developed using the probabilistic programming language Stan . The package offers user-friendly functions for simulating data, fitting the model, assessing model fit and postprocessing the results. The main aim of the R package is to offer freely accessible resources for integrative longitudinal analysis for metabolomics scientists and various visualization functions easy-to-use for applied researchers to interpret results.

Over the past few years, there has been a significant development in high-throughput omics technologies e.g. metabolomics, transcriptomics, genomics, epigenomics and proteomics along with a growing interest into joint modeling of multi-omic data [ 1 , 2 ]. In metabolomics, several approaches are used to understand the response of a biological system as a function of an internal or external perturbation by monitoring “the chemical fingerprints that specific cellular processes leave behind" [ 3 ]. These chemical fingerprints are most commonly interrogated in terms of metabolite (i.e. low weight molecules) concentration, structure and transformation pathways (i.e set of chemical reactions) in order to identify biomarkers related to the studied process. Biomarker discovery consists of identifying a metabolite that has significant association patterns with a particular phenotype (disease, clinical variables, physical trait, etc) and that can be thus used as an indicator of that specific phenotype. Typical experimental platforms use analytical techniques such as nuclear magnetic resonance spectroscopy (NMR) [ 4 ] and mass spectrometry (MS) [ 5 ] to generate appropriate spectral metabolomic profiles of the studied biological system.

Metabolomic datasets are characterized by high correlation structures at different levels. For example, chromatographic correlation between two spectral peaks often results from adduct or isotopic effects whereas other correlation structures can result from peaks that represent related molecules operating within networks of chemical reactions in multiple injections. In addition, further correlation structure is present in longitudinal metabolomic studies due to repeated measurements of observations over time. Additional challenges include not only the low number of time points and samples compared to the number of metabolic variables, but also integration of a different omic data with the metabolomic data.

Metabolomic time series are often short due to experimental costs or ethical considerations. Typically, less than 10 time points are available compared to a large number of metabolic variables profiled at each time point e.g. hundreds of metabolic variables for targeted experiments and thousands for untargeted experiments. For this reason, the number of temporal patterns that can arise is limited (due to the limited number of degrees of freedom). Some temporal patterns may be repeated and thus these patterns can be induced by random processes. Models fitted to a small number of data points are prone to overfitting i.e. the model is very sensitive to small fluctuations. This can lead to a poor fitting to new data and thus a high generalisation error. It is also important to consider the number of parameters of the statistical model and make use of the simplest models in order to avoid overparametrisation. Finally, monitoring variables within and between multiple omic types can substantially enhance the understanding of the underlying biological mechanism and provide a systems biology approach as these omic variables represent entities that are often involved in related cellular processes [ 1 , 2 ]. For all these reasons, metabolomics scientists need robust models which allow cautious interpretation of the data. Models are needed which integrate heterogeneous omic data and take into account both experimental conditions and biological variation.

There is a growing interest in longitudinal experiments for heterogeneous omics data and statistical models to infer biomarkers of a particular treatment or disease over time. Some approaches aim to infer influential or significant metabolites using dynamic metabolomic data under the assumption that metabolites are independent. These models include fitting smooth splines mixed effects models (SME) to time curves [ 6 ] and linear mixed effects models augmented with a variable selection [ 7 ]. However metabolomic data exhibit rich correlation structure, which is biologically relevant and should be modelled.

Seemingly unrelated regression accounts for metabolite correlation by using correlated regression errors and can be used to identify biologically significant metabolites [ 8 , 9 ]. In gene expression data analysis, [ 10 ] recently proposed to use confirmatory factor analysis to capture gene-pathway relationships and a conditional autoregressive model to capture relationships between a set of pathways where a network has been constructed based on KEGG [ 11 ] pathways. The latter accounts for biological variation in the data and aids interpretation.

In the metabolomics literature, traditional frameworks for metabolomic data analysis use dimensionality reduction techniques, namely principal component analysis (PCA), partial least squares (PLS) [ 12 ] and PLS derived models (OPLS [ 13 ], O2PLS [ 14 ], OnPLS [ 15 ]) to take into account high correlations between metabolites. Extension of PLS to O2PLS and OnPLS allows for integrative analysis of heterogeneous omic data. One of the interests of PCA (PLS) derived models is to be able to visually assess whether or not there is a time effect in the data and identify metabolites that change over time by looking into time trajectories of each metabolite [ 16 ].

Extensions to PCA and PLS for longitudinal analysis include lagged PCA (PLS) and dynamic PCA (PLS) where a backshift matrix is introduced to take into account time dependency [ 17 , 18 ]. Similarly, [ 19 ] used a set of piecewise orthogonal projections latent structures to describe changes between neighbouring time points. PARAFAC [ 20 ] is a multi-linear unsupervised decomposition method that can account for the multi-way variation seen in dynamic metabolomics data. Similar models such as ASCA [ 21 ] and APCA [ 22 ] also seek to account for temporal variation by combining analysis of variance (ANOVA) with PCA. Recently, dynamic probabilistic PCA (DPPCA) was proposed in [ 23 ] as a generative probabilistic model of longitudinal metabolomic data where a stochastic volatility model is used for the latent variables. The main inconvenience of these approaches is that further techniques such as variable importance scores have to be separately applied to the data in order to identify biomarkers and they do not take heterogeneous data (i.e. data from different omics techniques) into account.

In this paper, we are interested in dose-response time course experiments where additional omic variables (bacteria, genes, transcripts, etc) are monitored along with metabolites in the context of biomarker discovery. The main contribution of our work is a single probabilistic generative model that i) can overcome overfitting via the use of weakly informative priors ii) makes use of mixed effects models to model the experimental design iii) models metabolite interactions using pathway information through a conditional auto-regressive (CAR) component and iv) uncovers multiple associations between metabolites and other omic variables by using a horseshoe prior. An additional benefit of our approach is that it naturally yields a list of perturbed metabolic pathways.

We denote a metabolomics data set by \(\boldsymbol {X} \in \mathbb {R}^{N \times T \times M}\) where N is the number of individuals, T the number of time points and M the number of metabolite variables (henceforth termed metabolites for simplicity). \(\boldsymbol {Y} \in \mathbb {R}^{N \times T \times K}\) is an additional continuous omic data measured along with X where K is the number of associated omic variables. The set of N individuals consists of a set of cases and controls. Throughout the paper, index i always runs through individuals, index t runs through time points, index m runs through metabolites and index k runs through Y variables. Vector quantities are written in bold. Matrices are written in bold capitals. Our goal is to build a simple model that can identify biomarkers of a specific treatment over time, taking into account the multiple sources of variation in the data.

The model is built on three levels (see “ Methods ” section for details on full model): First, a CAR component to capture interaction between metabolites. Second, a variable selection model to uncover associations between metabolites and Y data. Third, a mixed effects component to model the experimental design. This yields the following hierarchical model:

where the mean metabolite level μ it is a function of covariates of sample i at time point t , influence of metabolite j on metabolite m is captured through coefficients c mj elements of the matrix \(\boldsymbol {C} \left (\boldsymbol {\phi } \right) = \sum _{p=1}^{P}\phi _{p} \boldsymbol {G_{p}}\boldsymbol {A_{p}}\) where P is the number of pathways (see “ Methods ” section for details). α m represents treatment effect for metabolite m , γ im represents individual perturbations for metabolite m , ν itm represents temporal effects for metabolite m of individual i at time point t and β m quantifies interactions between metabolite m and other omic variables. The indicator e stands for {controls,cases} and ϕ quantifies pathway perturbation. Particularly, by specifying different dependence parameters for metabolite interactions in cases and controls, the model is able to identify perturbed pathways by comparing ϕ cases and ϕ controls .

We refer to our model as “ iCARH ” model for “ integrative CAR Horseshoe ” model. The model we propose is implemented in the iCARH package and freely available from the Comprehensive R Archive Network ( CRAN ).

In the following sections, we perform experiments on both synthetic and real data to investigate whether our algorithm gives reasonable solutions. We first try our method on a simulated dataset in “ Simulation study ” section to get an understanding of the performance of our method. In “ Case study ” section, we test our method on a dataset comprising metabolomic and bacterial composition profiles in a drug treatment experiment. A fully worked reproducible example using the iCARH package on a publicly available dataset is available in Additional File 1. All experiments were run on a computer with an Intel i7 processor running at 2.8 GHz using 16 Gb of RAM.

Simulation study

To get better understanding of our method and test its applicability, we first perform our approach on synthetic datasets. We will mainly focus on the ability of our model to infer pathway perturbation.

In this simulation, we first extract from the KEGG database the proportion of pathways in which a single metabolite is involved. We then use these proportions to randomly generate a binary matrix Z with dimensions M × P indicating random assignments of metabolites to pathways where M is the number of metabolites, P is the number of pathways. Each design matrix A p is then equal to z p z p T where z p is the p th column of Z . We simulate pathway perturbation according to an indicator variable ω as follows: If ω =1 then pathways are perturbed hence \(\phi _{p}^{\text {controls}}\) and \(\phi _{p}^{\text {cases}}\) are simulated from normal distributions with different means. If ω =0 then \(\phi _{p}^{\text {controls}}=\phi _{p}^{\text {cases}}\) and pathways are not perturbed.

The rest of the parameters is set as follows : number of bacterial variables K =1, number of metabolites M =40, number of time points T =7, number of samples N =22, number of pathways P =11, global parameter τ fixed to 1.2 (to induce a medium degree of sparsity), parameters ν itm , γ im , μ itm and multi-omics time course data \( \boldsymbol {x}_{it}^{e}\) simulated according to Eqs. 1 , 2 , 11 , and 12 , respectively (See “ Methods ” section). Finally, we generated 10 datasets according to the simulations above in order to assess how our model infers perturbed pathways.

We set non-informative uniform priors on \(\alpha _{m}, \sigma _{\gamma _{im}}, \theta _{m}, \sigma _{\mu _{m}}^{2}\) . We set an informative prior on \(\sigma ^{2}_{\gamma _{m}} \sim \text {inverse-gamma} \left (1, 0.1 \right)\) as we expect low variability amongst biological samples of the same group.

Assessing uniform and beta-like priors

We first compare inference of the model under a uniform prior for \(\phi ^{e}_{p}\) and the prior in Eq. 5 (See Fig.  1 ). Inference is done using 2000 iterations of Hamiltonian Monte Carlo sampling and 1000 warm-up iterations.

figure 1

Left : Boxplots of Area Under the Curve (AUC) for pathway perturbation inference with uniform prior on ϕ compared to a beta like prior on ϕ for 10 simulated datasets. We infer perturbations based on the posterior distribution of ϕ controls − ϕ cases . Right : Distribution of ϕ controls − ϕ cases under the uniform prior (white) and the beta like prior (shaded). The true perturbed pathways are printed in bold on the x axis

We here show, through simulation, that an informative beta like prior compares better than a non-informative uniform prior in inferring significant pathways. The left plot in Fig.  1 shows the boxplots of the 95% confidence interval of the Area Under the Curve (AUC) for pathway perturbation inference for 10 simulated datasets with uniform prior on ϕ and a beta like prior on ϕ . We infer perturbations based on the posterior probability that ϕ controls and ϕ cases are different i.e. the 95% credible interval of ϕ controls − ϕ cases does not contain zero. The AUC values for the beta like distribution is significantly higher than the AUC values for the uniform distribution. On average pathway peturbation inference under the uniform distribution reduces to a random guess with an average AUC of 0.53. This is likely due to the lack of variance of the uniform distribution. The right plot in Fig.  1 shows the posterior distributions of ϕ controls − ϕ cases under the uniform prior (white) and the beta like prior (shaded) for each pathway. The true perturbed pathways are printed in bold on the x axis. In the following section, we assess pathway inference against design inaccuracies.

Assessing pathway inference against design inaccuracies

It is very common in metabolomics data to find metabolites that are correlated but not in the same KEGG pathway. In the following simulation we assess how inaccuracies in the covariance structure between metabolites and the design matrices A p affect the iCARH model. We used the 10 datasets from the previous simulation and perturbed the design matrices by selecting a random fraction of metabolites in each pathway. We then randomly (falsely) assign these metabolites to no pathway, or to different pathways. We similarly run the model for 2000 iterations of Hamiltonian Monte Carlo sampling and 1000 warm-up iterations for each of the fractions {0,0.18,0.35,0.44,0.5,0.62} of perturbed metabolites. Finally, in the same fashion, we assess perturbations based on the 95% credible interval of ϕ controls − ϕ cases . Figure  2 is a series of average Receiver Operating Characteristic (ROC) curves across 10 datasets for each of the fractions {0,0.18,0.35,0.44,0.5,0.62} of perturbed metabolites. On average, the performance of our model reduces to a random guess (AUC of 0.5) if 50% of the metabolites in each pathway is perturbed. The AUC of our model reaches 0.97 if no metabolites are perturbed and is about 0.88 if 18% of the metabolites in each pathway are perturbed.

figure 2

Average Receiver Operating Characteristic (ROC) curves for pathway perturbation inference across 10 datasets for different factions of “falsely” assigned metabolites. Boxplots show the 95% confidence intervals of the true positive rate

In this section, we test our model on an actual metabolomic data and 16S data for bacterial profiles. In this study we are interested in the influence of a diabetes drug metformin on a non-diabetic model. Metformin is the first-line medicine to treat type 2 diabetes. It has also been suggested that metformin has anti-cancer [ 24 ], cardiovascular [ 25 ] and anti-aging [ 26 ] effects. Because of their very large metabolic capacity, the gut bacteria can influence toxicity and metabolism of drugs. Here, we are particularly looking for metabolic biomarkers indicative of microbiota changes as a result of treatment.

The study design is as follows: metabolic profiles of 24 animals are acquired on 9 equally spaced timepoints using different mass spectrometry techniques from plasma samples. Bacterial profiles are acquired using Illumina MiSeq [ 27 ]. The study has allowed for two groups of 12 animals where the drug has been administrated to the second group (timepoints 3 to 7) allowing for an acclimatation period (timepoints 1 and 2) and a recovery period (timepoints 8 and 9). As metabolites are mapped to pathways, prior filtering and/or metabolite annotation needs to be performed beforehand. After data processing and metabolite identification, a total of 56 metabolites and 6 bacteria species are further analyzed using our model. Preliminary investigation of the data shows observable associations between correlations and inter-pathway and intra-pathway metabolites in Fig.  3 which motivates fitting the iCARH model the data. Inference is done using 2000 iterations of Hamiltonian Monte Carlo sampling and 1000 warm-up iterations.

figure 3

Figure shows boxplots of correlations between metabolites that are part of the same pathway (intra-pathway correlations) compared to correlations between metabolites that belong to different pathways (inter-pathway correlations) in the metformin study (See “ Case study ” section). Although correlations are not very close to 1. There is an empirically observed correlation signature related to whether metabolites are in the same pathway or not

We assess performance of our model for different values of τ using the Watanabe-Akaike information criterion (WAIC). Tested values of τ comprise 1, 1.2, 5, 10 with corresponding WAIC values of 7317.296, 7322.798, 7317.457, 7316.476 respectively. WAIC values are very similar for different values of τ which suggests to use the most selective model with τ =10 as it is the simplest i.e with the smallest number of selected variables.

Assessing model fit

In order to assess our model fit, we perform posterior predictive checks of our model compared to DPPCA [ 23 ]. The DPPCA model is a multivariate model using PCA, where PCA scores are modelled via a stochastic volatility model. In the Bayesian framework, posterior predictive checks consist in comparing data simulated from the posterior predictive distribution with the observed data. The mean absolute deviations (MADs) are computed between the observed covariance matrix and the covariance matrix of simulated data. The experiment was repeated for different numbers of metabolites selected randomly from the whole dataset. As metabolites are also predictors under the CAR model, the performance is expected to improve when the number of metabolites increases. The process was also replicated for inference using the DPPCA model [ 23 ]. Figure  4 shows MADs of our model and the DPPCA model. Although MADs for the DPPCA model decrease when the number of metabolites increases, it is still slightly higher than MADs for the iCARH model. Overall, our model clearly outperforms the DPPCA model.

figure 4

Posterior predictive checks for mean absolute deviation (MAD) compared to DPPCA for different numbers of metabolites included. Vertical bars show the 95% confidence intervals of MADs.The MADs decrease as the number of metabolites increases. Our model performs clearly better than the DPPCA model

In addition to posterior predictive checks, normality checks are another way to assess if the observed results are not mainly a product of misspecified priors. Specifically, goodness of fit was checked by using \(\Psi ^{-1}_{e} \left (\boldsymbol {x}_{it} - \boldsymbol {\mu }_{it} \right) \sim N\left (0, \boldsymbol {I}_{M}\right)\) where Ψ e denotes the Cholesky factor of ( I M − C(ϕ e ) ) −1 σ 2 . Zero-mean and normality were thus checked for \(\Psi ^{-1}_{e} \left (\boldsymbol {x}_{it} - \boldsymbol {\mu }_{it} \right)\) (See Fig.  5 ).

figure 5

Right and left panels show model fit assessment for controls and cases for metformin data. Left : quantile-quantile normal plot of \(\Psi ^{-1}_{\text {cases}} \left (\boldsymbol {x}_{it} - \boldsymbol {\mu }_{it} \right)\) . Right : quantile-quantile normal plot of \(\Psi ^{-1}_{\text {controls}} \left (\boldsymbol {x}_{it} - \boldsymbol {\mu }_{it} \right)\)

Data results

In the standard model, α m represents an indicator variable for the treatment effect. The treatment variable can also be continuous as in this data example (drug measurements) and modeled by \(\boldsymbol {\alpha }_{m} = \beta ^{\alpha }_{m} \boldsymbol {y}_{\text {drug}}\) . The treatment effect can now be simply summarized by \(\beta ^{\alpha }_{m}\) . Figure  6 is a series of boxplots of 95% credible intervals of posterior means of \(\beta ^{\alpha }_{m}\) for metabolites 13 to 31. We are mainly interested in “metabolite 27” as it is associated with bacteria species 2.

figure 6

Estimates of effects of treatment on metabolite profiles are captured by \(\beta ^{\alpha }_{m}\) for the metformin data. The figure depicts boxplots of 95% credible intervals of posterior means of \(\beta ^{\alpha }_{m}\) . Only part of the data is plotted as we are mainly interested in “metabolite 27”

Figures  7 and 8 show posterior distributions of ϕ e for each pathway and estimates of effects of bacteria on metabolites. Results in “ Simulation study ” section suggest to compare the covariance structure of metabolites in the observed data with the covariance induced by the design matrices in order to have an a priori idea on the robustness of pathway inference (See Fig.  2 ). For a correlation threshold of 0.3, about 25% of the metabolites are misspecified in the design matrices which corresponds to an AUC around 0.8 according to Fig.  2 . If we set a higher correlation threshold, a lower number of metabolites are misspecified. For example, for a correlation threshold of 0.5, only 8% of the metabolites are misspecified. This supports the use of the iCARH model for pathway perturbation inference for this data.

figure 7

Posterior distributions of ϕ e for each pathway for each treatment group for the metformin dataset. Posterior distributions of ϕ e for some pathways are flatter for the controls than the cases which might be indicative mild pathway alterations. These pathways correspond as well to the top 11 pathways inferred by MetaboAnalyst [ 30 ]. The p -values are the p -values returned by MetaboAnalyst for each pathway

figure 8

Metformin data: Estimates of effects of bacteria on metabolite profiles are captured by β m . The figure depicts boxplots of 95% credible intervals of posterior means of β m . Different metabolites present significant changes along with some bacterial profiles. For example, “metabolite 27” is positively associated with bacteria species 2 but negatively associated with bacteria species 5 and can be considered as indicator of changes in these bacteria species abundance

Estimates of effects of bacteria on metabolite profiles are captured by β m . Some metabolites present significant changes along with the bacterial profiles. For example, “metabolite 27”, a hydroxy fatty acid, is associated with alterations in abundance of 4 bacteria species. Figure  7 shows that, as a result of treatment, KEGG pathways are not significantly altered. However, distributions of ϕ controls for “fatty acids biosynthesis” and “biosynthesis of unsaturated fatty acids” KEGG pathways are remarkably flatter than the distributions of ϕ cases . These pathways involve the previously identified hydroxy fatty acid metabolite. Our analysis confirms previously reported studies that hydroxy fatty acids might be produced by the gut microbiome [ 28 , 29 ]. On the other hand, results from MetaboAnalyst [ 30 ] give p -values between 6.3×10 −2 and 1.8×10 −10 indicating that all pathways are significant to changes of the treatment except one; glycerophospholipid metabolism. We think that this discrepancy in the results is due to the way iCARH and metaboAnalyst model pathway perturbation i.e. whilst metaboAnalyst considers that mean level changes of one or more metabolite concentrations involved in a pathway indicate perturbation of the latter pathway. iCARH considers that changes in the covariance structure of metabolites in the same pathway are indicative of pathway perturbation.

Identifying biomarkers in time course metabolic data and inferring significant associations with heterogeneous omic variables is extremely challenging due to the several sources of variations of the data. In addition, existing methods developed to analyse such data are very scarce and have the limitations of i) overfitting to the few available data points or ii) confounding the experimental and longitudinal variation or iii) ignoring the metabolite interactions or iv) ignoring effects of other omic variables. In this paper, the model we have developed combines several approaches to take into account the different aspects of the data namely the number of time points, the experimental variation captured by μ it , interactions between metabolites captured by ϕ and interactions with additional omic variables captured by β m .

Our results demonstrate that our model successfully addresses the main questions of a metabolomic study. Most importantly, our model is able to identify metabolic biomarkers related to treatment, infer perturbed pathways as a result of treatment and find significant associations with additional omic variables. We have shown that providing an informative prior on metabolic pathways and an informative prior over the parameter ϕ is a significant improvement over the DPPCA model. Particularly, our model is more robust to slight variations usually observed in short time series data thanks to the small number of covariance parameters (in the covariance matrix) needed to estimate compared to DPPCA. We have also shown through simulation that an informative beta like prior compares better than a non-informative uniform prior in inferring significant pathways. On different real data, we have investigated how the number of profiled metabolites can affect the predictive ability of the model and carried out a fully reproducible application of iCARH .

Several potential extensions arise naturally from our model. In terms of the metabolite interactions component, many research questions can arise. Alternative strategies to modeling metabolite interactions can be examined such as modeling the non-zero elements of the adjacency matrix C of each pathway as random variables. This strategy was adopted in the CAR literature by [ 31 , 32 ] to take into account step changes in spatial variation. Step changes can potentially be useful to model changes in metabolites correlations as a result of treatment. Lee [ 33 ] provide an overview of different CAR models used in spatial modeling. The proposed models can be adapted to fit into the metabolomics literature.

Another potential extension concerns the source for pathway annotation and modeling. In this research paper, the KEGG pathways are used but this poses certain limitations regarding the performance of the model, due to some shortcomings in the database (missing compounds, inaccuracies in the database etc). In this sense extending the tool to support software formats and applications that enable assembly of superpathways (SBML [ 34 ], OWL [ 35 ], KEGGConverter [ 36 ] or KENEV [ 37 ]) would probably increase its performance in terms of accuracy and computational effeciency, as the adjacency matrix would be known a priori and its sparsity would be stronger.

From a practical point of view, the model has been fitted using HMC sampling but takes a large amount of time (about 1 h) mostly because of the variable selection procedure and metabolites interdependence. This could be addressed by using variational Bayes. In fact, variational Bayes inference procedures offer cost-effective inference by means of principled approximations and appealing computational time for high dimensional data. A variational bayes inference of CAR models was proposed by [ 38 ] for high dimensional data, and a variational bayes approach for variable selection was recently proposed by [ 39 ].

Metabolomics longitudinal profiling techniques are imperative to understand the effect of a drug or a disease across time and can provide enhanced understanding of the underlying biology of the system. In a data integration framework, we have illustrated the use of the CAR model to incorporate metabolites interactions in the model and the horseshoe prior to identify association with heterogeneous omic variables obtained by other omic techniques. The combination of the CAR and horseshoe levels yields the “integrative CAR Horseshoe” (iCARH) model which we presented in this article. Our model is accompanied by an R package with various visualization functions easy-to-use for applied researchers.

The iCARH model has various appealing features such that it is able to identify metabolic biomarkers related to treatment, infer perturbed pathways as a result of treatment and identify potential associations between heterogeneous omic variables. Clearly, these appealing features open up further research topics.

In this section we describe theoretical details behind the three levels of our iCARH model: Metabolite dependencies, integrative analysis with other omics data and experimental design.

Metabolite dependencies

In any integrative biological model, it is useful to be able to interpret the model at a systems level, e.g. according to functional groups of biological molecules, rather than attempting to interpret results for individual molecules. Metabolic pathways are the most widely used groupings for this type of analysis in metabolomics, and have been widely used to interpret experimental data, usually by performing over-representation or enrichment analyses [ 40 – 44 ]. Since pathways are regulated in a coordinated fashion, it is natural to assume that the levels of metabolites which are members of the same pathway may be correlated. This dependence, though weak, is observable in associations between correlations and network distance [ 45 – 47 ], and also observable in the real data used in our study (See Fig.  3 ). We therefore incorporate a pathway-based correlation component into our model via a CAR approach, in a similar fashion to [ 10 ]. The extent of pathway specific correlations will vary according to the experimental system and assay, and may in some cases provide little extra information. Nonetheless, including such a pathway based component can greatly increase the interpretability of the resulting model beyond one which does not include such grouping information. In this context, some metabolite peaks in the data need to be identified in order to be mapped to pathways. As it will be later clear in the CAR model we will use, if metabolites are not identified and hence can not be mapped to pathways, no pathway-induced correlation will be assumed.

We assume that the concentration of each metabolite is linearly influenced by concentration levels of metabolites in the same pathway. Linear dependencies have been investigated in genomics in order to uncover functional modules [ 45 ]. Modeling linear associations is appealing as it captures the overall trend and also less prone to overfitting small amounts of data. Let \(\boldsymbol {C} \in \mathbb {R}^{M \times M}\) be the design matrix quantifying metabolite interactions such that matrix elements c mm =0, c mj ≠0 if metabolites m and j are in the same pathway and 0 otherwise. Thus, metabolite levels can be expressed as:

where x i t ,− m represents measurements of metabolites of sample i at time point t excluding metabolite m , and μ itm is a function of covariates of sample i for metabolite m at time point t taking into account additional variation in the data (See “ Integrative analysis ” and “ Experimental design ” sections). If we define I M the M th order identity matrix, the joint distribution of x it can be explicitly written as [ 48 ]:

An important output of our modeling procedure is identification of which pathways are “on" or “off" as an effect of treatment. In the CAR literature, the design matrix C can be modeled as a scaled product of a diagonal weight matrix and an adjacency matrix. In order to infer which pathways are perturbed we construct the distance matrix based on the individual contribution of each pathway. To be precise, we define \(\boldsymbol {C} \left (\boldsymbol {\phi } \right) = \sum _{p=1}^{P}\phi _{p} \boldsymbol {G_{p}}\boldsymbol {A_{p}}\) where P is the number of pathways. The distance matrices A p are a zero-diagonal symmetric adjacency matrices with elements \(a_{mj}^{p}\) equal to the inverse of the length of the shortest path between metabolites m and j if they are in pathway p and 0 otherwise. A path between two metabolites consists of the number of reactions that lead from one metabolite to the other, and the shortest path is the path that contains the smallest number of reactions. The diagonal matrices G p comprise the reciprocal of the number of neighbors of each metabolite in pathway p i.e \( \left (g_{mm}^{p} \right)^{-1}= \sum _{j=1}^{M} (a_{mj}>0) \) so that the squared partial correlation between two metabolites \(\text {cor} \left (x_{itm}, x_{itj} \vert \boldsymbol {x}_{it,-(m,j)} \right)^{2} \propto \phi _{p}^{2} g_{mm}^{p} g_{jj}^{p} \) is reduced when more metabolites from the same pathway are profiled [ 48 ]. The vector of coefficients \(\boldsymbol {\phi } = \lbrace \phi _{p} \rbrace _{p=1}^{P}\) is estimated from the data. It is referred to as spatial -dependence parameter in the CAR literature. In the context of this work, the vector of coefficients \(\boldsymbol {\phi } = \lbrace \phi _{p} \rbrace _{p=1}^{P}\) quantifies pathway contribution, for example ϕ 1 =0 indicates no contribution.

Under the CAR setting, we turn the reader attention that if pathway information is not available (i.e. all/some metabolites are not identified) then no pathway-induced correlation is assumed in the data and inference will be performed such that the design matrix C in Eq. 4 is a zero matrix. Hence, metabolites are assumed to be independent as the covariance matrix between metabolites is diagonal in this case.

The model needs to comply with the condition that I M − C(ϕ) is positive definite. If we assume that pathways are a priori equally perturbed, ϕ p must fall in the interval \(\left (\frac {1}{P\xi _{p}^{1}}, \frac {1}{P\xi _{p}^{2}} \right)\) where \(\xi _{p}^{1}\) and \(\xi _{p}^{2}\) are the minimum and maximum eigenvalues of G p A p , respectively. In practice, strong interaction between observed metabolites of pathway p is reproduced in CAR models only when the scaling parameter ϕ p is quite close to one of the boundaries \(\frac {1}{P\xi _{p}^{1}}, \frac {1}{P\xi _{p}^{2}}\) . Hence, we use a beta-type prior for ϕ p that places substantial mass on large values of | ϕ p | [ 49 ]:

where B is the beta function. The parameter σ 2 captures variance heterogeneity in metabolite intensities and is given an inverse gamma prior G ( ψ , ψ −1). This prior provides 2 ψ pseudo-observations in addition to NT available observations. In order to build a reasonably informative prior we set ψ = N × T /4.

  • Integrative analysis

In this section, we turn our attention to modeling the association between heterogeneous omic variables such as transcripts and metabolites. Association between omic variables involves complex processes where often only few variables are significant which motivates the use of shrinkage priors for integrative analysis and cross-omics biomarker discovery (See [ 50 ] for a review on shrinkage priors). Recently, [ 51 ] proposed the “horseshoe” prior as a prior based on a scale mixture of normals where scale parameters are modeled as the product of a global shrinkage (scale) parameter and a local shrinkage (scale) parameter. This definition allows for an additional flexibility where sparsity can be controlled at a global level for each metabolite (i.e. how many non-zero coefficients?) and a local level for each metabolite (i.e. which coefficients are non-zero?). The horseshoe prior has been widely recognized and extended by the statistical community since its introduction by [ 51 ] as it benefits from various desirable properties such as simple analytic form, easy computation and preservation of significant coefficients (no over-shrinkage) [ 52 ]. In order to model the association between heterogeneous omic variables, we extend the horseshoe prior via the following hierarchical shrinkage model by introducing an additional variable τ to control the overall sparsity level for all metabolites:

where α m represents the treatment effect for metabolite m , \( \gamma _{im} \sim N\left (0, \sigma _{\gamma _{m}}^{2} \right)\) represents individual perturbations for metabolite m , \(\nu _{itm} \vert \nu _{i,t-1,m} \sim N\left (\theta _{m} \nu _{i,t-1,m}, \sigma _{\nu _{m}}^{2}\right)\) follows and auto-regressive process and represents temporal effects for metabolite m of individual i at time point t . β m is a vector of dimension K that quantifies interactions between metabolite m and other omic variables encoded in the vector y it of dimension K . λ mk is called the local shrinkage parameter whilst \(\sigma ^{2}_{\beta _{m}}\) is the global shrinkage parameter. St + denotes the half Student-t distribution with τ degrees of freedom. For τ =1, this prior reduces to the horseshoe prior [ 51 ]. Intuitively, for small values of λ mk the coefficient β mk is very close to 0 while for relevant variables λ mk will be large. In addition, \(\sigma _{\beta _{m}}\) controls the overall shrinkage level i.e sparsity of the vector β m is more important for small values of \(\sigma _{\beta _{m}}\) .

Define \(\kappa _{mk} = \frac {1}{1+ \lambda _{mk}^{2} \sigma ^{2}_{\beta _{m}}/ \tau }\) a random shrinkage coefficient such that κ km ≈0 when λ mk is large and κ km ≈1 when λ mk is small. This transformation implies the following prior distribution on κ mk :

This prior density is shown in Fig.  9 for different values of \(\sigma _{\beta _{m}}\) and τ . It reduces to a Beta ( τ /2,1/2) distribution if \(\sigma _{\beta _{m}}=1\) and to a Beta (1/2,1/2) which looks like a horseshoe, if in addition τ =1. When τ increases, Beta ( τ /2,1/2) skews towards 1 which increases the global shrinkage power. The expectation of β m given Y , κ m , τ , μ tm can be expressed as:

figure 9

Shrinkage prior \(\mathrm {p} \left (\kappa _{mk} \vert \tau, \sigma _{\beta _{m}} \right)\) on κ mk for different values of \(\sigma _{\beta _{m}}\) and τ . The prior distribution skews towards 1 if τ increases or \(\sigma _{\beta _{m}}\) decreases (shrinkage). It skews towards 0 if τ decreases or \(\sigma _{\beta _{m}}\) increases (no shrinkage)

where \(\Sigma _{m} = \left (\frac {\sigma _{\nu _{m}}^{2}}{1-\theta _{m}^{2}} + \sigma _{\gamma _{m}}^{2} \right) \boldsymbol {I}_{N}\) and Υ m is a diagonal matrix of order K with elements 1/ κ mk −1. Equation ( 10 ) introduces a penalty term τ Υ m where Υ m is a metabolite specific penalty term introduced by the horseshoe prior and τ is a global penalty term. Precisely, τ captures the overall sparsity level amongst all metabolites. The expectation of β m given Y , κ m , τ , μ tm is very similar to the estimate of β m under ridge regression where τ Υ m simply reduces to τ I N .

The global sparsity level can be controlled using τ . Increasing the global sparsity level is a desired property in omic studies, as usually we deal with a large number of omic variables where only few are important. In appendix B we discuss how τ can be fixed a priori.

figure 10

Plates diagram of the iCARH model. Fixed variables are represented by squares, random variables by circles and observations are shaded. For clarity, all variances \(\sigma ^{2}, \sigma _{\beta _{m}}^{2}, \sigma _{\gamma _{m}}^{2}, \sigma _{\nu _{m}}^{2}\) are not represented in the diagram

Experimental design

The covariance structure might change drastically as a result of treatment if the latter affects relationships between metabolites. The model can be extended to take into account the experimental design. As specified in the previous section, α m captures the treatment effect for metabolite m , γ im represents individual perturbations for metabolite m , \(\nu _{itm} \vert \nu _{i,t-1,m} \sim N\left (\theta _{m} \nu _{i,t-1,m}, \sigma _{\nu _{m}}^{2}\right)\) represents temporal effects for metabolite m of individual i at time point t in Eq. 7 . In addition, we allow covariance structures C ( ϕ e ) to be different for the control samples and the cases where e ∈ {cases, controls} designates experimental groups. This yields the overall hierarchical model:

A key point of this model is that by specifying different dependence parameters for metabolite interactions in cases and controls, the model is able to identify perturbed pathways by comparing ϕ cases and ϕ controls .

Appendix A: Global sparsity

When there is prior knowledge available, specifying τ a priori can optimize the inference and additionally, provide a more informative prior on λ mk . If we fix \(\mathrm {p}\left (\sigma _{\beta _{m}}^{2}\right) \propto 1/\sigma _{\beta _{m}}^{2}\) , integrating over \(\sigma _{\beta _{m}}\) gives the expected value of κ mk as :

where \( G_{\cdot,\cdot }^{\cdot,\cdot } \) is Meijer’s G-function [ 53 ]. The equation above can be used to fix τ a priori by defining the expected proportion of shrunk coefficients. In practice, different values of τ are plugged into the equation above to get the desired proportion of shrunk coefficients. However, many definite integrals can be obtained using the tables of Meijer functions in [ 54 ] for special values of parameters.

Appendix B: Model summary

Table 1 depicts a summary of model parameters specifying parameters of interest, other inferred parameters and user specified parameters. Figure 10 shows the plates diagram of the iCARH model where fixed variables are represented by squares, random variables by circles and observations are shaded. For clarity, all variances \(\sigma ^{2}, \sigma _{\beta _{m}}^{2}, \sigma _{\gamma _{m}}^{2}, \sigma _{\nu _{m}}^{2}\) are not represented in the diagram.

The choice of the gamma distribution for \( \sigma _{\beta _{m}}^{2}, \sigma _{\gamma _{m}}^{2}, \sigma _{\nu _{m}}^{2}\) follows the same principle used in “ Model ” section for σ 2 . For each variance parameter, the gamma prior provides half pseudo-observations in addition to the available observations e.g \(\sigma _{\nu _{m}}^{2}\) has a G ( T /4, T /4−1) prior such that it provides T /2 pseudo-observations in addition to T observations so that the prior is reasonably informative.

Availability of data and materials

The proposed method has been implemented in the R package iCARH that is available from CRAN . A fully worked example with publicly available data is in Additional file  1 .

Abbreviations

Conditional auto-regressive model

Dynamic probabilistic PCA

Integrative conditional auto-regressive horseshoe model

Principal component analysis

Partial least squares

Two-way orthogonal partial least squares

Joyce AR, Palsson BØ. The model organism as a system: integrating’omics’ data sets. Nat Rev Mol Cell Biol. 2006; 7(3):198–210.

CAS   PubMed   Google Scholar  

Ebrahim A, Brunk E, Tan J, O’brien EJ, Kim D, Szubin R, Lerman JA, Lechner A, Sastry A, Bordbar A, et al. Multi-omic data integration enables discovery of hidden biological regularities. Nat Commun. 2016; 7. https://doi.org/10.1038/ncomms13091 .

Daviss B. Growing pains for metabolomics: the newest’omic science is producing results–and more data than researchers know what to do with. The Scientist. 2005; 19(8):25–29.

Google Scholar  

Reo NV. Nmr-based metabolomics. Drug Chem Toxicol. 2002; 25(4):375–82.

Dettmer K, Aronov PA, Hammock BD. Mass spectrometry-based metabolomics. Mass Spectrom Rev. 2007; 26(1):51–78.

CAS   PubMed   PubMed Central   Google Scholar  

Berk M, Ebbels T, Montana G. A statistical framework for biomarker discovery in metabolomic time course data. Bioinformatics. 2011; 27(14):1979–85.

Mei Y, Kim SB, Tsui KL. Linear-mixed effects models for feature selection in high-dimensional NMR spectra. Expert Syst Appl. 2009; 36(3 PART 1):4703–8. https://doi.org/10.1016/j.eswa.2008.06.032 .

Chen C, Deng L, Wei S, Nagana Gowda GA, Gu H, Chiorean EG, Abu Zaid M, Harrison ML, Pekny JF, Loehrer PJ, Zhang D, Zhang M, Raftery D. Exploring metabolic profile differences between colorectal polyp patients and controls using seemingly unrelated regression. J Proteome Res. 2015; 14(6):2492–9. https://doi.org/10.1021/acs.jproteome.5b00059 .

Chen C, Nagana Gowda GA, Zhu J, Deng L, Gu H, Chiorean EG, Abu Zaid M, Harrison M, Zhang D, Zhang M, Raftery D. Altered metabolite levels and correlations in patients with colorectal cancer and polyps detected using seemingly unrelated regression analysis. Metabolomics. 2017; 13(11):125. https://doi.org/10.1007/s11306-017-1265-0 .

PubMed   PubMed Central   Google Scholar  

Pham LM, Carvalho L, Schaus S, Kolaczyk ED. Perturbation Detection Through Modeling of Gene Expression on a Latent Biological Pathway Network: A Bayesian hierarchical approach. J Am Stat Assoc. 2015; 1459(July 2016):1–61. https://doi.org/10.1080/01621459.2015.1110523 , http://arxiv.org/abs/arXiv:1409.0503v1.

Kanehisa M, Goto S. Kegg: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000; 28(1):27–30.

Wold S, Martens H, Wold H. The multivariate calibration problem in chemistry solved by the pls method. Matrix Pencils. 1983:286–93. https://doi.org/10.1007/bfb0062108 .

Trygg J, Wold S. Orthogonal projections to latent structures (o-pls). J Chemometr. 2002; 16(3):119–28.

CAS   Google Scholar  

Trygg J, Wold S. O2-pls, a two-block (x–y) latent variable regression (lvr) method with an integral osc filter. J Chemometr. 2003; 17(1):53–64.

Löfstedt T, Trygg J. Onpls—a novel multiblock method for the modelling of predictive and orthogonal variation. J Chemometr. 2011; 25(8):441–55.

Antti H, Bollard M, Ebbels T, Keun H, Lindon J, Nicholson J, Holmes E. Batch statistical processing of 1h nmr-derived urinary spectral data. J Chemometr. 2002; 16(8-10):461–8.

Kaspar MH, Ray WH. Dynamic pls modelling for process control. Chem Eng Sci. 1993; 48(20):3447–61.

Ku W, Storer RH, Georgakis C. Disturbance detection and isolation by dynamic principal component analysis. Chemom Intell Lab Syst. 1995; 30(1):179–96.

Rantalainen M, Cloarec O, Ebbels TM, Lundstedt T, Nicholson JK, Holmes E, Trygg J. Piecewise multivariate modelling of sequential metabolic profiling data. BMC Bioinformatics. 2008; 9(1):105.

Bro R. Parafac. tutorial and applications. Chemom Intell Lab Syst. 1997; 38(2):149–71.

Smilde AK, Jansen JJ, Hoefsloot HC, Lamers R-JA, Van Der Greef J, Timmerman ME. Anova-simultaneous component analysis (asca): a new tool for analyzing designed metabolomics data. Bioinformatics. 2005; 21(13):3043–8.

Harrington P. d. B., Vieira NE, Espinoza J, Nien JK, Romero R, Yergey AL. Analysis of variance–principal component analysis: A soft tool for proteomic discovery. Analytica chimica acta. 2005; 544(1-2):118–127.

Nyamundanda G, Gormley IC, Brennan L. A dynamic probabilistic principal components model for the analysis of longitudinal metabolomics data. J R Stat Soc Ser C (Appl Stat). 2014; 63(5):763–82.

Sahra IB, Le Marchand-Brustel Y, Tanti J-F, Bost F. Metformin in cancer therapy: a new perspective for an old antidiabetic drug?Mol Cancer Ther. 2010; 9(5):1092–9.

PubMed   Google Scholar  

Group UPDSU, et al. Effect of intensive blood-glucose control with metformin on complications in overweight patients with type 2 diabetes (ukpds 34). The Lancet. 1998; 352(9131):854–65.

Anisimov VN, Berstein LM, Egormin PA, Piskunova TS, Popovich IG, Zabezhinski MA, Tyndyk ML, Yurova MV, Kovalenko IG, Poroshina TE, et al. Metformin slows down aging and extends life span of female shr mice. Cell Cycle. 2008; 7(17):2769–73.

Rapin A, Pattaroni C, Marsland BJ, Harris NL. Microbiota analysis using an illumina miseq platform to sequence 16s rrna genes. Curr Protoc Mouse Biol. 2017:100–29. https://doi.org/10.1002/cpmo.29 .

Kishino S, Takeuchi M, Park S-B, Hirata A, Kitamura N, Kunisawa J, Kiyono H, Iwamoto R, Isobe Y, Arita M, et al. Polyunsaturated fatty acid saturation by gut lactic acid bacteria affecting host lipid composition. Proc Natl Acad Sci. 2013; 110(44):17808–13.

Kimura I, Ozawa K, Inoue D, Imamura T, Kimura K, Maeda T, Terasawa K, Kashihara D, Hirano K, Tani T, et al. The gut microbiota suppresses insulin-mediated fat accumulation via the short-chain fatty acid receptor gpr43. Nat Commun. 2013; 4:1829.

Chong J, Soufan O, Li C, Caraus I, Li S, Bourque G, Wishart DS, Xia J. Metaboanalyst 4.0: towards more transparent and integrative metabolomics analysis. Nucleic Acids Res. 2018; 46(W1):486–94.

Lee D, Mitchell R. Locally adaptive spatial smoothing using conditional auto-regressive models. J R Stat Soc Ser C (Appl Stat). 2013; 62(4):593–608.

Rushworth A, Lee D, Sarran C. An adaptive spatiotemporal smoothing model for estimating trends and step changes in disease risk. J R Stat Soc Ser C (Appl Stat). 2017; 66(1):141–57. https://doi.org/10.1111/rssc.12155 , http://arxiv.org/abs/1411.0924.

Lee D. A comparison of conditional autoregressive models used in bayesian disease mapping. Spat Spatio-temporal Epidemiol. 2011; 2(2):79–89.

Hucka M, Finney A, Sauro HM, Bolouri H, Doyle JC, Kitano H, Arkin AP, Bornstein BJ, Bray D, Cornish-Bowden A, et al. The systems biology markup language (sbml): a medium for representation and exchange of biochemical network models. Bioinformatics. 2003; 19(4):524–31.

Bleasby A, Akrigg D, Attwood T. Owl–a non-redundant composite protein sequence database. Nucleic Acids Res. 1994; 22(17):3574.

Moutselos K, Kanaris I, Chatziioannou A, Maglogiannis I, Kolisis FN. Keggconverter: a tool for the in-silico modelling of metabolic networks of the kegg pathways database. BMC Bioinformatics. 2009; 10(1):324.

Pilalis E, Koutsandreas T, Valavanis I, Athanasiadis E, Spyrou G, Chatziioannou A. Kenev: A web-application for the automated reconstruction and visualization of the enriched metabolic and signaling super-pathways deriving from genomic experiments. Comput Struct Biotechnol J. 2015; 13:248–255.

Harrison LM, Green GG. A bayesian spatiotemporal model for very large data sets. NeuroImage. 2010; 50(3):1126–1141.

Ormerod JT, You C, Müller S, et al. A variational bayes approach to variable selection. Electr J Stat. 2017; 11(2):3549–94.

Xia J, Wishart DS. Metpa: a web-based metabolomics tool for pathway analysis and visualization. Bioinformatics. 2010; 26(18):2342–4.

Kamburov A, Cavill R, Ebbels TM, Herwig R, Keun HC. Integrated pathway-level analysis of transcriptomics and metabolomics data with impala. Bioinformatics. 2011; 27(20):2917–8.

Chagoyen M, Pazos F. Tools for the functional interpretation of metabolomic experiments. Brief Bioinforma. 2012; 14(6):737–44.

Kankainen M, Gopalacharyulu P, Holm L, Orešič M. Mpea—metabolite pathway enrichment analysis. Bioinformatics. 2011; 27(13):1878–9.

Gao J, Tarcea VG, Karnovsky A, Mirel BR, Weymouth TE, Beecher CW, Cavalcoli JD, Athey BD, Omenn GS, Burant CF, et al. Metscape: a cytoscape plug-in for visualizing and interpreting metabolomic data in the context of human metabolic networks. Bioinformatics. 2010; 26(7):971–3.

Walther D, Strassburg K, Durek P, Kopka J. Metabolic pathway relationships revealed by an integrative analysis of the transcriptional and metabolic temperature stress-response dynamics in yeast. Omics J Integr Biol. 2010; 14(3):261–74.

Gipson GT, Tatsuoka KS, Sokhansanj BA, Ball RJ, Connor SC. Assignment of ms-based metabolomic datasets via compound interaction pair mapping. Metabolomics. 2008; 4(1):94–103.

Krumsiek J, Suhre K, Illig T, Adamski J, Theis FJ. Gaussian graphical modeling reconstructs pathway reactions from high-throughput metabolomics data. BMC Syst Biol. 2011; 5(1):21.

Cressie N, Wikle CK. Statistics for Spatio-temporal Data: Wiley; 2015. https://doi.org/10.1111/j.1538-4632.2012.00859.x. https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1538-4632.2012.00859.x .

Banerjee S, Carlin BP, Gelfand AE. Hierarchical Modeling and Analysis for Spatial Data: Crc Press; 2014.

Richardson S, Bottolo L, Rosenthal JS. Bayesian models for sparse regression analysis of high dimensional data. Bayesian Stat. 2010; 9:539–69.

Carvalho CM, Polson NG, Scott JG. Handling sparsity via the horseshoe. In: Proceedings of the Twelth International Conference on Artificial Intelligence and Statistics. PMLR: Florida: 2009. p. 73–80. http://proceedings.mlr.press/v5/carvalho09a.html .

Van Der Pas S, Kleijn B, Van Der Vaart A, et al. The horseshoe estimator: Posterior concentration around nearly black vectors. Electron J Stat. 2014; 8(2):2585–618.

Meijer CS. Über Whittakersche bzw. Besselsche Funktionen und deren Produkte. Nieuw Arch Wiskd, II Ser. 1936; 18(4):10–39.

Brychkov YA. Handbook of Special Functions: Derivatives, Integrals, Series and Other Formulas: CRC Press; 2008. https://doi.org/10.1201/9781584889571 .

Brunk E, George KW, Alonso-Gutierrez J, Thompson M, Baidoo E, Wang G, Petzold CJ, McCloskey D, Monk J, Yang L, et al. Characterizing strain variation in engineered e. coli using a multi-omics-based workflow. Cell Syst. 2016; 2(5):335–46.

Download references

Acknowledgements

Not Applicable.

Availability and requirements

Project name : iCARH.

Project home page : https://cran.r-project.org/web/packages/iCARH

Operating system(s) : Platform independent.

Programming language : R, Stan.

Other requirements : R 3.5 or higher, Stan 2.18 or higher.

License : GNU GPL.

Any restrictions to use by non-academics : None.

Infrastructure support for this work was provided by the NIHR Imperial Biomedical Research Centre. TJ was supported by a Wellcome Trust ISSF Ph.D. studentship. The funding body has not played any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and affiliations.

Epidemiology and Biostatistics, School of Public Health, Imperial College London, Norfolk Place, London, W2 1PG, UK

Takoua Jendoubi

Statistics Section, Department of Mathematics, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK

Department of Surgery and Cancer, Imperial College London, South Kensington Campus, London, SW7 2AZ, UK

Timothy M. D. Ebbels

You can also search for this author in PubMed   Google Scholar

Contributions

TJ and TE jointly conducted the research and wrote the paper. All authors read and approved the manuscript.

Corresponding author

Correspondence to Takoua Jendoubi .

Ethics declarations

Ethics approval and consent to participate.

Not applicable.

Consent for publication

Competing interests.

Timothy M. D. Ebbels is also a member of the editorial board (Associate Editor) of BMC Bioinformatics.

Additional information

Publisher’s note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Additional file 1.

Worked example. We illustrate a fully reproducible application of the iCARH package to a publicly available dataset from [ 55 ].

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License( http://creativecommons.org/licenses/by/4.0/ ), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver( http://creativecommons.org/publicdomain/zero/1.0/ ) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article.

Jendoubi, T., Ebbels, T.M. Integrative analysis of time course metabolic data and biomarker discovery. BMC Bioinformatics 21 , 11 (2020). https://doi.org/10.1186/s12859-019-3333-0

Download citation

Received : 29 September 2019

Accepted : 19 December 2019

Published : 09 January 2020

DOI : https://doi.org/10.1186/s12859-019-3333-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Metabolomics
  • Biomarker discovery
  • Bayesian inference

BMC Bioinformatics

ISSN: 1471-2105

time course experiment statistics

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • v.30(4); 2020 Apr

Time course regulatory analysis based on paired expression and chromatin accessibility data

Zhana duren.

1 Department of Statistics, Stanford University, Stanford, California 94305, USA;

Jingxue Xin

2 CEMS, NCMIS, MDIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China;

3 Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China;

Wing Hung Wong

4 Department of Biomedical Data Science, Bio-X Program, Center for Personal Dynamic Regulomes, Stanford University, Stanford, California 94305, USA

Associated Data

A time course experiment is a widely used design in the study of cellular processes such as differentiation or response to stimuli. In this paper, we propose time course reg ulatory analysis (TimeReg) as a method for the analysis of gene regulatory networks based on paired gene expression and chromatin accessibility data from a time course. TimeReg can be used to prioritize regulatory elements, to extract core regulatory modules at each time point, to identify key regulators driving changes of the cellular state, and to causally connect the modules across different time points. We applied the method to analyze paired chromatin accessibility and gene expression data from a retinoic acid (RA)–induced mouse embryonic stem cells (mESCs) differentiation experiment. The analysis identified 57,048 novel regulatory elements regulating cerebellar development, synapse assembly, and hindbrain morphogenesis, which substantially extended our knowledge of cis -regulatory elements during differentiation. Using single-cell RNA-seq data, we showed that the core regulatory modules can reflect the properties of different subpopulations of cells. Finally, the driver regulators are shown to be important in clarifying the relations between modules across adjacent time points. As a second example, our method on Ascl1 -induced direct reprogramming from fibroblast to neuron time course data identified Id1/2 as driver regulators of early stage of reprogramming.

In time course expression analysis, gene expression is measured at multiple time points during a natural biological process such as spontaneous differentiation of progenitor cells, or during an induced biological process such as cellular response to a stimulus or treatment ( Storey et al. 2005 ). In the last two decades, many methods were developed to infer gene regulatory networks (GRNs) from time course gene expression data, for example, information theory–based methods ( Margolin et al. 2006 ; Hempel et al. 2011 ; Kinney and Atwal 2014 ), Bayesian network–based methods ( Perrin et al. 2003 ; Zou and Conzen 2005 ), ordinary differential equation–based methods ( Bansal et al. 2006 ; Wang et al. 2006 ), and permutation-based methods ( Hempel et al. 2011 ). Such analysis is popular because the expression data, which is inexpensive to measure, can provide a rich description of the changes in the cellular states during the time course. Conversely, because the regulation of gene expression involves the interaction of transcription factors with DNA on regions with open chromatin structure, the measurement of gene expression alone is not sufficient for the study of the regulation ( Duren et al. 2017 ; Miraldi et al. 2019 ). Much deeper understanding can be revealed by time course regulatory analysis, in which both gene expression and chromatin accessibility are measured at each time point in a time course experiment. With the advent of cost-effective technologies (i.e., Assay for Transposase-Accessible Chromatin using sequencing [ATAC-seq]) for measuring chromatin accessibility ( Buenrostro et al. 2013 ), paired expression and accessibility data are now becoming available in many time course experiments, such as FOS-induced neuronal activities ( Su et al. 2017 ), epidermal development ( Li et al. 2019 ), myeloid cell differentiation ( Ramirez et al. 2017 ), early cardiomyocyte differentiation ( Liu et al. 2017 ), iPSC reprogramming ( Wapinski et al. 2017 ; Cao et al. 2018 ), and induced neuron reprogramming ( Wapinski et al. 2017 ; Cao et al. 2018 ). Here, we present a methodology for the analysis of data from studies with such experimental designs.

Figure 1 presents an outline of our methodology (for detail, see Methods). First, we infer context-specific GRN from matched ATAC-seq and RNA-seq data at each time point to output reliable regulatory relations. Using the inferred GRN, we define two types of scores for regulatory relations. The regulatory strength of a transcription factor (TF) on a target gene (TG) is quantified by the trans -regulation score (TRS), which is calculated by integrating information from multiple regulatory elements (REs) that may mediate the activity of the TF to regulate the TG. Here, a prior TF-TG correlation across external public data ( Supplemental Table S1 ) is included in the TRS definition to distinguish the TFs sharing the same binding motif (i.e., TFs from the same family). The regulatory strength of an RE on a target gene is quantified by the cis -regulation score (CRS), which is calculated by integrating the TRS of TFs with binding potential on the RE. Based on these scores, we use non-negative matrix factorization to extract the core regulatory modules that characterize different biological processes and/or subpopulations of cells. Finally, we identify driver TFs (i.e., TFs driving expression changes between adjacent time points) as TFs with large difference TRS scores on up-regulated genes versus other genes. This allows us to causally connect ancestor–descendant regulatory modules along with time points (Methods). Our methodology for time course regulatory analysis is implemented in the software Time Course Regulatory Analysis (TimeReg), which is freely available ( https://github.com/SUwonglab/TimeReg ).

An external file that holds a picture, illustration, etc.
Object name is 622f01.jpg

Schematic overview of Time Course Regulatory Analysis (TimeReg) based on paired gene expression and chromatin accessibility data. ( A ) TimeReg proposes a three-step framework to infer a high-quality gene regulatory network. Step 1: PECA2 infers context-specific GRN from matched ATAC-seq and RNA-seq data at a single time point to output TF-TG and RE-TG regulatory matrix. Step 2: NMF decomposes the regulatory matrix and extracts the core regulatory modules at each time point. Step 3: Driver regulators are identified (Methods). ( B ) Overview of PECA2 method. ( C ) Schematic overview of the major results on two data sets.

In this paper, we will validate the utility of TRS and CRS by comparison with TF-TG relations and RE-TG relations defined by data from independent gene knockdown, ChIP-seq, and HiChIP experiments. After validating these key concepts, we will apply our method to study a time course of retinoic acid (RA)–induced differentiation of mouse embryonic stem cells (mESCs), in which gene expression and chromatin accessibility are measured at baseline (day 0) and at days 2, 4, 10, and 20 after RA treatment. We will extract core regulatory modules on each time point, map the modules across time points to trace their development trajectory, and validate the core regulatory modules by single-cell RNA-seq data. As a second illustration, we will apply our method on Ascl1 -induced direct reprogramming from fibroblast to neuron time course data to detect heterogeneity, explore the regulatory dynamics, and identify driver regulators in the reprogramming process. By applying our method on different biological problems, we will illustrate that our methodology is capable of providing reliable regulatory relations in time course experiments and dissecting the dynamics at network level.

Chromatin accessibility and expression dynamics of retinoic acid–induced mESC differentiation

Mouse ESC was induced to differentiate by treatment with retinoic acid (RA). We harvested cells at days 0, 2, 4, 10, and 20 (mESC, D2, D4, D10, and D20), and performed ATAC sequencing (ATAC-seq) and RNA sequencing (RNA-seq) to measure the paired chromatin accessibility and gene expression time course data ( Fig. 2 A). Regulatory elements (REs) at each time point are identified and quantified by its openness score in the same way as in Duren et al. (2017) . In total, 174,059 REs are obtained across all time points. From gene expression data, we selected 7975 genes that have at least twofold expression change and the maximum expression level is >10. These sets of RE and gene are used for all subsequent analyses. From the results of principal component analysis (PCA) and Pearson's correlation coefficient (PCC) on ATAC-seq data of twofold dynamic REs ( Fig. 2 B,C), our genome-wide gene expression and chromatin accessibility profiling show high-quality reproducibility among biological replicates. We also find a sharp change in the chromatin accessibility landscape during the time course, whereas the changes in gene expression are more moderate ( Fig. 2 D,E). The change in accessibility between day 0 and day 2 is particularly large. This suggests that the immediate responses to RA treatment are large changes in chromatin accessibility, which then induces subsequent gene expression changes in the time course.

An external file that holds a picture, illustration, etc.
Object name is 622f02.jpg

Genome-wide profiling of gene expression and chromatin accessibility during RA induction reveals landscape for RA-driven lineage transition. ( A ) Schematic outline of study design. ( B , C ) PCA and heat map of the Pearson's correlation matrix on ATAC-seq data. ( D , E ) PCA and heatmap of the Pearson's correlation matrix on RNA-seq data. ( F ) Enriched GO terms in the top 200 specific genes at each time point. The horizontal axis is −log 10 ( P -value) and the number behind the bar represents fold enrichment.

We performed Gene Ontology (GO) term enrichment analysis on the 200 most specifically expressed genes (Methods) at each time point. Because the differentiation is induced by RA, as a positive control we check whether the GO term “response to retinoic acid” is enriched in day 2. Indeed, this term is highly enriched in day 2 ( P = 3.67 × 10 −10 , fold change = 6.51). This gives us confidence that the enrichment analysis on specifically expressed genes is capable of detecting biologically relevant signals. Figure 2 F presents the most significantly enriched GO terms at each time point. It suggests that RA-induced differentiation into multiple cell types, with neuronal cells arising early in the time course and glial lineages emerging later. The enrichment of “digestion” on day 20 suggests that the differentiation also gave rise to other cell types beyond neurons and glia ( Fig. 2 F).

Context-specific inference of gene regulatory relations by the PECA2 method

We developed PECA2 as a method to infer gene regulatory relations (TF-TG relations and RE-TG relations) in a cellular context based on gene expression and chromatin accessibility data in that context (Methods). First, the trans- regulation score (TRS) for a given TF-TG pair is defined by integrating information from multiple REs that may mediate the activity of the TF to regulate the TG ( Fig. 1 B; Methods). Before applying it to analyze our time course data, we first validate the usefulness of TRS using gene perturbation experiments. On mESCs, we performed shRNA knockdown (separately) of transcription factors Pou5f1 , Sox2 , Nanog , Esrrb , and Stat3 and measured gene expression changes following the knockdown. Regarding the most differentially expressed genes (FDR<0.01), see Guan et al. (2019) following the knockdown as true target genes of the TF, we calculated the area under the ROC curve (AUC) of target prediction based on ranking by TRS. As a comparison, we collected ChIP-seq data for these TFs on mESC and used the potential target gene score from BETA ( Wang et al. 2013 ) to predict the target genes. Figure 3 A shows that TRS-based prediction has substantially higher AUCs than predictions based on ChIP-seq data. We also compared our methods with different baseline methods (different combination of features from expression and accessibility data) (Methods) on TF-TG prediction. TRS-based predictions get much higher accuracy then these baseline methods ( Supplemental Fig. S1 ). These results validated the usefulness of TRS in predicting TF-TG relations.

An external file that holds a picture, illustration, etc.
Object name is 622f03.jpg

PECA2 infers accurate gene regulation supported by ChIP-seq, shRNA knockdown, and HiChIP experiments. ( A ) Comparison of PECA2 TRS with ChIP-seq experiment on five important regulators in mESC by taking the knockdown data as ground truth. Shapes represent different transcription factors and colors represent different methods. Red represents results from PECA2 and black represents results from ChIP-seq experiment. ( B ) Conservation score distribution comparison between REs predicted to regulate at least one gene and randomly selected REs. ( C , D ) Validation of PECA2 predicted RE-TG pairs by the HiChIP experiment on mESC and RA D4. Background RE-TG pairs are randomly selected to have the same distance distribution as the predicted RE-TG pairs. “Fold” represents fold change of average read count of predicted RE-TG pairs versus background RE-TG pairs.

After we have the trans- regulation score, we define the cis- regulation score (CRS) for each RE-TG pairs by integrating the TRS of TFs with binding potential on the RE ( Fig. 1 B; Methods). To validate the usefulness of CRS, we downloaded the publicly available H3K27ac HiChIP data on mESC ( Mumbach et al. 2017 ) and also performed H3K27ac HiChIP experiments on RA D4 ( Zeng et al. 2019 ). The RE-TG pairs from CRS-based prediction (Methods) have 17.72 and 15.30 HiChIP reads on average on mESC and RA D4, respectively. As a control, we selected random RE-TG pairs (under the constraint that they have the same distance distribution as our predicted RE-TG pairs) from all candidate RE-TG pairs. The RE-TG pairs from control groups have 5.17 and 4.23 read counts in HiChIP data on average on mESC and RA D4, respectively ( Fig. 3 C,D). HiChIP counts are much higher in CRS-predicted RE-TG pairs (one-tailed Wilcoxon rank-sum test, ESC: P -value = 1.4412 × 10 −53 , Fold = 3.4250; RA D4: P -value = 1.1025 × 10 −56 , Fold = 3.616). We also constructed two more control RE-TG sets and compared the read counts with our predicted RE-TG pairs. Those controls are selected from RE-TG pairs have the same TG expression distribution and have the same distribution of number of ends covered by H3K27ac ChIP-seq peaks, respectively. Our predicted RE-TG pairs have much higher read counts than both the control groups have ( Supplemental Fig. S2 ). Finally, when we examined the sequences of the RE in the RE-TG pairs, the CRS-predicted REs have significantly higher sequence conservation than those from the control group (one-tailed Wilcoxon rank-sum test, P -value = 1.3801 × 10 −33 Fold change = 1.4425) ( Fig. 3 B). These results validated the usefulness of CRS in predicting RE-TG relations. Taken together, PECA2 provides high-quality, genome-wide, and context-specific inference of gene regulatory relations for each single time point with paired gene expression and chromatin accessibility data. PECA2 is available at https://github.com/SUwonglab/PECA .

Identification and annotation of novel regulatory elements by CRS

To identify important REs, we analyzed REs and their target genes. First, we annotated the REs into three groups: (1) promoters, (2) known enhancers (from the mouse ENCODE Project, including developmental stage enhancers), and (3) novel enhancers ( Supplemental Table S2 ). We found that 7% of REs are promoters, 60% are known enhancers, and the remaining 33% are novel enhancers ( Fig. 4 A). About 69% of the genes are regulated by both known and novel enhancers. To understand the functions of the known and novel enhancers, we divided the genes into two groups (targets of known enhancers, and targets of novel enhancers) depending on whether the associated RE with the maximum CRS score is a known enhancer or a novel enhancer. On average over different time points, known enhancers have 4604.80 target genes and novel enhancers have 3370.20 target genes ( Fig. 4 B). We found the novel enhancers with maximum CRS scores are highly conserved with a mean conservation score of 0.2034. The conservation score of those enhancers is about twofold higher than random regions, which is higher than the mean conservation of known enhancers (conservation score = 0.1753) and comparable to open promoter regions (conservation score = 0.2368) ( Fig. 4 C). On each time point, we chose the top 500 specifically expressed genes (based on gene specificity score) (Methods) of known enhancers and novel enhancers, respectively. Figure 4 D shows the GO enrichment score (defined as the geometric mean of fold change and −log 10 [ P -value]) of known versus novel enhancers’ targets on RA D10. GO terms such as cerebellar granular layer development, synapse assembly, regulation of AMPA receptor activity, and hindbrain morphogenesis are only enriched in novel enhancers’ targets, but not enriched in known enhancers’ targets. These results suggest that enhancer annotation from the ENCODE Project on those GO terms associated genes is still incomplete. The reason is that the mouse ENCODE Project contains brain tissue but does not contain specific neuron cellular context. By combining experimental and computational analysis, we identified new enhancers whose inclusion will greatly enhance the interpretation of data from the time course.

An external file that holds a picture, illustration, etc.
Object name is 622f04.jpg

Identification and annotation of novel enhancers. ( A ) Pie charts of the promoter, known enhancers, and novel enhancers in REs. ( B ) Bar plot of numbers of known enhancers and novel enhancers at different time points. ( C ) Conservation score distribution of different sets of REs. ( D ) Comparison of GO enrichment on known enhancers’ targets and novel enhancers’ targets on D10. Top 500 specifically expressed genes in each group are chosen to perform GO analysis. The x -axis represents enrichment score on known enhancers’ targets, and the y -axis represents that on novel enhancers’ targets. Each dot represents one GO term. The enrichment score is defined as the geometric mean of fold change and −log 10 P -value.

Core regulatory modules of TFs and TGs reveal subpopulation characteristics

To better understand the regulatory network inferred by PECA2, we divided the TF-TG networks, which is represented by a TRS matrix (rows represent TF and columns represent TG), into several modules (dense subnetworks) by non-negative matrix factorization (NMF) ( Brunet et al. 2004 ; Wang et al. 2008 ) based module detection method (Methods; Fig. 5 A). We used RA D4 data to illustrate and validate this approach.

An external file that holds a picture, illustration, etc.
Object name is 622f05.jpg

Core regulatory modules extracted from gene regulatory networks are supported by subpopulation from single-cell RNA-seq data. ( A ) Heatmap of reordered normalized TRS scores at D2 to D20. The black line represents the detected modules from NMF. ( B ) Heatmap of D4 normalized TRS on selected specific genes and TFs. ( C ) Mean expression pattern of genes from three different modules of RA D4 on the developmental stage of seven tissues. ( D ) Mean expression pattern of genes from three different modules of RA D4 on RA time course. ( E ) Distribution of RA D4 top 500 module-specific genes’ maximum expressed subpopulations from single-cell RNA-seq data.

NMF analysis of the TRS matrix at D4 yielded three modules, in which the TRS scores between TFs and TGs within the same module are much higher than those between different modules. It means we have divided the TF-TG networks into three groups of nodes with dense connections internally and sparser connections between groups. Figure 5 B shows this pattern for selected TFs and TGs. We found genes from different modules have different expression patterns on mouse tissue development data ( Gorkin et al. 2017 ) as well as the RA induction data ( Fig. 5 C,D). Module 1 contains TFs such as Ascl1 , Ebf1 , Nr2f1 , and Lhx1 . They were previously reported to be relevant to neuronal development. Using data from Gorkin et al. (2017) , we also found that Module 1 associated TGs have a high and increasing expression pattern in embryonic brain development ( Fig. 5 C, top). Module 2 contains mesodermal and endodermal related regulators such as Sox17 , Gata4 , Gata6 , Foxa2 , and Hnf4a . Consistently, target genes in Module 2 are highly expressed in the liver, lung, heart, kidney, intestine, and stomach, but lowly expressed in the brain ( Fig. 5 C, middle). Module 3 associated TFs such as Pou2f1 , Hmga1 ( Nishino et al. 2008 ), Sox2 ( Graham et al. 2003 ), and Pax6 ( Sansom et al. 2009 ) are known to be involved in the maintenance of neural stem and progenitor cells. Consistently, we found that the expression of Module 3 associated TGs has decreasing patterns during embryonic development in all tissues ( Fig. 5 C, bottom). These pieces of evidence show that the biological function of module-specific regulators is well-matched to the cellular context suggested by the expression pattern of the corresponding target genes.

Because the cell population after 4 d of differentiation is expected to be a mixture of subpopulations of cells representing different developmental lineages, it is likely that the modules identified above may reveal the regulatory characteristics of these subpopulations. We used single-cell RNA-seq data to test this hypothesis. In a previous study ( Duren et al. 2018 ), we performed scRNA-seq experiments on day 4 of the RA-induced time course and found that there are three distinct subpopulations of cells ( Duren et al. 2018 ). We selected the top 500 module-specific genes (Methods) in each module and assigned each gene to one of the three subpopulations (the one with the maximum expression). We found that 76.20% of Module 1–specific genes are matched to the subpopulation 1, 75.00% of Module 2–specific genes are matched to the subpopulation 2, and 79.40% of Module 3–specific genes are matched to the subpopulation 3 ( Fig. 5 E). This result shows that the modules are well-matched to the subpopulations and therefore can help us to elucidate the regulatory relations within the subpopulations. In other words, our analysis of bulk expression and accessibility data provided useful information on subpopulation-specific gene regulatory networks.

Finally, we applied the same analysis for each time point and found that there are three modules in D2, three modules in D4, four modules in D10, and four modules in D20 ( Fig. 5 A; Supplemental Figs. S3–S5 ).

Associations of regulatory modules across time points

With these functionally meaningful regulatory modules indicating subpopulation at each time point, we next investigated the relationships among them across time points. Figure 6 A shows the similarities between modules in adjacent time points. It is seen that the three modules in D2 are well-matched to the three modules in D4 (Jaccard similarities are 0.77, 0.82, and 0.85), the four modules in day 10 are well-matched to the four modules in D20 (Jaccard similarities are 0.69, 0.74, 0.88, and 0.66). This suggests that the analysis of a set of matched modules across time points may reveal useful biological insight. There are four modules in D10 but only three in D4. Modules 2 and 3 in D4 are similar to Modules 2 and 3 in D10 (Jaccard similarities are 0.74 and 0.88, respectively). However, it is not obvious how Modules 1 and 4 in D10 are related to the D4 modules.

An external file that holds a picture, illustration, etc.
Object name is 622f06.jpg

Core regulatory modules decompose the mixture populations consistently across time points. ( A ) Jaccard similarity of modules between neighboring time points. Line width represents the Jaccard similarity, and the similarity value is labeled on the line if it is >0.1. ( B ) GO analysis on modules at each time points. Top 100 specifically expressed genes of each module at each time point are selected for GO enrichment analysis. The x -axis represents time points, and the y -axis represents fold enrichment. The size of circles represents the enrichment P -value. ( C ) Mean expression pattern of genes from D10_Module1 and D10_Module4 on developmental stage of seven tissues. ( D ) Mean expression pattern of genes from D10_Module1 and D10_Module4 on RA time course. ( E ) Expression pattern of the glial marker gene Gfap on RA time course. ( F ) Heatmap of D10 normalized TRS on selected specific genes and TFs.

To explore the function of the modules, we extracted the top 100 most specifically expressed genes from each module (Methods) in each of the time points and performed GO enrichment analysis ( Fig. 6 B). Epithelium development and cardiovascular system development are enriched in Module 2 of earlier time points. In D20 of Module 2, digestion is enriched, but the enrichment levels of epithelium development and the cardiovascular system development are decreased. Those enriched GO terms suggest Module 2 involves endoderm and mesoderm development, which is consistent with the function of subpopulation 2 from the scRNA-seq data ( Duren et al. 2018 ). Stem cell population maintenance is enriched in Module 3 and has a downward trend along with time, which is consistent with the expression pattern of Module 3–specific genes in RA induction and tissue development data ( Fig. 5 C,D), indicating that the stem cell population is becoming smaller. Module 3 also enriches neural tube closure, which suggests Module 3 contains neural stem cells.

To see the difference between Module 1 and Module 4 in D10, we checked the expression pattern of those genes on mouse developmental stage data and RA time course data ( Fig. 6 C,D). In the developmental stage, genes from Module 1 are always highly expressed in forebrain, but genes from Module 4 have an upward trend. These results suggest that both Module 1 and Module 4 are brain-related cell populations, but Module 1 is much earlier than Module 4 in development. In RA induction time course data, we also see a similar pattern ( Fig. 6 D). From the results of GO enrichment analysis, we find they are enriched in different functions. Module 4 in D10 and D20 are enriched in glial cell differentiation which is not enriched in any of the modules in previous time points. Module 1 in all the time points are enriched in axon guidance ( Supplemental Fig. S6 ). GO enrichment analysis show Module 1 is neuron related and Module 4 is glial related.

Next, we checked the specific trans- regulators of these two modules ( Fig. 6 F). We find genes from Module 1 are regulated by Ebf1 , Ascl1 , Nr2f1 , and Lhx1 , whereas genes from Module 4 are regulated by Olig1 , Sox10 , and Sox8 . From the similarity of regulators and enriched GO terms on target genes, Module 1 in D10 is very similar to Module 1 in D4. Therefore, Module 1 would correspond to neuron subpopulation. The GO enrichment analysis and module-specific regulators indicate that the newly generated module in D10 (Module 4) is the glial population. If it were true, the glial marker genes should be highly expressed in D10 but low expressed in D4. We examined the expression pattern of the astrocyte marker gene Gfap , which is specific to Module 4. Indeed, Gfap is very highly expressed in D10 but almost not expressed in D4 ( Fig. 6 E). From these results, we conclude that Module 4 corresponds to the glial population and has emerged between D4 and D10. To clarify the origin of Module 4 further, we need to identify the regulators that drive the expression and accessibility changes. We will return to this question subsequently.

Driver regulators shed light on cell lineage transition

Under RA treatment, mESCs differentiated (or transitioned) into three different subpopulations after 2 d. To explore the regulatory mechanism behind the transition, we identified driver regulators in each module. Driver regulators are defined as TFs that (1) are up-regulated during the transition from day 0 to day 2 by at least 1.5-fold, and (2) with TRS score on up-regulated genes being significantly higher than that on non-up-regulated genes (one-tailed rank-sum test, FDR < 0.05). Figure 7 A gives the driver regulators of the three modules in RA day 2. In Module 2, driver regulators include Gata4 , Gata6 , Rxra , Sox17 , Foxa2 , and Hnf4a . Sox17 , Foxa2 , and Hnf4a are involved in endoderm development. Gata4 and Gata6 are known to be involved in mesoderm development. We find Rxra , an important cofactor of retinoic acid receptors, is involved in endoderm and mesoderm development. The expression level of retinoic acid receptor cofactor Rxra is not very high on day 2 (FPKM 14.47, ranked 3897 in 7975 dynamic expressed genes, top 48.87%), but it is identified as a driver regulator. We find that the expression of Rxra is correlated with up-regulated genes (on ENCODE data, rank-sum test, P = 6.44 × 10 −104 ) ( Fig. 7 B). Furthermore, we find the motif of RXRA is enriched in REs associated with up-regulated genes (rank-sum test, P = 1.98 × 10 −26 ) ( Fig. 7 C). Previous literature indicated that knocking out of Rxra leads to an abnormal phenotype in the cardiovascular system ( Sucov et al. 1994 ; Chen et al. 1998 ; Mascrez et al. 2009 ). Even though the expression level is not very high, Rxra is likely to be an important driver of the transition from stem cell state to mesoderm and endoderm state. In Module 1, the driver regulators are neuron-related factors like Ascl1 , Nr2f1 , Pou3f2 -4, Hox family, Meis family, Pbx family, Rarb , and other factors. In Module 3, driver regulators are Pax6 , Dbx1 , Gli1 , Gli3 , and other factors. Pax6 is known to be important in neural stem cell development. We also find that developing brain homeobox factor Dbx1 is one of the important drivers.

An external file that holds a picture, illustration, etc.
Object name is 622f07.jpg

Identification of driver regulators reveals ancestor–descendant fates for regulatory modules. ( A ) Driver regulators of D2_Module1, D2_Module2, and D2_Module3. The x -axis is log 2 fold change; the y -axis is −log 10 ( P -value). ( B ) Distribution comparison of Rxra ’s PCC with up-regulated genes and randomly selected genes. ( C ) Distribution comparison of RXRA’s motif enrichment on REs of up-regulated genes and randomly selected genes. ( D ) Distribution of D10_Module4 driver TF expression (blue bar) and their RE openness (red bar) on D4. Expression or openness greater than the median are labeled as “Expressed,” less than decile are labeled as “Not exp,” and the remaining are labeled as “Low exp.” ( E ) Distribution of Z -score of TRS between top 50 module-specific TFs and driver TFs of D10_Module4. ( F ) Expression distribution of the Module4's driver regulators on D4 scRNA-seq data. Columns represent subpopulations identified from scRNA-seq data.

An external file that holds a picture, illustration, etc.
Object name is 622f08.jpg

TimeReg suggests the developmental trajectory of the subpopulations for RA-driven lineage transition. Schematic overview of the subpopulations at each stage. The gray line represents the similarity of neighboring regulatory modules. The orange line represents the ancestor–descendant mapping among regulatory modules. TFs on the orange lines represent the important driver regulators causally connecting the modules.

To determine the origin (ancestor) of the new population in D10, we checked the expression of driver regulators of D10_ Module4 ( Sox8 , Nfix , Olig1 , Nfib , Hoxc8 , Nkx2-2 , Foxo6 , Cpeb1 , Lbx1 , Hoxa6 , Pou6f1 , and Rfx4 ) in D4. We find 11 of 12 driver regulators have not been expressed (greater than the median expression level of all genes, noted as ≥50%) ( Fig. 7 D) in D4 yet. However, 118 of 174 (67.82%) REs of those driver TFs are already open in D4 ( Fig. 7 D). So although driver TFs are not expressed yet, it is still possible to determine the ancestor of Module 4 by comparing the TRS score of module-specific TFs targeting on those driver TFs. Figure 7 E shows the distribution of normalized TRS score ( Z -score) of the top 50 module-specific TFs of each module targeting on the Module 4 driver TFs. The results show that TFs from Module 3 in D4 have significantly higher TRS scores than those from Module 1 and Module 2 (one-tailed rank-sum test, P -values are 0.0014 and 1.97 × 10 −6 , respectively), which indicates that the new module in D10 is likely to have arisen from Module 3 in D4. To test this hypothesis, we compared the expression of these driver TFs in scRNA-seq from D4. It is seen that the driver regulators are more highly expressed in subpopulation 3 than subpopulations 1 and 2 ( Fig. 7 F), which is consistent with our hypothesis, because Module 3 has been matched to subpopulation 3 in our previous analysis ( Fig. 5 E). As further validation, we compared the expression of neural stem cell markers ( Sox2 and Nes ) in the three subpopulations in D4 ( Supplemental Fig. S7B,C ). In most of the subpopulation 3 cells, Sox2 and Nes are expressed, which indicates that subpopulation 3 is enriched for neural stem cells from which the glial cells in D10 emerged.

Based on driver regulators, it is possible to construct the developmental path of the subpopulations. We define a transition score between the modules in adjacent time points to construct the ancestor–descendant map for the subpopulations in the time course (Methods). The ancestor–descendant mapping results are topologically similar to the mapping based on module similarities ( Fig. 6 A) except for the TFs driving the transition Module 4 in D10 ( Fig. 8 ). After RA induction, three subpopulations—neuron, mesoderm and endoderm, and neural stem cell—are generated between mESC to D2. Glia subpopulation is generated from neural stem cells between D4 and D10.

TimeReg identifies novel driver regulators of direct reprogramming from fibroblasts to neurons

In addition to mESC differentiation, we applied TimeReg to study the regulatory network for direct lineage reprogramming, which can convert mouse embryonic fibroblasts (MEFs) to induced neuronal (iN) cells, and its underlying mechanism to overcome epigenetic barriers is fundamental for differentiation and development. A previous study ( Wapinski et al. 2013 ) found that Ascl1 is sufficient to induce neuron from fibroblast and generated time course gene expression data for MEF to iN reprogramming. Follow-up study generated time course chromatin accessibility data ( Wapinski et al. 2017 ), observed rapid chromatin changes in response to Ascl1 at day 5, and identified Zbtb18 , Sox8, and Dlx3 as key TFs downstream from Ascl1 for major transition at day 5. We collected gene expression data and chromatin accessibility data of Ascl1- induced reprogramming from these two previous studies. Two levels of time course data can be paired on day 0, day 2, and day 5. This allows us to run TimeReg integrative analysis on this iN reprogramming data. We asked how the cell triggers the follow-up signals, remodeling the chromatin structure in a genome-wide way, and turns on and off lineage-specific gene expression within 48 h at early stage.

Core regulatory modules are consistent with single-cell data

We find two well-separated modules in the day 2 network ( Fig. 9 A). Module 1 contains TFs like Ascl1 , Id2 , Sox4 , Sox8 , Sox11 , Dlx3, and Zbtb18 , and contains TG like Cplx2 , Dner , Nkain1, and Eda2r. Module 2 contains TFs like Klf4 , Tead3 , AP1 complex factors, Egr1 , Prrx1 , and contains TG like Myof , Emp1 , Actn1 , and Bst2 . In reprogramming time course data, Module 1–specific genes have an increasing pattern, but Module 2–specific genes have a decreasing pattern ( Supplemental Fig. S8A ). Furthermore in public ENCODE data, Module 1–specific genes have a high and increasing expression pattern in embryonic brain development and low and decreasing patterns on heart, kidney, liver, and lung tissue development, whereas Module 2–specific genes have the opposite trend ( Supplemental Fig. S8B ). GO enrichment analysis shows that Module 1–specific genes are enriched in neuron-related functions, and Module 2–specific genes are enriched in muscle and epithelial related functions, which are consistent with the expression pattern in developmental stages ( Supplemental Fig. S8C ). Collectively, TimeReg reveals two functional modules, which are consistent with our expectation that the neuron subpopulation is induced to become larger, and other populations are repressed to become smaller. This finding is supported by the single-cell RNA-seq data ( Treutlein et al. 2016 ) of this Ascl1 reprogramming, which shows two well-separated subpopulations ( Fig. 9 B). We find that 90.40% of Module 1–specific genes are matched to the subpopulation 1 ( Fig. 9 C), and 88.80% of Module 2–specific genes are matched to the subpopulation 2 (mapped to the subpopulation with the maximum expression). In single-cell RNA-seq data, the neuron-specific gene Ascl1 is specifically expressed in subpopulation 1 and the muscle-specific gene Myof is specifically expressed in subpopulation 2 ( Fig. 9 D,E). These results show that the modules identified by TimeReg indeed capture subpopulation characteristics and therefore can help us to elucidate the regulatory relations within the subpopulations.

An external file that holds a picture, illustration, etc.
Object name is 622f09.jpg

TimeReg analysis on direct reprogramming from fibroblast to neuron. ( A ) Heatmap of reordered normalized TRS scores at day 2. The black line represents the detected modules from NMF. ( B ) t-SNE plot of single-cell RNA-seq data. Color represents clustering label. ( C ) Distribution of day 2 top 500 module-specific genes’ maximum-expressed subpopulations from single-cell RNA-seq data. ( D , E ) Expression of module-specific genes on t-SNE plot. ( F , G ) Module 1–specific genes are enriched in ASCL1 ChIP-seq target genes. ( H ) List of driver regulators of Module 1 in day 2.

Neuron-related module-specific genes are enriched in Ascl1 ChIP-seq targets

As the reprogramming is induced by pioneer factor Ascl1 , Module 1–specific regulators and target genes should contain Ascl1 target genes. To validate the module-specific regulators and module-specific genes, we compared them with the Ascl1 target genes from ChIP-seq data at day 2. The results show both Module 1–specific TFs and Module 1–specific target genes are significantly overlapped with the Ascl1 target genes (Fisher's exact test, P -values 3.22 × 10 −7 and 9.35 × 10 −62 , odd ratios 4.90 and 5.02, respectively) ( Fig. 9 F,G).

Discovery of novel driver regulators

The TimeReg analysis identified eight driver regulators for the neuron-related Module 1 at day 2 ( Fig. 9 H). Ascl1 ranked the first in the list, which means our method successfully detected this super-regulator. Among the other seven driver regulators, six of seven regulators contain Ascl1 ChIP-seq peaks in their regulatory region. Dlx3 and Sox8 ranked second and third, respectively. This is consistent with the fact that they were also previously reported to be important regulators of reprogramming at day 5 ( Wapinski et al. 2017 ). Except for these known regulators, our method detected some novel driver regulators. The forth-ranking regulator is Id2 , which is not reported in the two separate studies from gene expression level and chromatin accessibility level, respectively. Id1 , one important paralog of Id2 , is also identified as a driver regulator. Previous research shows that dimerization of ID1/2 with bHLH family factor MYOD would prevent MYOD from binding to DNA and inhibits muscle differentiation ( Sun et al. 1991 ; Jen et al. 1992 ). Therefore, activation of Id1/2 by Ascl1 will inhibit muscle differentiation and drive the cell into the neural state. Id1 and Id2 would be important driver regulators of the reprogramming process. Another interesting driver regulator is Tcf12 , which is known to be involved in the initiation of neuronal differentiation ( Rebhan et al. 1997 ). Our computational model TimeReg successfully identified these driver regulators by integrating gene expression and chromatin accessibility data, which may play an important role in reprogramming within 48 h and gain new insights for early regulation.

In this paper, we proposed a time course regulatory analysis tool TimeReg from paired gene expression and chromatin accessibility data. To reduce dimensionality, TimeReg integrates expression and chromatin accessibility levels, aggregates the REs and TFs to quantitatively infer TF-TG and RE-TG regulatory strength, and narrows down to regulatory module level by highlighting the important role of driver TFs. The GRN is validated by experimental data, and the core regulatory modules characterize different subpopulations of cells. Based on the driver regulators’ sequential expression and binding on chromatin accessible regions, we causally connect the modules (subpopulations) in adjacent time points and reveal the developmental trajectory for RA induced differentiation.

Inferring ancestor–descendant fates from developmental time courses is a challenging problem, especially in the presence of large gaps (≥48 h) between time points ( Schiebinger et al. 2019 ). In this study, we found a newly generated population at D10 that was absent in the previous time point D4. It is challenging to infer ancestor–descendant in this 6-d gap development stage even if one has single-cell data on each time point. From scRNA-seq data on D4, we see ∼72% of the D10 Module 4–specific genes are highly expressed in subpopulation 1 (only 15% are highly expressed in subpopulation 3) ( Supplemental Fig. S7A ), but driver regulators are more highly expressed in subpopulation 3 ( Fig. 7 F). There are two reasons why Module 4 specific genes are more highly expressed in subpopulation 1 than subpopulation 3: (1) Many expressed genes are shared in neuron and glial, and (2) glial-specific genes have not been expressed in D4 yet. Because of these reasons, the expression-based analysis will not be able to map the ancestor of Module 4 correctly. By exploiting paired expression and accessibility data in the time course, our method is capable of inferring such mappings reliably.

We note that because knockdown of a TF may result in secondary changes in addition to those caused by the knocked down TF, TF ChIP-seq data would not be expected to predict all the transcriptional changes. Conversely, ATAC-seq data before and after the knockdown could reveal changes related to the knocked down TF as well as secondary effects caused by changes in other TFs. Therefore, time course experiments to collect paired RNA-seq and ATAC-seq data before and after knockdown of key TFs will be a powerful approach to gene regulatory analysis.

Finally, we discuss the limitation of our method. The incorporation of prior knowledge from external data into our model has allowed us to greatly reduce the complexity of the model. The caveat is that the cellular context used in the prior calculation is currently incomplete and this may cause modeling bias. The validation results reported above show that in spite of this, our method is already useful for many types of inferences and predictions. We expect that the bias associated with the use of external data will be further minimized as these data become more complete in the future. One possible extension of this paper is to identify the union of all modules across multiple time points simultaneously, which would describe the module dynamics more clearly.

Gene regulatory network inference from a single time point by PECA2

Our previous method PECA takes paired expression and chromatin accessibility data across diverse cellular contexts as input, models how trans - and cis -regulatory elements work together to affect gene expression in a context-specific manner, and outputs the transcriptional regulatory network with TF-RE-TG as the building block ( Duren et al. 2017 ). PECA2 aims to infer the regulatory network in a new cellular context different from those used in training the model by selecting active REs, specifically expressed TFs, and expressed TGs in this context. Unlike PECA, which requires diverse cellular context data as input, PECA2 only requires one paired data (i.e., one sample) as input and infers the gene regulatory network. PECA2 first improves the TF-TG accuracy by taking the combinatorial regulation among cis -regulatory elements into account and aggregating the REs used by the same TF to regulate a given TG. An upstream TF regulating a TG should satisfy three conditions: (1) The TF should be expressed, (2) motifs of the TF should be enriched in the REs of this TG, and (3) the TF should be coexpressed with this TG across diverse cellular contexts. By combining this information, we define the trans -regulation score (TRS) between given i th TF and j th TG as follows:

where B ik is motif binding strength of i th TF on k th RE, which is defined as the sum of binding strength (motif position weight matrix-based log-odds probabilities; see HOMER software for detail) of all of the binding sites on this RE; and RE k ~ represents the normalized accessibility [RE k × RE k /median(RE k )] of k th RE. The first term RE k represents the actual accessibility of the RE, and the second term represents relative accessibility compared to the median accessibility level of this RE on external data. If one RE is accessible in the given cellular context, and the accessibility is also much higher than the accessibility level on other contexts, then this RE is specifically accessible in the given cellular context;   I k j represents the interaction strength between the k th RE and j th TG, which is learned from the PECA model on diverse cellular contexts. TG ~ j represents the normalized expression level of the j th TG [TG j × TG j /median(TG j )]. TFA i represents the activity of the i th TF (geometric mean of normalized expression and motif enrichment score on open region); R ij is the expression correlation of the i th TF and j th TG across diverse cellular contexts. Higher regulation score TRS ij implies the j th TG is more likely to be regulated by the i th TF. In deciding which TRS ij are statistically significant, we randomly select some { TF i − TG j } pairs and take these pairs as negative controls. We choose the threshold of the regulation score by controlling the false discovery rate (FDR) at 0.001.

PECA2 then improves the RE-TG accuracy by taking the TF combinatorial regulation into account and aggregating the TFs using a given RE. We define cis -regulation score (CRS) for RE-TG pairs based on the binding TFs’ TRS as

We approximate the distribution of non-zero log­ 2 (1+CRS) by a normal distribution, and predict RE-TG associations by selecting the pairs that have P -value < 0.05.

Regulatory module detection by matrix factorization

To detect key TF-TG subnetworks (core modules) from the TRS, we use non-negative matrix factorization (NMF). Before matrix factorization, we perform the following transformations for TRS matrix: (1) log 2 transformation TRS ~ = log 2 ⁡ ( 1 + TRS ) ; (2) normalize across rows and columns, normalized   TRS = Z ( TRS ~ ) + Z ( TRS ~ T ) T , where function Z(x) represents Z transformation for each row of matrix x ; and (3) set the negative values to zero. We assign the TFs and TGs into the clusters based on the maximum factor loading. To determine the number of modules, we run NMF 50 times and calculate consistency based on the cophenetic correlation coefficient ( Brunet et al. 2004 ). We choose K from the range of 2–7 and choose the K which gives the most consistent results.

Module-specific genes

Given one gene and its corresponding module, we define a module specificity score for this gene by comparing the normalized TRS of in-module TFs and out-module TFs (one-tailed rank-sum test). If an in-module TF's normalized TRS score is significantly higher than that of the out-module TF's, then this gene is specific to this module. Module specificity score is defined as the product of −log 10 ( P -value) and the mean TRS score with the TFs in the same module. We take the 500 genes that have the highest module specificity score as module-specific genes. Similarly, we define module-specific TFs by comparing the normalized TRS of in-module genes and out-module genes.

Ancestor–descendant mapping via driver TF

To map the modules in adjacent time points, we define a transition score TS ijt for the i th module from the t th time point, and the j th module from the ( t + 1)th time point by mean normalized TRS score of source TF set S i , t to target TF set T j , t +1 as

where normalized TRS NTRS h l t is the Z -score of the TRS score, is defined as TRS minus mean TRS score of h th TF and divided by the standard deviation of h th TF. Source TF set S i , t is the top 50 module-specific TFs from the i th module of t th time point. Target TF set T j , t +1 is defined as the top 20 driver TFs from the j th module of the ( t + 1)th time point. If the number of driver TFs is <20, we add some TFs from top-ranking module-specific TFs to get a target TF set T j , t +1 containing 20 TFs. The ancestor–descendant relation should have a higher transition score. So we map the j th module from the ( t + 1)th time point to the module at the t th time point, which has a maximum transition score.

Baseline methods for TRS score

We have five baseline methods for TRS: Pearson's correlation coefficient (PCC), summation of TF binding strength in nearby open regions (BO), summation of TF binding strength in interacting regions (BI), summation of TF binding strength in interacting open regions (BOI), and combination of PCC with BOI. PCC-based TRS is defined as PCC between TF expression and TG expression from diverse cellular contexts. Baseline TRS are defined as follows:

where d kj represents the distance between the k th RE and the j th TG; d 0 is a constant; and we choose 40 kb as a default value. We only consider the REs within 200 kb distance from the TGs.

Identification and quantification of REs

We define regulatory elements (REs) by the union of open peaks from ATAC-seq (called by MACS2) data on each time point. In the whole induction process, we get 174,059 REs in total. To quantify the accessibility of the peaks, we calculate the openness score for each sample in each region. Given a certain region of length L , we treat this region as foreground and denote by X the count of reads in the region. To remove the sequencing depth effect, we choose a background region with a length L 0 and denote by Y the count of reads in this background window. The openness score is defined as the fold change of read counts per base pair as

where δ is a pseudocount (the default value of δ is 5 in our implementation).

Gene specificity and GO analysis

To select specific genes in each time point, we define gene specificity score based on expression in the current condition and its median expression level on publicly available diverse context data and RA time course data. Gene specificity of the i th gene on the t th time point is defined as follows:

where the median(FPKM i , public ) represents the median expression level of the i th gene in public data, median(FPKM i , RA ) represents the median expression level of the i th gene in RA time course data, FPKM i , t represents the expression level of the i th gene on the t th time point. We select the top specific 200 genes and GO enrichment analysis based on PANTHER Version 14.1 ( Thomas et al. 2003 ).

Reprogramming data analysis

To do TimeReg analysis on reprogramming data, we set K as [1, 2, 2] on the three time points, respectively. ASCL1 ChIP-seq target genes are genes that have ASCL1 ChIP-seq peak on promoters or enhancers. Driver TFs are ranked by −log 10 ( P -value) × fold, where fold is the mean TRS score of up-regulated genes divided by mean TRS score of non-up-regulated genes.

Experimental design of retinoic acid–induced mESC differentiation

Mouse ES cell lines R1 were obtained from the American Type Culture Collection (ATCC Cat. no. SCRC-1036). The mESCs were first expanded on an MEF feeder layer previously irradiated. Then, subculturing was carried out on 0.1% bovine gelatin-coated tissue culture plates. Cells were propagated in mESC medium consisting of Knockout DMEM supplemented with 15% knockout serum replacement, 100 µM nonessential amino acids, 0.5 mM beta-mercaptoethanol, 2 mM GlutaMAX, and 100 units/mL Penicillin-Streptomycin with the addition of 1000 units/mL of LIF (ESGRO, Millipore).

mESCs were differentiated using the hanging drop method ( Wang and Yang 2008 ). Trypsinized cells were suspended in a differentiation medium (mESC medium without LIF) to a concentration of 50,000 cells/mL. Then, 20 µL drops (∼1000 cells) were placed on the lid of a bacterial plate, and the lid was upside down. After 48 h incubation, embryoid bodies (EBs) formed at the bottom of the drops were collected and placed in the well of a six-well ultralow attachment plate with fresh differentiation medium containing 0.5 µM retinoic acid for up to 20 d, with the medium being changed daily.

We followed the ATAC-seq protocol published by Buenrostro et al. (2013) with the following modifications. The EBs were first treated with 0.25% Trypsin + EDTA for 10–15 min at 37°C with pipetting. The pellet was then resuspended in the transposase reaction mix (25 µL 2× TD buffer, 2.5 µL transposase, and 22.5 µL nuclease-free water) and incubated for 30 min at 37°C. After purification, DNA fragments were amplified using 1:30 dilution of 25 µM Nextera Universal PCR primer and Index primer under the following conditions: for 5 min at 72°C, for 30 sec at 98°C, and a total 10 cycles of 10 sec at 98°C, 30 sec at 63°C, and 1 min at 72°C. The library of mESC, D2, D4, and D20 was sequenced on NextSeq with 50-bp paired-end reads; the library of D10 was sequenced on HiSeq with 100-bp paired-end reads.

Total RNA was extracted using the Qiagen RNeasy mini kit. Libraries were constructed using the NEBNext Ultra RNA Library Prep Kit for Illumina (New England Biolabs) with the following modifications. mRNA was first isolated from 1 µg of total RNA using the NEBNext Poly(A) mRNA Magnetic Isolation Module. Then it was fragmented for 12 min at 94°C before the first strand and the second strand cDNA synthesis. The double-stranded cDNA was then end repaired and ligated with NEBNext adaptor, followed by AMPure XP beads purification (Beckman Coulter). Each library was amplified using NEBNext Universal PCR primer and Index primer (for detail see NEBNext multiplex oligo for Illumina [E7335, New England Biolabs]) under the following conditions: for 30 sec at 98°C, then a total of six cycles of 10 sec 98°C, 30 sec at 65°C, and 30 sec at 72°C, with a final extension for 5 min at 72°C. Additional PCR (four to six cycles) were necessary to obtain enough DNA for sequencing. Finally, an equal amount of DNA from each library was pooled together, and a 400-bp fragment was selected by 2% E-Gel SizeSelect Gels (Thermo Fisher Scientific) and purified with AMPure XP beads. The library was sequenced on Illumina HiSeq with 100-bp paired-end reads.

Experimental design of shRNA knockdown in mESC

For plasmids and shRNA, the pSuper.puro vector containing the human H1 RNA promoter for ectopic expression of small hairpin RNA (shRNA) was obtained from Oligoengine. The shRNA constructs for each target were designed using the Clontech RNAi Target Sequence Selector program. DNA oligonucleotides were first synthesized by Elim Biopharmaceutical, Inc. To anneal oligos, DNA was incubated in the following conditions (for 30 sec at 95°C; for 2 min at 72°C; for 2 min at 37°C; and for 2 min at 25°C). Then, the annealed oligo inserts were ligated with pSuper.puro vector linearized with BglII and either HindIII or XhoI restriction enzymes. Following transformation, the positive clones were selected by digesting with EcoRI and XhoI restriction enzymes and run in 1% agarose gel. In addition, the presence of the correct insert within pSuper.puro vector was confirmed by sequencing before transfection in mammalian cells (sequencing primer: 5′-AAGATGGCTGTGAGGGACAG). The list of shRNA constructs used in this study is shown in Supplemental Figure S9A . The primers used to measure target gene knockdown efficiency in qRT-PCR are shown in Supplemental Figure S9B .

RNA interference (RNAi) experiments were performed with Nucleofector technology. Briefly, 12 µg of plasmid DNA was transfected into 3.5 × 10 6 mouse ES cells using the Mouse ES cell Nucleofector kit (Lonza). After nucleofection, the cells were incubated in 500 µL warm ES medium for 15 min. Then, the cells were split into four gelatin-coated 60-mm tissue culture plates containing 5 mL of warm ES medium. Puromycin selection was introduced 18 h later at 1 µg/mL, and the medium was changed daily. Then at 30 h, 48 h, and 72 h after puromycin selection, the cells were collected for RNA isolation.

To minimize genomic DNA contamination, RNA was extracted with RNeasy mini kit (Qiagen), and on-column DNA digestion was performed using RNase-free DNase kit (Qiagen). Microarray hybridizations were performed on the MouseRef-8 v2.0 expression beadchip arrays (Illumina). To prepare the sample, 200 ng of total RNA was reverse transcribed, followed by a T7 RNA polymerase-based linear amplification using the Illumina TotalPrep RNA Amplification kit (Applied Biosystems). After amplification, 750 ng of biotin-labeled cRNA was hybridized to gene-specific probes attached to the beads, and the expression levels of transcripts were measured simultaneously.

Software availability

Software and processed data of RA time course are available at GitHub ( https://github.com/SUwonglab/TimeReg , https://github.com/SUwonglab/PECA ), and as Supplemental Code .

Data access

All raw and processed sequencing data generated in this study have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/ ) under accession number {"type":"entrez-geo","attrs":{"text":"GSE136312","term_id":"136312"}} GSE136312 .

Competing interest statement

The authors declare no competing interests.

Supplementary Material

Acknowledgments.

We thank Lingjie Li, Shining Ma, and Zhanying Feng for helpful discussions. We also thank Miranda Lin Li for her help in the experiment. This work was supported by grants R01HG010359 and P50HG007735 from the National Institutes of Health (NIH). Y.W. was supported by The National Natural Science Foundation of China (NSFC) under grants nos. 11871463, 61671444 and the Strategic Priority Research Program of the Chinese Academy of Sciences (no. XDB13000000).

Author contributions: W.H.W. conceived the project. Z.D. designed the analytical approach, performed data analysis with the help of J.X., and wrote the software. X.C. performed all biological experiments. Y.W. and W.H.W. supervised the research. All authors wrote, revised, and contributed to the final manuscript.

[Supplemental material is available for this article.]

Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.257063.119 .

Freely available online through the Genome Research Open Access option.

  • Bansal M, Gatta GD, Di Bernardo D. 2006. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles . Bioinformatics 22 : 815–822. 10.1093/bioinformatics/btl003 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Brunet JP, Tamayo P, Golub TR, Mesirov JP. 2004. Metagenes and molecular pattern discovery using matrix factorization . Proc Natl Acad Sci 101 : 4164–4169. 10.1073/pnas.0308531101 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Buenrostro JD, Giresi PG, Zaba LC, Chang HY, Greenleaf WJ. 2013. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position . Nat Methods 10 : 1213–1218. 10.1038/nmeth.2688 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Cao S, Yu S, Li D, Ye J, Yang X, Li C, Wang X, Mai Y, Qin Y, Wu J, et al. 2018. Chromatin accessibility dynamics during chemical induction of pluripotency . Cell Stem Cell 22 : 529–542.e5. 10.1016/j.stem.2018.03.005 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Chen J, Kubalak SW, Chien KR. 1998. Ventricular muscle-restricted targeting of the RXRα gene reveals a non-cell-autonomous requirement in cardiac chamber morphogenesis . Development 125 : 1943–1949. [ PubMed ] [ Google Scholar ]
  • Duren Z, Chen X, Jiang R, Wang Y, Wong WH. 2017. Modeling gene regulation from paired expression and chromatin accessibility data . Proc Natl Acad Sci 114 : E4914–E4923. 10.1073/pnas.1704553114 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Duren Z, Chen X, Zamanighomi M, Zeng W, Satpathy AT, Chang HY, Wang Y, Wong WH. 2018. Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations . Proc Natl Acad Sci 115 : 7723–7728. 10.1073/pnas.1805681115 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Gorkin DU, Barozzi I, Zhang Y, Lee AY, Li B, Zhao Y, Wildberg A, Ding B, Zhang B, Wang M, et al. 2017. Systematic mapping of chromatin state landscapes during mouse development . bioRxiv doi:10.1101/166652 [ Google Scholar ]
  • Graham V, Khudyakov J, Ellis P, Pevny L. 2003. SOX2 functions to maintain neural progenitor identity . Neuron 39 : 749–765. 10.1016/S0896-6273(03)00497-5 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Guan L, Chen X, Hung Wong W. 2019. Detecting strong signals in gene perturbation experiments: an adaptive approach with power guarantee and FDR control . J Am Stat Assoc 10.1080/01621459.2019.1635484 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Hempel S, Koseska A, Kurths J, Nikoloski Z. 2011. Inner composition alignment for inferring directed networks from short time series . Phys Rev Lett 107 : 054101 10.1103/PhysRevLett.107.054101 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Jen Y, Weintraub H, Benezra R. 1992. Overexpression of Id protein inhibits the muscle differentiation program: in vivo association of Id with E2A proteins . Genes Dev 6 : 1466–1479. 10.1101/gad.6.8.1466 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kinney JB, Atwal GS. 2014. Equitability, mutual information, and the maximal information coefficient . Proc Natl Acad Sci 111 : 3354–3359. 10.1073/pnas.1309933111 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Li L, Wang Y, Torkelson JL, Shankar G, Pattison JM, Zhen HH, Fang F, Duren Z, Xin J, Gaddam S, et al. 2019. TFAP2C- and p63-dependent networks sequentially rearrange chromatin landscapes to drive human epidermal lineage commitment . Cell Stem Cell 24 : 271–284.e8. 10.1016/j.stem.2018.12.012 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Liu Q, Jiang C, Xu J, Zhao MT, Van Bortle K, Cheng X, Wang G, Chang HY, Wu JC, Snyder MP. 2017. Genome-wide temporal profiling of transcriptome and open chromatin of early cardiomyocyte differentiation derived from hiPSCs and hESCs . Circ Res 121 : 376–391. 10.1161/CIRCRESAHA.116.310456 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla Favera R, Califano A. 2006. ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context . BMC Bioinformatics 7 : S7 10.1186/1471-2105-7-S1-S7 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mascrez B, Ghyselinck NB, Chambon P, Mark M. 2009. A transcriptionally silent RXRα supports early embryonic morphogenesis and heart development . Proc Natl Acad Sci 106 : 4272–4277. 10.1073/pnas.0813143106 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Miraldi ER, Pokrovskii M, Watters A, Castro DM, De Veaux N, Hall JA, Lee JY, Ciofani M, Madar A, Carriero N, et al. 2019. Leveraging chromatin accessibility for transcriptional regulatory network inference in T Helper 17 Cells . Genome Res 29 : 449–463. 10.1101/gr.238253.118 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Mumbach MR, Satpathy AT, Boyle EA, Dai C, Gowen BG, Cho SW, Nguyen ML, Rubin AJ, Granja JM, Kazane KR, et al. 2017. Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements . Nat Genet 49 : 1602–1612. 10.1038/ng.3963 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Nishino J, Kim I, Chada K, Morrison SJ. 2008. Hmga2 promotes neural stem cell self-renewal in young but not old mice by reducing p16 Ink4a and p19 Arf expression . Cell 135 : 227–239. 10.1016/j.cell.2008.09.017 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Perrin BE, Ralaivola L, Mazurie A, Bottani S, Mallet J, d'Alché–Buc F. 2003. Gene networks inference using dynamic Bayesian networks . Bioinformatics 19 : ii138–ii148. 10.1093/bioinformatics/btg1071 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Ramirez RN, El-Ali NC, Mager MA, Wyman D, Conesa A, Mortazavi A. 2017. Dynamic gene regulatory networks of human myeloid differentiation . Cell Syst 4 : 416–429.e3. 10.1016/j.cels.2017.03.005 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Rebhan M, Chalifa-Caspi V, Prilusky J, Lancet D. 1997. GeneCards: integrating information about genes, proteins and diseases . Trends Genet 13 : 163 10.1016/S0168-9525(97)01103-7 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sansom SN, Griffiths DS, Faedo A, Kleinjan DJ, Ruan Y, Smith J, Van Heyningen V, Rubenstein JL, Livesey FJ. 2009. The level of the transcription factor Pax6 is essential for controlling the balance between neural stem cell self-renewal and neurogenesis . PLoS Genet 5 : e1000511 10.1371/journal.pgen.1000511 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Schiebinger G, Shu J, Tabaka M, Cleary B, Subramanian V, Solomon A, Gould J, Liu S, Lin S, Berube P, et al. 2019. Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming . Cell 176 : 928–943.e22. 10.1016/j.cell.2019.01.006 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Storey JD, Xiao W, Leek JT, Tompkins RG, Davis RW. 2005. Significance analysis of time course microarray experiments . Proc Natl Acad Sci 102 : 12837–12842. 10.1073/pnas.0504609102 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Su Y, Shin J, Zhong C, Wang S, Roychowdhury P, Lim J, Kim D, Ming GL, Song H. 2017. Neuronal activity modifies the chromatin accessibility landscape in the adult brain . Nat Neurosci 20 : 476–483. 10.1038/nn.4494 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sucov HM, Dyson E, Gumeringer CL, Price J, Chien KR, Evans RM. 1994. RXR α mutant mice establish a genetic basis for vitamin A signaling in heart morphogenesis . Genes Dev 8 : 1007–1018. 10.1101/gad.8.9.1007 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sun XH, Copeland NG, Jenkins NA, Baltimore D. 1991. Id proteins Id1 and Id2 selectively inhibit DNA binding by one class of helix-loop-helix proteins . Mol Cell Biol 11 : 5603–5611. 10.1128/MCB.11.11.5603 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Thomas PD, Campbell MJ, Kejariwal A, Mi H, Karlak B, Daverman R, Diemer K, Muruganujan A, Narechania A. 2003. PANTHER: a library of protein families and subfamilies indexed by function . Genome Res 13 : 2129–2141. 10.1101/gr.772403 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Treutlein B, Lee QY, Camp JG, Mall M, Koh W, Shariati SAM, Sim S, Neff NF, Skotheim JM, Wernig M, et al. 2016. Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq . Nature 534 : 391–395. 10.1038/nature18323 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wang X, Yang P. 2008. In vitro differentiation of mouse embryonic stem (mES) cells using the hanging drop method . J Viz Exp e825 10.3791/825 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wang Y, Joshi T, Zhang XS, Xu D, Chen L. 2006. Inferring gene regulatory networks from multiple microarray datasets . Bioinformatics 22 : 2413–2420. 10.1093/bioinformatics/btl396 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wang RS, Zhang S, Wang Y, Zhang XS, Chen L. 2008. Clustering complex networks and biological networks by nonnegative matrix factorization with various similarity measures . Neurocomputing 72 : 134–141. 10.1016/j.neucom.2007.12.043 [ CrossRef ] [ Google Scholar ]
  • Wang S, Sun H, Ma J, Zang C, Wang C, Wang J, Tang Q, Meyer CA, Zhang Y, Liu XS. 2013. Target analysis by integration of transcriptome and ChIP-seq data with BETA . Nat Protoc 8 : 2502–2515. 10.1038/nprot.2013.150 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wapinski OL, Vierbuchen T, Qu K, Lee QY, Chanda S, Fuentes DR, Giresi PG, Ng YH, Marro S, Neff NF, et al. 2013. Hierarchical mechanisms for direct reprogramming of fibroblasts to neurons . Cell 155 : 621–635. 10.1016/j.cell.2013.09.028 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Wapinski OL, Lee QY, Chen AC, Li R, Corces MR, Ang CE, Treutlein B, Xiang C, Baubet V, Suchy FP, et al. 2017. Rapid chromatin switch in the direct reprogramming of fibroblasts to neurons . Cell Rep 20 : 3236–3247. 10.1016/j.celrep.2017.09.011 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zeng W, Chen X, Duren Z, Wang Y, Jiang R, Wong WH. 2019. DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data . Nat Commun 10 : 4613 10.1038/s41467-019-12547-1 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Zou M, Conzen SD. 2005. A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data . Bioinformatics 21 : 71–79. 10.1093/bioinformatics/bth463 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • 3.4 - Steps in Designing a Study

Here is a classical view of how a study proceeds.

  • Determine the objective of your study.
  • Determine the experiment conditions (treatments).
  • Determine the measurement platform
  • Determine the biological sampling design.
  • Determine the biological and technical replicates.
  • Determine the sequencing depth
  • Determine the tissue pool
  • Determine the measurement
  • Do the study

What's missing here is the study sequence. You do a study, analyze the data and generate hypotheses which leads to the next experiment. We talk about the experiment as if it is done in isolation, but this is almost never the case.  We hope that any flaws in our earlier findings will be exposed by the experiments that follow and that any correct findings will be strengthened.  We often hope that the later studies will refine our results - e.g. after finding that a gene over-expresses in tumor tissue, we might find that this is limited to a region of the tissue or a time-point in tumor development.  Most statistical analysis is done on a "one experiment at a time" basis but scientific advances require a series of experiments.  A particular experimental design might seem optimal for a single experiment in the series, but its efficacy has to be judged in the context of its value as part of the entire series.

Lets look at some of the steps in more detail.

1. Determine the Objective of Your Study

The level of replication should match the objectives. For differential analysis, we compare the mean difference between treatments to the variability among the replicates. For clustering we averaged among replicates of the correct level. In either case, we need to decide the appropriate level of replication. For example, for a study of tumor development in mice we might decide to use one tumor per mouse in several mice.  However if the mice typically develop several tumors, we might prefer to sample 2 tumors per mouse to determine if within mouse variability is smaller than between mouse variability.  If the tumor appear to be quite spatially heterogeneous, we might consider sampling several spatial locations in the tumor.

We also need to think ahead about the type of analysis that will be done. For example, if we're doing a differential analysis, we will do a t -test or something like a t -test. We will need the means and the variances. If we're doing a clustering analysis, i.e. clustering tumors based on their gene expression we would usually average the tumors of the same type before clustering.

As discussed earlier, you need to balance biological versus technical replicates. We usually don't need to need technical replicates unless you're interested in knowing what the measurement error is.  However, when biological replication is expensive compared to technical replication (for example, when the specimens are very rare - e.g. mammoth tissue - or difficult to sample - e.g. growth tip of a root hair) then technical replication can help you by reducing measurement error.

2. Determine the Experiment Conditions (treatments) Under Study.

You need to determine the factors in your experiment, their levels and which combinations you will use as treatments.

Typically our studies are comparative, with 2 or more treatments. Factor levels should be selected that you expect to have biologically significant differences.  For example, if you are studying exposure to a toxin under experimental control, it is useful to have some a priori information about the exposures that will produce effects ranging from minimal to highly toxic.  If this type of information is not available, it is useful to run a small pilot experiment to determine the appropriate doses.   

If there is some sort of stimulus, it is good to have a control condition.  Control conditions are not always clearly defined - for example, if a drug is injected in mice, should the control be "no injection", "injection of saline solution" or some other preparation, such as "injection with drug solvent"?  When using  organisms with some type of mutation, the wild-type from the same genetic background is usually used as the control.   Ethical standards also need to be considered - in animal and especially human studies, placebos are not always ethically be used as controls.   In these cases, the "control" should be the gold standard treatment, against with a newer treatment can be compared. 

Many experiments have a single factor with 2 or more levels.  In this case, each level is a treatment, and many levels may be used.  When there are 2 or more factors, the usual design is the factorial (also called crossed) design which uses all level combinations as treatments..  

Example of factorial design with two Genotypes, two Tissue types and 3 Times

For a study with 2 genotypes ( G and g ), 2 tissue samples ( T and t ), at three times ( a , b and c ). The treatments would be:

If you do the entire factorial set up, you will have all 2x2x3=12 combinations as part of your experiment. Since replication is required, this experiment needs at least of 24 observations to replicate everything. 

You can see that designs with multiple factors at multiple levels can be very expensive due to the number of treatments.  More compact designs which do not use all combinations but which are almost as informative can be constructed, but are seldom used in biology (although they are heavily used in industry).  If you need to vary more than 3 factors in an experiment, seek statistical assistance.

Statisticians will say that the design is said to be balanced if there are equal numbers of replicants all levels for all conditions. This turns out to be useful in terms of a robustness against assumptions about the distributions of the populations. Statistically, problems such as nonnormality and differences in variance  don't matter as much when we have equal sample sizes. However, we are not always constrained to have equal sample sizes: some factor levels may be difficult to sample or expensive to apply (for example, rare genotypes or  expensive drugs); equipment fails, mice die of causes not related to the experiment, nucleotide samples degrade or are lost.  Unbalanced experiments can be analyzed statistically.  However, the interpretation of badly unbalanced experiments may be difficult and may require additional statistical input.

Time is often a factor in experiments that may involve biological processes such as growth or tissue damage.  In biology, experiments over time are often called "time-course" experiments.  Statisticians may refer to these differently.  When independent biological entities are sampled at the different times, (say groups of plants that are pulled up at weeks 1 and 3) the statistician just considers this a factor like any other, and assign the samples at random to the sampling time.  However, when a biological entity is sampled repeatedly over time, this is called a  longitudinal or repeated measures design.  When a time course design has repeated measurements on the same individual, this needs to be taken into account in the statistical analysis.  We usually assume that there is correlation among the samples from the same individual, which reduces their variability compared to samples taken from different individuals.  Use of repeated measures can improve power for detecting differences by reducing the variability of the comparisons (thereby improving the false negative rate), but ignoring the correlation when doing the analysis leads to a larger false positive rate.  Biologists need to distinguish between the two types of time course studies.

In time course experiments, the times do not need to be equally spaced.  Just as in "dosing" studies, it is important to have some a priori idea of what times will be interesting.  As well, it is important when possible to start with samples that are in some sense synchronized, so that "time 0" means the same thing to every sample.  However, even when the samples are synchronized at the start of the study, they may be out of synchrony as time goes on, which will lead to greater variability at later time points.

Finally, in multifactor studies we may need to redefine the levels for some factors.  For example, suppose we are growing larvae at different temperatures - we may need to measure time by growth stage rather than elapsed time to make meaningful comparisons.  Suppose we are looking at exposures to a toxin and we have  susceptible and resistant genotypes: should dose of toxin be the level, or is it more meaningful to use some measure of damage?  Choice of levels should be determined by the biological hypotheses under study.

3. Determine the Measurement Platform

At some point, usually while you are figuring out budgets, you will need to figure out what the measurement platform should be.  Over time, more platforms are becoming available and measurement costs are rapidly decreasing.  However, data storage and analysis costs are not declining.  Choice of platform is a complex decision.

The platform and reagent manufactures and the technical staff at the microarray or sequencing center are valuable sources of information about the pros and cons of the measurement platforms.  At Penn State, the staff in the Genomics Core in the Huck Institutes of Life Sciences are both knowledgeable and helpful.  That being said, much innovative work is being done in research labs developing novel methodology with these technologies.

Cost and accuracy are major considerations in selecting technologies and experimental designs within technologies. There are several components to cost - sample preparation, running the samples, and handling the resulting data.  There are also less tangible costs which come from nost getting all the required information from the experiment.

You need to consider the cost of 'doing it right'versus the cost of:

  • Repeating the entire experiment,
  • validating all findings on another platform, or
  • losing of publication, grant opportunity are precedence.

Often what is cheap on the surface is not cheap in the end. It is worth putting in the time to figure out the best experiment,  the best platform, and so on before you do the experiment. Even if it turns out to be inexpensive to run the experiment, if you can't publish it you might as well not have done it.

In this course we are going to discuss the two currently most popular platforms, microarrays and short read high throughput sequencing.  Single molecule sequencing is now also available and is relatively high throughput - humdreds of thousands of reads.  However, millions of reads (sequences) are required to do statistical analyses such as differential expression analysis or genotyping.  In any case, as long as the data are counts of molecules or fragments, the analysis should be quite similar to the analysis of short read data.

Microarrays

Microarray's for model species are very reliable now. Gene expression microarrays can be fairly readily designed for un-sequenced species using a new sequencing technologies. They are very good for comparative analysis but you can't find what is not on the array. You can get differential hybridization for reasons that are not directly due to gene expression. For instance, sequence variation, splice variation and other factors create differential hybridization. 

Microarrays are being phased out for experimental work, but are cost-effective for many situations.  Data management and data analysis costs are very minimal. Since probe development and printing are fairly inexpensive, microarrays can also be used for non-model species.  

High Throughput Short-read Sequencing

High throughput short-read sequencing produces more information than microarrays. It seems to have a higher dynamic range, i.e. you can get the low-end and high-end information more accurately. It can be applied to any biological entity, including mixtures of organisms such as gut microflora. High throughput sequencing does not require knowing what you are looking for and is not affected by genetic variation (except in large insertions or deletions).The costs have come down dramatically in the past couple of years, due to ultra-high throughput and multiplexing (the ability to label multiple samples and sequence them simultaneously).

Data management is the biggest challenge. The dataset are huge - several gigabytes, so storing and moving the raw data is difficult.  The sequences are identified by mapping to a reference genome or transcriptome.  If there is no reference, then one must be created, which has its own unique challenges.  For some studies, such as cancer transcriptomes, sequence variability may be high, so that considerable expertise and computing must be applied before the data are in a format suitable for statistical analysis.

High throughput sequencing is currently the platform of choice for most research studies.  There are many design considerations after deciding to use short-read sequencing.  Some of these are discussed in [1].

4. Determine the Biological Sampling Design

The sampling design consists of several components: determining how the samples will be collected, determining if blocking is appropriate and determining the number of types of replicates.

A block is a set of samples that in the absence of treatment effects have a natural clustering. Usually, the clustering means that there is less variability within blocks than between blocks.  For example, in our colon cancer example, a person would be a block with the potential for taking 2 samples per person. However, there are lots of reasons why you need blocks. Some of these reasons have to do with having to tissue samples from the same individual. Another reason might be the set up of the experiment, i.e., the mice that are in the mouse house at the same time should have all the same environmental conditions and so they are more similar than mice housed at different times. Plants grown in the same field may be more similar than plants grown in different fields. Blocks may also be induced in an experiment due to technical issues induced by taking or processing samples in different labs, at different times or by different personnel.

It important to have a strong protocol for sample collection so that biases and extra variability are not introduced by, for example, different technicians handling the samples in slightly different ways, or collecting samples under different environmental conditions.  Reducing variability by use of a strong protocol induces internal validity of the results - if the samples are collected and handled identically, then observed treatment differences should actually be due to the treatment.  However, having too little variability in the samples can reduce external validity - the experimental results cannot be replicated at another time or in another lab because the new samples differ in some way that had been controlled by the protocol.

Balance between internal and external validity can be achieved by blocking.  Samples in the same block should be very similar in the absence of treatment effects, while different blocks should reflect the variability of the population.  For example, if the environmental conditions in the greenhouse are important for gene expression in leaf, samples can be collected over several days, with enough samples in a single day to perform an entire replicate of the experiment.  This way, slight changes in humidity, temperature, lighting and other environmental factors will be sampled over the days, while kept very uniform at each sampling time.  It might be even better if the plants can be grown in several batches and and a complete replicate of the experiment done on each batch, which would reflect variability in the growth conditions etc.  A common mistake is to confound a blocking factor with a treatment.  For example, it might be more convenient to house one genotype at each lab.  But if you want to compare genotypes, you will not be able to determine if the observed differences are due to genotype or due to lab differences.  

In some situations, it is not possible to replicate an entire experiment in a block because the number of treatments is greater than the number of samples in the block.  For example, in twin studies we have only 2 samples (the sibs) in each block.  Another example is two sample microarrays, where we induce blocking by hybridizing two samples to the microarray probes.  When the entire experiment cannot be done in a single block, special designs called incomplete block designs can be used to provide some of the variance reduction due to blocks while maintaining the balance of the experiment.  Complete block designs and balanced incomplete block designs are very amenable to statistical analysis.

Regardless of whether you are using complete or incomplete blocks, if the randomization has been done within block, then block needs to be recorded.  Blocking induces a type of correlation (the intraclass correlation) similar to the correlation in repeated measures designs.  If it is not accounted for in the statistical analysis the false positive and false negative rates will be increased. A note on jargon:  Many biologists consider one replicate of the entire experiment to be an experiment and might state that they did 3 experiments.  A statistician will call this three replicates or 3 blocks.

5. Determine the Number of Biological and Technical Replicates

Biological replication is used for statistical inference. The more biological replicates that you have, the higher your power. And because whenever you take a measurement, part of this measurement is measurement error, if you have enough biological replicates there is not any reason to take technical replicates unless you really want to know what the size of the measurement error is. What is the right sample size?  I used to say that it is always more than anyone can afford, but with the decreased cost of measurement, some large medical studies in humans have quite large sample sizes.  However, in the scientific setting, we generally have to balance the cost of the current experiment with the cost of doing the follow-up experiments that might validate or refine our results.  For this reason, we often set a sample size in advance and then try to estimate the power of the experiment to detect differences of various sizes.  

At minimum you should have three biological replicates for treatment. That being said,  dozens and dozens of experiments have only two replicates. These experiments have very high false discovery and false non-discovery rates.  However, they can be useful to provide some rough measures of biological processes that can then be used as pilot experiments to to guide sample size estimation and experimental design for experiments with better replication.

Technical replication is essential when part of the experimental objective is to measure the measurement error.  When the objectives are purely biological, technical replication is not as effective as biological replication.  However, in some circumstances, technical replicates may be much less expensive - for example if the specimens are rare or the treatments are difficult to apply.  In those cases, technical replication can be used to reduce the measurement error by averaging the replicates.  Typically when answering biological questions, we want to compare the treatment effects to the natural biological variability of our samples.  For this reason, we need to sample (and estimate) the full biological variability.  Hence measurement error will only be one component of the sample variability and the variance reduction due to averaging technical replicates quickly declines in importance.  

plots of technical and biological replication

In the left figure, the black line shows the effects of technical replication on the SE of the mean when the biological variation and measurement variation are the same.  The red line shows the effects of technical replication when the biological variation is 4 times larger than the error variation and the blue line shows the effects when the biological variation is 4 times smaller.  Even with very large error variation, the effect of technical replication quickly asymptotes at the magnitude of the biological variation.  In the right figure, the black l ine shows the effects of biologi cal  replication on the SE of the mean when the biological variation and measurement variation are the same.  The red line shows the effects of biological replication when the biological variation is 4 times larger than the error variation and the blue line shows the effects when the biological variation is 4 times smaller.  The curves asymptote at zero.

It is important to keep track of the blocks, biological replicates and technical replicates when recording sample information, as these features of the experiment are used for estimating variability.  One of the sources of irreproducible statistical analyses is that the sample information is lost.

In sequencing studies, another important determinant of measurement error is the amount of sequencing done - i.e. the number of fragments sequenced per sample or sequencing depth.  Increasing the sequencing depth, (getting more reads per sample), increases the power quite similarly to technical replication. As with technical replication, increasing the read depth reduces the measurement error (relative to the treatment effects) but  just like technical replication, there is a non-zero asymptote for the total error.   Sequencing depth must be traded off for additional biological replication.

The question still remains, how big should the sample be?

Appropriate sample sizes can be determined from a power analysis which requires an estimate of total variation and the sizes of the effects you hope to detect. You also need to specify the p-value at which you will reject the null hypothesis, and the desired power.  For simple cases such as the t-test, R and other statistical software can provide an estimate.  

In a high throughput experiment, each feature can have a different variance.  Even if we agree that 2-fold difference is an appropriate target for effect sizes that should be detectable with high probability, the difference in variance means that we cannot hope to have the same power for every feature.  One way to handle this is to use data from a previous similar experiment and estimate the distribution of power to detect a 2-fold difference from the distribution of feature variances.  However, this type of analysis does not take into account the effects of simultaneously testing thousands of features, which reduces the power to detect differences for each feature.

It is important to stress to store extra tissue until your paper is published. You never know when a sample may be destroyed in transit, which may cause imbalance in your design or loss of replicatin for some treatment. There are protocols for storing tissue samples in different locations to safe-guard from power outages for example.  

6. Determine the Pool of Genetic Material

When we do a comparative study of a each feature over several tissues, we have to think hard about what we are actually comparing.  I will take gene expression as an example, because that this the quantity I best understand.  

Suppose we have two mouse liver tissue samples and measure the quantity of mRNA produced by Gene 1 in both.  What does it mean if the quantity is higher in the first mouse?  Even if we took the same size sample (by volume or by weight) we might not get the same yield of mRNA.  Perhaps we should use a cell sorter and take the same number of cells from each liver, and try to obtain all of the mRNA from those cells, but this seldom possible.  Even in modern "single cell" studies, we are unlikely to be able to capture every mRNA molecule in each cell.  So perhaps the higher quantity in the first mouse is due to having more cells, or better yield from the extraction kit, etc.  

Often in tissue studies, a bioanalyzer is used to compare the total RNA (or total mRNA) of the sample before the expression of each feature is quantified.  The quantification device (microarray or sequencer) is usually loaded with roughly equal aliquots after the samples have been standardized to have the same total.  This means that if there is a difference in expression (by weight, volume, cell, etc) it is lost at this step.  

Essentially, the analysis assumes that the total should be the same for each sample, but that the samples differ in how the expression is distributed among features.  Standardization is a great way to reduce a source of technical variability if this assumption is true, but can erase the true picture if it is not.  For example, when I arrived at Penn State, the Pugh lab was investigating the transcription mechanism in yeast, and turning off transcription in some samples,  This could not be detected using standard methods of sample standardization.

In single cell analysis we measure the nucleic acid content of a single cell.  More typical samples are tissues, organs, or even whole organisms.  Clearly these samples are pools of genetic material from many cells.  We cannot assume that expression in all of these cells is the same - in fact one of the fascinating discoveries of single cell studies is how variable cells can be within a single tissue in a single individual.  Tissue samples are like a (possibly weighted) average of the cells in the sample.

In some studies we also pool tissues or nucleic acid samples across individuals.  If we have equal well-mixed samples (based on e.g. volume, weight, mRNA molecules, or some other measure) then the tissue pool is like an average sample.  Gene expression in the pool will be like average expression for the individuals, and hence will be less variable than the individuals in the pool.  When sample labeling is a significant part of the cost, this can provide considerable savings, because the entire pooled sample will be labeled together.  In this case, a biological replicate would be another pool of tissues or nucleic acid samples from an independent set of individuals of the same size, and statistical analysis would treat the number of pools as the sample size (not the total number of individuals). 

A common mistake in pooling is to pool all the samples and then sample from the pool.  These are technical replicates of the pooled sample, and cannot be used to assess biological variability.   This is exactly the right thing to do when testing measurement protocols (as the samples should be identical except for measurement differences) and exactly the wrong thing to do for biological studies.  On the other hand, if the budget allows for 50 mice to be raised, but only 10 sequencing samples, 10 pools of 5 mice provides almost the same variance reduction as individual measurements of the 50 mice, with the 10 pools providing an estimate of the biological variation averaged over the 5 mice in the pool. With sequencing technologies, individual nucleic acids can be barcoded with a unique label before pooling and sequencing (multiplexing).  This allows the investigator to sort the sequences that originated in each of the samples.  When the expensive step is sequencing, this is a great cost-saver, allowing all the advantages of keeping the samples separate and of pooling.  These days, both labelling and sequencing are expensive.  Along with sample size, the investigator has to balance the costs of barcoding and sequencing.

One we have all of our nucleic acid samples, we need to lay out the measurement design.

7.  Determine the Measurement Design

Measurement design has to do with how we arrange the samples in the measurement instrument. Although the analysis of micro arrays and sequencing data are quite different, the issues for measurement design are similar.

The main issue is the block correlation induced by measuring samples at the same time on the same instrument.  For example, a common design for microarrays is to have a set of  24 microarrays laid out in a 4 by 6 grid on a slide.  For the popular "two-channel" microarrays, each microarray can hybridize simultaneously to two samples with different labels.  In this set up, we need some idea of the correlations we would induce if we took 96 technical replicates (supposedly identical) and hybridized them to 48 microarrays on two slides with 2 samples per microarray.  We might expect that samples on the same microarray would be more similar than samples on different microarrays but on the same slide, which in turn would be more similar than samples on different slides.  If this is true, then we would want to balance our samples across the arrays and slides.  A common simplifying assumption is that there is no slide effect - i.e. two microarrays on the same slide are as variable as two microarrays on different slides.  We also usually assume that the two samples on the same microarray are less variable than two samples on different microarrays.  The practical implication of these two assumptions is that the individual microarrays should be treated as blocks.  If we have only two treatments, we should have one replicate of each on each microarray.  If we have more treatments, we should use some type of incomplete block design.  

Another common design for microarrays is 24 microarrays on a slide, but only one sample per microarray.  Using the same assumption, we do not have a blocking factor and samples can be assigned at random to the microarrays.

With short-read sequencing platforms, we typically have multiple sequencing lanes on a sequencing plate.  (In the discussion above, replace microarray with "lane" and slide with "plate").  Multiplexing allows us to run multiple samples, identified by a unique label, in each lane.  The typical assumption is that there are no plate effects and that multiplexing does not induce a correlation on the percentage of reads in each sample originating from each feature.  Multiplexing does affect the total number of reads that are retrieved from each sample, but this is accounted for in the statistical analysis, so the samples can be treated as independent and lane does not need to be treated as a block.

The assumptions above have to be tested each time a new platform is developed.  This is one of the uses of technical experiments which use technical replication.  

Many special designs have been proposed for two channel platforms in such as two channel microarrays to handle the blocks of size two, and also balance any measurement biases induced by the labels.  Some discussion of these can be found in [2] and [3] for example.  However, for single channel microarrays we can treat the sample measurements as independent.  This means that we do not have to consider blocking at the measurement level, as long as the samples are all run at the same time.  However, if the samples are run in batches, the batches should be considered blocks and should include entire replicates of the experiment.  

 Although there seems to be little evidence of label or lane effects for sequencing data, there are two proposals for measurement designs that treat the lanes as blocks and which guard against loss of an entire lane of data.  These are both feasible due to the possibility of multiplexing.  Multiplexing with 96 (4 4 ) samples is now routine, and the use of 384 is possible on some platforms.  One suggestion is to use the lane as a block, placing one replicate from each treatment in the lane, so that the number of lanes is the same as the number of replicates.  [4] Another suggestion is to multiplex the entire experiment (all replicates) into a single lane. [5]

Multiplexing reduces the number of reads per sample. Current high throughput systems can produce about 200 million reads per lane, so that multiplexing with 96 samples produces 200/96 (about 2) million reads per sample which is not sufficient for most studies.  Thus the designs above call for technical replicates of the blocks in multiple lanes if needed.  This can be particularly effective in the second design - a single lane can be run as a pilot.  If the reads/sample is not well-balanced, the mixture can be rebalanced (without the need to relabel samples) before running the additional lanes.  In both designs, the loss of a lane decreases the replication of the study (and thus the power) but ensures that data will be available for all treatments.  As well, if labelled nucleotides are stored, another lane of data can be run later with the batch effect confounded with the lane effect, rather than the treatment, which is preferable for the statistical analysis.

[1 ] Honaas, Loren, Krzywinski, Martin, Altman, Naomi (2015) Study Design for Sequencing Studies . in: Methods in Statistical Genomics . Mathe, E. and Davis, S. (editors). Springer. (in press)

[2] Kerr, M. K. & Churchill, G. A. (2001a). Experimental design for gene expression microarrays. Biostatistics 2,183–201.

[3] (2006) Altman, N.S., Hua, J. Extending the loop design for 2-channel microarray experiments Genetical Research , Vol 88, No. 3, p. 153-163. Cambridge

Normal 0 false false false EN-US JA X-NONE Normal 0 false false false EN-US JA X-NONE

[4]  Nettleton, D., Design of RNA Sequencing Experiments . Statistical Analysis of Next Generation Sequencing Data, ed. S. Datta and D. Nettleton. 2014: Springer.

[5] Auer, P.L. and R.W. Doerge, Statistical Design and Analysis of RNA Sequencing Data. Genetics, 2010. 185 (2): p. 405-416.

Start Here!

  • Welcome to STAT 555!
  • Search Course Materials
  • Lesson 1: Introduction to Cell Biology
  • Lesson 2: Basic Statistical Inference for Bioinformatics Studies
  • 3.1 - Statistical Jargon
  • 3.2 - Replicability
  • 3.3 - Study Objectives
  • 3.5 - Additional References
  • Lesson 4: Multiple Testing
  • Lesson 5: Microarray Preprocessing
  • Lesson 6: Statistics for Differential Expression in Microarray Studies
  • Lesson 7: Linear Models for Differential Expression in Microarray Studies
  • Lesson 8: Tables and Count Data
  • Lesson 9: RNA Seq Data
  • Lesson 10: Clustering
  • Lesson 11: Gene Set Analysis
  • Lesson 12: Single Nucleotide Polymorphisms
  • Lesson 13: ChIP-Seq Data
  • Lesson 14: Classification
  • Lesson 15: Cross-validation, Bootstraps and Consensus
  • Lesson 16 - Multivariate Statistics and Dimension Reduction

Penn State Science

Copyright © 2018 The Pennsylvania State University Privacy and Legal Statements Contact the Department of Statistics Online Programs

We can help you reset your password using the email address linked to your Project Euclid account.

time course experiment statistics

  • Subscription and Access
  • Library Resources
  • Publisher Tools
  • Researcher Resources
  • About Project Euclid
  • Advisory Board
  • News & Events
  • SUPPLEMENTAL CONTENT
  • DOWNLOAD PAPER SAVE TO MY LIBRARY

Graphical models are widely used to study biological networks. Interventions on network nodes are an important feature of many experimental designs for the study of biological networks. In this paper we put forward a causal variant of dynamic Bayesian networks (DBNs) for the purpose of modeling time-course data with interventions. The models inherit the simplicity and computational efficiency of DBNs but allow interventional data to be integrated into network inference. We show empirical results, on both simulated and experimental data, that demonstrate the need to appropriately handle interventions when interventions form part of the design.

Simon E. F. Spencer. Steven M. Hill. Sach Mukherjee. "Inferring network structure from interventional time-course experiments." Ann. Appl. Stat. 9 (1) 507 - 524, March 2015. https://doi.org/10.1214/15-AOAS806

Information

zbMATH: 06446578 MathSciNet: MR3341125 Digital Object Identifier: 10.1214/15-AOAS806

Keywords: Bayesian inference , causal Bayesian network , Causal inference , dynamic Bayesian network , Network inference , structure learning

Rights: Copyright © 2015 Institute of Mathematical Statistics

time course experiment statistics

KEYWORDS/PHRASES

Publication title:, publication years.

Time course analysis with DESeq2

Approximate time: 20 minutes

Learning Objectives

  • Discuss time course analyses with DESeq2

Time course analyses with LRT

Despite the popularity of static measurement of gene expression, time-course capturing of biological processes is essential to reflect their dynamic nature, particularly when patterns are complex and are not simply ascending or descending. When working with this type of data, the Likelihood Ratio Test (LRT) is especially helpful. We can use the LRT to explore whether there are any significant differences across a series of timepoints and further evaluate differences observed between sample classes.

For example, suppose we have an experiment looking at the effect of treatment over time on mice of two different genotypes. We could use a design formula for our ‘full model’ that would include the major sources of variation in our data: genotype , treatment , time , and our main condition of interest, which is the difference in the effect of treatment over time ( treatment:time ).

NOTE: This is just example code for our hypothetical experiment. You should not run this code .

To perform the LRT test, we also need to provide a reduced model, that is the full model without the treatment:time term:

Then, we could run the LRT by using the following code:

To understand what kind of gene expression patterns will be identified as differentially expressed, we have a few examples below. In the plots below we have Time on the x-axis and gene expression on the y-axis. In this dataset there are two samples for each time point, one having undergone some treatment (red) and the other without (blue).

For this figure, we are depicting the type of genes that will not be identified as differentially expressed. Here, we observe that GeneX is differentially expressed between the time points, however there is no difference in that expression pattern between the treatment groups.

time course experiment statistics

The type of gene expression patterns we do expect the LRT to return are those that exhibit differences in the effect of treatment over time. In the example below, GeneX displays a different expression pattern over time for the two treatment groups.

time course experiment statistics

Continuing with our example dataset, after running the LRT we can determine the set of significant genes using a threshold of padj < 0.05. The next step would be to sort those genes into groups based on shared expression patterns, and we could do this using degPatterns() . Here, you will notice that we make use of the col argument since we have two groups that we are comparing to one another.

Depending on what type of shared expression profiles exist in your data, you can then extract the groups of genes associated with the patterns of interest and move on to functional analysis for each of the gene groups of interest.

This lesson has been developed by members of the teaching team at the Harvard Chan Bioinformatics Core (HBC) . These are open access materials distributed under the terms of the Creative Commons Attribution license (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

  • Materials and hands-on activities were adapted from RNA-seq workflow on the Bioconductor website

time course experiment statistics

This is my problem. I have data from label-free proteomics experiment. I have 4 strains with 5 time points in each and 3 replicates. One of the strains is a negative control and expresses nothing, while 3 others are induced to express different proteins at time point 0. Overall I have complete data (all strains, time points and replicates) for around 1000 proteins.

What I'm interested in is finding which genes have significantly different profiles of expression in the other 3 strains compared to negative control. And I'm not sure what to do? Should I just make a t-test of negative control vs. each of the strains in all 5 time points? Or is there a more sophisticated solution?

There are a number of software/packages for time series microarray experiments, many based on clustering methods like hierarchical clustering. Please do a Google search for the detailed methods. You may try your data to software like STEM: a tool for the analysis of short time series gene expression data. https://www.cs.cmu.edu/~jernst/stem/

The problem in my case is how to discover what is significant, not how to cluster. But thanks anyway.

time course experiment statistics

You can try EDGE for time-series analysis (and I think SAM also has a time-series module):

http://www.genomine.org/edge/index.html

I have also used the 'timecourse' package in R and found it to be quite useful:

http://www.bioconductor.org/packages/2.12/bioc/html/timecourse.html

Thanks, EDGE seems partially useful for the task. The problem is, I'm comparing 4 strains, and it can't tell me which one is different compared to which one...

True, but you could either compare each of the 3 strains to the control separately and/or pool the 3 strains compared to the control. I'm guessing that it isn't really important how much the 3 altered strains vary from each other (although you could continue to do pair-wise comparisons, if needed).

time course experiment statistics

I think ANOVA is a better choice in this case, than an all versus all t-test which will inflate the type-I error due to multiple testing. If you suspect the data to deviate strongly from the assumptions, you can use the Kruskal-Wallis non-parametric alternative instead. Both methods should be available in R.

But with ANOVA I won't know which of the strains is significantly different from negative control. It can also miss the situation where Mut1 and Mut2 are significantly different from each other, but neither of them are significantly different from the negative control.

If you want to find out which group is contributing, you can also do a post hoc-test individually for genes coming up in the ANOVA. There is little bit of speculation involved in your second assumption, it depend on how you specify the model formula. But in general ANOVA should have more power than an all versus all t-test in such a setting.

time course experiment statistics

The limma user's guide has a section on how to analyze timecourse data with linear models using splines, which might be of interest for you.

If you protein expression data comes from peptide counts, perhaps you'd want to shoot the data through voom (also explained in the limma user's guide) before doing the linear model fitting.

time course experiment statistics

I sometimes go for another approach which is a cluster analysis using BiolayoutExpress. Normalize your data and check the data format at their website which biolayout will accept then run MCL button and then view the clusters as plotted graph. Look for a a cluster with gene expression pattern match your hypothesis across your samples and time.

Login before adding your answer.

Use of this site constitutes acceptance of our User Agreement and Privacy Policy .

Russia after Kemerovo fire: Tuleyev’s 'brawlers', Putin's expectations and spontaneous support campaigns

Political experts established fiasco of existing local government in Kemerovo Oblast

Almost all news flow on 27 March was concentrated on the consequences of the tragedy in Kemerovo. While a crowd of thousands demanded the authorities to resign and its separate participants heard the vice-governor's accusations of PR out, the head of Kemerovo Oblast finally apologised for what had happened. But not in front of the vengeful crowd but in front of President of Russia Vladimir Putin. Residents of the region, in turn, have finally heard only the label 'brawlers'. Realnoe Vremya tells the details with political experts' forecasts about further development of the situation.

''Young man, do you want to hype on the grief?''

The morning in Kemerovo started with a spontaneous demonstration of thousands of locals who demanded to investigate the causes of the fire in Winter Cherry Trade and Leisure Centre, resignation of the regional government together with Governor of Kemerovo Oblast Aman Tuleyev. In brief, the people's demands came to two slogans: ''Truth!'' and ''Away!'' Vice-Governor of Kuzbass Sergey Tsivilyov and First Deputy Governor Vladimir Chernov appeared to reply in front of the deads' relatives, the victims and people simply interested in the misfortune.

People who gathered there spoke about what concerned them into the microphone and ''talked'' with Tsivilyov. So the vice-governor who was expected to occupy the post of the governor of Kemerovo Oblast tried to take the microphone away from a man with a broken heart by accompanying his actions with a question: ''Young man, do you want to hype on the grief?'' Here it became known that young man Igor Vostrikov's sister, wife and three little kids – aged seven, five and two – died in the fire. Later Tsivilyov apologised to the people.

''I apologise once again to those who are in this complicated situation. And I kneel in front of these people,'' the vice-governor addressed the people and really kneeled. The crowd immediately shouted: ''Shame!'' and ''No Need!'', but at the end, some part of people welcomed Tsivilyov's actions with applause.

time course experiment statistics

Vice-Governor of Kuzbass Sergey Tsivilyov appeared to reply in front of the dead's relatives, the victims and people simply interested in the misfortune. Photo: Aleksandra Kryazheva (sputniknews.kz)

Inappropriate prank and morgue raid

People in the demonstration from time to time asked to announce real data on the dead – on 26 March, information that about 300-400 people died in the fire and authorities diligently hid it started to spread in social networks. In answer to it, Sergey Tsivilyov told that relatives said 64 people were missing. Earlier the Investigative Committee denied the information about hundreds of victims. During the demonstration, the very residents made a list which all people could include the names of missing relatives and acquaintances in.

Mayor of Kemerovo Ilya Seredyuk, in turn, gathered ambitious residents who were ready to go through local morgues and go to the fire scene to personally make sure that the information about hundreds of dead people didn't correspond to reality.

In the end, the media claimed Ukrainian phone prank Evgeny Volnov was responsible for distributing the fake information who had posted a recording of the call to Kemerovo's morgue on behalf of an emergency ministry employee on his YouTube channel. It's heard in the recording how he asks how many seats left in the morgue and was indignant that, actually, 300 people died in the fire. According to newspapers, he not only played the ''trick'' but also sent unfamiliar people to collect the flowers and toys the mourners had left on behalf of the presidential administration.

time course experiment statistics

Aman Tuleyev hasn't appeared in front of the residents but was present at the meeting with President of Russia Vladimir Putin. Photo: kremlin.ru

Accusations instead of apologies

Very Governor of the region Aman Tuleyev hasn't appeared in front of the residents but was present at the meeting with President of Russia Vladimir Putin who landed in Kemerovo on 27 March morning. Laying flowers to the memorial to the dead in the fire near the shopping centre, the president visited a hospital with victims and gathered Minister of Emergency Vladimir Puchkov, Minister of Health Veronika Skvortsova, Chairman of the Investigative Committee Aleksandr Bastrykin, Plenipotentiary Ambassador of Russia to Siberian Federal District Sergey Menyailo and very Aman Tuleyev.

At the meeting, the governor apologised for the first time for what had happened. But not to the people but the president.

''Mr Putin, you personally called me. Many thanks again. I personally apologise for what had happened on our territory. But when I reported you, I said 27 people were missing. Then it started to snowball,'' Tuleyev addressed the president of the country.

We need to notice that the word ''brawlers'' was deleted in the version of the transcript that was published on the site of the government – it is how the governor characterised the mourners who were in the demonstration.

''There are about 200 people there today. It's not relatives of the dead, it's those who brawl,'' it's how Tuleyev explained the president the people's discontent by announcing them oppositional force.

By the way, meanwhile, residents were demanding to meet with Vladimir Putin in person in the Freedom Square.

Putin met with citizens in the building of the forensic laboratory, and here he had not the easiest talk with residents of Kemerovo. Citizens demanded the resignation of Governor Tuleyev and thanked for support.

Worldwide mourning

Meanwhile, mourning for the dead in the fire started to be both officially and inofficially declared in other cities across Russia. Later Vladimir Putin signed an official decree declaring 28 March all-Russian mourning day.

Russians also have begun to hold spontaneous memorial services since 26 March. According to Meduza, the school in Beslan, at which 186 kids died in 2004, wrote ''Kemerovo. We Grieve'' the next day after the fire. Memorial services also appeared in other Russian cities and abroad on 27 March. People brought flowers and toys in Magadan, Moscow, Saint Petersburg, Khabarovsk, Ekaterinburg, Belogorsk, Yakutsk, Irkutsk, Kursk… the list is constantly updated. Kazan paid tribute to the memory of the dead near the fountain in Lenin Garden and in the city centre, near Trust sculpture consisting of the figures of a girl, boy and dog. Two pieces of paper were placed there. One of them says: ''Kids, forgive us!'', the other: ''Kemerovo, we're with you!''

Now citizens of Kazan spontaneously gather on Baumana street. People kindle candles, brought flowers.

Sympathising people bring flowers to Russian embassies and consulates abroad. Prague, Erevan, Odessa, Tel Aviv and other are among them.

Channel One employees ran a solidarity campaign and shot a video with a poster #Kemerovowearewithyou.

''Tsivilyov has thrown himself onto a pillbox today. But he didn't hit the pillbox''

Realnoe Vremya asked political experts to assess the current situation in the region and forecast further development of events. Director of Political Expert Group Konstantin Kalachev thinks the existing system of the power vertical under Tuleyev ''turned out to be rotten inside''.

time course experiment statistics

Vice-Governor of Kuzbass Sergey Tsivilyov has been repeatedly called to be one of the main candidates to substitute Aman Tuleyev. Konstantin Kalachev notices that today he has ''buried'' himself as a public politician during the demonstration.

''Director General of Kolmar Trivilyov was supposed to substitute Tuleyev. This is why he has thrown himself onto a pillbox today. But he didn't hit the pillbox. Figuratively speaking, even standing on the knees with a candle won't probably cross out the fact that he didn't learn to speak the same language with the people. And, in fact, he wasn't welcomed with affection,'' says the expert.

He says Tsivilyov's accusation that Igor Vostrikov hyped is a serious mistake.

''It's a serious mistake. In general, if he had handled today's situation, he would have been the next governor. Now it's not so clear,'' supposes Kalachev. In the political expert's opinion, the mayor of Kemerovo also expects resignation after the tragedy.

Director of the Regional Politics Development Centre Ilya Graschenkov also forecasts Ilya Serdyuk's departure. But he adds that the mayor of Kemerovo is the only functionary who speaks the same language with the residents. However, the population's loyalty may not save him from the departure, the newspaper's interlocutor supposes.

From the expert's point of view, the story of Tuleyev's resignation was solved but has been lasting for too long because the current governor promoted his heir, while Moscow – did its own one. Now, after the tragedy, when the fact that Tuleyev doesn't fulfil his obligations became obvious, the political expert believes a new governor will be appointed in several months only – to show the decision was made under public pressure.

Related articles

Vadim Drobiz: ‘There is absolutely no market for cheap wine in Russia’

Realnoe Vremya is an online newspaper, providing business news and sectoral analytics, up-to-date information about the development of economy and technology in Tatarstan, Russia and the whole world. Every day Realnoe Vremya’s Editorial board prepares materials and interviews with the leaders of different sectors and markets on the most relevant topics.

Rambler's Top100

Online assessment in the age of artificial intelligence

  • Open access
  • Published: 19 August 2024
  • Volume 3 , article number  126 , ( 2024 )

Cite this article

You have full access to this open access article

time course experiment statistics

  • Alexander Stanoyevitch 1  

112 Accesses

Explore all metrics

Online education, while not a new phenomenon, underwent a monumental shift during the COVID-19 pandemic, pushing educators and students alike into the uncharted waters of full-time digital learning. With this shift came renewed concerns about the integrity of online assessments. Amidst a landscape rapidly being reshaped by online exam/homework assistance platforms, which witnessed soaring stocks as students availed its questionable exam assistance, and the emergence of sophisticated artificial intelligence tools like ChatGPT, the traditional methods of assessment faced unprecedented challenges. This paper presents the results of an observational study, using data from an introductory statistics course taught every semester by the author, and delves into the proliferation of cheating methods. Analyzing exam score results from the pre and post introduction of ChatGPT periods, the research unpacks the extent of cheating and provides strategies to counteract this trend. The findings starkly illustrate significant increases in exam scores from when exams of similar difficulty were administered in person (pre-Covid) versus online. The format, difficulty, and grading of the exams was the same throughout. Although randomized controlled experiments are generally more effective than observational studies, we will indicate when we present the data why experiments would not be feasible for this research. In addition to presenting experimental findings, the paper offers some insights, based on the author's extensive experience, to guide educators in crafting more secure online assessments in this new era, both for courses at the introductory level and more advances courses The results and findings are relevant to introductory courses that can use multiple choice exams in any subject but the recommendations for upper-level courses will be relevant primarily to STEM subjects. The research underscores the pressing need for reinventing assessment techniques to uphold the sanctity of online education.

Similar content being viewed by others

time course experiment statistics

The quizzical failure of a nudge on academic integrity education: a randomized controlled trial

time course experiment statistics

Guidelines for Creating Online MCQ-Based Exams to Evaluate Higher Order Learning and Reduce Academic Misconduct

Explore related subjects.

  • Artificial Intelligence

Avoid common mistakes on your manuscript.

1 Introduction

Distance learning dates back to 1728, when Caleb Phillips advertised in the Boston Globe newspaper his shorthand course offering by mail correspondence for students throughout the United States; see [ 1 ]. The internet provided distance learning with a tremendous boost by greatly reducing the turn-around times between professor-student correspondences. The development of new technologies has significantly increased the quality of online courses over the years since. The single most significant catalyst for online education came with the pandemic, where every instructor, from primary education through graduate school levels, needed to quickly develop sufficient expertise in order to offer online versions of all of their courses that were forced upon them.

The pandemic coincided with, perhaps, the most profound revolution in distance learning ever. Students and instructors who never had any experience with online learning suddenly became full-time users. Debates flourished and studies came out in droves of the effectiveness of online courses. A nice study of the impact of Covid-19 on 309 courses given at eight colleges and universities spanning four continents is given in Bartolic et al. [ 2 ].

A crucial concern regarding online courses is the integrity of assessments. Before the pandemic, many instructors resisted the idea of offering their courses online, fearing that assessments would be compromised. The reality is that, compared to administering exams in a classroom setting under the watchful eye of a proctor, allowing students to take exams online at home makes it almost impossible to prevent cheating. There are evident actions students might take during an online exam, with or without permission: opening their books and notes, consulting the web, collaborating with classmates, or seeking assistance from advanced students. Several companies, who ostensibly advertise that they offer help with homework, had no qualms about helping students cheat on exams. During the pandemic, several of these companies and their shareholders profited immensely from this dubious practice; One of them saw their market capitalization nearly quadrupling from the onset of the pandemic to its zenith. For a modest monthly fee, students can obtain one-on-one assistance from a tutor, typically lowly paid from a third-world country, to complete their exams and gain access to an extensive database of solved exam problems across various subjects. A colleague of the author purchased such a subscription from one of these companies to understand the full extent it could undermine online assessment, and the findings were alarming. This instructor discovered that their exam questions, complete with answers, appeared on the platform the very next day!

The focus of this paper concerns assessing student performance in online courses when the exams are administered without any sort of online proctoring. In a face-to-face class, when we proctor an exam in the classroom, we have complete control over what resources students can use. For example, for a math exam, we typically will allow calculators, but neither laptops nor cellphones. If the same exam is given online without supervision, there are many things that cannot be controlled. Despite instructions, students might use their cellphones or computers to access online resources or even network with classmates on the exam questions. Our paper will provide compelling evidence of the prevalence of cheating with online exams using data from the author’s introductory large lecture statistics courses, which he has been teaching every fall and spring semesters for over a decade, with 120–150 students each semester.

Another stumbling block in ensuring the integrity of online assessments is the recent emergence of artificial intelligence-driven large language models, most notably ChatGPT, which is owned and hosted by Microsoft. Google has developed its own large language model, known as Gemini. Unlike services that employ tutors and maintain a database of exam questions and solutions, AI-based large language models have been trained on vast data sets and can answer questions from a myriad of contexts with remarkable precision. ChatGPT offers free access to an earlier version (Chat GPT 3.5) with a limit on the number of questions within a certain timeframe. Subscriptions are available that eliminate these restrictions and use a more powerful model (Chat GPT 4).

Figure 1 , sourced from OpenAI's paper introducing their ChatGPT version 4, illustrates its performance across a range of well-known exams:

figure 1

This figure is taken from the GPT-4 Technical Report by OpenAI (the company behind ChatGPT) accompanying the introduction of ChatGPT-4 in February 2023. The performance percentiles on various tests are depicted: the blue bars represent the previous version, GPT-3.5, and the green bars represent GPT-4. For instance, on the GRE-Verbal exam (a national test used for graduate school admissions), GPT-3.5 scored better than approximately 60% of the students taking the exam, whereas GPT-4 outperformed about 98%

This graphic shows that most all standardized exams cannot be administered without proper proctoring.

In this paper, we explore the present landscape and the future of assessment in online courses, considering third party online homework/exam assistance platforms and ChatGPT. Often, it is neither practical nor possible to proctor exams in an online format. Thus, this issue holds significant importance for the future of online education. Our purpose here will be to provide guidance and advice for giving online proctored exams in the era of large language models such as Chat GPT. Most of this advice stems from the author’s extensive experience and findings in using online assessments over the past four years, in mathematics and statistics courses, both at the introductory and advanced undergraduate levels. The key takeaway is that while it remains feasible to conduct online assessments, ensuring the integrity of exams requires a substantial amount of additional effort. This state of affairs will certainly impact university and individual instructor decisions on whether to use online assessment.

2 Review of related literature

For many years before the pandemic, internet-based distance learning has been offered and developed at many universities throughout the world. It has made education more equitable by allowing students to take college classes (and earn degrees) without having to make a major and economically difficult move, waste time with long commutes, and also to be able to schedule classes around their jobs or family obligations. According to the Oxford Learning College [ 1 ], online courses date back to 1965 (before the internet) when the University of Alberta used networked IBM 1500 computers. Also, Massive Open Online Courses (MOOCs) were first offered through MIT in 2012. In 2019, the last year before the pandemic hit, the number of higher education students taking online courses was 36.9%; see [ 3 ]. With the increasing prevalence of online education, much research has been done over the years. The meta-study by Bernard et al. [ 4 ] examined 232 separate studies of attitudes, achievement and retention outcomes and contains numerous useful research paper references. Horspool and Lange [ 5 ] compared online versus face-to-face learning and examined student’s perceptions, behaviors, and success rates. Their findings indicate that schedule flexibility is a major factor when students choose an online course over a face-to-face option and that students in both formats felt that they experience high-quality communications with the instructor. They also found that online students study more at home than face-to-face students, but had more limited peer-to-peer communication. The authors present several recommendations, based on their findings, to improve the online course experience. The Horizon 2020 TeSLA Project [ 6 ], funded by the European Union, is a valuable resource for online learning. In particular, the Publications from this project contain a wealth of literature on the assessment topics relating to the topics of this paper. In an effort to assist institutions who are hesitant to adopt online learning course offerings, Fidalgo et al. [ 7 ] conducted an extensive survey on undergraduate students in three countries to find out their feelings and concerns about online courses (the survey was done before Covid-19). Shortly into the pandemic, Gammage et al. [ 8 ] reviewed assessment security practices in safeguarding academic integrity at several different universities.

Before the onset of COVID-19, the prevalence of online course offerings was shaped by student demand and the willingness of faculty and universities to provide this instructional mode. Online assessment has consistently posed challenges, with difficulties in ensuring assessment integrity discouraging many faculty members from embracing online education. With COVID-19's arrival, faculty found themselves unprepared in the midst of the spring semester (or quarter) of 2020. Traditional face-to-face instruction came to a sudden halt, compelling all instructors to rapidly transition to online teaching. Unfortunately, the outcomes were not always favorable.

Determining the exact extent to which students cheat or even anticipating all possible methods they might employ is challenging. Yet, studies indicate a notable surge in the number of students admitting to cheating on online exams during the COVID-19 pandemic. A recent meta-study by Newton and Essex [ 9 ], encompassing 19 studies and 4,672 participants from as far back as 2012, revealed that self-reported online exam cheating rose from a pre-COVID rate of 29.9 to 54.7% during the pandemic. The primary reason students cited for cheating was the mere availability of an opportunity. It's pivotal to recognize that such studies, dependent on voluntary responses, can introduce bias, potentially underestimating the real figures. Nevertheless, the substantial uptick during the pandemic is of significant concern. Noorbehbahani et al. [ 10 ] carried out a meta-study reviewing 58 papers on online cheating from 2010 to 2021. Their work aimed to summarize patterns in cheating types, detection methods, and prevention strategies in online environments.

In Dendir and Maxwell [ 11 ], compared student scores on the same online exams when some were administered without proctoring and others with online proctoring. They did this comparison in two different courses: Economics and Geography, throughout the semester for a total of 3 high-stakes exams for each class. Their analysis showed significantly higher scores for the non proctored exams over the online proctored exams, over both courses and for all three exams. Many college instructors make use of test banks to create multiple choice exams. Such questions can promptly appear in databases of companies that provide assistance to students on exams and homework. Golden and Kohlbeack [ 12 ] did experiments with their online accounting exams and found that if test bank questions are modified by paraphrasing, the student scores on them dropped on average from 80 to 69%. This paper was published before the release of Chat GPT, which can serve as another powerful tool to allow students to cheat on exams.

This paper will showcase the outcomes of experiments designed to gauge student cheating, particularly with the utilization of online student test assistance platforms during the pandemic, in the context of the author's large introductory statistics lecture. The findings indicate that cheating via such platforms was rampant. Each semester, the author typically offers one upper-level course alongside the introductory class, with a smaller enrollment. The specialized nature of the upper-level course makes it harder for students to cheat using such platforms. However, the advent of ChatGPT exposes both course types to novel cheating techniques. This paper will delve into the diverse methods students resort to for cheating and the strategies to mitigate them. Although the primary focus is on mathematics and statistics courses, the insights should be invaluable to instructors across various disciplines.

Holden et al. [ 13 ] suggest distinct strategies to curb student cheating in online exams, differing from our propositions. Their expertise lies in non-STEM courses, where they recommend three types of video surveillance techniques and AI-driven plagiarism detection methods. However, video surveillance can be time-consuming, and it may pose ethical and legal challenges. Jia and He [ 14 ] at Beijing University crafted an AI-based proctoring system tailored for mathematics courses, utilizing automated video analysis through AI algorithms. They found it to be cost-efficient and highly effective in curbing student cheating in online evaluations.

A strategy closer to our paper's methodology is proposed by Nguyen et al. [ 15 ]. In the realm of Chemistry exams, they introduced several economical assessment formats that substantially reduce cheating while still meeting learning objectives.

While large language models can inadvertently assist students in cheating during online exams, they can also augment instruction in numerous beneficial ways, as demonstrated by Jeon and Lee [ 16 ].

3 Methods and results

One of the main purposes of this paper is to present evidence showing the extent of students cheating in the author’s introductory statistics class when it is offered online compared to when it is offered in the classroom. This is an observational study: the author gives this large lecture statistcs course every fall and spring semester with enrollments in the range 120–150, and has been doing so for the past 10 years. Each semester, there are four exams and a final exam, all multiple choice. Great care it taken to assure that the difficulty of each exam remains the same each semester and also the grading is done completely objectively and simply (each answer gets full credit if correct or no credit if wrong, there is no partial credit or need for any subjective assessment). We collected grade distributions for each exam over all the semesters. Prior to the pandemic, these distributions were quite similar for each exam over the different semesters, but once the pandemic hit and the exams were administered online, the scores increased markedly.

The most reliable statistical methods for collecting data are randomized controlled experiments. But such a design would not be feasible to test our hypothesis. We would need to announce during registration, that students who sign up for this class would be randomly assigned to one of two groups: one group would take their exams in class and the other online (but otherwise the classes would be identical). Few if any students would be willing to sign up for such a class and even if they did, as our research will show, the students taking the exams online would have much higher scores (on the same exams), and this would lead to chaos. Another design might be to offer two different classes: one in which the students would take their exams online and the other where they would take the exams in class. The exams and other aspects of the class would be identical. This design, however, would be flawed, since there would exist many lurking variables and differences stemming from which students choose which format. So a retrospective observational study, as we do in this paper, appears to be the best method for performing such comparisons.

As discussed above, we will present different ways that students have been able to cheat and the prevalence of such cheating using test homework help offered by several companies. We will also explore the new challenges introduced by ChatGPT.

The author has also regularly been teaching an upper-level mathematics course in machine learning every spring semester since before the pandemic. This class typically has an enrollment from 20 to 30. The exams were all take-home exams but very different in nature from the multiple choice introductory statistics exams. The questions involve mixes of mathematical and conceptual reasoning, and some involve computer programming. The recent availability of large language models like Chat GPT, has introduced some problems in assessment. Chat GPT, for example, is capable of writing computer programs. This paper will also discuss ways to prevent students from using such large-language models on exams, and these mitigation methods have been tested during several semesters.

4 Introductory large lecture statistics course

4.1 part 1a: pre-chat gpt, introductory-level large lecture statistics course.

The author’s venture into online teaching and assessment began abruptly in the middle of the spring semester of 2020 when the pandemic-induced lockdowns commenced and universities suspended all on-campus classes. All faculty at the author’s (California State University) campus were provided with some online teaching tools (including Zoom, Blackboard, and Camtasia) and were given just one week to prepare to continue all of their courses online, for an indefinite amount of time. Initially, our university anticipated this arrangement to last less than a month, an estimate that, in hindsight, proved greatly optimistic.

Prior to the pandemic, the author had essentially no experience with online education. Although there were opportunities to give online courses, the author’s reservations with online assessment along with his preference for in-person rather than online interactions deterred any further exploration of this instructional method. The swift shift to online was thrust upon us all, requiring urgent preparation and numerous decisions. The university aimed for maximum flexibility: faculty could conduct their lessons synchronously (live Zoom sessions), asynchronously (pre-recorded lectures), or a mix of both. For live sessions, instructors had the option to record and post these lessons. However, legal complications arose: was there a need to obtain student consent to upload recordings of live classes in which they actively participated? With limited time to address every issue, most began with a tentative plan and adapted as circumstances evolved. Maintaining integrity of exams and assessments was a major concern for many and for the author in particular.

For the lower-level large lecture class, the author used the same format for exams as was used in the face-to-face exams: multiple choice questions. These were now given on the Blackboard system with no proctoring rather than in a proctored classroom setting. Whereas in the in-class setting, the author did not allow the use of book, notes, or cell-phones (not to mention computers), there was really no way to prevent this in the online setting, so the author had to assume that students would be using all of these resources (despite whatever instructions were given). In addition, students could work with each other or get help from others in completing these exams. Exam collaboration with classmates can be made more difficult by scrambling both the question order as well as the order of the multiple choice answers for each individual exam —a feature available on Blackboard and other course management systems.

Blackboard, as well as CANVAS (another widely used system) has a feature called lockdown browser. This tool aims to prevent students from accessing other websites during their exams and prevents actions like copying and pasting. Although the author used this feature on his exams, there are ways around it—for example, by opening up a different browser. Using the lockdown browser also introduces potential complications. At times, it might malfunction, inadvertently preventing students from completing their exams—often through no fault of their own. While the malfunction rate isn’t alarmingly high—in the author's experience, it hovers around 2–3%—in a class of 140 students, this equates to 3–5 individuals per exam in the large lecture that require special accommodation. Moreover, the necessity for the instructor to remain available during exams to address technical glitches and also to arrange new times for students to complete or retake their exams undermines one of the intended benefits of automated online exams. Implementing this feature to reduce cheating opportunities can introduce significant collateral challenges, both in terms of time and inconvenience.

Navigating through the initial semester of Covid, while simultaneously learning the intricacies of online teaching, proved to be a substantial challenge. Most of the author’s energy was channeled into determining the most effective methods to deliver lessons and engage students. Foremost in his concerns was devising strategies to conduct exams with utmost integrity. It was evident that certain measures were essential to minimize cheating. One such measure was requiring that online exams were conducted within a narrow time slot. At first glance this may have seemed straightforward since all students should presumably be free during the designated time slot for the class. The unpredictable nature of technology meant that some students would inevitably encounter IT issues, necessitating buffer time to accommodate them. Within Blackboard, there are primarily two features to regulate the duration of an exam: one is the window during which the exam link is accessible, and the other is an inherent time limit on the exam itself.

With the first exam, I gave too much time and kept the exam difficulty level similar to that of an in-person exam. Both turned out to be major errors. The results are nicely illustrated by the following two bar plots of grade distributions for the students (in the same) class, in their first exam (that was administered in person), versus for their third exam (that was administered online); see Fig.  2 below.

NOTE: For readers outside of the United States: Grades in most US colleges and universities (as well as in primary and secondary schools) are letter grades which typically have the following meaning: A = outstanding, B = good, C = satisfactory, D = unsatisfactory, and F = failing. Grades are converted numerically with A = 4 points, B = 3 points, C = 2 points, D = 1 points, and F = 0. Using these numerical conversions, grade point averages for particular students (or groups thereof) and of exams can be computed.

Based on the author’s impression of student understanding and participation during the online lectures as well as the extremely higher grades with the online exams, this discrepancy does not seem indicative of enhanced remote learning proficiency. Although this does not present a definitive proof that students were cheating with the online exams, the author feels that the elevated grades on Exam 3 can largely be attributed to the increased opportunities for students to cheat during online assessments. There are other possible explanations, e.g., perhaps during the pandemic, students were able to study better at home with less distractions and reduced time spent on transportation. A controlled randomized experiment would reveal the true reason for the much higher grades on (the same) exams when given online, but, as we pointed out earlier, it would not be feasible to perform such an experiment.

Upon observing the results of Exam 3 during Spring 2020, it became apparent to the author that adjustments were necessary. A shorter exam duration was introduced, and the difficulty level was elevated. While these changes yielded a more “normal” grade distribution in Exam 4 (like the one on the left of Fig.  2 ), they risked unduly penalizing genuine students who refrained from any dishonest practices.

In the Fall 2020 semester some new trends became evident. Typically, the author would reuse questions on exams so as to compare student performances and gauge how certain modifications in teaching may impact student learning. The first three exams (out of a total of four plus a final exam) had similar grade distributions to the typical one (left of Fig.  2 ). This was attributed to the adjustments made by the author based on the learnings from the previous semester. This also gave a good indication that the usage of real-time tutors on exam assistance platforms was not drastically affecting the grade distribution. Perhaps most students (who used online learning platforms on exams) did not feel they needed a live tutor, relying instead the large question/answer banks that such platforms provide.

For the fourth exam, the author did a simple experiment: For Fall 2020 semester, the author gave exactly the same fourth exam as was used in a semester several years back when the course was given face-to-face. Sure enough, compared with the typical symmetric distribution (left of Fig.  2 ) with all pre-Covid exams, the grade distribution for the online exam was ridiculously skewed—A was now the most common grade, see Fig.  3 .

figure 2

The left distribution had been rather typical for all exams over the many semesters that the author has taught his large lecture statistics class, following a bell-shaped distribution. The distribution on the right comes from the author’s very first online exam in Spring 2020. It has many more A’s and B’s and quite a bit less C’s and D’s

figure 3

The ridiculous distribution for Exam 4, of Fall 2020, which, as an experiment, used the same questions as a previous semester’s exam a few years back. This provides compelling evidence that exam questions get added to third party exam assistance platform’s data bases of exams and that most students were using such platforms to cheat

The difference of the distribution in Fig.  3 with a typical symmetric distribution for the author’s Exam 4 pre-covid were even more extreme if we take into account the exam curves. The reason is that all exams are typically curved (with very similar curves for each exam over all pre-Covid semester). But since the numerical scores were so much higher with the online exams, their curves were less generous.

For Exam 4, which tends to be the most difficult of all of the exams (including the final exam), the scores tend to be lower and so the curve is more generous. Here are the curves for Exam 4 both before and during Covid:

Pre-Covid curve for (face-to-face) Exam 4: A: 75–100, B: 63–74, C: 45–62, D: 30–44.

Covid curve for (online) Exam 4: A: 90–100, B: 80–89, C: 70–70, D: 50–69.

This shows that the differences in grades were even more extreme than the distributions indicated. For example, pre-covid on this exam, a 75% was the bottom of the A grade range, whereas for the online version a 90% was needed for the same bottom of the A. Indeed, if the covid curve for this exam were used to assign grades to the students who took the exam face-to-face, the distribution would be skewed to the right (see the right side of Fig.  4 ), with about half of the students getting an F.

figure 4

A comparison of the grade distributions for the same multiple choice exam given online and unproctored (left) versus face-to-face proctored in the classroom (right) if the same curves were used

For the online exam it would have been reasonable to make the curve even less generous: e.g. A: 95–100. But the instructor would never make exam curves stricter than the “standard curve” and before giving online exams there would have never been a need to do this, since the exams were always sufficiently challenging to necessitate some sort of a downward curve.

To counteract this, the author crafted a final exam using entirely new questions, aiming to level the playing field and diminish the advantages for those relying on external aids.

Moving on into the Spring 2021 semester, when the author taught the same large lecture statistics course online, additional experiments and modifications were introduced.

Two Experiments : The author crafted two distinct modifications for roughly half of the questions on the first exam:

The modified question retained the original wording, and the multiple-choice answers remained unchanged. However, one numerical parameter (e.g., a standard deviation) was adjusted, altering the correct answer to another existing option.

The original question's wording was preserved, but the initially correct multiple-choice answer was replaced. Along with this, a numerical parameter (e.g., a standard deviation) was adjusted to change the correct answer to a different option.

The results were quite illuminating. For Experiment (1) the average percentage of students who chose the previously correct (but now incorrect) answer for such questions was 72%! This is corroborating evidence that most students were using third party online platforms to cheat and that many did not even read the actual questions! The 72% figure probably underrepresents the number of students who cheated in this way, as some astute learners might have read the questions and cross-referenced with their online platform to validate their initial interpretations.

In Experiment (2), students solely depending on third party online platforms were likely taken aback. Some even audaciously emailed the instructor, asserting that certain multiple-choice questions lacked the correct answers (which, indeed, were present).

4.2 Part 1B: introductory-level large lecture statistics courses after chat GPT

The emergence of Chat GPT has added further challenges for educators striving to uphold the integrity of online exams. As outlined in the introduction, Chat GPT can accurately answer numerous questions when they're directly inputted into the application. Educators are advised to test some of their online or take-home exam questions on Chat GPT to gauge the accuracy of its responses.

The capability of Chat GPT as well as other evolving large language models continues to improve. Questions for which Chat GPT gives the correct answer should be changed or replaced. We will give some general suggestions, but we first present the following rather revealing example.

Original Draft Exam Question: Suppose that two dice are rolled and the numbers displayed on top are added up. The probability that this sum is at most 3 is…

The correct answer is: 1/12.

The Chat GPT answer is: 1/12 (= 3/36)—CHAT GPT Got the correct answer!!

Modified Draft Exam Question: Suppose that we have two pyramid shaped 4-sided dice, with the faces numbered 1 through 4. Suppose that the two dice are rolled and the numbers displayed on the bottom are added up. The probability that this sum is at most 3 is…

The correct answer is…3/16.

Chat GPT's Response: Incorrect.

Despite the current version of Chat GPT not answering the modified question accurately, its capability and efficiency continue to improve. It might be able to answer similar questions correctly in the future.

4.3 Timeline for the online classes above

Below is a timeline that summarizes the exam and grade information presented above:

Spring 2020 Semester (Covid-19 started): Exams switched mid-semester from face-to-face to online. Exam 3 (the first online exam) had much higher grades than usual. Adjustments were made for remaining exams to mitigate this (shorter duration, increased difficulty).

Fall 2020 Semester (first full online semester): The author continued with the above strategy as well as some new ones that were be elaborated upon in the following section (recommendations) for the first three exams and for the final exam. For Exam 4 an experiment was done. The same exam that was used several years ago was given. The grades turned out ridiculously high.

Spring 2021 (another online semester for this class) The exams continued to be modified as explained above and this kept the exam results more down to earth.

Advice for making online exams for introductory statistics and related classes : The above discussion shows the variety of online resources that students can use to cheat on unsupervised online or take-home exams.

No Repetition : Third party online exam/homework assistance platforms have demonstrated that it's imperative not to reuse questions. Even those used just a day earlier can end up on such platforms. For integrity, either:

Draft entirely new questions.

Modify previous ones by adjusting parameters and/or altering the multiple-choice options.

Tackling Live Tutoring: It's a bit more challenging to address students who employ live tutors during exams. However, imposing strict time constraints can discourage this behavior.

Chat GPT Screening: As the efficiency of tools like Chat GPT increases, it's crucial for instructors to keep abreast of such new technology and to screen exam questions using a Chat GPT session to ensure they aren’t easily answered by such platforms.

Incorporating Graphics and Unique Symbols: Questions that reference specific graphics or use specialized Greek letters in unique fonts can pose a challenge for platforms like Chat GPT. It adds an additional layer of difficulty in deciphering and interpreting the questions.

While these strategies involve more preparation, the integrity they bring to the examination process is invaluable. It's a trade-off: increased time in question creation versus the time saved from manual grading. Many educators, when given a choice, might still lean towards supervised, in-person examinations. However, for purely online courses, these precautions become essential.

5 Guidelines for online assessments in advanced mathematics and computer science courses

In general, upper-level STEM classes are not as susceptible to online cheating as the more standard lower-level classes. Many upper-level classes use specialized vocabulary and definitions that can vary even with the same course, depending on the instructor. Such vocabulary variations may also occur with introductory-level courses, but at a much smaller scale. Upper-level math (post calculus) and computer science questions also appear in third party online exam/homework assistance platforms, but these are more difficult for any platform tutors to crack and keep up with.

Specialized Vocabulary and Concepts

Upper-level courses frequently employ specific terminology and concepts that might differ based on the instructor or the curriculum.

Variations in vocabulary make it more challenging for third party online exam/homework assistance platforms to provide accurate solutions.

Computational Questions

While engines like Chat GPT and Wolfram Alpha can solve certain computational questions, exam setters should ensure that the questions are crafted uniquely to prevent direct solutions from these platforms.

Proof-based Questions

Questions that require proofs, common in many advanced math courses, and some computer science subjects, are generally tougher for both third party online exam/homework assistance platforms and Chat GPT to handle.

4. Coding Questions

Chat GPT has the ability to generate computer program codes. For questions requiring a computer code/program: Programs must run and produce an output and if the instructor tests the program, it should produce the same output. The instructor requires that students provide their codes in a format that can be copied and pasted. It is also required that the codes use only functions and protocols that were introduced in class. It is allowed to use new functions and constructs, but anytime such a usage is done, the new function/construct needs to be thoroughly explained with examples. Students are also informed that if their codes appear to be beyond the scope of the class’s programming skills, the instructor may call them in after the exam to fully explain how their code works. Prior to Chat GPT, the author would allow students to use any new code strategies that were not taught in class without restriction; indeed, students are encouraged to go beyond expectations and study more programming methods. What usually happens when Chat GPT writes a computer program (in whatever programming language you ask for), it will feel free to use ANY functions and methods that it sees fit. Sometimes the functions used require the installation of some additional packages in order to run. This makes it fairly easy to detect when a student has used Chat GPT. Most often the programs will not run at all (in the author’s machine learning course the software R was used) and/or, the program will use some constructs that are more advanced or simply different than were taught in class. Thus it should be required that the programs the student actually provide will indeed be executable (and they should provide the resulting output). If you were to ask a student who used Chat GPT to write their code to explain their code in case more advanced constructs are used, they will most often become flummoxed. In order to avoid the necessity of such encounters to verify whether the computer code was indeed the student’s creation, it is simpler to simply not accept solutions that violate either of the above directions (or perhaps say that such solutions can earn no more than, say, 25% of the point values) in such cases.

Course-Specific Assessments

In other upper-level courses such as Cryptography, Discrete Mathematics, and Graph Theory, it is vital to frame questions based on the specific teachings of the course. This not only tests students' understanding but also mitigates the impact of online test support services.

In conclusion, while advanced courses present their own set of challenges for online assessments, a keen focus on course-specific teachings and a proactive approach towards leveraging technology can help ensure exam integrity and fairness.

The rapid transformation of the educational sector, catalyzed by unforeseen circumstances such as the pandemic, underscores the importance of adaptability and forethought. With distance learning not merely being an option but a necessity, online assessment, once considered a subsidiary component of education, has gained paramount importance. Our exploration unveils a multifaceted reality: the sheer potential of online learning, the unprecedented challenges posed by third party online exam/homework assistance platforms and ChatGPT, and the undeterred perseverance of educators in ensuring academic integrity.

Our findings highlight an essential paradox. The very tools, like AI and internet platforms, that hold the promise to revolutionize and democratize education, also present profound challenges to academic integrity. As technology continues its relentless march forward, the academic community needs to be one step ahead, always innovating and strategizing to maintain the sanctity of education.

While third party online exam/homework assistance platforms exploit the vulnerabilities of the current online assessment system for profit, AI models like ChatGPT offer a glimpse into the future of information accessibility. They underscore the urgency with which educators and institutions must rethink and reshape assessment methodologies. The traditional paradigms of assessments, predominantly reliant on rote memory and standard problem-solving, may increasingly become obsolete. Instead, a greater shift towards analytical thinking, application, and synthesis may not only align better with real-world challenges but also be less susceptible to the shortcuts technology offers.

From introductory courses to advanced levels, the narrative remains consistent: there is no single foolproof method, no silver bullet. Instead, the solution lies in a combination of approaches—continuous adaptation, employing course-specific nuances, leveraging technology judiciously, and fostering an academic culture grounded in integrity.

According to the author’s experience, the most important aspect for an online class to improve is the participation and peer communication of the students. I have noticed students have a much higher reluctance to participate in an online class than in an in-person class. There are many plausible reasons for this, for example, knowing that the class will be recorded and posted online might make students more hesitant to speak up. I am working on better understanding and mitigating this problem.

As we forge ahead into an increasingly digital future, it is imperative to view these challenges not as insurmountable barriers but as catalysts, driving us towards a more resilient, adaptive, and effective educational paradigm. It serves as a stark reminder that in the evolving dance between technology and education, staying static is not an option. The future of education demands foresight, adaptability, and a relentless commitment to academic excellence and integrity.

Readers might be curious about how the author’s department and university are dealing with online courses with the learning experiences of the pandemic behind us. The author’s campus belongs to the California State University system, which comprises of 23 campuses throughout the state and is the largest university system in the United States. While particular policies vary by department and campus, the system’s higher administration, however, has made clear its preference to go back mostly to face-to-face instruction. At our college, in order to be offered online, any course must be approved by the department and the college dean, and then a university-wide committee. This academic year, our mathematics department has only two courses that have been approved for online instruction, one of which is the machine learning course that the instructor is currently teaching. A notable exception is our introductory statistics class, the main online course that was analyzed in this paper. This is the department’s largest course, with over 40 sections offered every semester. The author continues to teach the (only) large lecture for this class, but at present there are no online sections being offered. As discussed, online courses have many advantages for democratizing education, but at the same time, there are important assessment issues that need to be further examined, so this topic will be an important area of research in years to come.

Data availability

All data for this paper was included in the paper, either in raw form or when more suitable via a graphical summary.

Code availability

Not applicable.

Sleator RD. The evolution of eLearning-background, blends, and blackboard…. Sci Prog. 2010;93(3):319–34. https://doi.org/10.3184/003685010X12710124862922 .

Article   Google Scholar  

Bartolic SK, et al. A multi-institutional assessment of changes in higher education teaching and learning in the face of COVID-19. Educ Rev. 2022;74(3):517–33. https://doi.org/10.1080/00131911.2021.1955830 .

Oxford (University) Learning College webpage. https://www.oxfordcollege.ac/news/history-of-distance-learning/ . Accessed 15 July 2024.

Bernard RM, Abrami PC, Lou Y, Borokovski E, Wade A, Wozney L, Huang B. How does distance education compare with classroom instruction? A meta-analysis of the empirical literature. Rev Educ Res. 2004;74(3):379–439.

Horspool A, Lange C. Applying the scholarship of teaching and learning: student perceptions, behaviours and success online and face-to-face. Assess Eval High Educ. 2012;17(1):73–88.

Horizon 2020 TeSLA Project. https://tesla-project-eu.azurewebsites.net/ . Accessed 19 Feb 2024.

Fidalgo P, Thormann J, Kulyk O, et al. Students’ perceptions on distance education: a multinational study. Int J Educ Technol High Educ. 2020;17:18.

Gamage KAA, de Silva EK, Gunawardhana N. Online delivery and assessment during COVID-19: safeguarding academic integrity. Educ Sci. 2020;10:301. https://doi.org/10.3390/educsci10110301 .

Newton P, Essex K. How common is cheating in online exams and did it increase during the COVID-19 pandemic? A systematic review. J Acad Eth. 2023. https://doi.org/10.1007/s10805-023-09485-5 .

Noorbehbahani F, Mohammadi A, Aminazadeh M. A systematic review of research on cheating in online exams from 2010 to 2021. Educ Inf Technol. 2022;27:8413–60. https://doi.org/10.1007/s10639-022-10927-7 .

Dendir S, Maxwell RS. Cheating in online courses: evidence from online proctoring. Computers Hum Behav Rep. 2020;2:100033.

Golden J, Kohlbeck M. Addressing cheating when using test bank questions in online classes. J Acc Educ. 2020;52:100671.

Holden OL, Norris ME, Kuhlmeier VA. Academic integrity in online assessment: a research review. Front Educ. 2021. https://doi.org/10.3389/feduc.2021.639814 .

Jia J, He Y. The design, implementation and pilot application of an intelligent online proctoring system for online exams. Interact Technol Smart Educ. 2021. https://doi.org/10.1108/ITSE-12-2020-0246/full/html .

Nguyen JG, Keuseman KJ, Humston JJ. Minimize online cheating for online assessments during COVID-19 pandemic. J Chem Educ. 2020;97(9):3429–35. https://doi.org/10.1021/acs.jchemed.0c00790 .

Jeon J, Lee S. Large language models in education: a focus on the complementary relationship between human teachers and ChatGPT. Educ Inf Technol. 2023. https://doi.org/10.1007/s10639-023-11834-1 .

Download references

Apart from the author’s university salary; there was no additional funding for this paper.

Author information

Authors and affiliations.

Department of Mathematics, California State University-Dominguez Hills, 1000 East Victoria Street, Carson, CA, 90747, USA

Alexander Stanoyevitch

You can also search for this author in PubMed   Google Scholar

Contributions

A.S. is the sole author.

Corresponding author

Correspondence to Alexander Stanoyevitch .

Ethics declarations

Competing of interests.

The authors declare no competing interests.

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Stanoyevitch, A. Online assessment in the age of artificial intelligence. Discov Educ 3 , 126 (2024). https://doi.org/10.1007/s44217-024-00212-9

Download citation

Received : 31 October 2023

Accepted : 18 July 2024

Published : 19 August 2024

DOI : https://doi.org/10.1007/s44217-024-00212-9

Current Time In Kemerovo, Kemerovo Oblast–Kuzbass, Russia

  • Share full article

Advertisement

Supported by

What to Know About the Latest Social Security Number Breach

Hackers may have gained access to the private information of millions of people from a background check company called National Public Data. Should you be worried? We have some advice.

An illustration shows several people, and a dog, each protected by brick walls.

By Ron Lieber

Ron Lieber covered the 2017 Equifax breach while fleeing from a hurricane .

The details are murky. In April, Hackmanac, a cybersecurity company, posted on X that about 2.9 billion records of personal data were for sale, from people in the United States, Canada and Britain. The data was supposedly stolen from National Public Data , a company that does background checks.

That company became the target of a class-action suit, which Bloomberg Law recently reported, contending that thieves got Social Security numbers in the breach. Bleeping Computer, a technology and security publication, rounded up reports of hackers leaking batches of the data.

We may never know the extent of the breach and the subsequent leak. But I’m not sure the details matter much.

Security breaches happen all the time. Thieves frequently find vulnerabilities in large systems and exploit them.

Our lack of data privacy and security is intensely hateful, but in the short and medium term, the only thing we can do is lock ourselves down as best we can.

Here are some reminders about how to do it.

Control Anxiety

Remember, some thieves steal simply because they can. If they don’t try to use stolen information, you don’t have a problem.

We are having trouble retrieving the article content.

Please enable JavaScript in your browser settings.

Thank you for your patience while we verify access. If you are in Reader mode please exit and  log into  your Times account, or  subscribe  for all of The Times.

Thank you for your patience while we verify access.

Already a subscriber?  Log in .

Want all of The Times?  Subscribe .

Skip to main page content

  • AUTHOR INFO

SBS

Time course regulatory analysis based on paired expression and chromatin accessibility data

  • Zhana Duren 1 , 5 ,
  • Xi Chen 1 , 5 ,
  • Jingxue Xin 1 ,
  • Yong Wang 2 , 3 and
  • Wing Hung Wong 1 , 4
  • 1 Department of Statistics, Stanford University, Stanford, California 94305, USA;
  • 2 CEMS, NCMIS, MDIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100080, China;
  • 3 Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming, 650223, China;
  • 4 Department of Biomedical Data Science, Bio-X Program, Center for Personal Dynamic Regulomes, Stanford University, Stanford, California 94305, USA

↵ 5 These authors contributed equally to this work.

  • Corresponding authors: ywang{at}amss.ac.cn , whwong{at}stanford.edu

A time course experiment is a widely used design in the study of cellular processes such as differentiation or response to stimuli. In this paper, we propose time course reg ulatory analysis (TimeReg) as a method for the analysis of gene regulatory networks based on paired gene expression and chromatin accessibility data from a time course. TimeReg can be used to prioritize regulatory elements, to extract core regulatory modules at each time point, to identify key regulators driving changes of the cellular state, and to causally connect the modules across different time points. We applied the method to analyze paired chromatin accessibility and gene expression data from a retinoic acid (RA)–induced mouse embryonic stem cells (mESCs) differentiation experiment. The analysis identified 57,048 novel regulatory elements regulating cerebellar development, synapse assembly, and hindbrain morphogenesis, which substantially extended our knowledge of cis -regulatory elements during differentiation. Using single-cell RNA-seq data, we showed that the core regulatory modules can reflect the properties of different subpopulations of cells. Finally, the driver regulators are shown to be important in clarifying the relations between modules across adjacent time points. As a second example, our method on Ascl1 -induced direct reprogramming from fibroblast to neuron time course data identified Id1/2 as driver regulators of early stage of reprogramming.

In time course expression analysis, gene expression is measured at multiple time points during a natural biological process such as spontaneous differentiation of progenitor cells, or during an induced biological process such as cellular response to a stimulus or treatment ( Storey et al. 2005 ). In the last two decades, many methods were developed to infer gene regulatory networks (GRNs) from time course gene expression data, for example, information theory–based methods ( Margolin et al. 2006 ; Hempel et al. 2011 ; Kinney and Atwal 2014 ), Bayesian network–based methods ( Perrin et al. 2003 ; Zou and Conzen 2005 ), ordinary differential equation–based methods ( Bansal et al. 2006 ; Wang et al. 2006 ), and permutation-based methods ( Hempel et al. 2011 ). Such analysis is popular because the expression data, which is inexpensive to measure, can provide a rich description of the changes in the cellular states during the time course. Conversely, because the regulation of gene expression involves the interaction of transcription factors with DNA on regions with open chromatin structure, the measurement of gene expression alone is not sufficient for the study of the regulation ( Duren et al. 2017 ; Miraldi et al. 2019 ). Much deeper understanding can be revealed by time course regulatory analysis, in which both gene expression and chromatin accessibility are measured at each time point in a time course experiment. With the advent of cost-effective technologies (i.e., Assay for Transposase-Accessible Chromatin using sequencing [ATAC-seq]) for measuring chromatin accessibility ( Buenrostro et al. 2013 ), paired expression and accessibility data are now becoming available in many time course experiments, such as FOS-induced neuronal activities ( Su et al. 2017 ), epidermal development ( Li et al. 2019 ), myeloid cell differentiation ( Ramirez et al. 2017 ), early cardiomyocyte differentiation ( Liu et al. 2017 ), iPSC reprogramming ( Wapinski et al. 2017 ; Cao et al. 2018 ), and induced neuron reprogramming ( Wapinski et al. 2017 ; Cao et al. 2018 ). Here, we present a methodology for the analysis of data from studies with such experimental designs.

Figure 1 presents an outline of our methodology (for detail, see Methods). First, we infer context-specific GRN from matched ATAC-seq and RNA-seq data at each time point to output reliable regulatory relations. Using the inferred GRN, we define two types of scores for regulatory relations. The regulatory strength of a transcription factor (TF) on a target gene (TG) is quantified by the trans -regulation score (TRS), which is calculated by integrating information from multiple regulatory elements (REs) that may mediate the activity of the TF to regulate the TG. Here, a prior TF-TG correlation across external public data ( Supplemental Table S1 ) is included in the TRS definition to distinguish the TFs sharing the same binding motif (i.e., TFs from the same family). The regulatory strength of an RE on a target gene is quantified by the cis -regulation score (CRS), which is calculated by integrating the TRS of TFs with binding potential on the RE. Based on these scores, we use non-negative matrix factorization to extract the core regulatory modules that characterize different biological processes and/or subpopulations of cells. Finally, we identify driver TFs (i.e., TFs driving expression changes between adjacent time points) as TFs with large difference TRS scores on up-regulated genes versus other genes. This allows us to causally connect ancestor–descendant regulatory modules along with time points (Methods). Our methodology for time course regulatory analysis is implemented in the software Time Course Regulatory Analysis (TimeReg), which is freely available ( https://github.com/SUwonglab/TimeReg ).

  • In this window
  • In a new window
  • Download as PowerPoint Slide

Schematic overview of Time Course Regulatory Analysis (TimeReg) based on paired gene expression and chromatin accessibility data. ( A ) TimeReg proposes a three-step framework to infer a high-quality gene regulatory network. Step 1: PECA2 infers context-specific GRN from matched ATAC-seq and RNA-seq data at a single time point to output TF-TG and RE-TG regulatory matrix. Step 2: NMF decomposes the regulatory matrix and extracts the core regulatory modules at each time point. Step 3: Driver regulators are identified (Methods). ( B ) Overview of PECA2 method. ( C ) Schematic overview of the major results on two data sets.

In this paper, we will validate the utility of TRS and CRS by comparison with TF-TG relations and RE-TG relations defined by data from independent gene knockdown, ChIP-seq, and HiChIP experiments. After validating these key concepts, we will apply our method to study a time course of retinoic acid (RA)–induced differentiation of mouse embryonic stem cells (mESCs), in which gene expression and chromatin accessibility are measured at baseline (day 0) and at days 2, 4, 10, and 20 after RA treatment. We will extract core regulatory modules on each time point, map the modules across time points to trace their development trajectory, and validate the core regulatory modules by single-cell RNA-seq data. As a second illustration, we will apply our method on Ascl1 -induced direct reprogramming from fibroblast to neuron time course data to detect heterogeneity, explore the regulatory dynamics, and identify driver regulators in the reprogramming process. By applying our method on different biological problems, we will illustrate that our methodology is capable of providing reliable regulatory relations in time course experiments and dissecting the dynamics at network level.

Chromatin accessibility and expression dynamics of retinoic acid–induced mESC differentiation

Mouse ESC was induced to differentiate by treatment with retinoic acid (RA). We harvested cells at days 0, 2, 4, 10, and 20 (mESC, D2, D4, D10, and D20), and performed ATAC sequencing (ATAC-seq) and RNA sequencing (RNA-seq) to measure the paired chromatin accessibility and gene expression time course data ( Fig. 2 A). Regulatory elements (REs) at each time point are identified and quantified by its openness score in the same way as in Duren et al. (2017) . In total, 174,059 REs are obtained across all time points. From gene expression data, we selected 7975 genes that have at least twofold expression change and the maximum expression level is >10. These sets of RE and gene are used for all subsequent analyses. From the results of principal component analysis (PCA) and Pearson's correlation coefficient (PCC) on ATAC-seq data of twofold dynamic REs ( Fig. 2 B,C), our genome-wide gene expression and chromatin accessibility profiling show high-quality reproducibility among biological replicates. We also find a sharp change in the chromatin accessibility landscape during the time course, whereas the changes in gene expression are more moderate ( Fig. 2 D,E). The change in accessibility between day 0 and day 2 is particularly large. This suggests that the immediate responses to RA treatment are large changes in chromatin accessibility, which then induces subsequent gene expression changes in the time course.

Genome-wide profiling of gene expression and chromatin accessibility during RA induction reveals landscape for RA-driven lineage transition. ( A ) Schematic outline of study design. ( B , C ) PCA and heat map of the Pearson's correlation matrix on ATAC-seq data. ( D , E ) PCA and heatmap of the Pearson's correlation matrix on RNA-seq data. ( F ) Enriched GO terms in the top 200 specific genes at each time point. The horizontal axis is −log 10 ( P -value) and the number behind the bar represents fold enrichment.

We performed Gene Ontology (GO) term enrichment analysis on the 200 most specifically expressed genes (Methods) at each time point. Because the differentiation is induced by RA, as a positive control we check whether the GO term “response to retinoic acid” is enriched in day 2. Indeed, this term is highly enriched in day 2 ( P = 3.67 × 10 −10 , fold change = 6.51). This gives us confidence that the enrichment analysis on specifically expressed genes is capable of detecting biologically relevant signals. Figure 2 F presents the most significantly enriched GO terms at each time point. It suggests that RA-induced differentiation into multiple cell types, with neuronal cells arising early in the time course and glial lineages emerging later. The enrichment of “digestion” on day 20 suggests that the differentiation also gave rise to other cell types beyond neurons and glia ( Fig. 2 F).

Context-specific inference of gene regulatory relations by the PECA2 method

We developed PECA2 as a method to infer gene regulatory relations (TF-TG relations and RE-TG relations) in a cellular context based on gene expression and chromatin accessibility data in that context (Methods). First, the trans- regulation score (TRS) for a given TF-TG pair is defined by integrating information from multiple REs that may mediate the activity of the TF to regulate the TG ( Fig. 1 B; Methods). Before applying it to analyze our time course data, we first validate the usefulness of TRS using gene perturbation experiments. On mESCs, we performed shRNA knockdown (separately) of transcription factors Pou5f1 , Sox2 , Nanog , Esrrb , and Stat3 and measured gene expression changes following the knockdown. Regarding the most differentially expressed genes (FDR<0.01), see Guan et al. (2019) following the knockdown as true target genes of the TF, we calculated the area under the ROC curve (AUC) of target prediction based on ranking by TRS. As a comparison, we collected ChIP-seq data for these TFs on mESC and used the potential target gene score from BETA ( Wang et al. 2013 ) to predict the target genes. Figure 3 A shows that TRS-based prediction has substantially higher AUCs than predictions based on ChIP-seq data. We also compared our methods with different baseline methods (different combination of features from expression and accessibility data) (Methods) on TF-TG prediction. TRS-based predictions get much higher accuracy then these baseline methods ( Supplemental Fig. S1 ). These results validated the usefulness of TRS in predicting TF-TG relations.

PECA2 infers accurate gene regulation supported by ChIP-seq, shRNA knockdown, and HiChIP experiments. ( A ) Comparison of PECA2 TRS with ChIP-seq experiment on five important regulators in mESC by taking the knockdown data as ground truth. Shapes represent different transcription factors and colors represent different methods. Red represents results from PECA2 and black represents results from ChIP-seq experiment. ( B ) Conservation score distribution comparison between REs predicted to regulate at least one gene and randomly selected REs. ( C , D ) Validation of PECA2 predicted RE-TG pairs by the HiChIP experiment on mESC and RA D4. Background RE-TG pairs are randomly selected to have the same distance distribution as the predicted RE-TG pairs. “Fold” represents fold change of average read count of predicted RE-TG pairs versus background RE-TG pairs.

After we have the trans- regulation score, we define the cis- regulation score (CRS) for each RE-TG pairs by integrating the TRS of TFs with binding potential on the RE ( Fig. 1 B; Methods). To validate the usefulness of CRS, we downloaded the publicly available H3K27ac HiChIP data on mESC ( Mumbach et al. 2017 ) and also performed H3K27ac HiChIP experiments on RA D4 ( Zeng et al. 2019 ). The RE-TG pairs from CRS-based prediction (Methods) have 17.72 and 15.30 HiChIP reads on average on mESC and RA D4, respectively. As a control, we selected random RE-TG pairs (under the constraint that they have the same distance distribution as our predicted RE-TG pairs) from all candidate RE-TG pairs. The RE-TG pairs from control groups have 5.17 and 4.23 read counts in HiChIP data on average on mESC and RA D4, respectively ( Fig. 3 C,D). HiChIP counts are much higher in CRS-predicted RE-TG pairs (one-tailed Wilcoxon rank-sum test, ESC: P -value = 1.4412 × 10 −53 , Fold = 3.4250; RA D4: P -value = 1.1025 × 10 −56 , Fold = 3.616). We also constructed two more control RE-TG sets and compared the read counts with our predicted RE-TG pairs. Those controls are selected from RE-TG pairs have the same TG expression distribution and have the same distribution of number of ends covered by H3K27ac ChIP-seq peaks, respectively. Our predicted RE-TG pairs have much higher read counts than both the control groups have ( Supplemental Fig. S2 ). Finally, when we examined the sequences of the RE in the RE-TG pairs, the CRS-predicted REs have significantly higher sequence conservation than those from the control group (one-tailed Wilcoxon rank-sum test, P -value = 1.3801 × 10 −33 Fold change = 1.4425) ( Fig. 3 B). These results validated the usefulness of CRS in predicting RE-TG relations. Taken together, PECA2 provides high-quality, genome-wide, and context-specific inference of gene regulatory relations for each single time point with paired gene expression and chromatin accessibility data. PECA2 is available at https://github.com/SUwonglab/PECA .

Identification and annotation of novel regulatory elements by CRS

To identify important REs, we analyzed REs and their target genes. First, we annotated the REs into three groups: (1) promoters, (2) known enhancers (from the mouse ENCODE Project, including developmental stage enhancers), and (3) novel enhancers ( Supplemental Table S2 ). We found that 7% of REs are promoters, 60% are known enhancers, and the remaining 33% are novel enhancers ( Fig. 4 A). About 69% of the genes are regulated by both known and novel enhancers. To understand the functions of the known and novel enhancers, we divided the genes into two groups (targets of known enhancers, and targets of novel enhancers) depending on whether the associated RE with the maximum CRS score is a known enhancer or a novel enhancer. On average over different time points, known enhancers have 4604.80 target genes and novel enhancers have 3370.20 target genes ( Fig. 4 B). We found the novel enhancers with maximum CRS scores are highly conserved with a mean conservation score of 0.2034. The conservation score of those enhancers is about twofold higher than random regions, which is higher than the mean conservation of known enhancers (conservation score = 0.1753) and comparable to open promoter regions (conservation score = 0.2368) ( Fig. 4 C). On each time point, we chose the top 500 specifically expressed genes (based on gene specificity score) (Methods) of known enhancers and novel enhancers, respectively. Figure 4 D shows the GO enrichment score (defined as the geometric mean of fold change and −log 10 [ P -value]) of known versus novel enhancers’ targets on RA D10. GO terms such as cerebellar granular layer development, synapse assembly, regulation of AMPA receptor activity, and hindbrain morphogenesis are only enriched in novel enhancers’ targets, but not enriched in known enhancers’ targets. These results suggest that enhancer annotation from the ENCODE Project on those GO terms associated genes is still incomplete. The reason is that the mouse ENCODE Project contains brain tissue but does not contain specific neuron cellular context. By combining experimental and computational analysis, we identified new enhancers whose inclusion will greatly enhance the interpretation of data from the time course.

Identification and annotation of novel enhancers. ( A ) Pie charts of the promoter, known enhancers, and novel enhancers in REs. ( B ) Bar plot of numbers of known enhancers and novel enhancers at different time points. ( C ) Conservation score distribution of different sets of REs. ( D ) Comparison of GO enrichment on known enhancers’ targets and novel enhancers’ targets on D10. Top 500 specifically expressed genes in each group are chosen to perform GO analysis. The x -axis represents enrichment score on known enhancers’ targets, and the y -axis represents that on novel enhancers’ targets. Each dot represents one GO term. The enrichment score is defined as the geometric mean of fold change and −log 10 P -value.

Core regulatory modules of TFs and TGs reveal subpopulation characteristics

To better understand the regulatory network inferred by PECA2, we divided the TF-TG networks, which is represented by a TRS matrix (rows represent TF and columns represent TG), into several modules (dense subnetworks) by non-negative matrix factorization (NMF) ( Brunet et al. 2004 ; Wang et al. 2008 ) based module detection method (Methods; Fig. 5 A). We used RA D4 data to illustrate and validate this approach.

Core regulatory modules extracted from gene regulatory networks are supported by subpopulation from single-cell RNA-seq data. ( A ) Heatmap of reordered normalized TRS scores at D2 to D20. The black line represents the detected modules from NMF. ( B ) Heatmap of D4 normalized TRS on selected specific genes and TFs. ( C ) Mean expression pattern of genes from three different modules of RA D4 on the developmental stage of seven tissues. ( D ) Mean expression pattern of genes from three different modules of RA D4 on RA time course. ( E ) Distribution of RA D4 top 500 module-specific genes’ maximum expressed subpopulations from single-cell RNA-seq data.

NMF analysis of the TRS matrix at D4 yielded three modules, in which the TRS scores between TFs and TGs within the same module are much higher than those between different modules. It means we have divided the TF-TG networks into three groups of nodes with dense connections internally and sparser connections between groups. Figure 5 B shows this pattern for selected TFs and TGs. We found genes from different modules have different expression patterns on mouse tissue development data ( Gorkin et al. 2017 ) as well as the RA induction data ( Fig. 5 C,D). Module 1 contains TFs such as Ascl1 , Ebf1 , Nr2f1 , and Lhx1 . They were previously reported to be relevant to neuronal development. Using data from Gorkin et al. (2017) , we also found that Module 1 associated TGs have a high and increasing expression pattern in embryonic brain development ( Fig. 5 C, top). Module 2 contains mesodermal and endodermal related regulators such as Sox17 , Gata4 , Gata6 , Foxa2 , and Hnf4a . Consistently, target genes in Module 2 are highly expressed in the liver, lung, heart, kidney, intestine, and stomach, but lowly expressed in the brain ( Fig. 5 C, middle). Module 3 associated TFs such as Pou2f1 , Hmga1 ( Nishino et al. 2008 ), Sox2 ( Graham et al. 2003 ), and Pax6 ( Sansom et al. 2009 ) are known to be involved in the maintenance of neural stem and progenitor cells. Consistently, we found that the expression of Module 3 associated TGs has decreasing patterns during embryonic development in all tissues ( Fig. 5 C, bottom). These pieces of evidence show that the biological function of module-specific regulators is well-matched to the cellular context suggested by the expression pattern of the corresponding target genes.

Because the cell population after 4 d of differentiation is expected to be a mixture of subpopulations of cells representing different developmental lineages, it is likely that the modules identified above may reveal the regulatory characteristics of these subpopulations. We used single-cell RNA-seq data to test this hypothesis. In a previous study ( Duren et al. 2018 ), we performed scRNA-seq experiments on day 4 of the RA-induced time course and found that there are three distinct subpopulations of cells ( Duren et al. 2018 ). We selected the top 500 module-specific genes (Methods) in each module and assigned each gene to one of the three subpopulations (the one with the maximum expression). We found that 76.20% of Module 1–specific genes are matched to the subpopulation 1, 75.00% of Module 2–specific genes are matched to the subpopulation 2, and 79.40% of Module 3–specific genes are matched to the subpopulation 3 ( Fig. 5 E). This result shows that the modules are well-matched to the subpopulations and therefore can help us to elucidate the regulatory relations within the subpopulations. In other words, our analysis of bulk expression and accessibility data provided useful information on subpopulation-specific gene regulatory networks.

Finally, we applied the same analysis for each time point and found that there are three modules in D2, three modules in D4, four modules in D10, and four modules in D20 ( Fig. 5 A; Supplemental Figs. S3–S5 ).

Associations of regulatory modules across time points

With these functionally meaningful regulatory modules indicating subpopulation at each time point, we next investigated the relationships among them across time points. Figure 6 A shows the similarities between modules in adjacent time points. It is seen that the three modules in D2 are well-matched to the three modules in D4 (Jaccard similarities are 0.77, 0.82, and 0.85), the four modules in day 10 are well-matched to the four modules in D20 (Jaccard similarities are 0.69, 0.74, 0.88, and 0.66). This suggests that the analysis of a set of matched modules across time points may reveal useful biological insight. There are four modules in D10 but only three in D4. Modules 2 and 3 in D4 are similar to Modules 2 and 3 in D10 (Jaccard similarities are 0.74 and 0.88, respectively). However, it is not obvious how Modules 1 and 4 in D10 are related to the D4 modules.

Core regulatory modules decompose the mixture populations consistently across time points. ( A ) Jaccard similarity of modules between neighboring time points. Line width represents the Jaccard similarity, and the similarity value is labeled on the line if it is >0.1. ( B ) GO analysis on modules at each time points. Top 100 specifically expressed genes of each module at each time point are selected for GO enrichment analysis. The x -axis represents time points, and the y -axis represents fold enrichment. The size of circles represents the enrichment P -value. ( C ) Mean expression pattern of genes from D10_Module1 and D10_Module4 on developmental stage of seven tissues. ( D ) Mean expression pattern of genes from D10_Module1 and D10_Module4 on RA time course. ( E ) Expression pattern of the glial marker gene Gfap on RA time course. ( F ) Heatmap of D10 normalized TRS on selected specific genes and TFs.

To explore the function of the modules, we extracted the top 100 most specifically expressed genes from each module (Methods) in each of the time points and performed GO enrichment analysis ( Fig. 6 B). Epithelium development and cardiovascular system development are enriched in Module 2 of earlier time points. In D20 of Module 2, digestion is enriched, but the enrichment levels of epithelium development and the cardiovascular system development are decreased. Those enriched GO terms suggest Module 2 involves endoderm and mesoderm development, which is consistent with the function of subpopulation 2 from the scRNA-seq data ( Duren et al. 2018 ). Stem cell population maintenance is enriched in Module 3 and has a downward trend along with time, which is consistent with the expression pattern of Module 3–specific genes in RA induction and tissue development data ( Fig. 5 C,D), indicating that the stem cell population is becoming smaller. Module 3 also enriches neural tube closure, which suggests Module 3 contains neural stem cells.

To see the difference between Module 1 and Module 4 in D10, we checked the expression pattern of those genes on mouse developmental stage data and RA time course data ( Fig. 6 C,D). In the developmental stage, genes from Module 1 are always highly expressed in forebrain, but genes from Module 4 have an upward trend. These results suggest that both Module 1 and Module 4 are brain-related cell populations, but Module 1 is much earlier than Module 4 in development. In RA induction time course data, we also see a similar pattern ( Fig. 6 D). From the results of GO enrichment analysis, we find they are enriched in different functions. Module 4 in D10 and D20 are enriched in glial cell differentiation which is not enriched in any of the modules in previous time points. Module 1 in all the time points are enriched in axon guidance ( Supplemental Fig. S6 ). GO enrichment analysis show Module 1 is neuron related and Module 4 is glial related.

Next, we checked the specific trans- regulators of these two modules ( Fig. 6 F). We find genes from Module 1 are regulated by Ebf1 , Ascl1 , Nr2f1 , and Lhx1 , whereas genes from Module 4 are regulated by Olig1 , Sox10 , and Sox8 . From the similarity of regulators and enriched GO terms on target genes, Module 1 in D10 is very similar to Module 1 in D4. Therefore, Module 1 would correspond to neuron subpopulation. The GO enrichment analysis and module-specific regulators indicate that the newly generated module in D10 (Module 4) is the glial population. If it were true, the glial marker genes should be highly expressed in D10 but low expressed in D4. We examined the expression pattern of the astrocyte marker gene Gfap , which is specific to Module 4. Indeed, Gfap is very highly expressed in D10 but almost not expressed in D4 ( Fig. 6 E). From these results, we conclude that Module 4 corresponds to the glial population and has emerged between D4 and D10. To clarify the origin of Module 4 further, we need to identify the regulators that drive the expression and accessibility changes. We will return to this question subsequently.

Driver regulators shed light on cell lineage transition

Under RA treatment, mESCs differentiated (or transitioned) into three different subpopulations after 2 d. To explore the regulatory mechanism behind the transition, we identified driver regulators in each module. Driver regulators are defined as TFs that (1) are up-regulated during the transition from day 0 to day 2 by at least 1.5-fold, and (2) with TRS score on up-regulated genes being significantly higher than that on non-up-regulated genes (one-tailed rank-sum test, FDR < 0.05). Figure 7 A gives the driver regulators of the three modules in RA day 2. In Module 2, driver regulators include Gata4 , Gata6 , Rxra , Sox17 , Foxa2 , and Hnf4a . Sox17 , Foxa2 , and Hnf4a are involved in endoderm development. Gata4 and Gata6 are known to be involved in mesoderm development. We find Rxra , an important cofactor of retinoic acid receptors, is involved in endoderm and mesoderm development. The expression level of retinoic acid receptor cofactor Rxra is not very high on day 2 (FPKM 14.47, ranked 3897 in 7975 dynamic expressed genes, top 48.87%), but it is identified as a driver regulator. We find that the expression of Rxra is correlated with up-regulated genes (on ENCODE data, rank-sum test, P = 6.44 × 10 −104 ) ( Fig. 7 B). Furthermore, we find the motif of RXRA is enriched in REs associated with up-regulated genes (rank-sum test, P = 1.98 × 10 −26 ) ( Fig. 7 C). Previous literature indicated that knocking out of Rxra leads to an abnormal phenotype in the cardiovascular system ( Sucov et al. 1994 ; Chen et al. 1998 ; Mascrez et al. 2009 ). Even though the expression level is not very high, Rxra is likely to be an important driver of the transition from stem cell state to mesoderm and endoderm state. In Module 1, the driver regulators are neuron-related factors like Ascl1 , Nr2f1 , Pou3f2 -4, Hox family, Meis family, Pbx family, Rarb , and other factors. In Module 3, driver regulators are Pax6 , Dbx1 , Gli1 , Gli3 , and other factors. Pax6 is known to be important in neural stem cell development. We also find that developing brain homeobox factor Dbx1 is one of the important drivers.

Identification of driver regulators reveals ancestor–descendant fates for regulatory modules. ( A ) Driver regulators of D2_Module1, D2_Module2, and D2_Module3. The x -axis is log 2 fold change; the y -axis is −log 10 ( P -value). ( B ) Distribution comparison of Rxra ’s PCC with up-regulated genes and randomly selected genes. ( C ) Distribution comparison of RXRA’s motif enrichment on REs of up-regulated genes and randomly selected genes. ( D ) Distribution of D10_Module4 driver TF expression (blue bar) and their RE openness (red bar) on D4. Expression or openness greater than the median are labeled as “Expressed,” less than decile are labeled as “Not exp,” and the remaining are labeled as “Low exp.” ( E ) Distribution of Z -score of TRS between top 50 module-specific TFs and driver TFs of D10_Module4. ( F ) Expression distribution of the Module4's driver regulators on D4 scRNA-seq data. Columns represent subpopulations identified from scRNA-seq data.

To determine the origin (ancestor) of the new population in D10, we checked the expression of driver regulators of D10_ Module4 ( Sox8 , Nfix , Olig1 , Nfib , Hoxc8 , Nkx2-2 , Foxo6 , Cpeb1 , Lbx1 , Hoxa6 , Pou6f1 , and Rfx4 ) in D4. We find 11 of 12 driver regulators have not been expressed (greater than the median expression level of all genes, noted as ≥50%) ( Fig. 7 D) in D4 yet. However, 118 of 174 (67.82%) REs of those driver TFs are already open in D4 ( Fig. 7 D). So although driver TFs are not expressed yet, it is still possible to determine the ancestor of Module 4 by comparing the TRS score of module-specific TFs targeting on those driver TFs. Figure 7 E shows the distribution of normalized TRS score ( Z -score) of the top 50 module-specific TFs of each module targeting on the Module 4 driver TFs. The results show that TFs from Module 3 in D4 have significantly higher TRS scores than those from Module 1 and Module 2 (one-tailed rank-sum test, P -values are 0.0014 and 1.97 × 10 −6 , respectively), which indicates that the new module in D10 is likely to have arisen from Module 3 in D4. To test this hypothesis, we compared the expression of these driver TFs in scRNA-seq from D4. It is seen that the driver regulators are more highly expressed in subpopulation 3 than subpopulations 1 and 2 ( Fig. 7 F), which is consistent with our hypothesis, because Module 3 has been matched to subpopulation 3 in our previous analysis ( Fig. 5 E). As further validation, we compared the expression of neural stem cell markers ( Sox2 and Nes ) in the three subpopulations in D4 ( Supplemental Fig. S7B,C ). In most of the subpopulation 3 cells, Sox2 and Nes are expressed, which indicates that subpopulation 3 is enriched for neural stem cells from which the glial cells in D10 emerged.

Based on driver regulators, it is possible to construct the developmental path of the subpopulations. We define a transition score between the modules in adjacent time points to construct the ancestor–descendant map for the subpopulations in the time course (Methods). The ancestor–descendant mapping results are topologically similar to the mapping based on module similarities ( Fig. 6 A) except for the TFs driving the transition Module 4 in D10 ( Fig. 8 ). After RA induction, three subpopulations—neuron, mesoderm and endoderm, and neural stem cell—are generated between mESC to D2. Glia subpopulation is generated from neural stem cells between D4 and D10.

TimeReg suggests the developmental trajectory of the subpopulations for RA-driven lineage transition. Schematic overview of the subpopulations at each stage. The gray line represents the similarity of neighboring regulatory modules. The orange line represents the ancestor–descendant mapping among regulatory modules. TFs on the orange lines represent the important driver regulators causally connecting the modules.

TimeReg identifies novel driver regulators of direct reprogramming from fibroblasts to neurons

In addition to mESC differentiation, we applied TimeReg to study the regulatory network for direct lineage reprogramming, which can convert mouse embryonic fibroblasts (MEFs) to induced neuronal (iN) cells, and its underlying mechanism to overcome epigenetic barriers is fundamental for differentiation and development. A previous study ( Wapinski et al. 2013 ) found that Ascl1 is sufficient to induce neuron from fibroblast and generated time course gene expression data for MEF to iN reprogramming. Follow-up study generated time course chromatin accessibility data ( Wapinski et al. 2017 ), observed rapid chromatin changes in response to Ascl1 at day 5, and identified Zbtb18 , Sox8, and Dlx3 as key TFs downstream from Ascl1 for major transition at day 5. We collected gene expression data and chromatin accessibility data of Ascl1- induced reprogramming from these two previous studies. Two levels of time course data can be paired on day 0, day 2, and day 5. This allows us to run TimeReg integrative analysis on this iN reprogramming data. We asked how the cell triggers the follow-up signals, remodeling the chromatin structure in a genome-wide way, and turns on and off lineage-specific gene expression within 48 h at early stage.

Core regulatory modules are consistent with single-cell data

We find two well-separated modules in the day 2 network ( Fig. 9 A). Module 1 contains TFs like Ascl1 , Id2 , Sox4 , Sox8 , Sox11 , Dlx3, and Zbtb18 , and contains TG like Cplx2 , Dner , Nkain1, and Eda2r. Module 2 contains TFs like Klf4 , Tead3 , AP1 complex factors, Egr1 , Prrx1 , and contains TG like Myof , Emp1 , Actn1 , and Bst2 . In reprogramming time course data, Module 1–specific genes have an increasing pattern, but Module 2–specific genes have a decreasing pattern ( Supplemental Fig. S8A ). Furthermore in public ENCODE data, Module 1–specific genes have a high and increasing expression pattern in embryonic brain development and low and decreasing patterns on heart, kidney, liver, and lung tissue development, whereas Module 2–specific genes have the opposite trend ( Supplemental Fig. S8B ). GO enrichment analysis shows that Module 1–specific genes are enriched in neuron-related functions, and Module 2–specific genes are enriched in muscle and epithelial related functions, which are consistent with the expression pattern in developmental stages ( Supplemental Fig. S8C ). Collectively, TimeReg reveals two functional modules, which are consistent with our expectation that the neuron subpopulation is induced to become larger, and other populations are repressed to become smaller. This finding is supported by the single-cell RNA-seq data ( Treutlein et al. 2016 ) of this Ascl1 reprogramming, which shows two well-separated subpopulations ( Fig. 9 B). We find that 90.40% of Module 1–specific genes are matched to the subpopulation 1 ( Fig. 9 C), and 88.80% of Module 2–specific genes are matched to the subpopulation 2 (mapped to the subpopulation with the maximum expression). In single-cell RNA-seq data, the neuron-specific gene Ascl1 is specifically expressed in subpopulation 1 and the muscle-specific gene Myof is specifically expressed in subpopulation 2 ( Fig. 9 D,E). These results show that the modules identified by TimeReg indeed capture subpopulation characteristics and therefore can help us to elucidate the regulatory relations within the subpopulations.

TimeReg analysis on direct reprogramming from fibroblast to neuron. ( A ) Heatmap of reordered normalized TRS scores at day 2. The black line represents the detected modules from NMF. ( B ) t-SNE plot of single-cell RNA-seq data. Color represents clustering label. ( C ) Distribution of day 2 top 500 module-specific genes’ maximum-expressed subpopulations from single-cell RNA-seq data. ( D , E ) Expression of module-specific genes on t-SNE plot. ( F , G ) Module 1–specific genes are enriched in ASCL1 ChIP-seq target genes. ( H ) List of driver regulators of Module 1 in day 2.

Neuron-related module-specific genes are enriched in Ascl1 ChIP-seq targets

As the reprogramming is induced by pioneer factor Ascl1 , Module 1–specific regulators and target genes should contain Ascl1 target genes. To validate the module-specific regulators and module-specific genes, we compared them with the Ascl1 target genes from ChIP-seq data at day 2. The results show both Module 1–specific TFs and Module 1–specific target genes are significantly overlapped with the Ascl1 target genes (Fisher's exact test, P -values 3.22 × 10 −7 and 9.35 × 10 −62 , odd ratios 4.90 and 5.02, respectively) ( Fig. 9 F,G).

Discovery of novel driver regulators

The TimeReg analysis identified eight driver regulators for the neuron-related Module 1 at day 2 ( Fig. 9 H). Ascl1 ranked the first in the list, which means our method successfully detected this super-regulator. Among the other seven driver regulators, six of seven regulators contain Ascl1 ChIP-seq peaks in their regulatory region. Dlx3 and Sox8 ranked second and third, respectively. This is consistent with the fact that they were also previously reported to be important regulators of reprogramming at day 5 ( Wapinski et al. 2017 ). Except for these known regulators, our method detected some novel driver regulators. The forth-ranking regulator is Id2 , which is not reported in the two separate studies from gene expression level and chromatin accessibility level, respectively. Id1 , one important paralog of Id2 , is also identified as a driver regulator. Previous research shows that dimerization of ID1/2 with bHLH family factor MYOD would prevent MYOD from binding to DNA and inhibits muscle differentiation ( Sun et al. 1991 ; Jen et al. 1992 ). Therefore, activation of Id1/2 by Ascl1 will inhibit muscle differentiation and drive the cell into the neural state. Id1 and Id2 would be important driver regulators of the reprogramming process. Another interesting driver regulator is Tcf12 , which is known to be involved in the initiation of neuronal differentiation ( Rebhan et al. 1997 ). Our computational model TimeReg successfully identified these driver regulators by integrating gene expression and chromatin accessibility data, which may play an important role in reprogramming within 48 h and gain new insights for early regulation.

In this paper, we proposed a time course regulatory analysis tool TimeReg from paired gene expression and chromatin accessibility data. To reduce dimensionality, TimeReg integrates expression and chromatin accessibility levels, aggregates the REs and TFs to quantitatively infer TF-TG and RE-TG regulatory strength, and narrows down to regulatory module level by highlighting the important role of driver TFs. The GRN is validated by experimental data, and the core regulatory modules characterize different subpopulations of cells. Based on the driver regulators’ sequential expression and binding on chromatin accessible regions, we causally connect the modules (subpopulations) in adjacent time points and reveal the developmental trajectory for RA induced differentiation.

Inferring ancestor–descendant fates from developmental time courses is a challenging problem, especially in the presence of large gaps (≥48 h) between time points ( Schiebinger et al. 2019 ). In this study, we found a newly generated population at D10 that was absent in the previous time point D4. It is challenging to infer ancestor–descendant in this 6-d gap development stage even if one has single-cell data on each time point. From scRNA-seq data on D4, we see ∼72% of the D10 Module 4–specific genes are highly expressed in subpopulation 1 (only 15% are highly expressed in subpopulation 3) ( Supplemental Fig. S7A ), but driver regulators are more highly expressed in subpopulation 3 ( Fig. 7 F). There are two reasons why Module 4 specific genes are more highly expressed in subpopulation 1 than subpopulation 3: (1) Many expressed genes are shared in neuron and glial, and (2) glial-specific genes have not been expressed in D4 yet. Because of these reasons, the expression-based analysis will not be able to map the ancestor of Module 4 correctly. By exploiting paired expression and accessibility data in the time course, our method is capable of inferring such mappings reliably.

We note that because knockdown of a TF may result in secondary changes in addition to those caused by the knocked down TF, TF ChIP-seq data would not be expected to predict all the transcriptional changes. Conversely, ATAC-seq data before and after the knockdown could reveal changes related to the knocked down TF as well as secondary effects caused by changes in other TFs. Therefore, time course experiments to collect paired RNA-seq and ATAC-seq data before and after knockdown of key TFs will be a powerful approach to gene regulatory analysis.

Finally, we discuss the limitation of our method. The incorporation of prior knowledge from external data into our model has allowed us to greatly reduce the complexity of the model. The caveat is that the cellular context used in the prior calculation is currently incomplete and this may cause modeling bias. The validation results reported above show that in spite of this, our method is already useful for many types of inferences and predictions. We expect that the bias associated with the use of external data will be further minimized as these data become more complete in the future. One possible extension of this paper is to identify the union of all modules across multiple time points simultaneously, which would describe the module dynamics more clearly.

Gene regulatory network inference from a single time point by PECA2

We approximate the distribution of non-zero log­ 2 (1+CRS) by a normal distribution, and predict RE-TG associations by selecting the pairs that have P -value < 0.05.

Regulatory module detection by matrix factorization

Module-specific genes.

Given one gene and its corresponding module, we define a module specificity score for this gene by comparing the normalized TRS of in-module TFs and out-module TFs (one-tailed rank-sum test). If an in-module TF's normalized TRS score is significantly higher than that of the out-module TF's, then this gene is specific to this module. Module specificity score is defined as the product of −log 10 ( P -value) and the mean TRS score with the TFs in the same module. We take the 500 genes that have the highest module specificity score as module-specific genes. Similarly, we define module-specific TFs by comparing the normalized TRS of in-module genes and out-module genes.

Ancestor–descendant mapping via driver TF

Baseline methods for trs score, identification and quantification of res, gene specificity and go analysis, reprogramming data analysis.

To do TimeReg analysis on reprogramming data, we set K as [1, 2, 2] on the three time points, respectively. ASCL1 ChIP-seq target genes are genes that have ASCL1 ChIP-seq peak on promoters or enhancers. Driver TFs are ranked by −log 10 ( P -value) × fold, where fold is the mean TRS score of up-regulated genes divided by mean TRS score of non-up-regulated genes.

Experimental design of retinoic acid–induced mESC differentiation

Mouse ES cell lines R1 were obtained from the American Type Culture Collection (ATCC Cat. no. SCRC-1036). The mESCs were first expanded on an MEF feeder layer previously irradiated. Then, subculturing was carried out on 0.1% bovine gelatin-coated tissue culture plates. Cells were propagated in mESC medium consisting of Knockout DMEM supplemented with 15% knockout serum replacement, 100 µM nonessential amino acids, 0.5 mM beta-mercaptoethanol, 2 mM GlutaMAX, and 100 units/mL Penicillin-Streptomycin with the addition of 1000 units/mL of LIF (ESGRO, Millipore).

mESCs were differentiated using the hanging drop method ( Wang and Yang 2008 ). Trypsinized cells were suspended in a differentiation medium (mESC medium without LIF) to a concentration of 50,000 cells/mL. Then, 20 µL drops (∼1000 cells) were placed on the lid of a bacterial plate, and the lid was upside down. After 48 h incubation, embryoid bodies (EBs) formed at the bottom of the drops were collected and placed in the well of a six-well ultralow attachment plate with fresh differentiation medium containing 0.5 µM retinoic acid for up to 20 d, with the medium being changed daily.

We followed the ATAC-seq protocol published by Buenrostro et al. (2013) with the following modifications. The EBs were first treated with 0.25% Trypsin + EDTA for 10–15 min at 37°C with pipetting. The pellet was then resuspended in the transposase reaction mix (25 µL 2× TD buffer, 2.5 µL transposase, and 22.5 µL nuclease-free water) and incubated for 30 min at 37°C. After purification, DNA fragments were amplified using 1:30 dilution of 25 µM Nextera Universal PCR primer and Index primer under the following conditions: for 5 min at 72°C, for 30 sec at 98°C, and a total 10 cycles of 10 sec at 98°C, 30 sec at 63°C, and 1 min at 72°C. The library of mESC, D2, D4, and D20 was sequenced on NextSeq with 50-bp paired-end reads; the library of D10 was sequenced on HiSeq with 100-bp paired-end reads.

Total RNA was extracted using the Qiagen RNeasy mini kit. Libraries were constructed using the NEBNext Ultra RNA Library Prep Kit for Illumina (New England Biolabs) with the following modifications. mRNA was first isolated from 1 µg of total RNA using the NEBNext Poly(A) mRNA Magnetic Isolation Module. Then it was fragmented for 12 min at 94°C before the first strand and the second strand cDNA synthesis. The double-stranded cDNA was then end repaired and ligated with NEBNext adaptor, followed by AMPure XP beads purification (Beckman Coulter). Each library was amplified using NEBNext Universal PCR primer and Index primer (for detail see NEBNext multiplex oligo for Illumina [E7335, New England Biolabs]) under the following conditions: for 30 sec at 98°C, then a total of six cycles of 10 sec 98°C, 30 sec at 65°C, and 30 sec at 72°C, with a final extension for 5 min at 72°C. Additional PCR (four to six cycles) were necessary to obtain enough DNA for sequencing. Finally, an equal amount of DNA from each library was pooled together, and a 400-bp fragment was selected by 2% E-Gel SizeSelect Gels (Thermo Fisher Scientific) and purified with AMPure XP beads. The library was sequenced on Illumina HiSeq with 100-bp paired-end reads.

Experimental design of shRNA knockdown in mESC

For plasmids and shRNA, the pSuper.puro vector containing the human H1 RNA promoter for ectopic expression of small hairpin RNA (shRNA) was obtained from Oligoengine. The shRNA constructs for each target were designed using the Clontech RNAi Target Sequence Selector program. DNA oligonucleotides were first synthesized by Elim Biopharmaceutical, Inc. To anneal oligos, DNA was incubated in the following conditions (for 30 sec at 95°C; for 2 min at 72°C; for 2 min at 37°C; and for 2 min at 25°C). Then, the annealed oligo inserts were ligated with pSuper.puro vector linearized with BglII and either HindIII or XhoI restriction enzymes. Following transformation, the positive clones were selected by digesting with EcoRI and XhoI restriction enzymes and run in 1% agarose gel. In addition, the presence of the correct insert within pSuper.puro vector was confirmed by sequencing before transfection in mammalian cells (sequencing primer: 5′-AAGATGGCTGTGAGGGACAG). The list of shRNA constructs used in this study is shown in Supplemental Figure S9A . The primers used to measure target gene knockdown efficiency in qRT-PCR are shown in Supplemental Figure S9B .

RNA interference (RNAi) experiments were performed with Nucleofector technology. Briefly, 12 µg of plasmid DNA was transfected into 3.5 × 10 6 mouse ES cells using the Mouse ES cell Nucleofector kit (Lonza). After nucleofection, the cells were incubated in 500 µL warm ES medium for 15 min. Then, the cells were split into four gelatin-coated 60-mm tissue culture plates containing 5 mL of warm ES medium. Puromycin selection was introduced 18 h later at 1 µg/mL, and the medium was changed daily. Then at 30 h, 48 h, and 72 h after puromycin selection, the cells were collected for RNA isolation.

To minimize genomic DNA contamination, RNA was extracted with RNeasy mini kit (Qiagen), and on-column DNA digestion was performed using RNase-free DNase kit (Qiagen). Microarray hybridizations were performed on the MouseRef-8 v2.0 expression beadchip arrays (Illumina). To prepare the sample, 200 ng of total RNA was reverse transcribed, followed by a T7 RNA polymerase-based linear amplification using the Illumina TotalPrep RNA Amplification kit (Applied Biosystems). After amplification, 750 ng of biotin-labeled cRNA was hybridized to gene-specific probes attached to the beads, and the expression levels of transcripts were measured simultaneously.

Software availability

Software and processed data of RA time course are available at GitHub ( https://github.com/SUwonglab/TimeReg , https://github.com/SUwonglab/PECA ), and as Supplemental Code .

  • Data access

All raw and processed sequencing data generated in this study have been submitted to the NCBI Gene Expression Omnibus (GEO; https://www.ncbi.nlm.nih.gov/geo/ ) under accession number GSE136312.

  • Competing interest statement

The authors declare no competing interests.

  • Acknowledgments

We thank Lingjie Li, Shining Ma, and Zhanying Feng for helpful discussions. We also thank Miranda Lin Li for her help in the experiment. This work was supported by grants R01HG010359 and P50HG007735 from the National Institutes of Health (NIH). Y.W. was supported by The National Natural Science Foundation of China (NSFC) under grants nos. 11871463, 61671444 and the Strategic Priority Research Program of the Chinese Academy of Sciences (no. XDB13000000).

Author contributions: W.H.W. conceived the project. Z.D. designed the analytical approach, performed data analysis with the help of J.X., and wrote the software. X.C. performed all biological experiments. Y.W. and W.H.W. supervised the research. All authors wrote, revised, and contributed to the final manuscript.

[Supplemental material is available for this article.]

Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.257063.119 .

Freely available online through the Genome Research Open Access option.

  • Received September 23, 2019.
  • Accepted March 9, 2020.
  • © 2020 Duren et al.; Published by Cold Spring Harbor Laboratory Press

This article, published in Genome Research , is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ .

  • ↵ Bansal M , Gatta GD , Di Bernardo D . 2006 . Inference of gene regulatory networks and compound mode of action from time course gene expression profiles . Bioinformatics 22 : 815 – 822 . doi: 10.1093/bioinformatics/btl003 CrossRef Medline Google Scholar
  • ↵ Brunet JP , Tamayo P , Golub TR , Mesirov JP . 2004 . Metagenes and molecular pattern discovery using matrix factorization . Proc Natl Acad Sci 101 : 4164 – 4169 . doi: 10.1073/pnas.0308531101 Abstract / FREE Full Text
  • ↵ Buenrostro JD , Giresi PG , Zaba LC , Chang HY , Greenleaf WJ . 2013 . Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position . Nat Methods 10 : 1213 – 1218 . doi: 10.1038/nmeth.2688 CrossRef Medline Google Scholar
  • ↵ Cao S , Yu S , Li D , Ye J , Yang X , Li C , Wang X , Mai Y , Qin Y , Wu J , et al. 2018 . Chromatin accessibility dynamics during chemical induction of pluripotency . Cell Stem Cell 22 : 529 – 542.e5 . doi: 10.1016/j.stem.2018.03.005 CrossRef Medline Google Scholar
  • ↵ Chen J , Kubalak SW , Chien KR . 1998 . Ventricular muscle-restricted targeting of the RXRα gene reveals a non-cell-autonomous requirement in cardiac chamber morphogenesis . Development 125 : 1943 – 1949 . Abstract
  • ↵ Duren Z , Chen X , Jiang R , Wang Y , Wong WH . 2017 . Modeling gene regulation from paired expression and chromatin accessibility data . Proc Natl Acad Sci 114 : E4914 – E4923 . doi: 10.1073/pnas.1704553114 Abstract / FREE Full Text
  • ↵ Duren Z , Chen X , Zamanighomi M , Zeng W , Satpathy AT , Chang HY , Wang Y , Wong WH . 2018 . Integrative analysis of single-cell genomics data by coupled nonnegative matrix factorizations . Proc Natl Acad Sci 115 : 7723 – 7728 . doi: 10.1073/pnas.1805681115 Abstract / FREE Full Text
  • ↵ Gorkin DU , Barozzi I, Zhang Y, Lee AY, Li B, Zhao Y, Wildberg A, Ding B, Zhang B, Wang M, et al . 2017 . Systematic mapping of chromatin state landscapes during mouse development . bioRxiv doi:10.1101/166652 Google Scholar
  • ↵ Graham V , Khudyakov J , Ellis P , Pevny L . 2003 . SOX2 functions to maintain neural progenitor identity . Neuron 39 : 749 – 765 . doi: 10.1016/S0896-6273(03)00497-5 CrossRef Medline Google Scholar
  • ↵ Guan L , Chen X , Hung Wong W . 2019 . Detecting strong signals in gene perturbation experiments: an adaptive approach with power guarantee and FDR control . J Am Stat Assoc doi: 10.1080/01621459.2019.1635484 CrossRef Google Scholar
  • ↵ Hempel S , Koseska A , Kurths J , Nikoloski Z . 2011 . Inner composition alignment for inferring directed networks from short time series . Phys Rev Lett 107 : 054101 . doi: 10.1103/PhysRevLett.107.054101 CrossRef Medline Google Scholar
  • ↵ Jen Y , Weintraub H , Benezra R . 1992 . Overexpression of Id protein inhibits the muscle differentiation program: in vivo association of Id with E2A proteins . Genes Dev 6 : 1466 – 1479 . doi: 10.1101/gad.6.8.1466 Abstract / FREE Full Text
  • ↵ Kinney JB , Atwal GS . 2014 . Equitability, mutual information, and the maximal information coefficient . Proc Natl Acad Sci 111 : 3354 – 3359 . doi: 10.1073/pnas.1309933111 Abstract / FREE Full Text
  • ↵ Li L , Wang Y , Torkelson JL , Shankar G , Pattison JM , Zhen HH , Fang F , Duren Z , Xin J , Gaddam S , et al. 2019 . TFAP2C- and p63-dependent networks sequentially rearrange chromatin landscapes to drive human epidermal lineage commitment . Cell Stem Cell 24 : 271 – 284.e8 . doi: 10.1016/j.stem.2018.12.012 CrossRef Medline Google Scholar
  • ↵ Liu Q , Jiang C , Xu J , Zhao MT , Van Bortle K , Cheng X , Wang G , Chang HY , Wu JC , Snyder MP . 2017 . Genome-wide temporal profiling of transcriptome and open chromatin of early cardiomyocyte differentiation derived from hiPSCs and hESCs . Circ Res 121 : 376 – 391 . doi: 10.1161/CIRCRESAHA.116.310456 Abstract / FREE Full Text
  • ↵ Margolin AA , Nemenman I , Basso K , Wiggins C , Stolovitzky G , Dalla Favera R , Califano A . 2006 . ARACNE: an algorithm for the reconstruction of gene regulatory networks in a mammalian cellular context . BMC Bioinformatics 7 : S7 . doi: 10.1186/1471-2105-7-S1-S7 CrossRef Medline Google Scholar
  • ↵ Mascrez B , Ghyselinck NB , Chambon P , Mark M . 2009 . A transcriptionally silent RXRα supports early embryonic morphogenesis and heart development . Proc Natl Acad Sci 106 : 4272 – 4277 . doi: 10.1073/pnas.0813143106 Abstract / FREE Full Text
  • ↵ Miraldi ER , Pokrovskii M , Watters A , Castro DM , De Veaux N , Hall JA , Lee JY , Ciofani M , Madar A , Carriero N , et al. 2019 . Leveraging chromatin accessibility for transcriptional regulatory network inference in T Helper 17 Cells . Genome Res 29 : 449 – 463 . doi: 10.1101/gr.238253.118 Abstract / FREE Full Text
  • ↵ Mumbach MR , Satpathy AT , Boyle EA , Dai C , Gowen BG , Cho SW , Nguyen ML , Rubin AJ , Granja JM , Kazane KR , et al. 2017 . Enhancer connectome in primary human cells identifies target genes of disease-associated DNA elements . Nat Genet 49 : 1602 – 1612 . doi: 10.1038/ng.3963 CrossRef Medline Google Scholar
  • ↵ Nishino J , Kim I , Chada K , Morrison SJ . 2008 . Hmga2 promotes neural stem cell self-renewal in young but not old mice by reducing p16 Ink4a and p19 Arf expression . Cell 135 : 227 – 239 . doi: 10.1016/j.cell.2008.09.017 CrossRef Medline Google Scholar
  • ↵ Perrin BE , Ralaivola L , Mazurie A , Bottani S , Mallet J , d'Alché–Buc F . 2003 . Gene networks inference using dynamic Bayesian networks . Bioinformatics 19 : ii138 – ii148 . doi: 10.1093/bioinformatics/btg1071 CrossRef Medline Google Scholar
  • ↵ Ramirez RN , El-Ali NC , Mager MA , Wyman D , Conesa A , Mortazavi A . 2017 . Dynamic gene regulatory networks of human myeloid differentiation . Cell Syst 4 : 416 – 429.e3 . doi: 10.1016/j.cels.2017.03.005 CrossRef Google Scholar
  • ↵ Rebhan M , Chalifa-Caspi V , Prilusky J , Lancet D . 1997 . GeneCards: integrating information about genes, proteins and diseases . Trends Genet 13 : 163 . doi: 10.1016/S0168-9525(97)01103-7 CrossRef Medline Google Scholar
  • ↵ Sansom SN , Griffiths DS , Faedo A , Kleinjan DJ , Ruan Y , Smith J , Van Heyningen V , Rubenstein JL , Livesey FJ . 2009 . The level of the transcription factor Pax6 is essential for controlling the balance between neural stem cell self-renewal and neurogenesis . PLoS Genet 5 : e1000511 . doi: 10.1371/journal.pgen.1000511 CrossRef Medline Google Scholar
  • ↵ Schiebinger G , Shu J , Tabaka M , Cleary B , Subramanian V , Solomon A , Gould J , Liu S , Lin S , Berube P , et al. 2019 . Optimal-transport analysis of single-cell gene expression identifies developmental trajectories in reprogramming . Cell 176 : 928 – 943.e22 . doi: 10.1016/j.cell.2019.01.006 CrossRef Google Scholar
  • ↵ Storey JD , Xiao W , Leek JT , Tompkins RG , Davis RW . 2005 . Significance analysis of time course microarray experiments . Proc Natl Acad Sci 102 : 12837 – 12842 . doi: 10.1073/pnas.0504609102 Abstract / FREE Full Text
  • ↵ Su Y , Shin J , Zhong C , Wang S , Roychowdhury P , Lim J , Kim D , Ming GL , Song H . 2017 . Neuronal activity modifies the chromatin accessibility landscape in the adult brain . Nat Neurosci 20 : 476 – 483 . doi: 10.1038/nn.4494 CrossRef Medline Google Scholar
  • ↵ Sucov HM , Dyson E , Gumeringer CL , Price J , Chien KR , Evans RM . 1994 . RXR α mutant mice establish a genetic basis for vitamin A signaling in heart morphogenesis . Genes Dev 8 : 1007 – 1018 . doi: 10.1101/gad.8.9.1007 Abstract / FREE Full Text
  • ↵ Sun XH , Copeland NG , Jenkins NA , Baltimore D . 1991 . Id proteins Id1 and Id2 selectively inhibit DNA binding by one class of helix-loop-helix proteins . Mol Cell Biol 11 : 5603 – 5611 . doi: 10.1128/MCB.11.11.5603 Abstract / FREE Full Text
  • ↵ Thomas PD , Campbell MJ , Kejariwal A , Mi H , Karlak B , Daverman R , Diemer K , Muruganujan A , Narechania A . 2003 . PANTHER: a library of protein families and subfamilies indexed by function . Genome Res 13 : 2129 – 2141 . doi: 10.1101/gr.772403 Abstract / FREE Full Text
  • ↵ Treutlein B , Lee QY , Camp JG , Mall M , Koh W , Shariati SAM , Sim S , Neff NF , Skotheim JM , Wernig M , et al. 2016 . Dissecting direct reprogramming from fibroblast to neuron using single-cell RNA-seq . Nature 534 : 391 – 395 . doi: 10.1038/nature18323 CrossRef Medline Google Scholar
  • ↵ Wang X , Yang P . 2008 . In vitro differentiation of mouse embryonic stem (mES) cells using the hanging drop method . J Viz Exp e825 . doi: 10.3791/825 CrossRef Google Scholar
  • ↵ Wang Y , Joshi T , Zhang XS , Xu D , Chen L . 2006 . Inferring gene regulatory networks from multiple microarray datasets . Bioinformatics 22 : 2413 – 2420 . doi: 10.1093/bioinformatics/btl396 CrossRef Medline Google Scholar
  • ↵ Wang RS , Zhang S , Wang Y , Zhang XS , Chen L . 2008 . Clustering complex networks and biological networks by nonnegative matrix factorization with various similarity measures . Neurocomputing 72 : 134 – 141 . doi: 10.1016/j.neucom.2007.12.043 CrossRef Google Scholar
  • ↵ Wang S , Sun H , Ma J , Zang C , Wang C , Wang J , Tang Q , Meyer CA , Zhang Y , Liu XS . 2013 . Target analysis by integration of transcriptome and ChIP-seq data with BETA . Nat Protoc 8 : 2502 – 2515 . doi: 10.1038/nprot.2013.150 CrossRef Medline Google Scholar
  • ↵ Wapinski OL , Vierbuchen T , Qu K , Lee QY , Chanda S , Fuentes DR , Giresi PG , Ng YH , Marro S , Neff NF , et al. 2013 . Hierarchical mechanisms for direct reprogramming of fibroblasts to neurons . Cell 155 : 621 – 635 . doi: 10.1016/j.cell.2013.09.028 CrossRef Medline Google Scholar
  • ↵ Wapinski OL , Lee QY , Chen AC , Li R , Corces MR , Ang CE , Treutlein B , Xiang C , Baubet V , Suchy FP , et al. 2017 . Rapid chromatin switch in the direct reprogramming of fibroblasts to neurons . Cell Rep 20 : 3236 – 3247 . doi: 10.1016/j.celrep.2017.09.011 CrossRef Medline Google Scholar
  • ↵ Zeng W , Chen X , Duren Z , Wang Y , Jiang R , Wong WH . 2019 . DC3 is a method for deconvolution and coupled clustering from bulk and single-cell genomics data . Nat Commun 10 : 4613 . doi: 10.1038/s41467-019-12547-1 CrossRef Google Scholar
  • ↵ Zou M , Conzen SD . 2005 . A new dynamic Bayesian network (DBN) approach for identifying gene regulatory networks from time course microarray data . Bioinformatics 21 : 71 – 79 . doi: 10.1093/bioinformatics/bth463 CrossRef Medline Google Scholar

Add to Google+

What's this?

This Article

  • Published in Advance March 18, 2020 , doi: 10.1101/gr.257063.119 Genome Res. 2020. 30: 622-634 © 2020 Duren et al.; Published by Cold Spring Harbor Laboratory Press
  • Abstract Free
  • » Full Text Free
  • Full Text (PDF) Free
  • Supplemental Material
  • gr.257063.119v1
  • gr.257063.119v2
  • 30/4/622 most recent

Article Category

  • Alert me when this article is cited
  • Alert me if a correction is posted
  • Similar articles in this journal
  • Similar articles in Web of Science
  • Article Metrics
  • Similar articles in PubMed
  • Download to citation manager
  • Permissions

Citing Articles

  • Load citing article information
  • Citing articles via Web of Science
  • Citing articles via Google Scholar

Google Scholar

  • Articles by Duren, Z.
  • Articles by Wong, W. H.
  • Search for related content

PubMed/NCBI

  • PubMed citation

Preprint Server

Navigate this article, current issue.

  • July 2024, 34 (7)

From the Cover

  • Duckweed scRNA-seq atlas
  • Integrated human and mouse blood cell epigenomes
  • Fish feeding habit adaptation
  • High-fidelity microsatellite profiling
  • Deep learning methods
  • Alert me to new issues of Genome Research
  • Advance Online Articles
  • Submit a Manuscript
  • GR in the News
  • Editorial Board
  • E-mail Alerts & RSS Feeds
  • Recommend to Your Library
  • Job Opportunities

USA Scientific

  • Author Info

Copyright © 2024 by Cold Spring Harbor Laboratory Press

  • Print ISSN: 1088-9051
  • Online ISSN: 1549-5469

Breadcrumbs Section. Click here to navigate to respective pages.

Kemerovo Oblast

Kemerovo Oblast

DOI link for Kemerovo Oblast

Click here to navigate to parent product.

This chapter presents history, economic statistics, and federal government directories of Kemerovo Oblast. Kemerovo Oblast, known as the Kuzbass, is situated in southern central Russia. Krasnoyarsk Krai and Khakasiya lie to the east, Tomsk Oblast to the north, Novosibirsk Oblast to the west, and Altai Krai and the Republic of Altai to the south-west. Kemerovo was founded in 1918 and became the administrative centre of the Oblast upon its formation on 26 January 1943. The city is at the centre of Russia's principal coal mining area. In 2015 Kemerovo Oblast's gross regional product (GRP) amounted to 842,619m. roubles, equivalent to 309,637 roubles per head. The Oblast's main industrial centres are at Kemerovo, Novokuznetsk, Prokopyevsk, Kiselyovsk and Leninsk-Kuznetskii. Kemerovo Oblast's agriculture consists mainly of potato and grain production, animal husbandry and beekeeping. The sector employed 3.6% of the workforce and contributed 4.0% of GRP in 2015.

  • Privacy Policy
  • Terms & Conditions
  • Cookie Policy
  • Taylor & Francis Online
  • Taylor & Francis Group
  • Students/Researchers
  • Librarians/Institutions

Connect with us

Registered in England & Wales No. 3099067 5 Howick Place | London | SW1P 1WG © 2023 Informa UK Limited

IMAGES

  1. Time-course assay. (A) Time-course experiment with transfected variant

    time course experiment statistics

  2. time course experiment on the transfer of ( 15 nh 4 ) 2 So 4 as n

    time course experiment statistics

  3. Time course analysis in Experiment 1. Mean reaction times (RTs) for

    time course experiment statistics

  4. Time course of experiment 2 (A) The experiment was divided into

    time course experiment statistics

  5. Time course of experiment. The timeline of measures is shown for the

    time course experiment statistics

  6. Typical time-course experiment of a) aggregation kinetics of Ab(140

    time course experiment statistics

COMMENTS

  1. Statistical analysis (comparison) of time course experiments

    The data sets come from biological experiments in which we measure something in N individual cells, with around 250 measurements over a time frame of around one minute. For each of the time points, we have summary statistics (mean, SD, N) that we can plot to create a curve; this curve will thus be the average curve of, say, 30 individual curves ...

  2. PDF Time Course Experiments

    At experiment: treat time and genotypes as factors with 6 and 2 levels, resp., and form the ratio of the times x genotypes MS to residual MS, giving an F5,d under the null, where d = 2×6×(4-1) = 36 are the residual d.f. Since there is pairing of wt and mutant, we should include that too, giving.

  3. A method to analyze time expression profiles demonstrated in a ...

    The data in Salsa were obtained from a time course experiment estimating gene expression at seven time points of fruit development: 0, 10, 20, 30, 40, 50 and 60 days after anthesis (DAA), in 12 ...

  4. Analysing time course microarray data using Bioconductor: a case study

    Large scale microarray experiments are becoming increasingly routine, particularly those which track a number of different cell lines through time. This time-course information provides valuable insight into the dynamic mechanisms underlying the biological processes being observed. However, proper statistical analysis of time-course data requires the use of more sophisticated tools and complex ...

  5. Analysis of time-course microarray data: Comparison of common tools

    Timecourse method applies both multivariate empirical Bayes statistics and Hotelling T2 statistics to rank DE genes based on the order of disproving the null hypothesis [7]. This package is primarily designed for longitudinal time course experiments, for which successive time points on the same units are not necessarily independent. BETR ...

  6. Significance analysis of time course microarray experiments

    The method that we propose is specifically designed for time course experiments; expression over time is modeled flexibly, and statistical significance is calculated while accounting for sources of dependence over time. ... In contrast to a static experiment, it is more difficult in the time course setting to form statistics that accurately ...

  7. Split plot design

    In a completely randomized time course experiment, different mice are used at each of the measurement times t 1, ... Naomi Altman is a Professor of Statistics at The Pennsylvania State University.,

  8. More powerful significant testing for time course gene expression data

    One of the fundamental problems in time course gene expression data analysis is to identify genes associated with a biological process or a particular stimulus of interest, like a treatment or virus infection. Most of the existing methods for this problem are designed for data with longitudinal replicates. But in reality, many time course gene experiments have no replicates or only have a ...

  9. Integrative analysis of time course metabolic data and biomarker

    Metabolomics time-course experiments provide the opportunity to understand the changes to an organism by observing the evolution of metabolic profiles in response to internal or external stimuli. Along with other omic longitudinal profiling technologies, these techniques have great potential to uncover complex relations between variations across diverse omic variables and provide unique ...

  10. Statistical Analysis of Time Course Microarray Data

    Time course gene expression experiments have proved valuable in a variety of studies. Their unique data structure and the diversity of tasks often associated with them present new challenges to statistical analysis. ... An application of a hierarchical state space model to a short time course microarray experiment. Annals of Applied Statistics ...

  11. Time Series Expression Analyses Using RNA-seq: A Statistical Approach

    3.2.3. Single Transient Time Course Experiment-II . This is a time series experimental design to be composed of eight stages during early zebrafish development, embryogenesis . In their study, wild-type zebrafish embryos (TLAB) were staged according to standard procedures and about 1,000 embryos were collected per stage (two to four cells ...

  12. Analysis of factorial time-course microarrays with application to a

    Time-course microarray experiments are able to capture the temporal profiles for thousands of genes simultaneously to provide information on the dynamic changes of gene expression. A popular approach for analyzing time-course microarray data is to use clustering methods to group genes with similar temporal patterns (1-4). Cluster analysis may ...

  13. A multivariate empirical Bayes statistic for replicated microarray time

    A typical microarray time course dataset consists of expression measure-ments of G genes across k time points, under one or more biological condi-tions (e.g., wildtype versus mutant). The number of genes G (1,000-40,000) is very much larger than the number of time points k, which can be 5-10 for shorter and 11-20 for longer time courses.

  14. Time course regulatory analysis based on paired expression and

    A time course experiment is a widely used design in the study of cellular processes such as differentiation or response to stimuli. In this paper, we propose time course reg ulatory analysis (TimeReg) as a method for the analysis of gene regulatory networks based on paired gene expression and chromatin accessibility data from a time course. TimeReg can be used to prioritize regulatory elements ...

  15. 3.4

    In time course experiments, the times do not need to be equally spaced. Just as in "dosing" studies, it is important to have some a priori idea of what times will be interesting. As well, it is important when possible to start with samples that are in some sense synchronized, so that "time 0" means the same thing to every sample.

  16. Inferring network structure from interventional time-course experiments

    In this paper we put forward a causal variant of dynamic Bayesian networks (DBNs) for the purpose of modeling time-course data with interventions. The models inherit the simplicity and computational efficiency of DBNs but allow interventional data to be integrated into network inference. We show empirical results, on both simulated and ...

  17. Time course analysis with DESeq2

    Discuss time course analyses with DESeq2; Time course analyses with LRT. Despite the popularity of static measurement of gene expression, time-course capturing of biological processes is essential to reflect their dynamic nature, particularly when patterns are complex and are not simply ascending or descending.

  18. Time-Course Experiment

    I have data from label-free proteomics experiment. I have 4 strains with 5 time points in each and 3 replicates. One of the strains is a negative control and expresses nothing, while 3 others are induced to express different proteins at time point 0. Overall I have complete data (all strains, time points and replicates) for around 1000 proteins.

  19. Current Time In Promyshlennaya, Kemerovo Oblast-Kuzbass, Russia

    View, compare and convert Current Time In Promyshlennaya, Kemerovo Oblast-Kuzbass, Russia - Time zone, daylight saving time, time change, time difference with other cities. Convert time between multiple locations, check timezone time, city time, plan travel time, flight arrival time, conference calls and webinars across all time zones.

  20. Chronicles from Kemerovo with political experts' comments

    People in the demonstration from time to time asked to announce real data on the dead - on 26 March, information that about 300-400 people died in the fire and authorities diligently hid it started to spread in social networks. In answer to it, Sergey Tsivilyov told that relatives said 64 people were missing. Earlier the Investigative ...

  21. Online assessment in the age of artificial intelligence

    In Experiment (2), students solely depending on third party online platforms were likely taken aback. Some even audaciously emailed the instructor, asserting that certain multiple-choice questions lacked the correct answers (which, indeed, were present). 4.2 Part 1B: introductory-level large lecture statistics courses after chat GPT

  22. Current Time In Kemerovo, Kemerovo Oblast-Kuzbass, Russia

    29. 30. View, compare and convert Current Time In Kemerovo, Kemerovo Oblast-Kuzbass, Russia - Time zone, daylight saving time, time change, time difference with other cities. Convert time between multiple locations, check timezone time, city time, plan travel time, flight arrival time, conference calls and webinars across all time zones.

  23. What to Know About the Latest Social Security Number Breach

    Account alerts are your friend. Depending on your bank or card company, you can set them up for many things, including any charge outside your home country, any (or all) A.T.M. withdrawals or ...

  24. Time course regulatory analysis based on paired expression and

    A time course experiment is a widely used design in the study of cellular processes such as differentiation or response to stimuli. In this paper, we propose time course reg ulatory analysis (TimeReg) as a method for the analysis of gene regulatory networks based on paired gene expression and chromatin accessibility data from a time course.

  25. Kemerovo Oblast

    This chapter presents history, economic statistics, and federal government directories of Kemerovo Oblast. Kemerovo Oblast, known as the Kuzbass, is situated in southern central Russia. Krasnoyarsk Krai and Khakasiya lie to the east, Tomsk Oblast to the north, Novosibirsk Oblast to the west, and Altai Krai and the Republic of Altai to the south ...