Views
Section2:Replicate-Experiment Study
From Assay Guidance Wiki
Contents |
Overview
It is important to verify that the assay results are reproducible, i.e. that the variability of key end points of the assay are acceptably low. In addition, if the assay is to report results with those previously reported by another assay then it should be verified that the two labs produce equivalent results. In this section, we define how to quantify assay variability and determine assay equivalence. It is important to read the entire section below to understand the rationale for the statistical methods employed in calculating reproducibility of potency and efficacy. We strongly recommend consultation with a statistician before designing experiments to estimate variability described below.
Rationale
Replicate-Experiment studies are used to formally evaluate the within-run assay variability and formally compare the new assay to the existing (old) assay. They also allow a preliminary assessment of the overall or between-run assay variability, but two runs are not enough to adequately assess overall variability. Post-production methods (Section III) are used to formally evaluate the overall variability in the assay. Note that the Replicate-Experiment study is a diagnostic and decision tool used to establish that the assay is ready to go into production by showing that the endpoints of the assay are reproducible over a range of potencies. It is not intended as a substitute for post-production monitoring or to provide an estimate of the overall Minimum Significant Ratio (MSR).
It may seem counter-intuitive to call the differences between two independent assay runs “within-run”. However, the terminology results from the way those terms are defined. Experimental variation is categorized into two distinct components: between-run and within-run sources. Consider the following examples:
- If there is variation in the concentrations of buffer components between 2 runs then the assay results could be affected. However, assuming that the same buffer is used with all compounds within the run, each compound will be equally affected and so the difference will only show up when comparing one run to another run, i.e. in two runs one run will appear higher on average than the other run. This variation is called between-run variation.
- If the concentration of the compound in the stock plate varies from the target concentration then all wells where that compound is used will be affected. However, wells used to test other compounds will be unaffected. This type of variation is called within-run as the source of variation affects different compounds in the same run differently.
- Some sources of variability affect both within- and between-run variation. For example, in a FLIPR assay cells are plated and then incubated for 24-72 hours to achieve a target cell density taking into account the doubling time of the cells. For example, if the doubling time equals the incubation time, and the target density is 30,000 cells/well, then 15,000 cells/well are plated. But even if exactly 15,000 cells are placed in each well there won’t be exactly 30,000 cells in each well after 24 hours. Some will be lower and some will be higher than the target. These differences are within-run as not all wells are equally affected. But also suppose in a particular run only 13,000 cells are initially plated. Then the wells will on average have fewer than 30,000 cells after 24 hours, and since all cells are affected this is between-run variation. Thus cell density has both within- and between-run sources of variation.
The total variation is the sum of both sources of variation. When comparing two compounds across runs, one must take into account both the within-run and between-run sources of variation. But when comparing two compounds in the same run, one must only take into account the within-run sources, since, by definition, the between-run sources affect both compounds equally.
In a Replicate-Experiment study the between-run sources of variation cause one run to be on average higher than the other run. However, it would be very unlikely that the difference between the two runs were exactly the same for every compound in the study. These individual compound “differences from the average difference” are caused by the within-run sources of variation. The higher the within-run variability the greater the individual compound variation in the assay runs.
The analysis approach used in the Replicate-Experiment study is to estimate and factor out between-run variability, and then estimate the magnitude of within-run variability.
Procedure (Steps)
All assays should have a reproducibility comparison (Steps 1-3). If the assay is to replace an existing assay and combine the data then an assay comparison study should also be done (Steps 4 and 5).
- Select 20-30 compounds that have potencies covering the concentration range being tested and, if applicable, efficacy measures that cover the range of interest. The compounds should be well spaced over these ranges.
- All of the compounds should be run in each of two runs of the assay.
- Compare the two runs as per Section D.3-D.6.
- All compounds should be run in a single run of the previous assay.
- Compare the results of the two labs by analyzing the first run of the new assay with the single run of the previous assay.
Analysis (Potency)
For the reproducibility comparison paste potency values from the two runs into the Run 1 and Run 2 data columns. All tests are conducted by the spreadsheet, and there are additional plots and diagnostics available to assist in judging the results. For the assay comparison study paste the potency values for the first run of the new assay into the Run1 column and the potency values for the (single) run of the previous assay into the Run 2 column. Potency values should be calculated according to the methods of Section III.
The points below describe and define the terms used in the template and the acceptance criterion discussed in the Diagnostic Tests section below.
- Compute the difference in log-potency (= first – second) between the first and second run for each compound. Let
be the sample mean and standard deviation of the difference in log-potency. Since ratios of EC50 values (relative potencies) are more meaningful than differences in potency (1 and 3, 10 and 30, 100 and 300 have the same ratio but not the same difference), we take logs in order to analyze ratios as differences. - Compute the Mean-Ratio:
. This is the geometric average fold difference in potency between two runs. - Compute the Ratio Limits:
, where n is the number of compounds. This is the 95% confidence interval for the Mean-Ratio. - Compute the Minimum Significant Ratio:
. This is the smallest potency ratio between two compounds that is statistically significant. - Compute the Limits of Agreement:
. Most of the compound potency ratios (approximately 95%) should fall within these limits. - For each compound compute the Ratio (=first/second) of the two potencies, and the Geometric Mean potency:
.
Items 2-6 can be combined into one plot: the Ratio-GM plot. An example is in Figure 1. The points represent the compounds; the blue-solid, green long-dashed and red short-dashed lines represent the MR, RLs and LsA values respectively.
Figure 1 shows the desired result of pure chance variation in the difference in activities between runs. The blue solid line shows the geometric mean potency ratio, i.e. the average relationship between the first and second run. The green long-dashed lines show the 95% confidence limits of the mean ratio. These limits should contain the value 1.0, as they do in this case. The red short-dashed lines indicate the limits of agreement between runs. They indicate the individual compound variation between the first and second run. You should see all, or almost all, the points fall within the red dashed lines. The lower line should be above 0.33, while the upper line should be below 3.0, which indicates a 3-fold difference between runs in either direction. The MSR should be less than 3.0, as it is in this example.

Figure 1. Potency Ratio versus GM Potency. This is a typical example for an acceptable assay: The MR=0.90, RLs=(0.78-1.03) [contains the value 1.0], MSR=1.86 [under 3.0], LsA=(0.48-1.67) [between 0.33 and 3.0].
Diagnostic Tests and Acceptance Criterion (Potency)
- If the MSR ≥ 3 then there is poor individual agreement between the two runs. This problem occurs when the within-run variability of the assay is too high. See Figure 2(a) below for an illustration. An assay meets the MSR acceptance criterion if the (within-run) MSR < 3.
- If Ratio limits do not contain the value 1, then there is a statistically significant average difference between the two runs. Within a lab (Step 3) this is due to high between-run assay variability. Between labs (Step 4), this could be due to a systematic difference between labs, or high between-run variability in one or both labs. See Figure 2(b) below for an illustration. Note that it is possible with a very “tight” assay (i.e. one with a very low MSR) or with a large set of compounds to have a statistically significant result for this test that is not very material, i.e., the actual MR is small enough to be ignorable. If the result is statistically significant then examine the MR. If it is between 0.67 and 1.5 then the average difference between runs is less than 50% and is deemed immaterial. However, in Figure 2(b) the MR=2.01, indicating a 101% difference between runs, which is too high to be considered “equivalent”. Note that there is no direct requirement for the MR, but values that are this extreme are unlikely to pass the Limits of Agreement criterion in step 3 below.
- The MR and the MSR are combined into a single interval referred to as the Limits of Agreement. An assay that either has a high MSR and/or an MR different from 1 will tend to have poor agreement of results between the two runs. An assay meets the Limits of Agreement acceptance criterion if both the upper and lower limits of agreement are between 0.33 and 3.0. Note that assays depicted in both Figures 2a and 2b do not have Limits of Agreement inside the acceptance region and thus do not meet the acceptance criterion.


Analysis (Efficacy)
The points below describe and define the terms used in the template and the acceptance criterion discussed in the Diagnostic Tests section. Note that the methods described here are intended for functional full/partial assays and non-competitive antagonist assays. Some potentiator assays, as well as assays normalized by fold stimulation may best be analyzed with the techniques described in the potency section rather than the methods described here. Consult a statistician for the best method of analysis.
- Compute the difference in efficacy (= first – second) between the first and second run for each compound. Let
be the sample mean and standard deviation of the difference in efficacy. - Compute the Mean-Difference:
. This is the average difference in efficacy between the two runs. - Compute the Difference Limits:
, where n is the number of compounds. This is a 95% confidence interval for the Mean-Difference. - Compute the Minimum Significant Difference:
. This is the smallest efficacy difference between two compounds that is statistically significant. - Compute the Limits of Agreement:
. Most of the compound efficacy differences should fall within these limits (approximately 95%). - For each compound compute the Difference (= first-second) of the two efficacies, and the Mean efficacy (average of first and second).
Items 2-6 can be combined onto one plot: the Difference-Mean plot (not shown). The plot is very similar to the Ratio-GM plot except that both axes are on the linear scale instead of the log scale.
Diagnostic Tests (Efficacy)
Generally the same two problems discussed under potency need to be judged for efficacy as well. However, a general acceptance criterion for efficacy has not been established as there is not a consensus on efficacy standards, and for most projects potency is the primary property of interest. As guidelines, the MD should be less than 5 (i.e., less than 5% average difference between runs) and the MSD should be less than 20 (e.g., 20% activity). More importantly, the MD and MSD should be used to judge the appropriateness of any efficacy CSF’s a project may have. For example, if the CSF for efficacy is >80%, and the MSD is 30%, then the assay will fail too many efficacious compounds - a 90%-active compound would fall below the CSF 25% of the time. A more appropriate CSF in this situation would be 70 or even 60%.
Summary of Acceptance Criteria
- In Step 3 conduct reproducibility and equivalence tests for potency comparing the two runs in the new lab. The assay should pass both tests (MSR < 3 and both Limits of Agreement should be between 0.33 and 3.0).
- In Step 5 conduct reproducibility and equivalence tests for potency comparing the first run of the new lab to the single run of the old lab. The assays should pass both tests to be declared equivalent (Limits of Agreement between 0.33 and 3.0).
- For full/partial agonist assays and non-competitive antagonist assays, repeat points 1 and 2 for efficacy. Use the informal guidelines discussed above, and project efficacy CSFs to judge acceptability of results.
Notes
- If a project is very new, there may not be 20-30 unique active compounds (where active means some measurable activity above the minimum threshold of the assay). In that case it is acceptable to run compounds more than once to get an acceptable sample size. For example, if there are only 10 active compounds then run each compound twice. However, when doing so, (a) it is important to biologically evaluate them as though they were different compounds, including the preparation of separate serial dilutions, and (b) label the compounds “a”, “b” etc. so that it is clear in the test-retest analyses which results are being compared across runs.
- Functional assays need to be compared for both potency (EC50) and efficacy (%maximum response). This may well require a few more compounds in those cases.
- In binding assays, it is best to compare Ki’s, and in functional antagonist assays it is best to compare Kb’s.
- An assay may pass the reproducibility assessment (Steps 1-3 in the procedure [Section D.2.]), but may fail the assay comparison study (Steps 4-5 in the procedure [Section D.2]). The assay comparison study may fail either because of a MR different from 1 or a high “MSR” in the assay comparison study. If it’s the former then there is a potency shift between the assays. You should assess the values in the assays to ascertain their validity (e.g. which assay’s results compare best to those reported in the literature?). If it fails because the Lab Comparison study is too large (but the new assay passes the reproducibility study) then the old assay lacks reproducibility. In either case, if the problem is with the old assay, then the team should consider rerunning key compounds in the new assay to provide comparable results to compounds subsequently run in the new assay.

















