We use cookies to distinguish you from other users and to provide you with a better experience on our websites. Close this message to accept cookies or find out how to manage your cookie settings.
To save content items to your account,
please confirm that you agree to abide by our usage policies.
If this is the first time you use this feature, you will be asked to authorise Cambridge Core to connect with your account.
Find out more about saving content to .
To save content items to your Kindle, first ensure [email protected]
is added to your Approved Personal Document E-mail List under your Personal Document Settings
on the Manage Your Content and Devices page of your Amazon account. Then enter the ‘name’ part
of your Kindle email address below.
Find out more about saving to your Kindle.
Note you can select to save to either the @free.kindle.com or @kindle.com variations.
‘@free.kindle.com’ emails are free but can only be saved to your device when it is connected to wi-fi.
‘@kindle.com’ emails can be delivered even when you are not connected to wi-fi, but note that service fees apply.
The criteria for evaluating research studies often include large sample size. It is assumed that studies with large sample sizes are more meaningful than those that include a fewer number of participants. This chapter explores biases associated with the traditional application of null hypothesis testing. Statisticians now challenge the idea that retention of the null hypothesis signifies that a treatment is not effective. A finding associated with an exact probability value of p = 0.049 is not meaningfully different from one in which p = 0.051. Yet the interpretation of these two studies can be dramatically different, including the likelihood of publication. Large studies are not necessarily more accurate or less biased. In fact, biases in sampling strategy are amplified in studies with large sample sizes. These problems are of increasing concern in the era of big data and the analysis of electronic health records. Studies that are overpowered (because of very large sample sizes) are capable of identifying statistically significant differences that are of no clinical importance.
It is a long-standing view that educational institutions sustain democracy by building an engaged citizenry. However, recent scholarship has seriously questioned whether going to college increases political participation. While these studies have been ingenious in using natural experiments to credibly estimate the causal effect of college, most have produced estimates with high statistical uncertainty. I contend that college matters: I argue that, together, prior effect estimates are just as compatible with a positive effect as a null effect. Furthermore, analyzing two-panel datasets of $n \approx 10,000$ young US voters, using a well-powered difference-in-differences design, I find that attending college leads to a substantive increase in voter turnout. Importantly, these findings are consistent with the statistically uncertain but positive estimates in previous studies. This calls for updating our view of the education-participation relationship, suggesting that statistical uncertainty in prior studies may have concealed that college education has substantive civic returns.
In this article, we highlight how simulation methods can be used to analyze power of economic experiments. We provide the powerBBK package programmed for experimental economists, that can be used to perform simulations in STATA. Power can be simulated using a single command line for various statistical tests (nonparametric and parametric), estimation methods (linear, binary, and censored regression models), treatment variables (binary, continuous, time-invariant or time varying), sample sizes, experimental periods, and other design features (within or between-subjects design). The package can be used to predict minimum sample sizes required to reach a user-specific level of power, to maximize power of a design given the researcher supplied a budget constraint, or to compute power to detect a user-specified treatment order effect in within-subjects designs. The package can also be used to compute the probability of sign errors—the probability of rejecting the null hypothesis in the wrong direction as well as the share of rejections pointing in the wrong direction. The powerBBK package is provided as an .ado file along with a help file, both of which can be downloaded here (http://www.bbktools.org).
Behavioral and psychological researchers have shown strong interests in investigating contextual effects (i.e., the influences of combinations of individual- and group-level predictors on individual-level outcomes). The present research provides generalized formulas for determining the sample size needed in investigating contextual effects according to the desired level of statistical power as well as width of confidence interval. These formulas are derived within a three-level random intercept model that includes one predictor/contextual variable at each level to simultaneously cover various kinds of contextual effects that researchers can show interest. The relative influences of indices included in the formulas on the standard errors of contextual effects estimates are investigated with the aim of further simplifying sample size determination procedures. In addition, simulation studies are performed to investigate finite sample behavior of calculated statistical power, showing that estimated sample sizes based on derived formulas can be both positively and negatively biased due to complex effects of unreliability of contextual variables, multicollinearity, and violation of assumption regarding the known variances. Thus, it is advisable to compare estimated sample sizes under various specifications of indices and to evaluate its potential bias, as illustrated in the example.
In this rejoinder, we examine some of the issues Peter Bentler, Eunseong Cho, and Jules Ellis raise. We suggest a methodological solid way to construct a test indicating that the importance of the particular reliability method used is minor, and we discuss future topics in reliability research.
The two statistical approaches commonly used in the analysis of dyadic and group data, multilevel modeling and structural equation modeling, are reviewed. Next considered are three different models for dyadic data, focusing mostly on the very popular actor–partner interdependence model (APIM). We further consider power analyses for the APIM as well as the partition of nonindependence. We then present an overview of the analysis of over-time dyadic data, considering growth-curve models, the stability-and-influence model, and the over-time APIM. After that, we turn to group data and focus on considerations of the analysis of group data using multilevel modeling, including a discussion of the social relations model, which is a model of dyadic data from groups of persons. The final topic concerns measurement equivalence of constructs across members of different types in dyadic and group studies.
Chapter 7 introduces statistical power and effect size in hypothesis testing. Guidelines for interpretation of effect size, along with other sources of increasing statistical power, are provided. Point estimation and interval estimation and their relationship to population parameter estimates and the hypothesis-testing process are considered. Statistical significance is highly sensitive to large sample sizes. This means that researchers, in addition to selecting desired statistical significance p-values, need to know the magnitude of the treatment effect or the effect size of the behavior under consideration. Effect size determines sample size, and sample size is intimately related to statistical power or the likelihood of rejecting a false null hypothesis.
Biodiversity monitoring programmes should be designed with sufficient statistical power to detect population change. Here we evaluated the statistical power of monitoring to detect declines in the occupancy of forest birds on Christmas Island, Australia. We fitted zero-inflated binomial models to 3 years of repeat detection data (2011, 2013 and 2015) to estimate single-visit detection probabilities for four species of concern: the Christmas Island imperial pigeon Ducula whartoni, Christmas Island white-eye Zosterops natalis, Christmas Island thrush Turdus poliocephalus erythropleurus and Christmas Island emerald dove Chalcophaps indica natalis. We combined detection probabilities with maps of occupancy to simulate data collected over the next 10 years for alternative monitoring designs and for different declines in occupancy (10–50%). Specifically, we explored how the number of sites (60, 128, 300, 500), the interval between surveys (1–5 years), the number of repeat visits (2–4 visits) and the location of sites influenced power. Power was high (> 80%) for the imperial pigeon, white-eye and thrush for most scenarios, except for when only 60 sites were surveyed or a 10% decline in occupancy was simulated over 10 years. For the emerald dove, which is the rarest of the four species and has a patchy distribution, power was low in almost all scenarios tested. Prioritizing monitoring towards core habitat for this species only slightly improved power to detect declines. Our study demonstrates how data collected during the early stages of monitoring can be analysed in simulation tools to fine-tune future survey design decisions.
The current study explored the impact of genetic relatedness differences (ΔH) and sample size on the performance of nonclassical ACE models, with a focus on same-sex and opposite-sex twin groups. The ACE model is a statistical model that posits that additive genetic factors (A), common environmental factors (C), and specific (or nonshared) environmental factors plus measurement error (E) account for individual differences in a phenotype. By extending Visscher’s (2004) least squares paradigm and conducting simulations, we illustrated how genetic relatedness of same-sex twins (HSS) influences the statistical power of additive genetic estimates (A), AIC-based model performance, and the frequency of negative estimates. We found that larger HSS and increased sample sizes were positively associated with increased power to detect additive genetic components and improved model performance, and reduction of negative estimates. We also found that the common solution of fixing the common environment correlation for sex-limited effects to .95 caused slightly worse model performance under most circumstances. Further, negative estimates were shown to be possible and were not always indicative of a failed model, but rather, they sometimes pointed to low power or model misspecification. Researchers using kin pairs with ΔH less than .5 should carefully consider performance implications and conduct comprehensive power analyses. Our findings provide valuable insights and practical guidelines for those working with nontwin kin pairs or situations where zygosity is unavailable, as well as areas for future research.
This chapter provides an accessible introduction to experimental methods for social and behavioral scientists. We cover the process of experimentation from generating hypotheses through to statistical analyses. The chapter discusses classical issues (e.g., experimental design, selecting appropriate samples) but also more recent developments that have attracted the attention of experimental researchers. These issues include replication, preregistration, online samples, and power analyses. We also discuss the strengths and weaknesses of experimental methods. We conclude by noting that, for many research questions, experimental methods provide the strongest test of hypothesized causal relationships. Furthermore, well-designed experiments can elicit the same mental processes as in the real world; this typically makes them generalizable to new people and real-life situations.
Clinical trials are conducted to solve the problem of confounding bias by conducting randomized studies. Examples are given with antidepressants for how clinicians see clinical benefit in practice but randomized clinical trials (RCTs) show that most of the benefit has to do with placebo effects, not the drug pharmacology as clinicians would assume. The conduct of clinical trials is discussed, including the risks of false positive and false negative results depending on how data are analyzed or how sample size is planned. Common errors are identified and criticized.
Despite the particular relevance of statistical power to animal welfare studies, we noticed an apparent lack of sufficient information reported in papers published in Animal Welfare to facilitate post hoc calculation of statistical power for use in meta-analyses. We therefore conducted a survey of all papers published in Animal Welfare in 2009 to assess compliance with relevant instructions to authors, the level of statistical detail reported and the interpretation of results regarded as statistically non-significant. In general, we found good levels of compliance with the instructions to authors except in relation to the level of detail reported for the results of each test. Although not requested in the instructions to authors, exact P-values were reported in just over half of the tests but effect size was not explicitly reported for any test, there was no reporting of a priori statistical analyses to determine sample size and there was no formal assessment of non-significant results in relation to type II errors. As a first stage to addressing this we recommend more reporting of a priori power analyses, more comprehensive reporting of the results of statistical analysis and the explicit consideration of possible statistical power issues when interpreting P-values. We also advocate the calculation of effect sizes and their confidence intervals and a greater emphasis on the interpretation of the biological significance of results rather than just their statistical significance. This will enhance the efforts that are currently being made to comply with the 3Rs, particularly the principle of reduction.
Previous research has suggested that statistical power is suboptimal in many biomedical disciplines, but it is unclear whether power is better in trials for particular interventions, disorders, or outcome types. We therefore performed a detailed examination of power in trials of psychotherapy, pharmacotherapy, and complementary and alternative medicine (CAM) for mood, anxiety, and psychotic disorders.
Methods
We extracted data from the Cochrane Database of Systematic Reviews (Mental Health). We focused on continuous efficacy outcomes and estimated power to detect predetermined effect sizes (standardized mean difference [SMD] = 0.20–0.80, primary SMD = 0.40) and meta-analytic effect sizes (ESMA). We performed meta-regression to estimate the influence of including underpowered studies in meta-analyses.
Results
We included 256 reviews with 10 686 meta-analyses and 47 384 studies. Statistical power for continuous efficacy outcomes was very low across intervention and disorder types (overall median [IQR] power for SMD = 0.40: 0.32 [0.19–0.54]; for ESMA: 0.23 [0.09–0.58]), only reaching conventionally acceptable levels (80%) for SMD = 0.80. Median power to detect the ESMA was higher in treatment-as-usual (TAU)/waitlist-controlled (0.49–0.63) or placebo-controlled (0.12–0.38) trials than in trials comparing active treatments (0.07–0.13). Adequately-powered studies produced smaller effect sizes than underpowered studies (B = −0.06, p ⩽ 0.001).
Conclusions
Power to detect both predetermined and meta-analytic effect sizes in psychiatric trials was low across all interventions and disorders examined. Consistent with the presence of reporting bias, underpowered studies produced larger effect sizes than adequately-powered studies. These results emphasize the need to increase sample sizes and to reduce reporting bias against studies reporting null results to improve the reliability of the published literature.
Sample size planning (SSP) is vital for efficient studies that yield reliable outcomes. Hence, guidelines, emphasize the importance of SSP. The present study investigates the practice of SSP in current trials for depression.
Methods
Seventy-eight randomized controlled trials published between 2013 and 2017 were examined. Impact of study design (e.g. number of randomized conditions) and study context (e.g. funding) on sample size was analyzed using multiple regression.
Results
Overall, sample size during pre-registration, during SSP, and in published articles was highly correlated (r's ≥ 0.887). Simultaneously, only 7–18% of explained variance related to study design (p = 0.055–0.155). This proportion increased to 30–42% by adding study context (p = 0.002–0.005). The median sample size was N = 106, with higher numbers for internet interventions (N = 181; p = 0.021) compared to face-to-face therapy. In total, 59% of studies included SSP, with 28% providing basic determinants and 8–10% providing information for comprehensible SSP. Expected effect sizes exhibited a sharp peak at d = 0.5. Depending on the definition, 10.2–20.4% implemented intense assessment to improve statistical power.
Conclusions
Findings suggest that investigators achieve their determined sample size and pre-registration rates are increasing. During study planning, however, study context appears more important than study design. Study context, therefore, needs to be emphasized in the present discussion, as it can help understand the relatively stable trial numbers of the past decades. Acknowledging this situation, indications exist that digital psychiatry (e.g. Internet interventions or intense assessment) can help to mitigate the challenge of underpowered studies. The article includes a short guide for efficient study planning.
Field experiments with survey outcomes are experiments where outcomes are measured by surveys but treatments are delivered by a separate mechanism in the real world, such as by mailers, door-to-door canvasses, phone calls, or online ads. Such experiments combine the realism of field experimentation with the ability to measure psychological and cognitive processes that play a key role in theories throughout the social sciences. However, common designs for such experiments are often prohibitively expensive and vulnerable to bias. In this chapter, we review how four methodological practices currently uncommon in such experiments can dramatically reduce costs and improve the accuracy of experimental results when at least two are used in combination: (1) online surveys recruited from a defined sampling frame (2) with at least one baseline wave prior to treatment (3) with multiple items combined into an index to measure outcomes and, (4) when possible, a placebo control for the purpose of identifying which subjects can be treated. We provide a general and extensible framework that allows researchers to determine the most efficient mix of these practices in diverse applications. We conclude by discussing limitations and potential extensions.
The use of experiments to study the behavior of political elites in institutions has a long history and is once again becoming an active field of research. I review that history, noting that government officials within political institutions frequently use random assignment to test for policy effects and to encourage compliance. Scholars of political institutions have generally been slower than practitioners to embrace the use of experiments, though there has been remarkable growth in experimentation by scholars to study political elites. I summarize the domains in which scholars have most commonly used experiments, commenting on how researchers have seized opportunities to leverage random assignment. I highlight design challenges including limited sample sizes, answering theoretically-driven questions while partnering with public officials or others, and the difficulty of conducting replications. I then implore scholars to be bold in using experiments to study political institutions while also being mindful of ethical considerations.
Publication bias and p-hacking are threats to the scientific credibility of experiments. If positive results are more likely to be published than null results conditional on the quality of the study design, then effect sizes in meta-analyses will be inflated and false positives will be more likely. Publication bias also has other corrosive effects as it creates incentives to engage in questionable research practices such as p-hacking. How can these issues be addressed such that the credibility of experiments is improved in political science? This chapter discusses seven specific solutions, which can be enforced by both formal institutions and informal norms.
This chapter reviews the use of chess in business, health, and education. In business, chess has been used with educational purposes, and as a model to evaluate game-theory aspects of the game. In health, there are applications of chess to address problems such as attention deficit hyperactivity disorder (ADHD), neurodegenerative disorders, or schizophrenia. In education, chess has become widely used as a pedagogical method thought to entail education al benefits for languages, mathematics, concentration, self-control, or the development of socio-affective competences. Some recent studies suggest significant higher levels of academic performance for schoolchildren and adolescents into chess-based teaching or who practice chess on a regular basis, when compared with students uninvolved in chess playing or chess instruction. Another set of studies, however, question the purported benefits of chess training for formal education. According with this latter point of view, there are both conceptual and methodological concerns that hinder in a great extent the available evidence about the association of chess training with academic achievement. Two of these issues relate with the transfer of abilities across domains, and with the concept of statistical power, which are addressed in greater depth within this chapter.
Dr Nick Martin has made enormous contributions to the field of behavior genetics over the past 50 years. Of his many seminal papers that have had a profound impact, we focus on his early work on the power of twin studies. He was among the first to recognize the importance of sample size calculation before conducting a study to ensure sufficient power to detect the effects of interest. The elegant approach he developed, based on the noncentral chi-squared distribution, has been adopted by subsequent researchers for other genetic study designs, and today remains a standard tool for power calculations in structural equation modeling and other areas of statistical analysis. The present brief article discusses the main aspects of his seminal paper, and how it led to subsequent developments, by him and others, as the field of behavior genetics evolved into the present era.
This article provides an accessible tutorial with concrete guidance for how to start improving research methods and practices in your lab. Following recent calls to improve research methods and practices within and beyond the borders of psychological science, resources have proliferated across book chapters, journal articles, and online media. Many researchers are interested in learning more about cutting-edge methods and practices but are unsure where to begin. In this tutorial, we describe specific tools that help researchers calibrate their confidence in a given set of findings. In Part I, we describe strategies for assessing the likely statistical power of a study, including when and how to conduct different types of power calculations, how to estimate effect sizes, and how to think about power for detecting interactions. In Part II, we provide strategies for assessing the likely type I error rate of a study, including distinguishing clearly between data-independent (“confirmatory”) and data-dependent (“exploratory”) analyses and thinking carefully about different forms and functions of preregistration.