TL;DR: In this paper, the authors present a probit, ordered probit model and a multinomial pro... model for estimating a linear function and a normal error in a small-sample linear regression model.
Abstract: At the heart of many econometric models are a linear function and a normal error. Examples include the classical small-sample linear regression model and the probit, ordered probit, multinomial pro...
TL;DR: The KHB method as mentioned in this paper is a general decomposition method that is unaffected by the rescaling or attenuation bias that arises in cross-model comparisons in nonlinear models and can be extended to other models in the generalized linear model family.
Abstract: In a series of recent articles, Karlson, Holm, and Breen (Breen, Karlson, and Holm, 2011, http://papers.ssrn.com/sol3/papers.cfm?abstractid=1730065; Karlson and Holm, 2011, Research in Stratification and Social Mobility 29: 221- 237; Karlson, Holm, and Breen, 2010, http://www.yale.edu/ciqle/Breen Scaling %20effects.pdf) have developed a method for comparing the estimated coefficients of two nested nonlinear probability models. In this article, we describe this method and the user-written program khb, which implements the method. The KHB method is a general decomposition method that is unaffected by the rescaling or attenuation bias that arises in cross-model comparisons in nonlinear models. It recovers the degree to which a control variable, Z, mediates or explains the relationship between X and a latent outcome variable, Y ∗ , underlying the nonlin- ear probability model. It also decomposes effects of both discrete and continuous variables, applies to average partial effects, and provides analytically derived sta- tistical tests. The method can be extended to other models in the generalized linear model family.
TL;DR: The mechanisms that connect explanatory variables with the explained variables, also known as "mediation analysis", is central to a variety of social-science fields, especially psychology as discussed by the authors, and is a subject of interest in this paper.
Abstract: Estimating the mechanisms that connect explanatory variables with the explained variable, also known as “mediation analysis,” is central to a variety of social-science fields, especially psychology...
TL;DR: A new postestimation command, xblc, is presented that facilitates tabular and graphical presentation of relationships between quantitative covariate and response variable relationships andCubic splines are given special emphasis.
Abstract: The use of flexible models for the relationship between a quantitative covariate and the response variable can be limited by the difficulty in interpret- ing the regression coefficients. In this article, we present a new postestimation command, xblc, that facilitates tabular and graphical presentation of these rela- tionships. Cubic splines are given special emphasis. We illustrate the command through several worked examples using data from a large study of Swedish men on the relation between physical activity and the occurrence of lower urinary tract symptoms.
TL;DR: A general-purpose multivariate decomposition command for nonlinear response models that incorporates several recent contributions to overcome various problems dealing with path dependence and identification is developed.
Abstract: We developed a general-purpose multivariate decomposition command for nonlinear response models that incorporates several recent contributions to overcome various problems dealing with path depende...
TL;DR: In this paper, the switch_probit command implements the maximum likelihood method to fit the model of the binary choice with binary endogenous regressors, which implements the switch-probit method.
Abstract: In this article, we describe the switch_probit command, which implements the maximum likelihood method to fit the model of the binary choice with binary endogenous regressors.
TL;DR: In this article, a new command, gformula, is described, which is an implementation of the g-computation procedure and is used to estimate the causal effect of time-varying exposures on an outcome in the pre...
Abstract: This article describes a new command, gformula, that is an implementation of the g-computation procedure. It is used to estimate the causal effect of time-varying exposures on an outcome in the pre...
TL;DR: In this paper, the authors identify and illustrate some shortcomings of the poisson command in STATA and point out that the command fails to check for the existence of the estimates and show that it is very sensitive to numerical problems.
Abstract: In this note we identify and illustrate some shortcomings of the poisson command in STATA. Speci…cally, we point out that the command fails to check for the existence of the estimates and show that it is very sensitive to numerical problems. While these are serious problems that may prevent users from obtaining estimates, or even produce spurious and misleading results, we show that the informed user often has available simple work-arounds for these problems.
TL;DR: Geocode and traveltime as mentioned in this paper are two commands that use Google Maps to provide spatial information for data, and allow users to generate latitude and longitude for varius variables.
Abstract: This article describes geocode and traveltime, two commands that use Google Maps to provide spatial information for data. The geocode command allows users to generate latitude and longitude for var...
TL;DR: The Stata 11 margins command makes it easier to estimate adjusted risk ratios, and the robust variance option for xtpoisson, fe provides correct confidence intervals for adjusted risk ratio from matched-cohort data as mentioned in this paper.
Abstract: The Stata 11 margins command makes it easier to estimate adjusted risk ratios, and the new robust variance option for xtpoisson, fe provides correct confidence intervals for adjusted risk ratios from matched-cohort data.
TL;DR: In this article, Balke and Pearl (1997, Journal of the American Statistical Association 92: 1172-1176) derived nonparametric bounds for the intervention probabilities and the average causal effect for a model in which the instrument, intermediate/treatment, and outcome variables are all binary.
Abstract: Instrumental variables can be used to make inferences about causal effects in the presence of unmeasured confounding. For a model in which the instrument, intermediate/treatment, and outcome variables are all binary, Balke and Pearl (1997, Journal of the American Statistical Association 92: 1172–1176) derived nonparametric bounds for the intervention probabilities and the average causal effect. We have implemented these bounds in two commands: bpbounds and bpboundsi. We have also implemented several extensions to these bounds. One of these extensions applies when the instrument and outcome are measured in one sample and the instrument and intermediate are measured in another sample. We have also implemented the bounds for an instrument with three categories, as is common in Mendelian randomization analyses in epidemiology and for the case where a monotonic effect of the instrument on the intermediate can be assumed. In each case, we calculate the instrumental-variable inequality constraints as a check for gross violations of the instrumental-variable conditions. The use of the commands is illustrated with a re-creation of the original Balke and Pearl analysis and with a Mendelian randomization analysis. We also give a simulated example to demonstrate that the instrumental-variable inequality constraints can both detect and fail to detect violations of the instrumental-variable conditions.
TL;DR: The treatment effects may vary with the observed characteristics of the treated, often with important implications as discussed by the authors, and a growing literature deals with the problem of specifying treatment interaction terms that most effectively capture this variation.
Abstract: Treatment effects may vary with the observed characteristics of the treated, often with important implications. In the context of experimental data, a growing literature deals with the problem of specifying treatment interaction terms that most effectively capture this variation. Some of the results of this literature are now implemented in Stata. With non-experimental (observational) data, and in particular when selection into treatment depends on unmeasured factors, treatment effects can be estimated using Stata's treatreg command. Although not originally designed for this purpose, treatreg can be used to consistently estimate treatment interactions parameters. In the presence of interactions, however, adjustments are required to generate predicted values and estimate the Average Treatment Effect (ATE). This paper introduces commands that perform this adjustment for the case of multiplicative interactions and shows the adjustment that is required for more complicated interactions.
TL;DR: In this paper, the rcpoisson command for right-censored count-data models with a constant (Terza 1985, Economics Letters 18: 361-365) and variable censoring threshold (Caudill and Mixon 1995, Empirical...
Abstract: I present the rcpoisson command for right-censored count-data models with a constant (Terza 1985, Economics Letters 18: 361–365) and variable censoring threshold (Caudill and Mixon 1995, Empirical ...
TL;DR: In this paper, the authors describe the current and new features of survci, including pointwise confidence intervals for the cumulative hazard function and the support of stratified Cox models, as well as factor variables; available as of Stata 11.
Abstract: A graphical representation of the pointwise confidence intervals allows a researcher to easily assess the precision of estimators. In the absence of covariates, the official commandsts graph can be used to plot these intervals for the survivor function or the cumulative hazard function; however, in the presence of covariates, sts graph is insufficient. The user-written commandsurvci can be used to plot the pointwise intervals for the survivor function after the Cox model. In this article, I describe the current and new features of survci. The new features include pointwise confidence intervals for the cumulative hazard function and the support of stratified Cox models, as well as factor variables; available as of Stata 11. I describe the methods used in calculating pointwise confidence intervals in the Cox model for both the covariate-adjusted survivor function and the covariate-adjusted cumulative hazard function. I also demonstrate the syntax of survci using Stata's example cancer dataset, cancer.dta.
TL;DR: The use of multidimensional scaling methods for visualizing social networks in Stata is described and illustrated and limitations of the approach are discussed.
Abstract: I describe and illustrate the use of multidimensional scaling methods for visualizing social networks in Stata. The procedure is implemented in the netplot command. I discuss limitations of the approach and sketch possibilities for improvement.
TL;DR: In this article, the authors compare the small-sample approximations to the multiple-imputation degrees of freedom proposed by Barnard and Rubin (1999, Biometrika 86: 948- 955); Lipsitz, Parzen, and Zhao (2002, Journal of Statistical Computation and Simulation 72: 309-318); and Reiter (2007, Biometric 94: 502-508) and ask if the sampling distribution of the t statistics is in fact a Student's t distribution with the specified degrees offreedom.
Abstract: Incomplete data is a common complication in applied research. In this study, we use simulation to compare two approaches to the multiple imputation of a continuous predictor: multiple imputation through chained equations and multivariate normal imputation. This study extends earlier work by being the first to 1) compare the small-sample approximations to the multiple-imputation degrees of freedom proposed by Barnard and Rubin (1999, Biometrika 86: 948- 955); Lipsitz, Parzen, and Zhao (2002, Journal of Statistical Computation and Simulation 72: 309-318); and Reiter (2007, Biometrika 94: 502-508) and 2) ask if the sampling distribution of the t statistics is in fact a Student's t distribution with the specified degrees of freedom. In addition to varying the imputation method, we varied the number of imputa- tions (m =5 , 10, 20, 100) that were averaged over 500,000 replications to obtain the combined estimates and standard errors for a linear model that regressed the log price of a home on its age (years) and size (square feet) in a sample of 25 ob- servations. Six age values were randomly set equal to missing for each replication. As assessed by the absolute percentage and relative percentage bias, the two ap- proaches performed similarly. The absolute bias of the regression coefficients for age and size was roughly −0.1% across the levels of m for both approaches; the ab- solute bias for the constant was 0.6% for the chained-equations approach and 1.0% for the multivariate normal model. The absolute biases of the standard errors for age, size, and the constant were 0.2%, 0.3%, and 1.2%, respectively. In general, the relative percentage bias was slightly smaller for the chained-equations approach. Graphical and numerical inspection of the empirical sampling distributions for the three t statistics suggested that the area from the shoulder to the tail was reasonably well approximated by a t distribution and that the small-sample ap- proximations to the multiple-imputation degrees of freedom proposed by Barnard and Rubin and by Reiter performed satisfactorily.
TL;DR: This column is a tour of functions that might easily be missed or underestimated in Stata, with a potpourri of tips, tricks, and examples for a wide range of basic problems.
Abstract: Functions are the unsung heroes of Stata. This column is a tour of functions that might easily be missed or underestimated, with a potpourri of tips, tricks, and examples for a wide range of basic problems.
TL;DR: In this article, the authors describe two commands, fetchyahooquotes and fetchyahookeystats, that import historical financial data and key current financial statistics from Yahoo! Finance.
Abstract: In this article, we describe two commands—fetchyahooquotes and fetchyahookeystats—that import historical financial data and key current financial statistics from Yahoo! Finance.
TL;DR: In this article, the authors illustrate how to recover these parameters, test hypotheses, and impose constraints in Stata programs, using a transformed version of those parameters, which is not always straightforward to get access to these parameters.
Abstract: There are several models in Stata that estimate coefficients other than regular “regression-like” coefficients. Often these have an interpretation as a standard deviation or a correlation of either error terms or random coefficients. Examples are xtmixed or heckman (see [XT] xtmixed or [R] heckman). Sometimes, we want access to these coefficients to perform a test or to impose a constraint. However, getting access to these parameters is not always straightforward. The problem is that Stata programs often do not estimate these coefficients directly, but instead estimate a transformed version of those parameters. In this Stata tip, I will illustrate how to recover these parameters, test hypotheses, and impose constraints.
TL;DR: In this article, the authors discuss generalized method of moments estimation of the covariance structure of longitudinal data on earnings, and introduce and illustrate a Stata program that facilitates the implementation of the GPM approach in this context.
Abstract: In this article, we discuss generalized method of moments estimation of the covariance structure of longitudinal data on earnings, and we introduce and illustrate a Stata program that facilitates the implementation of the generalized method of moments approach in this context. The program, gmmcovearn, esti- mates a variety of models that encompass those most commonly used by labor economists. These include models where the permanent component of earnings follows a random growth or random walk process and where the transitory compo- nent can follow either an AR(1) or an ARMA(1,1) process. In addition, time-factor loadings and cohort-factor loadings may be incorporated in the transitory and permanent components.
TL;DR: In this article, the authors present a command that enables Stata users to estimate Poisson-lognormal hurdle models using adaptive Gauss-Hermite quadrature to approximate the likelihood function and evaluate the performance of the estimator in Monte Carlo experiments.
Abstract: Hurdle models based on the zero-truncated Poisson-lognormal distribu- tion are rarely used in applied work, although they incorporate some advantages compared with their negative binomial alternatives. I present a command that enables Stata users to estimate Poisson-lognormal hurdle models. I use adaptive Gauss-Hermite quadrature to approximate the likelihood function, and I evaluate the performance of the estimator in Monte Carlo experiments. The model is ap- plied to the number of doctor visits in a sample of the U.S. Medical Expenditure Panel Survey.
TL;DR: In this paper, the M statistic is used to compare the interpoint distance distribution across groups of observations in a k-dimensional setting, where the locations are distributed in a region of the plane.
Abstract: We implement the commands mstat and mtest to perform inference based on the M statistic, a statistic that can be used to compare the interpoint distance distribution across groups of observations. The analyses are based on the study of the interpoint distances between n points in a k-dimensional setting to produce a one-dimensional real-valued test statistic. The locations are distributed in a region of the plane. When we consider allninterpoint distances, the dependencies among them are difficult to express analytically, but their distribution is informative, and the M statistic can be built to summarize one aspect of this information. The two commands can be used on aw ide class of datasets to test the null hypothesis that two groups have the same (spatial) distribution. mstat and mtest return the exact M test statistic. Moreover, mtest executes a Monte Carlo-type permutation test, which returns the empirical p-value together with its confidence interval. This is the command to use in most situations, because the convergence of M to its asymptotic chi-squared distribution is slow. Both commands can be used to obtain graphical output of the empirical density function of the interpoint distance distributions in the two groups and the two- dimensional map of the n observations in the plane. The descriptions of the commands are accompanied by examples of applications with real and simulated data. We run the test on the Alt and Vach grave site dataset (Manjourides and Pagano, forthcoming, Statistics in Medicine) and reject the null hypothesis, in contradiction to other published analyses. We also show how to adapt the techniques to discrete datasets with more than one unit in each location. Finally, we report an extensive application on breast cancer data in Massachusetts; in the application, we show the compatibility of the M commands with Pisati's spmap package.
TL;DR: This column reviews some basic Stata techniques helpful for data management tasks, including the use of subscripts, summarize, by:, sum(), cond(), and egen.
Abstract: Many problems in data management center on relating values to values in other observations, either within a dataset as a whole or within groups such as panels. This column reviews some basic Stata ...
TL;DR: Simontwostage as mentioned in this paper calculates the critical values and sample sizes for two-stage designs for phase II oncology trials, and options are provided to determine the minimax and optimal designs proposed by Simon (1989, Controlled Clinical Trials 10: 1−10) and admissible designs described by Jung et al. (2004, Statistics in Medicine 23: 561-569).
Abstract: This article describes a new Stata command called simontwostage, which calculates the critical values and sample sizes for two-stage designs for phase II oncology trials. Options are provided to determine the minimax and optimal designs proposed by Simon (1989, Controlled Clinical Trials 10: 1–10) and admissible designs described by Jung et al. (2004, Statistics in Medicine 23: 561–569). Furthermore, nonstochastic and stochastic curtailment rules can be implemented in both stages of the trial, and the properties of the curtailed designs can be examined.
TL;DR: This article introduces a command that generates a random sample from any user-specified distribution function using numeric methods that make this command very generic.
Abstract: Generating random samples in Stata is very straightforward if the distribution drawn from is uniform or normal. With any other distribution, an inverse method can be used; but even in this case, the user is limited to the built- in functions. For any other distribution functions, their inverse must be derived analytically or numerical methods must be used if analytical derivation of the inverse function is tedious or impossible. In this article, I introduce a command that generates a random sample from any user-specified distribution function using numeric methods that make this command very generic.
TL;DR: In this article, regression-based methods for analyzing multiple-source data in Stata were described for the BROMS Cohort Study, a cohort of Swedish adolescents who collected data on body mass index that was self-reported and that was measured by nurses.
Abstract: Multiple-source data are often collected to provide better information of some underlying construct that is difficult to measure or likely to be missing. In this article, we describe regression-based methods for analyzing multiple-source data in Stata. We use data from the BROMS Cohort Study, a cohort of Swedish adolescents who collected data on body mass index that was self-reported and that was measured by nurses. We draw together into a single frame of reference both source reports and relate these to smoking onset. This unified method has two advantages over traditional approaches: 1) the relative predictiveness of each source can be assessed and 2) all subjects contribute to the analysis. The methods are applicable to other areas of epidemiology where multiple-source reports are used.
TL;DR: The problem of handling Roman numerals in Stata is used to illustrate issues arising in the handling of classification codes in character string form and their numeric equivalents and regular expressions are especially valuable for this problem.
Abstract: The problem of handling Roman numerals in Stata is used to illustrate issues arising in the handling of classification codes in character string form and their numeric equivalents The solutions in