Sample Weight

National Longitudinal Study of Adolescent Health Strategies to Perform a Design-Based Analysis Using the Add Health Data

[...]

Kim Chantala¹, Joyce Tabor•Institutions (1)

University of North Carolina at Chapel Hill¹

1 Jan 2010

TL;DR: This paper presents a 7-step process for performing analysis of Add Health data using any software package designed to handle complex surveys, and describes the characteristics and data elements needed by the survey software packages.

...read moreread less

Abstract: The Add Health Study is a nationally representative, probability-based survey of adolescents in grades 7 through 12 conducted between 1994 and 1996. The sample design used to collect the data has introduced a complexity to analysis. Failing to account for this complexity may result in biased parameter estimates and incorrect variance estimates. Hence, you must correct for design effects and unequal probability of selection to ensure that your results are nationally representative with unbiased estimates. Specialized, “user-friendly” statistical software is now available for analyzing data from complex surveys. SUDAAN and STATA are two examples of this type of software. Using both SUDAAN and STATA, we show you how to incorporate characteristics of the sample design into an analysis so that your estimates and standard errors are unbiased. We will first present a simplified description of the Add Health sampling process including a description of the sample attributes and data elements needed for correctly analyzing the data when the unit of analysis is either the school or the adolescent. A brief description of statistical software available to analyze survey data is presented followed by “code templates” you can use as a guide in doing your own analysis using SUDAAN or STATA. Next, we present a 7-step process for performing analysis of Add Health data using any software package designed to handle complex surveys. We then conclude with an example using this process with both STATA and SUDAAN. Results are compared with an analysis from SAS to show how ignoring the design effects can lead to misleading conclusions. Overview The Add Health data collection was designed as a cluster sample in which the clusters were sampled with unequal probability. While reducing the cost of data collection, this design complicates the statistical analysis because the observations are no longer independent and identically distributed. To analyze the data correctly, you must use special survey software packages specifically designed to handle observations that are not independent and identically distributed. The purpose of this document is to provide a strategy to correctly analyze the Add Health data. To do this, we describe the characteristics and data elements needed by the survey software packages. We conclude by providing examples using two of the survey software packages, SUDAAN and STATA. All tables, figures, and examples were created using the contractual dataset. Design Characteristics of the Add Health Data This section describes how the sampling strategy has influenced the structure of the data. We will focus on why some of the adolescents in our dataset do not have sample weights. The details of the sampling strategy are beyond the scope of this paper, but can be found in the document “Grand Sample Weight” by Roger Tourangeau and Hee-Choon Shin. Overview of Sample Selection An overview of the Add Health sampling method is shown below. Figure 1. Sampling Structure for Add Health Study A sample of 80 high schools and 52 middle (feeder) schools from the U.S. was selected with unequal probability of selection. Thus, school became the cluster identifier or primary sampling unit (PSU). Administrators of these 132 schools were asked to fill out a questionnaire describing the characteristics of these schools. Adolescents attending these schools were eligible for selection into any of the three panels of data: the In-School Questionnaire (1994-1995), the Wave I In-Home Questionnaire (1995), and the Wave II In-Home Questionnaire (1996). Students attending participating schools filled out the In-School Questionnaire. Samples of students from the school rosters and those filling out the In-School Questionnaire were then selected to participate in the in-home data collection phase. Figure 2. In-Home Questionnaire Target Populations These samples, shown in Figure 2, have the following characteristics. C Core—a nearly self-weighting sample. Schools were chosen with probability proportional to size and a fixed number of students (~200) were selected from each school. Because the nonresponse rate varied from school to school, some of this self-weighting property is lost and we consider the core to be a nearly self-weighting sample. This is why we needed to develop core weights. Even with a self-weighting property, you must still account for the clustering of the sample. Because of this there is no advantage to analyzing the core instead of the grand sample. C Saturation Sample— all students from 16 schools. Two large schools for adolescent network analysis were chosen; 14 small schools included all students because of the small enrollment size of school. C Disabled Sample. Eligibility for this sample was determined by responses to several questions on the In-School Questionnaire. C Ethnic Samples—High Education Black, Cuban, Puerto Rican, Chinese. Eligibility was determined by race/ethnicity listed on the In-School Questionnaire. C Genetic Samples—identical and fraternal twins, full siblings, half siblings, unrelated adolescent pairs in the same home. Eligibility was based on responses to the household grid in the In-School Questionnaire. Availability of Sample Weights It is important to note that the adolescents in the Add Health Study were selected for two different analytical purposes: C Analyses to provide nationally representative estimates C Specialized genetic analyses Figure 3 illustrates these groups for the three panels of data. Figure 3. Genetic Sample and the Nationally Representative Sample The most striking feature of this data schematic is to note that only the adolescents selected to be in the group that can be used to make nationally representative estimates have sample weights. Because the sample size was too small for genetic analyses using only this group, we had to augment the genetic sample with students who were not part of the sampling plan. Thus, weights could not always be computed for adolescents that were selected for the specialized genetic analysis. A Note on Weights Needed for Analyzing Pairs of Respondents Some of the analyses you might be interested in involve serendipitous pairs of respondents. This might include friends as well as twins or siblings. For example, the Add Health data includes respondents and their friends who both filled out the survey. You might be predicting an outcome using information from both the respondent’s and their friend’s surveys. There is no simple answer to the proper weight to use when your analysis includes observations that are based on data from two different respondents. To correctly compute the weight for each pair, we need to compute the joint inclusion probability of each pair and then the weight for the pair is the inverse of that joint inclusion probability. To compute this weight, we must go back to the details of the sample selection process for both of the individuals and their schools. This can vary for each type of pair (pairs of friends, siblings, twins, and romantic partners) so the method of computing the weight for the pair might be different for each type of pair. We are currently working on this problem and will make the pairs weights available as soon as we are confident of the proper method needed to compute them. Specifying the Design Structure of the Add Health Data Next we will discuss the information that must be known about the design to use the survey software. This information is listed in Table 1. Design Type: Specify With Replacement as the Design Type The information needed to make finite population corrections for analyzing the dataset as a “without replacement design” is not available. However, we can assume that the schools were selected with replacement. The variance estimation technique is derived using large sample theory and will justify our assumption of with replacement sampling even though schools were not placed back on the list before the next school was selected. Stratum Variable: Use REGION The Add Health sampling plan did not include a stratification variable. However, a poststratification adjustment was made to the sample weights so that region of country (variable REGION) could be used as a post-stratification variable. This involved using the total number of schools on the sampling frame for each region (Northeast, Midwest, South, and West) of the country. For each region, an adjustment was made to the initial school weights so that the sum of the school weights was equal to the total number of schools on the sampling frame. Cluster Variable or Primary Sampling Unit (PSU): Use the School Identifier This is the variable named PSUSCID for the In-School, Wave I, and Wave II data. The sampling units in the Add Health Study are middle and high schools from the United States, hence the School Identifier is the appropriate variable to use as the cluster or PSU variable. Table 1. Variables for Correcting for Design Effects in Contractural Dataset Design Type = With Replacement Unit = School Unit = Adolescent School Admin N = 164 In-School N = 90,118* Wave I N = 20,745* Wave II N = 14,738* Strata variable REGION REGION REGION REGION Cluster variable PSUSCID PSUSCID PSUSCID PSUSCID Weight variable SCHADMWT SCHWGTPS GSWGT1 GSWGT2 # with weights 130 83,135 18,924 13,570 # missing weights 34 6,983 1,821 1,168 Mean of weights 250.4650 269.979

...read moreread less

359 citations

Year	Papers
2021	5
2020	10
2019	14
2018	6
2017	6
2016	4

Topic Tools

Papers published on a yearly basis

Papers

National Longitudinal Study of Adolescent Health Strategies to Perform a Design-Based Analysis Using the Add Health Data

The effects of sample size and heating rate on the kinetics of the thermal decomposition of CaCO3

A proposed sampling constant for use in geochemical analysis.

On the Utilization of Sample Weights in Latent Variable Models.

Estimation of effective sample size when analysing powders with diffuse reflectance near-infrared spectrometry

Related Topics (5)

Performance Metrics