1. What is the focus of nonparametric kernel regression in the context of using external datasets for internal data analysis?
Nonparametric kernel regression, a well-established approach, does not require any assumption on the regression function between Y and U, except for some smoothness conditions. It focuses on utilizing summary information from external datasets to improve the estimation of the regression function. In the context of using external datasets for internal data analysis, nonparametric kernel regression aims to make use of external information to obtain better fitted regression values at each observed U in the internal dataset. This approach helps overcome the curse of dimensionality for kernel-type methods and enhances the performance of nonparametric regression methods. Theoretical and empirical studies on the performance of constrained kernel (CK) regression demonstrate that it can improve the standard kernel method without using external information, especially when the sample size of the external dataset is at least comparable to the sample size of the internal dataset. The proposed CK regression method involves a two-step process, where the first step applies a constrained optimization procedure to obtain fitted regression values, and the second step applies standard kernel regression to estimate the entire regression function. The performance of CK regression is measured using the asymptotic mean integrated square error (AMISE), and the results show that it can significantly enhance the estimation accuracy compared to the standard kernel method.
read more
2. How can k-fold cross-validation be applied for bandwidth selection?
K-fold cross-validation can be applied for bandwidth selection by partitioning the internal dataset into approximately equal-sized subsets (G1, ..., Gk). For each subset, the estimator u(-j)CK(Ui) is calculated using bandwidths l and b, excluding the data {(Yi, Ui), iGj}. The chosen bandwidths (l, b) that minimize the CV(l, b) = k∑(iGj) {u(-j)CK(Ui) - Yi}^2 are selected. This method ensures a reasonable range of bandwidth selection and provides an effective way to optimize bandwidths for a given sample size n.
read more
3. How to handle bias in kernel estimator?
The main technical difficulty in handling bias in the kernel estimator of u(u) is estimating the asymptotic bias B K (u) for standard kernel estimation and B CK (u) for constrained kernel estimation. These biases are not zero unless c = 0, and c > 0 leads to the best convergence rate. Bias estimation is challenging, but the idea of under smoothing can be applied. Under smoothing involves choosing bandwidths smaller than those selected by CV for confidence intervals. For the constrained kernel estimation, bandwidths c l l and c b b are used in the first and second stages, respectively. A confidence interval [u CK (u) - z a V 1/2 CK (u), u CK (u) + z a V 1/2 CK (u)] is set for u(u), where V CK is the variance estimator and s 2 CK (u) is the variance estimator for the standard kernel. When s 2 (u) does not depend on u, a simplified estimator is used. The performance of this confidence interval is examined through simulation in Section 3.2.
read more
4. How does heterogeneity affect internal and external data populations?
Heterogeneity in populations for internal and external data can impact the estimation of regression functions. When populations differ, observed internal data (Y, U, R) and external data (Y, X) are considered separately. The goal is to estimate the regression function for internal data population, assuming EQUATION EQUATION R i {b g g(X i ) - u i }g(X i ) = 0. To address heterogeneity, condition E(Y | X, R = 1) = E(Y | X, R = 0) and X R is required. This condition allows for the estimation of b g in (2.15) and E{b g g(X)g(X) |R = 1}. External information, such as E[E{Y |X, R = 1}g(X) |R = 1] = E{u 1 (U )g(X) |R = 1}, is available to extend the method. Additionally, generalized estimating equations (GEE) can be used to generalize the approach to different types of external data, such as logistic regression for discrete responses. The constraint N i=1 R i H( b, u i , X i ) = 0 can be applied for GEE summary-level information.
read more