TL;DR: Although there are a set of fault prediction studies in which confidence is possible, more studies are needed that use a reliable methodology and which report their context, methodology, and performance comprehensively.
Abstract: Background: The accurate prediction of where faults are likely to occur in code can help direct test effort, reduce costs, and improve the quality of software. Objective: We investigate how the context of models, the independent variables used, and the modeling techniques applied influence the performance of fault prediction models. Method: We used a systematic literature review to identify 208 fault prediction studies published from January 2000 to December 2010. We synthesize the quantitative and qualitative results of 36 studies which report sufficient contextual and methodological information according to the criteria we develop and apply. Results: The models that perform well tend to be based on simple modeling techniques such as Naive Bayes or Logistic Regression. Combinations of independent variables have been used by models that perform well. Feature selection has been applied to these combinations when models are performing particularly well. Conclusion: The methodology used to build models seems to be influential to predictive performance. Although there are a set of fault prediction studies in which confidence is possible, more studies are needed that use a reliable methodology and which report their context, methodology, and performance comprehensively.
TL;DR: The authors argue that much of the difference between binary and nonnegative outcomes comes from a focus on structural parameters, such as index coefcients, instead of causal effects, and propose several simple strategies to accommodate binary endogenous regressors.
Abstract: Applied economists have long struggled with the question of how to accommodate binary endogenous regressors in models with binary and nonnegative outcomes. I argue here that much of the difculty with limited dependent variables comes from a focus on structural parameters, such as index coefcients, instead of causal effects. Once the object of estimation is taken to be the causal effect of treatment, several simple strategies are available. These include conventional two-stage least squares, multiplicative models for conditional means, linear approximation of nonlinear causal models, models for distribution effects, and quantile regression with an endogenous binary regressor. The estimation strategies discussed in the article are illustrated by using multiple births to estimate the effect of childbearing on employment status and hours of work.
TL;DR: In this article, the authors evaluated the performance of dynamic panel models on corporate finance data, in which the dependent variable may be clustered or censored and independent variables may be missing, correlated with one another, or endogenous.
Abstract: Dynamic panel models play a natural role in several important areas of corporate finance, but the combination of fixed effects and lagged dependent variables introduces serious econometric bias. Several methods of counteracting these biases are available and these methodologies have been tested on small datasets with independent, normally-distributed explanatory variables. However, no one has evaluated the methods’ performance with corporate finance data, in which the dependent variable may be clustered or censored and independent variables may be missing, correlated with one another, or endogenous. We find that the data’s properties substantially affect the estimators’ performances. We provide evidence about the impact of various data set characteristics on the estimators, so that researchers can determine the best approach for their datasets.
TL;DR: A survey of the existing literature on the specification and estimation of dynamic spatial panel data models, a collection of models for spatial panels extended to include one or more of the following variables and/or error terms.
Abstract: This paper provides a survey of the existing literature on the specification and estimation of dynamic spatial panel data models, a collection of models for spatial panels extended to include one or more of the following variables and/or error terms: a dependent variable lagged in time, a dependent variable lagged in space, a dependent variable lagged in both space and time, independent variables lagged in time, independent variables lagged in space, serial error autocorrelation, spatial error autocorrelation, spatial-specific and time-period-specific effects. The survey also examines the reasoning behind different model specifications and the purposes for which they can be used, which should be useful for practitioners.
TL;DR: In this article, the authors review the theoretical and empirical rationale for network dependence and spatial externalities as embodied in spatially lagged variables, arguing that failing to acknowledge their presence at least leads to biased inference, can be a cause of inconsistent estimation, and leads to an incorrect understanding of true causal processes.
Abstract: Spatial econometrics has been criticized by some economists because some model specifications have been driven by data-analytic considerations rather than having a firm foundation in economic theory. In particular, this applies to the so-called W matrix, which is integral to the structure of endogenous and exogenous spatial lags, and to spatial error processes, and which are almost the sine qua non of spatial econometrics. Moreover, it has been suggested that the significance of a spatially lagged dependent variable involving W may be misleading, since it may be simply picking up the effects of omitted spatially dependent variables, incorrectly suggesting the existence of a spillover mechanism. In this paper, we review the theoretical and empirical rationale for network dependence and spatial externalities as embodied in spatially lagged variables, arguing that failing to acknowledge their presence at least leads to biased inference, can be a cause of inconsistent estimation, and leads to an incorrect understanding of true causal processes.
TL;DR: In this paper, the problem of how to analyze the data when the mediator or dependent variable is categorical is discussed and a solution is recommended that is both optimal in its statistical qualities as well as practical and easily implemented: compute zMediation.
TL;DR: In this article, a bootlegging effect was used to quantify the effect of space-time diffusion on the demand for cigarettes over a 30-year period from 1963 to 1992, where the motivation for spatial dependence was a bootleg effect where buyers of cigarettes near state borders purchase in neighboring states if there is a price advantage to doing so.
TL;DR: This paper identified 327 dependent/outcome variables used in 257 empirical family business studies in 1998-2009 and developed a numerical taxonomy with seven clusters (performance, strategy, social and economic impact, governance, succession, family business roles, and family dynamics) plotted along two dimensions (business-family and short-term-long-term).
Abstract: To promote theoretical development in family business research, this research identified 327 dependent/outcome variables used in 257 empirical family business studies in 1998-2009. In four studies, the authors categorized outcome variables, developed a numerical taxonomy with seven clusters (performance, strategy, social and economic impact, governance, succession, family business roles, and family dynamics) plotted along two dimensions (business–family and short-term–long-term), validated their research, and identified missing outcome variables and variables that deserve more attention. Experts agree that family business roles, succession, and family dynamics make the family business domain unique and that noneconomic performance and family-specific topics deserve more attention.
TL;DR: In this article, the authors explore the use of control variables in management research, as reflected in both macro and micro management studies published in four leading management journals and find that it is not at all uncommon for the control variables included in studies to account for more variance than the main effects.
Abstract: This study explores the use of control variables in management research, as reflected in both macro and micro management studies published in four leading management journals. Based on a review of 812 empirical articles published from 2005 to 2009—a much larger sample than was employed by earlier studies of control variables—the authors make several important observations. One key finding is that, given it is not at all uncommon for the control variables included in studies to account for more variance than the main effects, it is surprising how infrequently adequate justification for inclusion is provided. In addition, even when justification is provided, often no expectation of the nature of the relationship between control and dependent variables is offered. The authors also make several recommendations for both authors and reviewers. The most important may be to avoid simple mimicry of others and think more deeply about the theoretical foundation for the control variables included in empirical studies.
TL;DR: A basic review of the data screening and assumption testing issues relevant to exploratory and confirmatory factor analysis along with practical advice for conducting analyses that are sensitive to these concerns is provided.
Abstract: We provide a basic review of the data screening and assumption testing issues relevant to exploratory and confirmatory factor analysis along with practical advice for conducting analyses that are sensitive to these concerns. Historically, factor analysis was developed for explaining the relationships among many continuous test scores, which led to the expression of the common factor model as a multivariate linear regression model with observed, continuous variables serving as dependent variables, and unobserved factors as the independent, explanatory variables. Thus, we begin our paper with a review of the assumptions for the common factor model and data screening issues as they pertain to the factor analysis of continuous observed variables. In particular, we describe how principles from regression diagnostics also apply to factor analysis. Next, because modern applications of factor analysis frequently involve the analysis of the individual items from a single test or questionnaire, an important focus of this paper is the factor analysis of items. Although the traditional linear factor model is well-suited to the analysis of continuously distributed variables, commonly used item types, including Likert-type items, almost always produce dichotomous or ordered categorical variables. We describe how relationships among such items are often not well described by product-moment correlations, which has clear ramifications for the traditional linear factor analysis. An alternative, non-linear factor analysis using polychoric correlations has become more readily available to applied researchers and thus more popular. Consequently, we also review the assumptions and data-screening issues involved in this method. Throughout the paper, we demonstrate these procedures using an historic data set of nine cognitive ability variables.
TL;DR: In this paper, the authors explored factors determining extrinsic and intrinsic rewards and their impact on employee performance and actions to influence the commercial banks for a consideration of a more systematic and structured approach to acknowledge employee's efforts which would in turn prosper high performance culture in commercial banks of Bangladesh.
Abstract: This study seeks to examine the relationship between rewards and employee performance as well as to identify the relationship between extrinsic and intrinsic rewards. The study explored factors determining extrinsic and intrinsic rewards and their impact on employee performance and actions to influence the commercial banks for a consideration of a more systematic and structured approach to acknowledge employee’s efforts which would in turn prosper high performance culture in commercial banks of Bangladesh. Descriptive statistics based frequency tables and graphs were used in the study to provide information on demographic variables. The results are investigated in terms of descriptive statistics followed by inferential statistics on the variables. A total of 200 questionnaires were distributed to employees of the commercial bank and a total of 180 employees completed the questionnaire properly. The result indicate that there is a statistical significant relationship between all of the independent variables with dependent variables employee work performance and all the independent variables have a positive influence on employee work performance.
TL;DR: The results demonstrate the importance of including static and dynamic non-climate variables in addition to climate variables in species distribution models designed to predict future change in a species' habitat or distribution as a result of climate change.
TL;DR: In this article, a generalized Hoeffding-Sobol decomposition is used to measure the sensitivity of the output with respect to the input variables, and the estimation of these new indices is discussed.
Abstract: In this paper, we consider a regression model built on dependent variables. This regression modelizes an input output relationship. Under boundedness assumptions on the joint distribution function of the input variables, we show that a generalized Hoeffding-Sobol decomposition is available. This leads to new indices measuring the sensitivity of the output with respect to the input variables. We also study and discuss the estimation of these new indices.
TL;DR: In this article, the authors used the data of US banking sector from official web sources of US Federal Reserve System and employed correlation and regression tests show that research model used is of good statistical health.
Abstract: Non Performing Loan Rate is the most important issue for banks to survive. There are lots of factors responsible for this ratio. Some of them belong to firm level issues and some are from macroeconomic measures. However this study is based on the blend. It considers the Real GDP per Capita, Inflation, and Total Loans as independent variables, and Non Performing Loan Ratio as dependent variable. Study uses the data of US banking sector from official web sources of US Federal Reserve System. Years from 1985 to 2010 constitute the study period. Employing correlation and regression tests show that research model used is of good statistical health. All the selected independent variables have significant impact on the depended variable, however, values of coefficients are not much high. Banks should control and amend their credit advancement policy with respect to mentioned variables to have lower non-performing loan ratio.
TL;DR: In this article, the authors discuss spatial regression models in empirical studies of public choice issues using impacts arising from population migration on the provision of county-level government services as an illustration, and examine the direct and indirect (spatial spillover) effects which taken together determine the total impact (on the dependent variable) arising from a change in the explanatory variables.
Abstract: It is frequently assumed that regional observations on local government behavior, voters, regional taxes, etc. can be analyzed using ordinary least-squares (OLS) methods. We discuss spatial regression models in empirical studies of public choice issues using impacts arising from population migration on the provision of county-level government services as an illustration. Spatial regressions allow an examination of the direct and indirect (spatial spillover) effects which taken together determine the total impact (on the dependent variable) arising from a change in the explanatory variables. This decomposition should be quite useful in empirical public choice studies.
TL;DR: In this paper, a lab-in-the-field experiment in rural Malawi involving large stakes was conducted to examine intertemporal choice revision and its determinants, and they found that revisions of money allocations toward the present are positively associated with measures of present-bias from an earlier baseline survey, as well as the (randomly assigned) closeness in time to the first possible date of money disbursement.
Abstract: The very poor in developing countries often make intertemporal choices that seem at odds with their individual self-interest. There are many possible reasons why. This paper investigates several of these reasons with a lab-in-the-field experiment in rural Malawi involving large stakes. It makes two contributions. First, it constructs a new dependent variable: revisions of prior choices regarding the allocation of future income. This allows us to directly examine intertemporal choice revision and its determinants. In particular, this dependent variable permits a novel test for the existence of self-control problems. It turns out revisions of money allocations toward the present are positively associated with measures of present-bias from an earlier baseline survey, as well as the (randomly assigned) closeness in time to the first possible date of money disbursement. Second, the paper investigates other potential determinants of revision, aside from self-control problems. It finds little evidence that revisions of money allocations toward the present are associated with spousal preferences for such revision, household shocks or the financial sophistication of respondents.
TL;DR: In this article, a regression analysis of energy consumption in the Spanish banking sector is presented, for which the available data is divided into a prediction and a validation subset, and three models are obtained; where the first proposed model can be used to predict energy consumption of the whole banking sector, while the rest of the models estimate the energy consumption for branches with low winter climate severity (Model 2) and high winter climate severe (Model 3).
TL;DR: The parallels between this problem and that of selective influences in behavioral sciences is established by observing that noncommuting measurements in quantum physics are mutually exclusive and can be treated as different levels of one and the same factor.
TL;DR: In this article, a lab-in-the-field experiment in rural Malawi involving large stakes was conducted to examine intertemporal choice revision and its determinants, and they found that revisions of money allocations toward the present are positively associated with measures of present-bias from an earlier baseline survey, as well as the (randomly assigned) closeness in time to the first possible date of money disbursement.
Abstract: The very poor in developing countries often make intertemporal choices that seem at odds with their individual self-interest. There are many possible reasons why. This paper investigates several of these reasons with a lab-in-the-field experiment in rural Malawi involving large stakes. It makes two contributions. First, it constructs a new dependent variable: revisions of prior choices regarding the allocation of future income. This allows us to directly examine intertemporal choice revision and its determinants. In particular, this dependent variable permits a novel test for the existence of self-control problems. It turns out revisions of money allocations toward the present are positively associated with measures of present-bias from an earlier baseline survey, as well as the (randomly assigned) closeness in time to the first possible date of money disbursement. Second, the paper investigates other potential determinants of revision, aside from self-control problems. It finds little evidence that revisions of money allocations toward the present are associated with spousal preferences for such revision, household shocks or the financial sophistication of respondents.
TL;DR: In this paper, the authors examined the relationship between environmental response and emotional response which are independent variables with dependent variable i.e. consumer buying behavior and found that consumer purchase those brands from which they are emotionally attached.
Abstract: Advertising is a way of communication to convince an audience for taking purchase decision about a product or service and delivering information to viewers. This paper examines the relationship between environmental response and emotional response which are independent variables with dependent variable i.e. consumer buying behavior. This research investigates the relationship between the variables involved, by taking the 200 responses in twins' cities of Pakistan. Findings of this study show the moderate relationship between independent variables and dependent variable. It shows that consumer purchase those brands from which they are emotionally attached. This study reports new results in the field of buying behavior of consumers' response.
TL;DR: In this article, the authors examined whether aid proliferation hinders economic growth by applying the standard aid-growth regression to Roodman's (2007) dataset, with proper correction for possible biases arising from omitted variable and endogeneity problems.
TL;DR: It is concluded that with data that are currently available, studies performed at a resolution of 1–100 km2 are useful for hypothesizing about the environmental conditions that limit the distribution of a species and incorporating coarse resolution species occurrence data in a model, despite an increase in sample size, lowers model performance.
Abstract: Species distribution models SDMs are an important tool in biogeography and ecology and are widely used for both fundamental and applied research purposes. SDMs require spatially explicit information about species occurrence and environmental covariates to produce a set of rules that identify and scale the environmental space where the species was observed and that can further be used to predict the suitability of a site for the species. More spatially accurate data are increasingly available, and the number of publications on the influence of spatial inaccuracies on the performance of modelling procedures is growing exponentially. Three main sources of uncertainty are associated with the three elements of a predictive function: the dependent variable, the explanatory variables and the algorithm or function used to relate these two variables. In this study, we review how spatial uncertainties influence model accuracy and we propose some methodological issues in the application of SDMs with regard to the modelling of fundamental and realized niches of species. We distinguish two cases suitable for different types of spatial data accuracy. For modelling the realized distribution of a species, particularly for management and conservation purposes, we suggest using only accurate species occurrence data and large sample sizes. Appropriate data filtering and examination of the spatial autocorrelation in predictors should be a routine procedure to minimize the possible influence of positional uncertainty in species occurrence data. However, if the data are sparse, models of the potential distribution of species can be created using a relatively small sample size, and this can provide a generalized indication of the main regional drivers of the distribution patterns. By this means, field surveys can be targeted to discover unknown populations and species in poorly surveyed regions in order to improve the robustness of the data for later modelling of the realized distributions. Based on this review, we conclude that 1 with data that are currently available, studies performed at a resolution of 1–100 km2 are useful for hypothesizing about the environmental conditions that limit the distribution of a species and 2 incorporating coarse resolution species occurrence data in a model, despite an increase in sample size, lowers model performance.
TL;DR: This paper improves upon an earlier formulation of the POPPC model by offering an additional conceptual justification for the model and an estimation method that does not require the use of person-threshold data.
TL;DR: In this paper, the authors present a method to separate the two sources of a moderating effect for a general class of nonlinear models that includes all LDV models commonly used in the management literature.
TL;DR: In this paper, the authors proposed an empirical methodology for modeling and mapping of the monthly and annual air temperature using remote sensing and GIS techniques, and evaluated three interpolation approaches, like inverse distance weighted averaging, thin-plate smoothing splines, and co-kriging, for 4°× 4° area covering the eastern part of India.
Abstract: Geographic Information Systems (GIS) and modeling are becoming powerful tools in agricultural research and natural resource management This study proposes an empirical methodology for modeling and mapping of the monthly and annual air temperature using remote sensing and GIS techniques The study area is Gangetic West Bengal and its neighborhood in the eastern India, where a number of weather systems occur throughout the year Gangetic West Bengal is a region of strong heterogeneous surface with several weather disturbances This paper also examines statistical approaches for interpolating climatic data over large regions, providing different interpolation techniques for climate variables' use in agricultural research Three interpolation approaches, like inverse distance weighted averaging, thin-plate smoothing splines, and co-kriging are evaluated for 4° × 4° area, covering the eastern part of India The land use/land cover, soil texture, and digital elevation model are used as the independent variables for temperature modeling Multiple regression analysis with standard method is used to add dependent variables into regression equation Prediction of mean temperature for monsoon season is better than winter season Finally standard deviation errors are evaluated after comparing the predicted temperature and observed temperature of the area For better improvement, distance from the coastline and seasonal wind pattern are stressed to be included as independent variables
TL;DR: In this article, the authors used logistic regression (LR) and various financial ratios as independent variables to investigate indicators that significantly affect the performance of stocks actively traded on the Indian stock market.
Abstract: The authors use logistic regression (LR) and various financial ratios as independent variables to investigate indicators that significantly affect the performance of stocks actively traded on the Indian stock market. The study sample consists of the ratios of 30 large market capitalization companies over a four-year period. The study identifies and examines eight financial ratios that can classify the companies up to a 74.6% level of accuracy into two categories – “good” or “poor” – based on their rate of return. The paper asserts that the model developed can enhance an investor's stock price forecasting ability. Macroecomonic variables, which also can influence the share price, were not taken into account, however. The paper dicusses the practical implications of using the LR method to predict the probability of good stock performance. The authors state that the model can be used by investors, fund managers, and investment companies to enhance their abilty to select out-performing stocks.
TL;DR: In this article, the authors used data of the 2004 German Socio-Economic Panel and Bayesian analysis in order to analyze to what degree violations of the strong validity assumption affect the estimation results.
TL;DR: In this paper, the authors provide a critical look at these studies and conclude that future studies on tourists' length of stay should abandon survival models if they are conducted along similar lines as the ones to date.
TL;DR: In this article, two rule-based machine learning regression methods including Cubist and Random Forest (RF) were compared with the traditional OLS regression for hedonic modeling for house sales price modeling.
TL;DR: The use of pattern recognition and data mining techniques into risk prediction models in the clinical domain of cardiovascular medicine is proposed and the data is to be modelled and classified by using classification data mining technique.
Abstract: In this research paper, the use of pattern recognition and data mining techniques into risk prediction models in the clinical domain of cardiovascular medicine is proposed. The data is to be modelled and classified by using classification data mining technique. Some of the limitations of the conventional medical scoring systems are that there is a presence of intrinsic linear combinations of variables in the input set and hence they are not adept at modelling nonlinear complex interactions in medical domains. This limitation is handled in this research by use of classification models which can implicitly detect complex nonlinear relationships between dependent and independent variables as well as the ability to detect all possible interactions between predictor variables.