Thesis

Data properties influencing the use of multiple logistic regression

An investigation of situational factors was made in order to compare the multiple linear and multiple logistic regression models. The study was performed in two phases. During the first phase, an artificially created data set was employed, while in phase two, analysis was performed on a real data set. Phase one of the study involved creation of data sets by using a logistic regression equation obtained from a previous research project for the comparative analysis of the linear model and the logistic model without any residual error. The development of the data sets allowed for investigation of several situational factors which might influence model preference. The first involved holding the parameter estimates of the logistic equations constant, while making the values of the predictor variable more and more extreme. The second investigation was performed in a similar fashion, except that on this occasion, the predictor variable values were held constant, while the parameter values were varied. The second phase of the study involved the use of a real data set, in order to compare the multiple linear and multiple logistic regression models. This data set consisted of 70 predictor variables in order to evaluate the multiple linear and multiple logistic regression models on the selected criteria. For phase one of the study, it was found for constant parameter values that as the value of the predictor variable became more extreme, the multiple linear model became less accurate in the criterion used to evaluate the models. This difference was even greater when all values for the predictor variable were extreme and of the same sign. For the situation where the values of the predictor variable was constant, and the value of the parameter estimate was varied, it was found that, as compared to the logistic model, the multiple linear regression model became less accurate as the value of the parameter increased. For phase two, few differences were found between the two models for most of the methods used to select subsets of predictor variables. However, one method did indicate a significant difference between the models on all designated criteria used for phase two. For subset selection using the stepwise logistic regression method, it was found for a number of criteria that the logistic model had a significantly higher level of performance than the multiple linear model. The multiple logistic model had an increase of 10% over the variance explained by the linear model; the logistic model had 27 more cases with residuals values less than 0,1; and it had three fewer misclassifications.

Relationships

Items