Sains Malaysiana, 2017;46:1001-1010.

Abstract

The location model proposed in the past is a predictive discriminant rule that can classify new observations into one
of two predefined groups based on mixtures of continuous and categorical variables. The ability of location model to
discriminate new observation correctly is highly dependent on the number of multinomial cells created by the number
of categorical variables. This study conducts a preliminary investigation to show the location model that uses maximum
likelihood estimation has high misclassification rate up to 45% on average in dealing with more than six categorical
variables for all 36 data tested. Such model indicated highly incorrect prediction as this model performed badly for
large categorical variables even with large sample size. To alleviate the high rate of misclassification, a new strategy
is embedded in the discriminant rule by introducing nonlinear principal component analysis (NPCA) into the classical
location model (cLM), mainly to handle the large number of categorical variables. This new strategy is investigated
on some simulation and real datasets through the estimation of misclassification rate using leave-one-out method. The
results from numerical investigations manifest the feasibility of the proposed model as the misclassification rate is
dramatically decreased compared to the cLM for all 18 different data settings. A practical application using real dataset
demonstrates a significant improvement and obtains comparable result among the best methods that are compared. The
overall findings reveal that the proposed model extended the applicability range of the location model as previously it
was limited to only six categorical variables to achieve acceptable performance. This study proved that the proposed
model with new discrimination procedure can be used as an alternative to the problems of mixed variables classification,
primarily when facing with large categorical variables.