Research Article Volume 2 Issue 1
1Department of Mathematics/Computer, Federal University of Petroleum Resources Effurun, Nigeria
2Department of Computer Science Education, Federal College of Education Technical, Nigeria
Correspondence: AA Ojugo, Department of Mathematics/Computer, Federal University of Petroleum Resources Effurun, Nigeria, Tel 2.35E+12
Received: December 23, 2017 | Published: February 1, 2018
Citation: Ojugo AA, Okobah I. Quest for an intelligent convergence solution for the well-known David, Fletcher and Powell quadratic function using supervised models. Open Access J Sci. 2018;2(1):53-59. DOI: 10.15406/oajs.2018.02.00044
Optimization tasks aims to resolve complex events by exploiting historic data as well as explores task knowledge and symbolic reasoning to yield new heuristic and paradigm, whose outcomes is the underlying probability of all possible feats in an optimal solution. The outcome consist of input constraints/parameters to be satisfied; While, guaranteeing an explicit reasoning structure that conveys data about the task via an algorithm that assigns as output, a set of variables that satisfies these constraints in its bid to prune off a huge portion of the search space. Our study seeks to optimize the quadratic functions via David Fletcher Powell method (and its use as pre-processor to train selected supervised models) to find all converged points. As dataset, we use the trial solution of the quadratic equation written as sum of 2-parts: (a) satisfy initial condition for the unconstrained optimization using DFP and supervised models to resolve quadratic function; and (b) uses DFP as pre-processor with adjustable parameters to train the supervised models. Results indicate QDA outperforms KNN, LDA and DFP in all cases to yield an approximate solution of the quadratic function, and also show the models converge closer to an analytic solution. This is easily extended to solve a wide range of other problems.
Optimization models tend to resolve complex task by exploiting historic data whose underlying probability feats yield optimal solution. It achieves this also, by exploring domain knowledge and symbolic reasoning expressed as mathematical model to yield a heuristic method that is both tolerant to noise, accepts partial truth, uncertainty and ambiguities at its input–while, exhibiting a great level of robustness, flexibility, and continuous adaptation (as model feats) in its bid to resolves a constraints within the system. Thus, mimics agents in search of optimal solution (food and survival) in a domain space, and is mostly inspired by behavioral pattern and natural laws of biological evolution. It yields output feats with uncontrolled parameters as input (not explicitly present from outset); But, modeled therein via boundary values in domain space confined to real parameters and derives via experience, ability to recognize behavioral feats from observed data, and can suggest optimal solution of high quality that is void of over-fitting, irrespective of modification made to the model via other approximations with multiple agents. These constantly affect the quality of the solution. Also, many of such heuristics are combined to create hybrid(s) that seeks to explore the structural differences in the statistical-method(s) used, and help resolve the implications of the conflict constrained on the model by such a multi-agent populated system–as agents can often create their own behavioral rules based on historic dataset.1
Chaotic and dynamic phenomena and events abound almost in every sphere of our daily endeavors. These must be resolved via the consequent creation of systems and models that mimic such behaviors. Thus, nonlinear systems are today-modeled as mathematical equations (models) to help resolve nonlinear, dynamic and complex feats in tasks such as in control systems and other applications. These systems are often modeled as quadratic functions, and there exists various methods in use to resolve quadratic cum differential equations. Optimal solution for such nonlinear, quadratic equations is still a challenge task.2–4 Studies have shown that parallel processor computers in order to solve a first order differential equation will use unsupervised model (neural network model) as a factor of speed;5 while6 went further to represent a new method to solve first-order linear ordinary and partial equations using ANN, as was further buttressed by 6,7 provided a hybrid ANNPSO intelligence model to solve Wessinger equation (though the model could not satisfy the initial and boundary conditions). It was improved by;5 while6 furthered to satisfy the solution for initial and boundary condition problems. Search methods aim to maximize/minimize constraint satisfaction problem objective functions and yield a feasible, optimal (closest to best) solution. CSPs are dynamic, nonlinear and complex-making the search for its solution cumbersome and inexplicable to resolve. The study presents comparative stochastic model for solving quadratic equation using DFP as preprocessor to seek optimality and convergence property.
Mathematical optimization is a selection of the best element from a set of available alternatives. It thus, aims either at a maximization or minimization task of real function that symmetrically choose inputs from an allowed set of variables, to compute the function’s output by finding the best available values from a set of alternatives via the objective function in a given domain8–10 and11 notes:
Definition 1
A continuous optimization defined as a pair (S, f). S is set of possible solutions: S = RN and N is number of controllable parameters and R is the real number line. Thus, f is a multi-objective function (f: S R) to be optimized-where a solution vector X is as limited between lower and upper bounds ( ). For maximization task (search for a solution greater than or equal to all other solutions), and minimization task (search for solution that is smaller than or equal to the all other solutions).The set of maximal and minimal solution S of a function f: S R defined as:
The quadratic problem formulation for DFP
We aim to minimize
and
Dai (2001) notes that a point x* A is a global minimum of fo (x) if:
An ideal optimization allows its objective function to have a unique minimize. Thus, the function in Eq. 3 is a global minimize. Conversely, a point x* A is a local minimize if for any > 0, f(x*)≤f(x) and for any .
Line search method / algorithm
12,13 notes that numeric method algorithm for unconstrained optimization (as grouped into line search and trust region) aims to minimize the nonlinear function f(x) as in Eq. 4. Thus, an iterative, continuous f(x) with initial point x1, its kth iteration (new point xk+1) is computed as thus:14
The goal of this convergence analysis is to study the sequence {xk} feats generated and compare its difference between to the convergence performances of other methods 15,16 such as DFP and with DFP as a pre-processor. The sequence xk generated converges to a point x* if:
With the solution for x* as in Eq. 5 not available, a possible replacement is Eq. 6, which unfortunately, also does not guarantee the convergence of xk will then yield Eq. 6 as given by17,18 as thus:
But, global convergence for unconstrained optimization aims to prove that Eq. 7 ensures xk is close to the set of stationery points where Ñ(x) = 0, or Eq. 8, which ensures that a least subsequence of xk is close to the set of the stationery points.19,20
But Local convergence aims to address the convergence speed of the sequence generated by the algorithm. In studying this, we assume sequence xk converges to a local minimize x*, so that the second order sufficient condition and the Newton method, can converge very slowly.13–21
Quasi-Newton method
22,23has Newton’s method as basis for Quasi-Newton, and the most widely used for solving nonlinear equations and unconstrained optimization. If for a smooth function f(x) and a positive hessian, then second order Taylor’s expansion yields thus:
As long as Ñ2f(x) is a positive definite, Ñ2f(x) p is Newton direction, and the next approximation is:
Newton direction has proven to be more expensive than Steepest Descent direction. We must compute Hessian matrix and invert it (not applicable with Quasi Newton). Its merits are: (a) convergent rate for Newton method is quadratic, and thus, there is a lot to gain in finding its direction, (b) they form a good starting point if f”(x) is positive definite, and (c) they are simple and easy to implement. However, its demerits is: (a) they are not globally convergent for many tasks, (b) it may diverge if starting point approximation is far from the solution, (c) it fails if hessian matrix is not inverted, and (d) requires analytic second order derivatives of f. Also, the method is seen as an approximation of the Newton Raphson method.23,24
Consider the Quasi Newton method behaviour from Broyden’s class of unconstrained optimization problem given by: min . The class consists of iterations of the form:
gk is gradient of f at xk, Hessian approximation is updated by25,26 and αk is step size. So:
Yields 2-update formulae: For , yields Broyden, Fletcher, Goldfarb and Shanno (BFGS) method; while , yields the Davidson, Fletcher and Powell method.27,28 Notes that with exact line search, all class members yield same iterates and their performance varies markedly. We assume step-size αk is chosen by an inexact line search satisfying two conditions as thus:
Note several important results about this class of methods are that they do not require an exact line search. If all assumptions hold that: x* is the minimize of f, the hessian matrix H is positive definite, αk = 1 in DFP/BGFS as Eq. 16 is true and xk converges to x* Q-super linearly, and its sum will be finite, if are sufficiently small and step size αk=1 will satisfy the line search conditions of Eq. 14 and Eq. 15.28,29
DFP optimizes (with/without an exact line search). Here, DFP computes solution of a given quadratic functions (as problem domain) via stochastic hybrid models, define convergence properties and state its influence on such convergence properties. The algorithm is:1–3
Set and return to step 2
Intelligent proposed model
We seek to compare resolution of above quadratic function via the supervised models below using unsupervised model (HMM as benchmark) to measure comparative performances.
Linear discriminate analysis (LDA)
LDA is a simple and effective supervised classification method with wide range of applications. Its basic theory is to classify compounds (rules) dividing n‐dimensional descriptor space into two regions separated by a hyper‐plane that is defined by linear discriminate function. Discriminant analysis generally transforms classification tasks into functions that partitions data into classes; thus, reducing the problem to an identification of a function. The focus of discriminate analysis is to determine this functional form (assumed to be linear) and estimate its coefficients. LDA was first introduced in 1936 by Ronald Aylmer Fisher and his LDA function works by finding the mean of a set of attributes for each class, and using the mean of these means as boundary. The function achieves this by projecting attribute points onto the vector that maximally separates their class means and minimizes their within-class variance as expressed in Eq. 17 as follows:
where X is vector of the observed values, Xi (i = 1, 2…) is the mean of values for each group, S is sample covariance matrix of all variables, and c is cost function. If the misclassification cost of each group is considered equal, then c = 0. A member is classified into one group if the result of the equation is greater than c (or = 0), and into the other if it less than c (or = 0). A result that equals c (set to 0) indicates such a sample cannot be classified into either class, based on the features used by the analysis. LDA function distinguishes between two classes-if a data set has more than two classes, the process must be broken down into multiple two‐class problems. The LDA function is found for each class versus all samples that were not of that class (one‐versus‐all). Final class membership for each sample is determined by LDA function that produced the highest value and is optimal when variables are normally distributed with equal covariance matrices. In this case, the LDA function is in same direction as Bayes optimal classifier, 30–33 and it performs well on moderate sample sizes in comparison to more complex method. Its mathematical function is simple and requires nothing more complicated than matrix arithmetic. The assumption of linearity in the class boundary, however, limits the scope of application for linear discriminate analysis. Real‐world data frequently cannot be separated by linear boundary. When boundaries are nonlinear, the performance of the linear discriminate may be inferior to other classification methods. Thus, to curb this-we adopt a decimal encoding of the data to give us a semblance of linear, continuous boundaries.
Quadratic discriminant analysis (QDA)
QDA is another distance-based classifier by, which is very similar to and more of an extension of LDA. Both discriminate functions assume that values of each attribute in each class are normally distributed, however, the discriminate score between each sample and each class is calculated using the sample variance-covariance matrix of each class separately rather than the overall pooled matrix and so is a method that takes into account the different variance of each class. While, LDA assumes that the covariance matrices of the groups are equal; QDA makes no assumption. When the covariance matrices are not equal, the boundary between the classes will be a hyper‐conic and in theory, the use of quadratic discriminate analysis will result in better discrimination and classification rates. However, due to the increased number of additional parameters to be estimated, it is possible that the classification by QDA is worse than that of linear discriminate analysis. The QDA is found by evaluating the Eq. 18:
The same conditions apply to the nature of c as well as the classification, in the case that the result is equal to c or zero. As with LDA, the QDA distinguishes between two classes. For multiple class data sets, this was handled the same as for linear discriminate analysis.34,35 Size of differences in variances determines how much better QDA performs better than LDA. For large variance differences, QDA excels when compared to LDA. Additionally, of the two, only QDA can be used when population means are equal. QDA is more broadly applicable than the LDA; But, less resilient in non-optimal conditions. The quadratic discriminate can behave worse than the linear discriminate for small sample sizes. Additionally, data that is not normally distributed results in a poorer performance by the quadratic discriminate, when compared to the linear discriminate. Found the performance of the quadratic discriminate function to be more sensitive to the dimensions of the data than the linear discriminate, improving as the number of attributes increases to a certain optimal number, and then rapidly declining. Linear and nonlinear discriminate functions are the most widely used classification methods. This broad acceptance is due to their ease of use and the wide availability of tools. Both, however, assume the form of the class boundary is known and fits a specific shape. This shape is assumed to be smooth and described by a known function. These assumptions may fail in many cases. In order to perform classification for a wider range of real‐world data, a method must be able to describe boundaries of unknown, and possibly discontinuous, shapes.
K-nearest neighborhood (KNN)
The K‐nearest neighbor (KNN) model is a well-known supervised learning algorithm for pattern recognition that first introduced by, and is still one of the most popular nonparametric models for classification problems. K‐nearest neighbor assumes that observations, which are close together, are likely to have the same classification. The probability that a point x belongs to a class can be estimated by the proportion of training points in a specified neighborhood of x that belong to that class. The point may either be classified by majority vote or by a similarity degree sum of the specified number (k) of nearest points. In majority voting, the number of points in the neighborhood belonging to each class is counted, and the class to which the highest proportion of points belongs is the most likely classification of x. The similarity degree sum calculates a similarity score for each class based on the K‐nearest points and classifies x into the class with the highest similarity score. Its lower sensitivity to outliers allows majority voting to be commonly used other than the similarity degree sum. We use majority voting for the data sets to determine which points belongs to neighborhood so that distances from x to all points in the training set must be calculated. Any distance function that specifies which of two points is closer to the sample point could be employed. The most common distance metric used in K-nearest neighbor is the Euclidean distance. The Euclidean distance between each test point ft and training set point fs, each with n attributes, is calculated via Eq. 19:
In general the following steps are performed for the K-nearest neighbor model: (a) chosen of k value, (b) distance calculation, (c) distance sort in ascending order, (d) finding k class values, (e) finding dominant class.
A challenge in using Knn is to determine optimal size of k, which acts as smoothing parameter. A small k is not sufficient to accurately estimate the population proportions around the test point. A larger k will result in less variance in probability estimates (but for risk of introducing more bias). K should be large enough to minimize probability of a non‐Bayes decision, and small enough that all points included, gives an accurate estimate of the true class33found optimal value of k to depend on sample size and covariance structures in each population and on the proportions for each population in the total sample. For cases where differences in covariance matrices and difference between sample proportions are both small or both large, it is found that optimal k is N3/8 (N is number of samples in the training set). If and when there is a large difference between covariance matrices, and a small difference between sample proportions (or vice-versa), the optimal value k is determined by N2/.33 This model presents several merits in that: (a) its mathematical simplicity does not prevent it from achieving classification results as good as (or better than) other more complex pattern recognition techniques, (b) it is free of statistical assumptions, (c) its effectiveness does not depend on the space distribution of classes, and (d) when the boundaries between classes are not hyper‐linear or hyper‐conic, K-nearest neighbor performs better than LDA.
But, found LDA performs slightly better than K‐nearest neighbor if the population covariance matrices are equal, a condition that suggests linear boundary. As differences in covariance matrices increases, k‐nearest neighbor performs increasingly better than LDA and QDA function. However, despite these merits, the demerits of k-nearest neighbor model include that it does not work well if large differences are present in samples in each class. K‐nearest neighbor provides poor data about the structure of its classes, and relative importance of variables in classification. Also, it does not allow graphical representation of the results, and in case of large number of samples, computation become excessively slowly. In addition, K‐nearest model requires more memory and processing requirements than other methods. All prototypes in training set must be stored in memory and used to calculate Euclidean distance from every test sample. The computational complexity grows exponentially as the number of prototypes increases.
Model performance and effectiveness measure
Performance34,35 of the models, are evaluated via computed MSE, MRE, MAE and COE as: To measure their effectiveness and classification accuracy, we adopt the misclassification rate of each model as well as its corresponding improvement percentages of the proposed model in comparison with those of other classification models for the DFP data (in both training and test data) summarized in (Table 1). The equations for its improvement percentage when trained with DFP is as below:
Results show from tables 2 shows our supervised models in LDA, QDA and KNN showed an improvement rate of 45.8%, 64.2% and 46.1% respectively. Also, it is observed that though the KNN scores were sensitive to relative magnitude of different attributes, all attributes are scaled by their z‐scores before using KNN model. This is in tandem with and results are supported by,36–40 (Table 2). The number of data-points in quadratic function has great influence on model performance-such that, if the data points are too small, most supervised models may not achieve its accuracy. If it is too many, it can result also in overtraining as well as over-parameterization.41–45
Model |
MSE |
MRE |
MAE |
COE |
Converge accuracy % |
DFP |
0.56 |
0.43 |
0.49 |
0.49 |
30.02 |
LDA |
0.87 |
0.69 |
0.65 |
0.581 |
23.54 |
KNN |
0.67 |
0.55 |
0.56 |
0.481 |
28.6 |
QDA |
0.46 |
0.37 |
0.46 |
0.818 |
39.2 |
Table 1 Model convergence performance evaluation
Improvement % |
|
|
Model |
Training data |
Testing data |
LDA |
41.16 |
45.83 |
KNN |
41.79 |
46.09 |
Table 2 Improvement percentage with Dfp-preprocessor
The rationale for the choice of models includes
Related study
The fuzzy trained neural network in evaluating the well-known Wessinger’s quadratic function as a constraints satisfaction problem. Their results show the model converged closer to data points in the range of their analytical solution adopted the memetic algorithm and SA to evaluate quadratic function. His trained his hybrids using DFP as a pre-processor to yield approximate solutions to the quadratic function. A trial solution of the quadratic equation is also written as sum of two parts: (a) to satisfy initial condition for unconstrained optimization using DFP and hybrids as separate methods, and (b) apply DFP as a pre-processor with adjustable parameters of for ANN-TLRN hybrid.56–61
Study consists of 4-phases: (a) train models using quadratic function, (b) delete outliers in data by resolving the quadratic function and finding underlying probability on convergence ability, (c) calculate membership probability of output points in each class, and (d) assign output to appropriate class by largest probability. We adopt four (4) known intelligent and statistical classification models: LDA, QDA, knn and PHMM. QDA outperforms LDA, KNN and DFP classic model. The unsupervised models do not assume the shape of the partition, unlike the linear and quadratic discriminate analysis. In contrast to K‐nearest neighbor model, the proposed model does not require storage of training data. Once the model has been trained, it performs much faster than K‐nearest neighbor does, because it does not need to iterate through individual training samples. The proposed model does not require experimentation and final selection of a kernel function and a penalty parameter as is required by the support vector machines. Our proposed model solely relies on a training process in order to identify the final classifier model. Finally, the unsupervised models does not need large amount of data in order to yield accurate results.
None.
The authors declare that there is no conflict of interest.
©2018 Ojugo, et al. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.