Research Article Volume 10 Issue 3
Researcher, Institute of Statistical Research and Training (ISRT), University of Dhaka, Bangladesh
Correspondence: Md. Israt Rayhan, Professor, Institute of Statistical Research and Training (ISRT), University of Dhaka, Dhaka-1000, Bangladesh
Received: August 16, 2021 | Published: September 14, 2021
Citation: Hemo SA, Rayhan MI. Classification tree and random forest model to predict under-five malnutrition in Bangladesh. Biom Biostat Int J. 2021;10(3):116-123. DOI: 10.15406/bbij.2021.10.00337
Malnutrition is one of the leading causes of morbidity and mortality in children under the age of five in most developing countries like Bangladesh. The main objective of this study is to design a model that predicts the nutritional status of under-five children using tree based model and classical approach. This study used secondary data from Bangladesh Demographic and Health Survey 2014 for 7,886 children. Decision tree based model like classification tree, random forest and classical model like multiple binary logistic regression model are fitted to assess the association of malnutrition of children with potential socioeconomic and demographic factors. In this study, predictive model is developed using random forest having an accuracy of 70.1% & 72.4% and area under receiver operating characteristic curve of 69.8% and 70% for stunting and underweight respectively. The prevalence of stunting and underweight are found 36.5% and 33% respectively among under-five children and higher in rural setting than in urban areas. Similarly, wealth index, exposure of mother to the mass media, age of child, size of child at birth, and parents’ education are significantly associated with stunting and underweight of children.
Keywords: predictive modeling, data mining, random forest, classification tree
This study examines the effect of different socio-economic factors upon children’s nutritional status. Under-five malnutrition is widespread in both Sub-Saharan Africa and South Asia.1 Toma et al.2 found from their study in Bangladesh that more educated and wealthier mothers have better nourished children at birth. Sarma et al.3 used multiple binary logistic regression analysis and found significant association between the stunting of under-five children in Bangladesh and the covariates, such as: wealth index, exposure of mother to the mass media, age of child, size of child at birth, and parents’ education. South Asia has the highest 49.3 percent of under-five underweight,4 despite a better economic development than Sub-Saharan Africa. This enigma is due to the low status of women in South Asia (Ibid). Association between child malnutrition and socio-economic covariates were analyzed by Khare et al.5 from the Indian DHS data, their study revealed that machine learning approach identifies some important features over the classical models. Talukder and Ahammed6 claimed that the random forest algorithm was moderately superior to any other machine learning algorithms to predict malnutrition status among under-5 children in Bangladesh.
Batterham et al.7 found that data mining methods can provide a more accuracy for weight loss trial study compare to the conventional assumptions. Amongst the data mining methods, a decision tree provided them the most accuracy. Bath8 stated that data mining method can act as a knowledge discovery technique in analyzing health medica1 data, also added that data mining method can identify high risk covariates with common features whereas Cox's regression can provide an estimate of the strength of these risk factors. Rayhan and Khan9 analyzed BDHS 1999-2000 data with Cox's logistic regression model and concluded that major contributing factors for under five malnutrition were birth interval, size at birth, mother`s body mass index at birth and parent`s education. Alom et al.10 utilized BDHS 2007 data and through a multilevel regression analysis they found that under-five malnutrition in Bangladesh was significantly associated with child's age, mother's education, father's education, father's occupation, family wealth index, currently breast-feeding, place of delivery and division. Rahman et al.11 analyzed BDHS 2011 data and stated that children's low birth weight was significantly associated with malnutrition by controlling other confounders. Sultana et al.12 used a multilevel generalized linear regression model for BDHS 2014 data and found that stunting was significantly associated with mother's age at birth, previous birth interval, mother's BMI, parent's education, wealth index. Therefore this study is thriving towards the data mining process that would cluster the risk factors of under-five malnutrition. Decision tree and random forest model are acquired as unconventional methods to analyze the BDHS 2014 data with the conventional covariates suggested by the previous literature.
This study utilizes the data from Bangladesh Demographic and Health Survey (BDHS), 2014. The dependent variable is considered as malnutrition status of the children under- five years of age. Malnutrition is measured by the two different anthropometric indices, named as: stunting (less height for age) and underweight (less weight for age). The indices are expressed as the number of standard deviations (SD) above or below the median height of healthy children in the same age group of the reference population.13 Household's socio-economic status, exposure of mother to the mass media, age of child, size of child at birth, parents’ education, birth order, birth interval, mother’s BMI, area of residence, administrative region, total number of children are considered to be associated with the child’s malnutrition.9-12).
Anthropometric measures
According to WHO13 height-for-age is considered to measure a child's stunting or acute malnutrition, caused by inadequate nutrition over a long period of time. A child who is two standard deviations below the median (-2 SD) group of the WHO reference population, in terms of height-for-age, is considered as stunted. If it is below three standard deviations (-3 SD) from the reference median, considered as severe stunting. A child whose weight-for-age is below two standard deviations (-2 SD) from the median of the reference group is classified as underweight. If it is below three standard deviations (-3 SD) from the reference median, considered as severe underweight. Underweight is an overall indicator of a child’s nutritional health.
Hosmer and Lemeshow14 has described logistic regression for the binary dependent variable, this study considered a child's nutritional status as stunted or not, and underweight or not. A brief discussion of logistic model is as follows:
be the conditional mean of dependent variable Y, given explanatory variable .
Then the logistic regression model as,
(1)
where β0 and βj are the model parameters. The logistic function π(x) ranges between 0 to 1 which is the major reason of the popularity of this model. This study has used this model as a classical approach.
Classification tree
Tree-based methods are simple and significant tools for grouping the covariates.15 The prediction for a test observation based on such model is the modal class of y in the region to which it belongs. The shape of the splitting rules of the predictor space looks like a tree, so this type of approach is known as decision tree method.
There are two steps for building a classification tree:
into J distinct and non-overlapping regions R1, R2, , RJ .
(2)
Here, (pˆmk) represents the proportion of training observations in the mth region that are from the kth class in terms of intra-correlation.
Random forest
Random forest chooses the variables to split in a group using an algorithm that minimizes error. Random forest uses the algorithm in the way that within sub-trees have more correlation but between sub-trees have less correlation.
A common choice of m is where is the total number of predictors in the data set. For each bootstrap sampling from the training data, a few samples left behind that were not included. The performance of each model on its drop out samples from an average can provide an estimated accuracy of the bagged models. This estimated performance is often called the out of bag (OOB) estimate of performance. The OOB performance measures are considered as the cross validation estimates. As the Bagged decision trees are constructed, this study can calculate how much the error function drops for a variable at each split point.
Cross Validation and parameter tuning
A procedure for tuning model parameters is cross validation. There are many approaches to perform this task. In validation set approach, the set of data at hand is randomly partitioned into two parts: the training data set and the test data set. The model to be trained is built on the training set with different combinations of values for the associated parameters, and applied on the test set. In K-fold cross validation approach, the set of data at hand are randomly divided into K partitions. For each partition or fold, the model is trained on the rest of the partitions with different parameter settings and applied on the unused partition. Checking out of the bag error is the ultimate extension if of the above two approaches, where the model is trained ‘n’ times (n being the total number of observations in the data set, each time one of the observations is left out while training the model and the model makes prediction on it.) Accuracy rate is the proportion of correctly classified observations in the test set. Sensitivity is the proportion of true positives that have been classified as positives. Specificity is the proportion of true negative that have been classified as negatives. Receiving operating curve (ROC) curve is a graph of ‘sensitivity’ over ‘1-specificity’. As the threshold value of the predicted class probability is moved from 0.5 to the both extremes, the sensitivity and the specificity of the classifier changes. The area under the ROC curve (AUC) is a measure of the discriminating ability of the classifier.
Among the children in rural area, 38.4% were stunted and 35.3% were underweight which is higher than the urban area. Prevalence of stunting and underweight are highest in Sylhet division which are 49.9% and 40.5% respectively and lowest in Khulna division which are 28.1% and 25.9% respectively. Prevalence of stunting and underweight are highest among the poorest which are 49.5% and 46.4% respectively. Mother’s education level plays a significant role in stunting and underweight as the prevalence increases as mother’s education level decreases. Percentage of malnutrition increases as the age of child increases. And underweight mothers are more likely to have a malnourished children. Mothers who watch TV or read the newspaper once a week have a lower percentage of malnourished children than those mothers never watches TV or read the newspaper. There has been significant association of child malnutrition and Sex of child, Size at birth, Mother’s age, Order of birth, Type of place of residence, Division, Wealth index, Mother’s highest level of education, Exposure to media, Had fever recently, Had diarrhea recently, Currently breastfeeding, Age of child and Mother’s BMI as the p-value from chi-square test statistic is very small.
Proportion of stunting |
Proportion of underweight |
||||
Covariates |
Stunted |
P-value |
underweight |
P-value |
|
Sex of child |
|||||
Female |
35.8 |
0.06 |
33.5 |
0.52 |
|
Male |
37.2 |
32.6 |
|||
Very small |
45.3 |
0 |
51.8 |
0 |
|
Smaller than average |
43.6 |
43.4 |
|||
Average |
31.6 |
27.9 |
|||
Larger than average |
22.1 |
16.9 |
|||
Very large |
26.1 |
19.2 |
|||
Mother’s age |
15-19 |
34.3 |
0.01 |
32 |
0.01 |
20-24 |
35.8 |
32.9 |
|||
25-29 |
37 |
31.9 |
|||
30-34 |
37.1 |
35.2 |
|||
35-39 |
38.9 |
34.6 |
|||
Order of birth |
40-44 |
42.7 |
30.2 |
||
45-49 |
69.4 |
60.4 |
|||
1 |
32.1 |
0 |
29 |
0 |
|
2 |
35.9 |
33.3 |
|||
3 |
37.3 |
34.2 |
|||
4+ |
48.2 |
41.4 |
|||
Type of place of residence |
|||||
Urban |
30.9 |
0 |
26.3 |
0 |
|
Rural |
38.4 |
35.3 |
|||
Division |
40.2 |
0 |
36.7 |
0 |
|
Barisal |
|||||
Chittagong |
38.2 |
36.1 |
|||
Dhaka |
34.5 |
28.8 |
|||
Khulna |
28.1 |
25.9 |
|||
Rajshahi |
31.2 |
32.4 |
|||
Wealth index |
Rangpur |
37 |
38 |
||
Sylhet |
49.9 |
40.5 |
|||
Poorest |
49.5 |
0 |
46.4 |
0 |
|
Poorer |
43.2 |
39.1 |
|||
Middle |
36.6 |
32.3 |
|||
Richer |
31.5 |
27.5 |
|||
Richest |
19.6 |
17.4 |
|||
Mother’s education |
|||||
No education |
47.7 |
0 |
42.5 |
0 |
|
Primary |
44.2 |
39.5 |
|||
Secondary |
31.4 |
28.8 |
|||
Higher |
19.8 |
18.2 |
|||
Exposure to media |
|||||
Not at all |
42.7 |
0 |
40.2 |
0 |
|
Less than once a week |
43.1 |
40.3 |
|||
At least once a week |
30 |
25.5 |
|||
Currently breastfeeding |
|||||
Yes |
35 |
0 |
32.6 |
0.02 |
|
No |
38.9 |
33.7 |
|||
0-11 |
18.4 |
0 |
21.6 |
0 |
|
23-Dec |
41 |
32.8 |
|||
24-35 |
41.4 |
37.5 |
|||
36-47 |
44.4 |
35.6 |
|||
48-59 |
39.5 |
39.6 |
|||
Underweight |
43.8 |
0 |
46.2 |
0 |
|
Normal |
38.3 |
33.7 |
|||
Overweight |
31.9 |
24.6 |
|||
Obese |
18.6 |
15.7 |
Table 1 Prevalence of malnutrition among different background characteristics (in weighted percentage)
In this paper, the approaches to finding best subset of models are kept to be as simplistic and intuitive as possible. This also opens up scope for further investigation and refinement of this study. A random forest model is built using the data and its parameters are tuned so that the out-of-the-bag error rate is minimized. Setting the number of trees to 5000, the optimum number of predictors to be considered at each split is found to be 4 which is the square root of number of total predictor. With parameters ntree = 5000 and mtry = 4, a random forest model is trained and cross-validated (10-fold) on the dataset. A variable importance plot is constructed, where importance of a variable is calcul ated as the average amount of decrease in Gini index resulting from splitting a node with respect to that variable. The largest mean decrease in Gini index for predicting stunting and underweight is accounted for variables Division, Wealth index, Age of child, Mother’s age, Mother’s BMI, Order of birth, Size at birth, Mother’s highest level of education, Mother’s age at first marriage, Exposure to media, Sex of child, etc. These variables are used to train classification tree.
Figure 2 Variable importance plot by mean decrease in gini using random forest for predicting underweight.
A classification tree is first grown using twelve variables from the random forest model according to mean decrease in Gini index. Then the tree is pruned using the root mean square error and cross validation error. Figure below suggests the best number of splits and choice of the complexity parameter for pruning for which root mean square error and cross validation error are minimum. Then the final tree is fitted using the complexity parameter found. In this study, classification tree has predicted 45% of children as normal children whose father have secondary or higher education with probability 0.74. The second split happens whose father have no education or primary education and only 11% children below 1 year of age are predicted as normal. Again, those children aged 2 or more years and from Dhaka, Khulna, Rajshahi, Rangpur then low birth weight are predicted as stunted with a probability of 0.62. If the child from Barisal, Chittagong or Sylhet who are at poorest and poorer socio-economic group whose father have no education or primary education, 12% are predicted as stunted with probability 0.65. For predicting underweight, 40% children from the richer and richest group are predicted as normal with probability 0.78. Mother’s age and mother’s BMI play an important role in predicting underweight children. Among the poor family, whose children is 2-5 years old suffered from fever recently and whose mother is underweight, the children is predicted as underweight with a probability 0.75.
To assess the relationship of background characteristics in predicting the status of each of the dependent variable, a logistic regression model is fitted. This study illustrates that, children from poorest family has highest risk of being malnourished. They are almost 3 times as likely to be stunted and 2.3 times as likely to be underweighted than the children of richest family. This study also indicates that children aged 12-59 months are more than three times as likely to be stunted and more than two times as likely to be underweight than the children aged 0-11 months. Children whose mother is underweight has 84% more chance of being stunted and three times likely to be underweighted than the children of an obese mother. And, male children are almost at 11% risk of being stunted.
Dependent Variable |
|
|
Stunting |
|
Underweight |
||
Coefficient |
OR |
95%CI |
Sig. |
OR |
95%CI |
Sig. |
|
Age of Mother |
|||||||
15-19 |
2.3 |
1.4-3.8 |
0 |
2.01 |
1.21-3.4 |
0.01 |
|
20-24 |
1.58 |
0.99-2.54 |
0.06 |
1.54 |
0.96-2.53 |
0.08 |
|
25-29 |
1.52 |
0.97-2.41 |
0.07 |
1.39 |
0.87-2.25 |
0.17 |
|
30-34 |
1.19 |
0.76-1.88 |
0.45 |
1.36 |
0.86-2.2 |
0.19 |
|
35-39 |
1.33 |
0.83-2.15 |
0.24 |
1.59 |
0.98-2.62 |
0.07 |
|
40-44 |
1 |
1.05-6.04 |
0.04 |
2.38 |
1.03-5.59 |
0.04 |
|
45-49 |
2.46 |
||||||
Wealth Index |
|||||||
Poorest |
2.96 |
2.36-3.71 |
0 |
2.36 |
1.88-2.97 |
0 |
|
Poorer |
2.31 |
1.86-2.86 |
0 |
1.92 |
1.54-2.39 |
0 |
|
Middle |
2.08 |
1.71-2.53 |
0 |
1.7 |
1.39-2.08 |
0 |
|
Richer |
1.6 |
1.33-1.94 |
0 |
1.28 |
1.05-1.56 |
0.01 |
|
Richest |
1 |
1 |
|||||
Monther's education |
|||||||
No education |
1.76 |
1.35-2.29 |
0 |
1.6 |
1.22-2.1 |
0 |
|
Primary |
1.55 |
1.23-1.97 |
0 |
1.52 |
1.19-1.95 |
0 |
|
Secondary |
1.16 |
0.93-1.44 |
0.19 |
1.18 |
0.94-1.48 |
0.16 |
|
Higher |
1 |
1 |
|||||
Division |
|||||||
Barisal |
1.48 |
1.18-1.86 |
0 |
1.22 |
0.97-1.54 |
0.09 |
|
Chittagong |
1.65 |
1.34-2.04 |
0 |
1.47 |
1.19-1.82 |
0 |
|
Dhaka |
1.46 |
1.18-1.81 |
0 |
1.15 |
0.93-1.44 |
0.21 |
|
Khulna |
1 |
1 |
|||||
Rajshahi |
1.04 |
0.82-1.3 |
0.77 |
1.12 |
0.89-1.41 |
0.32 |
|
Rangpur |
1.21 |
0.97-1.52 |
0.1 |
1.11 |
0.89-1.4 |
0.35 |
|
Sylhet |
2.18 |
1.76-2.72 |
0 |
1.42 |
1.14-1.77 |
0 |
|
Age of child |
|||||||
0-11 |
1 |
1 |
|||||
23-Dec |
3.45 |
2.88-4.13 |
0 |
2.04 |
1.7-2.44 |
0 |
|
24-35 |
3.9 |
3.25-4.69 |
0 |
2.72 |
2.27-3.27 |
0 |
|
36-47 |
4.3 |
3.57-5.18 |
0 |
2.65 |
2.2-3.2 |
0 |
|
48-59 |
3.26 |
2.69-3.96 |
0 |
3.05 |
2.52-3.69 |
0 |
|
Mother’s BMI |
|||||||
Underweight |
1.91 |
1.49-2.47 |
0 |
3.22 |
2.47-4.22 |
0 |
|
Normal |
1.56 |
1.24-1.99 |
0 |
1.84 |
1.44-2.39 |
0 |
|
Overweight |
1.5 |
1.18-1.91 |
0 |
1.43 |
1.11-1.87 |
0.01 |
|
Obese |
1 |
1 |
|||||
Order of birth |
1 |
||||||
1 |
1 |
||||||
2 |
1.22 |
1.05-1.42 |
0.01 |
1.17 |
1.01-1.37 |
0.04 |
|
3 |
1.28 |
1.05-1.56 |
0.02 |
1.23 |
1-1.51 |
0.05 |
|
4+ |
1.5 |
1.19-1.9 |
0 |
1.22 |
0.96-1.54 |
0.11 |
|
Sex of child |
|||||||
Female |
1 |
1.04 |
0.93-1.15 |
0.52 |
|||
Male |
1.1 |
0.99-1.22 |
0.07 |
1 |
Table 2 Odds ratios in Multiple Logistic Regressions Assessing the Impacts of Variables
Performance measure
In order to compare different predictive models, some accuracy measure has been carried out to find the best performing models. For measuring model performance, we have calculated area under curve (AUC) of the receiving operating characteristic (ROC) curve. The comparison is presented below:
Model using Random Forest |
||
Performance measure |
Underweight |
Stunting |
Accuracy |
70.10% |
72.40% |
95%CI |
(68.6%, 71.5%) |
(70.9%, 73.8%) |
No Information Rate |
80.10% |
85.60% |
Sensitivity |
59.10% |
56.10% |
Specificity |
72.80% |
75.10% |
AUC |
69.80% |
70.00% |
Table 3 Performance measure of random forest model
Model using Classification Tree |
||
Performance measure |
Stunting |
Underweight |
Accuracy |
68.70% |
70.50% |
95%CI |
(67.6%, 69.8%) |
(69.4%, 71.6%) |
No Information Rate |
63.10% |
67.30% |
Sensitivity |
35.60% |
24.60% |
Specificity |
88.00% |
92.60% |
AUC |
67% |
65% |
Table 4 Performance measure for classification tree
The demographic characteristics of the study including household socioeconomic status (poorest), and parents' education (no formal education) are appeared to influence the prevalence of malnutrition significantly. Random forest has an accuracy of 70.1% and 72.4% for predicting stunting and underweight, respectively. Classification tree has predicted 68.7% and 70.5% of children's stunting and underweight accurately, respectively. Classification tree has a higher specificity while random forest has higher sensitivity and AUC for predicting nutritional status. This study suggests that random forest has a better performance than the classification tree and multiple logistic regression model in predicting the stunting and underweight status of under-five children in Bangladesh.6,16 Several target-based and fact-finding interventions can be taken to build a healthy future generation, today's child is tomorrow's leader. This study suggested a few pathways for future policymaking that are yet to be needed for development.
None.
Authors do not have any conflict of interest.
Authors do not receive any kind of funding from any institutions throughout the study.
The authors have read and approved the final manuscript. Both the authors have the equal contribution to perform the research, design the research study, contribute essential reagents or tools, analyze the data and write the paper.
©2021 Hemo, et al. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.
2 7