Research Article Volume 5 Issue 1
Department of Environmental Engineering,Yildiz Technical University, Turkey
Correspondence: Kaan Yetilmezsoy, Professor, Department of Environmental Engineering, Faculty of Civil Engineering,Yildiz Technical University, Davutpasa Campus, 34220, Esenler, Istanbul, Turkey, Tel +90 212 383 53 76
Received: March 27, 2023 | Published: April 5, 2023
Citation: Yetilmezsoy K. Application of random forest-based decision tree approach for modeling fully developed turbulent flow in rough pipes. Fluid Mech Res Int J. 2023;5(1):4-16. DOI: 10.15406/fmrij.2023.05.00060
A random forest (RF) -based decision tree programming methodology was aimed for modeling fully developed turbulent flow conditions in rough pipes. In the present computational study, a flexible RF-based soft-computing strategy was applied for the estimation of the required pipe diameter (D) and Darcy–Weisbach friction factor (λ or f) obtained from the iterative solution of the implicit Colebrook–White equation for five basic pipeline design variables considered in sizing problems (Type 3) of pipe distribution systems. The prediction performance of the implemented RF-based model was assessed more than 15 different statistical goodness-of-fit parameters and useful mathematical diagrams such as box-and-whisker-plots and spread plots. The statistical metrics corroborated the superiority of the RF-based approach in predicting both the required pipe diameter (R2 = 0.9793, MAE = 0.0287 m, RMSE = 0.03833 m, SEE = 0.0326 m, IA or WI = 0.9933, CV(RMSE) or SI = 0.0595, NSE = 0.9753, LMI = 0.8482, and AIC = -1954.6438 for the testing dataset) and friction factor (R2 = 0.9576, MAE = 0.0011, RMSE = 0.0023, SEE = 0.0018, IA or WI = 0.9851, CV(RMSE) or SI = 0.0660, NSE = 0.9478, LMI = 0.8500, and AIC = -3646.7124 for the testing dataset). The descriptive statics suggested that the 25% percentile values (Q1), median values (Q2), and 75% percentile values (Q3) of RF-predicted values of D and λ and the corresponding actual values of these responses were found to be very close. The proposed RF-based model was also tested against additional some dataset obtained from the relevant literature. The validation results indicated that the applied decision tree-based method produced realistic estimations and acceptable statistics (i.e., R2 = 0.9624, MAE = 0.0598 m, and RMSE = 0.0708 m for D values, and R2 = 0.9130, MAE = 0.0043, RMSE = 0.0052 for λ values) even at extreme L values greater than 2000 m. This study demonstrated the importance and ability of the applied soft-computing strategy to accurately predict D and λ values and eliminated error-prone steps of the traditional iterative approach.
Keywords: decision tree-based modeling, friction factor, pipeline design, pipe diameter, random forest, sizing problem, soft-computing, statistical analysis
Computer-aided analysis and design of pipe distribution systems have attracted a great deal of attention from hydraulic engineering researchers in recent years. Three types of problems are typically encountered in the design and analysis of piping systems, covering the use of the Moody diagram or the Colebrook–White equation:1–6 (1) Type 1 (discharge problem) where the flow rate (Q) is calculated based on known values of pipe length (L), pipe diameter (D), and pressure drop (or head loss (Δh)); (2) Type 2 (head loss problem) where the pressure drop (or head loss, Δh) is calculated based on known values of pipe length (L), pipe diameter (D), and flow rate (Q); and (3) Type 3 (sizing problem) where the pipe diameter (D) is calculated based on known values of pipe length (L), flow rate (Q), and pressure drop (or head loss (Δh)).
The Colebrook–White equation has been widely used to predict the Darcy–Weisbach friction factor λ (sometimes written as f) for turbulent fluid flow in rough pipes.7,8 Nevertheless, the friction factor already includes this implicit relationship with pipe roughness and Reynolds number (Re). Because the friction factor is a function of the pipe’s relative roughness (ε/D) and Re, the typical design technique necessitates a lengthy iteration procedure even for a single-lined set of pipes and even without accounting for local losses.2,4 For the problem of Type 1, for instance, this complexity is not an issue because Q can be calculated using a closed-form formulation by calculating where the velocity V is determined using the Darcy–Weisbach equation and substituted into the right side of the Colebrook–White equation.3 To put it another way, if the minor loss (i.e., one-off losses occurring at a single point) coefficient K is equal to zero, the Darcy–Weisbach equation yields the combination λV2, and so λ may be calculated. Knowing both λV2 and λ results in V, which then leads to Q.9 The Type 2 problem, however, could necessitate repetition. Type 3, on the other hand, is a dimensioning problem and typically requires additional iterative calculations and assumptions to be made in order to achieve convergence.2
In the application of hydraulic engineering, it is impractical to explicitly compute the Re and ε/D because the D is unknown in Type 3 problems.6 The simulations are started with a hypothetical pipe diameter value and continued with a fresh one iteratively until convergence for the design issues of such pipe distribution systems. Moreover, the diagram-based technique is sensitive to reading mistakes in the logarithmic scale and is not suitable for computer-aided simulations. These factors suggest that a few attempts to use a descriptive computational technique can make a useful contribution to the practice of hydraulic engineering in designing water distribution networks. In addition to fostering a thorough understanding of a process, modeling offers the power to foresee and address issues in particular systems.10
Mathematical modeling and computer-aided simulation are also useful and effective approaches to analyze the system performance under complicated and stable situations, as well as to test the present system under different scenarios.11–13 In the previous research, many data-driven modeling attempts have been made to model various pipelines. For instance, Özger and Yıldırım14 proposed an adaptive neuro-fuzzy (ANFIS) computing technique to determine the friction coefficient in pipe networks. They tested the performance of the ANFIS-based approach for the commonly used explicit models for the Colebrook-White for a wide range of ε/D (relative roughness) and Reynolds number (Re) values. In another work, Lin et al.15 developed an integrated method to predict two-phase flow patterns in upward inclined pipes via deep learning neural networks. Additionally, Alhashem and Aramco16 conducted supervised machine learning as a proof-of-concept in predicting multiphase flow regimes in horizontal pipes. In a different study from Vietnam, Moayedi et al.17 implemented four machine learning methods (i.e., multilayer perceptron (MLP), M5Rules (M5R), decision table (DT), and trees M5P (TM5P)) for predicting the pressure drop reduction in crude oil pipelines. In a Japanese study conducted by Kobayashi et al.,18 prediction of drag reduction effect by pulsating turbulent flow was investigated based on machine learning models such as MLP model and long short-term memory (LSTM) model. In another investigation from Canada, Milukow et al.19 applied gene expression programming (GEP) and extreme learning machines (ELM) for the estimation of the Darcy-Weisbach friction factor for ungauged streams. Moreover, Sattar20 proposed a GEP-based approach to develop new empirical formulas for the prediction of longitudinal dispersion coefficients in pipe flow. Najafzadeh et al.21 developed a model tree (MT) to present formulations for evaluation of friction factor in pipes and compared their results with those obtained from GEP, evolutionary paradigm regression (EPR), and conventional models. In another study, Bardestani et al.22 used ANFIS and grid partition method for predicting turbulent flow friction coefficient. Furthermore, Srivastava et al.23 used artificial neural network (ANN) approach to determine the friction factor for turbulent flows of water in a pipe of uniform circular cross-section.
To the best of the author’s knowledge, there is still a specific literature gap in terms of the application of different soft-computing techniques for estimating the primary parameters in sizing problems (Type 3) of pipe distribution systems, even though the aforementioned investigations have made significant contributions to the field. Previous studies have not yet been directly addressed the random forest (RF)-based decision tree approach for modeling the D and λ within the specified ranges of the main pipeline design variables, such as absolute roughness of the pipe wall (ε), water temperature (T), pipe length (L), flow rate (Q), and head loss (Δh), in the same study. In the traditional approach, D-dependent functions of λ and Re have to be established (e.g., fifth order and vice versa, respectively) to determine the required pipe diameter. In addition, the Re requires the application of the temperature-dependent kinematic viscosity (n). The Colebrook-White equation, on the other hand, combines all D-dependent functions, and D is determined iteratively. Similarly, after determining the diameter D, the friction factor is calculated using the Darcy-Weisbach and continuity equations. All of these are very time-consuming and computationally error-prone processes. However, in the present analysis, D and λ for the specific variables (including ε, T, L, Q, and Δh) were estimated for the first time within the scope of the same study using the RF-based technique. Moreover, kinematic viscosity and a series of iterative calculation steps were eliminated as a result of the suggested soft-computing approach, and a sophisticated predictive modeling scheme was conducted with high precision. Furthermore, the relevant literature has not yet produced a particular RF-based prediction research for the same model variables and their working limitations as the one developed within the context of the current study.
As a consequence, the following objectives have been developed for the current study in order to contribute to addressing the abovementioned gap in this sense: (1) generation of a sufficient amount of fully developed turbulent flow data (including ε, T, L, Q, and Δh) from the iterative solution of the implicit Colebrook–White equation for rough pipes; (2) prediction of D and λ values in the same study utilizing a flexible random forest (RF)-based decision tree technique for sizing problems (Type 3); (3) evaluation of the prediction performance of the established RF-based model using more than 15 different statistical performance evaluations (i.e., R2, MAE, RMSE, RMSES, RMSEU, SEE, IA (WI), FV, FA2, CV(RMSE) (SI or NRMSE), NSE, LMI, MFB, MFE, AIC, and U95), box-and-whisker-plots, spread plots, and illustrative/tabulated presentations for the D and λ datasets; (4) validation of the RF-based model’s performance with various turbulent flow data from the open literature; and (5) demonstration of the versatility and adaptability of the implemented soft-computing method for an implicit and trial-and-error type hydraulic engineering problem.
The widely used ensemble machine learning approach known as random forest (RF) creates a structured collection of tree predictors from input vectors by using random vector samples.24,25 It has shown to be highly effective as a general-purpose classification and regression tool. With a hit-or-miss approach to the procedure, the variables are chosen using the optimal split. The RF method gathers a number of random trees to create random forests. The RF-based approach combines bagging (also known as bootstrap aggregation) and random subspace and functions by merging weak classification trees to get a final result via majority vote. When selecting how to split the forest trees, the number of decision trees to be formed and the number of features to be analyzed to discover the best split must both be considered. Due of the relative efficacy of the RF classifier and the lack of over-fitting, the number of decision trees can be as large as possible. The training data is used to grow each tree by two-thirds. The data from the out-of-bag (OOB) samples, which make up the last third of training samples, can be used to measure performance. As a consequence, the random forest regression is made up of k-trees, where k is the desired number of trees to be produced and can be any value specified by the user. The CART (classification and regression trees) approach is used to grow all of the decision trees in the forest with no pruning. By incorporating numerous criteria, random forest regression allows the tree to grow to the depth of all new training data. A random collection of parameters is chosen as the training set, and a Gini index is utilized to analyze the degree of impurity in the parameters in contrast to the result when generating specific trees.26 The training dataset becomes paramount significance when a single tree splits into just one criterion. Little modifications to the dataset and splitting criterion may result in a range of tree topologies, leading to various interpretations.24,25 As a result, RF models classify variables based on their importance in order to produce the optimal RF model.
In this study, which was carried out as part of an integrated modeling research, two multiple inputs single output (MISO)-type and RF-based models were established based on a trial-and-error process to evaluate their prediction performances on the required pipe diameter (D) and Darcy–Weisbach friction factor (λ or f) for fully developed turbulent flow conditions in rough pipes. It is noted that the integrated modeling research explores the best-performing data-driven models (e.g., genetic/non-parametric regression/decision tree/kernel/multilayer perceptron/fuzzy logic-based data-intelligent approaches, and so forth) for the estimation of different hydraulic output parameters (e.g., D, f, Re) of the Type-3 problems of pipe distribution systems. Nevertheless, deciding which soft-computing model would be utilized to estimate which output in such scenarios necessitates a thorough optimization research. Therefore, other results (e.g., benchmarking with other state-of-art models) associated with the above-mentioned integrated modeling study will be presented in future studies. Figure 1 shows an original flow network diagram of the proposed RF-based modeling approach applied to estimate D and λ values for the rough flow regime.
The purpose of this computer-based investigation was to demonstrate the applicability and usefulness of an RF-based soft-computing technique for estimating both the required pipe diameter (D) and Darcy–Weisbach friction factor (λ) in the sizing problems (Type 3) of pipe distribution systems. In the current work, the implicit Colebrook–White equation was solved using the conventional iteration method for a variety of five significant design parameters, yielding a sufficient number of D and λ datasets (n = 1000 for each of D and λ). The methodology for obtaining fully developed turbulent flow data is in agreement with other studies in the literature.3,6,22
Absolute roughness of the pipe wall (X1: ε [=] mm), water temperature (X2: T [=] °C), pipe length (X3: L [=] m), flow rate (X4: Q [=]m3/s), and head loss (X5: Δh [=] m) were considered as the input variables, whereas the required pipe diameter (Y1: D [=] m) and Darcy–Weisbach friction factor (Y2: λ or f [=] dimensionless) were considered as the output variables. As a result, the actual D and λ values were determined from the Colebrook–White equation by simulating the variables indicated above at their working limits. In the present computational study, a flexible RF-based soft-computing strategy was applied for the estimation of D and λ (or f) for the following ranges of five basic pipeline design variables (upper and lower ranges are rounded for simplicity in tracking variable bounds): ε = 0.01–10 mm, T = 5–30 °C, L = 30–2000 m, Q = 0.001–3 m3/s, and Δh = 1–90 m. According to the literature,27–30 70% of each dataset was used for the model construction (training stage), and 30% of each dataset was utilized for the testing stage.
Table 1 summarizes the detailed descriptive statistics of the simulated variables used in two RF-based and multiple-input single-output (MISO)-type soft-computing models. Because the normalizing technique was not used in this investigation, the inputs contain real units, as seen in Table 1. Many studies analyzed the efficacy of computational analysis utilizing actual (or real) data unit-based and normalized data-based conclusions. Depending on the features of the datasets employed, trials in the current investigation (not provided here due to space constraints) revealed that actual unit-based data outperformed conclusions based on normalized data. This result was shown to be compatible with other earlier soft-computing investigations.27,31,32
Statistics |
ε |
T |
L |
Q |
Δh |
D |
λor f |
Valid data (n) |
1000 |
1000 |
1000 |
1000 |
1000 |
1000 |
1000 |
Mean |
4.8508 |
17.8046 |
1001.2852 |
1.5308 |
45.3878 |
0.6552 |
0.0341 |
Standard deviation |
2.9294 |
7.0992 |
573.6267 |
0.8713 |
25.6509 |
0.2582 |
0.0099 |
Variance coefficient |
0.6039 |
0.3987 |
0.5729 |
0.5692 |
0.5652 |
0.3941 |
0.2908 |
Standard error of mean |
0.0926 |
0.2245 |
18.1397 |
0.0276 |
0.8112 |
0.0082 |
0.0003 |
Upper 95% CL of mean |
5.0326 |
18.2451 |
1036.8815 |
1.5849 |
46.9796 |
0.6712 |
0.0347 |
Lower 95% CL of mean |
4.6690 |
17.3640 |
965.6890 |
1.4767 |
43.7961 |
0.6391 |
0.0335 |
Geometric mean |
3.4455 |
16.1604 |
765.8298 |
1.1327 |
34.4332 |
0.5987 |
0.0326 |
Skewness |
0.0498 |
-0.0502 |
0.0285 |
-0.0378 |
-0.0213 |
0.5545 |
0.6022 |
Kurtosis |
1.7815 |
1.8594 |
1.7762 |
1.7990 |
1.8116 |
4.0312 |
4.8435 |
Maximum |
9.9859 |
29.9734 |
1997.2290 |
2.9990 |
89.9510 |
1.7738 |
0.0840 |
Upper quartile |
7.4027 |
23.9177 |
1498.9119 |
2.2793 |
67.4486 |
0.8107 |
0.0399 |
Median |
4.8152 |
17.8582 |
987.7940 |
1.5468 |
46.0126 |
0.6456 |
0.0340 |
Lower quartile |
2.2525 |
11.7243 |
504.5725 |
0.7707 |
23.6498 |
0.4814 |
0.0280 |
Minimum |
0.0130 |
5.0018 |
30.7203 |
0.0018 |
1.0066 |
0.0541 |
0.0104 |
Range |
9.9729 |
24.9717 |
1966.5087 |
2.9972 |
88.9444 |
1.7197 |
0.0736 |
Centile 95 |
9.4660 |
28.7151 |
1902.2067 |
2.8732 |
84.9469 |
1.0710 |
0.0506 |
Centile 5 |
0.3736 |
6.3438 |
124.1608 |
0.1320 |
5.4173 |
0.2476 |
0.0184 |
Table 1 Detailed descriptive statistics of simulated variables used in RF-based modeling
The skewness values showed that absolute roughness of the pipe wall (ε) and pipe length (L) datasets were weakly skewed right, while water temperature (T), flow rate (Q), and head loss (Δh) datasets were weakly skewed left (“-” sign means left-skewed or left-tailed and “+” sign means right-skewed or right-tailed) (Table 1). On the other hand, both the required pipe diameter (D) and Darcy–Weisbach friction factor (λ) datasets had a moderately right-skewed distributions for the numerical outputs generated from the iterative solution of the implicit Colebrook–White equation. In addition, the kurtosis values indicated that all input attributes (ε, T, L, Q, and Δh) had platykurtic distributions (i.e., kurtosis < 3), whereas all output attributes (D and λ) showed leptokurtic nature (i.e., kurtosis > 3). Moreover, scatter plots of the response (or dependent) variables as a function of each explanatory (or independent) variable are illustrated in Figures 2 and 3. Moreover, in accordance with prior MISO-type data-intelligent investigations,30,33,34 all predictors demonstrated a distinct relevance in accordance with the strength of different types of clusters in particular intervals, indicating that they should not be excluded from the used RF-based model.
In the present analysis, the RF-based soft-computing model was established within the numerical computing environment of WEKA 3.9.6 (Waikato Environment for Knowledge Analysis) software (The University of Waikato, Hamilton, New Zealand, https://www.cs.waikato.ac.nz/ml/weka/). In order to assess the efficacy and usefulness of the RF-based model using D and λ datasets produced under fully developed turbulent flow conditions in rough pipes, WEKA Explorer was used as a potent data mining tool. It is worth noting that randomization, also known as data shuffling, is a common technique to overcome this issue since learning algorithms may be sensitive to the sequence in which the data is acquired. For this reason, the randomization procedure was employed using WEKA’s “randomize” filter (package weka.filters.unsupervised.instance.randomize) before dividing the original datasets (diameter_ALL.arff and friction_ALL.arff), each with 5 inputs and 1 output, into the training and testing datasets. It is noted that the xlsx data files (Microsoft® Excel® standard format file type) were converted to the csv (comma-separated values) and txt (a standard text document (e.g., Microsoft® Notepad) that contains plain text) files, respectively, and finally converted to the arff (attribute-relation file format) files for reading datasets in WEKA. To guarantee uniformity and repeatability, D and λ datasets were shuffled using a random seed value of 42, which is in line with prior studies.35–37The full randomized datasets (diameter_ALL_random.arff and friction_ALL_random.arff) were then separated into training and testing datasets (herein these datasets are abbreviated as TRA and TES, respectively) using the “remove percentage” filter option in WEKA (located in package weka.filters.unsupervised.instance.removepercentage) Following that, these datasets were saved as diameter_random_TRA.arff and friction_random_TRA.arff for the training stages and diameter_random_TES.arff and friction_random_TES.arff for the testing stages of the computational analysis. The block diagram/working process of WEKA Java-based open-source machine learning platform machine learning software (released under GNU General Public License (GNU GPL)) can be observed in a recent MLP-based research undertaken by Sharma et al.38
A statistical and visualization software package (StatsDirect V2.7.2, Copyright© 1990–2008, StatsDirect Ltd, Altrincham, Cheshire, UK) was employed to compute the descriptive statistics (see Table 2) of the RF-based model’s variables (inputs: ε, T, L, Q, Δh, and outputs: D, λ) for both the training and testing datasets. StatsDirect software package was also used to create scatter plots of the predictor variables, box-and-whisker plots, and spread plots. SigmaPlot® (V10.0.0.54, Copyright© 2006, Systat Software, Inc., GmbH, Germany) software and Microsoft® Excel® 2010 were implemented to develop linear correlation graphs of the applied RF-based model for the training and testing stages.Statistics |
D-RF (TRA) |
D-RF (TES) |
λ-RF (TRA) |
λ-RF (TES) |
R2 |
0.9969 |
0.9793 |
0.9926 |
0.9576 |
MAE |
0.0117 |
0.0287 |
0.0005 |
0.0011 |
RMSE |
0.0168 |
0.0383 |
0.0009 |
0.0023 |
RMSES |
0.0091 |
0.0203 |
0.0004 |
0.0014 |
RMSEU |
0.0141 |
0.0325 |
0.0008 |
0.0018 |
SEE |
0.0141 |
0.0326 |
0.0008 |
0.0018 |
IA |
0.9990 |
0.9933 |
0.9978 |
0.9851 |
FV |
0.0334 |
0.0753 |
0.0370 |
0.1254 |
FA2 |
0.9936 |
0.9816 |
0.9975 |
1.0011 |
CV(RMSE) |
0.0254 |
0.0595 |
0.0268 |
0.0660 |
NSE |
0.9960 |
0.9753 |
0.9916 |
0.9478 |
LMI |
0.9429 |
0.8482 |
0.9386 |
0.8500 |
MFB (%) |
0.7324 |
2.2055 |
0.2816 |
-0.0006 |
MFE (%) |
2.1840 |
5.6432 |
1.4082 |
2.8488 |
AIC |
-5722.7355 |
-1954.6438 |
-9805.3383 |
-3646.7124 |
U95 |
0.0196 |
0.0280 |
0.0007 |
0.0012 |
Table 2 Performance evaluation of the implemented RF-based model in terms of various quantitative statistics (the unit of MAE, RMSE, RMSES, RMSEU, SEE, and U95 is meter (m) for the D dataset)
In this computational study, various distinct statistical performance metrics (i.e., R2, b (slope), a (intercept), MAE, RMSE, RMSES, RMSEU, SEE, IA (WI), FV, FA2, CV(RMSE) (SI or NRMSE), NSE, LMI, MFB, MFE, AIC, and U95, and so forth) were computed by executing a new solution script (statistics.m) written in the M-file Editor within the framework of MATLAB® R2018a software (V9.4.0.813654, 64-bit (win64), Academic License Number: 40578168, MathWorks Inc., Natick, MA) running under Windows 10 system on the same PC platform. Full descriptions and formulations of the respective evaluators are presented in the following section.
The following assumptions were employed in this soft-computing investigation on fully developed turbulent flow conditions in sizing problems (Type 3):
(1) Pipe was considered to be totally filled with water in the internal flow,
(2) Piping system as a whole has a constant diameter (D), and the total head loss (Δh) is calculated from,
(3) Minor (local) losses’ impact was not taken into account (K = 0) (this will be covered in the upcoming research), and hence the Darcy–Weisbach equation yielded the Δh as a function of λ, D, V, g, where g = 9.807 m/s2,
(4) According to the continuity equation, Q was considered as constant and obtained by ,
(5) Because commercial pipes are only manufactured with particular standard sizes in practice, the next diameter available will be selected based on the computed or predicted value of D.
(6) The importance of each independent variable was assumed to be equal, and no special safety measures were taken when building the model to prevent any knowledge bias,
(7) Yetilmezsoy’s empirical formula6,13,39,40 was used to calculate the kinematic viscosity (ν [=] m2/s) as a function of temperature (valid for T = 0–100 °C),
(8) Yetilmezsoy’s fifth order nonlinear regression-based equation6,13,39 was used to calculate the specific weight (γ [=] kgf/m3) as a function of temperature (valid for T = -20–100 °C),
As part of the current computational analysis, numerous significant statistics, such as slope of the best-fit line (b), intercept (a), determination coefficient (R2), mean absolute error (MAE), root mean squared error (RMSE), systematic and unsystematic RMSE (RMSES and RMSEU, respectively), standard error of the estimate (SEE), index of agreement (IA) (or known as Willmott’s Index (WI)), fractional variance (FV), the factor of two (FA2), coefficient of variation of RMSE (CV(RMSE) (or known as scattering index (SI) or normalized root mean squared error (NRMSE)), Nash–Sutcliffe efficiency (NSE), Legates and McCabe’s index (LMI), mean fractional bias (MFB), mean fractional error (MFE), Akaike information criterion (AIC) (named after the Japanese statistician Hirotsugu Akaike), and expanded uncertainty with 95% confidence level (U95) were calculated to measure the agreement and make comparisons between the observed values and predictions of the used RF-based technique for the training and testing datasets. The mathematical formulations of the computed performance metrics are provided in Equations (1) to (21). In these expressions, the letters O, P, m, n, reg, and i denote the observed, predicted, mean, number of data points (in both training and testing datasets), regression, and index of data points, respectively. In Equations (10)–(13), the Greek letter σ refers to the standard deviation. In Equations (14) and (15), RSE and RAE are the abbreviations of the relative squared error and the relative absolute error, respectively. In Equation (18), “ln” is the natural logarithm, and ke is the number of parameters being estimated.
Comprehensive explanations of these measurements (which are not included here owing to space constraints) may be found in prior research including soft-computing-based modeling of the flow rate of dry part in the wet gas mixture,29 approximation of the discharge coefficient of differential pressure flowmeters,28 modeling of the lateral confinement coefficient for carbon fiber reinforced polymer (CFRP)-confined rectangular/square reinforced concrete columns,27 group method of data handling (GMDH)-extreme learning machine (ELM)-based prediction of longitudinal dispersion coefficients in water pipelines,41 weather research and forecasting (WRF)-community multiscale air quality (CMAQ)-based modeling of meteorological parameters and PM2.5 concentrations,42 performance evaluation of solar radiation computing models,43 assessment of the precision of mathematical models,44 resistant MAPE (R-MAPE)-based statistical assessment of prediction accuracy,45 intercomparison of wind speed probability distribution models,46 prediction of daily global solar radiation from sunshine duration,47 estimation of discharge capacity of sharp-crested weirs,48 and empirical modeling of pipe-sizing problems.6
(1) |
|||
(2) |
|||
(3) |
|||
(4) |
|||
(5) |
|||
(6 |
|||
(7) |
|||
(8) |
|||
(9) |
|||
(10) |
|||
(11) |
|||
(12) |
|||
(13) |
|||
(14) |
|||
(15) |
|||
(16) |
|||
(17) |
|||
(18) |
|||
(19) |
|||
(20) |
|||
|
(21) |
In order to get the optimum user-defined parameter values, decision tree-based model implementation takes some trial and error.29 As a result, in order to obtain better model prediction or minimize error, the values for each parameter in the RF model must be effectively adjusted.49 The user-defined parameter values were optimized for the following parameters in the current study utilizing a number of RF-based model trials: (a) bag size percent (size of each bag as a percentage of the training set size) = 100, (b) batch size (preferred number of instances to process if batch prediction is being performed) = 100, (c) the maximum depth of tree = 0 (0 is used for unlimited), (d) number of execution slots (number of threads to use for constructing the ensemble) = 1, (e) number of features (number of randomly chosen attributes) = 0 (if 0, int(log_2(#predictors) + 1) function is used per split in each tree), (f) number of iterations (number of tress in the RF) = 100, and (g) random number seed to be used = 1. The values obtained in the earlier decision tree-based modeling research49,50 are consistent with these settings.
The elapsed time during the computational analysis is one of WEKA’s output parameters. For the present pipe diameter (D) dataset, the time records for the building, training, and testing of the RF-based model were 0.28 seconds for 700 instances, 0.31 seconds for 700 instances, and 0.14 seconds for 300 instances, respectively. At the end of the analysis carried out in WEKA, RF-based predictions on the training set of D (n = 700) produced a correlation coefficient (R) of 0.9985, a mean absolute error (MAE) of 0.0117 m, and root mean squared error (RMSE) of 0.0168 m, while R, MAE, and RMSE values for the testing set of D (n = 300) were computed as 0.9896, 0.0287 m, and 0.0383 m, respectively. Likewise, for the current Darcy–Weisbach friction factor (λ) dataset, the time records for the building, training, and testing of the RF-based model were 0.17 seconds for 700 instances, 0.28 seconds for 700 instances, and 0.11 seconds for 300 instances, respectively. The computational results showed that the RF-based estimations on the training set of λ (n = 700) yielded an R value of 0.9971, an MAE of 0.0005, and an RMSE value of 0.0009 m, while R, MAE, and RMSE values for the testing set of λ (n = 300) were determined as 0.9789, 0.0011, and 0.0023, respectively.
As seen from the statistics summarized in Table 2, R2 values were determined as 0.9793 and 0.9576 for testing sets of D and λ, revealing that the RF-based approach satisfactorily predicted the expected responses (D and λ) with small deviations for each subset. The R2 values indicated that unexplained variations were only 2.07% and 4.24% of all the variations in prediction of the pipe diameter and Darcy–Weisbach friction factor, respectively. The calculated IA (0.9990 and 0.9933) and FA2 (0.9936 and 0.9816) values (for the training and testing datasets of D, respectively) were determined to be very close to 1, implying that very satisfactory agreements were achieved between the actual and RF-predicted D values. In addition, IA (0.9978 and 0.9851) and FA2 (0.9975 and 1.0011) values (for the training and testing datasets of λ, respectively) values corroborated that acceptable agreements were attained between the actual and RF-predicted λ values.
The low values of the CV(RMSE) ((a) 0.0254 and 0.0595 for the training and testing datasets of D, respectively; and (b) 0.0268 and 0.0660 for the training and testing datasets of λ, respectively) showed a high degree of precision and a good deal of the reliability of the proposed RF-based method. Moreover, AIC values of the RF-based model were fairly low in all subsets, indicating the accuracy of the RF-based decision tree strategy applied to estimate the D and λ values. Other descriptive performance metrics, such as MAE, RMSE (including its systematic and unsystematic components), FV, MFB, MFE, and U95, also revealed that the proposed soft-computing model produced very small residuals/uncertainty and demonstrated a noticeable predictive performance in estimating the required pipe diameter (D) and Darcy–Weisbach friction factor (λ).
Figures 4 and 5 show the linear correlation between the actual and forecasted values of the actual and RF-predicted values of D and λ for both training and testing phases, respectively. As seen from Figure 4, predicted D values obtained by the RF-based approach range within the ±10% error band during the training stage and within the ±22% error band during the testing stage. Similarly, Figure 5 shows that λ values estimated by the RF-based model range within the ±15% error band during the training stage and within the ±25% error band during the testing stage.
Figure 3 Scatter plots of D and λ as a function of the fourth and fifth predictor variables (Q, Δh).
Figure 4 Linear correlation between the actual and RF-predicted values of pipe diameter (D): (a) training stage (n = 700) and (b) testing stage (n = 300).
Figure 5 Linear correlation between the actual and RF-predicted values of Darcy–Weisbach friction factor (λ or f): (a) training stage (n = 700) and (b) testing stage (n = 300).
Furthermore, in terms of visual comparisons, the prediction accuracy of the applied soft-computing strategy was evaluated using two useful graphical methods such as box-and-whisker plot and spread plot. The box-and-whisker plots summarize each variable by the following components as follows:6,30 (1) the median value (Q2: median or second quartile) in each box acts as a center solid line to represent the location or central tendency; (2) a box represents the range of variation around this central tendency (the edges of the box are the 25th (Q1: lower quartile or first quartile) and 75th (Q3: upper quartile or third quartile) percentiles); and (3) the error range (Q4-Q0: maximum value - minimum value) is displayed as whiskers around the box. It is noted that black diamond (♦) inside each boxplot represents the mean value. Moreover, the spread plot is a useful way to display the distribution of data across groups. It provides a fully graphical picture of the spread of the data. The vertical axis is divided into any number of divisions that correspond to the width of a plot point. If more than one data point falls within a division, they are shown alongside the first. As a result, a broad band represents a concentration of data at a specific value.
Figures 6 and 7 illustrate box-and-whisker and spread plots of the actual datasets against the RF-based estimations for the prediction of the required pipe diameter (D) and Darcy–Weisbach friction factor (λ or f), respectively. On basis of the training and testing datasets of D and λ (or f), shapes of both box-and-whisker and spread plots of the RF-based decision tree approach appear almost similar to the actual values of the respective responses.
Figure 6 Box-and-whisker plots of the actual and RF-predicted values of D and f (or λ) datasets: (a) training stage (n = 700) and (b) testing stage (n = 300).
Figure 7 Spread plots of the actual and RF-predicted values of D and f (or λ) datasets: (a) training stage (n = 700) and (b) testing stage (n = 300).
In order to examine the consistency of the RF-based estimations over the actual values, the 25%, 50%, and 75% quartile values of D and λ datasets are presented in Tables 3 and 4, respectively. When both Figure 6 (box-and-whisker plots) and Tables 3 and 4 are scrutinized, the descriptive statics suggest that the 25% percentile quartile values (Q1), median 50% percentile values (Q2), and 75% percentile quartile values (Q3) of RF-estimated D and λ datasets are very close to their the corresponding actual values. As seen from Tables 3 and 4, the interquartile ranges (IQR) of and RF-predicted values and the respective actual values of D and λ are quite close to each other.
Statistics |
D-Actual (TRA) |
D-RF (TRA) |
D-Actual (TES) |
D-RF (TES) |
Valid data (n) |
700 |
700 |
300 |
300 |
Mean (m) |
0.6599 |
0.6600 |
0.6440 |
0.6471 |
Standard deviation (m) |
0.2639 |
0.2552 |
0.2444 |
0.2267 |
Variance coefficient |
0.3999 |
0.3867 |
0.3795 |
0.3503 |
Standard error of mean |
0.0100 |
0.0096 |
0.0141 |
0.0131 |
Upper 95% CL of mean |
0.6795 |
0.6790 |
0.6718 |
0.6729 |
Lower 95% CL of mean |
0.6404 |
0.6411 |
0.6163 |
0.6213 |
Geometric mean (m) |
0.6028 |
0.6073 |
0.5895 |
0.6027 |
Skewness |
0.6395 |
0.5778 |
0.2799 |
0.2962 |
Kurtosis |
4.1495 |
3.8854 |
3.4853 |
3.4861 |
Maximum (Q4) (m) |
1.7740 |
1.6790 |
1.4920 |
1.3970 |
Upper quartile (Q3) (m) |
0.8210 |
0.8160 |
0.7815 |
0.7860 |
Median (Q2) (m) |
0.6470 |
0.6470 |
0.6405 |
0.6465 |
Lower quartile (Q1) (m) |
0.4785 |
0.4860 |
0.4940 |
0.5185 |
Minimum (Q0) (m) |
0.0540 |
0.0910 |
0.0970 |
0.1360 |
Range (Q4-Q0) (m) |
1.7200 |
1.5880 |
1.3950 |
1.2610 |
IQR = Q3-Q1 |
0.3425 |
0.3300 |
0.2875 |
0.2675 |
Centile 95 (m) |
1.1020 |
1.0845 |
1.0370 |
0.9950 |
Centile 5 (m) |
0.2575 |
0.2620 |
0.2300 |
0.2570 |
Table 3 Descriptive statistics of the actual and predicted D values for the applied RF-based soft-computing model
Statistics |
λ-Actual (TRA) |
λ-RF (TRA) |
λ-Actual (TES) |
λ-RF (TES) |
Valid data (n) |
700 |
700 |
300 |
300 |
Mean |
0.0339 |
0.0339 |
0.0346 |
0.0344 |
Standard deviation |
0.0099 |
0.0095 |
0.0100 |
0.0088 |
Variance coefficient |
0.2920 |
0.2814 |
0.2893 |
0.2568 |
Standard error of mean |
0.0004 |
0.0004 |
0.0006 |
0.0005 |
Upper 95% CL of mean |
0.0346 |
0.0346 |
0.0358 |
0.0354 |
Lower 95% CL of mean |
0.0331 |
0.0332 |
0.0335 |
0.0334 |
Geometric mean |
0.0324 |
0.0324 |
0.0332 |
0.0332 |
Skewness |
0.4191 |
0.2562 |
1.0122 |
0.2702 |
Kurtosis |
4.0080 |
3.3565 |
6.5258 |
3.7043 |
Maximum (Q4) |
0.0770 |
0.0690 |
0.0840 |
0.0660 |
Upper quartile (Q3) |
0.0400 |
0.0400 |
0.0400 |
0.0400 |
Median (Q2) |
0.0340 |
0.0340 |
0.0340 |
0.0345 |
Lower quartile (Q1) |
0.0275 |
0.0280 |
0.0290 |
0.0290 |
Minimum (Q0) |
0.0100 |
0.0120 |
0.0120 |
0.0130 |
Range (Q4-Q0) |
0.0670 |
0.0570 |
0.0720 |
0.0530 |
IQR = Q3-Q1 |
0.0125 |
0.0120 |
0.0110 |
0.0110 |
Centile 95 |
0.0505 |
0.0500 |
0.0510 |
0.0495 |
Centile 5 |
0.0180 |
0.0180 |
0.0190 |
0.0200 |
Table 4 Descriptive statistics of the actual and predicted λ values for the applied RF-based soft-computing model
Finally, validation datasets were built for both output variables using open literature data to check the prediction performance of the RF-based model on D and λ values (Table 5). Descriptions of pipe material acronyms are presented below the table. Figure 8 depicts the agreements between the observed values and the RF-based model predictions for the D and λ validation datasets.
Figure 8 Agreement between the observed values and the RF-based model outputs for the validation datasets of D and λ.
No |
ε |
T |
L |
Q |
Δh |
D |
λ |
Re (×105) |
Pipe |
Reference and region |
1 |
0.26 |
15.5 |
300 |
0.574 |
1.75556 |
0.5999 |
0.0167 |
10.875 |
CI |
Schumack51, Princeton, MI, USA |
2 |
0.254 |
7.7 |
457.2 |
0.08495 |
3.05 |
0.2829 |
0.0203 |
2.7429 |
CI |
Hoeft et al.,1 Texas, USA |
3 |
9.144 |
12.3 |
30.48 |
0.03639 |
1.52 |
0.1743 |
0.0733 |
2.1783 |
CN |
Hoeft et al.,1 Texas, USA |
4 |
0.05 |
20 |
100 |
0.003 |
10 |
0.0444 |
0.0229 |
0.86124 |
CS |
Senturk,52 Turkey |
5 |
3.2 |
15 |
450 |
0.068 |
7.3 |
0.25 |
0.0414 |
3.0513 |
RS |
Subramanian,53 New York, USA |
6 |
0.12 |
20 |
1000 |
0.05 |
11.55 |
0.2017 |
0.0187 |
3.1517 |
ACI |
Ghabayen and Abualtayef,54 Gaza, Palestine |
7 |
0.26 |
10 |
2000 |
0.058 |
4.6 |
0.3014 |
0.0206 |
1.8817 |
CI |
Ghabayen and Abualtayef54, Gaza, Palestine |
8 |
0.045 |
20 |
1000 |
0.4 |
5.5 |
0.5021 |
0.0133 |
10.1294 |
CS |
Ghabayen and Abualtayef54, Gaza, Palestine |
9 |
0.15 |
20 |
200 |
0.0016 |
4 |
0.0499 |
0.0291 |
0.4081 |
GI |
Ghabayen and Abualtayef54, Gaza, Palestine |
10 |
0.259 |
19.6 |
1000 |
0.13 |
85.216 |
0.2033 |
0.0212 |
8.0544 |
CI |
Sakkas55, Giannitsa, Greece |
11 |
3.05 |
19.6 |
305 |
0.12975 |
6.1 |
0.3051 |
0.038 |
5.3552 |
RS |
Sakkas55, Giannitsa, Greece |
12 |
0.915 |
19.6 |
1520 |
2.84 |
15.2 |
1.0495 |
0.0191 |
34.079 |
RS |
Sakkas55, Giannitsa, Greece |
13 |
0.5 |
9.9 |
1000 |
0.051 |
5.39 |
0.2497 |
0.0243 |
1.9915 |
CN |
Siddique56, Sharjah, UAE |
14 |
0.045 |
23 |
500 |
0.003 |
83.49 |
0.0398 |
0.0225 |
1.027 |
WI |
Ergil57, TR of Northern Cyprus |
15 |
0.045 |
25 |
135 |
0.62756 |
35.45 |
0.2786 |
0.0135 |
32.1451 |
CS |
Ergil57, TR of Northern Cyprus |
16 |
0.0015 |
20 |
390 |
0.115 |
16.5 |
0.2002 |
0.0124 |
7.3054 |
PVC |
Ergil57, TR of Northern Cyprus |
17 |
0.15 |
17 |
1450 |
0.0235 |
45.95 |
0.1255 |
0.0217 |
2.2106 |
GI |
Ergil57, TR of Northern Cyprus |
18 |
0.06 |
20 |
500 |
0.03 |
8.753 |
0.15 |
0.0179 |
2.543 |
CS |
Apsley9, Manchester, UK |
19 |
0.03 |
20 |
5000 |
0.4 |
50 |
0.4421 |
0.0128 |
11.5043 |
CS |
Apsley9, Manchester, UK |
20 |
0.1 |
20 |
800 |
0.05174 |
10 |
0.2 |
0.0181 |
3.2888 |
ACI |
Apsley9, Manchester, UK |
21 |
0.1 |
20 |
3000 |
0.1553 |
40 |
0.3 |
0.0163 |
6.5816 |
ACI |
Apsley9, Manchester, UK |
22 |
0.1 |
20 |
3000 |
0.2553 |
65.7 |
0.3292 |
0.0157 |
9.8623 |
ACI |
Apsley9, Manchester, UK |
23 |
0.1 |
20 |
5000 |
0.08 |
18.52 |
0.3 |
0.017 |
3.3908 |
ACI |
Apsley9, Manchester, UK |
24 |
1 |
20 |
3000 |
0.05 |
25 |
0.2357 |
0.0293 |
2.6978 |
WCI |
Apsley9, Manchester, UK |
25 |
0.0015 |
20 |
120 |
0.25 |
10 |
0.2336 |
0.0112 |
13.6056 |
PL |
Almoulki and Yetilmezsoy58, Turkey |
Table 5 Open-access fully developed turbulent flow data to validate the accuracy of the applied RF-based model’s predictions
CI: cast iron; CN: concrete; CS: commercial steel; RS: riveted steel; ACI: asphalted cast iron; GI: galvanized iron; WI: wrought iron; PVC: polyvinyl chloride; WCI: worn cast iron; PL: plastic
Statistical measurements for the validation datasets of D and λ were obtained as follows, in their respective order: R2 = 0.9624 and 0.9130; MAE = 0.0598 m and 0.0043; RMSE = 0.0708 m and 0.0052; RMSES = 0.0621 m and 0.0042; RMSEU = 0.0339 m and 0.0031; SEE = 0.0353 m and 0.0033; IA = 0.9653 and 0.9499; FV = 0.1469 and 0.1677; and AIC = -130.4128 and -260.5149. Notwithstanding the fact that the pipe length dataset contains some extreme values (i.e., L > Lmax » 2000 m) in comparison to the present modeling constraints (Table 1), the statistical findings indicated the validity of the RF-based decision tree strategy implemented to estimate the D and λ values.
An RF-based soft-computing approach was implemented for the first time in the estimation of the required pipe diameter (D) and Darcy–Weisbach friction factor (λ or f) in the same study. In the present computational analysis, five primary pipeline design components (ε, T, L, Q, Δh) were simulated for fully developed turbulent flow conditions in sizing problems (Type 3) of rough pipes. The results were analyzed in terms of various statistical performance measures and useful mathematical diagrams.
It was shown that, in contrast to traditional computing, the suggested RF-based strategy offered a well-flexible solution for calculating both D and λ values, not worse than the old labor-intensive methods. It should be emphasized that the present computational study was carried out as a part of an integrated modeling research exploring the prediction performance of various soft-computing methodologies on the estimation of different hydraulic outputs (e.g., D, λ, Re) for Type 3 problems of pipe distribution systems. This study demonstrated the efficacy of an RF-based data-intelligent model without the need for the cumbersome and time-consuming steps of the traditional iterative technique (trial-and-error progress). Any data collection with missing values may be avoided by using the soft-computing technique that is being used. In this regard, the approach for calculating both D and λ values offers a fairly flexible strategy.
According to the results, the suggested RF-based decision tree technique produced quantitative predictions in a computation time of just a few seconds. As a consequence, the established method provided a speedy solution to pipeline sizing problems within the studied limitations of the relevant input data. It should be underlined that while all efforts in this field are valued as a product of labor, there is always a demand for high-performance and adaptable approaches for hydraulic engineering applications. From this stand point, it would be worthwhile to expand the existing research to more fully characterize the behavior of turbulent flow conditions using a number of sophisticated hybrid techniques. Furthermore, additional effort is recommended to build novel soft-computing models that take into account the influence of varied minor (local) loss coefficients (K). It was concluded that the flexibility of the proposed strategy will make it an appropriate data-driven tool for modeling of other highly iterative hydraulic engineering applications
The author would like to thank Ms. Regina Cooper (Editorial & Review Analyst from MedCrave Group) who provided encouragement and time support during the creation of this work.
The author declares that he has no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
©2023 Yetilmezsoy. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.