SiPy 0.7.0 – R-based ANOVA and survival analyses

Wira Bin  Ambel; Maurice  HT Ling

doi:10.15406/oajs.2026.09.00280

Open Access Journal of

eISSN: 2575-9086

Science

Software Tool Article Volume 9 Issue 1

SiPy 0.7.0 – R-based ANOVA and survival analyses

Wira Bin Ambel,¹ Maurice HT Ling^2,3

¹School of Health & Life Sciences, Teesside University, UK
²Newcastle Australia Institute of Higher Education, University of Newcastle, Australia
³HOHY PTE LTD, Singapore

Correspondence: Maurice HT Ling, Newcastle Australia Institute of Higher Education, University of Newcastle, Australia

Received: December 12, 2025 | Published: January 8, 2026

Citation: Ambel WB, Ling MHT. SiPy 0.7.0 – R-based ANOVA and survival analyses. Open Access J Sci. 2026;9(1):1-5. DOI: 10.15406/oajs.2026.09.00280

Download PDF

Abstract

Statistics in Python (SiPy) is a data analysis tool built using Python and integrates analysis from R; yet, aiming to reduce the learning curve need to learn either Python or R. Recently, SiPy version 0.6.0 had been released but is lacking in ANOVA and survival analyses. Here, we extend SiPy version 0.6.0 to version 0.7.0 (codenamed Keropok), released on 05 December 2025, by integrating 8 ANOVA-based methods and 15 survival analysis methods from R.

Keywords: SiPy version 0.6.0, data, experimental designs, medical research, data analysts

Introduction

Statistical analysis of clinical and experimental data frequently requires the use of models that can partition, compare, and explain sources of variability across groups, covariates, and time.^1,2 Analysis of variance (ANOVA) and its extensions remain foundational tools for evaluating mean differences under varying experimental designs.³ Similarly, survival analysis methods are central to the analysis of time-to-event data in medical research.^4,5

While both Python and R platforms are frequent choices among data analysts,⁶ it is generally accepted that R is stronger in analytics, especially statistical analysis methods, compared to Python.⁷ SiPy⁸ is a lightweight statistical interface written in Python, and has been demonstrated as a potential platform for incorporating R methods while reducing the learning curve needed to learn R. In this paper, we integrated R-based ANOVA and survival analysis methods into SiPy 0.6.0;⁸ thereby, presenting SiPy 0.7.0 (codenamed as Keropok) released on 05 December 2025 and illustrate its application through a simulated clinical dataset designed to resemble a multi-centre randomized trial.

Simulated data for case study

We generate a set of data using a Python data generation script (file name = survival_dataset_generator.py in sipy/data folder) comprising of 1000 patients across 20 centres, randomized 1:1 to Drug or Control arms, to resemble a multi-centre clinical trial. The ages of these patients were normally distributed and averaged at 60 years old with standard deviation of 10 years old, with 50% of each gender. Three stages of disease (I, II, III) was randomized at 40% to 40% to 20%. Baseline biomarker was normally distributed at 50 units with a standard deviation of 10 units. Baseline quality-of-life score was normally distributed at 70 units with a standard deviation of 12 units.

The simulated data incorporated treatment effects at 3 and 6 months. In the Drug arm, mean biomarker reductions of 5.0 and 8.0 units at 3 and 6 months respectively, with quality-of-life improvement of 4 units at 6 months. In the Control arm, mean biomarker reductions of 1.0 and 0.5 units at 3 and 6 months respectively, with quality-of-life improvement of 1.0 unit at 6 months. Time-to-event data was generated using a Weibull distribution.

The resulting simulated data is stored as a comma-delimited file in sipy/data folder as survival_dataset.csv and can be read into SiPy as sdata variable using the following command: read csv sdata from data/survival_dataset.csv.

ANOVA-based analyses

The following ANOVA-based methods from R are available in SiPy 0.7.0 (sample ANOVA analyses are shown in Figure 1):

ANOVA. For 1-way ANOVA; for example, to evaluate the raw treatment effect at 6 months between drug and control arms (such as in Satre et al.⁹) with Tukey as posthoc test (example command: ranova anova data=sdata y=biomarker_6m x=arm posthoc=tukey) as shown in Figure 1A where the results only the means between the drug and control arms are significant (p-value < 2e-16). This can be extended to 2-way ANOVA with arm and gender as factors (example command: ranova anova data=sdata y=biomarker_6m x=arm,sex posthoc=tukey) or 3-way ANOVA with arm, gender and centre as factors (example command: ranova anova data=sdata y=biomarker_6m x=arm,sex,center posthoc=tukey) or N-way ANOVA.
Kruskal-Wallis Test is a non-parametric equivalent of 1-way ANOVA.¹⁰ For example, if the data cannot be assumed to be normally distributed, to evaluate the raw treatment effect at 6 months between drug and control arms with Dunn as posthoc test (example command: ranova kruskal data=sdata y=biomarker_6m x=arm posthoc=dunn).
Friedman Test is a non-parametric equivalent of repeated measures ANOVA.¹¹ For example, evaluating various stress reduction techniques using the same set of test subjects and assuming no interaction between the techniques.
Welch Test is used when the assumption of equal variances cannot be assumed in 1-way ANOVA.¹² For example, if the variances of biomarker levels at 6 months between drug and control arms cannot be assumed to be equal, to evaluate the raw treatment effect at 6 months between drug and control arms with Games-Howell as posthoc (example command: ranova welch data=sdata y=biomarker_6m x=arm posthoc=games-howell).
Permutation Test can be used when normality cannot be assumed but N-way ANOVA is needed;¹³ hence, cannot be addressed by Kruskal-Wallis Test or Welch Test. For example, to evaluate the raw treatment effect at 6 months between drug and control arms in various centres, the command will be ranova permutation data=sdata y=biomarker_6m x=arm,center; as shown in Figure 1A which also shows that only the means between the drug and control arms are significant (p-value < 2e-16).
ANCOVA, which is ANOVA with one or more continuous variables (known as covariates). As ANCOVA mathematically partitions the variance of the dependent variable into variances of the independent covariates and variances of independent factors, it is commonly used to account for baseline differences measured as covariates.¹⁴ For example, the raw treatment effect at 6 months between drug and control arms must take into account of the baseline biomarker levels. Hence, baseline biomarker level is used as a covariate (example command: ranova ancova data=sdata y=biomarker_6m x=arm covariates=biomarker_baseline posthoc=tukey). Similar to N-way ANOVA, there can be multiple covariates (example command: ranova ancova data=sdata y=biomarker_6m x=arm,sex covariates=biomarker_baseline,age posthoc=tukey).
MANOVA is ANOVA with more than one dependent variables. For example, we can compare the mean biomarker levels at 3 months and 6 months simultaneously between drug versus control arm (example command: ranova manova data=sdata y=biomarker_3m,biomarker_6m x=arm posthoc=tukey). Similar to N-way ANOVA, there can be multiple factors (example command: ranova manova data=sdata y=biomarker_3m,biomarker_6m x=arm,sex posthoc=tukey).
MANCOVA is then a covariates extension to MANOVA; much like ANCOVA to ANOVA. As such, MANCOVA can be used to control for baseline in MANOVA.¹⁵ For example, to compare the mean biomarker levels at 3 months and 6 months simultaneously between drug versus control arm and gender while controlling for baseline biomarker levels and age, the command will be ranova mancova data=sdata y=biomarker_3m,biomarker_6m x=arm,sex posthoc=tukey covariates=biomarker_baseline,age; as shown in Figure 1B showing that arm, sex, and biomarker_baseline are significant (p-value < 0.05).

Figure 1 Screenshots of Sample ANOVA-Based Analyses. Panel A shows 2-way ANOVA, and Permutation test. Panel B shows MANCOVA.

Survival-based analyses

The following Survival-Based Analyses methods from R are available in SiPy 0.7.0 (sample survival analyses are shown in Figure 2):

Accelerated Failure Time model (AFT) is commonly utilised to estimate the effect of covariates on the survival times of the patients under a chosen parametric distribution.¹⁶ For instance, to assess how treatment arm, age, sex, disease stage, baseline biomarker level and baseline quality of life influence survival time under a Weibull distribution, the command will be rsurvival aft data=surdata time=time_months event=status covariates=arm,age,sex,stage,biomarker_baseline,qol_baseline dist=Weibull.
Competing Risks Regression (Fine-Gray Model) evaluates the effect of covariates on the incidence of an event of interest in the presence of various competing risks.¹⁷ For instance, the command used to determine how treatment arm, age and sex affect failure from cause 1 while accounting for other causes (example command: rsurvival competing data=surdata time=time_months event=status cause=cause group=arm covariates=age,sex).
Cox Interaction Model (Cox-Int) is an extension of the Cox model by including interaction effects among covariates.¹⁸ This is useful to assess whether the effect of a covariate depends on another. For example, to test the effect of age and sex at different stages (example command: rsurvival coxint data=surdata time=time_months event=status covariates=age,sex,stage) as shown in Figure 2B.
Cox Proportional Hazards Model (Cox) is utilised to evaluate the effects of covariates on hazard rate while assuming proportional hazards over time.¹⁸ For example, to determine whether survival is influenced by treatment age, sex, and disease stage (example command: rsurvival cox data=surdata time=time_months event=status covariates=age,sex,stage).
Exponential-AFT is a survival method by assuming a constant hazard rate over time under an exponential distribution¹⁹ (example command: rsurvival expaft data=surdata time=time_months event=status covariates=arm,age,sex,stage)
Frailty-Cox Model accounts for random variables shared within clusters such as study groups²⁰ (example command: rsurvival frailtycox data=surdata time=time_months event=status group=center covariates=age,sex,stage).
Interval-Censored Model evaluates interval-censored models with various parametric distribution methods without any covariate adjustments.²¹ For example, the command used to analyse the survival time for different groups will be rsurvival intcens data=surdata time1=time1 time2=time2 event=event group=group.
Kaplan-Meier (KM) method is used to estimate and visualise unadjusted survival times between different groups.²² For example, the command for the comparison of survival distributions between drug and control arms without covariate adjustment (example command: rsurvival km data=surdata time=time_months event=status group=arm). This produces the Kaplan-Meier curves and median survival times for each arm.
Left-Truncated Cox Model (LT-Cox) is a survival model used to account for data due to delayed study entry where participants are only observed after the onset of the disease of interest²³ (example command: rsurvival ltcox data=surdata entry=entry time=time event=event covariates=group,age,sex).
Log-rank Test is a non-parametric test used to compare survival curves between different groups.²⁴ An example would be to test if survival distributions differ between drug and control arms without considering any covariates (example command: rsurvival logrank data=surdata time=time_months event=status group=arm), as shown in Figure 2A showing that the survival distributions between drug and control arms is significant (p-value = 0.04).
Nonparametric Interval-Censored Model (NPMLE) evaluates survival interval-censored data without any distribution method.²⁵ For example, to assess the nonparametric survival functions by treatment group (example command: rsurvival intnp data=surdata time1=time1 time2=time2 event=event group=group).
Parametric Accelerated Failure Time Model (Interval-AFT) assesses the effects of covariates on the time of event of interests with no proportionality assumptions under a chosen parametric distribution such as log-logistic.²⁶ As such, to determine how group, age and sex affect the survival time under a log-logistic distribution (example command: rsurvival int-aft data=surdata time1=time1 time2=time2 event=event covariates=group,age,sex).
Parametric Interval-Censored Model (Interval-Parametric) assumes a specific survival distribution such as Weibull to analyse interval-censored data²⁷ (example command: rsurvival intpar data=surdata time1=time1 time2=time2 event=event covariates=group,age,sex dist=Weibull).
Semiparametric Interval-Censored Model (Interval-sp) combines nonparametric estimation with time-dependent covariates²⁸ (example command: rsurvival intsp data=surdata time1=time1 time2=time2 event=event covariates=group,age,sex).
Time-Dependent Cox Model (TD-Cox) is used to analyse time-varying covariates over time²⁹ (example command: rsurvival tdcox data=surdata time=time_months event=status group=arm covariates=treatment_td,age,sex).

Figure 2 Screenshots of Example Survival Analyses. Panel A shows Log-rank test. Panel B shows Cox proportional hazard model with interaction between the covariates.

Concluding remarks

We extend SiPy 0.6.0⁸ to SiPy 0.7.0 (codenamed as Keropok), which was released on 05 December 2025, by incorporating R-based ANOVA and survival analysis methods into a consistent Python interface. This design could potentially reduce the steep learning curve for users while improving the accessibility of sophisticated R statistical tools.³⁰ Future work could focus on extending the capability of SiPy by incorporating other statistical tools such as mixed-effects models to analyse longitudinal data.³¹ Continuous updates will also be performed to ensure latest methods and bug fixes are available through SiPy.

Data availability

Source codes SiPy can be found at https://github.com/mauriceling/sipy while documentation can be found at https://github.com/mauriceling/sipy/wiki. The release page for SiPy 0.7.0 (codenamed as Keropok) can be found at https://bit.ly/SiPy-070.