A practical approach to develop a parsimonious survey questionnaire—taiwanese patient safety culture survey as an example

doi:10.15406/bbij.2017.06.00169

eISSN: 2378-315X

Biometrics & Biostatistics International Journal

Research Article Volume 6 Issue 3

A practical approach to develop a parsimonious survey questionnaire—taiwanese patient safety culture survey as an example

Hsun Hsiang Liao,¹ Wui Chiang Lee,² Yuh Ling You,³ Heng Lien Lo,³ Cheng Fan Wen,³ Heon Jae Jeong,⁴

Verify Captcha

Regret for the inconvenience: we are taking measures to prevent fraudulent form submissions by extractors and page crawlers. Please type the correct Captcha word to see email ID.

Pa Chun Wang⁵

¹Deputy CEO, Joint Commission Taiwan
²Department of Medical Affairs and Planning, Taipei Veterans General Hospital & National Yang-Ming University School of Medicine, Taipei, Taiwan
³Division of Quality Improvement, Joint Commission of Taiwan, Taipei, Taiwan
⁴Advisor, Joint Commission Taiwan, Taipei, Taiwan
⁵CEO, Joint Commission of Taiwan

Correspondence: Heon-Jae Jeong, Advisor, Joint Commission Taiwan, 5F, NO.31, Sec. 2, Sanmin Rd., Banqiao Dist., New Taipei City 22069, Taiwan, Tel 886-2-8964-3000, Fax 886-2-2963-4292

Received: August 23, 2017 | Published: September 5, 2017

Citation: Liao HH, Lee WC, You YL, et al. A practical approach to develop a parsimonious survey questionnaire-taiwanese patient safety culture survey as an example. Biom Biostat Int J. 2017;6(3):350-355. DOI: 10.15406/bbij.2017.06.00169

Download PDF

Abstract

This study aims to suggest a method to choose what items can be removed from a survey instrument, thereby lessening survey participants’ burden. We used the Taiwanese Patient Safety Culture (TPSC) survey as an example—in particular, the emotional exhaustion (EE) and work–life balance (WLB) domains. A traditional factor analysis approach, whereby a Likert scale is treated as a linear variable, was applied, and an item response theory (IRT) graded response model (GRM), where a response is treated as an ordinal scale, was used. From both methods, five out of nine EE items and three of seven WLB items were filtered out for removal; these items provided significantly smaller information for domain score estimation than other items. To check whether the downsized versions showed sufficient validity, we tested the factor structure of the remaining items and compared fit indices between the models with all original items and the new models with the remaining items. The new models in both EE and WLB domains showed better model fit than the models with all original items. In this study, we showed how to select more important items while ensuring that the selected items as a whole can provide a better fit compared to the original instrument. This interesting phenomenon may be due to the translation step and different cultural soils where surveys are administered. We strongly recommend checking the possibility that items can be removed when adopting and validating a survey instrument—not only to shorten the overall length, but also to increase validity.

Introduction

Healthcare professionals tend to work within extremely tight schedules. Yet those in the field of improving the quality and safety of care almost always want to administer survey questionnaires to these professionals to collect information on various aspects of healthcare. Such information is certainly crucial, but it is also certain that the surveys would put much burden on healthcare professionals. In this regard, it is our obligation to control the burden expected when administering a survey. From a more practical standpoint, fatigue from frequent surveys may result in a low response rate that endangers the representativeness of the survey itself,¹ which could lead to less useful survey results.

In addition to being frequently asked to participate in surveys, the burden of completing a survey also resides in the characteristics of each questionnaire, such as how many questions are asked, how long each item is, and even what measurement scale is used. Indeed, Jeong and Lee (2016) showed that the number of response options causes significant differences in the time to complete surveys; for example, a 3-point Likert scale is much faster to complete than a 5-point Likert scale.² In addition to reducing the time required, they also showed that a 3-point and a 5-point Likert scale yielded very similar results in a population level.³ Although their studies used a specific instrument, the Safety Attitudes Questionnaire-Korean version (SAQ-K), the results suggest that we can modify and streamline survey questionnaires without undermining their validity.

Among the above-mentioned various characteristics of questionnaire, the number of items play an important role in deciding the amount of a survey’s burden; thus, this is one of the aspects that should be addressed first. As such, we tried to establish or at least suggest a clear and structured approach to reducing survey items without compromising validity for the target readers, who are quality and safety personnel in the field but may not have experience in advanced statistics.

To this end, we used the Taiwanese Patient Safety Culture Survey Questionnaire (TPSC) as an example. TPSC has been administered for the past decade to hospitals in Taiwan by the Joint Commission of Taiwan (JCT). It consists of 46 main items in addition to 14 items asking about demographics and a few issues specific to the JCT’s needs. Each of the 46 main items came from three different instruments measuring the corresponding constructs: 30 items from safety attitude questionnaire (SAQ), one of the most widely used patient safety culture measurement instruments,^4-9 nine items in the emotional exhaustion (EE) domain from the Maslach Burnout Inventory,¹⁰ and seven work–life balance (WLB) domain items from the study “Accelerating the coping process”.¹¹

In this study, we applied two different approaches to choose appropriate items that can be removed: a traditional factor analysis and item response theory (IRT). Although they share some commonalities, each is rooted in different foundations. From the practical perspective, IRT has its unique strength in showing the amount of information of each item.¹² On the other hand, the factor analysis approach is still the most popular method for survey development,¹³ and TPSC was also validated by such approach before it was rolled out. By using them in tandem and checking whether the results were in concordance, we obtained much more stable results. Let us describe the procedures and results in detail.

Methods

Selecting target domains to reduce

The SAQ section of TPSC is the overall safety culture construct consisting of six domains, each of which contains an average five items. These already compact item numbers in each domain did not provide us much opportunity to drop items. Thus, we focused on the EE and WLB domains to reduce the number of items. As Table 1 shows, each of these domains have relatively more items compared to SAQ domains, suggesting there is room for reduction.

Items

EE1

EE2

EE3

EE4

EE5

EE6

EE7

EE8

EE9

I feel like I am at the end of my rope

I feel burned out from my work

I feel frustrated by my job

I feel I am working too hard on my job

I feel emotionally drained from my work

I feel used up at the end of the workday

I feel fatigued when I get up in the morning and have to face another day on the job

Working with people all day is really a stain for me.

Working with people directly puts too much stress on me

WLB1

WLB2

WLB3

WLB4

WLB5

WLB6

WLB7

Skipped a meal

Ate a poorly balanced meal

Worked through a day/shift without any breaks

Changed personal/family plans because of work

Had difficulty sleeping

Slept less than 5 hours in a night

Arrived home late from work

Table 1 List of items of EE and WLB domains

Nine EE items were measured using a 5-point Likert scale (1 = agree strongly, 2 = agree slightly, 3 = neutral, 4 = disagree slightly, 5 = disagree strongly), and seven WLB items were measured using a 4-point scale (1 =always, 2 =occasionally, 3 = a few times occasionally, 4 = rarely or none). For this study, we used 2016 TPSC data that JCT collected from various hospitals across Taiwan, which is the most recent and largest TPSC dataset as of this writing

Selection of the Items to be retained and their confirmation

First, we utilized the most commonly used form of factor analysis—namely, principal component factor analysis (PCFA)—and calculated the factor loading of each item. We also applied the IRT graded response model (GRM) considering the responses on 5- and 4-point Likert scales for EE and WLB domains, respectively, to obtain the item information function (IIF) describing the amount of information each item provides. By examining PCFA and IIF results together, we found items that convey a smaller amount of information than the others. We then confirmed our decision on the instrument validity perspective by conducting a confirmatory factor analysis (CFA) and comparing model fit indices between an all-item model and reduced-item model. For all analyses introduced in this article, Stata 14.2 (StataCorp, College Station, Texas) was used.

Results

Characteristics of participating hospitals and respondents

As described in Table 2, the 2016 TPSC dataset contains a total of 83,807 respondents from 98 hospitals, which are categorized into four different levels according to the Taiwanese healthcare system. The majority of both the number of organizations and the number of respondents were regional hospitals: 53 (54.1%) and 41,583 (49.6%), respectively. Furthermore, 34,443 (41.1%) respondents were from only 16 (16.3%) medical centers, which is understandable considering that medical centers are the biggest facilities. The dataset included 28 district hospitals (28.6%) and one psychiatric hospital (1.0%), from which 7,279 (8.7%) and 502 (0.6%) respondents, respectively, completed the survey.

Hospital level	Number of hospitals		Number of respondents
Hospital level	N	%	N	%
Medical center Regional hospital District hospital Psychiatric hospital	16	16.3%	34,443	41.1%
	53	54.1%	41,583	49.6%
	28	28.6%	7,279	8.7%
	1	1.0%	502	0.6%
Total	98	100.0%	83,807	100.0%

Table 2 Characteristics of participating hospitals

Table 3 shows respondents’ job types. According to the data, 44,000 (52.5%) respondents were nurses, 13,273 (15.8%) were administration staff, and 9,135 (10.9%) were technologists. Physicians accounted for 7,837 (9.4%) of respondents, followed by pharmacists and rehabilitation staff at 3,085 (3.7%) and 1,400 (1.7%), respectively.

Job type

Physician

Nurse

Technologist

Pharmacist

Administration

Rehabilitation

Others

7,837

44,000

9,135

3,085

13,273

1,400

5,077

9.4

52.5

10.9

3.7

15.8

1.7

6.1

Total

83,807

100

Table 3 Characteristics of respondents

Exploring items to drop

Once each of the EE and WLB domains was independent construct in TPSC, we analyzed them separately. Thus, we describe the results by each domain.

We begin with the EE domain and its nine items. First, we conducted a PCFA. From this exploratory factor analysis, we obtained factor loadings and factor scoring coefficients; we then chose four items showing clearly smaller coefficients and scores compared to the others—namely, EE 1, 4, 8, and 9—as well as one item that was somewhat equivocal in value (EE 3). Next, we conducted an IRT GRM on the EE domain and drew IIF curves. We found the same four items and possibly EE3 were underperforming. The PCFA and IRT GRM revealed only ignorable differences, justifying our filtering process.

We include only the IIF curves from IRT GRM due to space limits (Figure 1). Interpreting the IIF curves was straightforward: Higher curves provide more precise information for calculating the EE domain score; in other words, items with lower curves are less effective or rather redundant. In IIF curves, EE 1, 4, 8, and 9 obviously provided ignorable information, so we removed those four items. As in the PCFA, EE 3 was hard to determine decisively; thus, we decided to keep it in this step.

Figure 1 Item information function for emotional exhaustion domain.

Note X-axis is the level of respondents’ EE, scaled to zero mean and one standard deviation (SD)

For the WLB domain, we conducted the same analyses. Figure 2 shows the results for the IIF curve format. Clearly, EE 5, 6, and 7 provided significantly less information; thus, we dropped these items. Although not described here, the PCFA showed the same results.

Figure 2 Item information function for work–life balance domain.

Note X-axis is the level of respondents’ EE, scaled to zero mean and one standard deviation (SD)

Confirmation of retained items

To ensure that the validity was not compromised by dropping items, we conducted CFAs and compared model fit indices of the full model with all the original items and the reduced or nested model with only the items we chose. For the EE domain, we developed three models: model 1 with all the original nine items; model 2 with items 2, 3, 5, 6, and 7 (please recall that item 3 was equivocal and thus could be dropped or retained); and model 3 with items 2, 5, 6, and 7, the most parsimonious model. In the following table, grey cells stand for the values that were within a satisfactory range, according to generally accepted model fit guidelines. The cutoff values for good model fit were alpha > 0.70, RMSEA < 0.08, CFI > 0.95, TLI > 0.95, and SRMR < 0.08, as summarized by Acock (2013), which reflect recent trends in using more stringent standards.¹⁴

As depicted in Table 4, model 3 (the most parsimonious model, with EE 2, 5, 6, and 7) showed the best values for all model fit indices—even better than the full model—while satisfying the general guidelines introduced earlier.

	Model 1	Model 2	Model 3
Included items	All 9 items	2, 3, 5, 6 and 7	2, 5, 6 and 7
Alpha	0.920	0.908	0.900
RMSEA	0.198	0.176	0.018
CFI	0.831	0.953	1.000
TLI	0.774	0.905	0.999
SRMR	0.075	0.038	0.002

Table 4 Model fit indices of EE domain item combinations

Note Grey cells mean good fit for each index

Finally, we compared the EE domain scores from the three models using the simple average scores of included items that we normally use. The correlation coefficients were 0.980 between models 1 and 2, 0.992 between models 2 and 3, and 0.961 between models 1 and 3, suggesting that the reduced-item model(s) provided EE domain scores without much difference from the full model with all the original items. As a result, utilizing a model with only four items (EE 2, 5, 6, and 7) was a statistically reasonable choice.

Unlike EE, the WLB domain did not include any equivocal items to address. Therefore, we developed and compared only two models: model 1 with all seven original WLB items and model 2 with WLB 1, 2, 3, and 4. As demonstrated in Table 5, whereas values in the grey cells satisfy external model fit index criteria, model 2 showed significantly better fit than model 1 with all the original items. RMSEA was the only index that model 2 failed to meet, but considering that its cut-off value was 0.08, the current results strongly suggest that model 2 is the optimal item set for the WLB domain. The correlation between scores from the two models was 0.951, suggesting that the four-item questionnaire derived from model 2 can serve as a stable surrogate for the full seven-item questionnaire; indeed, it might be a better instrument.

	Model 1	Model 2
Included items	All 7 items	1, 2, 3 and 4
Alpha	0.894	0.868
RMSEA	0.149	0.095
CFI	0.916	0.991
TLI	0.873	0.972
SRMR	0.053	0.017

Table 5 Model fit indices of WLB domain item combinations

Note Grey cells mean good fit for each index

Discussion

When diagnosing a patient, we order many tests in the hopes that more tests would provide a more precise diagnosis. There is no problem in doing so, as long as conducting the tests is supported by strong clinical evidence. Yet there might be problem in practicality; too many tests probably place an avoidable burden not only on patients, but also on the hospital system itself. As a result, system-wide, the benefit from the extensive number of tests could be less than we expected. We can expand on this idea using the development of a statistical model as an analogy: When we develop a multiple regression model, it is recommended to ensure model parsimony as a model with too many covariates is generally penalized.^15,16 In diagnosis, those covariates correspond to the ordered tests while the dependent variable might be the disease to be diagnosed. Clearly, the best-case scenario would be predicting the dependent variable using the smallest possible number of covariates, if no decrease in precision is guaranteed.

Administering a survey questionnaire to improve quality and safety in healthcare organizations is similar: More information could be collected with more items in a questionnaire, but there is no doubt that already overworked healthcare professionals would suffer from completing such a survey. More importantly, the burden from a long questionnaire increases the survey refusal and drop-out rate as well as the number of missing values. As this missing structure due to the length of the survey is neither completely random nor random, modern imputation methods are not easily applicable by definition.¹⁷ List-wise deletion cannot solve the problem either because of the same issue of missing structure, in addition to the loss of too many responses.^18-20 All in all, we should control our greed and keep the length of the entire survey questionnaire within a manageable range for healthcare professionals’ busy schedule. Of course, the validity of the optimized version must be ensured.

This study distinguishes itself from other similar studies by using IRT and factor analysis approach in tandem. It is important to describe why we chose to do so. Those two approaches are, by definition, built on two different paradigms. Traditional factor analysis assumes the measurement scale is linear²¹—basically, a ratio scale or at least an interval scale where we can freely obtain central tendency and variation.³ However, in many survey instruments, including the TPSC we analyzed, a response is measured on an ordinal scale, usually Likert, which is not linear at all. Indeed, we can never say the difference between “disagree slightly” and “neutral” or between “neutral” and “agree slightly” is the same.²² Therefore, applying a traditional factor analysis approach to the TPSC is, technically speaking, inappropriate.²³

On the contrary, IRT, especially GRM, treats responses as an ordinal scale, not a ratio or interval, as it should.^24,35 This is well illustrated in Figures 1 and 2 presented earlier, where the IIF curves of items were depicted. Here, the point is the curves, not straight lines, and humps caused by response category changes. The format of IIF curves gives us an intuitive grasp of the relative importance of each item.¹² Such information naturally begs the question: Why did we use a traditional factor analysis in conjunction with IRT instead of just sticking to IRT? The answer lies in the dimension of history. The development of original instruments was based on a factor analysis paradigm; thus, simultaneously applying the traditional factor approach and relatively new IRT paradigm is more useful than we might think. Each methodology can serve as safeguard to the other. If the results from the two methods had been different, that would have signaled the need for a deep, exhaustive examination of each item. In this study, the results from the two converged, indicating that our findings were reliable.

In the first step, an exploratory factor analysis by PCFA would need to be explained in more detail. We used PCFA because of its popularity,¹⁴ but had its communality assumption not held, we prepared other factor analysis methods as fallbacks,^14,21 although they turned out not to be necessary. Thus, for an exploratory factor analysis, there is huge room for applying different methodology chosen for each specific situation. So why conduct such a seemingly unstable analysis? Conducting an exploratory factor analysis has more meaning than just screening less important items. In this study, the results showed that each of the current EE and WLB domain items was aligned well under its corresponding single latent factor, regardless of the existence of less informative items. This result directly supports the empirical evidence that IRT’s unidimensional assumption was not violated,²⁶ which enabled us proceed to IRT GRM and IIF calculation. Although there are more sophisticated approaches for checking and handling such an assumption,^27,28 we chose not to go further due to one of the goals of this study—namely, providing guidelines for non-statistician practitioners.

In the last step, the CFA of candidate models had a significant implication, much beyond just choosing the best item combination. For both the EE and WLB domains, reduced models showed better model fit than the full model with all the original items. Several scenarios can explain this, such as the difference in language and cultural soil between where the instrument was developed and where it is being adapted and administered.²⁹ This suggests that we may not need to be obsessive in translating and using all the items in the original version, as we habitually do. Instead, checking the potential to downsize is preferable considering the burden to participants and even better model fit. It cannot be overstated that more items do not always guarantee higher validity.

Readers may wonder to what number of items we can reduce an instrument. Conventionally, each domain in a CFA with multiple domains should include at least two items.³⁰ Yet we strongly recommend at least four items, especially when an individual domain-specific score is the important information the survey aims to measure; two is too small to guarantee the centrality and spread of a domain score. Furthermore, testing differential item functioning (DIF) is an issue to consider. Although seldom tested and adjusted, DIF is a major source of bias to be considered. To analyze DIF with responses in an ordinal scale like a 5-point Likert scales, at least four items are needed to obtain stable results, even with the Makov chain Monte Carlo method,^31,32

Although briefly mentioned earlier, we would like to reiterate that every step shown in this article could be done solely by IRT. Even exploratory and confirmatory factor analyses can be done under the name of item factor analysis.^33,34 However, few survey instruments in healthcare have been developed and validated with IRT. Thus, until all or at least the majority of instruments are developed or updated using IRT, our two-pronged approach fills the gap, sufficing both IRT-oriented and traditional factor approach-oriented personnel.

Conclusion

It is hard to resist the temptation to include more items in a survey questionnaire, especially when we have the “more is better” mindset. Yet it may be time to reconsider that groundless axiom. Indeed, we showed that more may not always be better; rather, it could be worse in terms of both instrument validity and length. If you get feedback from your colleagues that they are too busy to complete a long survey, it is an indisputable signal for you to examine your instrument and consider item reduction. In the healthcare setting, we should never forget that completing a survey takes up participants’ precious time that could be used for direct patient care.

We close this article by quoting Mark Twain: “I did not have enough time to write short.” Yes, it is an oxymoron, but it may be the very thing that we do when we develop a survey questionnaire. It is time to stop Mark Twain’s joke. We have too many lives to save.

Goodness of fit indices used in this study

RMSEA: The Root Mean Square Error of Approximation
CFI: Comparative Fit Index
TLI: Tucker-Lewis Index (also known as non-normed fit index (NNFI)
SRMR: Standardized Root Mean Square Residual

This study was approved by the Institutional Review Board of the National Cheng Kung University in Taiwan: B-ER-106-194-T. (Safety culture questionnaire survey for hospital employees seldom requires IRB approval. Yet, for future study we received one)

We do hope the methodology introduced in this article helps your endeavor. Please do not hesitate to contact the corresponding author for all Stata commands for the analyses in this article.