eISSN: 2378-315X

Biometrics & Biostatistics International Journal

Abstract

The Chinese version of the Safety Attitudes Questionnaire (SAQ-C) was developed and tested in 2007. It consists of 5 domains: teamwork climate (TC, 5 items), safety climate (SC, 6), job satisfaction (JS, 5), perception of management (PM, 10), and working condition (WC, 4). The issue arose that PM had too many items because the same 5 items were asked twice: one for unit management and the other for hospital management. Unfortunately, in many Asian countries, such a management level classification does not provide useful information. Thus, each pair was collapsed into one as “management in this work setting”. In addition, 2 overly general items were dropped. Thus, PM ultimately had 4 items. The new version of SAQ-C became a very compact 24-item instrument named the Taiwanese Patient Safety Culture Survey (TPSC).

We then validated TPSC, but we took a different road. Thus far, almost all survey instrument validations in healthcare have been done using the linear regression-based confirmatory factor analysis (CFA). However, survey responses more often than not use a Likert scale, which is ordinal not interval (continuous). Therefore, the habitually used CFA has not been correctly used, either because of negligence or ignorance. For a scientifically sound analysis, we used the multidimensional item response theory (MIRT) that takes care of this issue. To check the model fit, we used limited-information goodness-of-fit tests with M₂* statistics considering the sparse contingency table.

In the first quarter of 2009, invitation letters to encourage participation in TPSC were circulated among hospitals in Taiwan. From April 1 to December 31, TPSC was administered to healthcare professionals voluntarily. Because the process was paper-based, the returned questionnaires were entered into a computer database manually. A total of 23,999 questionnaires were returned, of which 4,596 were missing the hospital level variable and were dropped because we intended to build a multigroup model by hospital level.

We examined each item and domain and their parameters, such as factor loadings and variance/covariance; all were satisfactory. The overall model, GOF, and M₂* statistics were investigated. The root means square error of approximation (RMSEA) was 0.03 (cut-off<0.06), and the non-normed fit index (NNFI, or the Tucker-Lewis index) was 0.98 (cut-off>0.95), suggesting TPSC is well validated in the MIRT (IRT) framework. In addition, we conducted the albeit flawed CFA, and the model fit was satisfactory too.

Ultimately, we can safely say that TPSC, the downsized version of SAQ-C, is well validated using either the classic linear-regression-based CFA or the newly developed IRT-based approach. This validation study will help researchers perform their own studies with TPSC.

Introduction

“Everything boils down to culture.” For anyone who works in the field of patient safety, it does not take long to recognize the truth of this statement. Yet this consensus did not emerge at the very beginning of the patient safety era. Rather, we thought that just changing care processes and adding automated systems would suffice in dropping the adverse event rate—probably to zero. However, it did not take much time till such naïve optimism was completely nullified. Many researchers have focused on why such efforts, represented by physical improvement, could not solve problems perfectly. As evidence has grown, healthcare professionals reached an agreement: safety culture is a must-have ingredient to true improvement in patient safety.^1–4

The evidence was clear: history depicts that even if the very same safety improvement program were implemented, the effectiveness of the program—even to the success or failure level—would largely vary according to the cultural background of the target area, such as the hospital or even country. This suggests that we have to either adjust the program most appropriately to the cultural background of the place where it would be transplanted or influence the culture so that it accepts the program without resistance, like terraforming.^5,6

Figuratively, such a recommendation leads us to the premise that we know and understand the topography of safety culture of the place in which we are interested. Were it not for this information, we could never plan, execute, or evaluate a program precisely. Peter Drucker once said “If you can't measure it, you can't improve it”⁷; this definitely applies to cultural issues in patient safety. Once such a need for measurement was recognized, safety enthusiasts developed an instrument without a single “nay”. Among the several safety culture measurement instruments,⁸ one of the most popular in the world is the Safety Attitudes Questionnaire (SAQ). Developed by Bryan Sexton, the original SAQ consists of six domains: teamwork climate (TC), safety climate (SC), job satisfaction (JS), perception of management (PM), stress recognition (SR), and working condition (WC).⁹ It provides views from different angles of safety culture.

Taiwan has been a trailblazer in adopting SAQ. With permission from the original developer Bryan Sexton, Taiwanese researchers translated SAQ into Chinese and developed the SAQ-Chinese version (SAQ-C). While translating, they found two items—one from TC (“In this work setting it is difficult to speak up if I perceive a problem with patient care”) and the other from SC (“In this clinical area, it is difficult to discuss errors”)—that did not work well and were dropped. Their factor loadings were around 0.30, which was probably due to negative questions not performing as intended in Chinese. In addition, the SR domain itself was a completely different animal among the original 6 SAQ domains—not only in the Chinese version, but also in several countries.^10–12 As Jeong et al. proved using the bifactor model, SR cannot be in the SAQ.¹² Thus, the whole SR domain (4 items) was discarded. The pilot version of SAQ-C was administered to volunteers in Taiwan, and 45,252 questionnaires were returned. With a very positive results, the instrument officially obtained the name “SAQ-C” in 2008.¹³

Despite its successful validation, SAQ-C was criticized in two ways. First, although SAQ-C utilized a 5-point Likert scale that is an ordinal scale, the scoring formula from the developer treats it as an interval scale (please think of it as continuous scale). This issue can nullify any results from SAQ-C. Second, compared to the other domains, PM contains 10 items, which—practically speaking—is too many and suspected to be less efficient, leading respondents to leave several items missing until eventually the whole questionnaire is scraped. These items are listed in Table 1.

The goal of this study is clear. We developed a method to treat a Likert scale as an ordinal scale, as we should. We then re-validated the instrument with the new method. In this process, we shrank the PM domain into 4 items for efficiency. As Table 1 denotes, the 4 items were the most representative items among the small tweaked versions of the original 10 items.

To materialize these goals, we utilized item response theory (IRT) graded response model (GRM), which can handle ordinal data and process them, yielding interval scale results. Specifically, multidimensional-IRT (MIRT) was used, because we have 5 domains.^14–16 In order to distinguish the new instrument from older SAQ-C, we call the new one Taiwanese Patient Safety Culture (TPSC) measurement instrument. Indeed, this is the official name that the Joint Commission of Taiwan uses for this instrument; the contents are found in Table 3.

Methods

Developing TPSC

Through several discussion sessions, an expert group decided to remove 6 items from the PM domain in addition to the already removed SR domain (4 items), one item from TC and the other from SC domain; thus, TPSC has 12 items fewer than the original SAQ. Table 1 summarizes the modifications in the PM domain. Many Asian countries, including Taiwan, do not distinguish the unit manager and hospital management. Specifically, healthcare professionals seldom have a chance to see the C-suite people. Thus, with the SAQ-C, people had a hard time answering such items. Eventually, we combined these two manager levels into one (see Table 1). In addition, PM5 and 6in SAQ-C did not give meaningful information; they contained overly general ideas and, thus, only increased the burden for respondents. In sum, TPSC had 24 items across 5 domains (detailed items and the domain list are depicted in Table 3).

Table 1 Item changes in PM domain from SAQ-C to TPSC

Data collection

In the first quarter of 2009, invitation letters encouraging participation in TPSC were sent out to hospitals in Taiwan. From April 1 to December 31, TPSC was administered to volunteers. The processes relied on paper-based versions of the questionnaire, so returned questionnaires were entered into a computer database manually.

Model development to validate TPSC

We built a MIRT model following the correlated factor structure, allowing each domain (latent trait) to be correlated. In addition, considering the consensus in Taiwan’s medical society—the hospital level is closely related to quality and safety—we added a multigroup structure with the hospital level variable. Readers might think of it as a kind of controlling method, much like a categorical covariate, although technically it is not. This structure is a clear depiction that equality constraints would equivalently spread across all levels of hospitals with participants.¹⁷

Checking the model fit

Generally, the classic Bock-Aitkin’s Expectation-Maximization (BAEM) provides various goodness-of-fit (GOF) indices. However, ordinal data structures, like a 5-point Likert scale, do not allow us to harness such popular indices’ full-information GOF tests; for example, X² or G² did not work because the contingency table was too sparse.^18,19 Therefore, we had to use a limited-information GOF test based on M₂* statistics, although the number of available GOF indices dropped drastically.^18,20

All analyses were performed using software packages for item response theory, flexMIRT 3.51 (Vector Psychometric Group, LLC, Chapel Hill, North Carolina).¹⁷

Results

Characteristics of respondents

A total of 23,999 questionnaires were returned, among which 4,596 questionnaires were missing the hospital-level variable. As we were seeking to build a multigroup model, we dropped those questionnaires. Table 2 summarizes the characteristics of the remaining 19,403 questionnaires used in the MIRT analysis.

Table 2 Characteristics of Respondents

Like many other studies administering questionnaire to healthcare professionals, most questionnaires were returned from females, which is not surprising as most nurses are female.^4,16,21,22 In addition, for each characteristic, the number of respondents in a certain category varies a lot. Female nurses, age 20–40, were the majority in each characteristic; regarding hospital level, 58.8% questionnaires were returned from medical centres, which employ many employees. Some might suggest checking the representativeness of the samples; however, one of the strengths of the IRT framework is that representativeness is not significantly influenced by respondents’ mix.²³

Running the MIRT model and its results

Table 3 describes all item-level parameters. First, we checked factor loadings; each was equal to or higher than 0.67 (TC1), satisfying the generally accepted threshold of 0.5.²⁵ The loadings ranged up to 0.94 (JC3 and JC4). Thus, loadings varied a lot, meaning that a simple mean of domain scores cannot be justified. Therefore, scores should be obtained in the form of factor scores.

Table 3 Item-level parameters from MIRT
Note. (a) discrimination; (c1-c4): intercepts.

As Table 4 indicates, all correlations between the 5 domains are high enough (the lowest, between JS and WC, was 0.77). Although the topic of this article is not calculating individual participant’s domain scores, getting them requires considering this variance/covariance matrix; otherwise, we end up ignoring the blueprint of the complex structure among domains and can only achieve simple mean scores at best.¹²

Table 4 Variance/covariance matrix (Lower triangle)

We examined GOF with M₂* statistics-based fit indices.^18,20 The root mean square error of approximation (RMSEA) was 0.03 (cut-off<0.06), and the non-normed fit index (NNFI, or the Tucker-Lewis index) was 0.98 (cut-off>0.95), suggesting TPSC is well validated in the MIRT framework.

Discussion

Traditional CFA and MIRT

In addition to the MIRT-based validation, we also conducted a typical linear regression-based CFA without the IRT component and found that most GOF indices were satisfactory, although not described here. Thus, this article guarantees that TPSC is a validated instrument in both the traditional linear CFA and IRT frameworks. However, regardless of whether the traditional way was validated, we still suggest utilizing MIRT. The linear-regression based-CFA theoretically cannot handle an ordinal scale like a 5-point Likert scale that TPSC uses, although in the field all too often CFA is used where it should not be. Again, treating a Likert scale as a linear continuous scale is just like using a simple regression for dichotomized data instead of logistic regression. Another reason we favour MIRT is that it produces the finest possible granulation of the results. MIRT provides safety managers with surgical level precision for each respondent’s information.

When to revalidate

At this point, readers might question why we conducted a validation study on an already validated instrument. We propose two scenarios. The first is what we described in this article. In order to enhance understanding, we use the following analogy: TPSC (instrument) is fish. Thus far, the fish has lived in seawater (the world of the classic CFA). Now we want to move the fish to freshwater (the IRT paradigm). Before we execute this migration, we must test whether the fish can survive well in the radically different types of water. This is the instrument validation that we showed with MIRT in earlier sections.

Second, imagine a different situation. The fish continued growing until it needed a new fishbowl. We know the relationship between the fish and the fishbowl has changed. There are a number of examples, such as language change, item number change, and contents modification. Technically any change in the instrument is a signal for validation. However, is it possible or even practical to conduct a validation study so often? How big of a change warrants a non-ignorable sign of validation? As always, it ought to be left to the researcher’s discretion. The change from SAQ-C to TPSC was definitely in need of validation. However, the combination between what follows and IRT may relax the stringent need for validation.

Classic test theory and IRT

A decade ago, we were sitting in classrooms and taking exams. Many countries still use this method for college entrance exams. If we were in the universe of classic test theory (CTT),²⁵ literally any change would trigger the validation process. In the CTT realm, just a change of sequence between a couple of items leads to revalidation. CTT regards the whole test instrument as a complete kit, so a small change means we develop a new version of the survey. In addition, environmental factors when administering the test, such as noise and modality, must be controlled if this approach is used for an entrance exam. Furthermore, the items cannot be reused, so building an item bank and applying computer-adaptive test are nonsense.^26,27 In sum, in the world of CTT, no change can slide; pure CTT enthusiasts may claim that even the fonts should be the same.

IRT does not work that way. Of course, instruments should be validated in this realm as well, but the biggest difference between IRT and CTT is that IRT is item oriented, not test oriented. A practical scenario is that item 1 (Table 3) used “nurse” in the original SAQ instead of “staff”. In SAQ-C, because respondents had trouble clearly understanding the real meaning of power gradient in Chinese, the researchers changed the term. In CTT, this instrument theoretically requires a full-scale validation; if it passes, it would be called TPSC ver. 2.0. Yet in an item-oriented method like IRT, such an impact is minimized on the condition that local independence is observed, although we should report these changes to academia. Finally, IRT is not influenced by the survey taker; seeing such characteristics numerically is easy.^28,29

Conclusion

Time to shift gears to IRT in the field of safety culture

Thus far, only a few researchers in patient safety culture have utilized IRT, despite its superior applicability to various measurement scales. This might be due to many reasons, such as insufficient computing power for running such serious analyses in a short time, but to be honest, not many people really understand how it works. This article confined itself to the validation. The real-world use of IRT in safety culture surveys can be found in previously published articles.^28,30,31