Mini Review Volume 8 Issue 3
Georgetown University, USA
Correspondence: Othmar W Winkler, Professor emeritus, Georgetown University, USA
Received: May 11, 2019 | Published: May 27, 2019
Citation: Winkler OW. A Statistical mystery resolved. Biom Biostat Int J. 2019;8(3):101-102. DOI: 10.15406/bbij.2019.08.00278
During consulting work the regression analysis between the salaries and length of employment of a group of professional women gave an implausible, counter intuitive result. The resolution of this statistical mystery revealed a common, unrecognized misunderstanding of the nature and interpretation of regression.
Keywords: interpreting regression lines, sex-discrimination in government
Everybody dealing with data will, at one time or another, employ regression analysis. This very unusual case happened during the exploratory phase of the data of a sex-discrimination lawsuit.1 The 32 librarians,16 male and 16 equally qualified female librarians of that government agency, appeared to be ideally suited to initiate discovery of the claimed discrimination in that professional workforce. Simple linear regressions of salary and length of employment, computed separately for the male and the female librarians, was a first approach to reveal the supposed existence and nature of sex-discrimination. Though expecting differences between these two regressions, the author was unprepared to make sense of the women’s regression and incredulous. The fundamental insight gained by resolving this statistical puzzle should be of general interest.
The relationship of ‘Salary and Lengths of Service’ for the male librarians, (Figure 1) was SALARY (M)=$16,900+$1,380∗ERDAEMPL, in other words, “the average starting salary of these 16 male librarians at ERDA was $16,900 with a yearly average salary increase of $1,380” which appeared to be reasonable. The relationship for the female librarians with comparable academic degrees and the same duties was: SALARY (W)=$26,500–1,020∗ERDAEMPL. The slope β=−$1,020 indicated that for each additional year of service, the salaries of those female librarians were reduced, on average, by $1,020 (Figure 2) while their entrance salaries were the highest when they were first employed. This just did not made sense, defying every experience with employment.
My first impression was that this obviously had to be an error in the data or some mix-up in the computer program. The person responsible for these data, however, swore that this represented the situation correctly, and that the computer program worked fine. This assurance was trustworthy because this ‘discovery of facts’ had been ordered by the District judge. I had hoped for errors in the data or in the computer program as the explanation of this implausible result. But no such easy explanation of that puzzling regression became available. Before continuing to the next chapter I like to invite the reader to stop reading and think of an explanation as a probable solution to the conundrum of Figure 2.
Resolving the paradox
Searching for reasons to explain this negative slope, implying that the female librarians at ERDA were paid $1,020 less for every additional year they were employed at ERDA, it then dawned on me that the error was in the interpretation, not in the data. The usual interpretation of the slope β as the change in Y corresponding to a one unit change in X implicitly assumes a longitudinal, dynamic situation. Yet both, the regressions of the male and of the female librarians, represented a ‘static cross-section’ of the situation at the time of this lawsuit. These data required a cross-sectional, static understanding of β, as if these 16 women, at the time of this lawsuit, were lined up for a group photo. Imagine the recent hires, employed the shortest time, were standing to the left of the group, with low values on the horizontal axis of the graph, but with the largest salaries, their Y-values. Those longer employed librarians, hired in previous years and decades, standing to their right, with larger X-values, happened to have the lower incomes, Y-values. The average difference between the sizes of the salaries of any two female librarians in this line-up, the one employed one year longer, hired a year earlier, happened to earn a lower income, smaller on average by $1020. β was the average salary difference between any two of these 16 female librarians whose length of employment differed by one year.2 The key to resolve this puzzle was recognizing these data as a ‘cross-section’ at a given point in time, instead of wrongly interpret them to be the ‘longitudinal development’ of their salaries over the years.
How did this unbelievably dismal situation happen?
The employment histories of these women revealed that those who entered ERDA before the promulgation of ‘Title VII’ – anti sex-discrimination signed into law in 1969 and expanded in 1974 – were hired decades earlier at starting salaries that were substantially lower than the starting salaries of the male librarians hired at that time, and also lower than the salaries female librarians who had been hired after the anti-sex-discrimination laws had been enacted. A colleague opined that the earlier hired female librarians had fewer educational opportunities and less professional education available to them. This was used as justification for their lower salaries. All librarians had received raises of similar percentages but those of the older-tenured female librarians, due to their lower starting incomes, amounted to smaller pay increases. The existence of these discrepancies in 1975 was due to management’s improper, selective implementation of the anti-sex-discrimination laws. ERDA’s management had assumed that only the averages of the women’s salaries of a department, not Individual cases of discrimination would be checked for compliance with anti-discrimination laws.
The statistical paradox, focused on female librarians, who appeared to be paid less the longer they were employed, originated in the failure to recognize the cross sectional, static nature of the data to correctly interpret the regression. This misinterpretation was subconsciously encouraged by the preceding wrong, and just as inappropriate, similar interpretation, of the regression of male librarians which was easily overlooked because their positive correlation of length of employment and income seemed to agree with common sense and general experience. In conclusion, the peculiar circumstances that lead to this lawsuit revealed the unnoticed, preferred custom to interpret regression lines longitudinally as a dynamic “average change in Y for a change or increase in X,” even when the data are a cross section not warranting such an interpretation. This incorrect interpretation of the women’s data became the puzzle of an obviously implausible employment situation. If it had not been for this very unusual employment situation such misinterpretations of the regression line, its error, would continue to remain unrecognized.
This class action lawsuit, by the way, had a happy ending. The women of that class-action lawsuit won a decisive victory against their agency proving convincingly, through statistics, the presence of substantial social and economic discrimination.
©2019 Winkler. This is an open access article distributed under the terms of the, which permits unrestricted use, distribution, and build upon your work non-commercially.
2 7