Measuring Agreement Between Diagnostic Devices: Summary

The recommended method for diagnosing apnea is polysomnography. As physicians and the public have gained awareness of sleep apnea, there has been a steadily increasing demand for the investigation of patients who are suspected of having this disorder, which in many laboratories resulted in unacceptable long waiting lists. This prompted the increasing interest on portable monitoring that does not require the patient to be studied in a laboratory. These devices have utilized various combinations of signals that are commonly used in curing polysomnography. Some monitors still use EEG and electromyogram recording that allows for sleep staging and the calculation of an apnea-hypopnea index (AHI) or the total number of apneas and hypopneas perhour of sleep time. Those who do not use the EEG and electromyogram quantitate “respiratory disturbances” and use total monitoring time as the quotient to determine the respiratory disturbance index. The validity of portable monitors for investigating patients with suspected sleep apnea generally has been studied by comparing their results with those of the standard sleep-laboratory based polysomnography. A systematic review of this research has recently been completed by a joint working group of the American College of Chest Physicians, the American Thoracic Society, and the American Academy of Sleep Medicine using the principles outlined in this methodology review.

There are several commonly used approaches for assessing how well these two different methods work. Each method has its strengths and weaknesses. The advantage of Pearson product-moment correlation coefficients over other methods is that it is more commonly used and that the scale is easily understood. However, they can be misleading and therefore are not recommended. Intraclass correlation coefficients compare total variability among patients, measurement variability and measurement error. They are statistically better than the previous coefficient but are not intuitive to physicians and thus, not commonly used.

Ultimately, a clinician could accept that the measurement of breathing event using a portable monitor does not agree completely with polysomnograph as long as it classifies patients accurately as those with or without sleep apnea. Since a large number of patients have index values around the usual cut-off point, it is possible that a patient’s classification could change due to expected variability in the measurement.

Measuring agreement between two different clinical measurements using a product-moment correlation coefficient can be misleading. Two methods may correlate perfectly but may have different scales of measurement. This type of correlation depends on the range of values that are being compared, yet this does not necessarily reflect greater agreement between two methods. For these reasons, product-moment correlation coefficients are not recommended as a statistic to describe how well two methods of measurement agree.

A widely accepted method of measuring agreement is the approach proposed by Bland and Altman in which the difference between the two measurements for each subject is determined. The mean difference provides an estimate of whether the two methods, on average, will return a similar result. The use of logarithmic transformation of differences between polysomnography and portable monitors is recommended. When it is done, the quoted limits of agreement can be misleading and should be interpreted with caution.

The operating characteristics of tests are often summarized as sensitivity and specificity. Using sensitivity and specificity to describe the utility of a diagnostic test has some limitations, since they indicate the probability that the test result will be positive if the patient has the disease, and the probability that the test will be negative if the patient does not have the disease. Clinicians cannot apply these numbers directly, because they do not know whether or not the patient has the disease. What the physician wants to know is conditional, the probability that the patient has the disease if the test is positive or negative.

When two methods of measurement do not completely agree, the potential user of the test should understand the interaction of sensitivity, specificity, and pretest probability that will dictate the number of tests that will, on average, come back positive or negative and the percentage of times that a positive result will be false-positive, and a negative result will be false-negative. The thresholds of sensitivity and specificity that dictate the ability of a test to exclude or confirm a diagnosis in a substantial percentage of cases and the acceptable rate of false results will be affected by several factors, such as the potential risk to a patient of having test results being labeled false-negative or false-positive.

Although sensitivity and specificity are more likely to be used to infer the utility of a diagnostic test to exclude or confirm a disease, either of these statistics, when considered in isolation, can be misleading. This is because positive and negative predictive values depend on the combination of sensitivity and specificity. The utility of a test for excluding or confirming a disorder can be captured in a single number, the LR. The LR for a positive test result is the ratio of the proportion of patients with disease who have a positive test result to the proportion of people without disease who have a positive test result Similarly, the LR for a negative test result is the ratio of the proportion of patients with disease who have a negative test result to the proportion of people without disease who have a negative test result.

When trying to address the issue of whether a portable monitor can reduce the probability that a patient has sleep apnea, the focus is on sensitivity. A high sensitivity will result in a low number of false-negative results and a low LR. Conversely, when addressing the issue of whether a portable monitor can increase the probability of sleep apnea, the focus is on specificity. A high specificity will result in a high LR and a low number of false positive results. Using different thresholds for positive and negative results generates combinations of sensitivity and specificity, and different LRs that can be confusing.

Although there are several approaches to measuring agreement between two methods of measurement, such as portable monitoring and polysomnography, each one has limitations. The two recommended approaches are the following: (1) the Bland-Altman calculation of mean differences and limits of agreement; and (2) sensitivity, specificity, and LRs.

Questions:

State the reasons why people are now using portable monitoring device. Also, identify the commonly used approaches to assess the portable and clinical monitoring devices and state their strengths and weaknesses. Lastly, differentiate the Bland-Altman calculation of mean differences and limits of agreement approach from sensitivity, specificity and LRs approach.