Validation Studies

9.8 Validation Studies


9.8.1 The Colorado Study[1] Overview

“This study was undertaken specifically to extend study of the SFSTs from the laboratory setting to field use. The primary study question was, “How accurate are officers’ arrest and release decisions when the SFSTs are used by trained and experienced officers?”

“Officers from the cooperating law enforcement agencies, who were routinely assigned to traffic patrol and/or special DUI units and who were SFST-trained by NHTSA guidelines, were eligible to participate[2].”

“As illustrated by the figure, the decisions may be correct under two different conditions. A “Correct Arrest” (Cell 1) occurs when an officer correctly decides, as confirmed by a chemical test, that the driver’s blood alcohol concentration (BAC) is at or above legally-defined limits for driving.[3]

The primary problem with the study is that Colorado sets its DWAI per se law at 0.05 BAC. So, any arrest decision where the BAC was 0.05 or higher, is determined by NHTSA to be a “correct” decision. This not only contradicts the previous studies holding the FSTS are meant to measure a BAC of 0.10 or higher, but it is pretty useless to any State that has a per se law of .08. As defense counsel, I would often be happy to agree the Defendant’s BAC was 0.05. The study simply does not distinguish correct decisions between 0.05 and 0.08. Study Results / Data

The 0.10% criterion for a DUI charge dictated the entry of 133 arrests into the decision matrix as correct decisions. In addition, the criterion of >0.05 to <0.10% for a DWAI charge dictated the entry of 30 additional arrests for a total of 163 correct decisions[4]. Based upon that statement, it appears that roughly ¼ of the people had a BAC of <.10. One can look at the full data and see 19 people were arrested for a BAC of <.08[5], while only 12 people had a BAC of .08 or .09[6]. So, it appears to me, there were 115 correct decisions for .08+ and 30 false arrests. This gives an accuracy of 79%. While this may seem high, the mean BAC of the drivers was .152[7]. Hopefully, an officer should be able to tell if someone with a BAC of .152 is impaired.

The FSTs were not administered or scored properly. NHTSA notes for the Walk and turn in the study that the WAT was observed for Improper balance[8]. This is not a standardized validated clue in the manual.  Further, it was the most reported clue[9].

NHTSA even conceded that drivers with low and moderate BAC made many errors[10].

The second primary major source of error in the Study is that the officers were given PBTs to measure the suspects BAC. While the officer was told to determine the BAC prior to administering the PBT, they often were unsupervised. HGN

For released subjects, with a BAC of <0.05, officers observed at least one sign of HGN in 15 out of 38 drivers[11].“It is possible that lack of smooth pursuit and distinct nystagmus at maximum deviation occur at low BACs with some subjects but not with others, or on some occasions but not others. It is possible that these subjects had combined low levels of alcohol with some other nystagmus-producing substances. Also, of course, it is possible that the officers erred in their observations. Research has not yet clearly defined HGN signs for low BACs.[12] Officers Used Information Besides FSTs

The officers probably used all of the information available to them to make an arrest. “More than 80% (n=250) of the stops occurred because the officers observed driving behavior which led them to believe the driver might be impaired.[13]” Accordingly, the officer was not using just the FSTs to make an arrest determination.


9.9 San Diego Study[14]

In this study, NHTSA attempted to determine the accuracy of FSTs at .08. “Seven officers of the San Diego Police Department’s alcohol enforcement unit were trained in theadministration and modified scoring of NHTSA’s SFST battery (i.e., Horizontal Gaze Nystagmus-HGN, Walk and Turn,and One Leg Stand). SFST scoring was adjusted: the observation of four HGN clues indicated a BAC >= 0.08 percent(rather than four clues indicating a BAC >= 0.10 percent), and the observation of two HGN clues indicated a BAC>= 0.04 percent. During routine patrols, the participating officers followed study procedures in administering SFSTsand completing a data collection form for each test administered. The officers’ final step in each case was theadministration of an evidentiary breath alcohol test.[15]

Only officers who were members of the San Diego Police Department’s alcoholenforcementunit and who received NHTSA-approved SFST training participateddirectly in the study. Dr. Marcelline Burns provided brief “refresher” training to allparticipating officers to ensure a consistent and systematic approach to SFSTadministration during the study.:[16]”  Oddly, the refresher course used a 1995 NHTSA curriculum[17].

As you know from reading the previous sections, the scoring, interpretation, and other factors have changed over the years.

Like the previous Colorado study, the officers also based their arrest decisions on the driving and other observations and not just the 3”adjusted” FSTs. Again, women were disproportionally low in this study, making up only 12% of those tested.

In this study, officers were provided PBTs, but told not to use them until after guessing the driver’s BAC. Only biased “project staff” witnesses were ever present, and project staff “periodically” rode along with the officers[18].


9.9.1 Validation Study Data

While NHTSA touts extremely high accuracy rates of 91%, if you look at the charts provided, the false positive rates are really high. The problem with NHTSA’s statistics is it is based upon the whole where 72% of the drivers had a measured BAC >=.08[19].



Instead of looking at overall rates, or correct releases, I am primarily interested in how many people were falsely arrested. 83 drivers were below the legal limit and should have been released according to the premise that the FSTs can measure a BAC of .08. However, the officer arrested 24 of those people. This leaves an accuracy rate of 71%. Much lower than the 91% NHTSA promotes. Even with a flawed study like this one, nearly 30% of the drivers under the legal limit of .08 were still determined to be over.

The individual rates for the FSTs are even worse.


Using the same guidelines to determine false arrests, the HGN alone was 67% accurate. The WAT was 47% accurate, and the OLS was 58% accurate. It is amazing that a test is used that is less than 50% accurate in estimating/guessing the BAC of the innocent driver who has a BAC of less than .08.

Further, the study likes to point out that some of the false positives were .07 and therefore within the margin of error for the PBT. Of course, NHTSA does not claim that results .08 or higher are also within the margin of error and are just as likely incorrect.

The study also provides two important quotes regarding field sobriety tests.

In regard to the HGN, “Horizontal gaze nystagmus lacks face validity because it does notappear to be linked to the requirements of driving a motor vehicle. The reasoning iscorrect, but it is based on the incorrect assumption that field sobriety tests aredesigned to measure driving impairment.[20]” “HGN’s apparent lack of face validity to driving tasks isirrelevant because the objective of the test is to discriminate between drivers aboveand below the statutory BAC limit, not to measure driving impairment.[21]

Further, “Driving a motor vehicle is a very complex activity that involves a widevariety of tasks and operator capabilities. It is unlikely that complex humanperformance, such as that required to safely drive an automobile, can be measured atroadside… As a consequence, they pursued thedevelopment of tests that would provide statistically valid and reliable indicationsof a driver’s BAC, rather than indications of driving impairment. The link betweenBAC and driving impairment is a separate issue, involving entirely differentresearch methods.[22]


9.10 Florida Study[23]

The Florida study is the most recent validation study for all three FSTs. The report of the study is rather short, and does not contain as much data as the other studies. At the outset, the study notes “They reported that the testbattery is valid for detection of low BAC’s and that no other measures or observations offer greatervalidity for BAC’s of 0.08% and higher[24].” This could be particularly useful in an aggravated DWI case where the State has to show a BAC .16 or higher. It appears all observations including the FSTs would be irrelevant. I am not sure how a chemical test is not a better detection of BAC, but NHTSA says “no other measures.”

In regard to attacking the original validation studies, NHTSA claims “It is entirely appropriate to inquire whether that early research to identify a best set of sobriety tests was conducted with scientific rigor.

Beyond that inquiry, however, the data , which were obtained in a laboratory setting and now are more than twenty years old, are of little interest.[25]” If the data and results are not important, maybe NHTSA can explain why it still mentions them in every manual.


9.10.1 Officer Experience Matters

“Experience and confidence have a direct bearing on an officer's skill with roadside tests.[26]” NHTSA concludes that more experienced officers do better either administering or scoring FSTs. “[R]ecognition of alcohol impaired drivers can be difficult and is,therefore, subject to error[27]


“DUI arrest decisions made by Florida law enforcement officers

1. Who have been trained under NHTSA guidelines to administer, score, and interpret the

Standardized Field Sobriety Tests (SFST’s),

2. Who have developed experience and skill with the SFST’s,

3. Who use only the 3 test battery to examine suspected DUI drivers, and who do not have access to apreliminary breath tester (PBT) will be > 90% correct, as confirmed by measured BAC's.

The design of the study was dictated by the need to insure:

1. Standardization of SFST administration and interpretation,

2. Data integrity, and

3. Data completeness

The compromise of any of these requirements would have made interpretation of the obtained databoth difficult and. subject to question.[28]

NHTSA probably used the best available officers to do this study. All 8 officers were DWI Instructors, and half were DRE’s[29].


9.10.2 Data Collection & Results

In regard to data collection, NHTSA recognizes “Control of data collection is difficult but essential in a field study. Although the expression "garbage in- garbage out" lacks elegance, it does aptly, describe the consequences of a failure to control what goeson in the field.[30]

The mean BAC of the drivers arrested was 0.15, and 37 of all drivers had a BAC between 0.20 and 0.284[31]. Again, hopefully officers should be able to correctly detect drivers with such a high BAC.  “Notsurprisingly, given that the mean BAC of arrested drivers was 0.15%, an observed loss of car controlwas the most frequently reported reason the deputies made vehicle stops. The cues with the highestfrequency were "failure to maintain a single lane" and "weaving within a lane"[32].”

Again, focusing on those with a BAC of less than 0.08, the officers incorrectly arrested 9 out of 50. This is an 82% accuracy rate, assuming the BAC number is valid, which is questionable.

To determine BAC, NHTSA “Evidential tests require two breath specimens, and the two measured BAC’s must not differ by morethan 0.02%. If they do, a third specimen is required. Since only one BAC could be entered into the database, the means for the first and second specimens were compared. They were found to differ by only0.002%. It is unlikely that the small difference could affect the analysis in any significant way, andarbitrarily the first obtained BAC has been used. The only exceptions are those cases where the officernoted that the first test was a "low blow" and where only the second measurement was 0.08% orhigher.[33]” NHTSA appears not to account for any “high blows”, and appears to be manipulating the data to get higher BACs.


9.10.3 Missing Data

53 cases were excluded from the Study because the officer did tests other than the 3 approved FSTs[34]. If the officer cannot properly follow the guidelines for the validation study, are the officers also doing other things incorrectly? NHTSA “suggests that other tests may have been used not only when it was notpossible to administer some of the SFST's but also when assessment of impairment was particularlydifficult[35]” Accordingly, those borderline cases, which often result in the most inaccurate results, were excluded from the study.


9.10.4 Tests Administered

In regard to the FSTs themselves, the HGN was checked for “distinct nystagmus at maximum deviation[36].” This is different than the present standard of distinct and sustained nystagmus at maximum deviation. Of further note is that in 6 of the 9 false arrests, the officer noted 6/6 clues on the HGN[37].

The Walk and Turn test was improperly administered. The officers did not check for “incorrect number of steps[38].” As NHTSA likes to emphasize in all of the manuals, if the tests are not given or scored correctly, the validity is compromised. Accordingly, in this study, NHTSA “validated” a test that is not presently administered or scored the same as in the study. The results are therefore compromised.

NHTSA further recognized the false positives in the WAT. “Thirty-three drivers who were correctly released (mean BAC 0.033%) had been given the WAT test. Ifthe decisions had been based solely on WAT, only ten of those drivers would have been released.[39]


9.11 The Robustness of the Horizontal Gaze Nystagmus Test[40]

“It is concluded that HGN is a robust phenomenon.[41]

This study should be titled the “unrobustness” of the HGN. The information in the study can greatly help the defense. Any officer who testifies that this study proves the effectiveness of HGN clearly has not read the study.

The study validated (in the minds of NHTSA) the HGN under its approved administration, as well as slight deviations.

“Within the standardized procedures specified in Table 2, there may be some variations in roadside test administration, but no evidence has been reported that these minor variations change either the occurrence of HGN signs or an officer’s observations of them. However, because this assumption has been challenged and because the topic had not been systematically examined, this study of the effects of a set of procedural variations was conducted. The general research hypothesis of this study was that the variations do not affect the accuracy of the HGN observations and the validity of conclusions based on them.[42]

The problem, however, is that HGN is not at all accurate when administered 100% correctly.

There were three experiments in the study.

1: Changing the stimulus speed, passing, and elevation.

2: Driver’s posture.

3: Driver’s vision.

The officers used were, again, well trained. This helps defense counsel, however, because the results of the study show HGN is junk science.

NHTSA concludes the position of the driver does not matter as well as the driver’s vision. NHTSA concludes moving the stimulus too quickly will produce false-negatives.

For defense purposes, it is not necessary to counter these arguments. In New Hampshire, the HGN test must be administered properly, otherwise it should be inadmissible. Accordingly, I would accept all of NHTSA conclusions about test variations, and focus solely on if the test is administered 100% correctly.

It is, however, important to point out that NHTSA keeps manipulating fail points for the HGN. The new magic formula for NHTSA is that 4 out of 6 clues on the HGN equals a BAC of .06 or higher[43]. (Based upon the chart below, it also appears a .03 BAC with 4 clues is also a correct determination) As defense counsel, I would almost always be more than happy to concede a BAC of .06, or even .03 for a presumption of non-impairment.



9.11.1 Data for Lack of Smooth Pursuit in Test I


It is important to note that EVERY PARTICIPANT HAD AT LEAST 2 CLUES WHEN THE TEST WAS PROPERLY ADMINISTERED. NHTSA even comments on this. “ At the two-second speed, LSP was reported for both eyes for all participants, (Tables 10 and 11). A breakdown of pursuit movements is not expected at very low BACs, and it is interesting that it was already observed at .016 and .019 with both the two-second and the one-second speeds[44].”

Also, of importance, using 4/6 as a fail point, people failed the test with a BAC as low as .016.

9.11.2 Test 2



Again, we should be only concerned with the correct administration. We can see people fail the test with a BAC as low as .019, and 6 clues as low as .047.



9.11.3 Test 3


It is interesting to note the bottom number. Based on NHTSA determining it was a false positive, and due to the numbers decreasing, it appears to be a misprint. However, I cannot just guess it was supposed to be .012. Giving NHTSA the benefit of the doubt, the lowest false positive on this test is .022, and all 6 clues as low as .029 (Less than the presumptive standard of non-impairment)


9.11.4 Total Overall Statistics

Based upon my reading of these tables, there were 107 total tests administered. Instead of using NHTSA’s capricious changing formula, I use the standard that 4 out of 6 clues equates to a BAC of .08 or higher to see how many decisions were correct.  Out of 107 total, there was only one case of a false release (BAC .08+ but with 2 clues).

30 individuals had a BAC of .08+ (assuming the .120 is a typo  in the last chart). All but one of them were deemed to have had 4+ clues, and therefore a failure.

More important, however, is to see how the officer’s measured people <.08. 77 people had a BAC < .08. Of those, 51 PEOPLE HAD 4+ CLUES! The test is 33% accurate for correctly measuring someone’s BAC of < .08 (Sounds like Dahood should be revisited). And this is the test NHTSA always claims is the most accurate.



[1]A Colorado Validation Study of the Standardized Field Sobriety Test (SFST)Battery, Burns et al. Sept. 1996. Again, because this study has no page numbers, my references refer to the pg. number of the .pdf file

[2]Id at 12

[3]Id at 8

[4]Id. at 22

[5]Id at 77


[7]Id at 24

[8]Id at 26


[10]Id 27-28

[11]Id at 29

[12]Id at 30

[13]Id at 30


[15]Id at iii

[16]Id at 13

[17]Id at 9

[18]Id at 11

[19]Id at 18

[20]Id at 27-28

[21]Id. at 28


[23]A Florida Validation Study of the Standardized Field Sobriety Test(S.F.S.T.)Battery, Burns et al.

[24]Id. at 2

[25]Id at 3

[26]Id at 3


[28]Id at 5

[29]Id at 6

[30]Id at 7

[31]Id at 12

[32]Id at 25

[33]Id 11-12

[34]Id at 15

[35]Id at 16

[36]Id at 16

[37]Id at 17

[38]Id at 18

[39]Id at 18

[40] The Robustness of the Horizontal Gaze Nystagmus Test, M. Burns, September 2007

[41]Id at 9

[42]Id at 2

[43]Id at 5

[44]Id at 5