Difference testing
As I type this, over on Twitter, Dr. Vino, Steve!, and lord knows where else, folks are rehashing—with considerable vitriol—arguments on the merits of 100-point wine ratings, or lack thereof. This got me thinking about what it takes to assign meaningful numerical value to a wine’s attributes—something I have had some experience with at points in my career where I was responsible for various research projects. In light of the current “discussions” surrounding the validity of wine reviews and point scales, I thought it might be of interest to explore what it takes in the research setting to evaluate wines to the objective standard that some feel wine reviewers should aspire to.

Define What Is “Better”

In any discussion of wine, in order to get beyond endless argument over personal opinion there has to be agreement on what constitutes “better”—exactly what is it that makes wine A superior to wine B. This is a non-trivial question that seems to be completely glossed over in the discussions of the merits of wine reviews. In my opinion, in a general sense there is no answer to this question. But my opinion aside, in order to put numerical values on wines there must be universal agreement on the value to be assigned to specific attributes. Simply put, in a research setting the first and most important question is: “what is the goal of this project?” For example, we might say “Chardonnay that shows more minerality, fruitiness, lack of vegetal notes, and creamy texture is better; our desire is to increase these attributes in our wine, so what can we do to increase these attributes?”

Set Up The Experiments

Perhaps we could explore the effects of canopy, crop load and irrigation management in the vineyard. Or maybe we could study options in fruit handling, processing temperature, juice settling, yeast selection, barrel choice, and lees stirring in the winery. First we have to define what are we willing to change, and then rigorously produce wines that reflect the range of these options as closely as possible to how we would treat them in routine production. Ideally, we would do this over several vintages to eliminate uncontrolled seasonal variables in the results.

Train The Tasting Panel

Aye and here’s the rub. Training the tasting panel—more than one person; my preference is for 5 to 7 experienced tasters—is the single most critical control point in assuring that the evaluation of experimental results has any meaning. In the research setting, reference standards for the attributes being tested for must be established, e.g. from the example above: “this is what we mean by ‘mineral,’ this is ‘fruity,’ this is ‘vegetal’ and this is ‘creamy.’” Reference compounds are dosed into neutral wines, and the panel members are drilled to develop their ability to recognize them. If a reference can’t be reliably identified it has to be dropped from the trial. If a member can’t reliably identify a standard obvious to the rest of the panel, that taster has to be removed from the trial. (I recall hearing that Ann Noble at UC Davis used to reward her tasting panel trainees with cookies when they got good at picking out particular attributes. I never found that motivation all that useful, but then the panels I trained weren’t hungry students.)

Present The Trial

This is the easy part. The setting needs to be well lit without distracting sights, sounds, drafts, or especially aromas. The glasses need to be all the same and well-cleaned, without any residue of the cleaner. Importantly, the wines to be evaluated need to be presented to the panel on more than one occasion (3 to 5 seems to be optimal) and these evaluations should be made at the same time of day in each instance. Of course the samples are presented blind and in random order. Reference standards need to be included in the blind presentation—these are to control for panel members having a bad day; if a taster who is usually good at identifying the attributes fails on the standards, their results should be excluded. When I would evaluate a multivariate trial, the tasting sheet for each wine would have the attributes listed, with a 100mm long straight line next to each and the words “low” and “high” underneath the lines to the left and right, respectively. The tasters were required only to put a mark on the line indicating their perception of the intensity of each attribute. The protocol I most often employed was to present the trial wines in ensemble; the tasters were allowed to smell all, taste all, smell all again, and then mark their sheets. The trial wines were presented in different orders for each taster and in each tasting session.

Evaluate The Data

I would slap a ruler on each line and measure where each mark was located: 0mm to 100mm. Each record in the data set comprised the session ID, the taster ID, the trial wine ID, the attribute ID and the associated intensity “value.” A first cleanup pass on the dataset would scrub the records for session/taster/attribute combinations where reference standards were poorly identified. The references I used were usually pretty obvious, so I somewhat arbitrarily set the cutoff at 60; e.g. if a taster failed to identify a standard with an intensity value of 60 or more, their session results for that attribute were excluded from the data set. Finally, I fed the data into statistics software to crunch the numbers. The most robust results came from principal component or factor analyses; non-parametric methods that maximize the variance in the observational data, and then rotate the experimental treatment axes relative to the observational vectors. In the example above, say if the wines produced from different crop loads grouped along the vector for perceived minerality, or perhaps the vector for perceived fruitiness, we could conclude that crop load affects these attributes of the wine.

I don’t own any of the data I generated from my days as a researcher, and I worked for private companies that did not publish the results of the work I did. So to illustrate this kind of analysis I have lifted a pretty decent graphic from a published study exploring the effects of yeast selection on the attributes of Sauvignon Blanc: click to go to the published study I leave it to the (very) interested reader to look deeper into this statistical approach.

The Bottom Line

What I have tired to convey here is not the method, but rather a sense of the level of rigor I believe is necessary to perform an objective evaluation of wine—to be able to conclude with reasonable certainty that one wine is “better” than another, according to some specific definition of what constitutes better-ness. Would it surprise anyone to find that I view any expectation of inviolate veracity for 100 point scores to be hopelessly naive? Given the work I have done, I have earned the right to tell y’all that any insistence that someone reviewing many wines a day can approach tasting with this level of rigor and reproducibility is misplaced to the point of irrationality.

In The Trenches

I have huge respect for anyone who reviews wines for a living. It is hard work. In the argument over the meaning of scores—inflated or not—I come down on the side that scores are a shorthand valued by a culture that views everything in terms of a competition and shuns relativism. I truly believe that most if not all reviewers would prefer not to use scores if they had a choice, but that consumers demand them. I also believe that a certain slice of consumers are 100% to blame for the expectation that scores must reflect some sort of absolute. I don’t see a single reviewer claiming omniscience, infallibility, or the inviolability of their scores or evaluation methodologies. And I don’t fault wineries for touting scores to move their product—that’s just good business sense. But anyone who buys by scores and truly expects that anybody’s 96 is objectively “better” than an 88, every time, to every person—as the sage said, there’s one of them born every minute.

Yeah, I said it. Oh yeah, I really did. Sucker.


Today I was reading the February 2012 issue of Road & Track magazine (in paper, thank you—I’ve been a subscriber for nearly 40 years) and was struck by this bit from the opening “Road Ahead” column by Editor In-Chief Matt DeLorenzo:

‘What’s a good car?’ It’s a common question put to enthusiasts, yet impossible to answer because invariably part of the reason it’s asked is to validate the questioner’s own opinion. What really is a good car? More often than not, you end up engaging in Socratic dialogue to find out the person’s needs or wants before settling on an answer.

A better question is what car do you love? The beauty of this approach lies in its subjectivity, as opposed to the objectivity demanded by the ‘what’s a good car’ question. If someone is seeking your opinion, shouldn’t the answer be more subjective than objective? This also opens the door to allow passion to enter the discussion rather than simple data.

You can love a car for many reasons, both ration­al and irrational, the latter being eminently more fun than the former. So… we’ve decided to bring you a loose collection of cars we love. We aren’t saying these are the definitive best cars in the world, but rather cars worthy of not just your attention, but more importantly, your affection.

Subjectivity. Passion. Fun. Affection.

I could not agree more with Mr. DeLorenzo. As he suggests with cars, the joy of wine appreciation is sucked out by the “simple data” implied by scores. I have come to realize that James Suckling gushing “I’m 100 points on that!” is more expressing enthusiasm and emotional honesty about a particular experience than he is saying “this 100 point wine is objectively better than that 96 point wine.” Score inflation? Wines getting better? No, I don’t think either. I think that reviewers are just getting more enthusiastic about wines they love.

I wish reviewers would give up the pretense of objectivity. If we acknowledge that when a reviewer gives a high score it means they love that particular wine—no more, no less, and with no expectation that that score fits in a wider, objective context—we would all be happier. Kumbaya.