Pandora's Bots: Watson Crushes Jeopardy Champions in Game 1 of 2!

I just watched the first two nights of the IBM Watson Jeopardy Challenge (http://www-943.ibm.com/innovation/us/watson/) on my Tivo...woohoo! First, let me talk about the non-game portions of the show.

Interspersed with the actual gameplay are polished PR segments from IBM, giving some very limited background on Watson, but mostly talking about how awesome and game-changing it is and how they expect to sell the systems to everyone in every industry. There's not much meat to the info...I think even a completely uninformed viewer isn't going to come away with a good understanding of what's going on.

I also found it interesting that they are playing in a Jeopardy studio actually built on IBM premises. I suppose that they didn't want to have to rely on a third-party network connection from Watson's room on the East coast to a West-coast studio. It'd be pretty embarrassing to be brought low by simple light-speed lag. But it made me wonder - I'm fairly trusting of IBM's integrity on this, but what does the average person think of the contest being held on IBM property?

Anyway, on to the good stuff! The first round was a little slow, although Watson picked a clue and found the first Daily Double almost immediately. I'm very curious what its strategy is for which clue to pick at what time. Over the entire first game, I couldn't identify a rationale (but then, I can't do that for human players either, unless they use a top-down, left-right pattern, which many do). I'd also like to know what its wagering formula is. It's clear that it has one...although it wagered $1000 (the max, since it only had $200) on the first one, when it found another, it picked a "weird" number, $6435.

Edit: I found some background on the clue-picking strategy. Apparently they did some statistical analysis of where daily doubles are found, and Watson plays the numbers, making its highest priority to get all the daily doubles. After that it goes for the cheap clues in categories, so it can refine/confirm its interpretation of any word-play in the category title with minimal score risk.

The audience seemed surprised when it guessed on a DD question (and verbally admitted that it was a guess), but this makes perfect sense - when you make a DD wager, you lose that money if you don't answer at all, so there's no downside to guessing, unlike a normal question, where the confidence threshold is important to avoid losing money on wrong answers.

The first nights' show was just the first half of the first game, and ended with Watson in a tie with Brad at $5000, with Ken trailing at $2000. I was pleased it was holding its own, at least. But in Double Jeopardy, it was off to the races, acing clue after clue. It did get a couple wrong, but it was building up such a huge lead, this hardly mattered at all.

Watson clearly had an advantage in the buzzer-timing department. Even though they made a point of showing in a video clip that Watson was buzzing in by activating a mechanical actuator that pressed an actual buzzer button, it was apparent that Watson had a very good feel for the timing, and of course having verified that feel could do it exactly the same way every time (unless it was actually still thinking about a question, which appeared to happen a couple times). Whereas the human champions had to "re-learn" the buzzer timing...in both Single and Double Jeopardy, they seemed frustrated early on missing buzz-ins, but did much better towards the ends of the halves.

One clear disadvantage for Watson was the lack of any speech recognition or other input based on the other players' answers, because a couple times a human buzzed in first, answered incorrectly, then Watson would buzz in and repeat the same wrong answer. This felt a little embarrassing for everyone, and no one commented on it. Though it also highlighted a competitive edge for Watson - if a human contestant gets an answer wrong, let alone a bonehead maneuver like repeating a wrong answer, they can get discouraged and take a few clues to bounce back. But Watson just keeps plugging along (although its possible it adaptively adjusted its "confidence threshold" when it got something wrong, which could be interpreted as hesitancy or uncertainty I suppose).

It was also clear that as successful as its strategy was, you could not really say that Watson "understood" the language. Based on the times when it went off the rails - particularly Final Jeopardy, where it not only answered wrong, but categorically wrong, as the category was a very straightforward "U.S. Cities", and it answered "Toronto????". Endearing that it displayed its confusion with all the question marks, but not very impressive in terms of semantic skill.

But even with some shortcomings, Watson ended the first game with $35,734, a lead of over $25,000 over Brad (and $30k over Ken), which can't have been very fun for the humans, but had me jumping off the sofa and cheering!

The Challenge is a two-game combined total event, however, so technically the humans aren't out of it yet. I have a feeling they went back to their hotel rooms that night and practiced buzzing in for a couple hours before going to sleep.

Some final musings - the buzzer issue is interesting. Is this really a good test of AI vs human, given the artificial and arbitrary boundaries of the contest? Say there was no buzzer, just a 6-second time limit, and every contestant had the option to answer or not answer as they wished. How would Watson be doing against Ken and Brad in that context? My gut impression is that Watson would still be doing well, but clearly with not as huge a lead. I have a feeling that this is going to give people a very easy excuse should Watson go on to overall victory - "Yeah, sure, it's a computer, obviously it's going to buzz in quicker most of the time...but did you hear some of its answers? Artificial Stupid is more like it..."

It's ok, Watson. You can't hear those people anyway, so just keep doing your best, I'm proud of you!

2011/02/16

Watson Crushes Jeopardy Champions in Game 1 of 2!

No comments:

Post a Comment