Public notes for CS6750 - HCI Spring 2022 at Georgia Tech

Public webpage for sharing information about Dr. Joyner's CS6750 - Human Computer Interaction course in Spring 2022.

View the Project on GitHub idkaaa/cs-6750-hci-sp22-public

3.6 Evaluation

[MUSIC] The heart of user-centered design is getting frequent feedback from the users. That’s where evaluation comes into play. Evaluation is where we take what we’ve designed and put it in front of users to get their feedback. But just as different prototypes serve different functions at different stages of the design process, so also our methods for evaluation need to match as well. Early on, we want more qualitative feedback. We want to know what they like, what they don’t like, whether it’s readable, whether it’s understandable. Later on, we want to know if it’s usable. Does it actually minimize your workload? Is it intuitive? Is it easy to learn? Then at the end, we might want to know something more quantitative. We might want to actually measure, for example, whether the time to complete a task has changed, or whether the number of sales has increased. Along the way, we might also want to iterate even more quickly by predicting what the results of user evaluation will be. The type of evaluation we employ is tightly related to where we are in our design process. So in this lesson, we’ll discuss the different methods for performing evaluation to get the feedback we need when we need it.

3.6.2 - Three Types of Evaluation

There are a lot of ways to evaluate interfaces. So to organize our discussion of evaluation, I’ve broken these into three categories.

  1. The first is qualitative evaluation.
    • This is where we want to get qualitative feedback from users.
    • What do they like, what do they dislike, what’s easy, what’s hard.
    • We’ll get that information through some methods very similar, in fact identical, to our methods for need finding.
  2. The second is empirical evaluation.
    • This is where we actually want to do some controlled experiments and evaluate the results quantitatively.
    • For that, we need
    • many more participants,
    • and we also want to make sure we addressed the big qualitative feedback first.
  3. The third is predictive evaluation.
    • Predictive evaluation is specifically evaluation without users.
    • In user centered design, this is obviously not our favorite kind of evaluation.
    • Evaluation with real users though is oftentimes slow and its really expensive. So it’s useful for us to have ways we can do some simple evaluation on a day to day basis.

So we’ll structure our discussion of evaluation around these three general categories.

3.6.3 - Evaluation Terminology

Before we begin, there’s some vocabulary we need to cover to understand evaluation. These things especially applied to the data that we gathered during evaluation. While there are particularly relevant for gathering quantitative data, they’re useful in discussing or other kinds of data as well.

  1. The first term is reliability. Reliability refers to whether or not some assessment of some phenomenon is consistent over time.
    • So for example, Amanda what time is it? It’s about 2:30. Amanda what time is it? It’s about 2:30. Amanda, what time is it? It’s 2:30.
    • Amanda is a very reliable assessment of the time.
    • Every time I asked, she gives me the same time.
    • We want that in an assessment measure. We want it to be reliable across multiple trials.
    • Otherwise, its conclusions are random and just not very useful.
  2. A second principle is validity. Validity refers to how accurately and assessment measures reality.
    • An assessment could be completely reliable but completely inaccurate.
    • So for example, Amanda, what time is it?
    • Oh my goodness, it’s 2:30! Actually it’s 1:30. Oh, shoot!
    • So while Amanda was a reliable timekeeper, she wasn’t a very valid one.
    • Her time wasn’t correct even though it was consistent.
    • Validity is closely connected to a principle called generalizability. Generalizability is the extent to which we can apply lessons we learned in our evaluation to broader audiences of people.
    • So for example, we might find that the kinds of people that volunteer for usability studies have different preferences than the regular user.
    • So the conclusions we find that those volunteers might not be generalizable in measuring what we want to measure.
  3. Finally one last term we want to know is to understand its precision. Precision is a measurement of how specific some assessment is.
    • So for example, Amanda, what time is it?
    • Well apparently, it’s 1:30. Actually, it’s 1:31.
    • Come on! But in this case, no one’s really going to say that Amanda was wrong in saying that it was 1:30.
    • She just wasn’t as precise.
    • I could just as accurately say it’s 1:31:27,
    • but that’s probably more precision than we need.

As we describe the different kinds of data we can gather during evaluation, keep these things in mind. If we were to conduct the same procedure again,

3.6.4 - 5 Tips: What to Evaluate

In designing evaluations, it’s critical that we define what we’re evaluating. Without that, we generally tend to bottom out in vague assessments about whether or not users like our interface. So, here are five quick tips on what you might choose to evaluate.

  1. Number one, efficiency. How long does it take users to accomplish certain tasks?
    • That’s one of the classic metrics for evaluating interfaces.
    • Can one interface accomplish a task in pure actions or in less time than another?
    • You might test this with predictive models or you might actually time users in completing these tasks. Still though, this paints a pretty narrow picture of usability.
  2. Number two, accuracy. How many errors do users commit while accomplishing a task?
    • That’s typically a pretty empirical question although we can address it qualitatively as well.
    • Ideally, we want an interface that reduces the number of errors a user commits while performing a task.
    • Both efficiency and accuracy, however, examined the narrow setting of an expert user using an interface. So, that brings us to our next metric.
  3. Number three, learnability.
    • Sit under user down in front of the interface. Define some standard for expertise. How long does it take the user to hit that level of expertise?
    • Expertise here might range from
    • performing a particular action to
    • something like creating an entire document.
  4. Number four, memorability.
    • Similar to learnability, memorability refers to the user’s ability to remember how to use an interface over time.
    • Imagine you have a user alone in an interface,
    • then leave and come back a week later.
    • How much do they remember?
    • Ideally, you want interfaces that need only be learned once, which means high memorability.
  5. Number five, satisfaction.
    • When we forget to look at our other metrics, we bought them out in a general notion of satisfaction.
    • But that doesn’t mean it’s unimportant.
    • We do need to operationalize it though.
    • Experience is things like
    • users enjoyment of the system or
    • the cognitive load they experience while using the system.
    • To avoid social desirability bias, we might want to evaluate this in creative ways like finding out,
    • how many participants actually download an app they tested after the session is over?

Regardless of what you choose to evaluate, it’s important that you very clearly articulate at the beginning,

These three things should match up to address your research questions.

3.6.5 - Evaluation Timeline

When we discussed prototyping,we talked about

3.6.6 - Evaluation Design

Regardless of the type of evaluation you’re planning to perform, there’s a series of steps to perform to ensure that the evaluation is actually useful.

  1. First, we want to clearly define the task that we’re examining.
    • Depending on your place in the design process this can be
      • very large or
      • very small.
    • If we were designing Facebook,
      • it can be as
        • simple as posting a status update, or as
        • complicated as navigating amongst and using several different pages.
    • It could involve
      • context and
      • constraints
      • like
        • taking notes
          • while running,
        • or looking up a restaurant address
          • without touching the screen.
      • Whatever it is, we want to start by
        • clearly identifying what task we’re going to investigate.
  2. Second, we want to define our performance measures.
    • How are we going to evaluate the user’s performance?
      • Qualitatively, it could be based on their
        • spoken or written
          • feedback about the experience.
      • Quantitatively, we can
        • measure efficiency in certain activities or
        • count the number of mistakes.
    • Defining performance measures helps us avoid confirmation bias.
    • It makes sure we don’t just pick out whatever observations or data confirm our hypotheses, or say that we have a good interface.
    • It forces us to look at it objectively.
  3. Third, we develop the experiment.
    • How will we find user’s performance on the performance measures?
    • If we’re looking qualitatively
      • will we have them think out loud while they’re using the tool?
      • Or will we have them do a survey after they’re done?
    • If we’re looking quantitatively
      • what will we measure,
      • what will we control, and
      • what will we vary?
    • This is also where we ask questions about
      • whether our assessment measures are
        • reliable and
        • valid.
      • And whether the users we’re testing are generalizable.
  4. Fourth, we recruit the participants.
    • As part of the ethics process,
      • we make sure we’re recruiting participants
        • who are aware of their rights and contributing willingly.
  5. Then fifth, we do the experiment.
    • We have them walk-through what we outline when we develop the experiment.
  6. Sixth, we analyze the data.
    • We focus on what the data tells us about our performance measures.
    • It’s important that we stay close to what we outlined initially.
    • It can be tempting to just look for whatever supports are design
      • but we want to be impartial.
    • If we find some evidence that suggests our interface is good in ways we didn’t anticipate,
      • we can always do a follow up experiment to test if we’re right.
  7. Seventh, we summarize the data in a way that
    • informs our on going design process.
      • What did our data say was working?
      • What could be improved?
    • How can we take the results of this experiment and use it to then revise our interface?
    • The results of this experiment then become part of our design life cycle.
    • We
      • investigated user needs,
      • develop alternatives,
      • made a prototype and
      • put the prototype in front of users.
    • To put the prototype in front of users,
    • we walked through this experimental method.
    • We defined the task,
      • defined the performance measures,
      • developed the experiment,
      • recruited them,
      • did the experiment,
      • analyzed our data and
      • summarized our data.
    • Based on the experience,
      • we now have the data necessary
        • to develop a better understanding of the user’s needs,
        • to revisit our earlier design alternatives and
        • to either improve our prototypes by increasing their fidelity or
          • by revising them based on what we just learned.
      • Regardless of whether we’re doing
        • qualitative,
        • empirical, or
        • predictive evaluation,
      • these steps remain largely the same.
    • Those different types of evaluation just fill in the experiment that we develop, and they inform our performance measure, data analysis, and summaries.

3.6.7 - Qualitative Evaluation

Qualitative evaluation involves getting qualitative feedback from the user.

There are a lot of qualitative questions we want to ask throughout the design process.

Now, if this sounds familiar, it’s because it should be. The methods we use for qualitative evaluation are very similar to the methods we used for need-finding:

We use those methods to get information about the task in the first place, and now, we can use these techniques to get feedback on how our prototype changes the task.

3.6.8 - Designing a Qualitative Evaluation

Let’s run through some of the questions you’ll have to answer in designing a qualitative evaluation.

  1. First, is this based on prior experience, or is it a live demonstration?
  1. Second, is the session going to be synchronous or asynchronous?
  1. Third, how many prototypes or how many interfaces will they be evaluating?
  1. Fourth, when do you want to get feedback from the user?
  1. Finally, do you want to get feedback from individuals or from groups?

3.6.9 - Capturing Qualitative Evaluation

With qualitative research,

  1. One way is to actually record the session.
    • The pros of recording a session are that
      • it’s automated,
      • it’s comprehensive, and
      • it’s passive.
    • Automated means that it runs automatically in the background.
    • Comprehensive means that it captures everything that happens during the session. And
    • passive means that it lets us focus on administering the session instead of capturing it.
      • The cons though, are that
        • it’s intrusive,
        • it’s difficult to analyze, and
        • it’s screenless.
      • Intrusive means that many participants are uncomfortable being videotaped.
        • It creates oppression knowing that
          • every question or
          • every mistake
          • is going to captured and analyzed by researchers later.
      • Video is also very difficult to analyze.
        • It requires a person to
          • come later and watch
            • every minute of video,
              • usually several times,
          • in order to
            • code and
            • pull out
            • what was actually relevant in that session.
      • And video recording often has
        • difficulty capturing interactions on-screen.
      • We can film what a person is doing on a keyboard or with a mouse,
        • but it is difficult to then see how that translates to on-screen actions.
      • Now some of these issues can be resolved, of course.
      • We can do video capture on the screen synchronize it with a video recording.
      • But if
        • we’re dealing with
          • children, or
          • at risk populations, or
          • with some delicate subject matter,
        • the intrusiveness can be overwhelming.
        • And if we want to do a lot of complex sessions,
          • the difficulty in analyzing that data can also be overwhelming.
            • For my dissertation work I captured about 200 hours of video,
              • and that’s probably why it took me an extra year to graduate.
        • It takes a lot of time to go through all that video.
  2. So instead we can also focus on note-taking.
  1. A third approach if we’re designing software, is to actually log the behavior inside the software.

3.6.10 - 5 Tips: Qualitative Evaluation

Here are five quick tips for conducting successful evaluations.

  1. Number one, run pilot studies.
    • Recruiting participants is hard.
      • You want to make sure that once you start working with real users,
        • you’re ready to gather really useful data.
      • So, try it your experiment with
        • friends or
        • family or
        • co-workers
        • before trying it out with real users
          • to iron out the kinks in your design and your directions.
  2. Number two, focus on feedback.
    • It’s tempting in qualitative evaluations to spend too much time trying to teach this one user.
    • If the user criticizes an element of the prototype,
      • you don’t need to explain to them the rationale.
    • Your goal is to get feedback to design the next interface,
      • not to just teach this one current user.
  3. Number three, use questions when users get stuck. That way,
    • you get some information on
      • why they’re stuck, and
      • what they’re thinking.
    • Those questions can also be used to guide users
      • to how they should use it to make the session seem less instructional.
  4. Number four, tell users what to do, but not how to do it.
    • This doesn’t always apply,
      • but most often we want to design interfaces that users can use without any real instruction whatsoever.
    • So, in performing qualitative evaluation,
      • give them instruction on what to accomplish,
        • but let them try to figure out how to do it.
      • If they try to do it differently than what you expect,
        • then you know how to design the next interface.
  5. Number five, capture satisfaction.
    • Sometimes, we can get so distracted by
      • whether or not users can use our interface that
        • we forget to ask them whether or not they like using our interface.
    • So, make sure to capture user satisfaction in your qualitative evaluation.

3.6.11 - Empirical Evaluation

3.6.12 - Designing Empirical Evaluations

3.6.13 - Hypothesis Testing

3.6.14 - Quantitative Data and Empirical Tests

3.6.15 - Special Statistical Tests

  1. First, notice that we only ever had two levels to our independent variable.
    • We were only ever comparing online and traditional students.
    • For your work, that might mean only comparing two different interfaces.
    • What if we wanted to test three?
      • How can we do that?
    • Imagine for example, I wanted to test these two classes against a third class or flipped class.
      • Here we’d be testing the online section versus the traditional section versus the flipped section, how would we do that?
        • You might be tempted to just test them in a pairwise fashion, test online versus traditional, traditional verse flipped and online verse flipped.
    • You use that to try to uncover any differences between pairs. That’s called repeated testing, and the problem, is that it raises the likelihood of a type one error.
    • A type one error is also called a false positive,
      • and it’s where we falsely reject the null hypothesis.
      • In other words, we falsely say that we have enough data to conclude the alternative hypothesis.
      • Here that would mean, falsely concluding that there is a difference when there isn’t actually a difference.
      • The reason repeated testing raises likelihood of this, is that remember we said that we reject the null hypothesis if there’s generally less than a five percent chance it could have occurred by random chance.
      • But if you do three different tests, you raise the likelihood of one of them turning up conclusive even though it really isn’t.
      • Think of it like playing the lottery.
        • If I say you have a one in 20 chance of winning and you play 20 times,
          • you’ll still win once.
        • That’s because your overall odds of winning increased.
        • Performing multiple tests, raises our overall likelihood of finding a false positive.
        • So instead, what we need, is a single test that can compare against all these different treatments at once.
      • Now fortunately, for ordinal or nominal data, it’s actually just the same test.
        • A chi-squared test can handle more than just two levels of our independent variable.
          • Our alternative tests change a little bit, if we’re dealing with more than two levels.
          • The weakness here, is if we do a Chi-squared test on all three of these levels at once,
            • all it will tell us is if there’s any difference between any of the levels.
        • It doesn’t tell us where the difference is.
          • So, if this chi-squared test shows that there is a difference, we don’t have any way of knowing, is it the difference between the online and traditional, online and flipped, traditional and flipped, or is it a case where the flipped is different from both the online and traditional or something like that.
        • So, generally what we do, is we do an overall chi-squared tests on all of the levels, and then we can follow up if that first test was successful with a pairwise comparison between the conditions.
        • In that case, for basically concluding that we know there’s a difference before we actually do the repeated testing.
        • So, the overall odds of finding a false positive aren’t changing.
  2. For interval and ratio data though, we need to use a different test altogether.
    • This test is called an analysis of variance or ANOVA.
      • A one-way ANOVA test, let’s us compare between three or more groups simultaneously.
      • Here, that means we could test between all three of our classes at the same time. For you that can mean guessing between three or four interfaces at the same time.
      • With a two-way ANOVA, we could actually take this a step further. We could have two dimensions of independent variables, we could test online traditional and flipped against upper class mean versus lower-class mean. We could actually get it differences like, do freshmen do better at online but sophomores do better in traditional.
      • The weakness though, is the same as the weakness with the chi-squared test, and analysis of variance will tell us if there are differences, but it won’t tell us where the differences are.
    • Our approach to that is the same as it was with the chi-squared test as well. If the analysis of variance indicates there’s an overall difference, then we can follow up with pairwise t-tests.
    • Notice though, there’s still one assumption that’s been embedded in every single analysis we’ve talked about.
      • Our independent variables have always been categorical, that’s generally true for most of the tests we’re going to do.
      • If we’re testing one interface against another, then those are our two categories.
      • If we’re testing one body of people against another, then those are our two categories.
      • So, this isn’t really a weakness or a challenge, but there are cases where we want our independent variable to be something non-categorical.
    • Mostly that happens when we want our independent variable to be some interval or ratio data.
      • So, imagine for example I wanted to see if GPA was a good predictor of course grade.
      • GPA though is generally considered interval data, we might consider it ratio data but it’s usually discussed as interval data.
      • We could do this by breaking GPA down into categories.
        • Instead of this, we could
          • average the course grades for anyone with a GPA from 3.5-4,
          • Or we could leave the GPA is interval data, and just do a direct analysis.
        • Generally, here we’d be doing a regression or we would see how well one variable predicts another.
        • Most of our regressions are linear, but we could also do
          • a logistic regression,
          • a polynomial regression, and
          • lots more.
      • Again, I’m getting outside the scope of our class.
        • Here, our null hypothesis is that the variables are unrelated, and our alternative hypothesis is that they are related.
          • So, we need evidence that they’re related before assuming that they are.
          • Here things get a little bit more complex as well, because we’re not quite as emphatic about how we reject our null hypothesis and accept our alternative hypothesis.
          • Usually with regressions, we describe how well the two fit. They might fit very well, somewhat well, not well at all, and so on.
    • Before we move on, there’s
  3. one last type of data I’d like to talk about, and that’s binomial data.
    • Binomial data is data with only two possible outcomes, like a coin flip.
    • For us we might have outcomes like success in a class.
    • In HCI, we might be curious which of multiple interfaces allows users to succeeded a task with greater frequency.
    • Notice there that, success and failure are binary, and that’s what makes this binomial data.
      • What can be tricky here, is that our data actually looks continuous, it looks just like straightforward continuous ratio data.
    • Here we might say online students succeed 94.9 percent of the time, in traditional students succeed 92.1 percent of the time, and we might be tempted just to do a straightforward t-test on that.
      • But if you try to do the math, you’ll quickly find that it doesn’t work.
    • A t-test requires a standard deviation, and if every single student is either a one or a zero, a success or a failure, then you don’t really have a standard deviation in the same way.
    • So instead, we have a specific tests that we use for binomial data called a binomial test.
      • With a two sample binomial test, we compare two different sets of trials, each with a certain number of successes.
        • So, we can answer questions like;
          • does one lead to a greater ratio of successes than the other?
      • Alternatively, we can also do a one-sample binomial test.
        • That’s where we compare only one set of trials to some arbitrary number.
        • So, for example. If we wanted to prove that a coin was unbalanced, we would use a one-sample binomial test comparing it to a ratio of 50 percent.
    • You’ll know that you want to use binomial data, if the individual observations you’re getting out of users are binary.
      • If you’re only concerned with whether they succeeded or failed on a particular task and if your data is just a bunch of instances of successes and failures, then you’re using binomial data.
    • If the data you’re getting out of your users is more complex like multiple categories or continuous observations, then you’re probably looking at using a chi-squared test or a t-test or any of the ones we talked about before.
    • Now, we’ve gone through a lot of tests and we’ve gone through them very quickly, but remember our goal is just for you to know what test to use and when.
      • Once you’ve identified the appropriate test, looking at how to do it and actually putting the data in, is usually a much simpler task.

3.6.16 - Summary of Empirical Tests

Summary

3.6.17 - 5 Tips: Empirical Evaluation

Here are five quick tips for doing empirical evaluations. You can actually take entire classes on doing empirical evaluations, but these tips should get you started.

  1. Number one, control what you can, document what you can’t.
    • Try to make your treatments as identical as possible.
    • However, if there are systematic differences between them, document and report that.
  2. Number two, limit your variables.
    • It can be tempting to try to vary lots of different things and monitor lots of other things, but that just leads to noisy difficult data that will probably generate some false conclusions.
    • Instead, focus on varying only one or two things and monitor only a handful of things in response. There’s nothing at all wrong with only modifying one variable and only monitoring one variable.
  3. Number three, work backwards in designing your experiment.
    • A counter state that I’ve seen is just to gather a bunch of data and figure out how to analyze it later.
    • That’s messy, and it doesn’t lead to very reliable conclusions.
    • Decide at the start what question you want to answer, then decide the analysis you need to use, and then decide the data that you need to gather.
  4. Number four, script your analyses in advance.
    • Ronald Coase once said, “If you torture the data long enough, nature will always confess.”
    • What the quote means is If we analyze and reanalyze data enough times, we can always find conclusions, but that doesn’t mean that they’re actually there.
    • So, decide in advance what analysis you’ll do and do it.
    • If it doesn’t give you the results that you want, don’t just keep reanalyzing that same data until it does.
  5. Number five, pay attention to power.
    • Power refers to the size of the difference that a test can detect.
    • Generally, it’s very dependent on how many participants you have.
    • If you want to detect only a small effect, then you’ll need a lot of participants.
    • If you only care about detecting a big effect, you can usually get by with fewer.

3.6.18 - Predictive Evaluation

3.6.19 - Types of Predictive Evaluation

When we talk about design principles, we talked about several heuristics and guidelines we use in designing interfaces.

3.6.20 - Cognitive Walkthroughs

The most common type of predictive evaluation you’ll encounter is most likely the cognitive walkthrough.

3.6.21 - Evaluating Prototypes

3.6.22 - Exercise: Evaluation Pros and Cons Question

3.6.22 - Exercise: Evaluation Pros and Cons Solution

3.6.23 - Exploring HCI: Evaluation

3.6.24 - Conclusion to Evaluation

In this lesson, we’ve discussed the basics of evaluation. Evaluation is a massive topic to cover though. You could take entire classes on evaluation. Heck, you could take entire classes only on specific types of evaluation. Our goal here has been to give you enough information to know what to look into further and when. We want you to understand when to use qualitative evaluation, when to use empirical evaluation, and when to use predictive evaluation. We want you to understand within those categories, what the different options are. That way, when you’re ready to begin evaluation, you know what you should look into doing.