Case Study: Querying a Video Dataset Using Explainable A.I.

The A.I. Models

Our XAI Model

The Black-Box Model

The User Dashboard

A screenshot of the User Dashboard with annotations labeling different components of the interface.

User Case Study

Study Design

  1. Clip Identify — the participant is prompted to pick up to 3 video clips from a set of 10 that they think are most relevant to the displayed keyword using a drag-and-drop interface.
  2. Timeline Spot — the participant is asked to identify which segments of a longer video illustrate the displayed keyword.
  3. User-Machine Collaboration Task — The user is given one of 7 manually-constructed scenario descriptions and must search the dataset for the video clip that best describes the scenario.
A screenshot of a Timeline Spot Task Trial with the XAI system after a participant has completed the trial. This image shows the video containing the specified keyword (top-left), the XAI generated evidence (bottom-left), the prediction/output summary (top-right), and the AI performance-assessment questionnaire (bottom-right). Note that the explanation interface described and shown previously was modified to facilitate each of the 3 task types.

Study Results

  • Black-Box AI Tasks
  • XAI tasks completed by users with low levels of trust (XAI LOW)
  • XAI tasks completed by users who did not express explicit distrust (XAI HIGH)
Results of accuracy (left), user-machine synchronization (middle), and user skepticism (right) across the three task clusters. Skepticism here refers to the portion of trials where the AI was incorrect and the user was still able to find their own correct answer. There was no significant difference in skepticism across the three task clusters.
  1. The amount of trust in the XAI’s evidence (as indicated in the mental model questionnaire) correlated highly with user-machine synchronization (XAI LOW cluster showed lower rates of user-machine synchronization versus the XAI HIGH cluster). This combined with the fact that individual users’ reactions to the XAI explanations tended to shift from one trial to the next suggests that users do actively respond to XAI explanations and change their opinions and behavior accordingly.
  2. While the Black Box AI system originally seemed to yield higher overall satisfaction amongst the participants, there was a sharp divide in satisfaction between the participants with a high level of trust and reliance for the XAI system compared to those without. Upon further analysis, it was evident that the XAI system resulted in a more positive experience overall compared to the AI system, if the users had a high level of trust for and reliance on the system. Additionally, users who trusted the XAI explanations also performed better than those using the Black Box AI in some tasks.
  3. In addition to collecting participant reactions to model-generated video clips as part of the study, five external data annotators (with no prior experience with the study) were hired to conduct a post-hoc analysis of the quality of the XAI explanations and were asked to rate how well the model-generated clips represented each query. This data was used as a proxy for the quality of the XAI explanations as well as an indicator of participant attentiveness throughout the study. There was no apparent correlation between participant reactions to individual clips and externally annotated ratings. This may indicate that clip quality assessment criteria differed between our experts and participants, or that overall clip quality did not strongly influence users’ trust or confidence in the AI agent. It was notable, however, that positive user reactions were clustered around clips that feature exaggerated motions and cartoon-like premises, such as “bear (human subject),” “salsa dance,” and “express joy,” while more generic and muted clips such as “pull up” and “walk and turn repeated” received negative reactions.
  4. There was divergence between clusters of participants who found the XAI system to be reliable and influential to their decision-making processes, and those who deemed the system to be counter-intuitive and underwhelming. One participant wrote “[AI explanation] is a good basis of determining the reliability of AI in terms of [whether] the AI is able to detect the proper animations,” and another expressed satisfaction, stating “I’m impressed with what the AI system outputs.” However, some expressed caution and distrust, with one writing “the AI system often interpreted small portions of movements as if they met the definition of the keyword although it was a mere segment of the movement,” and another writing “I didn’t trust it completely as it detected similar movements and categorized it as the real one.” Two participants plainly wrote “I did not see the explanation,” alluding to the possibility that what constitutes an AI explanation may vary between individuals or may require additional training or clearer messaging to help people interpret generated clips as explanations.

In Conclusion…




Visualization and HCI research at Ontario Tech University, led by Dr. Christopher Collins

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

How Artificial Intelligence Killed Technical Analysis

ai certification

Top Five Myths About Artificial Intelligence That Newbies Have

What Does Sustainability mean for Banking & Financial Services Institutions and how Artificial…

Certified Artificial Intelligence (AI) Expert™

Making AI work for us part 2: Action on AI

The latest research on AI in Telecom domain

Toward a practical policy discussion on artificial intelligence

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
vialab research

vialab research

Visualization and HCI research at Ontario Tech University, led by Dr. Christopher Collins

More from Medium

Deep dive into Explainable AI: Current methods and challenges

A Quick Introduction into Semi-Supervised Learning

Matrix Factorization with PyTorch Lightning

ML Models on Petabytes of Data — You Need GPUs