Enabling Peer-to-Peer QA and Scaling for ML

Data Review Tasks

Role: Designer on 3-eng, 1-PM team
Timeline: July 2023; April 2024–Jan 2025
Users: Abstractors, Abstraction Managers
Outcome:

Outcome

Created a new type of task that allows part-time abstractors to QA other abstractors’ work, when previously only managers could. The Data Review Task (DRT) pulls disparate data into a one-page workflow. After testing for comprehension across several tests, we released DRTs across all diseases.

Between 3,000–5,000 DRTs are completed per month, shifting 1,200–2,000 working hours per month from overworked full-time managers to part-time workers who always want more tasks. DRTs save $200k a year and are a key task to identify the best abstractors. DRTs are also set up for a future where abstractors will verify a fully ML-extracted task.

Resolving Flags

When a task is completed with unusual data (say, a surgery date before a diagnosis date), this raises a flag. Pre-DRT, abstraction managers would pick a flag from a report table and then click through individual tasks to identify the root cause of the issue. After confirming the data’s accuracy or finding an error and fixing it, managers would return to the table and fill out a report form. Flags are either very easy to solve (5-10 minutes of work) or extremely difficult (25-30 minutes of investigation across several tabs).

Abstraction

Abstractors are responsible for a single set of questions centered around a disease, using a stack of patient charts to find answers.

Hackathon

In Summer 2023, a team hacked together a set of abstraction tasks that represented what a manager would do to resolve a flag (e.g., verify that an “unknown” response is correct). Managers who picked up these tasks as accurate as managers doing the old flow were and 10-30% faster. If all flags were solved this way, it would represent savings of $150k–$400k.

Prioritization

Unfortunately, the amount of technical task reworking, flag rewriting wasn’t worth the initial $400k in savings for a user group that was already efficient at flags. We had to say no to pursuing this work. However, as we did more Machine Learning discovery, we realized that the sheer amount of ML abstraction we could generate (tripling our datasets) would need some way to be QA’d at scale, too. In addition, managers were working overtime and unable to finish monthly backlogs. We pitched DRTs again, but as a task for abstractors to pick up. While there was some hestitation from operations, we convinced them that we would test quickly, lightly, and pull the project if we saw discomfort from our users.

Designing PLQA

Since we knew margins were thin, we wanted to optimize DRTs without a heavy testing burden. We started by switching internal manager teams over to DRTs where possible. If they weren’t faster at scale then we would need to re-evaluate.

Designs re-used as much of the original task as possible, while highlighting that this was a new task with different expectations. I borrowed from the emphasized decision-making unique to DRT on top and used a banner color (present in other tasks) to visually distinguish it from standard tasks.

Language & Content & Testing

I used the monthly abstractor user research sessions I had set up to test DRTs with abstractors. Across the board, we saw abstractors immediately understand that these tasks were different and could explain what needed to be done. Abstractors showed similar speeds as managers after 3 tasks and could be faster within 10.

We adjusted language to push abstractors to read and make a decision on tasks faster rather than make them re-read risk-mitigating instructions each time. Our newest Flatiron designer, Carissa, made the latest changes to the look and feel of a DRT. She also realized we could get rid of the comment section when, after 10 tests, no abstractor put any useful data in it!

How Soon Is Now?

While DRTs are critical to setting up review workflows and save some money (half what we expected), we still see a large subset of users relitigate the chart and skip over the specific instructions of what to review—relying instead on their intuition about what might be wrong with the data. Our monthly abstractor sessions have also shown that some abstractors are too heavily invested in quality and not making mistakes, to the detriment of any efficiency gains. This can be helpful for difficult, blank slate tasks, but is a cost sink for easy tasks like DRTs and incremental tasks (completed tasks with a few updated documents).

As a designer, this was an opportunity to use our regular research to push back against further layout and content changes, and ask our operations team to consider workforce management and allocation solutions first.