
Until recently I worked at Conductor Technologies. Conductor provides on-demand compute for the media and entertainment industry. Artists submit CG render jobs consisting of hundreds of tasks, each representing a frame of animation. The service runs these tasks on an array of machines at one of the hyperscalers. Since tasks are compute-intensive, running them in parallel allows artists to get results quickly and iterate on their work.
Tasks cannot communicate with each other, so while Conductor works well for a flat list of render tasks, it cannot support jobs where tasks depend on the output of other tasks.

An example might be where an artist wants to run a fluid simulation that writes out cache files, render those caches, and then composite the images into a quicktime movie. Complex workflows like this are common in larger visual effects studios and require a graph to describe the dependencies.
I pitched the idea of building a DAG to my boss and got the green light. Storm was born.
The team:
I had already done some groundwork, but to get things properly started, we set up the project in Basecamp and met to shape the first phase. We experimented with libraries and settled on these key decisions:
I've included a copy of the Storm documentation in this blog. I'll reference it where relevant to provide deeper insight into features.
The core of the project was DAG-based workflows, but the desktop app also presented the opportunity to solve many pain points of the legacy system. This is discussed under Customer flow later.

The Composer is where tasks are laid out as a visual graph before submission. Artists typically send the graph from their DCC (Digital Content Creation application) to the Composer. In this view, they can edit the graph or modify properties of tasks and other nodes. They then submit the tasks to be orchestrated by the Workflow API. This workflow is described in detail in the Slinky tutorial.
Once the tasks are submitted, the app switches to the Monitor view, which shows the status of the tasks as they are executed. The Monitor looks similar to the Composer, but with differences that shall be described later.
Our first three months focused on getting the system working end-to-end for a preview at SIGGRAPH (computer graphics trade conference). At that time, the app handled only single node selection and editing. It quickly became clear we needed multi-selection and batch-editing in both the Composer and Monitor views.
If a frame fails to render due to running out of memory, the likelihood is that all frames in the sequence will fail. The artist will need a way to select all the tasks, switch them to a better spec machine, and retry them. Play the video to see multi-select and batch-edit in action.
The green play button and other bright green UI elements are temporary developer tools. The play button simulates advancing graph statuses by one step. This was necessary because the Workflow API development wasn't progressing as quickly as hoped.
With multi-selection and batch editing solved in the Composer view, we tackled the Monitor view, which was more tricky.
In the Monitor view, task nodes have several different statuses: holding, running, succeeded, and so on. When nodes are edited, the behavior is dependent on the status of the node. Editing a running node requires that the task be stopped and restarted with the new values. It also causes the previous configuration to be saved as a snapshot.
The screen recording below gives a clue as to the underlying complexities involved in marrying the concepts of batch editing, retries, status transitions, and history.
To get a better understanding of the state machine and the history feature, read the two sections of the documentation linked below.
We had two major eureka moments that eliminated a lot of complexity:
We were trying to solve the problem of failed tasks that were either false failures or not serious enough to block the whole workflow. Jon suggested a "continue on error" attribute, but that would have meant knowing beforehand that a failure on that node was not fatal.
I proposed adding a manual transition "Flip to success," but that meant information about the failure was lost and not visually clear.
The insight was to add a Skipped status, which allows users to bypass failed nodes to let the graph continue while making it easy to see at a glance which nodes were skipped.
Cascades mean that if a task is retried, then any running or succeeded downstream tasks are reset to their initial state so they can run again when the retried task succeeds. It was assumed that cascades were optional, and in fact we though that implementing cascades would be harder than not implementing them.
However, the idea that the graph could exist in a state where tasks downstream from a failed node were in a succeeded or running state seemed very ugly to me. It would mean that users looking at the graph would not be able to figure out how it got into that illegal state.
With that in mind we agreed that the graph must always be in a state that's possible through the natural flow of dependencies. For a task to start running, all its upstream dependencies must have succeeded or have been skipped. See this in diagram form here. The only way to guarantee this was to ensure a retry always resets downstream nodes.
The screen recording below shows some nodes that have been skipped (purple) with downstream nodes in the success state. When a node is retried, downstream nodes are reset to their initial state and the graph as a whole has integrity.
We touched earlier on the pain points of the existing customer experience. Customers have to switch context repeatedly between different applications.
The new system was on track to eliminate most of that context switching, as illustrated in the diagram below.

In the legacy system, validation logic is scattered across individual DCC plugins, leading to code duplication and inconsistent implementations. Users also have to fix issues manually.
In the redesign, validation is implemented once in the desktop app, and the API allows fixer actions to be added for each validation entry, as shown in the screenshot.

Currently, customers have to wait until uploads finish before they regain control of their DCC. Alternatively, they must use the command line to run an uploader in daemon mode. To solve this poor CX, Asher built an uploader GUI that shows progress from a background uploader daemon directly in the desktop app. Play the video to see it in action.
The legacy companion app included an app-store-like plugins page, but it suffered from several issues that needed addressing.
In the new app, we redesigned the plugins page with these guidelines:
The project was hard work but enjoyable. We delivered an end-to-end demo for SIGGRAPH after the first few months. The project was on track to ship at the end of March but was unfortunately cancelled in February.
The most challenging part was designing state transitions for tasks with history tracking and retries, combined with batch-edit functionality. These features were complex but critical for customers. In the legacy system, users had to resubmit entirely new jobs just to make minor changes to running tasks.
Everything should be made as simple as possible, but not simpler.