Tl;dr Most software processes can be represented as a directed acyclic graph (DAG). In reality however, your processes are more likely to behave as pseudo-DAGs resulting in slowness, inconsistencies and non-reproducibility. This is because getting a complex system to behave like a true DAG - while managing the challenges that come with it - requires a lot of hard work and clever design. Ediphy’s core technology is a true DAG that has overcome the inherent challenges of being one.
In a world eaten by software, understanding the intricate dependencies that drive your organisation’s computations is crucial. Imagine graphing your organisation as a network of interconnected tasks, where each task is a node relying on others to fulfil its function, represented by the edges. Enter the directed acyclic graph (DAG).
By breaking down computations into manageable sub-components, represented as a DAG, we gain insights into the dependencies within a system. This enables parallelization of non-interdependent calculations, optimising performance and a starting point for ensuring reproducibility.
Sounds simple enough. So then - after migrating to the cloud, deploying a global data lake and more recently considering developing your own LLM/GPT - why is your P&L still wrong?
The problems facing DAGs
Slowness
One issue with traditional systems (when we have a DAG with multiple nodes that are dependent on each other), is the computation time of the entire DAG is limited by the slowest nodes in the graph. This is because the result of the computation at the ultimate end node cannot be computed until all of its dependent nodes have completed their computations.
If we have a node in our DAG that takes 1ms to compute, and it in turn has 10 upstream dependencies that take 1ms each to compute - then due to parallelism, we should be able to compute the node we want in a total of 2ms.
Now, imagine a single one of the upstream dependencies instead takes 5 seconds to compute. This slowest node now infects all downstream nodes and what took 2ms now takes 5.001 seconds. (Several pseudo-DAGs solve this by taking the ‘latest good value’ of a slow node - which breaks consistency. More on this later.)
In a more complex DAG, the dependency structure between nodes can be much more intricate, making it harder to determine why performance suffers. However, the basic principle remains the same - the computation time of the entire DAG is limited by the slowest nodes, the DAG must wait for all nodes to compute before an answer is available at the ultimate end node. Therefore, in typical DAG systems it is essential to carefully analyse the dependency structure of a computation DAG and optimise the performance of the system by minimising the computation time of the slowest node. This can be achieved through various techniques such as load balancing, task scheduling, and resource allocation, among others.
Versioning
Another issue can be described as ‘graph versioning’. With traditional DAG-like systems, the execution algorithm may be described as follows:
- The calculation that is required is determined, often coming from a user request
- The system recursively analyses dependencies, determining which calculations need to take place in order to fulfil the request
- A topological sort takes place, to determine the necessary calculation order
- Schedulers then manage the communication between the different processes in order to make the best use of resources
While this is taking place, new requests are continually landing - on either the same ‘tick’ (requiring an update to the previous topo-sort strategy) or for different ‘ticks’ (requiring segregated execution strategies). These requests may be coming in on the order of 100’s of requests per second per ‘tick’, over hundreds of ticks (as the state-of-world is continually mutating). Therefore in traditional DAG-like systems, the above strategy quickly becomes inefficient. Previously completed topological sorts are rendered out of date as soon as they are completed, dominating the runtime for little gain.
The above strategy does work well, but only in the following circumstances:
- One request at a time. We don’t plan anything new until a request has completed.
- Where the world does not change while execution takes place.
- There is nothing to gain from cooperation between requests, either on the same data version or for subsections of requests where intermediate results are guaranteed to produce the same output.
Consistency
One common complexity that leads to consistency issues in computation is the problem of race conditions. Race conditions occur when two or more computations access the same data source in an uncontrolled manner, and the time at which they access the data affects the outcome of the computation. This can result in inconsistent results or errors in the final output.
DAGs solve this problem by enforcing a specific order of execution for the computations. In a DAG, each node represents a specific computation, and the edges between nodes represent the dependencies between those computations. This means that a computation cannot begin until all of its antecedent computations have completed, ensuring that the data it requires is consistent and up-to-date.
By enforcing this order of execution, true DAGs eliminate the possibility of race conditions and ensure that computations are performed correctly and in a consistent manner. In practice this is rarely the case unfortunately.
The pseudo-solution: pseudo-DAGs
In an organisation, multiple systems including Excel sheets and independent databases can introduce a "pseudo-DAG" problem, which can lead to consistency issues in computation. This problem arises when computations are performed across different systems asynchronously, without a guarantee that they will complete in any particular order.
For example, suppose an organisation uses an Excel sheet to perform a set of calculations, and another department uses an independent database to perform a different set of calculations that depend on the results of the Excel sheet. If these computations are performed asynchronously, there is no logic to ensure that the database computations will wait for the Excel sheet computations to complete before starting, or vice versa. As a result, the database computations may start with incomplete or incorrect data, leading to wrong numbers, and bad business decisions.
Making matters worse, it is likely that the database is fed from several Excel sheets or upstream systems, each in a state inconsistent with each other. In these cases, the data in your database, and everything derived from that database is affected by the input data being fed into each upstream system, which is more often than not a function of timing. This timing is affected by other conditions such as hardware in use and the load that is currently placed on that hardware.
A pseudo-DAG will happily compute away, unaware of the inconsistencies and cycles that we have introduced, and the poor decisions that will be made from downstream calculations as more pseudo-DAGs are constructed in various applications. And before you know it, your P&L is non-reproducible and wrong.
Ediphy’s core technology
Ediphy is built on a single, procedurally generated distributed computation graph.
No ‘up-front’ generation of the graph topology is required by our engineers, and the technology facilitates computation of low latency tasks such as trading decisions, while also remaining optimal for longer running intensive compute tasks, without introducing waits for the “slowest compute” issues described earlier.
Our core system uses an uncapped number of ‘higher order’ consistent graphs, to alter the structure of thousands of sub-graphs in real time - the more calculations that are run. I.e. the more workloads are being computed in parallel across different tasks - often in the 100,000s - the more ‘cross borrowing’ of intermediate results across the ultimate causal tasks can take place, improving execution time.
Further, we lean on reproducibility to further parallelise pending calculations, allowing them to continue while the system continues to tick into the future - processing new calculations from newer data while ‘older’ calculations complete in isolation.
This technology is delivered in a transparent manner to our users via our web front-end - with calculations and graph structures being created in a just-in-time manner for any ad-hoc compute required. This design has paid off by allowing small high performing teams of engineers to outperform our expectations. In turn this enables us to move faster, while delivering more to our users at a lower cost - with faster turnaround time - and correct P&Ls.