TALL Stack: Nested Sets: Dealing with Hierarchical Data

Date Created: 03-Jul-2022
(Last Updated: 11-Nov-2024)

Nesting Parapolybia nodosa

The Problem

The first element to the problem, is that you hierarchical data that need to access in a hierarchical manner. For my site, I have three tables that fall into this category:

  • article_categories
  • checklist_categories
  • taxa

The second part is that either you are using a database that doesn't support hierarchical queries (e.g. MySQL 5.7) or your dataset is so large/nested that hierarchical queries become too inefficient. There is a possibility this could happen with my taxa table but I would have to upload a lot more observations for that to occur.

In my case, my site is on a shared hosting service that uses MySQL 5.7, and I cannot change that (at the moment).

Modelling Hierarchical Data

Adjacency List Model

The adjacency list model provides the simplest way to model hierarchical data because you simply need to provide a self-referencing relationship on the table, from each child record to its parent, with top-level records having a null value. This is very easy to maintain but difficult to traverse the hierarchy, for example, to retrieve all descendants of a particular record. Many modern databases include queries that alleviate that difficulty. However, I am using MySQL 5.7, which does not and, currently, I have no way of upgrading to MySQL 8, which supports recursive queries.

Without recursive queries, the only way to traverse the hierarchy is to develop stored procedures that will carry out multiple queries to build the required result set.

Nested Sets Model

The nested sets model offers an alternative approach, where every node in a hierarchy has a start and end value for the set of nodes it contains (i.e. that are descendants of it, and including itself). It's easiest to explain this by showing it.

Nested sets explanation

You can see in the diagram that every node has a number before and after it. These are the bounds of its set. The values are often referred to as 'left' and 'right' but I prefer 'start' and 'end', so will be using those throughout.

The blue arrows in the diagram show how the set values are worked out. Basically, you 'travel' the hierarchy prefering down to right, and right to up. That is, if you can go down, you go down to the next level descendant. If there are no more descendants, you go right to the next sibling. If there are no siblings, you go up to the parent. At each step, you add one to the number. The blue numbers in the boxes are the node numbers (i.e. the order in which the nodes are added to the nested set) and the red numbers are the node depths. These are useful for data conversion (see below).

The result is a very useful bunch of numbers. All of the following are very easy (and inexpensive, as long as the set values are indexed):

  • To obtain all of the descendants of a node, you just need all those whose set start value is between the node's start and end values.
  • To obtain all of the ancestors of a node, you just need all those whose set start value is less than and the end value greater than node's start and end values, respectively.
  • To sort items in the 'correct' sequence, sort them by the set start value.
  • To find leaf nodes (i.e. those with no child), just compare the start and end values (if end - start == 1 it's a leaf node).

All update operations that affect the set values are more difficult than with the adjacency list model but inserts and deletes are fairly straightforward, albeit less efficient because multiple database rows need to be updated:

  • When inserting a node, all set values (starts and ends) that are at least the value of the start of the new node need to be incremented by 2.
  • When deleting a node, all set values (starts and ends) that are at least the value of the start of the removed node need to be decreased by the 'width' of the removed node. The width is start - end + 1. This copes with deleting a node that has descendants.

However, all of the following are not so easy:

  • retrieving the direct descendants of as node
  • determining the depth of a node (i.e. how far down the hierarchy it is)
  • moving a node within the hierarchy.

A Hybrid Solution

I have decided to combine the two models. That is, I am using nested sets but I am retaining the parent_id column of the adjacency list model to allow easy retrieval of direct descendants. This is something I use a lot with my tree view control. I have also decided to include a depth column because it is very easy to determine when inserted data but much harder to determine when querying data. I should point out that all three of my hierarchies are likely to be queried far more often than they are changed.

The process of moving sets is much more complex, so I will explain that in more detail in words and show some diagrammatic examples of what is going on. If you don't want to read that, you can skip straight to The Code.

Moving Sets

In my situation, every hierachical table has a column that enables items to be sorted within their parent. In the case of taxa, this is the scientific name of each taxon. For article and checklist categories there is a sequence_no column. I use a generic term for the property the holds this: $sortChildrenBy. My move processing works based on knowing the new values for parent_id and $sortChildrenBy of the moved node. For categories moving to a new parent, I always place them as the last child.

Basically, I move nodes (and their descendants) in four steps:

  1. Determine the new start and end values of the of the node that is being moved.
  2. Move the node completely out of the set by adding a large number to its start and end values, and those of its descendants
  3. Move other nodes into their new positions within the nested set, to make room moved node's new position.
  4. Move the node (and its descendants) to its new position.

The Start and End Values

First, I determine if the new position of the moved node has a following sibling. In other words, is it going as the last sibling or being inserted before another sibling. This has to cope with moving a node to the top level (i.e. with no parent).

If there is no parent and no following sibling, the node must be being moved to the very end of the set. Therefore, as the whole set is remaining the same size, its new end value will be the same as the current maximum end value of the set.

If there is a parent but no following sibling, the node is being moved as the last sibling of the parent. If the end value of the new parent is higher than the current end value of the moved node, the node is moving up in the set and its new end value will be the end value of the new parent minus 1. Otherwise, the node is moving down in the set and its new start will be equal to the current end value of the new parent. This might seem odd but it is all to do with the fact that the values of other nodes will change to accommodate the move. The examples below should help to visualise what is going on.

The remaining situation is where there is a parent and a following sibling. If the current end value of the following sibling is higher than the current end value of the moved node, the new end value will be the start value of the following sibling minus 1. Otherwise, the start value will be the current start value of the next sibling.

At this stage, either the new start or the new end of the moved node is known. The other value is easily determined by applying the width of the node.

Moving the node out of the way

This is a very simle process of adding a large number to the start and end values of the node that is being moved and all of its descendants. These are all nodes whose start value is between the start and end value of the moved node (inclusive). The 'large number' should be large enough to guarantee that the modified set values will not conflict with any real set values. I use 1,000,000.

The purpose of this step is to avoid any complications with modifying overlapping sets during the transition from the original to the final states. In other words, by moving the moved subset completely out of the way, the following two steps become significantly simpler.

Moving the other nodes

The first thing to note is that the set values that need to change are constrained by the range of movement of the new node. If the node is moving up in value, the range is from its old start value to its new end value. If it is moving down in the set, the range is from its new start value to its old end value. No start or end value outside this range needs to be changed. This can be seen visually in the example diagrams (below) because the affected set values are shown in bold.

All we need to do is adjust every node value in the range by the width of the moved node. If the node is moving up in value, we subtract the width. If it is moving down, we add the width. This as the effect of leaving a 'hole' of the correct width in which to slot the moved node.

Moving the node to its new position

To move the node we just need to apply the distance it has moved to it and all its descendants. The distance is the difference between the original and new start values. If it's moving up, we add the distance; otherwise, we subtract it. Obviously, we also need to substract the 1,000,000 we added earlier.

Nested Sets Example

Here, I provide a full example that shows how the set start and end values change based on different operations. If you are not interested in the examples, you can skip to The Code.

In this example, I start with 5 nodes in the hierarchy and go through a number of steps adding new nodes and moving nodes around. In all diagrams: added nodes are filled with blue; moved nodes are green; and changed set values are bold.

Start Position

Start position

Step 1: Add F under C

NodeF is added under NodeC. Therefore, startF = startC + 1 = 5. To accommodate NodeF, add 2 to all the set values of 5 or higher, before adding NodeF.

Step 1

Step 2: Add G under F

NodeG is added under NodeF. Therefore, startG = startF + 1 = 6. To accommodate NodeG, add 2 to all the set values of 6 or higher, before adding NodeG.

Step 2

Step 3: Add H under F and I under E

For brevity, I have included two actions in this step because they are more-or-less independent of each other. I will add NodeH before adding NodeI.

NodeH is added under NodeF and after NodeG. Therefore, startH = endG + 1 = 8. To accommodate NodeH, add 2 to all the set values of 8 or higher, before adding NodeH.

NodeI is added under NodeE. Therefore, startI = startE + 1 = 16 because startE was incremented by 2 when NodeH was added. To accommodate NodeI, add 2 to endE, which is the only value of 16 or higher, before adding NodeI.

Step 3

Step 4: Move H before G

After the move, following sibling of NodeH will be NodeG, and the current end value of NodeG is less than that of NodeH. Therefore, newStartH = currentStartG = 6, and newEndH = 7. After moving NodeH out of the way, add its width (2) to NodeG, the only other node within the affected range. Then subtract the distance of the move (also 2) and subtracts 1,000,000 from the values of NodeH. The result is that they simply swap places.

Step 4

Step 5: Move F under E

There is no following sibling but there is a parent. Therefore, newEndF = currentEndE - 1 = 17, and newStartF = newEndF - widthF + 1 = 17 - 6 + 1 = 12. The affected range is currentStartF to newEndF (5 to 17). After moving NodeF, NodeG and NodeH out of the way, subtract widthF from all values between 5 and 17 (inclusive). Then add the distance of the move (12 - 5 = 7) and 1,000,000 from the values of NodeF, NodeG and NodeH.

Step 5

Step 6: Add J under H

NodeJ is added under NodeH. Therefore, startJ = startH + 1 = 14. To accommodate NodeJ, add 2 to all the set values of 14 or higher, before adding NodeJ.

Step 6

Step 7: Move F under A

There is no following sibling but there is a parent. Therefore, newStartF = currentEndA = 8, and newEndF = newStartF + widthF - 1 = 8 + 8 -1 = 15 (note the width of NodeF is now 8 since NodeJ was added). The affected range is newStartF to currentEndF (8 to 19). After moving NodeF, NodeG, NodeH and NodeJ out of the way, add widthF to all values between 8 and 19 (inclusive). Then subtract the distance of the move (12 - 8 = 7) and subtract 1,000,000 from the values of NodeF, NodeG, NodeH and NodeJ.

Step 7

Step 8: Move D to after F

There is no following sibling but there is a parent. Therefore, newEndD = currentEndA - 1 = 15, and newStartD = newEndD - 1 = 14. The affected range is currentStartD to newEndD (6 to 15). After moving NodeD out of the way, subtract widthD from all values between 5 and 17 (inclusive). Then add the distance of the move (15 - 6 = 9) and 1,000,000 from the values of NodeD.

Step 8

Step 9: Move F to before C

The following sibling is NodeC, and the current end value of NodeC is less than that of NodeF. Therefore, newStartF = currentStartC = 4, and newEndF = newStartF + widthF - 1 = 4 + 8 -1 = 11. After moving NodeF out of the way, add its width (8) to NodeC, the only other node within the affected range. Then subtract the distance of the move (6 - 4 = 2) and 1,000,000 from the values of NodeF, NodeG, NodeH and NodeJ.

Step 9

The Code

Here is the HierarchicalModel class, which extends Illuminate\Database\Eloquent\Model.

abstract class HierarchicalModel extends Model
{
    protected static string $sortChildrenBy;

    /**
     * Get the parent model of the model.
     */
    public function parent()
    {
        return $this->belongsTo(static::class, 'parent_id');
    }

    /**
     * Get the child models of the model.
     */
    public function children()
    {
        $relations = $this->hasMany(static::class, 'parent_id');
        $relations->orderBy('set_start', 'ASC');
        return $relations;
    }

    /**
     * Determine if a model has any child model.
     */
    public function hasChild()
    {
        return $this->set_end - $this->set_start > 1;
    }

    /**
     * Get the ids of all descedants of the current model
     */
    public function getDescendantsAttribute()
    {
        return static::where('set_start', '>', $this->set_start)
            ->where('set_end', '<', $this->set_end)
            ->orderBy('set_start', 'asc')
            ->get();
    }

    /**
     * Get the ancestry of the current model
     */
    public function getAncestorsBottomUpAttribute()
    {
        return static::where('set_start', '<', $this->set_start)
            ->where('set_end', '>', $this->set_end)
            ->orderBy('set_start', 'desc')
            ->get();
    }

    /**
     * Get the ancestry of the current model
     */
    public function getAncestorsAttribute()
    {
        return static::where('set_start', '<', $this->set_start)
            ->where('set_end', '>', $this->set_end)
            ->orderBy('set_start', 'asc')
            ->get();
    }

    /**
     * Get the siblings of the current model
     */
    public function getSiblingsAttribute()
    {
        if (empty($this->parent_id)) {
            return static::whereNull('parent_id')->get();
        }
        return $this->parent->children;
    }

    /**
     * Deals with hierarchical model and nested set specific functionality
     * Then calls the parent save method.
     *
     * @return bool
     */
    public function save(array $options = []): bool
    {
        try {
            /* If necessary, set the nested set start and end values */
            if ($this->isDirty(['parent_id', static::$sortChildrenBy])) {
                $sortChildrenBy = static::$sortChildrenBy;

                /* Find the next sibling of the new position of this node */
                if (empty($this->parent_id)) {
                    $baseQuery = static::whereNull('parent_id');
                    $this->depth = 1;
                } else {
                    $baseQuery = static::where('parent_id', $this->parent_id);
                    $this->depth = $this->parent->depth;
                }
                $nextSibling = $baseQuery->where($sortChildrenBy, '>', $this->$sortChildrenBy)->first();

                /*
                 * If updating, move the node (and any descendants).
                 * If adding, make space in the set for the new node
                 */
                if ($this->exists) {
                    /* Determine the new position of the moved node(s) */
                    if (empty($this->parent_id) && empty($nextSibling)) {
                        /* Move to be the last top level node */
                        $newEnd = static::max('set_end');
                    } else {
                        if (empty($nextSibling)) {
                            /* If we don't have a next sibling, we add as the last child of the parent */
                            $movingUp = $this->parent->set_end > $this->set_end;
                            if ($movingUp) {
                                $newEnd = $this->parent->set_end - 1;
                            } else {
                                $newStart = $this->parent->set_end;
                            }
                        } else {
                            $movingUp = $nextSibling->set_end > $this->set_end;
                            if ($movingUp) {
                                $newEnd = $nextSibling->set_start - 1;
                            } else {
                                $newStart = $nextSibling->set_start;
                            }
                        }
                    }
                    $width = $this->set_end - $this->set_start + 1;

                    if (isset($newEnd)) {
                        $newStart = $newEnd - $width + 1;
                    } else {
                        $newEnd = $newStart + $width - 1;
                    }

                    if ($newStart !== $this->set_start) {
                        /* Move the current node and descendants right out of the way. */
                        static::whereBetween('set_start', [$this->set_start, $this->set_end])
                            ->update([
                                'set_start' => DB::raw('set_start + 1000000'),
                                'set_end' => DB::raw('set_end + 1000000')
                            ]);

                        /* Adjust other nodes to make space for the moved subset */
                        $distance = $newStart - $this->set_start;
                        /**
                         * If distance is positive, we are moving up the set
                         * If we are moving up, we need to adjust all values between the old start and the new end.
                         * If we are moving down, we need to adjust all values between the new start and the old end.
                         */
                        if ($distance > 0) {
                            $rangeStart = $this->set_start;
                            $rangeEnd = $newEnd;
                            $adjustment = $width;
                        } else {
                            $rangeStart = $newStart;
                            $rangeEnd = $this->set_end;
                            $adjustment = $width * -1;
                        }
                        /* We substract the adjustment. If it is negative, it will increase the values of the other nodes */
                        static::whereBetween ('set_start', [$rangeStart, $rangeEnd])
                            ->update(['set_start' => DB::raw("set_start - $adjustment")]);
                        static::whereBetween ('set_end', [$rangeStart, $rangeEnd])
                            ->update(['set_end' => DB::raw("set_end - $adjustment")]);

                        /* Move the current node and descendants to its new location */
                        static::where('set_start', '>', 1000000)
                            ->update([
                                'set_start' => DB::raw("set_start - 1000000 + {$distance}"),
                                'set_end' => DB::raw("set_end - 1000000 + {$distance}")
                            ]);

                    }
                } else {
                    /* Inserting */
                    if (empty($this->parent_id) && empty($nextSibling)) {
                        /* Move to be the last top level node */
                        $newStart = static::max('set_end') + 1;
                    } else {
                        if (empty($nextSibling)) {
                            $newStart = $this->parent->set_end;
                        } else {
                            $newStart = $nextSibling->set_start;
                        }
                    }
                    $newEnd = $newStart + 1;
                    /* Make room for the new node */
                    static::where('set_start', '>=', $newStart)
                        ->update(['set_start' => DB::raw("set_start + 2")]);
                    static::where('set_end', '>=', $newStart)
                        ->update(['set_end' => DB::raw("set_end + 2")]);
                }
                $this->set_start = $newStart;
                $this->set_end = $newEnd;
            }
            return parent::save($options);

        } catch ( \Throwable $ex ) {
            Log::error($ex->getMessage());
            return false;
        }
    }

    public function delete() {
        $width = $this->set_end - $this->set_start + 1;
        /* Substract the width from all higher node values */
        static::where('set_start', '>', $this->set_start)
            ->update(['set_start' => DB::raw("set_start - $width")]);
        static::where('set_end', '>', $this->set_end)
            ->update(['set_end' => DB::raw("set_end - $width")]);
        parent::delete();
    }
}

I am not going to explain much here because all the explanation of how it works is in the Moving Sets section above, and in the comments in the code. The parent() and children() methods make use of the adjancency list model. All of the others use or maintain the nested sets model. The code for both adding and moving nodes is within the save() method, for consistency with the use of the Model::save() method, which is used for both inserts and updates.

I will emphasise a couple of key points about the way I use this. Each subclass of HierarchicalModel specifies a value for the $sortChildrenBy static property, which must match a column in the underlying table that has a unique value within each parent. In the case of categories, the $sortChildrenBy is 'sequence_no'. When I move a category, I set its new parent_id and/or sequence_no and then call the save() method. For taxa, the $sortChildrenBy is 'scientific_name'. The setting of this is out of my control, because the data come from iNaturalist. So the parent_id and scientific_name are set based on the incoming data, and I call the save() method as normal.

Converting the data to the nested sets model

I should start by saying that all of my hierarchical tables already had a sort_path column, with data made from the sort_path of the parent concatenated with '.' followed by the value of the the $sortChildrenBy column. In other words, for categories, it was the series of sequence_no for the current category and its ancestors delimited by '.'. For taxa, it is the same but using scientific names. This helps a lot in the calcution of the start and end values. If you want use the migration code below, you will also need an equivalent column.

Nested sets explanation

The set start values are determined by this simple formula:

set_start = 2 * node_number - node_depth

Where:

  • node_number is the blue number in the boxes in the diagram
  • node_depth is the red number

To determine the depth of each node, I added a generated column to each table to determine the depth based on the sort_path using this derivation:

CHAR_LENGTH(sort_path) - CHAR_LENGTH(REPLACE(sort_path, '.', '')) + 1

This basically counts the number of '.' delimiters in the path and adds 1 to get the depth.

The set end values are determined by this simple formula:

set_end = child_node_count * 2 + set_start + 1

Where child_node_count is the count of descendants of the node.

I implemented all of the above in a single database migration for each table. Here is the migration code for the taxa table:

public function up()
{
    Schema::table('taxa', function (Blueprint $table) {
        $table->unsignedTinyInteger('depth')
            ->storedAs("CHAR_LENGTH(sort_path) - CHAR_LENGTH(REPLACE(sort_path, '.', '')) + 1")
            ->after('sort_path');
        $table->unsignedSmallInteger('set_start')->nullable()->after('parent_id');
        $table->unsignedSmallInteger('set_end')->nullable()->after('set_start');
        $table->index('set_start');
        $table->index('set_end');
    });
    DB::statement(DB::raw('SET @row_number = 0'));
    DB::statement(DB::raw(
        'UPDATE taxa a
        INNER JOIN (
            SELECT
                id,
                (@row_number := @row_number + 1) AS node_number,
                (2 * @row_number) - depth AS set_start
            FROM taxa
            ORDER BY sort_path) AS b ON b.id = a.id
        SET a.set_start = b.set_start'
    ));
    DB::statement(DB::raw(
        'UPDATE taxa a
        INNER JOIN (
            SELECT aa.id, count(1) - 1 as child_node_count
            FROM taxa aa,
            taxa bb
            WHERE substring(bb.sort_path, 1, LENGTH(aa.sort_path)) = aa.sort_path
            GROUP BY aa.id) AS b ON b.id = a.id
        SET a.set_end = (b.child_node_count * 2) + a.set_start + 1'
    ));
}

Incidentally, I decided not to keep the sort_path columns in the two category tables because they are no longer needed because we can no sort by set_start and get the same sequence. For taxa, I still use the sort_path but no longer for sorting. I guess I should rename it to path. As a result, I also changed the depth column so it is no longer generated. It is now maintained in the code, which is very simple (see HeirarchicalModel::save())

Wrapping up

So, that ended up very long and covering quite a bit. I am not apologising for that! I had to read a few different articles to get the source information that helped lead to my solution. I decided I would keep all of this together in one article. I also decided it was worth showing several examples to help exlain the principles of the nested sets model.

Anyway, just to recap, if you got this far (and didn't skip sections) you will have learned:

  • about the adjacency list and nested set models and the potential problems with each.
  • that I decided to use a hybrid of the two models, plus storing a depth column.
  • how to add, delete and move nodes within a nested set.
  • how to convert data to use the nested sets model.

Useful links

The first two articles have links/references to other articles, if you feel the need to dig any deeper. The third is an answer to a Stack Overflow question. Please note that the answer with the most votes (at the time of writing), and the article it links to show the approach I tried and found did not work for me.