This article details how async Python can be used to develop a simple API for managing tree-based data in SQL.

Intro

Tree-based data has often been used to represent the parent-child relationships in the user business processes. There are several well-known models for storing such hierarchical data, for example adjacency list, nested set, closure table, etc. We do not discuss which the best one is, but trying to show how one of those models can be implemented using Python.

Objectives

Let’s develop a simple system that allows user to leave comments. Now it doesn’t matter what exactly the user can comment. The API being developed must satisfy the following most important requirements.

  • The user has to be authorized in order to make other API calls.
  • There is a method to leave a new comment.
  • There is a method to search comments by content.
  • There is a method to retrieve a sub-tree of comment.
  • All the API calls speak JSON.

Let’s define some technical properties of our system. We would use

  • async Python as a backend to handle many concurrent users,
  • PostgreSQL to exploit its full text search,
  • SQLAlchemy to construct SQL queries,
  • Swagger UI as a demonstration GUI for our API.

Closure table

We would use a combination of closure table and adjacency list models. Closure table model usually requires two tables to declare.

import sqlalchemy as sa
 
meta = sa.MetaData()
 
comments = sa.Table(
    'comments_comments', meta,
    sa.Column('id', sa.Integer, primary_key=True),
    sa.Column('content', sa.String, nullable=False)
)
 
comments_tree = sa.Table(
    'comments_tree', meta,
    sa.Column('id', sa.Integer, primary_key=True),
    sa.Column('ancestor_id', sa.Integer, sa.ForeignKey(comments.c.id), nullable=False),
    sa.Column('nearest_ancestor_id', sa.Integer, sa.ForeignKey(comments.c.id), nullable=False),
    sa.Column('descendant_id', sa.Integer, sa.ForeignKey(comments.c.id), nullable=False),
    sa.Column('depth', sa.Integer, nullable=False)
)

So far we have declared the comments table to store user comments, and the comments_tree table to store parent-child comment relationships. Each row in the second one is a relation between two comments. The columns have the following meaning:

  • ancestor_id is an id of the parent comment. It might be not a direct parent, it may be a grand parent or a grand-grand parent for instance. Each comment is an ancestor of itself at the same depth.
  • descendant_id is an id of a child comment.
  • depth is a level describing how deep in the tree the comment is. All the root comments have zero depth.

However, this closure table contains only a plain info about data relations and doesn’t allow to restore the ordered tree structure. To make it possible we will add an additional column.

  • nearest_ancestor_id is an id of the direct parent of a comment.

Now it’s possible to reproduce an ordered tree structure using info about a relation between a comment and its direct parent.

Let’s see how to fill out these tables by example. To create the first comment we’re just putting a text into comments table.

id content
1 Can anybody explain me the difference between Bitcoin and Etherium?


We would also mark that comment as a parent (or ancestor) of itself.

id ancestor_id nearest_ancestor_id descendant_id depth
1 1 1 1 0


To create a child comment we additionally put a relation with its grand parent (a non direct ancestor with non-zero depth). The tables are changed as shown below.

id content
1 Can anybody explain me the difference between Bitcoin and Etherium?
2 I'm sure no one can.


id ancestor_id nearest_ancestor_id descendant_id depth
1 1 1 1 0
2 2 2 2 1
3 1 1 2 1


Now we can see that the number of the relations and the depth are growing together. This is why the closure table model has often been criticized. Obviously, in order to add an element with high depth value to the tree, we will have to store lots of relations of the element and all its ancestors.

But it is not as bad as it seems. In a real case adding an element with a depth of 200 or 1000 may not affect the performance of a server in a serious way. In fact, such a situation probably will never happen (except when we are commenting the news about our national team losing a soccer match). In many cases, the time of data selection is more significant then the insertion time.

Back to our example. The full comment branch is looking like that.

1   Can anybody explain me the difference between Bitcoin and Etherium?
2      I'm sure no one can.
6           Why are you so sure?
7                This is a blog about cooking!
3        The same request.
4             No problem, I'll give an explanation paper later.
5                   Send me it by email then.


Let’s see how this branch will be represented in the tables.

id content
1 Can anybody explain me the difference between Bitcoin and Etherium?
2 I'm sure no one can.
3 The same request.
4 No problem, I'll give an explanation paper later.
5 Send me it by email then.
6 Why are you so sure?
7 This is a blog about cooking!


id ancestor_id nearest_ancestor_id descendant_id depth
1 1 1 1 0
2 2 2 2 1
3 1 1 2 1
4 3 3 3 1
5 1 1 3 1
6 4 4 4 1
7 1 1 4 1
8 5 5 5 2
9 4 4 5 2
10 1 4 5 2
11 6 6 6 2
12 2 2 6 2
13 1 2 6 2
14 7 7 7 3
15 6 6 7 3
16 2 6 7 3
17 1 6 7 3


As it shown above, there is a root comment with zero depth within the branch and the max depth (or level) of comments is 3. Now let’s see how we can deal with these tables in Python using SQLAlchemy.

SQL queries

Inserting a node to the tree

Filling comments_tree table can be done by using SQL INSERT INTO SELECT statement. Using SQLAlchemy it can be written like shown below.

import sqlalchemy as sa
 
ancestor = comments_tree.alias('ancestor')
descendant = comments_tree.alias('descendant')
nearest = comments_tree.alias('nearest')
 
sel = sa.select([
    descendant.c.ancestor_id,
    nearest.c.nearest_ancestor_id,
    ancestor.c.descendant_id,
    nearest.c.depth + 1
]).where(sa.and_(
    descendant.c.descendant_id == parent_id,
    ancestor.c.ancestor_id == comment_id,
    nearest.c.ancestor_id == parent_id,
    nearest.c.descendant_id == parent_id,
))

ins = comments_tree.insert().from_select([
    comments_tree.c.ancestor_id,
    comments_tree.c.nearest_ancestor_id,
    comments_tree.c.descendant_id,
    comments_tree.c.depth,
], sel)

That code produces the following SQL query.

INSERT INTO comments_tree
    (ancestor_id, nearest_ancestor_id, descendant_id, depth)
SELECT
    descendant.ancestor_id,
    nearest.nearest_ancestor_id,
    ancestor.descendant_id,
    nearest.depth + 1
FROM
    comments_tree AS descendant,
    comments_tree AS nearest,
    comments_tree AS ancestor
WHERE descendant.descendant_id = PARENT_ID
    AND ancestor.ancestor_id = COMMENT_ID
    AND nearest.ancestor_id = PARENT_ID
    AND nearest.descendant_id = PARENT_ID

Selecting a sub-tree

Selecting the whole comment branch can be done like that.

import sqlalchemy as sa
 
join = comments.join(
    comments_tree, comments.c.id == comments_tree.c.descendant_id
)

sel = sa.select([
    comments_tree.c.nearest_ancestor_id,
    comments.c.id,
    comments.c.content,
]).select_from(join).where(comments_tree.c.ancestor_id == comment_id)

The corresponding SQL query.

SELECT
    comments_tree.nearest_ancestor_id,
    comments_comments.id,
    comments_comments.content
FROM
    comments_comments
JOIN comments_tree ON comments_comments.id = comments_tree.descendant_id
WHERE comments_tree.ancestor_id = COMMENT_ID

Deleting a sub-tree

To delete a comment and all its sub-comments we have to:

  • figure out and remove all the relations in which the target comment is involved (meaning it’s an ancestor or a nearest ancestor or a descendant),
  • remove all the comments for which the target comment is an ancestor.

Using SQLAlchemy we can write it like that.

import sqlalchemy as sa
 
remove = comments_tree.alias('remove')
descendant = comments_tree.alias('descendant')
 
sel = sa.select([remove.c.descendant_id, remove.c.id]).where(sa.and_(
    sa.or_(
        remove.c.ancestor_id == descendant.c.descendant_id,
        remove.c.nearest_ancestor_id == descendant.c.descendant_id,
        remove.c.descendant_id == descendant.c.descendant_id,
    ),
    descendant.c.ancestor_id == comment_id,
))

SQL query.

SELECT
    remove.descendant_id,
    remove.id
FROM
    comments_tree AS remove,
    comments_tree AS descendant
WHERE (remove.ancestor_id = descendant.descendant_id
    OR remove.nearest_ancestor_id = descendant.descendant_id
    OR remove.descendant_id = descendant.descendant_id)
    AND descendant.ancestor_id = COMMENT_ID

That query will select all the ids of the comments (the first column) and all the ids of the comment relations (the second column) we need to delete.

The details

Schema changes

It’s most likely that we’ll want to show more info about comments, not only a text. For instance, we may want to display a name of the author of the comment and the time when comment has been left. It will cause the schema changes to be applied. We would probably want to use one of the database schema management systems like SQLAlchemy Migrate to simplify that.

Executing SQL

Due to async Python we would use aiopg to operate upon PostgreSQL. We can init and shutdown a connection pool as shown below.

import aiopg.sa
from aiohttp import web
 
async def setup_pg(app):
    engine = await aiopg.sa.create_engine(DATABASE)
    app['db'] = engine

async def close_pg(app):
    app['db'].close()
    await app['db'].wait_closed()

app = web.Application()

app.on_startup.append(setup_pg)
app.on_cleanup.append(close_pg)

PostgreSQL supports full-text search, we can exploit it using SQLAlchemy.

import sqlalchemy as sa
from sqlalchemy.sql.expression import func
 
sel = sa.select([comments]).where(comments.c.content.match(
    sa.cast(func.plainto_tsquery('query text'), sa.TEXT)
))

That will produce the following SQL query.

SELECT
    comments_comments.id,
    comments_comments.content
FROM
    comments_comments
WHERE
    comments_comments.content @@ to_tsquery(
        CAST(plainto_tsquery('query text') AS TEXT)
    )

This is a basic full-text search usage. For more complex examples see the links below.

User auth

To avoid unauthorized access to the API we would require an auth token being passed as a parameter to each call. We can use JSON Web Tokens to implement that. To figure out how to use JWT in Python, follow the links below.

Swagger UI

To provide a demonstration GUI for the API we can use aiohttp-swagger.

from aiohttp import web
from aiohttp_swagger import setup_swagger
 
app = web.Application()

setup_swagger(app, swagger_from_file='swagger.yaml')

Conclusion

In this article we’ve developed a simple API that allows user to leave comments. We’ve learned how to use a combination of closure table and adjacency list models to work with tree-based data. The source code of the project is available in the repository on GitHub.

Links