Using Peewee's Expression to Perform Array Operations
This is a story about problem solving. True, it involves Python and database code, but eventually what I’m describing is running into a problem, understanding its root cause and finding a creative way to solve it.
To understand this blog post, all you need to know is some SQL. I’ll try to take care of the rest.
Entities & Schema
For the sake of simplicity, we’ll be talking about Books and Authors. The relation between them is many-to-many: a book may have more than one author, and an author can write more than one book.
The data is stored in a PostgreSQL database in 3 tables: Book, Author, and BookToAuthor. Each entity (book and author) has an ID, and the last table simply maps between these IDs.
book_id | author_id |
---|---|
1 | 3 |
2 | 9 |
3 | 6 |
3 | 9 |
The database code is written in Peewee. Peewee is an ORM library in Python, which means it converts data from certain SQL databases (PostgreSQL is among them) to Python objects and vice versa. When you query the database, you do it by writing object-oriented code instead of raw SQL statements.
The Task
The task I had to perform was simple: find all books that were written by a specific list of authors, e.g. find all books whose authors are Jane Austen, J.K. Rowling and Harper Lee (all three!).
To do that, I performed a two-level query:
- First, for each book ID, aggregate all its author IDs into an array (as seen in the table below).
- Then, return only the books whose aggregated array equals the given array.
book_id | author_ids |
---|---|
1 | [3] |
2 | [9] |
3 | [6, 9] |
This was my code for the first step:
# First-Level Query
subquery = BookToAuthor.select(
BookToAuthor.book_id,
fn.ARRAY_AGG(Book.author_id.distinct()).alias('author_ids')).group_by(Book.book_id)
In SQL, this translates into:
SELECT book_id,
ARRAY_AGG(DISTINCT author_id) AS author_ids
FROM BookToAuthor
This would return an aggregated array of author IDs per book.
Now the code for the whole process. Before reading it, notice that:
- The
certain_authors
variable is a list of the author IDs I’m interested in, say [6, 9]. - The
.c
operator is Peewee’s way of accessing a subquery result’s column. contains
is a Peewee method that enables to test if a database array contains the given elements.
# First-Level Query
subquery = BookToAuthor.select(
BookToAuthor.book_id,
fn.ARRAY_AGG(Book.author_id.distinct()).alias('author_ids')).group_by(Book.book_id)
# Second-Level Query
results = BookToAuthor.select(
BookToAuthor.book_id).join(
# Join with subquery on the book_id column
subquery, on=(subquery.book_id == BookToAuthor.book_id)).where(
# Choose only book IDs whose authors are (certain_authors)
subquery.c.author_ids.contains(certain_authors))
Whoopie doo. This looked good to me.
It wasn’t. or: The Problem
The problem was I didn’t get any results; I got an error.
ProgrammingError: operator does not exist: integer[] ~~* unknown
LINE 1: ...= "t1"."book_id") WHERE ("t2"."author_ids" ILIKE '%[1...
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts.
Peewee was using ILIKE
string operator. It was not treating author_ids
as an array.
Using Peewee’s .sql()
method, I could confirm this by seeing the full query to which my code was translated:
SELECT t1.book_id
FROM BookToAuthor AS t1
INNER JOIN (SELECT t1.book_id,
ARRAY_AGG(DISTINCT x.author_id) AS author_ids
FROM BookToAuthor AS t1
GROUP BY t1.book_id) AS t2
ON (t2.book_id = t1.book_id)
WHERE t2.author_ids ILIKE '%<certain_authors>%'
It was time to see why Peewee chose a string operator - a case-insensitive string match - instead of array-contains.
Debugging
With Pycharm I did a lot of F7 and F8 to step into and over the code.
Eventually I broke on the following method inside the ColumnBase
class:
def contains(self, rhs):
return Expression(self, 'ILIKE', '%%%s%%' % rhs)
Peewee reached a place I didn’t want it to reach - a class for generic column types. In this class, the method
contains
treats its operands as strings.
However, Peewee has a more specific class
named ArrayField
. There, there’s a different implementation for the method contains
which is mapped onto a different
SQL operator:
def contains(self, *items):
return Expression(self, '@>', ArrayValue(self, items))
This was exactly what I needed - the @>
operator which checks if an array contains one or more of another array’s
elements. But Peewee never got there :(
I needed to find a way to override this behavior and tell Peewee that the two operands are arrays, and that contains
should be translated into @>
and not into ILIKE
1.
Using Peewee’s Expression
Peewee allows adding custom expressions. An expression consists of a left-hand and right-hand side operands and an operator.
For example, the last contain
method called an expression which used ArrayField
values as both operands, and
the operator @>
.
And so I added my new, custom expression inside a function:
def array_equals(lhs, rhs):
return Expression(lhs, '@>', rhs)
Good! My new query looked like this:
# First-Level Query (same as before)
subquery = BookToAuthor.select(
BookToAuthor.book_id,
fn.ARRAY_AGG(Book.author_id.distinct()).alias('author_ids')).group_by(Book.book_id)
# Second-Level Query
results = BookToAuthor.select(
BookToAuthor.book_id).join(
subquery, on=(subquery.book_id == BookToAuthor.book_id)).where(
# Only this line changed:
array_equals(subquery.c.author_ids, certain_authors))
I happily ran the query and… meh. An error again:
ProgrammingError: operator does not exist: integer[] @> record
LINE 1: ...= "t1"."book_id") WHERE ("t2"."author_ids" @> (1, 2))
^
HINT: No operator matches the given name and argument types. You might need to add explicit type casts
Look at the query though - the operator changed! Peewee was now building my query with the array-contains operator @>
,
which was a huge step in the process of solving this issue.
However, as the error indicated - something remained incorrect with the arguments’ types. Since the author_ids
operand was seen as an integer array (see the first line in the error message), it must have been the right-hand side
operand - the list literal - that was incorrectly parsed.
One Last Trick
Looking at the examples in the PostgreSQL documentation, I saw
that array operations were done with explicit type declaration - arrays are preceded with the keyword ARRAY
.
Completely impatient but with a strong feeling of almost-victory, I replaced the python variable with a pure SQL expression:
# Only the relevant line-of-code from before...
array_equals(subquery.c.author_ids, SQL('ARRAY{}'.format(certain_authors)))
This was a hacky solution, but it produced my oh-so-desired SQL query:
SELECT t1.book_id
FROM BookToAuthor AS t1
INNER JOIN (SELECT t1.book_id,
ARRAY_AGG(DISTINCT t1.author_id) AS author_ids
FROM BookToAuthor AS t1
GROUP BY t1.book_id) AS t2
ON (t2.book_id = t1.book_id)
WHERE (t2.author_ids @> ARRAY[6, 9])'
My almost-victory was now a victory.
Lessons Learned
This extremely-specific scenario had important takeaways for me to recall in the future:
- Don’t be afraid to debug library code. We often use it as a black box, but in cases like mine it was important to dig in and confirm where the problem lay.
- The idea of using
Expressions
was my colleague’s - Tal Shahnovsky. When things don’t work, I personally tend to get frustrated and think “Nothing will ever solve this”. Someone else’s clear mind may bring some good ideas into the process, give them a chance. - A better knowledge of Peewee would have helped me in coming up with
Expression
as a solution. If you’re working with a library and having trouble - invest some time in getting to know the module and its capabilities better. This may pay off.
If any of you have an idea of how to solve this problem better, more elegantly or just differently, I’ll be pleased to hear from you. Do not hesitate to contact.
1You may think to yourself: “Why not just execute a raw SQL query?". mat
Peewee indeed has this functionality - build a query using a string and execute it with execute_sql
.
However, the resulting rows are returned as tuples and not as objects, which would make my code ugly AF.
But this was what I intended to do as a last resort.