Mastodon Integration As A Function

First post in almost two years and what is it about? Refactoring my mastodon comment system yet again, this time turning it into a Digital Ocean Function. I know, I know, wasn't it essentially a function with AWS Lambda originally? Yes, it was, but I converted it to ECS because of the NAT and elastic IP requirement.

That limitation, it turns out, does not apply to DO Functions. They have public internet access out of the box at no extra cost. So it's more economical than either my AWS Lambda or AWS ECS approaches.

Good bye, AWS

With the ECS version, I'd already paid the cost of Docker-ifying the service to escape Lambda's always-on NAT charges. But it was still costly if you're not already running a lot on ECS -- the sunk cost of the ALB, the VPC, the elasticIPs is a significant portion of the bill. Never mind that ECS is kind of painful to interact with. At work, I do everything with kubectl but the control plane for EKS alone on AWS starts at over $70, so it's just not a viable hobbyist alternative to ECS.

I've been helping a buddy who runs his sites on Digital Ocean and the more I looked at their eco-system, the more impressed I was. I'm still daily in AWS for work, but for my fun projects, the price point and lack of friction of DO has me moving things a piece at a time.

The appeal of DO Functions over the ECS service isn't just cost — it's the elimination of infrastructure entirely. No container registry, no ECS service, no load balancer, no rolling deployment to manage. The function is just a directory of Python and a project.yml. This was also the promise of Lambda, but in typical AWS fashion, the friction is much higher and the NAT requirement made it unsustainable for my needs.

Hello, DO

While I'm taking my hosting of this to DO Functions, the existing Flask Docker container remains functional and I added back AWS Lambda support, since it's close enough to Functions. So there are now three ways to deploy this code.

Converting from Flask

DO Functions are built on Apache OpenWhisk. The entry point is a plain main(args) function rather than a web framework, and query parameters arrive as keys in the args dict. The conversion from Flask is almost mechanical:

# Flask
@app.route('/')
def postblog():
    url = request.args.get('url')
    ...
    return (
        dict(host=MASTODON_HOST, user=MASTODON_USER, toot_id=str(toot_id)),
        200,
        {'Access-Control-Allow-Origin': '*'}
    )

# DO Function
def main(args):
    url = args.get('url')
    ...
    return {
        'statusCode': 200,
        'headers': {'Content-Type': 'application/json', 'Access-Control-Allow-Origin': '*'},
        'body': json.dumps(dict(host=MASTODON_HOST, user=MASTODON_USER, toot_id=str(toot_id)))
    }

Flask's (body, status_code, headers) tuple becomes an explicit dict, and body must be a string, so json.dumps does what Flask was doing automatically. No gunicorn, no Dockerfile, no /status health-check endpoint -- the platform handles all of that.

Project configuration lives in project.yml alongside the source:

packages:
  - name: postblog
    shared: false
    environment:
      MASTODON_USER: '${MASTODON_USER}'
      MASTODON_HOST: '${MASTODON_HOST}'
      MASTODON_OAUTH_TOKEN: '${MASTODON_OAUTH_TOKEN}'
      BLOG_POST_PATTERN: '${BLOG_POST_PATTERN}'
      BLOG_POST_POSTFIX: '${BLOG_POST_POSTFIX}'
      BLOG_TITLE_PATTERN: '${BLOG_TITLE_PATTERN}'
      S3_KEY: '${S3_KEY}'
      S3_SECRET: '${S3_SECRET}'
      S3_ENDPOINT: '${S3_ENDPOINT}'
      S3_BUCKET: '${S3_BUCKET}'
    functions:
      - name: post
        runtime: python:3.11
        web: true
        limits:
          timeout: 30000
          memory: 256

(Edit: Fixed some errors on how env vars for DO Functions are handled)

The ${VAR_NAME} syntax pulls from the local environment at deploy time, which is a little ugly, IMHO. I'd rather have a parameter store like system for all services in DO, but right now it seems only DO App Platform has secrets management. Regardless, deployment is then just:

doctl serverless deploy . --remote-build

--remote-build matters here -- it ensures the Python dependencies are compiled on Linux, which is important for anything with C extensions.

The Database Problem

I almost left it at those minor changes, but the original service has a PostgreSQL database, and while I have a hosted instance in DO as well, the same benefit of Functions having IPs on the public internet also means, they are not in my VPC and therefore can't be a trusted origin for my instance. I did not want to open my DB up to the world (DO doesn't publish their Function IP ranges).

Dropping the Database

It was time to re-evaluate why I had a database in the mix for this in the first place. It always was a bit of a heavy piece of infrastructure to require for something this light weight.

The database was doing two things: storing the URL→toot_id mapping, and providing row-level locking (SELECT ... FOR SHARE / SELECT ... FOR UPDATE) to prevent duplicate toot creation when multiple readers hit a new post simultaneously. Which got me thinking whether both requirements could be met more simply.

The concurrency control turned out to be redundant. The code was already sending an Idempotency-Key header when creating toots -- Mastodon honors this and will return the original toot rather than creating a duplicate if the same key is used within its deduplication window. Two concurrent requests racing to create the same toot both get the same toot_id back from Mastodon. The FOR SHARE / FOR UPDATE dance was unnecessary.

The mapping only needed a key-value store. What I actually needed was a durable lookup from URL to toot_id. DO Spaces (S3-compatible object storage) fits this perfectly. Instead of rows in a table, I use zero-byte objects keyed by the URL path, with the toot_id stored as S3 object metadata.

The key is derived by stripping the scheme and host from the URL and prepending .mastodon-meta:

https://claassen.net/geek/blog/2024/02/mastodon-integration.html
  → .mastodon-meta/geek/blog/2024/02/mastodon-integration.html

The lookup is a HEAD request on that key:

def get_toot_id(url):
    key = get_s3_key(url)
    try:
        response = s3.head_object(Bucket=SPACES_BUCKET, Key=key)
        toot_id = response.get('Metadata', {}).get('toot-id')
        if toot_id:
            return int(toot_id)
    except ClientError as e:
        if e.response['Error']['Code'] != '404':
            raise
    toot_id = create_toot(url)
    s3.put_object(Bucket=SPACES_BUCKET, Key=key, Body=b'',
                  Metadata={'toot-id': str(toot_id)})
    return toot_id

On a 404, the toot is created and the result written back as object metadata. The bucket is private -- objects are never served publicly. Two concurrent requests that both miss the HEAD will both call Mastodon with the same Idempotency-Key, get back the same toot_id, and write the same metadata to Spaces. The second write is an overwrite with identical data.

Note

boto3 strips the x-amz-meta- prefix and lowercases custom metadata keys in responses, so the header written as x-amz-meta-toot-id comes back as response['Metadata']['toot-id'].

The Result

No managed database, no connection pooling, no VPC access concerns. The service is now a private Spaces bucket, a Functions namespace, a handful of secrets, and a directory of Python. The existing mapping in the old database migrates cleanly with a small script that reads each row and writes the corresponding Spaces object.

I also added a python script to backfill these meta objects based on the existing database to provide a migration path. All the code can be found on github

Now, let's see if I manage to blog again before the next refactor of this code base in another year or two.