NHacker Next
login
▲Postgres LISTEN/NOTIFY does not scalerecall.ai
455 points by davidgu 4 days ago | 191 comments
Loading comments...
JoelJacobson 2 minutes ago [-]
Hey folks, I ran into similar scalability issues and ended up building a benchmark tool to analyze exactly how LISTEN/NOTIFY behaves as you scale up the number of listeners.

Turns out that all Postgres versions from 9.6 through current master scale linearly with the number of idle listeners — about 13 μs extra latency per connection. That adds up fast: with 1,000 idle listeners, a NOTIFY round-trip goes from ~0.4 ms to ~14 ms.

To better understand the bottlenecks, I wrote both a benchmark tool and a proof-of-concept patch that replaces the O(N) backend scan with a shared hash table for the single-listener case — and it brings latency down to near-O(1), even with thousands of listeners.

Full benchmark, source, and analysis here: https://github.com/joelonsql/pg-bench-listen-notify

No proposals yet on what to do upstream, just trying to gather interest and surface the performance cliff. Feedback welcome.

osigurdson 9 hours ago [-]
I like this article. Lots of comments are stating that they are "using it wrong" and I'm sure they are. However, it does help to contrast the much more common, "use Postgres for everything" type sentiment. It is pretty hard to use Postgres wrong for relational things in the sense that everyone knows about indexes and so on. But using something like L/N comes with a separate learning curve anyway - evidenced in this case by someone having to read comments in the Postgres source code itself. Then if it turns out that it cannot work for your situation it may be very hard to back away from as you may have tightly integrated it with your normal Postgres stuff.

I've landed on Postgres/ClickHouse/NATS since together they handle nearly any conceivable workload managing relational, columnar, messaging/streaming very well. It is also not painful at all to use as it is lightweight and fast/easy to spin up in a simple docker compose. Postgres is of course the core and you don't always need all three but compliment each other very well imo. This has been my "go to" for a while.

riedel 2 hours ago [-]
Actually LISTEN/NOTIFY does also not scale the other way. Immich also moved to that pg for everything mentality (trying to remove redis dependencies). The problem: postgres needs a WAL flush for all notifications. I ran immich on my HDD-NAS. The result was constant noise because the pg backed socket.io backend issues constant keep alive messages.
ownagefool 3 hours ago [-]
Largely agree. Functionality wise if you don't have many jobs, using the database as the queue is fine.

However, I've been in several situations where scaling the queue brings down the database, and therefore the app, and am thus of the opinion you probably shouldn't couple these systems too tightly.

There are pros and cons, of course.

mike_hearn 2 hours ago [-]
Using the database for queues is more than fine, it's often essential to correctness. In many use cases for queues you need to atomically update the database with respect to popping from the queue, and if they're separate systems you end up needing either XA or brittle and unreliable custom idempotency logic. I've seen this go wrong before and it's not nice, the common outcome is business-visible data corruption that can have financial impact.

This seems like another case where Postgres gets free marketing due to companies hitting its technical limits. I get why they choose to make lemonade in these cases with an eng blog post, but this is a way too common pattern on HN. Some startup builds on Postgres then spends half their eng budget at the most critical growth time firefighting around its limits instead of scaling their business. OpenAI had a similar blog post a couple of months ago where they revealed they were probably spending more than quarter of a million a month on an Azure managed Postgres, and it had stopped scaling so they were having to slowly abandon it, where I made the same comment [1].

Postgres is a great DB for what you pay, but IMHO well capitalized blitzscaling startups shouldn't be using it. If you buy a database - and realistically most Postgres users do anyway as they're paying for a cloud managed db - then you might as well just buy a commercial DB with an integrated queue engine. I have a financial COI because I have a part time job there in the research division (on non-DB stuff), so keep that in mind, but they should just migrate to an Oracle Database. It has a queue engine called TxEQ which is implemented on top of database tables with some C code for efficient blocking polls. It scales horizontally by just adding database nodes whilst retaining ACID transactions, and you can get hosted versions of them in all the major clouds. I'm using it in a project at the moment and it's been working well. In particular the ability to dequeue a message into the same transaction that does other database writes is very useful, as is the exposed lock manager.

Beyond scaling horizontally the nice thing about TxEQ/AQ is that it's a full message queue broker with all the normal features you'd expect. Delayed messages, exception queues, queue browsing, multi-consumer etc. LISTEN/NOTIFY is barely a queue at all, really.

For startups like this, the amount of time, money and morale they are losing with all these constant stories of firefights just doesn't make sense to me. It doesn't have to be Oracle, there are other DBs that can do this too. But "We discovered X about Postgres" is a eng blog cliché by this point. You're paying $$$ to a cloud and GPU vendor anyway, just buy a database and get back to work!

[1] https://news.ycombinator.com/item?id=44074506

fathomdeez 7 hours ago [-]
This kind of issue always comes up when people put business logic inside the database. Databases are for data. The data goes in and the data goes out, but the data does not get to decide what happens next based on itself. That's what application code is for.
tsimionescu 4 hours ago [-]
The way you model data and store it in your database is fundamentally a part of your business logic. The same data can be modeled in many different ways, with different trade-offs for different use cases. Especially if you have a large amount of data, you can't just work with it as is, you need to know how you will use it and model it in a way that makes the common operations fast enough. As your application evolves, this may change, and even require data migrations.

None of this means you have to or even should use stored procedures, triggers, or listen/notify. I'm just making the point that there is no clean separation between "data" and "business logic".

ehansdais 3 hours ago [-]
Can't upvote this enough. The point is not that procedures outside of the DB is wrong, nor is it that procedures should always go into the DB. It's that you should look at the context and decide what the best way to solve the problem is.
chatmasta 5 hours ago [-]
The first thing I did when I saw this article was to check the Postgres docs, because I thought "heh, surely they just didn't read the fine print," but the LISTEN/NOTIFY page has zero mentions of "lock" in the entire content.
perlgeek 4 hours ago [-]
I really hope somebody reading this article (or HN thread) writes a doc patch to mention that.

I'm unlikely to get it myself today, and by tomorrow I've probably already forgotten it :-(

Cthulhu_ 3 hours ago [-]
> and by tomorrow I've probably already forgotten it :-(

You're self-aware and are writing about it, why not maintain and add it to your todo list if this is a recurring issue?

physix 6 hours ago [-]
That may hold to a certain extent for relational databases where your business model doesn't align well with physical model (tables). Although you might wonder why stored procedures and triggers were invented.

In databases where your domain is also your physical data model, coupling business logic to the database can work quite well, if the DBMS supports that.

https://medium.com/@paul_42036/entity-workflows-for-event-dr...

bevr1337 7 hours ago [-]
> the data does not get to decide what happens next based on itself.

Then why bother with a relational database? Relations and schemas are business logic, and I'll take all the data integrity I can get.

jl6 4 hours ago [-]
I think an argument can be made that relations, schemas and constraints encode a kind of business logic that is intrinsic to the definition and integrity of the data, while other types of business logic represent processes that may hinge on data but aren’t as tightly coupled to it. Similar to the difference between a primitive type and a function.

I guess some will argue that their business logic is special and really is so tightly coupled to the data definition that it belongs in the database, and I’m not going to claim those use cases don’t exist, but I’ve seen over-coupling far more often than under-coupling.

This is why I say: Applications come and go, but data is forever.

Jailbird 7 hours ago [-]
I've seen both of these philosophies. I liken them to religions, the believers are devout. Code is King vs the DB is King.

I'm personally Code is King, and I have my reasons (like everyone else)

ako 42 minutes ago [-]
It’s really not about code is better or database it better, it’s mostly about locality: if you want to update thousands of records, you can’t pull those records into a separate process, update them there and then write back. So you put your code next to the data in the database. Stored procedures are just code deployed to a database container…
IgorPartola 5 hours ago [-]
I am mostly on the side of business logic should live in applications and relationships between data types are not business logic so much as just the layout of the data. But I typically access data via an ORM and they typically don’t have support for triggers and stored procedures. If they did, I would certainly use it because projects I work on might have multiple people writing application code but everyone uses a single set of database models. This would mean that critical constraints on the shape of the data could be defined and respected at all times vs some developer on my team forgetting to include some critical check in their data update routine.
djfivyvusn 2 hours ago [-]
That's purely because nobody knows how to write SQL let alone stored procedures. If stored procedures had better devex they'd be used for most of your app.
platzhirsch 7 hours ago [-]
If you want your database to just store bytes, use a key-value store. But SQL gives you schemas and constraints for a reason; they're guardrails for your business logic. Just don’t ask your tables to run the business for you.
IgorPartola 5 hours ago [-]
If only different ORMs had more support for triggers and stored procedures. Things would be so much easier if I could do things like denormalize certain frequently accessed fields across tables but with proper ability to update them automatically without having to do them in application code.
Cthulhu_ 3 hours ago [-]
It really depends, but it's also a factor of time, that is, "back in the day", databases were designed to serve many different clients, nowadays a common practice is to have a 1:1 relationship between a database and a client application.

Of course, this is sometimes abused and taken to extremes in a microservices architecture where each service has their own database and you end up with nastiness like data duplication and distributed locking.

panzi 7 hours ago [-]
So what are your thoughts on constraints then? Foreign keys? Should that only be handled by the application, like Rails does (or did, haven't used in a long time).
Lio 4 hours ago [-]
Rails fully supports constraints and encourages you to use them.

You can either execute SQL in your migration or use add_check_constraint.

Footkerchief 5 hours ago [-]
You still use constraints even if you put all your business logic in stored procedures.
fathomdeez 7 hours ago [-]
I don't think of those as business logic, per se. They're just validity checks on what the data should look like before it's written to disk - they're not actionable in the way L/N is. That being said, constraints usually end up being duplicated outside the db anyway, but having them where the data rests (so you don't have to assume every client is using the correct constraint code) makes sense.
2 hours ago [-]
v5v3 2 hours ago [-]
Isn't Kafka the Postgresql of pub/sub

I.e. use Kafka unless you have a explicit reason not to?

So why Nats?

evnix 2 hours ago [-]
After working with NATS, I wouldn't want to touch Kafka even with a long stick. Its just too complex and a memory hog for no good reason. It doesn't have all the features that NATS supports as well.
v5v3 1 hours ago [-]
What about the Kafka V2, Pulsar?
the_duke 54 minutes ago [-]
Kafka is far from trivial to operate, for one thing, even post zookeeper.
goodkiwi 7 hours ago [-]
I’ve been meaning to check out NATS - I’ve tended to default to Redis for pubsub. What are the main advantages? I use clickhouse and Postgres extensively
sbstp 6 hours ago [-]
I've been disappointed by Nats. Core Nats is good and works well, but if you need stronger delivery guarantees you need to use Jetstream which has a lot of quirks, for instance it does not integrate well with the permission system in Core Nats. Their client SDKs are very buggy and unreliable. I've used the Python, Rust and Go ones, only the Go one worked as expected. I would recommend using rabbitmq, Kafka or redpanda instead of Nats.
PaoloBarbolini 2 hours ago [-]
I've had the same experience and I fixed part of the problem by writing my own Rust client, Watermelon. It's still missing a lot of features but at least I'm not blocked by weird decisions taken by upstream.
chatmasta 5 hours ago [-]
Are those recommendations based on using them all in the same context? Curious why you chose Kafka (or Redpanda which is effectively the same) over NATS.
osigurdson 5 hours ago [-]
NATS gives you regular pub/sub but also streams as well (similar to Kafka along with strong durability guarantees, etc).
sorentwo 14 hours ago [-]
Postgres LISTEN/NOTIFY was a consistent pain point for Oban (background job processing framework for Elixir) for a while. The payload size limitations and connection pooler issues alone would cause subtle breakage.

It was particularly ironic because Elixir has a fantastic distribution and pubsub story thanks to distributed Erlang. That’s much more commonly used in apps now compared to 5 or so years ago when 40-50% of apps didn’t weren’t clustered. Thanks to the rise of platforms like Fly that made it easier, and the decline of Heroku that made it nearly impossible.

cpursley 14 hours ago [-]
How did you resolve this? Did you consider listening to the WAL?
sorentwo 14 hours ago [-]
We have Postgres based pubsub, but encourage people to use a distributed Erlang based notifier instead whenever possible. Another important change was removing insert triggers, partially for the exact reasons mentioned in this post.
MuffinFlavored 11 hours ago [-]
> Another important change was removing insert triggers, partially for the exact reasons mentioned in this post.

What did you replace them with instead?

sorentwo 9 hours ago [-]
In app notifications, which can be disabled. Our triggers were only used to get subsecond job dispatching though.
parthdesai 13 hours ago [-]
Distributed Erlang if application is clustered, redis if it is not.

Source: Dev at one of the companies that hit this issue with Oban

alberth 13 hours ago [-]
I didn’t realize Oban didn’t use Mnesia (Erlang built-in).
sorentwo 13 hours ago [-]
Very very few applications use mnsesia. There’s absolutely no way I would recommend it over Postgres.
tecleandor 8 hours ago [-]
I think RabbitMQ still uses by default for its metadata storage. Is it problematic?
schaum 7 hours ago [-]
They are in the process of migrating away from it https://www.rabbitmq.com/docs/metadata-store
arcanemachiner 12 hours ago [-]
I have heard the mnesia is very unreliable, which is a damn shame.

I wonder if that is fixable, or just inherent to its design.

sb8244 11 hours ago [-]
My understanding is that mnesia is sort of a relic. Really hard to work with and lots of edge / failure cases.

I'm not sure if it should be salvaged?

asg0451 9 hours ago [-]
can you explain why?
spooneybarger 8 hours ago [-]
Mnesia along with clustering was a recipe for split brain disasters a few years ago I assume that hasn't been addressed.
leontrolski 14 hours ago [-]
I'd be interested as to how dumb-ol' polling would compare here (the FOR UPDATE SKIP LOCKED method https://leontrolski.github.io/postgres-as-queue.html). One day I will set up some benchmarks as this is the kind of thing people argue about a lot without much evidence either way.

Wasn't aware of this AccessExclusiveLock behaviour - a reminder (and shameless plug 2) of how Postgres locks interact: https://leontrolski.github.io/pglockpy.html

qianli_cs 10 hours ago [-]
My colleague did some internal benchmarking and found that LISTEN/NOTIFY performs well under low to moderate load, but doesn't scale well with a large number of listeners. Our findings were pretty consistent with this blog post.

(Shameless plug [1]) I'm working on DBOS, where we implemented durable workflows and queues on top of Postgres. For queues, we use FOR UPDATE SKIP LOCKED for task dispatch, combined with exponential backoff and jitter to reduce contention under high load when many workers are polling the same table.

Would love to hear feedback from you and others building similar systems.

[1] https://github.com/dbos-inc/dbos-transact-py

mind-blight 9 hours ago [-]
Nice! I'm using DBOS and am a little active on the discord. I was just wondering how y'all handled this under the hood. Glad to hear I don't have to worry much about this issue
singron 13 hours ago [-]
Polling is the way to go, but it's also very tricky to get right. In particular, it's non-trivial to make a reliable queue that's also fast when transactions are held open and vacuum isn't able to clean tuples. E.g. "get the first available tuple" might have to skip over 1000s of dead tuples.

Holding transactions open is an anti-pattern for sure, but it's occasionally useful. E.g. pg_repack keeps a transaction open while it runs, and I believe vacuum also holds an open transaction part of the time too. It's also nice if your database doesn't melt whenever this happens on accident.

atombender 55 minutes ago [-]
Dead tuples is a real and significant problem, not just because it has to skip the tuples, but because the statistics that drive the planner don't account for them.

I found this out the hard way when I had a simple query that suddenly got very, very slow on a table where the application would constantly do a `SELECT ... FOR UPDATE SKIP LOCKED` and then immediately delete the rows after a tiny bit of processing.

It turned out that with a nearly empty table of about 10-20k dead tuples, the planner switched to using a different index scan, and would overfetch tons of pages just to discard them, as they only contained dead tuples. What I didn't realize is that the planner statistics doesn't care about dead tuples, and ANALYZE doesn't take them into account. So the planner started to think the table was much bigger than it actually was.

It's really important for these uses cases to tweak the autovacuum settings (which can be set on a per-table basis) to be much more aggressive, so that under high load, the vacuum runs pretty much continuously.

Another option is to avoid deleting rows, but instead use a column to mark rows as complete, which together with a partial index can avoid dead tuples. There are both pros and cons; it requires doing the cleanup (and VACUUM) as a separate job.

time0ut 11 hours ago [-]
An approach that has worked for me is to hash partition the table and have each worker look for work in one partition at a time. There are a number of strategies depending on how you manage workers. This allows you to only consider 1/Nth of the dead tuples, where N is the number of partitions, when looking for work. It does come at the cost of strict ordering, but there are many use cases where strict ordering is not required. The largest scale implementation of this strategy that I have done had 128 partitions with a worker per partition pumping through ~100 million tasks per day.

I also found LISTEN/NOTIFY to not work well at this scale and used a polling based approach with a back off when no work was found.

Quite an interesting problem and a bit challenging to get right at scale.

j16sdiz 10 hours ago [-]
Can't change the number of partition dynamically.

Additional challenge if jobs comes in funny sizes

AlisdairO 7 hours ago [-]
Depending on exactly what you need, you can often fake this with a functional index on mod(queue_value_id, 5000). You then query for mod(queue_value_id,5000) between m and n. You can then dynamically adjust the gap between m and n based on how many partitions you want
dfsegoat 10 hours ago [-]
If there were a toy or other public implementation of this, I would love to see it.
CBLT 9 hours ago [-]
This is how Kafka does it. Kafka has spent years working on the rough edges (e.g. partition resizing), haven't used it recently though.
leontrolski 6 hours ago [-]
> also fast when transactions are held open

In my linked example, on getting the item from the queue, you immediately set the status to something that you're not polling for - does Postgres still have to skip past these tuples (even in an index) until they're vacuumed up?

broken_broken_ 5 hours ago [-]
I have implemented polling against a cluster of mixed mariadb/mysql databases which do not offer listen/notify. It was a pain in the neck to get right.

- The batch size needs to be adaptative for performance, latency, and recovering smoothly after downtime.

- The polling timeouts, frequency etc the same.

- You need to avoid hysteresis.

- You want to be super careful about not disturbing the main application by placing heavy load on the database or accidentally locking tables/rows

- You likely want multiple distributed workers in case of a network partition to keep handling events

It’s hard to get right especially when the databases at the time did not support SKIP LOCKED.

In retrospect I wish I had listened to the WAL. Much easier.

RedShift1 13 hours ago [-]
I use polling with back off up to one minute. So when a workload is done, it immediately polls for more work. If nothing found, wait for 5 seconds, still nothing 10 seconds, ... until one minute and from then on it polls every minute until it finds work again and the back off timer resets to 0 again.
cpursley 14 hours ago [-]
Have you played with pgmq? It's pretty neat: https://github.com/pgmq/pgmq
edoceo 12 hours ago [-]
Another thing for @leontrolski to add to the benchmarks - which I cannot wait to read.
cpursley 11 hours ago [-]
There's a pretty cool solution built on pgmq called pgflow:

https://www.pgflow.dev/concepts/how-pgflow-works

TkTech 10 hours ago [-]
With that experience behind you, would you have feedback for Chancy[1]? It aims to be a batteries-included offering for postgres+python, aiming for hundreds of millions of jobs a day, not massive horizontal worker scaling.

It both polls (configurable per queue) and supports listen/notify simply to inform workers that it can wake up early to trigger polling, and this can be turned off globally with a notifications=false flag.

[1]: https://github.com/tktech/chancy

aurumque 14 hours ago [-]
I'll take the shameless plug. Thank you for putting this together! Very helpful overview of pg locks.
sorentwo 9 hours ago [-]
Ping requires something persistent to check. That requires creating tuples, and most likely deleting them after they’ve been consumed. That puts pressure on the database and requires vacuuming in ways that pubsub doesn’t because it’s entirely ephemeral.

Not to mention that pubsub allows multiple consumers for a single message, whereas FOR UPDATE is single consumer by design.

cpursley 14 hours ago [-]
Right, plus there's character limitations (column size). This is why I prefer listening to the Postgres WAL for database changes:

https://github.com/cpursley/walex?tab=readme-ov-file#walex (there's a few useful links in here)

williamdclt 14 hours ago [-]
I found recently that you can write directly to the WAL with transactional guarantees, without writing to an actual table. This sounds like it would be amazing for queue/outbox purposes, as the normal approaches of actually inserting data in a table cause a lot of resource usage (autovacuum is a major concern for these use cases).

Can’t find the function that does that, and I’ve not seen it used in the wild yet, idk if there’s gotchas

Edit: found it, it’s pg_logical_emit_message

gunnarmorling 4 hours ago [-]
pg_logical_emit_message() is how I recommend users on Postgres to implement the outbox pattern [1]. No table overhead as you say, no need for housekeeping, etc. It has some other cool applications, e.g. providing application-specific metadata for CDC streams or transactional logging, wrote about it at [2] a while ago. Another one is making sure replication slots can advance also if there's no traffic in the database they monitor [3].

[1] https://speakerdeck.com/gunnarmorling/ins-and-outs-of-the-ou...

[2] https://www.infoq.com/articles/wonders-of-postgres-logical-d...

[3] https://www.morling.dev/blog/mastering-postgres-replication-...

williamdclt 2 hours ago [-]
Ha, your [2] is how I learnt about it! Thanks :)
cyberax 13 hours ago [-]
One annoying thing is that there is no counterpart for an operation to wait and read data from WAL. You can poll it using pg_logical_slot_get_binary_changes, but it returns immediately.

It'd be nice to have a method that would block for N seconds waiting for a new entry.

You can also use a streaming replication connection, but it often is not enabled by default.

williamdclt 13 hours ago [-]
I think replication is the way to go, it’s kinda what it’s for.

Might be a bit tricky to get debezium to decode the logical event, not sure

gunnarmorling 4 hours ago [-]
Debezium handles logical decoding messages OOTB. There's also an SMT (single message transform) for decoding the binary payload: https://debezium.io/documentation/reference/stable/transform....
cyberax 10 hours ago [-]
Sure, but the replication protocol requires a separate connection. And the annoying part is that it requires a separate `pg_hba.conf` entry to be allowed. So it's not enabled for IAM-based connections on AWS, for example.

pg_logical_slot_get_binary_changes returns the same entries as the replication connection. It just has no support for long-polling.

denysonique 13 hours ago [-]
For node.js users there is postgres.js that can listen to the Postgres WAL and emit node events that can be handled by application code.
meesles 13 hours ago [-]
Yeah until vendors butcher Postgres replication behaviors and prevent common paths of integrating these capabilities into other tools. Looking at you AWS
CaliforniaKarl 14 hours ago [-]
I appreciate this post for two reasons:

* It gives an indication of how much you need to grow before this Postgres functionality starts being a blocker.

* Folks encountering this issue—and its confusing log line—in the future will be able to find this post and quickly understand the issue.

Gigachad 10 hours ago [-]
Sounds like ChatGPT appreciated the post
acdha 9 hours ago [-]
If you think they’re a bot, flag and move on. No need for a derail about writing style.
jjgreen 2 hours ago [-]
Just for the em-dashes? Some humans also use them.
TrackerFF 2 hours ago [-]
A decent way to classify human vs bot when it comes to dashes, is that all bots use ‘em-dashes(—), while almost none use regular dashes (-) in writing. While plenty of humans will use regular dashes, because they won’t bother to look for ‘em-dashes on the keyboard, or phone.

Of course, you have the people that correctly use em-dashes, too.

Gigachad 2 hours ago [-]
It’s also the fact it’s just a summary of the post content without anything extra or any opinions.
jjgreen 2 hours ago [-]
Fair point
bjornsing 4 hours ago [-]
If I’m not mistaken LISTEN/NOTIFY doesn’t work with connection poolers, and you can’t have tens of thousands of connections to a Postgres database. Not sure you need a more elaborate analysis than that to reach the same conclusion.
callamdelaney 12 hours ago [-]
My kneejerk reaction to the headline is ‘why would it?’.

It’s unsurprising to me that an AI company appears to have chosen exactly the wrong tool for the job.

kristianc 10 hours ago [-]
Sounds like a deliberate attempt to avoid spinning up Redis, Kafka, or an outbox system early on.. and then underestimated how quickly their scale would make it blow up. Story as old as time.
const_cast 9 hours ago [-]
I find the opposite story more true: additional complexity in the form of caching early, for a scale that never comes. I've worked on one too many sprawling, distributed systems with too little users to justify it.
physix 5 hours ago [-]
"Sprawling distributed systems".

I like that. Sounds like a synonym for "Platform Engineering". :-)

I remember being amazed that lambda architecture was considered a kind of reference, when it looked to me more like a workaround.

We like to build IT cathedrals, until we have to run them.

const_cast 2 hours ago [-]
If there's one thing I took away from school, it's that distributed systems are hard. More failure points and much more communication hops. Serialization into deserialization into serialization again over network hops.
v5v3 2 hours ago [-]
Better to be successful with simple tech and have a minor 'blow up', then over engineer and go bust.
oulipo 3 hours ago [-]
Not sure I get it... how would you replicate this functionality with Kafka? You'd still need to have the database LISTEN to changes and push it to Kafka no?
j16sdiz 10 hours ago [-]
Kafka head of line blocking sucks.
chrnola 8 hours ago [-]
Guaranteeing order has its tradeoffs.

There is work happening currently to make Kafka behave more like a queue: https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A...

bravesoul2 12 hours ago [-]
Yeah I have no idea whether it would. But I'd load test it if it needed to scale.

SQS may have been a good "boring" choice for this?

TheTaytay 8 hours ago [-]
Because documentation doesn’t warn about this well-loved feature effectively ruins the ability to perform parallel writes, and because everything else in Postgres scales well.

I think it’s a reasonable assumption. Based on the second half of your comment, you clearly don’t think highly of “AI companies,” but I think that’s a separate issue.

merb 4 hours ago [-]
Wouldn’t it be better nowadays to listen to the Wal. With a temporary replication slot and a publication just for this table and the id column?
hombre_fatal 14 hours ago [-]
Interesting. What if you just execute `NOTIFY` in its own connection outside of / after the transaction?
nick_ 14 hours ago [-]
My thought as well. You could add notify commands to a temp table during the transaction, then run NOTIFY on each row in that temp table after the transaction commits successfully?
zbentley 12 hours ago [-]
This is roughly the “transactional outbox” pattern—and an elegant use of it, since the only service invoked during the “publish” RPC is also the database, reducing distributed reliability concerns.

…of course, you need dedup/support for duplicate messages on the notify stream if you do this, but that’s table stakes in a lot of messaging scenarios anyway.

foota 12 hours ago [-]
Wouldn't you need to then commit to remove the entries from the temp table?
zbentley 12 hours ago [-]
No, so long as the rows in there are transactionally guaranteed to be present or not, a sweeper script can handle removing failed “publishes” (notifys that didn’t delete their row) later.

This does sacrifice ordering and increases the risk of duplicates in the message stream, though.

parthdesai 13 hours ago [-]
You lose transactional guarantees if you notify outside of the transaction though
hombre_fatal 13 hours ago [-]
Yeah, but pub/sub systems already need to be robust to missed messages. And, sending the notify after the transaction succeeds usually accomplishes everything you really care about (no false positives).
parthdesai 12 hours ago [-]
What happens when transaction succeeds but the execution of NOTIFY fails if it's outside of transaction, in it's own separate connection?
saltcured 12 hours ago [-]
For reliability, you can make the recipient poll the table(s) of record for relevant state and use the out-of-band notification channel as a latency-reducer. So, the poller is eventually consistent at some configured polling interval, but opportunistically can respond much sooner when told to check again ahead of the next scheduled poll time.

In my experience, this means you make sure the polling solution is complete and correct, and the notifier gets reduced to a wake-up signal. This signal doesn't even need to carry the actionable change content, if the poller can already pose efficient queries for whatever "new stuff" it needs.

This approach also allows the poller to keep its own persistent cursor state if there is some stateful sequence to how it consumes the DB content. It automatically resynchronizes and the notification channel does not need to be kept in lock-step with the consumption.

valenterry 7 hours ago [-]
> you can make the recipient poll the table(s) of record for relevant state

That is tricky due to transactions and visibility. How do you write the poller to not miss events that were written by a long/blocked transaction? You'd have to set the poller scan to a long time (e.g. "process events that were written since now minus 5minutes") and then make sure transactions are cancelled hard before those 5minutes.

parthdesai 11 hours ago [-]
fwiw - that's what Oban did for the most part. It sent a signal to a worker that there was a new job to pick up and work on. At scale, even that was an issue.
Groxx 10 hours ago [-]
The same thing that happens if the notified process dies suddenly.

If you're not handling that, then whatever you're doing is unreliable either way.

cheesekunator 8 hours ago [-]
98% of developers can't see it
soursoup 14 hours ago [-]
Isn’t it standard practice to have a separate TCP stream for NOTIFY or am I mistaken
remram 14 hours ago [-]
You mean for LISTEN?
zerd 6 hours ago [-]
That would make the locked time shorter, but it would still contend on the global lock, right?
baristaGeek 7 hours ago [-]
Postgres is a great DB, but it's the wrong tool for a write-heavy, high-concurrency, real-time system with pub-sub needs.

You should split your system into specialized components: - Kafka for event transport (you're likely already doing this). - An LSM-tree DB for write-heavy structured data (eg: Cassandra) - Keep Postgres for queries that benefit from relational features in certain parts of your architecture

ryanjshaw 5 hours ago [-]
IMO They don’t have a high concurrency DB writing system, they just think they do.

Recordings can and should be streamed to an object store. Parallel processes can do transcription on those objects; bonus: when they inevitably have a bug in transcription, retranscribing meetings is easy.

The output of transcription can be a single file also stored in the object store with a single completion message notification, or if they really insist on “near real-time”, a message on a queue for every N seconds. Much easier to scale your queue than your DB, eg Kafka partitions.

A handful of consumers can read those messages and insert into the DB. Benefit is you have a fixed and controllable write load into the database, and your client workload never overloads the DB because you’re buffering that with the much more distributed object store (which is way simpler than running another database engine).

baristaGeek 7 hours ago [-]
Very good article! Succinct, and very informative.
redskyluan 1 hours ago [-]
Postgres users often hit scaling issues — whether it's with LISTEN/NOTIFY, PGVector, or even basic relational queries.

For startups, Postgres is a fantastic first choice. But plan ahead: as your workload grows, you’ll likely need to migrate or augment your stack.

NightMKoder 14 hours ago [-]
Facebook’s wormhole seems like a better approach here - just tailing the MySQL bin log gets you commit safety for messages without running into this kind of locking behavior.
winterrx 3 hours ago [-]
They're the same company that ran into this, at least they're learning! > How WebSockets cost us $1M on our AWS bill
polote 14 hours ago [-]
Rls and triggers dont scale either
shivasaxena 14 hours ago [-]
Yeah, I'm going to remove triggers in next deploy of a POS system since they are adding 10-50ms to each insert.

Becomes a problem if you are inserting 40 items to order_items table.

lelanthran 12 hours ago [-]
> Yeah, I'm going to remove triggers in next deploy of a POS system since they are adding 10-50ms to each insert.

Do you expect it to be faster to do the trigger logic in the application? Wouldn't be slower to execute two statements from the application (even if they are in a transaction) than to rely on triggers?

candiddevmike 12 hours ago [-]
How do you handle trigger logic that compares old/new without having a round trip back to the application?
SoftTalker 11 hours ago [-]
Do it in a stored procedure not a trigger. Triggers have their place but a stored procedure is almost always better. Triggers can surprise you.
candiddevmike 11 hours ago [-]
I don't follow how you would do that in a stored procedure outside of a trigger.
const_cast 9 hours ago [-]
I think instead of performing an INSERT you call a stored proc that does the insert and some extra stuff.
nine_k 11 hours ago [-]
Hmm, imho, triggers do scale, they are just slow. But as you add more connections, partitionss, and CPUs, the slowness per operation remains constant.
ants_a 34 minutes ago [-]
Triggers are not even particularly slow. They just hide the extra work that is being done and thus sometimes come back to bite programmers by adding a ton of work to statements that look like they should be quick.
brikym 12 hours ago [-]
Have you tried deferring them?
GuinansEyebrows 14 hours ago [-]
that, and keeping your business logic in the database makes everything more opaque!
lelanthran 12 hours ago [-]
> that, and keeping your business logic in the database makes everything more opaque!

Opaque to who? If there's a piece of business logic that says "After this table's record is updated, you MUST update this other table", what advantages are there to putting that logic in the application?

When (not if) some other application updates that record you are going to have a broken database.

Some things are business constraints, and as such they should be moved into the database if at all possible. The application should never enforce constraints such as "either this column or that column is NULL, but at least one must be NULL and both must never be NULL at the same time".

Your database enforces constraints; what advantages are there to code the enforcement into every application that touches the database over simply coding the constraints into the database?

thisoneisreal 11 hours ago [-]
I think the dream is that business requirements are contained to one artifact and everything else responds to that driver. In an ideal world, it would be great to have databases care only about persistence and be able to swap them out based on persistence needs only. But you're right, in the real world the database is much better at enforcing constraints than applications.
Spivak 13 hours ago [-]
Neither do foreign keys the moment you need to shard. Turns out that there's no free lunch when you ask your database to do "secret extra work" that's supposed to be transparent-ish to the user.
mulmen 12 hours ago [-]
Does that only apply when you need to shard within tenants?

If each tenant gets an instance I would call that a “shard” but in that pattern there’s no need for cross-shard references.

Maybe in the analytics stack but that can be async and eventually consistent.

shivasaxena 14 hours ago [-]
Out of curiosity: Would appreciate if others can share what other things like AccessExclusiveLock should postgres users beware of?

What I already know

- Unique indexes slow inserts since db has to acquire a full table lock

- Case statements in Where break query planner/optimizer and require full table scans

- Read only postgres functions should be marked as `STABLE PARALLEL SAFE`

hans_castorp 5 hours ago [-]
> Unique indexes slow inserts since db has to acquire a full table lock

An INSERT never results in a full table lock (as in "the lock would prevent other inserts or selects on the table)

Any expression used in the WHERE clause that isn't indexed will probably result in a Seq Scan. CASE expressions are no different than e.g. a function call regarding this.

A stable function marked as "STABLE" (or even immutable) can be optimized differently (e.g. can be "inlined"), so yes that's a good recommendation.

1a527dd5 3 hours ago [-]
https://pglocks.org/?pglock=AccessExclusiveLock is my go to reference.

My other reference for a slightly different problem is https://www.thatguyfromdelhi.com/2020/12/what-postgres-sql-c...

franckpachot 13 hours ago [-]
Can you provide more details? Inserting with unique indexes do not lock the table. Case statements are ok in where clause, use expression indexes to index it
spoaceman7777 9 hours ago [-]
This is part of the basis for Supabase offering their realtime service, and broadcast, rather than supporting native LISTEN/NOTIFY. The scaling issues are well known.
DumBthInker007 4 hours ago [-]
My understanding: i think as postgres takes an exclusive lock to enqueue the notifications into a shared queue in PreCommit_Notify(), as the actual commit happens after notification was enqueued into the queue,as other transactions also try to notify but wait becacause of the lock ,so does the commit waits.
cshimmin 14 hours ago [-]
If I understood correctly, the global lock is so that notify events are emitted in order. Would it make sense to have a variant that doesn't make this ordering guarantee if you don't care about it, so that you can "notify" within transactions without locking the whole thing?
GuinansEyebrows 14 hours ago [-]
possibly, but i think at that point it would make more sense to move the business logic outside of the database (you can wait for a successful commit before triggering an external process via the originating app, or monitor the WAL with an external pub/sub system, or something else more clever than i can think of).
supportengineer 14 hours ago [-]
LISTEN/NOTIFY isn’t just a lock-free trigger. It can jeopardize concurrency under load.

Features that seem harmless at small scale can break everything at large scale.

edoceo 11 hours ago [-]
It's true and folk should also choose the right tool at their scale and monitor it. There are plenty of cases where LISTEN/NOTIFY is the right choice.

However, in 2025 I'd pick Redis or MQTT for this kind of role. I'm typically in multi-lamg environments. Is there something better?

vb-8448 2 hours ago [-]
I didn't see it in the article, can some tell me what is the scale of " many writers."?
sleepy_keita 10 hours ago [-]
LISTEN/NOTIFY was always a bit of a puzzler for me. Using it means you can't use things like pgbouncer/pgpool and there are so many other ways to do this, polling included. I guess it could be handy for an application where you know it won't scale and you just want a simple, one-dependency database.
nightfly 9 hours ago [-]
> I guess it could be handy for an application where you know it won't scale and you just want a simple, one-dependency database

That's where we use it at my work. We have host/networking deployment pipelines that used to have up to one minute latency on each step because each was ran on a one-minute cron. A short python script/service that handled the LISTENing + adding NOTIFYs when the next step was ready removed the latency and we'll never do enough for the load on the db to matter

nhumrich 6 hours ago [-]
You can setup notify to run as a trigger on an events table. The job that listens shouldn't need a pool, it's a long lived connection anyway. Now you can keep using pgbouncer everywhere else.
valenterry 7 hours ago [-]
How about using a service that runs continuously and brings it's own pool? So basically all Java/JVM based solutions that use something like HiKariCP.
andrewstuart 14 hours ago [-]
There’s lots of ways to invoke NOTIFY without doing it from with the transaction doing the work.

The post author is too focused on using NOTIFY in only one way.

This post fails to explain WHY they are sending a NOTIFY. Not much use telling us what doesn’t work without telling us the actual business goal.

It’s crazy to send a notify for every transaction, they should be debounced/grouped.

The point of a NOTIFY is to let some other system know something has changed. Don’t do it every transaction.

0xCMP 14 hours ago [-]
Agreed, I am struggling to understand why "it does not scale" is not "we used it wrong and hit the point where it's a problem" here.

Like if it needs to be very consistent I would use an unlogged table (since we're worried about "scale" here) and then `FOR UPDATE SKIP LOCKED` like others have mentioned. Otherwise what exactly is notify doing that can't be done after the first transaction?

Edit: in-fact, how can they send an HTTP call for something and not be able to do a `NOTIFY` after as well?

One possible way I could understand what they wrote is that somewhere in their code, within the same transaction, there are notifies which conditionally trigger and it would be difficult to know which ones to notify again in another transaction after the fact. But they must know enough to make the HTTP call, so why not NOTIFY?

andrewstuart 13 hours ago [-]
Agreed.

They’re using it wrong and blaming Postgres.

Instead they should use Postgres properly and architect their system to match how Postgres works.

There’s correct ways to notify external systems of events via NOTIFY, they should use them.

thom 13 hours ago [-]
Yeah, the way I've always used LISTEN/NOTIFY is just to tell some pool of workers that they should wake up and check some transactional outbox for new work. False positives are basically harmless and therefore don't need to be transactional. If you're sending sophisticated messages with NOTIFY (which is a reasonable thing to think you can do) you're probably headed for pain at some point.
tomrod 14 hours ago [-]
Assuming you skip select transaction, or require logging on it because your regulated industry had bad auditors, then every transaction changes something.
to11mtm 11 hours ago [-]
Seriously people just layer shit with NATS for pubsub after persist and make sure there's a proper way to place a 'on restart recoonect' thing.
caleblloyd 9 hours ago [-]
Amen! NATS is how we do AI streaming! JetStream subject per thread with an ordered consumer on the client.
seunosewa 3 hours ago [-]
They have a history of not prioritising performance.
29 minutes ago [-]
h1fra 14 hours ago [-]
You had one problem with listen notify which was a fair one, but now you have a problem with http latency, network issues, DNS, retries, self-DDoS, etc.
GuinansEyebrows 14 hours ago [-]
it sounds like the impact of LISTEN/NOTIFY scaling issues was much greater on the overall DB performance than the actual load/scope of the task being performed (based on the end of the article), and they're aware that if they needed something more performant for that offloaded task, they have options (pub/sub via redis or w/e).
winterrx 3 hours ago [-]
Funny, I got to their homepage and get 504'd
grumple 1 hours ago [-]
I’m mostly a MySQL user. Two things stand out:

1) the Postgres documentation does not mention that Notify causes a global lock or lock of any sort (I checked). That’s crazy to me; if something causes a lock, the documentation should tell you it does and what kind. Performance notes also belong in documentation for dbs.

2) why the hell does notify require a lock in the first place? Reading the comment this design seems insane; there’s no good reason to queue up notifications for transactions that aren’t committed. Just add the notifications in commit order with no lock, you’re building a db with concurrency, get used to it.

freeasinbeer2 11 hours ago [-]
Am I supposed to be able to tell from these graphs that one was faster than the other? Because I sure can't.

What were the TPS numbers? What was the workload like? How big is the difference in %?

cellis 14 hours ago [-]
It does scale. Just not to recall levels of traffic. Come on guys let's not rewrite everything in cassandra and rust now.
doc_manhat 13 hours ago [-]
Got up to the TL;DR paragraph. This was a major red flag given the initial presentation of the discovery of a bottleneck:

''' When a NOTIFY query is issued during a transaction, it acquires a global lock on the entire database (ref) during the commit phase of the transaction, effectively serializing all commits. '''

Am I missing something - this seems like something the original authors of the system should have done due diligence on before implementing a write heavy work load.

Someone 4 hours ago [-]
If “doing due diligence” involves reading the source code of a database server to verify a design, I doubt many people writing such systems do due diligence.

The documentation doesn’t mention any caveats in this direction, and they had 3 periods of downtime in 4 days, so I don’t think it’s a given that testing would have hit this problem.

kccqzy 13 hours ago [-]
I think it's just difficult to predict how heavy is heavy enough to make this a problem. FWIW I had worked at a startup with a much more primitive data storage system where serialized commits were actually totally fine. The startup never outgrew that bottleneck.
whatevaa 5 hours ago [-]
You don't know how heavy it will be in new systems. As another commenter mentioned, you might never reach that point. Simplier is always better.
mulmen 14 hours ago [-]
Sounds like one centralized Postgres instance, am I understanding that correctly? Wouldn’t meeting bots be very easy to parallelize across single-tenant instances?
maxdo 10 hours ago [-]
What a discovery , even Postgres itself doesn’t scale easy. There are so many solutions that are dedicated and cost you less.
dumbfounder 13 hours ago [-]
Transactional databases are not really the best tool for writing tons of (presumably) immutable records. Why are you using it for this? Why not Elastic?
incoming1211 13 hours ago [-]
Because transactional databases are perfectly fine for this type of thing when you have 0 to 100k users.
0xbadcafebee 9 hours ago [-]
The total number of users in your system is not a performance characteristic. And transactions are generally wrong for write-heavy anything. Further, if you can just append then the transaction is meaningless.
Kwpolska 13 hours ago [-]
[citaiton needed]
randall 13 hours ago [-]
wow thanks for the heads up! no idea this was a thing.
wordofx 13 hours ago [-]
It’s not a thing.
randall 10 hours ago [-]
i don’t understand. is the serialized write global lock a thing or no?
anonu 13 hours ago [-]
was hoping the solution was: we forked postgres.

cool writeup!

threecheese 12 hours ago [-]
I had a similar thought, as I was clicking through to TFA; “NOTIFY does not scale, but our new Widget can! Just five bucks”
deadbabe 11 hours ago [-]
Honestly this article is ridiculous. Most people do not have tens of thousands of concurrent writers. And most applications out there are read heavy, not write. Which means you probably have read replicas distributing loads.

Use LISTEN/NOTIFY. You will get a lot of utility out of it before you’re anywhere close to these problems.

acdha 9 hours ago [-]
I would phrase this as “know where your approach hits scaling walls”. You’re right that many people never need more than LISTEN/NOTIFY but the reason that advice became so popular was the wave of people who had jumped straight into running some complicated system like Kafka when they hadn’t done any analysis to justify it; it would be nice if the lesson we taught was that you should do some analysis rather than just picking one popular option.
winterissnowing 9 hours ago [-]
[dead]
aaa12365 6 hours ago [-]
hi
0xbadcafebee 13 hours ago [-]
RBDMS are not designed for write-heavy applications, they are designed for read-heavy analysis. Also, an RDBMS is not a message queue or an RPC transport.

I feel like somebody needs to write a book on system architecture for Gen Z that's just filled with memes. A funny cat pic telling people not to use the wrong tool will probably make more of an impact than an old fogey in a comment section wagging his finger.

const_cast 9 hours ago [-]
People have been using RDBMS' for write-heavy workflows for forever. Some people even use stored procs or triggers for getting complicated write operations to work properly.

Databases can do a lot of stuff, and if you're not hurting for DB performance it can be a good idea to just... do it in the database. The advantage is that, if the DB does it, you're much less likely to break things. Putting data constraints in application code can be done, but then you're just waiting for the day those constraints are broken.

0xbadcafebee 8 hours ago [-]
That is the same logic that led every failed design I've seen in my career to take months (if not years) and tons of money to fix. "YOLO engineering" is simple at first and a huge pain in the ass later. Whereas actually correct engineering is slightly painful at first and saves your ass later.

The people who design it walk away after a few years, so they don't give a crap what happens. The rest of us have to struggle to support or try to replace whatever the lumbering monstrosity is.

9 hours ago [-]
hombre_fatal 13 hours ago [-]
But those rules of thumb aren't true. People use Postgres for job queues and write-heavy applications.

You'd have to at least accompany your memes with empirics. What is write-heavy? A number you might hit if your startup succeeds with thousands of concurrent users on your v1 naive implementation?

Else you just get another repeat of everyone cargo-culting Mongo because they heard that Postgres wasn't web scale for their app with 0 users.

0xbadcafebee 8 hours ago [-]
There are lots of ways to empirically tell what solutions are right for what applications. The simplest is using basic computer science like applying big-O notation, or using something designed as a message queue to do message queueing, etc. Slightly more complicated are simple benchmarks with immutable infrastructure.
kccqzy 13 hours ago [-]
There are OLTP and OLAP RDBMSes. Only OLAP ones are designed for read-heavy analyses.
ilitirit 4 hours ago [-]
> The structured data gets written to our Postgres database by tens of thousands of simultaneous writers. Each of these writers is a “meeting bot”, which joins a video call and captures the data in real-time.

Maybe I missed it in some folded up embedded content, or some graph (or maybe I'm probably just blind...), but is it mentioned at which point they started running into issues? The quoted bit about "10s of thousands of simultaneous writers" is all I can find.

What is the qualitative and quantitative nature of relevant workloads? Depending on the answers, some people may not care.

I asked ChatGPT to research it and this is the executive summary:

  For PostgreSQL’s LISTEN/NOTIFY, a realistic safe throughput is:

  Up to ~100–500 notifications/sec: Handles well on most systems with minimal tuning. Low risk of contention.

  ~500–2,000 notifications/sec: Reasonable with good tuning (short transactions, fast listeners, few concurrent writers). May start to see lock contention.

  ~2,000–5,000 notifications/sec: Pushing the upper bounds. Requires careful batching, dedicated listeners, possibly separate Postgres instances for pub/sub.

  >5,000 notifications/sec: Not recommended for sustained load. You’ll likely hit serialization bottlenecks due to the global commit lock held during NOTIFY.
cap11235 4 hours ago [-]
Source: ass
ilitirit 2 hours ago [-]
What is wrong with you? Why would you even bother posting a comment like this?

Maybe you also don't know what ChatGPT Research is (the Enterprise version, if you really need to know), or what Executive Summary implies, but here's a snippet of the 28 sources used:

https://imgur.com/a/eMdkjAh

ants_a 11 minutes ago [-]
In that snippet are links to Postgres docs and two blog posts, one being the blog post under discussion. None of those contain the information needed to make the presented claims about throughput.

To make those claims it's necessary to know what work is being done while the lock is held. This includes a bunch of various resource cleanup, which should be cheap, and RecordTransactionCommit() which will grab a lock to insert a WAL record, wait for it to get flushed to disk and potentially also for it to get acknowledged by a synchronous replica. So the expected throughput is somewhere between hundreds and tens of thousands of notifies per second. But as far as I can tell this conclusion is only available from PostgreSQL source code and some assumptions about typical storage and network performance.