Introduction
If you send transactional emails directly inside a web request, you will eventually lose messages during deploys, worker crashes, timeouts, or rate-limit spikes. The biggest problem is silent failure: the user does not get the email, and your team cannot prove what happened.
This guide is for SaaS and development teams building transactional email pipelines for OTPs, password resets, receipts, and alerts. It shows how to design an email queue that does not silently drop messages by using a transactional outbox, at-least-once processing, idempotency, and dead-letter queues (DLQ). It also covers safe acknowledgment timing, visibility and ack timeouts, and troubleshooting with provider logs and webhook events.
You will not get a magic inbox guarantee. You will get a queue design where every email has a traceable lifecycle and recoverable failure states. Start by reviewing MailCub Documentation and confirming your test send appears in delivery logs so your visibility path is working.
Quick Answer
- Use at-least-once delivery and make sending idempotent, because safe duplicates are better than dropped messages.
- Implement a transactional outbox so database writes and send intent are persisted together.
- Acknowledge queue messages only after the provider accepts the send and you store a message reference.
- Add a dead-letter queue and alert on it so poison messages stop looping.
- Tune visibility timeout or ack timeout to match worst-case worker time.
- Keep an audit trail with app logs, provider logs, and events or webhooks.
Why It Matters
Transactional emails are part of your product reliability. If OTPs or password resets do not arrive, users assume your app is broken or unsafe. If invoices or receipts do not land, support load and disputes increase.
A good queue turns failures into visible states such as queued, processing, retrying, dead-lettered, and accepted-by-provider. That is the difference between “we think we sent it” and “we can show exactly what happened.”
The goal is not perfection. The goal is no silent loss, safe retries, and fast debugging.
What “Never Loses Messages” Really Means
In practice, “never loses messages” means three things:
- Durability: the intent to send is persisted before any network call
- No silent drops: every failure becomes a visible state you can alert on
- Duplicate-safe processing: at-least-once delivery can happen, and it is harmless
Once you accept these rules, the architecture becomes much easier to design.
Step-by-Step Solution
1) Choose your reliability contract: at-least-once + idempotency
Most real queues behave like at-least-once systems, which means a message can be delivered more than once. If your sending logic cannot tolerate duplicates, you will either drop messages or spam users.
Fix this with an idempotency key. Make the key unique per business event, not per job attempt.
Example key shapes from the provided content:
email:{tenant_id}:{message_type}:{business_event_id}otp:{tenant_id}:{user_id}:{purpose}:{time_bucket}
Store the idempotency key in a table with a unique constraint so repeat attempts become no-ops.
2) Use the transactional outbox pattern (stop dual-write loss)
The easiest way to lose messages is doing a database write and queue publish as separate operations. If enqueue fails after the database commit, the email intent is lost.
The transactional outbox pattern fixes this:
- Write the business record and an outbox row in the same database transaction
- A dispatcher publishes outbox rows to the queue and marks them as published
This ensures that if the business state exists, the send intent exists too.
3) Publish durably and record the broker reference
When publishing to the queue, record traceable references such as:
- Outbox row ID
- Broker message ID (if your queue provides one)
- Published timestamp
This creates an audit trail and makes “lost message” reports debuggable.
4) Worker: acknowledge only after provider acceptance and state write
Your worker should follow a strict sequence:
- Receive queue message
- Check idempotency key (already accepted or sent?)
- Call provider send API
- Persist provider response (status and message reference)
- Acknowledge the queue message
If you acknowledge before step 4, a crash can cause silent drops. If you acknowledge after step 4 but do not use idempotency, retries can create duplicates. You need both controls together.
5) Tune visibility and ack timeouts for slow sends
If your queue uses a visibility timeout or ack deadline, the message can reappear while a worker is still processing it. This creates duplicates under load.
Set the timeout higher than your worst-case processing time, and use heartbeat or timeout extension if your queue supports it. Also limit worker concurrency so you do not overwhelm the provider and trigger throttling spikes.
6) Add retries with backoff and a DLQ for poison messages
Retries should be:
- Limited (maximum attempts)
- Spaced out (exponential backoff plus jitter)
- Classified (retry transient errors, stop on permanent errors)
Then add a dead-letter queue:
- After N failures, send the message to the DLQ
- Alert on DLQ growth
- Keep DLQ retention long enough to investigate and replay safely
You can use MailCub Documentation as your operational reference while implementing outbox, retries, and failure visibility, and review the Transactional Email page for delivery logging and event tracking capabilities.
7) Add observability: trace IDs, metrics, and provider logs/events
At minimum, store:
- Correlation ID (request ID or user ID)
- Outbox ID and idempotency key
- Timestamps for queued, processing, accepted, and dead-lettered states
- Provider response snapshot (redacted as needed)
Add core metrics:
- Queue lag and in-flight count
- Retry rate
- DLQ count
- Provider throttle events
If your provider supports webhooks or events, consume them so your app can automatically record final outcomes such as delivered, bounced, or failed.
Email Queue Reliability Checklist
| Layer | Must-have control | What it prevents | “Good” signal |
|---|---|---|---|
| Write path | Transactional outbox | DB updated but no email job | Outbox row exists for every send intent |
| Processing | At-least-once + idempotency | Duplicate sends becoming incidents | Replays become no-ops |
| Acking | Ack after provider acceptance | Silent drops on worker crash | No “missing” message without a state |
| Retries | Backoff + classification | Retry storms and spam | Retry rate remains stable under spikes |
| DLQ | Dead-letter after max attempts | Infinite retry loops | DLQ alerts and a replay runbook exist |
| Visibility | Timeout tuned + heartbeats | Accidental duplicates during slow provider responses | Low duplicate rate during provider slowdowns |
| Observability | IDs + logs + events | Guesswork | Message is traceable end-to-end |
Common Mistakes
- Doing database writes and queue publish separately, with no outbox
- Acknowledging messages before provider acceptance
- Retrying permanent failures forever, with no DLQ
- No idempotency key, causing duplicate OTPs or receipts
- Visibility timeout set too low, causing messages to reappear during processing
- No trace ID linking user action to queue message and provider outcome
Troubleshooting
We’re losing emails
Check the pipeline in this order:
- Does an outbox record exist?
- Did the dispatcher publish it to the queue?
- Did a worker pick it up?
- Did the provider accept it?
- Do provider logs or events show the final status?
If you cannot answer any step, add instrumentation there first.
We’re sending duplicates
The most common causes are:
- Missing idempotency key
- Visibility timeout too short, so the message reappears
- Worker crash after provider accepted but before state was persisted
Fix this by enforcing idempotency and persisting provider acceptance before acknowledgment.
DLQ is growing
Treat the DLQ like a bug queue:
- Inspect sample payloads
- Fix template, domain, or configuration issues
- Replay only after the root cause is resolved
FAQ
Can an email queue truly guarantee zero message loss?
Not in every failure mode. The practical target is no silent loss: durable intent, visible states, and duplicate-safe processing.
What is the transactional outbox pattern and why does it matter?
It ensures the database change and send intent are committed together, so emails are not lost during partial failures between database and queue operations.
When should a worker acknowledge a message?
After the provider accepts the send and your system persists the acceptance status and message reference.
Why is idempotency required for reliable queues?
Because at-least-once delivery can repeat messages. Idempotency makes those repeats safe.
What should go to a DLQ?
Messages that exceed max retries or fail with permanent errors, such as bad payloads or invalid domain configuration.
How do webhooks and events help?
They let your app record final outcomes like delivered, failed, or bounced automatically, which reduces guesswork and improves debugging.
Conclusion
To build an email queue that does not silently lose messages, design for reality: at-least-once delivery, safe duplicates, durable send intent, and visible failure states. The transactional outbox removes the biggest silent-loss risk, DLQs stop poison loops, and strict ack timing prevents drops during worker crashes.
Use MailCub Documentation to guide implementation, explore delivery logging and event workflows on the Transactional Email page, and review MailCub Pricing if you are planning production rollout capacity and operational support.