Database Connection Pooling
:::info Who this guide is for
- New learners โ start at What is a Database Connection? and What is a Connection Pool? to understand why this problem exists and how pools solve it.
- Senior engineers โ jump to Pool Sizing, Failure Modes, PgBouncer, RDS Proxy, or Production Observability. :::
What is a Database Connection?โ
A database connection is a persistent, stateful communication channel between your application and the database server. It is not just a network socket โ it carries:
- An authenticated session (the DB knows who you are)
- A transaction context (what isolation level, what's in-flight)
- Server-side memory allocation (caches, temp buffers, cursor state)
- A dedicated backend process on PostgreSQL (or thread on MySQL)
Your App (JVM) PostgreSQL Server
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Thread 1 Backend Process 1 (~10MB RAM)
Thread 2 โโโ TCP Socket โโโบ Backend Process 2 (~10MB RAM)
Thread 3 Backend Process 3 (~10MB RAM)
Every physical connection consumes resources on both sides โ not just on the database.
Why creating a connection is expensiveโ
Opening a new connection from scratch requires a full multi-step handshake before a single SQL query can run:
Step 1 โ TCP Handshake (1 network round trip)
Client SYN โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบ Server
Client โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ SYN-ACK
Client ACK โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบ Server
Duration: 0.5โ10ms (depends on network)
Step 2 โ TLS/SSL Negotiation (2โ4 round trips)
ClientHello โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโบ Server
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ ServerHello + Certificate
Key Exchange โโโโโโโโโโโโโโโโโโโโโโโโโโโโโบ Server
Duration: 5โ50ms
Step 3 โ Database Authentication
Auth request + credentials โโโโโโโโโโโโโโโบ Server
โโโโโโโโโโโโโโโโโโโโโโโโโโ Permission check + session init
Duration: 2โ20ms
Step 4 โ Backend Process Allocation (PostgreSQL)
Server forks a new OS process for this connection
Allocates ~10MB RAM (work_mem, sort buffers, etc.)
Duration: 1โ5ms
Total: 10โ100ms before your first SQL query runs
A typical indexed SQL query: < 1ms
The implication: if every application request opened a fresh connection, connection setup would dominate your response time โ especially under load.
Connection cost comparisonโ
| Operation | Typical duration |
|---|---|
| Open a new TCP connection | 0.5โ10ms |
| TLS handshake | 5โ50ms |
| Database authentication | 2โ20ms |
| Borrow from existing pool | < 1ms |
| Simple indexed SQL query | 0.1โ5ms |
| Total new connection cost | 10โ100ms |
What is a Connection Pool?โ
A connection pool is a cache of pre-opened, pre-authenticated database connections that application threads borrow and return โ rather than creating and destroying a new connection on every request.
The library analogyโ
| Without pooling | With pooling |
|---|---|
| Every student registers for a new library card, gets a background check, receives a laminated card โ uses it once, and the card is destroyed | The library keeps 20 pre-made cards at the desk. Students borrow one, do their reading, return it โ the next student uses the same card immediately |
How borrowing and returning worksโ
Application starts:
Pool creates 10 connections to the DB (pre-authenticated, warm)
Pool: [C1] [C2] [C3] [C4] [C5] [C6] [C7] [C8] [C9] [C10]
HTTP request arrives (Thread A needs a connection):
Thread A borrows C1 โ instantly, no TCP handshake
Pool: [ ] [C2] [C3] [C4] [C5] [C6] [C7] [C8] [C9] [C10]
Thread A runs queries on C1 (1ms)
Thread A finishes โ returns C1 to pool:
Pool: [C1] [C2] [C3] [C4] [C5] [C6] [C7] [C8] [C9] [C10]
Thread B borrows C1 immediately (no waiting, no handshake)
What happens when all connections are borrowed?โ
10 threads all holding connections simultaneously:
Pool: [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] [ ] โ empty
Thread 11 arrives and asks for a connection:
โ Waits in a queue for connection-timeout (e.g. 30 seconds)
โ If no connection is returned within timeout:
โ Throws SQLTransientConnectionException
โ HTTP request fails with 500
This is called POOL STARVATION โ the most common connection pool failure.
HikariCP โ Spring Boot's Default Poolโ
HikariCP is the fastest JVM connection pool. It is the default in Spring Boot since 2.x and is chosen for its extremely low overhead (single-digit microsecond borrow time) and robust failure detection.
Core parameters explainedโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ HikariCP Pool (max=10) โ
โ โ
โ โโโโโโโโ โโโโโโโโ โโโโโโโโ โ Active (borrowed by threads) โ
โ โ C1 โ โ C2 โ โ C3 โ โ
โ โโโโโโโโ โโโโโโโโ โโโโโโโโ โ
โ โ
โ โโโโโโโโ โโโโโโโโ โโโโโโโโ โโโโโโโโ โ Idle (warm, ready) โ
โ โ C4 โ โ C5 โ โ C6 โ โ C7 โ โ
โ โโโโโโโโ โโโโโโโโ โโโโโโโโ โโโโโโโโ โ
โ โ
โ โ minimum-idle keeps at least N idle connections warm โ โ
โ โ maximum-pool-size is the total ceiling โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Parameter | Default | What it controls | Recommendation |
|---|---|---|---|
maximum-pool-size | 10 | Total connections (active + idle). Hard ceiling the DB sees. | Start at 10โ20; tune via load test |
minimum-idle | = max | Minimum idle connections kept warm | Set equal to maximum-pool-size (fixed pool) |
connection-timeout | 30,000ms | How long a thread waits to borrow before exception | Reduce to 2,000โ5,000ms for OLTP |
idle-timeout | 600,000ms | How long an idle connection lives before removal | Only relevant if min-idle < max |
max-lifetime | 1,800,000ms | Maximum age of any connection before recycling | Set 30s shorter than DB's wait_timeout |
keepalive-time | 0 (off) | Sends a test query to keep idle connections alive through firewalls | Set to 30,000โ60,000ms in cloud |
leak-detection-threshold | 0 (off) | Logs warning if a connection is held longer than N ms | Set to 2,000โ5,000ms in staging |
validation-timeout | 5,000ms | How long to test a connection before declaring it dead | Keep default |
Recommended production configurationโ
# application.yaml
spring:
datasource:
url: jdbc:postgresql://db-host:5432/mydb
username: ${DB_USER}
password: ${DB_PASSWORD}
hikari:
# Pool size โ tune for your DB server's CPU cores
maximum-pool-size: 20
minimum-idle: 20 # Fixed-size pool โ no dynamic churn
# Fail fast under load โ don't let threads queue for 30s
connection-timeout: 3000 # 3 seconds
# Recycle connections before DB kills them
max-lifetime: 1800000 # 30 minutes (set 30s below DB's wait_timeout)
# Keep connections alive through cloud firewalls
keepalive-time: 30000 # ping every 30 seconds
# Catch transactional leaks in staging
leak-detection-threshold: 5000
# Connection validation
connection-test-query: SELECT 1 # for MySQL; not needed for PostgreSQL
validation-timeout: 3000
# Pool name โ appears in logs and metrics
pool-name: HikariPool-Orders
Fixed-size pool vs dynamic poolโ
Dynamic pool (minimum-idle < maximum-pool-size):
Quiet period: pool shrinks to minimum-idle (e.g. 5 connections)
Traffic spike arrives: pool must create 15 new connections on demand
Each new connection: TCP + TLS + auth = 20โ80ms
โ First 15 requests of the spike experience high latency
โโโโโโโโโโโโโโโโโโโโโโโ
Response time: โ spike! โ____
โโโโโโโโโโโโโโโโโโโโโโโ
Fixed-size pool (minimum-idle = maximum-pool-size):
Pool always keeps 20 connections warm regardless of traffic
Traffic spike arrives: all 20 connections already ready
โ No latency spike on first requests
___________________________
Response time:
Use a fixed-size pool in production. The marginal cost of keeping extra connections warm is far less than the latency spike of establishing connections on demand.
Pool Sizingโ
The common beginner mistakeโ
"More connections = more parallel queries = higher performance. I'll set my pool to 500."
This is wrong. Here is why:
CPU parallelism limit: a database server with 8 CPU cores can physically execute exactly 8 queries simultaneously. If 500 connections all submit queries, 492 queries queue waiting for a CPU core. The OS spends more time on context switching than executing SQL.
Disk I/O bottleneck: a single SSD has a finite IOPS queue depth (typically 32โ128 concurrent I/O operations). 500 concurrent write transactions all competing for the same disk cause seek thrashing โ latency goes up, throughput goes down.
Lock contention: more concurrent transactions increase the probability of two transactions waiting for the same row lock. As active connections grow, lock wait time grows super-linearly.
Throughput vs Connection Count:
โฒ
Throughput โ โ
โ โ โ
โ โ โ
โ โ โโโโโโ โ plateau, then degradation
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโบ
Connection count
โฒ
sweet spot
The sizing formulaโ
A widely-used empirical formula from PostgreSQL and Oracle benchmarks:
connections = (CPU_cores ร 2) + effective_spindle_count
Where:
CPU_cores = physical cores on the DATABASE server (not app server)
effective_spindle_count = 1 for a single SSD
= RAID disk count for spinning disks
= 1 for most cloud RDS/Aurora instances
Examples:
4-core DB server, single SSD: (4 ร 2) + 1 = 9 โ start at 10
8-core DB server, single SSD: (8 ร 2) + 1 = 17 โ start at 20
16-core DB server, SSD RAID-10 (4 disks): (16 ร 2) + 4 = 36 โ start at 40
:::tip This formula gives you a starting point, not the final answer The formula captures CPU and I/O saturation. It does not account for:
- Query mix (read-heavy vs write-heavy)
- Average query duration
- Lock contention patterns
- Number of application instances sharing the pool
Always validate with load testing โ find the "knee" where adding connections stops improving throughput and starts increasing latency. :::
Sizing across multiple application instancesโ
The formula gives the total connections the DB can handle. With multiple app instances, divide across them:
DB can handle: 20 connections total (from formula)
Running 4 app instances (pods/containers):
Max pool per instance = 20 / 4 = 5
# application.yaml on each instance:
hikari.maximum-pool-size: 5
If you scale from 4 to 8 app instances without adjusting the pool size, you go from 20 total connections to 40 โ overloading the database. Either reduce per-instance pool size or put a connection proxy (PgBouncer/RDS Proxy) in front.
Failure Modesโ
Pool starvation โ the cascade failureโ
Pool starvation occurs when all connections are borrowed and new requests queue, time out, and fail โ cascading into a service outage.
Normal state (pool size = 10):
Avg request holds connection for 5ms
At 200 req/sec: 200 ร 0.005s = 1 connection in use at any time
Pool: 1 active, 9 idle โ
Slow query incident (avg hold time jumps to 500ms):
At 200 req/sec: 200 ร 0.5s = 100 connections needed
Pool only has 10 โ 90 threads queue
Queue fills โ requests timeout at connection-timeout (30s by default)
All 200 req/sec fail with SQLTransientConnectionException
โ Cascade: upstream services retry โ more load โ starvation deepens โ
How to detect it: hikaricp.connections.pending > 0 sustained for more than a few seconds.
How to prevent it:
- Eliminate slow queries โ add indexes, fix N+1 queries.
- Reduce
connection-timeoutto 2โ5 seconds so threads fail fast and don't pile up. - Add a circuit breaker โ stop accepting requests when pool is exhausted.
- Scale the pool (but only after fixing the root cause).
Anti-pattern 1 โ Holding connections across slow operationsโ
The most common cause of pool starvation is keeping a connection borrowed during non-database work โ external HTTP calls, file I/O, or heavy computation.
// โ DANGEROUS: DB connection held for the entire 2-second API call
@Transactional
public void processOrder(Long orderId) {
Order order = orderRepository.findById(orderId).orElseThrow();
// โ connection is held from here...
PaymentResult result = paymentGateway.charge(order); // โ 2000ms HTTP call
// โ ...to here. Connection is idle but unavailable to other threads.
order.setStatus(result.isSuccess() ? PAID : FAILED);
orderRepository.save(order);
}
// With 10 concurrent requests: 10 threads ร 2s = all 10 pool connections
// exhausted for 2 seconds. Request 11 waits 30s then fails.
// โ
CORRECT: Connection is held only during actual DB work (ms, not seconds)
public void processOrder(Long orderId) {
// Transaction 1: fetch data โ holds connection for ~2ms
Order order = orderService.getOrderById(orderId);
// Outside any transaction โ no connection borrowed
PaymentResult result = paymentGateway.charge(order); // 2000ms HTTP call
// Transaction 2: update status โ holds connection for ~2ms
orderService.updateStatus(orderId, result.isSuccess() ? PAID : FAILED);
}
@Service
public class OrderService {
@Transactional(readOnly = true)
public Order getOrderById(Long id) { return repo.findById(id).orElseThrow(); }
@Transactional
public void updateStatus(Long id, OrderStatus status) { repo.updateStatus(id, status); }
}
Rule of thumb: the connection is held from the first database call inside a @Transactional method to when the method returns. Every millisecond your method spends on non-DB work is wasted borrowed connection time.
Anti-pattern 2 โ N+1 queries depleting the poolโ
N+1 queries hold a connection while executing N serial round trips. Under load, this multiplies connection hold time by N.
// โ N+1: holds the connection for 1 + N DB round trips
@Transactional
public List<OrderDto> getOrdersWithUsers(List<Long> orderIds) {
List<Order> orders = orderRepository.findAllById(orderIds); // 1 query
return orders.stream()
.map(o -> {
User user = userRepository.findById(o.getUserId()).orElseThrow(); // N queries
return new OrderDto(o, user);
})
.collect(toList());
}
// 100 orders = 101 queries ร (avg 2ms each) = 202ms connection hold time
// vs JOIN FETCH: 1 query ร 5ms = 5ms connection hold time โ 40ร better
// โ
Fix: single JOIN FETCH โ one round trip, connection released 40ร faster
@Query("SELECT o FROM Order o JOIN FETCH o.user WHERE o.id IN :ids")
List<Order> findAllWithUsers(@Param("ids") List<Long> ids);
Anti-pattern 3 โ Thread pool / connection pool mismatch deadlockโ
If your web server has more threads than the pool has connections, you can hit a deadlock when threads depend on each other:
Setup:
Tomcat max-threads = 200
HikariCP maximum-pool-size = 10
Scenario:
200 requests arrive โ Tomcat spawns 200 threads
Thread 1โ10 borrow all 10 connections
Thread 11โ200 queue waiting for connections
Thread 1 (holding C1) makes an internal REST call to /api/helper
โ That request also needs a connection from the same pool
โ All 10 connections are held by threads 1โ10
โ Thread for /api/helper queues at connection-timeout (30s)
โ Thread 1 waits for /api/helper โ Thread 1 cannot release C1
โ DEADLOCK โ no thread can make progress
Fix options:
| Fix | Approach |
|---|---|
| Increase pool size | Add more connections to break the cycle (increases DB load) |
| Separate pools | Use a dedicated pool for internal sub-calls |
Reduce connection-timeout | Fail fast โ threads waiting > 3s throw an exception, breaking the cycle |
| Remove internal synchronous calls | Async messaging (Kafka/RabbitMQ) eliminates the dependency |
@Transactional(readOnly = true) | Ensures sub-requests don't also need write connections |
Anti-pattern 4 โ @Transactional on public vs private methodsโ
Spring's @Transactional uses AOP proxies โ it only intercepts calls through the Spring proxy, not direct this.method() calls:
@Service
public class OrderService {
// โ Self-invocation โ @Transactional is ignored
public void processAll(List<Long> ids) {
for (Long id : ids) {
this.processSingle(id); // โ calls directly on 'this', not through proxy
}
}
@Transactional
public void processSingle(Long id) {
// This method's @Transactional has no effect when called from processAll()
// Result: no transaction, dirty reads possible, connection borrowed without transaction scope
orderRepository.findById(id);
}
// โ
Fix: inject self-reference so calls go through the proxy
@Autowired
private OrderService self;
public void processAll(List<Long> ids) {
for (Long id : ids) {
self.processSingle(id); // goes through Spring proxy โ @Transactional works
}
}
}
PgBouncerโ
Why PostgreSQL needs a connection proxyโ
PostgreSQL's architecture spawns a dedicated OS process for every connection โ not a lightweight thread. Each process consumes:
- ~10 MB RAM for memory buffers (work_mem, sort buffers, temp storage)
- OS process overhead (context switches, scheduling)
- File descriptor allocation
10 connections โ 100 MB RAM used on the DB server for processes alone
100 connections โ 1 GB RAM
500 connections โ 5 GB RAM โ database server starts swapping โ performance collapses
PgBouncer sits between your application and PostgreSQL, multiplexing many application connections onto a small number of real PostgreSQL connections:
Without PgBouncer:
[App instance 1] (pool=20) โโโบ
[App instance 2] (pool=20) โโโบ PostgreSQL (60 processes ร 10MB = 600MB)
[App instance 3] (pool=20) โโโบ
With PgBouncer:
[App instance 1] (pool=20) โโโ
[App instance 2] (pool=20) โโโผโโโบ [PgBouncer] โโโบ PostgreSQL (20 processes ร 10MB = 200MB)
[App instance 3] (pool=20) โโโ
โ โ
60 client connections 20 real DB connections
(cheap โ just sockets) (expensive โ OS processes)
PgBouncer pool modesโ
- Transaction mode (most common)
- Session mode (safest)
- Statement mode (rarely used)
A real PostgreSQL connection is assigned to a client only for the duration of one transaction. Once the transaction commits or rolls back, the connection is returned to PgBouncer's pool.
Client 1: BEGIN โ INSERT โ COMMIT โ uses real connection C1
โ (between transactions, C1 is free)
Client 2: BEGIN โ SELECT โ COMMIT โ uses real connection C1 (same physical conn!)
Multiplexing ratio: 3 real connections can serve 10+ clients if transactions are short.
Gotchas:
| Feature | Broken in transaction mode? | Fix |
|---|---|---|
| Prepared statements (JDBC) | โ Yes | Set prepareThreshold=0 in JDBC URL or disable in Hibernate |
SET session variables | โ Yes | Use SET LOCAL inside a transaction instead |
| Temporary tables | โ Yes | Use regular tables with session-scoped cleanup |
| Advisory locks | โ Yes | Avoid or use application-level locks |
LISTEN/NOTIFY | โ Yes | Use a dedicated non-pooled connection |
# Spring datasource URL โ disable prepared statement caching for PgBouncer transaction mode
spring.datasource.url=jdbc:postgresql://pgbouncer:5432/mydb?prepareThreshold=0
A real connection is assigned to a client for their entire session (from connect to disconnect). The connection is only returned when the client disconnects.
Client 1 connects โ gets C1 โ stays on C1 until client 1 disconnects
Client 2 connects โ gets C2 โ stays on C2 until client 2 disconnects
This is safe for all PostgreSQL features but provides minimal connection multiplexing โ the gain is mainly in connection setup amortisation across reconnects, not in reducing the number of real DB connections.
Use when: you need full PostgreSQL session features (advisory locks, temp tables, LISTEN) but want connection setup amortisation.
A real connection is assigned only for a single SQL statement and immediately returned after execution.
Client sends: SELECT * FROM users WHERE id = 1
โ PgBouncer assigns C1 for this one statement
โ C1 returned immediately after the result set is sent
Client sends: SELECT * FROM orders WHERE user_id = 1
โ PgBouncer may assign C2 (different physical connection!)
This breaks multi-statement transactions entirely โ you cannot use BEGIN/COMMIT. Only useful for purely read-only, statement-by-statement workloads (rare).
PgBouncer configurationโ
# pgbouncer.ini
[databases]
mydb = host=postgres-server port=5432 dbname=mydb
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000 # max clients connecting to PgBouncer
default_pool_size = 20 # real PostgreSQL connections per database
reserve_pool_size = 5 # extra connections for emergencies
reserve_pool_timeout = 3 # seconds before using reserve pool
# Connection to PostgreSQL
server_lifetime = 1800 # recycle real connections after 30 min
server_idle_timeout = 600 # remove idle real connections after 10 min
server_connect_timeout = 5
server_login_retry = 3
# Client connection limits
client_login_timeout = 60
query_timeout = 30 # kill queries running over 30 seconds
transaction_timeout = 300 # kill transactions over 5 minutes
# Auth
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
# Logging
log_connections = 1
log_disconnections = 1
stats_period = 60 # log pool stats every 60 seconds
RDS Proxy (for Serverless)โ
The Lambda connection explosion problemโ
AWS Lambda scales out horizontally โ each concurrent invocation is a separate execution environment that opens its own database connections:
Normal load: 10 Lambda instances ร pool=5 = 50 connections โ
Traffic spike: 500 Lambda instances ร pool=5 = 2,500 connections
2,500 connections hitting RDS:
PostgreSQL: 2,500 ร 10MB RAM = 25 GB just for connection processes
Result: RDS runs out of memory โ connection refused โ all Lambdas fail โ
RDS Proxy solves this by acting as a managed connection pool between Lambda and RDS:
[Lambda 1] โโโ
[Lambda 2] โโโค
[Lambda 3] โโโค All connect to RDS Proxy (thin, cheap connections)
... โโโค
[Lambda 500]โโโ
โ
[RDS Proxy] โ maintains a small fixed pool of real RDS connections
โ
[Amazon RDS] โ only sees 20โ50 real connections, not 2,500
RDS Proxy featuresโ
| Feature | Detail |
|---|---|
| Connection multiplexing | Thousands of Lambda connections โ tens of real RDS connections |
| IAM authentication | Lambda functions authenticate via IAM role, not hardcoded DB passwords |
| Failover handling | On RDS Multi-AZ failover, RDS Proxy queues queries and routes them to the new primary automatically โ Lambda functions don't see the failover |
| Connection pinning | Some SQL features (temp tables, SET session variables) "pin" a Lambda to a specific RDS connection โ reduces multiplexing efficiency |
| Secrets Manager integration | DB credentials auto-rotated without Lambda redeployment |
When RDS Proxy helps vs hurtsโ
| Scenario | RDS Proxy helpful? |
|---|---|
| Serverless (Lambda, Fargate spot) โ many short-lived clients | โ Essential |
| Containers on ECS/EKS with fixed instance count | โ ๏ธ PgBouncer may be cheaper |
| Traditional EC2 application with long-lived pool | โ Adds latency with no benefit |
| RDS Multi-AZ with critical failover SLA | โ Smooth failover |
| PostgreSQL with heavy use of session features | โ ๏ธ Connection pinning reduces efficiency |
Connection Validation Strategiesโ
A connection can go stale โ the database server closes it (timeout, restart, network interruption) while your pool still considers it valid. Borrowing a stale connection results in an immediate SQLException.
How pools detect stale connectionsโ
- Test on borrow (safe, slight overhead)
- Keepalive (preferred)
- Max lifetime (always use)
Before handing a connection to a thread, the pool sends a lightweight test query (SELECT 1). If it fails, the connection is discarded and a new one is created.
# HikariCP โ enable test on borrow
hikari:
connection-test-query: SELECT 1 # for older drivers; modern drivers use isValid()
validation-timeout: 3000 # fail if test takes > 3s
Overhead: ~0.1ms per borrow (negligible). Recommended for production.
Instead of testing on every borrow, send a lightweight heartbeat to idle connections periodically. This prevents stale connections from accumulating without the per-borrow overhead.
hikari:
keepalive-time: 30000 # ping idle connections every 30 seconds
# HikariCP uses Connection.isValid() as the keepalive probe
Best for: cloud environments where firewalls kill idle TCP connections after a few minutes.
Regardless of validation, recycle every connection after a maximum age. This catches connections that have silently degraded (memory leaks in the DB backend process, server-side statement cache pollution).
hikari:
max-lifetime: 1800000 # recycle connections after 30 minutes
# CRITICAL: set 30 seconds shorter than your DB's wait_timeout/tcp_keepalive_time
For PostgreSQL:
-- Check your DB's connection limit settings
SHOW idle_in_transaction_session_timeout;
SHOW tcp_keepalives_idle;
-- Set max-lifetime in HikariCP to at least 30s shorter than these values
Connection Pool Library Comparisonโ
- HikariCP (default)
- c3p0 (legacy)
- Apache DBCP2
- R2DBC (reactive)
<!-- Already included in Spring Boot starter โ no extra dependency needed -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
- Fastest borrow/return time (~microseconds)
- Smallest codebase (~130KB jar)
- Best diagnostics (leak detection, metrics)
- Default in Spring Boot 2.x+
Choose HikariCP for all new Spring Boot applications.
<dependency>
<groupId>com.mchange</groupId>
<artifactId>c3p0</artifactId>
<version>0.9.5.5</version>
</dependency>
- Older pool โ still found in legacy codebases
- More configuration options but more complex
- Slower than HikariCP
- Known for memory leaks under certain configurations
Avoid for new code. Migrate to HikariCP.
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-dbcp2</artifactId>
</dependency>
- Commons DBCP2 is the Apache pool
- Reliable, well-maintained
- Slower than HikariCP (~2-3ร)
- Used in some Apache Tomcat configurations
Use only if you have a specific Apache ecosystem requirement.
<!-- For reactive Spring WebFlux applications -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-r2dbc</artifactId>
</dependency>
<dependency>
<groupId>io.r2dbc</groupId>
<artifactId>r2dbc-pool</artifactId>
</dependency>
spring:
r2dbc:
url: r2dbc:postgresql://localhost:5432/mydb
pool:
initial-size: 5
max-size: 20
max-idle-time: 30m
validation-query: SELECT 1
R2DBC uses non-blocking I/O โ connections are never "held" blocking a thread. A reactive pipeline releases the connection between async database calls automatically.
Use only with Spring WebFlux. Do not mix with Spring MVC (blocking) โ you will lose the benefits and add complexity.
Monitoring & Observabilityโ
Auto-exposing HikariCP metrics via Actuatorโ
<!-- pom.xml โ Actuator and Micrometer -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-registry-prometheus</artifactId>
</dependency>
# application.yaml โ expose metrics endpoint
management:
endpoints:
web:
exposure:
include: health, metrics, prometheus
metrics:
tags:
application: ${spring.application.name}
// Register HikariCP metrics with Micrometer
@Configuration
public class HikariMetricsConfig {
@Bean
public MeterRegistryCustomizer<MeterRegistry> hikariPoolMetrics(DataSource dataSource) {
return registry -> {
if (dataSource instanceof HikariDataSource hds) {
new HikariDataSourcePoolMetrics(hds, registry, Tags.empty()).bindTo(registry);
}
};
}
}
Key metrics and alert thresholdsโ
| Metric | What it measures | Alert threshold |
|---|---|---|
hikaricp.connections.active | Currently borrowed connections | Approaching maximum-pool-size sustained |
hikaricp.connections.pending | Threads waiting for a connection | > 0 for more than 5 seconds โ pool starvation |
hikaricp.connections.idle | Warm, available connections | Drops to 0 โ imminent starvation |
hikaricp.connections.acquire | Avg time to borrow a connection | P99 > 10ms โ pool under pressure |
hikaricp.connections.creation | Time to open a new physical connection | Spike โ dynamic pool churn (fix: fixed-size pool) |
hikaricp.connections.usage | Time a connection is held per borrow | Spike โ long transactions or slow queries |
hikaricp.connections.timeout.total | Count of connection-timeout exceptions | Any > 0 โ pool exhaustion events |
Grafana dashboard queries (PromQL)โ
# Active connections as % of pool capacity โ alert at 80%
hikaricp_connections_active / hikaricp_connections_max * 100
# Pending threads โ alert if > 0 for > 30 seconds
hikaricp_connections_pending > 0
# Connection acquisition P99 latency
histogram_quantile(0.99, rate(hikaricp_connections_acquire_seconds_bucket[5m]))
# Timeout rate per minute
rate(hikaricp_connections_timeout_total[1m]) * 60
Connection leak detectionโ
// Enable in staging and production
spring:
datasource:
hikari:
leak-detection-threshold: 5000 # warn if held > 5 seconds
// When a leak is detected, HikariCP logs:
// WARN HikariPool-Orders - Connection leak detection triggered for
// com.example.OrderService.processOrder(OrderService.java:42),
// stack trace follows ...
//
// Stack trace shows EXACTLY which method borrowed and didn't return the connection.
Production Patternsโ
๐ฌ Senior deep-dive: connection pool per DataSource in multi-tenant systems
In a multi-tenant SaaS app where each tenant has their own database schema or database instance, you need a pool per tenant โ not one shared pool:
@Configuration
public class MultiTenantDataSourceConfig {
// Lazily created pool per tenant โ created on first request, reused thereafter
private final ConcurrentHashMap<String, HikariDataSource> pools = new ConcurrentHashMap<>();
public DataSource getDataSourceForTenant(String tenantId) {
return pools.computeIfAbsent(tenantId, this::createPool);
}
private HikariDataSource createPool(String tenantId) {
HikariConfig config = new HikariConfig();
config.setJdbcUrl("jdbc:postgresql://db-" + tenantId + ":5432/app");
config.setUsername(getTenantUser(tenantId));
config.setPassword(getTenantPassword(tenantId));
config.setMaximumPoolSize(5); // smaller pool per tenant
config.setMinimumIdle(2);
config.setConnectionTimeout(3000);
config.setPoolName("HikariPool-" + tenantId);
return new HikariDataSource(config);
}
// Shutdown pools gracefully on application stop
@PreDestroy
public void closeAll() {
pools.values().forEach(HikariDataSource::close);
}
}
Key consideration: if you have 1,000 tenants and each pool keeps 2 idle connections, you have 2,000 open connections to your database cluster. Use PgBouncer in front of each database or shared schema isolation to manage this.
๐ฌ Senior deep-dive: read/write splitting with separate pools
For read-heavy applications, route read-only queries to replicas and writes to the primary โ each with its own pool:
@Configuration
public class ReadWriteDataSourceConfig {
@Bean
@Primary
public DataSource routingDataSource(
@Qualifier("primaryDs") DataSource primary,
@Qualifier("replicaDs") DataSource replica) {
AbstractRoutingDataSource routing = new AbstractRoutingDataSource() {
@Override
protected Object determineCurrentLookupKey() {
// TransactionSynchronizationManager tells us if we're in a read-only tx
return TransactionSynchronizationManager.isCurrentTransactionReadOnly()
? "replica" : "primary";
}
};
routing.setTargetDataSources(Map.of("primary", primary, "replica", replica));
routing.setDefaultTargetDataSource(primary);
return routing;
}
@Bean("primaryDs")
public DataSource primaryDataSource() {
HikariConfig cfg = new HikariConfig();
cfg.setJdbcUrl("jdbc:postgresql://primary-db:5432/mydb");
cfg.setMaximumPoolSize(20);
cfg.setPoolName("HikariPool-Primary");
return new HikariDataSource(cfg);
}
@Bean("replicaDs")
public DataSource replicaDataSource() {
HikariConfig cfg = new HikariConfig();
cfg.setJdbcUrl("jdbc:postgresql://replica-db:5432/mydb");
cfg.setMaximumPoolSize(40); // more connections โ reads are higher volume
cfg.setPoolName("HikariPool-Replica");
return new HikariDataSource(cfg);
}
}
// Usage โ Spring routes to replica automatically
@Transactional(readOnly = true) // โ routes to replica pool
public List<OrderDto> listOrders() { return orderRepository.findAll(); }
@Transactional // โ routes to primary pool
public OrderDto createOrder(CreateOrderRequest req) { ... }
๐ฌ Senior deep-dive: graceful shutdown and connection draining
During a rolling deployment, you must drain connections gracefully โ in-flight queries must complete before the pool closes:
@Configuration
public class GracefulShutdownConfig {
@Bean
public HikariDataSource dataSource() {
HikariConfig config = new HikariConfig();
config.setMaximumPoolSize(20);
config.setConnectionTimeout(3000);
return new HikariDataSource(config);
}
// Spring Boot calls this before the JVM exits
@PreDestroy
public void closeDataSource(HikariDataSource dataSource) {
// HikariCP close() waits for active connections to be returned
// before physically closing the underlying TCP sockets.
// In-flight transactions will complete; new borrows will fail immediately.
dataSource.close();
log.info("Connection pool closed gracefully");
}
}
# application.yaml โ give enough time for in-flight requests to complete
server:
shutdown: graceful # wait for active requests before stopping
spring:
lifecycle:
timeout-per-shutdown-phase: 30s # 30 seconds to drain
Kubernetes readiness probe: set the pod to NotReady first (so no new traffic is routed), then wait for in-flight requests to drain, then close the pool. The pool's close() call blocks until all connections are returned.
Common Mistakesโ
| Mistake | Problem | Fix |
|---|---|---|
minimum-idle < maximum-pool-size (dynamic pool) | Latency spike on traffic surges while new connections are established | Set minimum-idle = maximum-pool-size for a fixed-size pool |
connection-timeout: 30000 (default 30s) | Threads queue for 30 seconds before failing โ request pile-up under starvation | Reduce to 2,000โ5,000ms to fail fast |
External HTTP call inside @Transactional | Connection held idle during slow external call โ pool depletes | Move external calls outside @Transactional scope |
| N+1 queries inside a transaction | Connection held for N sequential DB round trips | Fix with JOIN FETCH or @EntityGraph |
| Pool size = 500 "for performance" | DB CPU thrash, context switching, lock contention โ performance collapses | Use the sizing formula: (cores ร 2) + spindles |
No max-lifetime configured | Connections silently stale โ DB closes them, JDBC gets broken pipe errors | Always set max-lifetime: 1800000 (30 min) |
leak-detection-threshold disabled in production | Connection leaks silently drain the pool with no log evidence | Enable at 5,000ms; the stack trace shows exactly where the leak is |
| PgBouncer transaction mode + prepared statements | ERROR: prepared statement "S_1" does not exist | Add prepareThreshold=0 to JDBC URL |
| Lambda without RDS Proxy | 500 Lambda instances ร pool=5 = 2,500 connections โ RDS OOM crash | Use RDS Proxy for serverless; limit pool size on Lambda to 1โ2 |
| Not accounting for multiple app instances | 5 instances ร pool=20 = 100 connections โ exceeds DB formula of 20 | Divide total by instance count; use PgBouncer to centralise pooling |
๐ฏ Interview Questionsโ
Q1. What is connection pooling and why is it needed?
Opening a new database connection requires a TCP handshake, TLS negotiation, database authentication, and backend process allocation โ typically 10โ100ms. A connection pool maintains a cache of pre-opened, pre-authenticated connections. Threads borrow a connection (< 1ms), run their query, and return it. This eliminates the per-request setup cost, enables connection reuse, and caps the total number of connections the database must serve โ preventing resource exhaustion.
Q2. How does HikariCP determine when to time out a connection request?
When a thread calls
getConnection(), HikariCP checks if an idle connection is available. If not, it places the thread in a queue and waits up toconnection-timeoutmilliseconds (default 30s). If no connection becomes available within that window, HikariCP throwsSQLTransientConnectionException. This is pool starvation โ the signal that either pool size is too small, queries are holding connections too long, or the system is overloaded.
Q3. Why is setting the pool size to a very large number a bad idea?
More connections does not equal more throughput. A database server with N CPU cores can execute exactly N queries simultaneously. Adding more concurrent connections beyond the optimal point causes: (1) CPU context switching overhead โ the OS wastes cycles saving and restoring thread state; (2) disk I/O contention โ concurrent writes compete for finite IOPS; (3) increased lock contention โ more active transactions mean more row-level and table-level lock waits. Performance peaks at the "knee" of the throughput curve and degrades beyond it. The empirical formula
(CPU_cores ร 2) + spindle_countgives a good starting point โ validate with load testing.
Q4. What is pool starvation and what causes it?
Pool starvation is when all connections are borrowed simultaneously, causing new requests to queue at
connection-timeoutand then fail. Common causes: (1) N+1 queries โ one request holds a connection across N serial DB round trips; (2) long transactions โ a@Transactionalmethod makes a slow external HTTP call while holding the connection idle; (3) thread pool / connection pool mismatch โ more app threads than DB connections, causing threads waiting for connections to block threads holding them; (4) slow queries โ a query suddenly takes 5ร longer, multiplying connection hold time across all concurrent threads.
Q5. What is the difference between PgBouncer session mode and transaction mode?
In session mode, a real PostgreSQL connection is assigned to a client for their entire session โ returned only when they disconnect. Safe for all features but provides limited multiplexing. In transaction mode, the real connection is assigned only for the duration of a transaction โ returned immediately on
COMMIT/ROLLBACK. Enables 10ร multiplexing (one real connection can serve multiple clients interleaved between transactions) but breaks connection-scoped features: prepared statement caching, temporary tables,SETsession variables, and advisory locks. Transaction mode is the production default; prepared statement caching must be disabled viaprepareThreshold=0in the JDBC URL.
Q6. Why does Lambda without RDS Proxy crash the database during traffic spikes?
AWS Lambda scales horizontally โ each concurrent invocation runs in an isolated execution environment with its own database connections. A traffic spike from 10 to 500 concurrent Lambda invocations causes 500 separate connection pools to try connecting to RDS simultaneously. PostgreSQL spawns one OS process per connection (~10MB RAM each) โ 500 connections consume 5GB of RAM just for connection management, often exceeding the RDS instance's available memory. The database crashes with "too many connections" or runs out of memory. RDS Proxy solves this by multiplexing the 500 Lambda connections onto 20โ50 real RDS connections.
Q7. (Senior) How would you diagnose a connection pool leak in production without restarting the application?
Step 1: Enable
leak-detection-threshold: 5000in HikariCP config (requires restart or dynamic config update). This logs a warning with a full stack trace identifying exactly which method borrowed a connection and didn't return it within 5 seconds. Step 2: Checkhikaricp.connections.pendingin Prometheus/Grafana โ sustained pending count > 0 confirms starvation. Step 3: Checkhikaricp.connections.activeapproachingmaximum-pool-sizeโ confirm all connections are borrowed. Step 4: Query the database directly:SELECT pid, query, state, query_start, state_change FROM pg_stat_activity WHERE state = 'idle in transaction'โ connections in "idle in transaction" state have borrowed a connection but are not executing SQL (the leak). Step 5: Correlate thepidfrompg_stat_activitywith thepg_locksview to see if these connections hold locks blocking other queries.