Distributed SystemsFault ToleranceEvent-Driven ArchitectureJava NIONon-Blocking I/OGossip ReplicationVector TimestampsP-RAM ConsistencyDeduplicationRecovery ProtocolsDocker

Distributed Fault-Tolerant IRC-Style Chat System

Distributed IRC-style chat system with gossip-based replication, fault-tolerant client/server recovery, and load-aware routing via replicated Addressing Servers, built on an event-driven Java NIO architecture for scalable non-blocking concurrency.

Overview

A distributed IRC-style chat system designed around real distributed-systems concerns: membership/routing, replication, message ordering guarantees, and recovery under failure. Clients connect through an Addressing Server cluster to discover and register with Chat Servers. Chat Servers operate as peers (no central chat authority), persist chat logs to disk, and replicate new messages using configurable gossip fanout (forwarding to the N nearest peers). Servers recover state by requesting chat history from peers, ensuring missed messages are re-fetched after crashes/partitions. The networking layer is built with Java NIO (selectors/channels) to multiplex many client + peer connections without per-connection threads.

Built to implement end-to-end distributed reliability: correct message propagation under reordering/duplication, robust reconnection flows for both clients and servers, and practical failure detection/recovery while staying scalable with an event-driven runtime.

Your Role

What I Built

Chat Server: peer-to-peer replication, disk-backed chat log, and non-blocking client + server networking
Client: registration, chat UX/commands, reconnect + resend, and history synchronization
Protocol foundation: message/data types and JSON schemas used across the system (including shared types used by the Addressing Server)
Registration + reconnection handshake flows between Client↔Addressing Server and Chat Server↔Addressing Server

What I Owned End-to-End

Chat Server runtime (Java NIO selector loop, connection state, parsing, routing, and write-queue behavior)
Client reliability: bootstrap/redirect, reconnect handshake, resend buffering, and history refresh after failover
Chat Server registration + re-registration handshake (initial join + recovery) with Addressing Servers
Core protocol/data model design (message envelopes, command/object types, payload shapes, and lifecycle semantics)
Replication behavior: gossip fanout (configurable N-nearest) + state transfer via peer chat-log requests
Deduplication strategy based on unique message IDs (ignore duplicates on retrieval/ingest)
Ordering guarantee: P-RAM consistency using vector timestamps for partial ordering and convergence
Heartbeat + zombie detection to avoid lingering dead peers and trigger recovery workflows
Integration code connecting Addressing Server coordination with client/server networking flows

Technical Highlights

Architecture Decisions

Two-layer system: Addressing Server cluster for routing/membership + peer Chat Server network for replicated message plane
Event-driven, selector-based Java NIO architecture (non-blocking reads/writes; no per-connection threads)
Disk-backed chat history on Chat Servers (append + reload), with recovery by requesting peer logs to re-fill gaps
Explicit connection context/state machines for clients, peer servers, and addressing-server coordination
Dockerized deployment: separate images for client, chat server, and addressing server with env-driven configuration

Algorithms / Protocols / Constraints

Gossip-based replication with configurable fanout: forward each new message to the N nearest peers (tunable N)
State transfer: REQUEST_CHATLOG / RESPONSE_CHATLOG to synchronize logs on join/rejoin and fill missed messages
P-RAM consistency via vector timestamps (preserve per-sender order; allow concurrent interleavings)
Unique message IDs + deduplication on ingest to tolerate redundant propagation paths
Heartbeat PING/PONG between chat servers with missed-response thresholds for failure detection

Failure Handling

Client failover: Addressing Server-assisted redirection to a healthy Chat Server after server failure
Client-side buffering + resend to avoid message loss across disconnects
Server recovery: re-register + request chat log from peers to reconstruct state after crash/partition
Zombie detection: periodic health checks so clients/servers don’t remain attached to dead peers
Peer isolation: stop sending to peers that miss heartbeats; reconnect workflows restore membership

Optimization Strategies

Non-blocking multiplexed I/O (selectors) to scale with concurrent clients + peer connections
Write-queue style non-blocking sends to avoid stalling the event loop on slow sockets
Tunable gossip fanout to trade bandwidth for faster convergence
Separation of protocol parsing/dispatch from low-level I/O readiness handling

Tech Stack

JavaJava NIO (Selectors, SocketChannel)TCP + JSON ProtocolEvent-Driven ConcurrencyVector TimestampsGossip ReplicationDocker + Docker ComposeBash Scripts + Env-based Configuration

Results / Learnings

What Worked

Implemented disk-backed message persistence and peer-based log recovery to restore missed messages after failures
Built robust registration + reconnection handshakes for both clients and chat servers via Addressing Server coordination
Deployed and exercised the system across multiple networks using Docker: ~50 clients, 10 chat servers, and 5 addressing servers
Maintained convergence under redundant propagation using unique IDs + deduplication and P-RAM ordering via vector timestamps

What I Learned

Correct recovery paths (re-register, resync logs, resend buffered messages) dominate overall system complexity
Event-driven NIO scales well, but forces explicit state machines for partial reads/writes and connection lifecycle
Replication correctness needs redundancy (gossip) plus guardrails (dedupe, ordering guarantees, and retry/recovery)

Tradeoffs Considered

Chose P-RAM (per-sender ordering) over total ordering to avoid heavy coordination and improve availability
Accepted redundant gossip traffic to simplify convergence and improve fault tolerance
Optimized for recovery + correctness rather than minimal bandwidth or lowest-latency delivery

← View All Projects

Occluded Pet Detection in Domestic Environments →