← Back to Projects
Distributed SystemsFault ToleranceEvent-Driven ArchitectureJava NIONon-Blocking I/OGossip ReplicationVector TimestampsP-RAM ConsistencyDeduplicationRecovery ProtocolsDocker

Distributed Fault-Tolerant IRC-Style Chat System

Distributed IRC-style chat system with gossip-based replication, fault-tolerant client/server recovery, and load-aware routing via replicated Addressing Servers, built on an event-driven Java NIO architecture for scalable non-blocking concurrency.

Overview

A distributed IRC-style chat system designed around real distributed-systems concerns: membership/routing, replication, message ordering guarantees, and recovery under failure. Clients connect through an Addressing Server cluster to discover and register with Chat Servers. Chat Servers operate as peers (no central chat authority), persist chat logs to disk, and replicate new messages using configurable gossip fanout (forwarding to the N nearest peers). Servers recover state by requesting chat history from peers, ensuring missed messages are re-fetched after crashes/partitions. The networking layer is built with Java NIO (selectors/channels) to multiplex many client + peer connections without per-connection threads.

Built to implement end-to-end distributed reliability: correct message propagation under reordering/duplication, robust reconnection flows for both clients and servers, and practical failure detection/recovery while staying scalable with an event-driven runtime.

Your Role

What I Built

  • Chat Server: peer-to-peer replication, disk-backed chat log, and non-blocking client + server networking
  • Client: registration, chat UX/commands, reconnect + resend, and history synchronization
  • Protocol foundation: message/data types and JSON schemas used across the system (including shared types used by the Addressing Server)
  • Registration + reconnection handshake flows between Client↔Addressing Server and Chat Server↔Addressing Server

What I Owned End-to-End

  • Chat Server runtime (Java NIO selector loop, connection state, parsing, routing, and write-queue behavior)
  • Client reliability: bootstrap/redirect, reconnect handshake, resend buffering, and history refresh after failover
  • Chat Server registration + re-registration handshake (initial join + recovery) with Addressing Servers
  • Core protocol/data model design (message envelopes, command/object types, payload shapes, and lifecycle semantics)
  • Replication behavior: gossip fanout (configurable N-nearest) + state transfer via peer chat-log requests
  • Deduplication strategy based on unique message IDs (ignore duplicates on retrieval/ingest)
  • Ordering guarantee: P-RAM consistency using vector timestamps for partial ordering and convergence
  • Heartbeat + zombie detection to avoid lingering dead peers and trigger recovery workflows
  • Integration code connecting Addressing Server coordination with client/server networking flows

Technical Highlights

Architecture Decisions

  • Two-layer system: Addressing Server cluster for routing/membership + peer Chat Server network for replicated message plane
  • Event-driven, selector-based Java NIO architecture (non-blocking reads/writes; no per-connection threads)
  • Disk-backed chat history on Chat Servers (append + reload), with recovery by requesting peer logs to re-fill gaps
  • Explicit connection context/state machines for clients, peer servers, and addressing-server coordination
  • Dockerized deployment: separate images for client, chat server, and addressing server with env-driven configuration

Algorithms / Protocols / Constraints

  • Gossip-based replication with configurable fanout: forward each new message to the N nearest peers (tunable N)
  • State transfer: REQUEST_CHATLOG / RESPONSE_CHATLOG to synchronize logs on join/rejoin and fill missed messages
  • P-RAM consistency via vector timestamps (preserve per-sender order; allow concurrent interleavings)
  • Unique message IDs + deduplication on ingest to tolerate redundant propagation paths
  • Heartbeat PING/PONG between chat servers with missed-response thresholds for failure detection

Failure Handling

  • Client failover: Addressing Server-assisted redirection to a healthy Chat Server after server failure
  • Client-side buffering + resend to avoid message loss across disconnects
  • Server recovery: re-register + request chat log from peers to reconstruct state after crash/partition
  • Zombie detection: periodic health checks so clients/servers don’t remain attached to dead peers
  • Peer isolation: stop sending to peers that miss heartbeats; reconnect workflows restore membership

Optimization Strategies

  • Non-blocking multiplexed I/O (selectors) to scale with concurrent clients + peer connections
  • Write-queue style non-blocking sends to avoid stalling the event loop on slow sockets
  • Tunable gossip fanout to trade bandwidth for faster convergence
  • Separation of protocol parsing/dispatch from low-level I/O readiness handling

Tech Stack

JavaJava NIO (Selectors, SocketChannel)TCP + JSON ProtocolEvent-Driven ConcurrencyVector TimestampsGossip ReplicationDocker + Docker ComposeBash Scripts + Env-based Configuration

Results / Learnings

What Worked

  • Implemented disk-backed message persistence and peer-based log recovery to restore missed messages after failures
  • Built robust registration + reconnection handshakes for both clients and chat servers via Addressing Server coordination
  • Deployed and exercised the system across multiple networks using Docker: ~50 clients, 10 chat servers, and 5 addressing servers
  • Maintained convergence under redundant propagation using unique IDs + deduplication and P-RAM ordering via vector timestamps

What I Learned

  • Correct recovery paths (re-register, resync logs, resend buffered messages) dominate overall system complexity
  • Event-driven NIO scales well, but forces explicit state machines for partial reads/writes and connection lifecycle
  • Replication correctness needs redundancy (gossip) plus guardrails (dedupe, ordering guarantees, and retry/recovery)

Tradeoffs Considered

  • Chose P-RAM (per-sender ordering) over total ordering to avoid heavy coordination and improve availability
  • Accepted redundant gossip traffic to simplify convergence and improve fault tolerance
  • Optimized for recovery + correctness rather than minimal bandwidth or lowest-latency delivery