Software Development Engineering Cheat Sheet

Authors
Published on
85 min read
Software Development Engineering Cheat Sheet

Table of Contents

1. C1: System Context Design

1.1. Overview

System design is a series of decisions optimising for:

  • Profit: Functionality, Scalability, Speed
  • Loss: Security, Robustness, Observability, Cost
  1. [Functionality] Does the system work?
  2. [Compliance] Is the system compliant?
  3. [Security] Is the system secure?
  4. [Robustness] How does the system handle failures?
  5. [Scalability] Does the system scale?
  6. [Speed] What are the bottlenecks?
  7. [Observabilty] Is the system observable?
  8. [Cost] Is there anything we can simplify do decrease cost?

1.2. Functionality

1.2.1. Types of Development

  • Web
    • Frontend
    • Backend
  • Mobile
  • Game
  • Desktop
  • Embedded
  • DevOps
  • Data
  • ML / AI
  • Security

1.2.2. CAP Theorem

CharacteristicDescriptionUse Case
ConsistencyAll nodes in the system see the same data at the same timeUsually preferred for financial systems
AvailabilitySystem remains operational even if some nodes failUsually preferred for social media / streaming apps
Partition ToleranceSystem remains operational even if network communication with some nodes failNon-optional because networks are not reliable, so the tradeoff is usually between C and A.

1.2.3. Duplexity: Initiation of communication + Sending of data + Concurrency

DuplexityWho can initiate communication?Who can send data?Can both send at the same time?ExampleUse Case
SimplexOne side onlyOne side onlyN/AWebhooksEvent notifications
Half-duplexTypically one side at a time (often client-first)Both sidesNoHTTPAPIs
Full-duplexBoth sidesBoth sidesYesWebSocketChat, collaboration

1.2.4. Message Distribution / Fanout Patterns

Key question: For one event,

  • who should receive it,
    • this determines fanout appraoch
  • how many recipients at peak
    • if >1000, avoid broadcast and look into group fanout strategies
Fanout PatternsDescriptionUse CaseTypical ImplementationExample Schema
1:1 (simple fanout/targeted routing)One sender → one recipientNotifications, repliesConnection registry + routing
1:many (group/complex)One sender → defined subsetChat rooms, teamsGroup registry or pub/sub topics
1:all (simple fanout/broadcast)One sender → everyoneLive feeds, system eventsPub/sub broadcast
Group Fanout PatternsDescriptionExample KV ShapeUse CaseTypical ImplementationRegistry Shape
ConditionalSend to subset matching a conditionAll users with role=adminAdmin alertsAttribute filtersattr → [connectionId]
HierarchicalTopics arranged in a treeregion.eu.fr.parisGeo updatesTopic hierarchytopicPath → [connectionId]
PartitionedSplit audience for scale/performanceShard users across partitionsMassive scaleHashing / partitionsshardId → [connectionId]
Fanout ImplementationAdvantagesDisadvantagesUse Case
In-memoryFast, simpleSingle node onlyMVPs, single-node WS
RegistryScales, targeted deliveryRegistry complexityChat, notifications
Pubsub BrokerMassive fanout, decoupledWasteful for 1:1, infra costFeeds, live updates

1.3. Compliance

1.3.1. MIFID

Markets in Financial Instruments Directive (MIFID) is a EU compliance standard for financial systems

Key Requirements:

  • Be synchronised to UTC
  • Stay within strict accuracy limits (e.g. milliseconds or microseconds depending on system)

Common implementations: NTP = Network Time Protocol

  • It keeps a machine’s clock accurate and synced with official time servers
  • Prevents clock drift (when a system slowly becomes a few seconds/minutes off)
  • Used everywhere in infrastructure (servers, containers, databases)

1.3.2. PCI DSS

PCI DSS (Payment Card Industry Data Security Standard) is a security compliance standard governed by major card brands (Visa, Mastercard, Amex) relevant to credit/debit card data

  1. Prefer using payment providers (e.g. Stripe) to avoid handling card data
  2. If unavoidable:
    1. Never store sensitive authentication data (CVV, PIN, track data)
    2. Isolate the Card Data Environment (CDE) via network segmentation
    3. Encrypt cardholder data in transit and at rest
    4. Strict access control + audit logging for in-scope systems

1.4. Security

1.5. Robustness

1.6. Scalability

1.7. Speed

1.8. Observabilty

1.9. Cost

1.9.1. Infrastructure as a Service (IaaS) vs Platform as a Service (PaaS)

ApproachUse CaseAdv.Disadv.
IaaSLarge-scale / custom appsFlexibility, Pay-as-you-goSetup & maintenance, steeper learning curve
PaaSMVPsFaster dev + CI/CD + easy deployment & scaling + security out of the boxVendor lock-in, less flexibility

1.9.2. Tenancy

A tenant is a customer/organisation space with its own users, data, config

Single-tenantMulti-tenant
DefinitionOne tenant per isolated stackMultiple tenants per stack
IsolationStrongWeak
Per-tenant customisationEasyHarder
OpExHigherLower
ScaleWorse (under-utilised)Better (pooling)
Compliance / Data residencyEasierHarder (needs partitioning)
Onboarding SpeedSlowerFaster

2. C2: Container Design

2.1. Functionality

2.1.1. Typical Cloud Infrastructure

LayerComponentDescriptionUse CaseE.g.
EdgeDNSResolves domain nameAWS Route53, GCP DNS
GatewayGatewayRouting to different services, securityTypically used to route to serverless, not usually needed for servers + ALBAWS API Gateway
Protocol Translation (HTTP to gRPC, REST to GraphQL)
ComputeServersPhysical servers (bare metal)Ultra-low latency, specialized hardware (GPU/FPGA), compliance, predictable performanceAWS Bare Metal (EC2 metal), on-prem
VMsLong-running VM that provides environment configurationSteady high throughput, long-lived connections, heavy local state, custom networking, predictable workloads, higher memory/CPU/GPU, strict latency floorsAWS EC2
Serverless (Container Runtimes)Fully managed container environmentsLong-running microservices, multi-language, complex dependenciesAWS Fargate
Serverless FunctionsEvent-driven functionsSpiky demand, MVP: Pay-per-use makes it more suitable, short, event-driven logicAWS Lambdas
Relational Data StoreRelational DBsScale-up: Easier to manage schema as compared to Document DBsAWS RDS, AWS Aurora (Serverless RDS)
Not Only Relational Data StoreDocument DBsMVP: Pay-per-useDynamoDB
KV DBs / CachesAWS ElastiCache (Redis)
Object Data StoreObject StorageStoring large immutable objectsAWS S3
File Data StoreFile StorageShare mutable files within private networkAWS EFS
DSQLDistributed SQL Query EngineQuery large-scale data across object storage/ data lake with SQLAWS Athena
Data LakeCentralised storage for raw dataanalytics, ML workloads, batch processingAWS Lake Formation, Iceberg on S3
MessagingPubSub BrokerFanout messages to subscribersLive updates, eventsSNS, Redis PubSub
Message Broker / Message QueueOne message → one consumerAsync jobs, retriesSQS, RabbitMQ
Stream / Event QueueDurable ordered event logEvent sourcing, analyticsKafka, Kinesis
IntegrationsEmail Delivery (ESP)Sends emails via API/SMTPPassword resets/receipts, notificationsAWS SES, SendGrid, Mailgun, Postmark

2.1.2. Options for storing data

Storage = Abstraction + Deployment

Storage AbstractionStores data asData is organised byPartial updates
BlockFixed-size blocks (address + bits)Application / file systemYes (block-level)
FileFiles (path + blob + metadata)File systemYes
ObjectObjects (key + blob + metadata)Object storage systemNo (rewrite required)
DeploymentStorage AbstractionE.g.Adv (vs others at same level)Disadv (vs others at same level)Use Case
LocalBlock Storage??Lowest latency, full controlHigh complexityI am building a DB or FS
LocalFile Systemext4, APFS, NTFSNo implementation needed, mutabilityLess control than block
LocalObject StorageMinIO localNo partial updates, higher overhead

2.1.3. Options for structuring data

ParadigmExamplesUse CaseAdvDisadv
SQLPostgreSQL, MySQL, MSSQLStructured relationships + strong consistency e.g. financial dataPowerful Querying + ACIDSlower writes due to B-Trees, slower reads/writes due to stronger consistency/locks,
Key-ValueRocksDB, DynamoDB, CassandraHigh-throughput writes, cachingExtremely fast writes + BASESlower writes due to LSMT
DocumentMongoDB, FirestoreSemi-structured JSON-like data, e.g. mobile/web appsFlexible schema + BASESlower writes due to LSMT
ColumnarCassandraTime series data, e.g. analytics, event loggingFast on columnar queries, aggregationsSlower writes due to LSMT
GraphNeo4jSocial graphs, recommendation enginesOptimised for graph traversal and relationship modelingLimited for heavy aggregations
TypeDBComplex knowledge graphs, strongly typed and structured relationshipsSmall eco system

2.1.4. File Transfer Options

ApproachAdvDisadvTypical Use Case
Direct Host-to-Host (SSH over TCP)Simple setup, built into most servers, strong authentication, good CLI toolingSlight protocol overhead, less optimized for massive distributionDev → server transfers, admin workflows, secure server-to-server transfers
Client-server based (TCP with TLS)Internet-standard security model, firewall-friendly, highly scalable infrastructureMore setup (certificates, servers), less integrated with shell workflowsCloud uploads/downloads, APIs, large-scale file distribution
Peer-to-PeerParallel downloads from many peers, scales with number of participantsMore complex coordination, uncommon for private 1-to-1 transfersLarge-scale distributed file sharing
H2H ProtocolTransportResume SupportAdvDisadvTypical Use Case
SFTPSSH over TCPRequires setupSimple setup, single port, strong authentication modelSlight protocol overheadSecure server-to-server or dev-to-server transfers
SCPSSH over TCPNoVery simple, minimal setupNo resume support, limited featuresQuick ad-hoc file copy
rsync (over SSH)SSH over TCPBuilt-inEfficient delta transfer, resume supportSlightly more complex usageLarge file sync, repeated transfers
Client-server ProtocolTransportResume SupportAdvDisadvTypical Use Case
FTP / FTPSTCP / with TLSRequires setupWidely supported, simple client/server modelLegacy design, complex firewall/NAT handlingLegacy enterprise integrations
HTTP / HTTPSTCP / with TLSRequires setupFirewall-friendly, scalable, cloud-nativeStateless, whole-object transfersCloud uploads/downloads (e.g. object storage)
P2P ProtocolTransportResume SupportAdvDisadvTypical Use Case
BitTorrentTCP / µTP over UDPBuilt-inParallel P2P transfer, scales with peersMore complex setup, uncommon for private transfersLarge-scale distributed file sharing

2.1.5. Operating Systems

  • Stack Size
    • Linux: 8MB
    • macOS: 8MB
    • Windows: 1MB

2.1.6. Transport Models

Model
TCP
UDP
Quick UDP Internet Connections (QUIC)

2.1.7. API Architectural Styles

What API architectural style is optimal for functionality (speed) and cost (DevX, maintenance, opex)?

StyleDescriptionUse CaseAdvDisadv
RESTPerform HTTP verbs on resources. Entity based, e.g. POST /usersMost commonUniversally understood + docgen tools e.g. Swagger, OpenAPISlowest - One request for each entity unlike GraphQL + less space efficient than RPC
GraphQLQuery or mutate entities. Entity based, e.g. mutation CreateUser() {...}APIs for FEFaster - One request for multiple entitiesMore setup e.g. defining the schema, resolvers + less standardised docgen e.g. GraphiQL
RPCCall functions remotely. Action based, e.g. await client.createUser()Internal APIsFastest and most space efficient because it uses binary instead of text payloadsOnly for internal use, requires HTTP2^
tRPCType-safe RPC framework that auto-generates client and serverTypescript AppsCan run on HTTP1 because of text payloadTied to Typescript ecosystem + limited language interoperability + difficult to debug

2.1.8. Transport Protocols

What transport protocol is optimal for functionality (user experience) and cost (DevX, maintenance, opex)?

Transport ProtocolsAdvDisadvUse Case
HTTP
gRPC
WS

What common combinations are there?

EdgeCoreUse Case
HTTPHTTPTraditional API
WSHTTP
WSRPC
WSMQTT
MQTTMQTT

2.1.9. Scheduler

Scheduler

  • Does
    • Assign task to node
  • Does not
    • Start or manage the workload

Orchestrator

  • Does
    • Scheduler
    • Provisioning and starting workloads on nodes
    • Scaling workloads up/down based on demand
    • Health monitoring and self-healing
    • Rolling updates and rollback management
    • Managing networking, storage and service discovery

2.1.10. WebRTC

If asked: “How would you design WhatsApp voice calls?”

  • Signaling: WebSockets (or SIP for enterprise VoIP).
  • Transport: RTP/SRTP for media.
  • NAT traversal: STUN + TURN fallback.
  • Encryption: SRTP end-to-end.
  • QoS handling: Adaptive bitrate, jitter buffer.

If asked: “How does WebRTC work?”

  • WebRTC = framework, uses:
  • Signaling (custom, often WebSocket)
  • RTP/SRTP for audio/video streams
  • STUN/TURN for NAT traversal
  • DTLS/SRTP for security
  • Adaptive bitrate + codec negotiation.

2.1.11. Websockets

2.1.11.1. Single-Node

At high level design, a single-node WebSocket system can often handle up to ~10k concurrent connections, but to maintain a margin of safety, it’s reasonable to start thinking about distributed WebSocket systems above ~1k connections. At that point, distributed systems also bring benefits like better fault tolerance and operational robustness. When calculating costs

At a lower level, websocket soak test tools can be used to validate these assumptions by observing system behaviour over time (CPU/memory usage, message latency, connection health (success/lifetime/dropped), network egress), identifying which part of the system becomes a bottleneck and needs to be scaled. The goal at this stage is typically to meet some kind of SLO, e.g.:

  • 99.9% of WebSocket messages delivered within 200ms
  • 99% of API requests complete under 500ms
  • < 0.1% connection drops per hour

2.1.12. Pubsub

What types of pubsub brokers are there?

Pub/Sub Message Delivery CharacteristicDescriptionAdvantagesDisadvantagesTypical Use Case
At-most-onceMessage is delivered zero or one time; no retriesVery low latency; simple; minimal overheadMessages can be lost silentlyMetrics, logs, realtime notifications where loss is acceptable
At-least-onceMessage is delivered one or more times; retries on failureHigher reliability; simple retry modelDuplicate messages; consumers must be idempotentEvent propagation, cache invalidation, background jobs
Exactly-once (logical)System ensures message effects occur exactly once (often via deduplication)Strong correctness guaranteesHigh complexity; coordination overheadFinancial transactions, billing, inventory updates
Best-effort broadcastMessage is pushed to all subscribers with no persistenceExtremely fast; simple fan-outNo durability; subscribers must be onlineRealtime websocket fan-out, multiplayer state updates
Durable pub/subMessages are persisted until acknowledged by subscribersSurvives subscriber crashesHigher latency; storage costCritical event distribution, audit logs

2.2. Compliance

2.3. Security

2.3.1. Auth

The process of designing an auth system involves two categories,

  1. Authentication
  2. Authorisation

and four key questions:

  1. What type of token do we want to use / Where is auth state stored?
  2. Who issues the token?
  3. Who validates the token?
  4. How do we want to maintain auth state over time?

2.3.1.1. What type of token do we want to use / Where is auth state stored?

Auth Token TypesAuth state stored inExample PayloadAdvDisadvUse Case
OpaqueServer-side<hash>No data exposure + Small + easy revocationRequires lookupHigh security systems, traditional web apps, centralised auth
Self-contained (e.g. JWT)Token (Client-side)Encoded: <header hash>.<payload hash>.<signature hash> -> Decoded: { sub: ..., iss: ..., exp: ... }No lookupHarder to revoke + readable claimsAPIs + microservices (e.g. Cognito, Entra)
Opaque Token SubtypesDescriptionExampleData stored inAdvDisadvUse Case
SessionToken represents a server-side session storing user identity/stateCookie: session_id=abc123 -> App Server -> Data Store: abc123 : {user_id: ...}Application serviceSimple + easy revocationTied to one appmonoliths, SSR
ReferenceToken is a pointer to auth data stored in auth server (introspection required)Cookie: reference_token=<hash> -> App Server -> Auth ServerAuthentication serviceCentralised control + Works across servicesNetwork callOAuth, APIs, microservices
Self-contained Token SubtypesDescriptionExample PayloadData stored inAdvDisadvUse CaseExample
Fatidentity + full authorisation data{ sub: ..., roles: ..., groups: ..., permissions: ..., resources: ... }JWTStateless + No network callsLarge tokens + Hard to revoke/updateStable permissions, high-scale microservicesAzure EntraID
Thinidentity only{ sub: ... }Data storeSmall tokens + Always up-to-dateAdditional network callHierarchical access control, sensitive systems, frequently changing permissionsCustom BE auth
Hybrididentity + coarse permissions, fine-grained resolved via BE{ sub: ..., roles: ..., scopes:... }BothBest of bothMore complex logicMost modern APIsCognito
JWT Encryption ApproachDescriptionAdvDisadvUse Case
Unencrypted (JWS)Signed token, payload is readable (encoded, not encrypted)simple + fast + widely supportedvisible infodefault
Encrypted (JWE)Token payload is encrypted (not readable without key)hides claimsslowersensitive info in token

2.3.1.2. Who issues the token to the client?

Who handles login and issues tokenExampleAdvDisadvUse Case
API GatewayAWS API Gateway + JWT/Cognito/Lambda AuthoriserOffload authn from BE + fastLimited flexibilitySimple serverless setup
Application ServerCustom login endpointFull controlReinventing the wheel + load on BESmall systems, complex authn
Auth ServerCognito, Entra, Auth0Managed authExternal depBest practice

2.3.1.3. Who validates the client's token?

Who validates token before request proceedsExampleAdvDisadvUse Case
CDN / EdgeLower latency, offloads traffic from downstream, cache authenticated responsesComplex cache invalidation, limited auth logicGlobal low-latency requirements, e.g. public content with lightweight auth
API GatewayAWS API Gateway + JWT/Cognito/Lambda AuthoriserOffload authn from BE + fastLimited flexibilityserverless / modern APIs / coarse grained auth
Load BalancerCentralisedLimited to basic checks (signature, expiration)Basic authentication before API Gateway / compute
Application ServerMiddlewareFull controlhigher latency + Reinventing the wheel + load on BECustom authn
Auth ServerOAuth /introspectImmediate revocationnetwork callBanking systems, enterprise APIs, zero-trust environments, Fine grained auth / resource-level gating

2.3.1.4. How do we want to maintain auth state over time?

Refresh StrategyDescriptionAuthorisation Token LifetimeAdvDisadvUse Case
No Refresh TokenUser re-authenticates after expiryShortSimplePoor UXHighly secure systems / short-lived tools
Refresh Token (Opaque)Long-lived refresh token is used to obtain a new authz token on expiryShortBetter UXRefresh token is high-value targetSPAs / mobile apps / modern APIs
Sliding SessionSession expiry extended on each requestMediumBest UXHarder to control expiryServer-rendered apps / monoliths
Silent Re-auth (OIDC)Client re-authenticates via IdP sessionShortNo refresh token in clientComplex, depends on IdPEnterprise SSO, IdP-managed sessions

2.3.1.5. Auth Implementations

2.3.1.5.1. What common combinations of authn/authz/refresh strategies are there?
PatternAccess TokenRefresh TokenAuth DataAuthZ DataValidationUse Case
Classic SessionOpaque (session ID)NoneSession storeSession storeBackendMonoliths, SSR
OAuth (Opaque)OpaqueOpaqueAuth serverAuth serverIntrospectionHigh security APIs
JWT + RefreshJWT (short-lived)OpaqueJWT (sub)JWT or backendGateway/APIModern SPAs
Thin JWTJWT (identity only)OpaqueJWTBackendLambda/backendComplex permissions
Hybrid JWTJWT (roles/scopes)OpaqueJWTJWT + backendGateway + backendMost modern systems
Fat JWTJWT (full permissions)OptionalJWTJWTGateway/APIStable permissions
2.3.1.5.2. Which AWS API Gateway Authoriser should we use?
AWS API Gateway Authoriser TypeDescriptionAdvDisadvUse Case
JWT AuthoriserValidates JWT locally within API Gateway using signature and claims (iss, aud, exp)fastest (no network call)No custom logic + limited to JWT contentsSimple APIs, coarse-grained auth (roles/scopes in JWT)
Cognito AuthoriserJWT validation within API Gateway, tightly integrated with Cognito User Poolsfast + managed authno custom logic, limited to Cognito JWT contentsApps using Cognito (SPAs, mobile apps, standard OAuth/OIDC flows)
Lambda AuthoriserCustom Lambda function invoked by API Gateway to evaluate request and return allow/deny policyfull flexibility (token type, logic etc.)slowest (network + compute) + most complexThin tokens, hierarchical access, dynamic permissions, custom auth

2.3.2. Typical Cloud Infrastructure

LayerComponentDescriptionUse CaseE.g.
EdgeDNSResolves domain nameAWS Route53, GCP DNS
WAF / DDoS ProtectionProtect from malicious actsAWS WAF
Edge GatewayRouting/security/transform at the edgeAPI gateway-lite, auth, rate limiting, header rewritesAWS API Gateway + Lambdas, Cloudflare Workers as gateway
GatewayGatewayRouting to different services, securityTypically used to route to serverless, not usually needed for servers + ALBAWS API Gateway
ComputeServersPhysical servers (bare metal)Ultra-low latency, specialized hardware (GPU/FPGA), compliance, predictable performanceAWS Bare Metal (EC2 metal), on-prem
VMsLong-running VM that provides environment configurationSteady high throughput, long-lived connections, heavy local state, custom networking, predictable workloads, higher memory/CPU/GPU, strict latency floorsAWS EC2
Serverless (Container Runtimes)Fully managed container environmentsLong-running microservices, multi-language, complex dependenciesAWS Fargate
Serverless FunctionsEvent-driven functionsSpiky demand, MVP: Pay-per-use makes it more suitable, short, event-driven logicAWS Lambdas
Load BalancingGateway Load Balancer (GWLB)Distributes traffic to third party security/network applicances using TCP/UDP infoAWS GWLB
NetworkingVPCIsolated virtual network for cloud resourcesDefine public/private subnets, control routing, isolation, multi-tier deploymentsAWS VPC
SubnetsSegments inside a VPCControls traffic flow and exposure of resources (e.g. public ALB, private DB)AWS Subnets
Security GroupsVirtual firewalls attached to resourcesControl traffic at instance/service levelAWS Security Groups

2.3.3. Virtual Private Clouds (VPCs)

2.3.3.1. Connections

Connection TypeDescriptionAdvDisadvUse Case
VPC EndpointPrivately connect VPC to AWS Services without traversing internetLower latency, higher security, lower data transfer costs

2.3.3.2. Subnets

Subnet TypeDescriptionAdvDisadvUse Case
PublicHas a route to an Internet Gateway (IGW)Simpler setup + troubleshootingLess safeLoad balancers, bastion hosts, public APIs
PrivateHas no direct route to IGW. Outbound internet goes via NAT Gateway / egress-only IGW / VPN Direct ConnectSaferHigher cost (Needs NAT + proxy, tricker troubleshooting)Databases, app servers, microservices, caches/queues, internal ALB/NLB targets, analytics workers
VPC Endpoint TypeDescriptionAdvDisadvUse Case
InterfaceCreates an Elastic Network Interface (ENI) in your subnet with a private IPSSM, Secrets Manager, CloudWatch
GatewayRoute table entries that direct traffic to S3 / DynamoDB

Port Forwarding: Forwarding of information from a router's port to a port on a device on its subnet

Tailscale Protocol: Connecting directly to a device's port when it is already on a subnet

2.3.4. Firewalls

TypeE.g.LayerFound InChecksUse Case
Web-Application (WAF)AWS WAFApplicationCDNs, gateways, load balancerExamines HTTP payload for attack detectionWeb app / API protection against SQLi, XSS, bots, malicious patterns
ProxyNginx reverse proxyApplicationProxy servers, gatewaysExamines payload for access control and anonimisation
Packet Filteringiptables (basic rules)Network & TransportRoutersExamines packets based on source/destination IP, port, protocolSimple allow/deny rules, port blocking
Host-BasedWindows Firewall, iptablesNetwork & TransportIndividual servers / VMsExamines traffic per hostProtects single servers, last line of defense

2.4. Robustness

If asked: “How does VoLTE differ from WhatsApp?” • VoLTE → Managed SIP + RTP inside carrier network, guaranteed QoS, low jitter. • WhatsApp → WebRTC over the public Internet, no QoS guarantees.

2.4.1. Typical Cloud Infrastructure

LayerComponentDescriptionUse CaseE.g.

2.4.2. How do we ensure reliable message processing and delivery over time?

StrategyDescriptionAdvantagesDisadvantagesTypical Use Case
Queue / Stream keyed by clientMessages placed in per-client or keyed queue, instance owning WS consumes and deliversDurable, pull-based backpressure, supports retries/replays/offline delivery (because messages stay in log while client is offline)Higher latency enqueueing/dequeueing than simple push with pubsub, ownership/rebalancing complexity, not true push (message is delivered when consumer polls, not at production time)Systems needing durability, offline delivery, or replay
Client pull / reconnect catch-upClient fetches pending messages from shared store on poll or reconnectExtremely resilient; minimal server couplingHigher latency; weaker real-time guaranteesNotifications, feeds, async workflows

2.4.3. Websockets

Connection TypeTimeout
Client x API Gateway Websocket Connection2hrs
API Gateway x Lambda Integration29s
WebSocket IssuesScenarioMitigation
Reconnect StormsBackoff + jitter in the client
Gateway crashClient reconnect typically handled by client ws library
Stale connection registryTTL + Heartbeat allows stale data to be cleared from the registry
Message loss
Memory leaks
Slow clients

2.4.4. Caching

Cache Stampede Management StrategiesDescriptionAdv.Disadv.Use Case
Warmup
Prefill

2.4.5. Distributed Websockets

2.4.5.1. Issues and Mitigations

IssueScenarioMitigation
Reconnect stormsMany clients reconnect after outageExponential backoff + jitter
Gateway crashAll connections on node dropClient reconnect + registry TTL
Stale registryRegistry points to dead gatewayHeartbeats + expiry
Message duplicationRetries cause duplicatesIdempotency
BackpressureGateways send messages faster than client can read, causing buffers/memory to explodeStop sending temporarily / drop messages / disconnect clients
Connection exhaustionToo many open socketsConnection limits
Message lossCrash during sendACKs, retries, durable queues

2.4.6. Adaptive Performance Strategies

StrategyDescriptionLayerUse Cases
Jitter BufferTemporary storage in receiver's app that smooths out variations in packet arrival times before playbackApplicationJitter
BitrateBitrate Reduction + ...
Bitrate ReductionReducing the encoding and sending of dataApplicationPacket Loss

2.5. Scalability

2.5.1. Overview

TypePrincipleUse CaseAdvDisadv
VerticalUpgrading CPU/RAM/StorageSmall to medium apps, monolithic systems, startupsNo code change + lower latencyLimited by hardware ceilings + expensive at scale + SPOF
HorizontalAdding more serversDistributed systemsFault tolerance via redundancy + Infinite scalabilityNetwork latency + Higher complexity

Types of horizontal scaling:

  1. Database Horizontal Scaling
  2. Compute Horizontal Scaling

Database Horizontal Scaling, i.e. sharding

TypePrincipleUse CaseAdvDisadv
Directory/Lookup-basedShard where data belongs depends on manually maintained directoryFrequently changing shards / manual controlEasy to add / remove shardsDirectory is a SPOF, lookup adds latency
Range-basedShard where data belongs depends on which contiguous key ranges (e.g. A-F, G-L, ...)Time-series data, ordered data, range queriesEfficient for range queries + simple to implementData skew possible, hotspots risk
Hash-basedShard where data belongs depends on hash of keyHigh-write, evenly distributed workloadsGood load balancing, no need to manage rangesRange queries inefficient, rebalancing expensive

Compute Horizontal Scaling

TypePrincipleUse CaseAdvDisadv
Centralised Load Balancing / Orchestrator-based SchedulingRequests are routed based on a load balancer or schedulerWorkloads are heterogeneous, resource usage unpredicatable, fine-grained control over task placementAssign request based on compute needs + Easy to add/remove nodes + Supports complex scheduling policiesOrchestrator / scheduler is SPOF + can be bottleneck
Static PartitioningRequests are routed based on predefined ranges or affinity rules, e.g. ID range, locationTasks are grouped logicallyLow latency as no lookup is neededHotspots + manual rebalancing + difficult to add/remove nodes
Consistent HashingRequests are routed based on hash of request keyStateless workloads, e.g. microservices, serverless, API gatewaysAutomatic load balancing + no load balancingRange based tasks difficult + rebalancing required when nodes are added/removed

2.5.2. Rules of Thumb

Scaling Best PracticesDescriptionReasonException
Stateless ComputeKeep biz logic compute statelessAny instance can serve any request, add more instances to scale, replace instance in failure, easy load balancing
WS on the edge, HTTP/RPC at the coreHTTP/RPC are stateless, i.e. providing easier retries + load balancing + observability + timeouts
IdempotencyRepeating an operation has the same effect as doing it once
Pull-based backpressure is typically more forgiving than push-based

2.5.3. Typical Cloud Infrastructure

LayerComponentDescriptionUse CaseE.g.
OrchestrationServer OrchestrationScales VMsAWS Auto Scaling + EC2 ASG
Container OrchestrationScales containersECS/EKS
Object Data StoreObject StorageStoring large immutable objectsAWS S3
MessagingPubSub BrokerFanout messages to subscribersLive updates, eventsSNS, Redis PubSub
Message Broker / Message QueueOne message → one consumerAsync jobs, retriesSQS, RabbitMQ
Stream / Event QueueDurable ordered event logEvent sourcing, analyticsKafka, Kinesis
Load balancingApplication Load Balancer (ALB)Distributes traffic to apps using HTTP infoAWS ALB
Network Load Balancer (NLB)Distributes traffic to apps using TCP/UDP infoAWS NLB
Gateway Load Balancer (GWLB)Distributes traffic to third party security/network applicances using TCP/UDP infoAWS GWLB
Global Load Balancer (GLB)Distributes traffic geographicallyAWS ELB

2.5.4. Options for storing data

DeploymentStorage AbstractionE.g.Adv (vs others at same level)Disadv (vs others at same level)Use Case
NetworkBlock StorageAWS EBSHigher complexity
NetworkFile SystemAWS EFSMutabillityMetadata bottlenecks, locking complexityInternally shared files
NetworkObject StorageMinIOScalable readsNo partial updates, higher latencyInternal services
DistributedBlock StorageCeph RBDHighest complexityI am building a distributed DB
DistributedFile SystemAWS FSx for LustreMetadata bottlenecks, locking complexityI am
DistributedObject StorageAWS S3Scalable, miniamal coordinationI am building a data lake / serving media to thousands of people

2.5.5. Strategies to serve SPAs

Hosting StrategyDescriptionUse CaseAdvDisadv
Blob Storage Service (e.g. S3)SPA is stored in blob storage and exposed via a public URLLow-traffic / internalSimple and cheapNo edge caching and higher global latency
Blob Storage Service + Content Delivery Network (e.g. S3 + CloudFront)SPA is stored in blob storage, CDN caches assets at edge locationPublic production SPAsFast global deliveryExtra setup
Virtual Machine Hosting (e.g. EC2 + nginx)SPA is stored in a virtual machine and exposed via a portExisting monolithFlexible configurationVM management

2.5.6. Container Orchestration

2.5.6.1. Orchestration Framework

Container Orchestration FrameworkDescriptionUse Case
KubernetesOpen-source container orchestration framework
OpenShiftKubernetes with batteries included
ECS / ...

2.5.6.2. Platform

Container Orchestration PlatformKubernetes?DescriptionUse Case
ECSAWS managed orchestration serviceMinimal setup
EKSAWS hosted open-source orchestration frameworkFlexibility
OpenShift

2.5.7. Distributed Websockets

MechanismWhat it storesStrengthWeaknessUse Case
Connection registryconnectionId → gatewayIdPrecise routingNeeds cleanup1:1 messaging
Group registrygroupId → [connectionId]Controlled fanoutLarge groups expensiveChat rooms
Pub/sub brokertopic → subscribersMassive fanoutCoarse routingBroadcast feeds
Distributed Websocket ApproachDescriptionAdvDisadvUse Case
Sticky SessionsGWLB pins client to specific gateway instance, e.g.SimplePoor rebalancingSmall systems
Connection Registry + Targeted RoutingclientId -> gatewayId KV lookupEfficient 1:1Chat, notifications
PubsubBroker fans out messagesFeeds, one event must notify many recipients immeidately

2.5.8. Optimising for reads/writes

Read Optimisation Strategy
CDN caching

The disadvantages in general are:

  1. Higher storage
  2. Stale data
  3. Additional complexity with invalidation strategy
Write Optimisation Strategy

The disadvantages in general are:

  1. More complex read paths
  2. Additional complexity with background preprocessors
Balanced Approach
CQRS + messaging
per-endpoint SLAs with targeted caching
tiered storage (hot cache -> primary DB -> datalake)

2.5.9. Caching

Cache Distribution StrategiesDescriptionAdv.Disadv.Use Case
Single-node (L1)Cache local to one instanceSimpleNo sharingSmall apps
Distributed (L2)Multiple caches, e.g. redisScales horizontallyNetwork latency + Op overheadMicroservices
Multi-level (L1/L2)Local + DistributedBest latency + scaleComplexityHigh-scale systems

2.5.10. Distributed Websockets

2.5.10.1. : How do we ensure that messages get to the correct client?

StrategyDescriptionAdvantagesDisadvantagesTypical Use Case
Pub/Sub broadcastAny instance publishes to a broker which broadcasts to all instances, instance holding the WS delivers, others dropSimple, resilient to instance churnWasteful fan-out, message loss if nobody is listeningSmall–medium clusters, low message volume
Connection registry + direct routingInstances add {clientId → instance} in registry, sender looks up owner and forwards via RPCPrecise delivery, scales wellRegistry correctness complexity, e.g. flapping ownership, more failure cases to handleLarge clusters, high throughput, real-time messaging
WebSocket gateway layerDedicated gateway owns all WS connections, compute instances send messages to gatewayCompute stateless, clean separation of concerns, simple delivery semanticsStateful gateway tier, extra hopHigh-scale systems, many short-lived compute instances

2.6. Speed

2.6.1. Typical Cloud Infrastructure

LayerComponentDescriptionUse CaseE.g.
EdgeCDNCaches static content for low-latencyAWS CloudFront
Edge FunctionsExecute code at edge on incoming HTTP requestsAuth, redirects, request shaping, thin APIs, caching logicAWS Cloudfront Functions, AWS Lambda@Edge, Cloudflare Functions
Edge GatewayRouting/security/transform at the edgeAPI gateway-lite, auth, rate limiting, header rewritesAWS API Gateway + Lambdas, Cloudflare Workers as gateway

2.6.2. Rendering Strategy

Rendering StrategyDescriptionUse CaseAdvDisadv
Client-Side Rendering (CSR) with Single-Page Apps (SPAs)Client downloads minimal HTML shell with JS, JS renders everything elseInternal tools, dashboardsCheap hostingSlow first paint, poor SEO
Server Side Rendering (SSR)Client downloads minimal HTML shell, server renders full HTML and hydrates client with contentSEO critical (e-commerce, social)Fast first paintHigher infra cost, complex
Static Site Generation (SSG)HTML rendered at build time,Blogs, docs, marketing sitesFaster first paintStatic content only, rebuild for updates
Incremental Static Regeneration (ISR)SSG with on-demand/timer based regenerationCatalogs, listingsSSG with refreshCache stale window, build limits
Islands Architecture / Partial HydrationOnly some components are SSRedSites with selective interactivityFaster first paintComplex
Multi-page Apps (MPAs)Client downloads new HTML for every pageTraditional sites, simple appsSimple model, good SEOFull page reloads, less dynamic UX
Edge-Side Rendering (ESR)SSR but running on edge functionsGlobal appsFastest first paintLimited runtime, cold start issues

2.6.3. Caching

Cache Placement StrategiesDescriptionAdv.Disadv.Use Case
Client / BrowserHTTP cache in browserZero latency and costInvalidation complexityStatic assets
CDN / EdgeCache at edge e.g. CloudFrontVery fast + offloads backendAuth + Invalidation complexityPublic content
In-app cache (L1)In-process cacheUltra-fastMemory-bound, per-instanceHot keys
Remote cache (L2)Redis / MemcachedShared across servicesNetwork latencyShared state
In data layer cacheDB buffer / query cacheTransparentLimited controlRead-heavy DBs

2.7. Observabilty

2.7.1. Typical Cloud Infrastructure

LayerComponentDescriptionUse CaseE.g.
ObservabilityLoggingCollect, aggregate and index logs from all servicesAWS CloudWatch
Monitoring / MetricsMonitor resource usage, uptime, etc.AWS CloudWatch
TracingTraces request flow across different servicesAWS X-Ray

2.8. Cost

2.8.1. Typical Cloud Infrastructure

LayerComponentDescriptionUse CaseE.g.
EdgeCDNCaches static content for low-latencyAWS CloudFront
WAF / DDoS ProtectionProtect from malicious actsAWS WAF
OrchestrationServer OrchestrationScales VMsAWS Auto Scaling + EC2 ASG
Container OrchestrationScales containersECS/EKS
MessagingPubSub BrokerFanout messages to subscribersLive updates, eventsSNS, Redis PubSub
Message Broker / Message QueueOne message → one consumerAsync jobs, retriesSQS, RabbitMQ
Stream / Event QueueDurable ordered event logEvent sourcing, analyticsKafka, Kinesis
DevOpsCI/CDAWS CodeBuild, GitHub Actions
ArtifactContainer RegistryStores , versions, distributes container imagesAWS ECR

2.8.1.1. How do we route clients to the same instance to reduce coordination?

StrategyDescriptionAdvantagesDisadvantagesTypical Use Case
Sticky sessionsLoad balancer routes client to same instance based on hash/cookieVery simple, reduces cross-instance routingBreaks on instance failure, doesn’t guarantee ownershipLow churn systems, cost-sensitive setups
Consistent hashing ownershipAll instances compute owner for clientId using membership + hash ringNo central registry; predictable routingComplex failure handling; membership convergence issuesAdvanced infra teams, custom routing layers

3. C3: Component Design

  • BFF: Backend for Frontend
    • GET /dashboard instead of GET /users + GET /orders + GET /recommendations

3.1. Functionality

3.1.1. Choosing a language for mobile app development

3.1.2. Choosing a language for frontend web development

LanguageUse CaseAdv.Disadv.
JSDefaultNatively supported - browsers come with JS engineSingle-threaded by default
Dart (compiled to JS)Cross-platformNo UI interactivity
C/C++/Rust (through WASM)3D graphics, gaming, video editing (e.g. Figma, Canva, AutoCAD Web)High performanceNo UI interactivity
Python (through WASM)AI/ML in the browserHigh performance, mature AI/ML ecosystem libraryNo UI interactivity
C (through Blazor WASM)Existing .NET implementationUI interactivityYoung ecosystem, large initial payload (downloads 6MB .NET runtime)

JS is the default choice as it is the only language that has direct access to the DOM to render UI.

3.1.3. Choosing a language / framework for backend web development

The choice of language for backend web development is tightly coupled to the language's runtime, libraries and frameworks as they provide key tradeoffs.

LanguageUse CaseAdv.Disadv.
JavascriptReal-time apps, typically preferred over php these daysMature ecosystem, same language for FE and BE, great for concurrency (<10k users)Not typed
PHPWordpress, CMS, e-commerceHuge CMS ecosystem, powers wordpressProcess-per-request model limits real-time apps without extra tooling, js is typically preferred
PythonML / AIHuge AI/ML ecosystem
JavaEnterprise, financeStrict typing, battle testedHeavier setup
CEnterprise with Microsoft eco-systemGreat integrations with Microsoft / AzureTied to Microsoft eco-system
GoMicroservices, cloud-native, high-concurrency APIsExtremely fast, great concurrency with goroutinesLess suited for CMS, e-commerce
RustHigh-performance APIs
RubyReplaced by JS-Declining in popularity due to memory usage, scaling, and struggling with concurrency

3.1.4. Choosing an Infrastructure as Code (IaC) framework for cloud

FrameworkDescriptionUse CaseAdv.Disadv.
AWS
SST (Serverless Stack)Third party abstraction on top of CDKSmall projectsUltra-fast local lambdas with hot reload, DevXLess flexible than CDK, third party solution, risky with breaking changes
CDKAWS high-level code-first framework built on CloudFormationBest all round-choice for AWSCommon programming languages supportedSteep learning curve, no local emulators for lambdas and API gateways
SAM (Serverless Application Model)AWS high-level serverless-first legacy framework built on CloudFormationPrefer CDKDevX with emulators for local lambdas/API gatewaysYAML config, serverless projects only
CloudFormationAWS low-level frameworkLow-level controlAccess to L1 constructs for high customisabilityJSON/YAML config, verbose
Azure
Bicep
ARM TemplatesJSON Config
GCP
Deployment ManagerYAML
Multi-vendor
Terraform
Pulumi
Serverless FrameworkLegacy vendor agnostic frameworkDo not use, it is deadSupports AWS, Azure, GCPYAML config, mocking AWS locally required

3.1.5. Choosing a library for local dev of cloud resources

AWS

Library / ToolDescriptionUse CaseAdv.Disadv.
LocalStackFull AWS service emulator in DockerBest library to start with before using other libraries for specific functionalityBroad AWS coverage, runs in one containerSlower than service-specific emulators, partial coverage of some services
MinIOS3 compatible object storeLocal S3FastS3 only, some S3 features differ
ElasticMQSQS emulatorLocal SQSFastSQS Only
DynamoDB LocalDynamoDB emulatorLocal KVFastDynamoDB only
SAM CLILambdas / API Gateway emulatorLocal lambdas / API GatewayFastServerless services only
SSTLambda emulator with hot reloadExtremely fast local lambda devExtremely fastNeed to use SST

3.1.6. Encoding

Encoding is used to serialise user facing data (text/image/audio/video) for storage / transport over the network.

TypeDescriptionUse CaseE.g.
Base3232-character set encoding (A-Z, 2-7)QR codes, OTP secretsJBSWY3DPEBLW64TMMQ======
Base64Represents binary data in ASCIIImages, API keys, JWT segmentsSGVsbG8gd29ybGQ=
Base85Represents binary data in ASCIIPDF<~87cURD_*#TDfTZ)+T~>
URLMakes data safe for URLsURLs%20 -> spaces
HexRepresents binary as hex strings0x12ab
ASCII / UTF-8Maps chars as numeric codesText65 -> "A"
Unicode (UTF-16, UTF-32)Maps characters to numeric codesText (International)U+4F60 -> "你"

3.1.7. JavaScript

EngineBrowser
V8Chrome
SpiderMonkeyFirefox
JavaScriptCoreSafari
HermesReact Native
RuntimesEngineAdvDisadvUse Case
NodeV8Mature ecosystemSlower + Security needs to implemented via containers / OS policiesDefault choice
DenoV8Faster than nodeMOstly compatible with node modules + Security needs to implemented via containers / OS policies
BunJavaScriptCoreSandboxedLeast compatible with node modules

3.1.8. Websockets

Choosing a websocket session identifier

IdentifierDescriptionAdvantagesDisadvantagesTypical Use Case
Connection IDServer-generated unique ID for each WebSocket connection; changes on every reconnectPrecise 1:1 mapping to an actual socket; ideal for ownership, fencing, and livenessEphemeral; not useful for user-level routing or groupingMessage delivery, connection ownership, detecting stale connections
Client ID (User ID)Logical identifier for a user or client across devices/sessionsStable identity; good for authorization and groupingToo coarse: one client can have many connections; unsafe for deliverySend to all user devices, auth checks, user-level fan-out
Session IDIdentifier for a login session or browser/app contextHelps replace old connections; supports “last session wins” semanticsStill not 1:1 with sockets; session handling adds complexityEnforcing single active session, reconnect fencing
Channel / Topic / Room IDLogical grouping that connections subscribe toClean abstraction for broadcast and fan-out; decouples sender from connectionsRequires subscription management; not tied to identityChat rooms, game lobbies, collaborative documents
Device IDStable identifier per physical deviceUseful for presence, multi-device sync, fallback deliveryPrivacy concerns; not always available or reliablePush notification routing, device-specific state
Instance IDIdentifier of the compute instance holding the connectionUseful for routing and debugging; enables direct forwardingChanges with churn; not meaningful at business levelInternal routing, connection registries, observability

3.1.9. Testing Frameworks

Frontend

  • Web
    • Playwright (purpose built from the ground up)
    • Cypress (multiple packages patched together)
  • Cross Platform
    • integration_test (flutter)
  • Mobile
    • Maestro (js)
      • Supports OS level interaction, e.g. going to system settings

3.1.10. Browser Storage

Storage TypeDescriptionSet byAccess viaLifetimeAccess scopeCapacityUse CasesSecurity Notes
CookiesKV pairsResponses (Set-Cookie) + JS (document.cookie)Requests (auto-sent) + JSConfigurable to clear after session / expiry datetimeBrowser + domain4KB each, 50 per domainAuth, prefsUse HttpOnly, Secure, SameSite flags
Session StorageKV pairsJSJSCleared on tab closeTab / Session5MBTemporary UI state, multi-tab separationAccessible to JS -> XSS risk
Local StorageKV pairsJSJSPersistent until clearedBrowser + Origin10MBApp state, non-sensitive prefsAccessible to JS -> XSS risk
Extension Storage???JS (Extensions only)JS (Extensions only)Persistent until clearedExtension5MB (sync), 10MB (local)Extension settings, sync across devices
IndexedDBNoSQL DBJSJSPersistent until clearedBrowser + OriginxGB, depending on disk spacePWAs, offline apps, large structured dataOrigin-scoped, but XSS risk

3.1.11. Database Data Persistence

Data in Tables (Persistent)

  1. Base/Regular Table
    • Data stored in disk
    • Data is persistent across sessions
  2. Temporary Table
    • Data stored in disk
    • Data exists only in session
    • Data can exist across sessions if cached

Data in Queries (In Memory)

  1. Result Set
    • Data stored in memory
    • Data exists onl
  2. Derived / Subquery e.g. FROM
    • Data stored in memory
    • Data exists only in query
  3. Common Table Expression (CTEs) e.g. WITH
    • Same as subquery, but provides syntactic alias for reusing subqueries

Named Queries

  1. View/Virtual
    • Query definition stored in disk
    • Data only stored
  2. Materialised View
    • Data stored in disk
    • Manual/scheduled refresh
  3. Stored Procedure
    • Data stored in disk
    • ???

3.1.12. Database Isolation Levels

Isolation LevelPrevents Dirty ReadsPrevents Non-Repeatable ReadsPrevents Phantom ReadsTypical Use CasesAdvantagesDisadvantages
Read UncommittedNoNoNoRarely appropriate; niche analytics where absolute accuracy is not neededMaximum concurrency, minimal blockingCan read uncommitted/rolled-back data; usually unsafe
Read CommittedYesNoNoGeneral OLTP apps, most standard business CRUD workloadsGood balance of correctness and concurrencySame row can change between reads; query result sets can change
Repeatable ReadYesYesUsually not fully, depends on DB implementationWorkflows needing stable rereads of rows within one transactionMore consistent repeated readsMore overhead, more chance of contention, phantom handling varies by DB
SerializableYesYesYesFinancial transfers, inventory correctness, highly sensitive concurrent workflowsStrongest correctness guaranteesLowest concurrency, more blocking or retries, highest cost

3.2. Compliance

3.3. Security

3.3.0.1. Auth: How do we model who gets access to what?

Access Control ApproachPrincipleUse Case
Role-Based (RBAC)Users -> Roles -> PermissionsEasiest to implement / reason about
Attribute-Based (ABAC)Permission based on user attributes, e.g. user.department == doc.department and time < 18:00Highly customisable
Relationship-Based (ReBAC)Permissions via graph relations, e.g. editor of project XCollaboration apps
Scope-Based (SBAC)Users -> Scope -> Permissions, e.g. contacts.readOAuth

3.3.1. Encryption / Decryption with Keys

There are two types of encryption/decryption patterns

Key TypeDescriptionE.g.AdvDisadvUse Case
SymmetricPrivate key is shared, i.e. one key for both encryption and decryptionAESComputationally fasterHard to distributeBulk data ancryption (disks, HTTPS session data, VPNs)
AsymmetricPublic/private key is set up, i.e. two keysRSA, ECDSAEasier to distributeComputationally slowerKey exchange, digital signatures, SSL/TLS handshake, email encryption

Public and private keys are used for two main purposes:

Key Use CasePrivate KeyPublic Key / Shared Private Key
Message Authentication and Integrity (Digital Signatures)Sign messageVerify message came from sender (authentication) + Ensure message wasn't modified in transit (integrity)
Message ConfidentialityDecrypt messageEncrypt message

3.4. Robustness

3.4.1. Queues

Queue TypeDescriptionUse Case
Simple Queue
Durable Queue
Dead-Letter Queue (DLQ)

3.5. Scalability

3.5.1. Database Access Patterns

3.5.1.1. DB Lock Modes and Granularity

Data Lock ModesTriggered ByAllows other transactions to acquireBlocks other transactions from acquiringUse Case
Shared (S)Read operationsSU, XConcurrent reads
Exclusive (X)Write operationsn.a.S, U, XData modification
Update (U)Read-for-updateSU, XPreventing deadlocks in read-then-write flows
Intent Lock ModesTriggered ByAllows other transactions to acquireBlocks other transactions from acquiringUse Case
Intent Shared (IS)Read operationsIS, IX, SXInternal coordination of hierarchical locks
Intent Exclusive (IX)Write operationsIS, IXS, U, XInternal coordination for writes
GranularityDescriptionAdvantagesDisadvantagesTypical Use Case
RowLocks a single rowHigh concurrency, very preciseMany locks can add overheadTransactional updates
PageLocks a page of multiple rowsFewer locks than row-levelCan block nearby unrelated rowsDB engine optimization
TableLocks the entire tableSimple, strong consistencyVery low concurrencyBulk operations, migrations
DatabaseLocks the entire database or schemaFull protectionExtremely restrictiveSchema changes, maintenance

3.5.1.2. ORM Lock Types

Lock TypeDescriptionHow It WorksWhen to UseDownsides
Optimistic LockingDetects conflicts at write time using a version columnUPDATE includes WHERE version = X; fails if changedLow contention systemsRequires retry handling
Pessimistic ReadPrevents others from modifying while readingUses DB read locks (e.g. FOR SHARE)When you must read stable data before decidingCan block writers
Pessimistic WriteLocks row for exclusive accessUses DB write locks (e.g. FOR UPDATE)High contention, critical updatesCan cause blocking/deadlocks
Force IncrementForces version increment even without changeORM increments version explicitlySignaling “logical update”Rare use case

3.5.1.3. Spring

@Transactional: joins existing transaction, or starts a new one if no transaction exists

Propagation TypeIf Transaction ExistsIf No Transaction ExistsTypical Use CasesRisks / Notes
REQUIREDJoin itStart oneDefault business service methodsMost common default
REQUIRES_NEWSuspend existing and start newStart oneAudit logging, independent side effectsInner commit can succeed even if outer rolls back
SUPPORTSJoin itRun without transactionRead-only/helper logic that can work either wayEasy to accidentally run non-transactionally
NOT_SUPPORTEDSuspend existingRun without transactionWork that should avoid transaction contextRarely used; dangerous if writes happen unexpectedly
MANDATORYJoin itErrorMethods that must be called from an existing transactionGood for enforcing calling discipline
NEVERErrorRun without transactionCode that must not run in transaction contextRare
NESTEDUse savepoint/nested behaviorStart one or fail depending on framework/DB supportPartial rollback within a larger transactionSupport varies; often misunderstood

3.5.2. S3

URL Types

URL TypeDescriptionAdvDisadvUse Case
UnsignedPublic URLHosting public assets, e.g. website images, JS/CSS, downloads
SignedURL signed with S3 access keys,Uploading images, private file sharing
Pre-signedAllows users who do not have AWS credentials to access S3Uploading images, private file sharing
Cloudfront SignedURL signed with CloudFront key pairsMedia streaming, CDNs, large-scale distribution

3.6. Speed

3.6.1. HTTP

There are 3 main versions of HTTP being used

VersionDescriptionAdvDisadvUse Case
1.1Most widely supportedSimple, easy to debug, universally compatibleOne request per connection -> head-of-line blocking -> higher latency, more open connections = higher infra costLegacy, IoT
2Multiplexed streams over one TCP connectionBig improvements in latency and throughput over HTTP/1, fewer connections per client, required for gRPCHead-of-line blocking if packet loss occurs, more complex load balancinggRPC
3Runs over QUIC (UDP)Lowest latencyLess mature, harder debugging, firewalls may block UDPMobile / unstable networks
  • Modern clients auto-negotiate best protocol via Application Layer Protocol Negotiation (ALPN)
    • client says “I support h2, http/1.1, h3”, server picks one

3.7. Observabilty

3.8. Cost

4. C4: Code Design

4.1. Functionality

4.1.1. Data Structures & Algorithms

How to solve problems with code.

4.1.1.1. Methods to Reinterpret Problems

  • Create formula and see if shifting variables around can simplify solution

4.1.1.2. Modulo

ApplicationModulo byExample
Get n trailing digits10^n1234 % 100 = 34
Check even/odd2isEven = x % 2 == 0
Get value of bit after addition2(1 + 1) % 2 = 0
(0 + 1) % 2 = 1
(0 + 0) % 2 = 0
Check divisible by nnisXDivisibleByN = x % n == 0

4.1.1.3. Floor Division

ApplicationDenominatorExample
Remove n trailing digits10^n12345 // 100 = 123
Get carry over bit after addition2(1 + 1) // 2 = 1
(0 + 1) // 2 = 0
(0 + 0) // 2 = 0
Get midpoint of any array ([0,1,2] [0,1,2,3])2midpoint = len(arr) // 2

4.1.1.4. Binary Trees

Sizes

  • no. of nodes: nn
  • height of tree: logxnlog_x n,
    • where xx is for a xx-ary tree
  • width of tree: 2x2^x
    • where xx is the level of the tree for which you want the width

How to navigate a Tree

There are two methods of navigating a tree: Depth-First Search (DFS) and Breadth-First Search (BFS)

DFS

There are three ways to perform traversal:

  1. In-Order Traversal (IOT) -> left, node, right
  2. Pre-Order Traversal (PreOT) -> node, left, right
  3. Post-Order Traversal (PostOT) -> left, right, node

There are two ways to implement DFS:

'''
1. Recursively
    - Adv.: Clean and intuitive
    - Disadv.: Limited by recursion depth, stack overflow risk
'''

def recursive(root):
    iot(root)
    preOT(root)
    postOT(root)

def iot(node):
    if node is None:
        return

    iot(node.left)
    process(node)
    iot(node.right)

def preOT(node):
    if node is None:
        return

    process(node)
    preOT(node.left)
    preOT(node.right)

def postOT(node):
    if node is None:
        return

    preOT(node.left)
    preOT(node.right)
    process(node)

'''
2. Iteratively
    - Adv.: Robust for large or unbounded inputs
    - Disadv.: Less intuitive and readable
'''

def iot(root):
    if root is None:
        return
        
    stack = []
    node = root

    while stack or node:
         go left as far as possible
        while node:
            stack.append(node)
            node = node.left
        
        node = stack.pop()
        process(node)
        stack.append(node.right)

def preOT(root):
    if root is None:
        return 
    
    stack = [root]  switching this to a queue changes the DFS to BFS
    while stack:
        node = stack.pop()
        
        process(node)

         push right first so left is processed first
        if node.right:
            stack.append(node.right)
        if node.left:
            stack.append(node.left)


def postOT(root):
    if root is None:
        return
    
    stack = []
    lastNode = None
    node = root

    while stack or node:
         go left as far as possible
        if node:
            stack.append(node)
            node = node.left
            continue
        
         at leftmost node, if candidate has right and is not the last visited node, check right subtree
         at 
        candidateNode = stack[-1]
        if candidateNode.right and lastNode != candidateNode.right:
            node = candidateNode.right
            continue

        node = stack.pop()
        process(node)
        lastNode = node
        node = None  do not process node again

BFS

There are two ways to perform traversal:

  1. Flat Traversal (FT)
  2. Level-Order Traversal (LOT)

BFS is primarily done iteratively - it can be implemented recursively but there is no practical benefit.


def ft(root):
    if root is None:
        return
    
    queue = deque([root])

    while queue:
        node = queue.popleft()

        process(node)

        if node.left is not None:
            queue.append(node.left)
        if node.right is not None:
            queue.append(node.right)

def lot(root):
    if root is None:
        return

    queue = deque([root])

    while queue: 
         for LOT, we just need to wrap the flat traversal logic in a for loop with levelSize iterations
        levelSize = len(queue)
        for _ in range(0,levelSize):
             same as flat traversal

Note:

  • You can also add metadata for each node by appending tuples (node, metadata) to the queue instead of just nodes

4.1.1.5. Array

How many times can I slide a window over an array?

  • Intuition
    • Start from the base case - window size 1
      • How many times can you slide it?
    • Increase window size
  • Formula
    • len(array) - windowSize + 1

4.1.1.6. Bitwise Operations

OperationApplicationExample
AND &Get carry for binary addition of two numbers1 & 1 = 1
AND &Get last bit10 & 1 = 0, 11 & 1 = 1
XOR ^Get sum without carry for binary addition of two numbers1 ^ 1 = 0
0 ^ 1 = 1
1 ^ 0 = 1
XOR ^Find differences between two bit patterns0110 ^ 1010 = 1100, i.e. different in first two bits
Bit ShiftMultiply/divide by 2x = 2, x << 1 = 4, x >> 1 = 1

4.1.1.7. Dynamic Programming

  • Caching results for fibonacci-style recurrence

4.1.1.8. Binomial Theorem

Theory

  • The Binomial Theorem describes how to expand binomial expressions without brute force
    • Binomial Expression:
      • An expression formed from two terms,
      • e.g. (a+b)(a + b)
    • Binomial Theorem Formula:
      • (x+y)n=k=0n(nk)xnkyk(x+y)^n = \sum_{k=0}^{n} \binom{n}{k} x^{n-k} y^{k}
        • where (nk)nCk\binom{n}{k} \equiv {}^{n}C_k is the binomial coefficient a.k.a. combinations

Applications

  • The binomial coefficient can be used to describe symmetric number sequences, e.g. 1 4 6 4 1

4.1.1.9. Describing Symmetry

  • Linear Symmetry
    • Combinations / Binomial Coefficient
    • Modulus
    • Even Functions
    • Cosine
  • Rotational Symmetry
    • Odd Functions
    • Sine

4.2. Compliance

4.3. Security

4.3.1. Request/Response Flags

FlagPurposeUse Case
HttpOnlyPrevents JS from reading cookiesProtect tokens from XSS
SecureCookie only sent over HTTPSProtect plaintext cookies from being leaked
SameSiteControls if cookies are sent on cross-site requests (Strict/Lax/none)CSRF protection / cross-site marketing
Cache-ControlControls caching of resposne data (no-store, max-age etc.)Ensure sensitive data isn't cached
CORS headersControl which domains can make cross-origin requestsAPIs that need controlled access

4.3.1.1. Authentication

Transporting Passwords

  • Use HTTPS for password submissions
  • Avoid logging raw credentials

4.3.1.2. Authentication Methods

MethodUse Case
Username + Password
Username + Password + 2FA
SSO
Custom-built SSO
Securing Passwords
  • Hashing
    • Passwords should be stored as irreversible cryptographic hashes
  • Salting
    • A random, user-specific unique value (salt) is added to the plain-text password before hashing, which is stored in plaintext in the database
    • Prevents
      • two users with the same passwords from getting the same hash
      • hackers using rainbow tables (precomputed mappings of common passwords -> hashes)
  • Peppering
    • A random, global value (pepper) is added to the plain-text password before hashing, which is stored as an env variable on the server
    • An additional layer of security on top of salting

4.4. Robustness

4.4.1. Recursion Depth Limits

  • C++: 100,000
    • Depends on frame size + OS stack size
  • Dart: 10,000
    • Set by default
  • JS: 10,000
    • (V8 engine/chrome)
    • Depends on
  • Java: 1,000
    • Depends on frame size + OS stack size
  • Python: 1,000
    • Set by default

4.5. Scalability

4.6. Speed

4.6.1. CPU Optimisations

  • Branch Prediction
  • Variable reassignment
  • CPU Pipelining
  • CPU Preloading
  • CPU Prefetching
  • Cache Locality
  • Memory Access Patterns

4.6.2. Language Optimisations

  • Peephole Optimisations
  • Inline
  • Unroll

4.6.3. Caching

Cache Read StrategiesDescriptionAdv.Disadv.Use Case
Read-thruApp reads cache -> on miss, cache reads from DBSimplifies app logicStampede risk on hot keys + Tight coupling between cache and data store + Limited flexibility for custom fetch logicSimple KV access
Cache Write StrategiesDescriptionAdv.Disadv.Use Case
Write-thruApp writes to cache -> cache writes to DB syncCache is consistent + Reads are fast after writesHigher write latency + Cache outage blocks writesStrong consistency / configuration data
Write-behind / backApp writes to cache -> cache writes to DB asyncVery fast writesRisk of data loss without durable buffering (queue / WAL required) + eventual consistencyHigh-throughput / analytics / logging / non-critical data
Cache Read/Write StrategiesDescriptionAdv.Disadv.Use Case
Cache-asideApp checks cache -> on miss, app reads from DB -> app writes to cacheSimple + cache only stores what is usedStampede risk on hot keys + Harder to guarantee consistency under concurrent writesDefault choice for most BE systems / Read-heavy systems / microservices / web APIs
Cache-thruRead-thru + Write-thruCentralised data accessCache is SPOF + Reduced observability + Harder debuggingRare / legacy / strict data access boundaries
Cache Invalidation StrategiesDescriptionAdv.Disadv.Use Case
TTL-basedCached entries expires after timeSimple invalidation + Prevents stale data buildupStampede risk on expiry + Hard to pick optimal TTLOften combined with cache-aside / CDN caching
Event-basedCache entries invalidated on data change eventsVery fresh data + No guess work with TTLEvent loss or ordering issues can cause permanently stale cache + More moving partsEvent-driven / CQRS systems

4.7. Observability

4.7.1. Logging

  • Avoid auto logging POST bodies and GET parameters
    • If the auto logging runs on auth endpoints, passwords could be written in plaintext to logs

4.7.2. Performance Metrics

MetricDescriptionLayerUnitsE.g.
BitrateRate at which app encodes and sends dataApplicationbits/sVoice: 10 kbps 2G, 64kbps 3G, 64 kbps LTE, 12-64 kbps VoLTE, 128 kbps Vo5G
Video: 1 Mbps (360p), 2 Mbps (720p), 5 Mbps (1080p), 15 Mbps (4K)
ThroughputRate at which data is sent over the networkNetworkbits/sZoom bitrate 2Mbps, network throughput only 1.5Mbps due to packet loss
Available BandwitdhRate at which a network link can support data transferNetworkbits/sWi-Fi: 5Mbps
Latency / Round Trip Time (RTT)Time taken for packet to go to peer and backms<150ms before humans detect delay
Packet Loss% of dropped packets between nodes in one direction%<1% before choppy/freezing videoaudio
JitterVariability in packet arrival time in one directionms<30ms before video stutters \ audio cracks

4.8. Cost

4.8.1. Maintainability

How to deliver value to users with minimal waste using code.

  • Single Layer of Abstraction Principle (SLAP)
  • Dependency Injection
  • Clean Conditionals
  • Conventional Commits
  • Early Returns / Continues
  • Prefer for loops over while

4.8.2. Response Codes

CodeMeaningWhen to useBenefit of using
InformationalRequest received, continuing processRare in practice, mostly for protocol-level interactions
100ContinueClient should continue sending request body (after headers OK)Saves bandwidth if request is rejected early
101Switching ProtocolsUsed for HTTP to WebSocket upgrade or HTTP/1 to HTTP/2 switchNecessary to start persistent connections
SuccessRequest succeeded
200OKStandard response for successful request (e.g. GET, POST when no resource creation)
201CreatedNew resource created successfully (e.g. POST /users)
202AcceptedRequest accepted for async processing but is not done yet
204No ContentSuccess, but no response body (e.g. DELETE)
RedirectionFurther action needed
301Moved PermanentlyResource permanently movedTells crawlers to update their search index, better SEO
302Found (Moved Temporarily)Temporary redirect (historically used like 303)
303See OtherRedirect after POST -> GET (common for web forms), e.g. ???
304Not ModifiedUsed with cachingClient can use cached response, lowers latency and bandwidth does not need to wait for body to arrive
Client ErrorProblem with request
400Bad RequestMalformed syntax, invalid patterns
401UnauthorizedMissing/invalid authentication
403ForbiddenAuthenticated but not authorised
404Not FoundResource doesn't exist, or if you don't want malicious actors to know your API endpoints if they are not authenticated/authorisedSecurity through obscurity + clear feedback
409ConflictResource conflict (e.g. duplicate unique field)
429Too Many RequestsRate limiting / throttling
Server ErrorProblem on server side
500Internal Server ErrorGeneric server crash/error
502Bad GatewayUpstream server error (e.g. reverse proxy can't reach backend)
503Service UnavailableServer overloaded, down for maintenance
504Gateway TimeoutUpstream service didn't respond in time

4.8.3. React

  • Avoid useEffect if there are no external deps (source)

4.8.4. Git

  • BFG Repo-Cleaner
    • CLI for cleaning up git repos
      • e.g. committed large files / sensitive data

5. First Principles

QuestionAnswer
What is a gateway?A gateway is a specialised, stateful compute optimised for connection handling, routing, auth, fanout
What is a load balancer?
What is OAuth?Authorisation protocol, i.e. can this app access this resource?
What is OIDC?Authentication protocol built on top of OAuth, i.e. who is this user?
What is authentication?Authenticaiton is verifying identity, i.e. confirming that you are who you say you are
What is authorisation?Authorisation is checking permissions, i.e. determining what you can do

5.1. Options for deploying

Deployment ApproachHow data is served
LocalFrom the same machine
NetworkFrom a remote system
DistributedFrom multiple remote nodes

5.2. Testing Best Practices

  • E2E
    • Main user stories, happy paths
  • Integration
    • Edge cases not caught by E2E, unhappy paths
  • Unit
    • Edge cases for small functions

5.3. Mobile

5.3.1. Cold/Warm/Hot Starts on Mobile

  1. Cold Start
    • binary not in memory
    • e.g. launching app after killing it
  2. Warm Start
    • binary in memory, app process in background
    • e.g. when switching between apps
  3. Hot Start
    • binary in memory, app process in foreground
    • e.g. when locking and unlocking the screen momentarily, or switching between apps briefly
      • This occurs because the Android and iOS give apps a grace period (~2s) before backgrounding
    • App still has GPU and CPU priority

5.3.2. Splash Screen

Splash screens are only shown for cold start

PhaseNative iOSNative AndroidReact NativeFlutter
Process StartupOS launches app processSameSameSame
Show OS-level SplashLaunch splashSameSameSame
Runtime Init + Framework BoostrapInitializes iOS runtime + UIKit, sets up main run loop, prepares initial UIViewControllerInit Android Runtime + base Activity, inflates first layoutNative layer starts JS engine, loads JS bundle, sets up React tree & JS x native bridgeNative layer starts Flutter engine, loads Dart VM, initializes widget tree & Skia renderer
App InitSet up SDKs, DB, config etc.SameSameSame
Remove SplashOS removes splash once first UIViewController is readyOS removes splash once Activity content is readyNative splash removed after JS bundle + RN root view are mountedNative splash removed after Flutter engine renders first frame
First Frame RenderedFirst frame is renderedSameSameSame

5.4. Databases

5.4.1. SQL Databases

QuestionAnswer
What is atomicity?Atomicity guarantees that all statements are committed within a transaction, or none are, i.e. all or nothing
What is consistency?Consistency guarantees that a transaction brings the database from one valid state to another, preserving all defined constraints
What is isolation?Isolation guarantees that transactions running in parallel do not interact in unsafe ways (subject to change in isolation levels)
What is durability?Durability guarantees that once a transaction has been committed, there will be no data loss in the event of a crash
What is SQL?Structured Query Language is a language used to interact with data in relational DBs
What subtypes of SQL are there?DQL, DML, DDL, DCL
What is DQL?Data Query Language is used to read data from a DB, e.g. SELECT
What is DML?Data Modification Language is used to modify data in the DB, e.g. INSERT, UPDATE, DELETE
What is DDL?Data Definition Language is used to define the structure of data in the DB, e.g. CREATE
What is DCL?Data Control Language is used to control access to data in the DB, e.g. GRANT, REVOKE
What is a command/statement?An instruction executed by the database, e.g. SELECT * FROM fooTable;
What is a clause?A part within a statement
What is a read/query?A statement that reads data, e.g. SELECT * FROM fooTable;
What is a write/update?A statement that writes data, e.g. DELETE FROM fooTable WHERE ...;
What is a read-then-write?A pattern where data is read before a write (the write may use the data / update the same data)
What is a read-for-update?A pattern that reads data with the intention of updating the same data later, e.g. BEGIN; SELECT ... FOR UPDATE; -- biz logic on that data ; UPDATE ...; COMMIT;
What is a result set?Data returned from a query
What is an update acknowledgement?Confirmation returned from an update
What is a transaction?A group of statements executed as a single unit of work e.g. BEGIN; SELECT ...; UPDATE ...; COMMIT; ROLLBACK;
What is a single unit of work?A set of operations that guarantees atomicity, i.e. either all statements commit, or rollback
What is a lost update?Two transactions read the same row, both modify it, one overwrites the other
What is a dirty read?One transaction reads data written by another transaction that has not committed yet
What is a non-repeatable read?A situation where a transaction reads data twice and gets different values because another transaction committed inbetween
What is a phantom read?A situation where the set of rows matching a condition changes during a transaction
What is a deadlock?A deadlock is when 2 or more transactions are waiting for each other's locks, preventing any from proceeding
What is autocommit?Autocommit is a DB configuration that automatically commits individual SQL statements as its own transaction
What is an object?Anything defined in a DB, e.g. Tables, Views, Indices, Stored Procedures, Triggers, Functions
What is a schema?Logical grouping of DB objects
What is an execution plan?The strategy the DB optimiser chooses to execute reads/writes efficiently, e.g. index scan vs full scan, hash join
What is a lock?Mechanism used to control concurrent access to data, e.g. through blocking
What is a lock mode?The type of lock that defines what operations are allowed/blocked (e.g. S, U, X, IS, IX)
What is lock granularity?The level of the object at which a lock is applied (e.g. row, page, table)
What is a data lock mode?Locks placed on actual data to control read/write access (e.g. S, U, X)
What is an intent lock mode?Locks placed on coarser-granularity objects by transactions, signalling their intentions on finer-granularity objects (e.g. IS, IX)
What is the lock lifecycle?Acquire intent lock when higher-level object is accessed during execution -> acquire data lock when lower-level object is accessed during execution -> do work -> release lock depending on lock mode
When are read locks released?Depends on the isolation level
When are write locks released?At transaction end / commit
What does blocking mean in locks?A transaction is forced to wait at a statement that is trying to access a locked resource, because it cannot acquire the required lock
How many S locks can exist on a resource at any time?Multiple
How many U locks can exist on a resource at any time?One
How many X locks can exist on a resource at any time?One
Why does S block UBecause U locks is intended to be promoted to X locks, which gets blocked by S, resulting in a deadlock
Why doesn't U block SBecause DBs allows concurrent reads while determining if an update is necessary, reducing total lock duration and improving performance (design choice)
Thread123456
Arequest receivedread old data under U lock allowedresponse sent with old data
Brequest receivedread blockedread new dataresposnse sent with new data
Crequest receivedU lock acquiredreadwriteother processing

5.4.1.1. Deadlock Problem

StepTransaction ATransaction B
1SELECT → acquires S lock
2SELECT → acquires S lock
3UPDATE → tries to acquire X, waits
4UPDATE → tries to acquire X, waits
5❌ waiting for B to release S❌ waiting for A to release S
6💥 deadlock💥 deadlock

5.4.2. ORMs

QuestionAnswer
What is a session?A session represents an unit of work and manages the persistence context and communication with the DB
What is the difference between an ORM session and a DB transaction?ORM session represents an application-level UoW while DB transaction represents a DB-level UoW + a session may span multiple transactions + a transaction may be managed within a session
What is persistence context?A cache that tracks entities and their changes during a session, ensuring consistency between in-memory objects and the DB

5.5. Networking Model

There are two main models that are used in the industry today:

  1. Open Systems Intercommunication (OSI) model
    1. Abstract: Typically used to discuss concepts
  2. TCP/IP model
    1. Concrete: This is what is used in the internet today
OSI LayerNamePurposeTCP/IP LayerData UnitExamples
7ApplicationUser AppsApplicationDataZoom, WhatsApp, Teams
App ProtocolsHTTP, WebSockets, WebRTC, SIP, DNS, WebRTC API, WebRTC Signaling, DNS, gRPC, RTP/SRTP
6PresentationData formattingJSON, XML Protobuf
6PresentationEncoding & CompressionJPEG, MP3, H.264, gzip
6PresentationEncryptionTLS, DTLS, SSL, SRTP,
5SessionManage session lifecycleNetBIOS, RPC, WebRTC session setup
4TransportReliable/unreliable delivery, multiplexing, manage connectionsTransportSegment (TCP) / Datagram (UDP)TCP, UDP, QUIC
3NetworkRouting, addressingInternetPacketIP, ICMP, BGP
2Data LinkFraming, error detectionLinkFrameEthernet, Wi-FI MAC, PPP, 5G NR
1PhysicalRaw bits over a mediumBitsFiber, RF, copper, modulation

5.6. Sessions and connections

Definition

ConnectionSession
LayerTransportApplication
DefinitionA channel between two peersA context between two peers
LifespanExists only while data flows on the transportCan span multiple connections, until either peer terminates the session

Signaling: Session Management Signaling is the process of setting up, managing, and tearing down a communication session before real-time data flows. Signaling encompasses multiple processes:

  • Session Setup
  • Codec Negotiation
  • Process where two peers agree on a common codec for audio/video during signaling
  • NAT Traversal
  • Techniques + Protocols that allow devices behind NAT to communicate directly
  • There are three main techniques
    1. Session Traversal Utilities for NAT (STUN)
      • Device asks STUN server "What's my public IP:port?"
      • Device shares info with other peer (P2P)
      • Works only if NAT keeps mappings stable
    2. Traversal Using Relays around NAT (TURN)
      • Both devices send media to a TURN server
      • Used as fallback if direct P2P fails
      • Higher latency + server bandwith cost
    3. Interactive Connectivity Establishment (ICE)
      • Gathers candidates
        • Private IP:port
        • Public IP:port from STUN
        • Relay addresses from TURN
      • Tries all possible paths
      • Picks the fastest, lowest-latency route
  • Encryption keys exchange
  • Exchange session metadata

5.7. Web Identifiers

TermDefinitionE.g.
TCP connectionSource IP : Source Port -> Destination IP : Destination Port192.168.1.10 : 52341 → 34.120.10.5 : 443
SocketOS-managed object that includes TCP connection + send/receive buffer
DomainRegistrable name of a websiteexample.com
Subdomainshop.example.com
Hostshop
Ephemeral Port52341
Scheme???http://, ws://
Port???443
OriginScheme + Host + Porthttps://example.com:443
Fragment???#reviews
Uniform Resource Name (URN)Name of a resource, not how to locate iturn:isbn:0451450523 (book ISBN), urn:uuid:6fa459ea-ee8a-3ca4-894e-db77e160355e (UUID)
Uniform Resource Locator (URL)How to locate a resourcehttps://shop.example.com:443/products?id=10#reviews
Uniform Resource Identifier (URI)URL / URN-

5.7.1. Defining IP address ranges

  • CIDR blocks
    • Protocol that allows defining a range of valid IP addresses
    • Notation: <ip address>/<prefix length>
      • Prefix length determines how many bits in the address are fixed
  • Classless Inter-Domain Routing

5.8. Telco 101:

Rendering 3D models to 2D assets
Rendering 3D models to 2D assets
  • Cell Tower
    • Software Components i.e. Base Station Software Stack
      • Radio Access Network (RAN) Software
        • Handles communication between mobile devices and cell tower, e.g.
          • Handover Control: Deciding when phone switches from one tower to another
          • Radio Resource Control (RRC): managing spectrum and assigning frequencies to devices
          • MAC & PHY Scheduling: Deciding which user gets how much bandwidth every millisecond
          • Security & Authentication: Encrypting radio traffic before it hits the core
          • Quality of Service: Prioritising latency-sensitive traffic like voice and video
      • Cell Tower OS
        • Manages hardware scheduling, memory and task prioritisation
      • Management Software
        • For engineers to monitor and configure the cell tower
    • Hardware Components
      • Antennas: Send/receive radio signals
      • Remote Radio Unit (RRU): Converts radio waves to/from digital data
      • Baseband Unit (BBU): Runs the base station software stack
        • In 5G, BBUs are
          • centralised in regional data centers
          • serve dozens of towers
          • do not exist on the cell tower
      • Backhaul: Connection to core network via
        • Fiber (Most common)
        • Microwave (rural areas)
        • Satellite (remote locations)

5.9. WebRTC

  • Frameworks

    • Web Real-Time Connection (WebRTC)
      • Open source framework for P2P RTC
      • Components
        • Signaling
        • Media Capture
        • Media Transport
        • Encryption
        • NAT Traversal
        • Adaptive Quality
        • Data Channels
  • Signaling Protocols

    • Session Initiation Protocol (SIP)
      • Set up, modify, tear down real-time sessions for voice/video/messaging
  • Monitoring Protocols

    • Real-time Transport Control Protocol (RTCP)
      • Measures network performance metrics for RTP
  • Security Protocols

    • Transport Layer Security (TLS)
      • Secures TCP
    • Datagram Transport Layer Security (DTLS)
      • Secures UDP
      • i.e. TLS for UDP
  • Transport Protocols

    • Real-time Transport Protocol (RTP)
      • Transports real-time media (audio/video)
      • Rides on UDP, sometimes TCP
    • Secure Real-time Transport Protocol (SRTP)
      • Encrypted RTP
      • Uses DTLS for key exchange
    • RTCP
  • Network Address Translation (NAT)

    • NAT Devices
      • Home Routers
      • Corporate Firewalls
    • Vanilla NAT
      • 1:1 mapping between private IPs to public IPs (e.g. 192.168.0.1 (private) : 203.0.113.1 (public))
      • Provides control over private IP ranges
      • Single source of truth for configuring public/private IP mappings (e.g. ISP changes IP allocations)
    • Port Address Translation (PAT) a.k.a NAT Overload
      • 1:many mapping between private IPs to public IPs by using ports as well
        • e.g.
          • 192.168.0.10:52301 -> 203.0.113.7:40001
          • 192.168.0.11:52301 -> 203.0.113.7:40002
      • Workaround to IPv4's small address space, not needed in IPv6 where 1:1 mappings are encouraged
  • Firewall

    • Decides which packets are allowed/blocked
    • Lives between private network and public internet
    • Typically blocks incoming connections, not outgoing
    • Corporates typically block UDP entirely because the lack of handshakes make it hard for firewalls to understand the session state

5.10. Wireless Systems

  • Application
  • Transport / IP
  • Radio Resource Control (RRC): Manages radio resources and connection states between base station and user device
    • Types of radio resources:
      • Time
      • Frequency
      • Power
      • Modulation & Coding
      • Bearer
      • Control
      • Random access
      • Beamforming
    • Types of connection states:
      • RRC_IDLE
      • RRC_INACTIVE (5G)
      • RRC_CONNECTED
  • PDCP
  • RLC
  • Medium Access Control (MAC) Layer: Decides who gets to transmit, when, and how much bandwidth
  • Physical (PHY) Layer: Deals with actual signal transmission over radio waves (modulation, power levels etc.)

5.11. Network Protocols

Application Layer

Signaling Layer

  • Voice over Public Switched Telephone Network (PSTN)
    • Dedicated E2E path between landlines/mobile phones using circuit switchers
    • Transmits uncompressed voice using Pulse Code Modulation (PCM) at 64 kbps per call
    • Used in landlines and mobile phones when on connections of < 4G
    • >4G and above
    • Carrier provides QoS guarantees
  • Voice over IP
    • Transmits voice using IP
    • No QoS guarantees, call quality depends on network connection
  • Video over IP
    • Transmits video using IP