. . .

References

System Design Top 100
Hello interview:
- youtube: https://www.youtube.com/@hello_interview/featured
- https://www.hellointerview.com/

Java-fullstack SD:

美国 Java 全栈开发的系统设计（System Design）面试中，常见的题目聚焦在高并发、可扩展性、数据一致性、微服务架构等核心问题。以下是最常考的系统设计题（尤其是中高级职位或全栈岗位）：
Java 全栈岗位额外可能考察的系统设计内容:
- 前后端解耦设计（API 设计规范、接口版本管理）
- 微服务架构设计（Spring Cloud, Netflix OSS, gRPC）
- 前端 SSR vs CSR 的选择（React 服务端渲染）
- CI/CD 与 DevOps 集成（Jenkins, Docker, Kubernetes）
- 数据一致性与 CAP 理解（尤其是 eventual consistency）

Backend SD 真题参考:

notification system
twitter
youtube
uber
comment system
distributed MQ
distributed counter
Data platform

Frontend SD:

写一个流水灯，支持自动500ms 切换和手动切换两种. Χ
找一个csv格式的api table. 获取处理数据render成表单，支持分页，搜索
React实现节流和防抖
如何有关一个渲染很慢的列表，提出两种以上解决方案
实现一个modal弹窗组件，支持嵌套
设计一个倒计时器，设计一个数字输入框，点击确定是开始倒计时，支持停止，继续，reset功能
模拟电视遥控器输入，画一个含有1-9，a-z的键盘，通过键盘上下左右控制方向，每当按下enter录入字符，将点击字符显示出来并且backspace控制删除字符。注意到达键盘边界特殊情况
登录表单（输入邮箱和密码，简单验证）
从API加载用户列表.
大列表懒加载
多步表单（Wizard）（比如注册分步骤）

Scaling - prerequisite knowledge

Single server setup

alt text

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Database

alt text

Non-relational databases might be the right choice if:

• Your application requires super-low latency.
• Your data are unstructured, or you do not have any relational data.
• You only need to serialize and deserialize data (JSON, XML, YAML, etc.).
• You need to store a massive amount of data.

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Vertical scaling vs horizontal scaling

alt text

Vertical scaling, referred to as “scale up”
- means the process of adding more power (CPU, RAM, etc.) to your servers.
- it comes with serious limitations:
  - • Vertical scaling has a hard limit. It is impossible to add unlimited CPU and memory to a single server.
  - • Vertical scaling does not have failover and redundancy. If one server goes down, the website/app goes down with it completely.
Horizontal scaling, referred to as “scale out”
- allows you to scale by adding more servers into your pool of resources.
- Horizontal scaling is more desirable for large scale applications due to the limitations of vertical scaling

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Load balancer

alt text

after a load balancer and a second web server are added, we successfully
solved no failover issue and improved the availability of the web tier. Details are explained
below:

• If server 1 goes offline, all the traffic will be routed to server 2. This prevents the website from going offline. We will also add a new healthy web server to the server pool to balance the load.
• If the website traffic grows rapidly, and two servers are not enough to handle the traffic, the load balancer can handle this problem gracefully. You only need to add more servers to the web server pool, and the load balancer automatically starts to send requests to them.

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Database replication

alt text

Advantages of database replication:

Reliability
Better performance:
High availability:

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

alt text

Cache

alt text

Considerations for using cache:

• Decide when to use cache. Consider using cache when data is read frequently but modified infrequently.
• Expiration policy. It is a good practice to implement an expiration policy. Once cached data is expired, it is removed from the cache.
• Consistency: This involves keeping the data store and the cache in sync. Inconsistency can happen because data-modifying operations on the data store and cache are not in a single transaction.
- When scaling across multiple regions, maintaining consistency between the data store and cache is challenging. For further details, refer to the paper titled “Scaling Memcache at Facebook” published by Facebook [7].
• Mitigating failures: A single cache server represents a potential single point of failure (SPOF), defined in Wikipedia as follows: “A single point of failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working” [8].
- As a result, multiple cache servers across different data centers are recommended to avoid SPOF.
• Eviction Policy: Once the cache is full, any requests to add items to the cache might cause existing items to be removed. This is called cache eviction. Least-recently-used (LRU) is the most popular cache eviction policy.

CDN

A CDN is a network of geographically dispersed servers used to deliver static content. CDN
servers cache static content like images, videos, CSS, JavaScript files, etc.

alt text

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

alt text

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

alt text

Stateless web tier

Stateful architecture:

alt text

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

Stateless architecture:

alt text

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

alt text

we move the session data out of the web tier and store them in the persistent data store. The shared data store could be a relational database, or Memcached/Redis, NoSQL, etc.

However, The NoSQL data store is chosen as it is easy to scale.

Data center

In normal operation, users are geoDNS-routed, also known as geo-routed, to the closest data center,

alt text

↓ ↓ ↓ ↓ ↓ ↓ ↓ ↓

In the event of any significant data center outage, we direct all traffic to a healthy data center.

alt text

Message queue

alt text

To further scale our system, we need to decouple different components of the system so they
can be scaled independently. Messaging queue is a key strategy employed by many real-
world distributed systems to solve this problem.

alt text

Logging, metrics, automation

Logging: Monitoring error logs is important because it helps to identify errors and problems
in the system.
Metrics: Collecting different types of metrics help us to gain business insights and understand
the health status of the system. Some of the following metrics are useful:
- • Host level metrics: CPU, Memory, disk I/O, etc.
- • Aggregated level metrics: for example, the performance of the entire database tier, cache tier, etc.
- • Key business metrics: daily active users, retention, revenue, etc.
Automation: CI/CD, When a system gets big and complex, we need to build or leverage automation
tools to improve productivity.

alt text

Database scaling - Sharding

sharding:

alt text

It introduces complexities and new challenges to the system:

Resharding data: Resharding data is needed when
- 1) a single shard could no longer hold more data due to rapid growth.
- 2) Certain shards might experience shard exhaustion faster than others due to uneven data distribution.
- Consistent hashing, is a commonly used technique to solve this problem.
Celebrity problem: This is also called a hotspot key problem. Excessive access to a specific shard could cause server overload.
- Imagine data for Katy Perry, Justin Bieber, and Lady Gaga all end up on the same shard. For social applications, that shard will be overwhelmed with read operations.
- To solve this problem, we may need to allocate a shard for each celebrity. Each shard might even require further partition.
Join and de-normalization: Once a database has been sharded across multiple servers, it is hard to perform join operations across database shards.
- A common workaround is to de-normalize（反规范化） the database so that queries can be performed in a single table.

Denormalization（反规范化）

Denormalization（反规范化）的操作，是指在设计数据库时有意引入冗余，以提升读取性能、减少跨表查询（特别是在分布式系统中）。常见的具体操作包括：

复制字段（Field Duplication）
将某些字段从关联表复制到当前表中，避免 JOIN。
例：将用户姓名从 users 表复制到 orders 表，避免查询订单时还要 JOIN users。

合并表（Table Merging）
将多个高度关联的表合并成一个宽表（wide table）。
例：将 users 和 user_profiles 合并成一个 user_full_info 表。

添加冗余数据（Redundant Data）
存储一些计算后的字段或聚合数据。
例：在商品表中直接保存评论数量或平均评分，而不是每次去评论表计算。

嵌套结构（常用于NoSQL）
在文档型数据库（如 MongoDB）中，直接把关联数据嵌套到一个文档中。
例：一个用户文档中嵌套其所有订单数据。

alt text

Conclusion

To conclude this chapter, we provide a summary of how we scale our system to support millions of users:

• Keep web tier stateless
• Build redundancy at every tier
• Cache data as much as you can
• Support multiple data centers
• Host static assets in CDN
• Scale your data tier by sharding
• Split tiers into individual services
• Monitor your system and use automation tools

Back-of-the-envelope Estimation

Back-of-the-envelope estimation（粗略估算）是指用非常简单、快速的方法进行大致计算，不追求精确，只为评估一个数量级或判断某个方案是否可行。

这个术语来自字面意思：“在信封背面随手写下的计算”，常见于系统设计、商业分析、工程等场景

面试中常见：

“你设计一个视频上传服务，预估每天有多少数据写入？”

你可以：

假设每天 1 亿个用户中有 1% 上传视频
每个视频平均 100MB
那就是：1,000,000 × 100MB = 约 100TB/天

这就是典型的 back-of-the-envelope 估算。

Example: Estimate Twitter QPS and storage requirements

Please note the following numbers are for this exercise only as they are not real numbers from Twitter:

Assumptions:

300 million monthly active users.
50% of users use Twitter daily.
Users post 2 tweets per day on average.
10% of tweets contain media.
Data is stored for 5 years.

Answer Estimations:

Query per second (QPS) estimate:
- Daily active users (DAU) = 300 million * 50% = 150 million
- Tweets QPS = 150 million * 2 tweets / 24 hour / 3600 seconds =~ 3500
- Peek QPS = 2 * QPS = ~7000
We will only estimate media storage here.
- Average tweet size:
- tweet_id 64 bytes
- text 140 bytes
- media 1 MB
- Media storage: 150 million * 2 * 10% * 1 MB = 30 TB per day
  - 15M * 2 * 1MB= 30M * 1MB = 30, 000, 000 MB = 30 TB
- 5 - year media storage: 30 TB * 365 * 5 =~ 30 TB * 1500 =~ 55, 000 TB =~55 PB

Framework for SD interview

If you are asked to design a news feed system:

step1-Understand the problem and establish design scope

“Design scope” 指的是你在做系统设计时，这个设计要解决的边界、范围和目标。
更具体地说： Design scope = 你要设计什么、不设计什么、解决哪些问题、考虑到哪些场景。

One of the most important skills as an engineer is to ask the right questions, make the proper assumptions, and gather all the information needed to build a system.

So, do not be afraid to ask questions:

Is this a mobile app? Or a web app? Or both?
What are the most important features for the product?
How many friends can a user have?
What is the traffic volume?

step2-Propose high-level design and get buy-in

“buy-in” 是一个商业和产品开发中的术语，意思是：关键相关人（如团队成员、管理层、客户）对某个提议或设计的认可、支持与承诺。对方认可并决定按照这个设计推进开发，这就叫他们 give buy-in

we aim to develop a high-level design and reach an agreement with the interviewer on the design:

Come up with an initial blueprint for the design. Ask for feedback. Treat your interviewer as a teammate and work together. Many good interviewers love to talk and get involved.
Draw box diagrams with key components on the whiteboard or paper. This might include clients (mobile/web), APIs, web servers, data stores, cache, CDN, message queue, etc.
Do back-of-the-envelope calculations to evaluate if your blueprint fits the scale constraints. Think out loud. Communicate with your interviewer if back-of-the-envelope is necessary before diving into it.

feed publishing:
alt text

news feed building:
alt text

step3-Design deep dive

Sometimes, for a senior candidate interview, the discussion could be on the system performance characteristics, likely focusing
on the bottlenecks and resource estimations. In most cases, the interviewer may want you to
dig into details of some system components.

For URL shortener, it is interesting to dive into the hash function design that converts a long URL to a short one.
For a chat system, how to reduce latency and how to support online/offline status are two interesting topics.

alt text

step4-Wrap up

In this final step, the interviewer might ask you a few follow-up questions or give you the freedom to discuss other additional points. Here are a few directions to follow:

The interviewer might want you to identify the system bottlenecks and discuss potential improvements
Error cases (server failure, network loss, etc.) are interesting to talk about.
Operation issues are worth mentioning. How do you monitor metrics and error logs? How to roll out(逐步部署并上线系统或功能，让用户可以开始使用) the system?
How to handle the next scale curve(随着系统规模扩大（scale），系统资源需求、性能瓶颈或架构复杂度等的增长趋势曲线) is also an interesting topic.
- For example, if your current design supports 1 million users, what changes do you need to make to support 10 million users?

Time allocation on each step

Step 1 Understand the problem and establish design scope: 3 - 10 minutes
Step 2 Propose high-level design and get buy-in: 10 - 15 minutes
Step 3 Design deep dive: 10 - 25 minutes
Step 4 Wrap: 3 - 5 minutes

Layer 4 vs Layer 7 负载均衡

参考

Layer 4 (传输层负载均衡)

工作层级：TCP/UDP 层（IP 地址 + 端口）
原理：只看 IP、端口、协议，不看应用数据。
转发方式：像路由器，收到包 → 按规则转发 → 不关心里面是 HTTP、gRPC 还是 MySQL。
优点：
- 转发快，性能高（开销小）
- 支持非 HTTP 协议（如 MySQL、Redis、gRPC、WebSocket）
缺点：
- 只能做到简单的四元组（源/目标 IP+端口）转发
- 不能基于 URL、Cookie、Header 等应用层信息做路由

👉 常见用途：数据库集群（MySQL、Redis）、gRPC、任意 TCP 服务、VoIP、游戏服务器。

Layer 7 (应用层负载均衡)

工作层级：应用层（HTTP/HTTPS、gRPC、SSE 等）
原理：能解析请求内容（URL、Host、Cookie、Header、Method…）。
转发方式：根据应用层信息做智能路由。
优点：
- 支持基于 URL/路径/域名的路由（如 /api 到 A 服务，/static 到 B 服务）
- 可以做 SSL 终止、HTTP 压缩、缓存
- 支持 A/B 测试、金丝雀发布（Canary）、多版本路由
缺点：
- 要解析 HTTP，性能开销比 L4 大
- 只能处理 HTTP/HTTPS、WebSocket、gRPC 这类应用层协议

👉 常见用途：Web 应用、微服务网关、API Gateway、CDN 边缘负载均衡。

总结对比

特性	L4 LB	L7 LB
层级	TCP/UDP	HTTP/HTTPS (应用层)
性能	高	较低
协议支持	任意 TCP/UDP	主要 HTTP/HTTPS, gRPC, WebSocket
路由能力	基于 IP/端口	基于 URL/Host/Header/Cookie
用途	数据库、TCP 服务、游戏、VoIP	Web/API 服务，微服务路由，CDN，金丝雀发布

👉 最直白的一句话：

L4 = 快速分流（不懂内容）
L7 = 智能分流（懂内容）
如果您在面试中使用 websockets，您可能希望使用 L4 负载均衡器。对于其他所有方面，第 7 层负载均衡器可能更适合
第 7 层负载均衡器非常适合基于 HTTP 的流量，它将涵盖我们迄今为止讨论过的所有协议，但 Websocket 除外。

Redis的特殊用法

Redis for Proximity Search

示范实例 - Uber sd

Redis natively supports geospatial indexes with commands like GEOADD and GEOSEARCH. The basic commands are simple:
Redis 原生支持使用 GEOADD 和 GEOSEARCH 等命令的地理空间索引。基本命令很简单：

GEOADD key longitude latitude member # Adds "member" to the index at key "key"
GEOSEARCH key FROMLONLAT longitude latitude BYRADIUS radius unit # Searches the index at key "key" at specified position and radius

The search command, in this instance, runs in O(N+log(M)) time where N is the number of elements in the radius and M is the number of items inside the shape.
在本例中，搜索命令以 O（N+log（M））时间运行，其中 N 是半径中的元素数，M 是形状内的项目数。

Redis for Pub/Sub

示范实例 - FB live comment sd

Redis natively supports a publish/subscribe (Pub/Sub) messaging pattern, allowing messages to be broadcast to multiple subscribers in real time. This is useful for building chat systems, real-time notifications, or any scenario where you want to decouple message producers from consumers (more discussion on this in our Realtime Updates pattern).
Redis 原生支持发布/订阅（Pub/Sub）消息传递模式，允许将消息实时广播给多个订阅者。这对于构建聊天系统、实时通知或任何想要将消息生成者与使用者分离的场景非常有用（在我们的实时更新模式中对此进行了更多讨论）。

People frequently call out limitations of Redis Pub/Sub that are no longer true, e.g. Redis pub/sub is now sharded which enables scalability which was not possible in previous versions!
人们经常指出 Redis Pub/Sub 的局限性不再正确，例如 Redis Pub/Sub 现在是分片的，这实现了以前版本中无法实现的可扩展性！

The basic commands are straightforward:
基本命令很简单：

SPUBLISH channel message   # Sends a message to all subscribers of 'channel' (the S prefix means "sharded")
SSUBSCRIBE channel        # Listens for messages on 'channel'

When a client subscribes to a channel, it will receive any messages published to that channel as long as the connection remains open. This makes Pub/Sub great for ephemeral, real-time communication, but it’s important to note that messages are not persisted—if a subscriber is offline when a message is published, it will miss that message entirely.
当客户端订阅通道时，只要连接保持打开状态，它就会收到发布到该通道的任何消息。这使得 Pub/Sub 非常适合临时的实时通信，但请务必注意，消息不会持久化——如果订阅者在发布消息时处于离线状态，它将完全错过该消息。
Pub/Sub clients use a single connection to each node in the cluster (rather than a connection per channel). Generally speaking, this means that in most cases you’ll use a number of connections equal to the number of nodes in your cluster. It also means that you don’t need millions of connections even if you have millions of channels!
Pub/Sub 客户端使用与集群中每个节点的单个连接（而不是每个通道的连接）。一般来说，这意味着在大多数情况下，您将使用与集群中节点数相等的连接数。这也意味着即使您拥有数百万个频道，您也不需要数百万个连接！

Redis Pub/Sub is simple and fast, but not durable. The delivery of messages is “at most once” which means that if a subscriber is offline when a message is published, it will miss that message entirely. If you need message persistence, delivery guarantees, or the ability to replay missed messages, consider using Redis Streams or a dedicated message broker like Kafka or RabbitMQ.
Redis Pub/Sub 简单快捷，但不耐用。消息的传递“最多一次”，这意味着如果订阅者在发布消息时处于离线状态，它将完全错过该消息。如果您需要消息持久性、传递保证或重放丢失消息的能力，请考虑使用 Redis Streams 或专用消息代理，例如 Kafka 或 RabbitMQ。

Pub/Sub is a great fit for interview scenarios where you need to demonstrate real-time communication patterns, but be ready to discuss its limitations and when you might need a more robust solution.
Pub/Sub 非常适合需要演示实时通信模式的面试场景，但要准备好讨论其局限性以及何时可能需要更强大的解决方案。

Special Application Layer Protocols

技术	方向	协议	适合场景
SSE	单向 (Server → Client)	HTTP	简单的实时推送（通知、股价、日志）
WebSocket	双向	TCP	聊天、消息系统、协作应用
WebRTC	双向（点对点）	UDP（SRTP, SCTP）	实时音视频、P2P 数据传输

Unique IDs generation

Here’s a complete production-ready design for generating unique IDs, suitable for platforms like Twitter or Instagram, focused on global uniqueness, scalability, and reliability — no code, just design:

Globally unique IDs
High throughput, high concurrency
Horizontally scalable
Roughly time-ordered (ascending)
Resilient to clock issues
Sharding-friendly

ID Structure (Based on Twitter Snowflake - 64 bits)

| 1 bit | 41 bits Timestamp | 5 bits Datacenter ID | 5 bits Machine ID | 12 bits Sequence |

Field Breakdown:

1 bit: Sign bit, always 0 (because long is signed)
41 bits: Timestamp in milliseconds (can support ~69 years)
5 bits: Datacenter ID (up to 32 data centers)
5 bits: Machine ID (up to 32 machines per DC)
12 bits: Sequence number (up to 4096 IDs per millisecond per machine)

This ensures IDs are globally unique, roughly ordered, and can be generated independently without coordination.

Assigning Datacenter and Machine IDs:

Use a configuration service (e.g. Nacos, Consul) or service registry (e.g. ZooKeeper) to assign and manage IDs for each machine.
Ensure no duplication across machines.

Handling Clock Skew / Time Rollback

Primary Strategy:

If current time < last generated time:
- If small offset (< 5ms): wait until time catches up
- If large offset: log alert and fail the request

Preventative Measures:

Disable automatic time sync (e.g., via BIOS/NTP)
Use “slew” mode NTP to avoid sudden clock jumps
Deploy in HA mode with backup ID generators

Alternative ID Schemes:

Method	Characteristics	Suitable For
Snowflake-style	Local generation, ordered, scalable	High-scale backend systems
UUID + DB check	Simple but less performant	Small-scale systems
Leaf (Meituan)	DB-segment-based allocation	DB-centric systems
MongoDB ObjectId	Built-in time/machine info	Document stores like MongoDB

How to choose DB

Many candidates trip themselves up by trying to insert a comparison of relational and NoSQL databases into their answer. The reality is that these two technologies are highly overlapping and broad statements like “I need to use a relational database because I have relationships in my data” (NoSQL databases can work great for this) or “I’ve gotta use NoSQL because I need scale and performance” (relational databases, used correctly, perform and scale incredibly well) are often yellow flags that reveal inexperience.
许多考生试图在他们的答案中插入关系数据库和 NoSQL 数据库的比较，从而绊倒了自己。现实情况是，这两种技术是高度重叠的，并且像“我需要使用关系数据库，因为我的数据中有关系”（NoSQL 数据库可以很好地解决这个问题）或“我必须使用 NoSQL，因为我需要扩展和性能”（关系数据库，正确使用，性能和扩展非常好）通常是表明缺乏经验的黄信号。

Here’s the truth: most interviewers don’t need an explicit comparison of SQL and NoSQL databases in your session and it’s a pothole you should completely avoid. Instead, talk about what you know about the database you’re using and how it will help you solve the problem at hand. If you’re asked to compare, focus on the differences in the databases you’re familiar with and how they would impact your design. So “I’m using Postgres here because its ACID properties will allow me to maintain data integrity” is a great opener.
事实是这样的：大多数面试官不需要在您的会话中对 SQL 和 NoSQL 数据库进行明确比较，这是一个您应该完全避免的坑洼。相反，请谈谈您对正在使用的数据库的了解以及它将如何帮助您解决手头的问题。如果要求您进行比较，请关注您熟悉的数据库中的差异以及它们将如何影响您的设计。因此，“我在这里使用 Postgres，因为它的 ACID 属性将使我能够保持数据完整性”是一个很好的开场白。

特性	MongoDB	Cassandra	DynamoDB	PostgreSQL
分类	NoSQL → 文档型	NoSQL → 宽列存储	NoSQL → Key-Value/文档型	关系型数据库 (RDBMS)
数据模型	JSON 文档，灵活 schema	Row → 动态列（宽列），Bigtable 模型	Item (JSON-like)，PK+SK	固定表结构，行/列
扩展性	分片支持，强大但运维复杂	水平扩展极强，去中心化无单点	托管服务，自动扩展	垂直扩展为主（分表/分库才水平扩展）
写入性能	高（适合文档批量写入）	极高（append-only，写入为主）	高（但费用高）	中等（事务强，写放大大）
读取性能	灵活查询，索引丰富	按主键+clustering key 快，范围查询强	按 PK+SK 快，复杂查询受限	强（SQL 灵活，索引丰富）
事务支持	单文档强，多文档弱	弱（最终一致性，轻事务）	弱（单项事务，最终一致性）	强（ACID 完整支持）
查询能力	灵活（丰富的 query/filter）	受限（必须围绕查询模式建表）	受限（依赖 PK+SK，二级索引额外收费）	最强（SQL + join + 聚合）
典型场景	CRUD 密集型应用、实时分析、JSON 数据	大规模写密集场景、时序、消息流(如twitter timeline/facebook feed)	用户会话表，KV 存储，简单高并发查询	核心交易，后台管理，复杂分析
运维	自建集群需要经验	自建复杂，需专业运维	AWS 托管，最省心	简单，生态成熟
成本	开源免费，自托管硬件成本	开源免费，自托管硬件成本	高（读写按量计费，流量大极贵）	开源免费，自托管便宜

📊 用一句话总结

DynamoDB → 云端托管 KV/文档库，适合 中高并发、模式稳定 的业务数据，但贵。
- 参考文章: https://www.hellointerview.com/learn/system-design/deep-dives/dynamodb#fault-tolerance-and-availability
MongoDB → 灵活的 文档型数据库，适合中等规模数据、变化频繁的 schema。
Cassandra → 海量写入、时间序列/日志型系统首选，强水平扩展，但查询模式受限。
PostgreSQL → 经典 关系型数据库，事务、SQL 查询、聚合分析能力最强，但扩展性有限。

什么时候用 MongoDB / DynamoDB？

数据结构灵活，字段频繁变动, 需要 NoSQL 版的 MySQL
比如某个订单单日一亿条, 那就选 MongoDB, 因为 DynamoDB很贵; 中小公司不想自己维护管理DB的单日千万级数据量可以选DynamoDB
需要复杂查询或聚合
数据量不是超大规模分布式，或你能接受通过 Sharding 扩容
需要事务或强一致性

什么时候用 Cassandra？

高并发、大吞吐写入, 但是读不太行
适合究极大量的日志类系统, 或者消息流(如twitter timeline/facebook feed)
容忍最终一致性（如日志、传感器数据）
需要高可用、高扩展性（跨数据中心）
想避免单点故障

简明总结：

MongoDB / DynamoDB 更像 NoSQL 版的 MySQL，开发友好，功能丰富，适合数据驱动型应用。
Cassandra 是写入吞吐怪兽，适合分布式、海量数据写入但查询简单的场景。

选谁：要看你的业务写多还是读多，数据结构是否固定，是否追求强一致或高可用

举例说明Cassandra和DynamoDB和PostgreSQL最大的建模差别

比如记录用户的登录日志:

📊 Cassandra（宽列模型）

一行 = 一个用户
列 = 每次登录（clustering key = login_time），都挂在同一行下：

Row(user_id=123):
   2025-08-16T10:00 → {ip: "1.1.1.1", device: "iPhone"}
   2025-08-16T15:00 → {ip: "2.2.2.2", device: "Mac"}
   2025-08-17T09:00 → {ip: "3.3.3.3", device: "Windows"}

👉 查询最近 N 次：天然顺序存储，直接 slice 很快。
👉 存储稀疏、写入 append-only，非常高效。
👉 缺点：数据建模必须围绕查询模式来设计（不能随便 join），否则查起来很痛苦。

📊 PostgreSQL

一行 = 一次登录
表结构固定：

user_id | login_time         | ip       | device
--------+-------------------+----------+---------
123     | 2025-08-16 10:00  | 1.1.1.1  | iPhone
123     | 2025-08-16 15:00  | 2.2.2.2  | Mac

👉 优点：SQL 通用，支持复杂查询。
👉 缺点：写多时表会无限膨胀，需要分表分区。

📊 DynamoDB

一行 = 一次登录（和 PostgreSQL 一样）
但存储是 JSON-like item，每个 item 独立：

{
  "user_id": "123",
  "login_time": "2025-08-16T10:00",
  "ip": "1.1.1.1",
  "device": "iPhone"
}
{
  "user_id": "123",
  "login_time": "2025-08-16T15:00",
  "ip": "2.2.2.2",
  "device": "Mac"
}

👉 优点：按 user_id + login_time 查询快，水平扩展简单。
👉 缺点：每次写就是一次 item 插入，读写费用直接按次数算，成本高。

📊 总结对比

特性	PostgreSQL	DynamoDB	Cassandra
存储模型	一行 = 一次登录	一行 = 一次登录	一行 = 一个用户，列 = 登录记录
扩展性	差，需要分表分库	AWS 托管，自动扩展	水平扩展强，自托管
查询模式	SQL 灵活	受限于 PK+SK 设计	基于 Partition Key + Clustering Key
写入成本	存储便宜	每写一次都计费，贵	Append 写，高效便宜
典型适用	中小规模，灵活查询	中等规模，简单高并发查询	大规模写入，时间序列/日志流

✅ 所以：

PostgreSQL / DynamoDB → 每次登录就是插入一行。
Cassandra → 把一个用户的一生登录都放在同一行里（只是行可以无限宽，列是动态追加的）。

Algorithms to fetch nearby stuff

this kind of algo is the key to design a efficient LBS(location-based service), which is to find nearby businesses for a given radius and locations.

Geohash

Geohash 是一种将经纬度坐标编码为字符串的空间索引方法，适用于地理位置的快速查询、排序和分片，广泛用于地图服务、LBS、数据库地理索引等场景。

注: Redis 的 GEOADD 和GEOSEARCH 命令就是用的这个算法.

It works by reducing the two-dimensional longitude and latitude data into a one-dimensional string of letters and digits. Geohash algorithms work by recursively dividing the world into smaller and smaller grids with each additional bit.

它通过将二维的经度和纬度数据简化为一个由字母和数字组成的一维字符串来实现。地理哈希算法通过随着每增加一位而递归地将世界划分为越来越小的网格来起作用。

alt text

Geohash 通过递归二分法对经度、纬度交错编码，然后映射为 Base32 字符串：

经度范围：[-180, 180]，纬度范围：[-90, 90]
每次将经度和纬度各二分一半，判断目标坐标落在哪个半区，转为 0 或 1
按照“经度一位、纬度一位”交错排列成一个二进制串
每 5 位转为一个 Base32 字符

例子：

坐标：(39.92324, 116.3906)
Geohash：wx4g0ec1

结构特性1: 越长的 Geohash，表示的区域越小，精度越高：

长度	精度范围（约）
5	~4.9km × 4.9km
7	~153m × 153m
9	~5m × 5m

结构特性2: 同一前缀代表相近区域（前缀匹配）：

wx4g0ec1 和 wx4g0ec2 表示两个相邻的 5 米小区域
可以用字符串前缀过滤“附近”点

优点:

可索引、可排序：字符串可用于数据库中的 B+ 树索引
易于分片与聚合：可按前缀聚合或分片（如 Redis、Elasticsearch、HBase）
支持模糊匹配：查找前缀为 wx4g0 的所有位置即可获取某区域内所有点

缺点:

边界问题：邻近两个位置可能 Geohash 不连续（如正好在两个格子边界）
two positions can have a long shared prefix, but they belong to different geohashes
形状问题：编码区域是矩形，与圆形范围查询不完全重合
赤道和极地精度不一致：因为经度划分间隔在纬度变化时变形

缺点解决方案: 扩展邻居格子，联合查询

每个 Geohash 区块最多有 8 个邻居（上下左右 + 对角线）：

[NW] [ N ] [NE]
[ W] [中心] [E ]
[SW] [ S ] [SE]

步骤:

查询时，先找到目标坐标的 Geohash 编码（如 wx4g0ec1）
再查询它的 8 个相邻格子（如 wx4g0ec0, wx4g0ec2, wx4g0ebz 等）
将这些格子内的点一并查出
最后使用精确距离计算（如 Haversine）过滤真实的“附近”结果

应用场景:

附近搜索（如“3km 内找司机”）配合邻近 Geohash 块 + 精确距离过滤
数据分片：根据 Geohash 前缀分 Redis key、数据库表
空间聚合：按 Geohash 分组聚合位置热力数据
轨迹压缩：把轨迹坐标点压缩成 Geohash 序列，节省存储

Quadtree

四叉树（Quadtree）是一种针对二维空间的递归划分树结构，广泛用于地理信息系统（GIS）、游戏地图、图像压缩和空间索引等场景。它的核心是将每个区域分成四个子区域，实现高效定位与空间查询。

Another popular solution is a quadtree. A quadtree is a data structure that is commonly used to partition a two - dimensional space by recursively subdividing it into four quadrants (grids) until the contents of the grids meet certain criteria.

Note that quadtree is an in-memory data structure and it is not a database solution. It runs on each LBS server, and the data structure is built at server startup time.

另一个流行的解决方案是一个四叉树。一个四叉树[18]是一种数据结构，通常用于通过递归地将二维空间细分为四个象限（网格），直到网格中的内容满足特定标准。

请注意，四叉树是一种内存数据结构，它不是一个数据库解决方案。它在每个基于位置的服务（LBS）服务器上运行，并且数据结构是在服务器启动时构建的。

每个节点代表一个矩形区域，包含最多 4 个子节点,
每个子区域继续细分为 NW、NE、SW、SE 四部分，直到满足某种终止条件（如点数少于阈值或区域最小尺寸）.
alt text

The Figure below explains the quadtree building process in more detail. The root node represents the whole world map. The root node is recursively broken down into 4 quadrants until no nodes are left with more than 100 businesses.”

下面的图更详细地解释了四叉树的构建过程。根节点代表整个世界地图。根节点被递归地分解为四个象限，直到不存在包含超过100家企业的节点为止。

alt text

How to get businesses with quadtree?

Build the quadtree in memory.
After the quadtree is built, start searching from the root and traverse the tree, until we find the leaf node where the search origin is. If that leaf node has 100 businesses, return the node. Otherwise, add businesses from its neighbors until enough businesses are returned.

Real-world quadtree example:
alt text

为何查找快（原理）? 空间查询（如查找某点附近的点）时，只访问与查询区域相交的节点：

不需要遍历所有数据点（不像线性搜索）
每层缩小 1/2 空间，平均查询复杂度接近 O(log N)
类似于“二分查找 + 空间定位”的结合体

例如：

你要找 1km 内的司机，只需要遍历与该范围相交的几个子节点
大部分与该区域无交集的节点会被“剪枝”，不会遍历

适用场景（尤其适合 Uber/LBS）:

场景	作用
附近资源查找	查找附近车辆、充电桩、订单等，效率远高于全表扫描
空间索引	构建内存中位置索引树，支持动态更新和快速匹配
空间负载分片	用于按地理位置将数据分布到不同服务器或服务节点
热区识别与聚合	高密度区域可自动加深层级，便于精细化控制与调度

Geohash对比Quadtree

特性	Geohash	Quadtree
面试讲解难度	简单, 推荐	较难
存储形式	编码为字符串	树状结构（每节点四个子区）
查询速度	快，适合范围模糊匹配,边界需手动扩展	更快，适合精确空间搜索
插入/删除	直接更新编码，容易但不结构化	快速局部更新
空间结构	固定网格（Base32 分段）	动态划分，不均匀适应密度
易部署性	简单，字符串易分布式部署	复杂，适合内存索引

Blob Storage workflow

alt text

To upload:

When clients want to upload a file, they request a presigned URL from the server.
The server returns a presigned URL to the client, recording it in the database.
The client uploads the file to the presigned URL.
The blob storage triggers a notification to the server that the upload is complete and the status is updated.

To download:

The client requests a specific file from the server and are returned a presigned URL.
The client uses the presigned URL to download the file via the CDN, which proxies the request to the underlying blob storage.

Features:

Upload and Download Directly from the Client:
- Blob storage services allow you to upload and download blobs directly from the client. This is useful for applications that need to store and retrieve large blobs of data, like images or videos. Familiarize yourself with presigned URLs and how they can be used to grant temporary access to a blob – either for upload or download.
Chunking:
- When uploading large files, it’s common to use chunking to upload the file in smaller pieces. This allows you to resume an upload if it fails partway through, and it also allows you to upload the file in parallel. This is especially useful for large files, where uploading the entire file at once might take a long time. Modern blob storage services like S3 support chunking out of the box via the multipart upload API.

Examples of blob storage services:

The most popular blob storage services are Amazon S3, Google Cloud Storage, and Azure Blob. All of these services are designed to be fast, durable, and cost effective. They also have a range of features like versioning, lifecycle policies, and access control. They are used by companies like Netflix, Airbnb, and Spotify to store large blobs of data like images, videos, and files.
If you don’t have experience with them, opt for S3 as it’s the most popular and widely understood by interviewers - even non-S3 platforms often have an S3-compatible API.

Video transcoding

Directed acyclic graph(DAG) model

To support different video processing pipelines and maintain high parallelism, it is important to add some level of abstraction and let client programmers define what tasks to execute.

For example, Facebook’s streaming video engine uses a directed acyclic graph (DAG) programming model, which defines tasks in stages so they can be executed sequentially or parallelly [8].

alt text

video encodings:
alt text

Video transcoding architecture:
alt text

The DAG scheduler: splits a DAG graph into stages of tasks and puts them in the task queue in the resource manager.
alt text

resource manager:
alt text

optimizations

parallelize video uploading:
place upload centers close to users:
After the message queue is introduced, the encoding module does not need to wait for the output of the download module anymore. lf there are events in the message queue, the encoding module can execute those jobs in parallel:
pre-signed upload URL:
- The client makes a HTTP request to APl servers to fetch the pre-signed URL, which gives the access permission to the object identified in the URl. The term pre-signed URL is used by uploading files to Amazon s3.
- Once the client gets the response, it uploads the vid using the pre-signed URL

Payment flow

alt text

Pay-in flow:
alt text

Payment Service Provider(PSP), PSP integration:
alt text

Reconciliation(对账):
alt text

what if a customer clicks the “pay” button quickly twice?
alt text

Distributed transaction - Saga

question: There is no guarantee that both updates would succeed.

If, for example, the wallet service node crashes after the first update has gone through but before the second update is done, it would result in an incomplete transfer, The two updates need to be in a single atomic transaction.
alt text

Solution: Saga
alt text

It is easier to understand this by using an example. the figure above shows the Saga workflow to transfer $1 from accountA to account C :

The top horizontal line shows the normal order of execution.
The two vertical lines show what the system should do when there is an error.
When it encounters an error, the transfer operations are rolled back and the client receives an error message.

How do we coordinate the operations? There are two ways to do it:

Choreography(/ˌkɔːriˈɑːɡrəfi/, n. 编舞；舞蹈艺术；舞艺). in a microservice architecture, all the services involved in the Saga distributed transaction do their jobs by subscribing to other services’ events. So it is fully decentralized coordination.
- It can become hard to manage when there are many services
Orchestration(prefer). A single coordinator instructs all services to do their jobs in the correct order. the coordinator is indeed a potential single point of failure (SPOF), so how to solve it?
- a. Make the Coordinator Stateless + Persistent Log: Store all Saga state transitions in a durable log (e.g., relational DB, Kafka, or a durable event store). If the coordinator crashes, a new instance can pick up from the log and resume. This makes the system crash-recoverable, even with a single coordinator.
- b. Leader Election + Replication: Run multiple instances of the coordinator. Use *leader election (e.g., via ZooKeeper, etcd) to ensure only one acts at a time. Replicate Saga state to followers to enable fast failover.