System Design is a typical interview stage at many software companies like Amazon, Google, Uber, Tinkoff, Bolt etc.

This is an example of how one can approach the system design interview for an Online Training Portal like Coursera, Udemy etc.

Task

As a famous training center, we would like to allow our clients to pass our courses online. We have both free and paid courses. Our clients might be divided into two categories: individuals and employers who want to get their employees certified. We specialize on different scientific domains including IT, biology, physics etc.

Functional requirements

  1. paid and free courses
  2. each course may have a number of attached materials: videos, slides, articles, links
  3. course might be divided into stages with intermediate test
  4. user may track his progress
  5. employer may track a progress of all employees
  6. ongoing tests could be passed any times once user starts his final exam it is available for 24h and then is evaluated
  7. We already have all videos recorded so it should be just imported.
  8. we would like to have some aggregated statistics (reporting)
  9. we want to allow user to login using Facebook or Google account in this cases user could publish his achievements in social networks
  10. we think that it is very important to support tablets
  11. administrator should be able to add new courses in a couple of clicks but it must be available only after all materials loaded discounts and special offerings programs

Non-functional requirements:

  1. number of courses is approximately 5000 number of categories is 30
  2. each course has about 10 video recordings (2-30)
  3. one video recording is about 100-300 MBs number of concurrent users: 1000, 2500 in peak hours
  4. total number of users: 100k
  5. focused on US and EU customers
  6. video streaming must be supported
  7. Expected availability is 99.99%
  8. response time: user transaction must take less than 2s
  9. scalability: number of courses grows - we add 20-30 courses per months, system must support this growth
  10. UX is something that is really important

Solution

Solution consists of 3 parts: load calculations, overall design and data model.

Load calculations

Let's calculate the course size. We need course title, description, list of lectures, each containing a title, and links to video and/or texts. Having 10 recording per course average, we can assume 500 KB of data per course. So, we have 5000 * 50 gives us 2,5 GB of course metadata. For videos we can use 10 * 300 MB * 5000 = 15 TB of binary storage. Per network we have 2500 users in peak hours making let's say a 1 KB of request each, which gives 2,5 MB per second. Seems as a relatively small system.

Overall design

I try to start with the personas who interact with the system. Here we have 4 persons: a learner, a company representative, the admin and an analyst. They use a bunch of interfaces to access the system.

We also want to fulfill the requirement on social networks usage, so we add them as a 3rd party component.

The next step is a first level load balancer which we require due to the availability requirement. If we assume that API Gateway SLA and EC2 SLA for our services are both 99.99%, then the overall availability will be 99,98 which is below requirement. To fix this we will need a second region(not shown on the scheme).

Also, as we have several roles we need to authenticate and route the requests to different internal services, thus the API Gateway and an AuthService.

Speaking of internal services we are following a DDD approach and place a bunch of microservices with their own databases. We have a user context(signup/signin/profile), payment context, learning context and learning progression.

Microservices approach is not a dogma here. With this amount of load we can go with a monolithic approach and a single database without any issues except may be a payment service which has a different level of security concerns. However, as time to market is usually a valid concern I believe going separate services will decrease t2m. Although this is a good point to ask a question if we want to optimize for cost of solution or for the development time.

The next bit of the system are notifications and statistics. From the requirements we know that tablets support is important, so we can conduct that we need to notify the users about different events: may be a new course is published,  or signup completed successfully, or the purchase is completed etc. In order to do so we create a new notification service which can communicate with mail services, Google and Apple Push Notifications services, sms senders and others. The important thing here is that we want not to loose any messages, so we employ a message queue. This is also useful to decouple the system events from the their processing. Statistics service which generates the analytical data is subscribed to the same queue.

The last, but not least is video streaming. As we will have people from multiple locations accessing our videos we need a proper streaming mechanism for that. We are going to pick up HLS for that purpose and leverage it at CDN level to cache the content at edge locations. More on CDNs here.

Data Model

Data model is pretty straightforward: we need to save info about company, user, a link between the two including roles, courses, lectures and tests. We also need to know about payments.

 

The whole picture