Final Projects
• Up to 2 students in one team for the final project
Task 1. Application Reliability Project:
- As a team, choose a web application or service.
- Apply SRE principles to ensure the application's reliability. This includes setting up
monitoring, alerting, defining SLIs/SLOs/SLAs, and incident management.
- Document the entire process, challenges faced, and how they were addressed.
Task 2. Infrastructure as Code Collaboration:
- Collaboratively design a complex infrastructure setup for a high-availability
application.
- Use tools like Terraform or Ansible to script the setup.
- Each team member should focus on different aspects/modules of the infrastructure.
Integrate individual components into one coherent system.
Task 3. Automated Deployment Pipeline:
- Create a CI/CD pipeline for a sample application.
- Ensure that the pipeline includes stages for building, testing, and deploying the
application.
- Document roles and responsibilities of each team member in the pipeline setup and
management.
Task 4. Security Team Audit:
- Select a pre-existing infrastructure or application.
- As a team, perform a comprehensive security audit. Identify potential vulnerabilities
and suggest remediations.
- Create a detailed report on findings and proposed solutions.
Task 5. Capacity Planning Simulation:
- Simulate an application with growing user demand.
- As a team, predict infrastructure needs and scale the application accordingly. Test the
application's performance at various scales.
- Document the predictive strategies used and the results at each scale level.
Task 6. SRE Tool Development:
- Identify a common challenge in the SRE domain that doesn't have a readily available
solution.
- As a team, develop a tool or script to address that challenge.
- Test the tool under different scenarios and document its efficiency and applications.
Note: The goal of these team projects is to foster collaboration, divide responsibilities,
and integrate individual efforts to achieve a common goal. Each project should culminate in a
presentation or a report that details the process, challenges, solutions, and learnings. Ensure to
reflect on the team's dynamics, communication, and collaborative efforts in the final
documentation.
Here are more detailed instructions for the first three team tasks:
1. Application Reliability Project:
Objective: Enhance the reliability of a web application/service through the application
of SRE principles.
Detailed Steps:
Selection & Breakdown:
- Choose a web application or service. This could be an open-source project or a mock
service you create.
- Break down the application into its main components (front-end, back-end, database,
etc.).
- Monitoring & Alerting:
- Set up a monitoring solution like Prometheus, Grafana, or Nagios.
- Define key metrics to monitor for each component.
- Create alerting rules based on thresholds that could indicate potential issues.
SLIs/SLOs/SLAs:
- Define SLIs for your application (e.g., response time, uptime).
- Set SLOs based on what you consider acceptable performance.
- Draft an SLA that communicates these performance benchmarks to users.
Incident Management:
- Simulate an incident (e.g., service downtime or database failure).
- Document the team's response strategy, including communication plans,
troubleshooting steps, and resolution.
- After resolving the incident, draft a postmortem detailing the incident, cause, resolution,
and preventive measures for the future.
Deliverables:
- A detailed report that includes the application's breakdown, monitoring setup, defined
SLIs/SLOs/SLAs, and a postmortem report.
- A live demonstration showcasing the monitoring in action and the alerting mechanism.
2. Infrastructure as Code Collaboration:
Objective: Design and deploy a complex, high-availability infrastructure for an
application using Infrastructure as Code (IaC) tools.
Detailed Steps:
Application & Infrastructure Design:
- Choose or design a mock application requiring high availability.
- Architect a multi-tier infrastructure setup for the application considering redundancy,
failover, and scalability.
IaC Implementation:
- Use tools like Terraform, Ansible, or Chef.
- Break the infrastructure into modules or components. Assign each team member a
module to develop the IaC scripts.
Integration & Deployment:
- Integrate individual IaC modules to represent the complete infrastructure.
- Deploy the infrastructure and test the application's deployment on it.
Documentation:
- Document best practices followed, challenges faced, and solutions implemented during
the IaC development.
Deliverables:
- IaC scripts for each module and integrated infrastructure.
- A live demonstration of the infrastructure deployment and application setup.
- A detailed report on the design, implementation, and learnings.
3. Automated Deployment Pipeline:
Objective: Implement a robust CI/CD pipeline for a sample application, ensuring
seamless integration and deployment.
Detailed Steps:
Application Selection:
Choose a sample application with multiple components (e.g., front-end, API, database).
Pipeline Design:
Architect a CI/CD pipeline that includes stages for code integration, testing (unit,
integration, acceptance tests), and deployment.
Tool Selection & Setup:
- Use tools like Jenkins, GitLab CI, or CircleCI for pipeline setup.
- Define roles and responsibilities: one member could handle testing, another
deployment, and another monitoring post-deployment.
Implementation & Testing:
- Implement the pipeline and integrate the application's codebase.
- Simulate code changes and observe the pipeline's execution, ensuring the application is
built, tested, and deployed automatically.
Deliverables:
- A working CI/CD pipeline.
- A live demonstration of a change in the codebase triggering the pipeline and the
subsequent stages.
- A report detailing the pipeline's design, tools used, challenges, and benefits observed.
General Requirements for All Projects:
1. Team Collaboration Tools: Platforms such as Slack, Teams, or Zoom for
communication.
2. Project Management Tools: Tools like Jira, Trello, or Asana for tracking progress,
assigning tasks, and ensuring deadlines are met.
3. Access to Cloud Services: Depending on the project, teams might need access to cloud
platforms like AWS, GCP, or Azure.
4. Documentation and Reporting: Ability to create detailed reports, which might require
knowledge of platforms/tools such as Google Docs, Microsoft Word, Confluence, or LaTeX.
5. Source Control: Familiarity with Git and platforms like GitHub or GitLab is essential
for versioning and collaboration.
Presentation & Documentation:
1. Document every stage of the process, challenges faced, and how they were addressed.
2. Prepare a comprehensive presentation detailing the end-to-end SRE implementation,
tools used, team dynamics, and key learnings.
3. Reflect on the overall collaboration, responsibilities division, and how individual
efforts were integrated to achieve the common goal.
Note: This unified project aims to combine all the facets of SRE into one cohesive
project. Each team should ideally have members focusing on specific aspects, ensuring a
thorough understanding and implementation of SRE principles across the board.
Final Evaluation Criteria
Title Description Weight (%)
Application Reliability Apply SRE principles 20%
Project (monitoring, SLIs/SLOs,
incident management) to a
web app
Infrastructure as Code Design and deploy a 20%
Collaboration scalable infrastructure
using tools like Terraform
or Ansible
Automated Deployment Create a CI/CD pipeline 10%
Pipeline with build, test, and deploy
stages
Security Team Audit Audit an application or 10%
infrastructure, identify
vulnerabilities, and suggest
fixes
Capacity Planning Predict scaling needs for an 10%
Simulation app under increasing load
and simulate performance
SRE Tool Development Develop a custom tool to 10%
solve a common SRE
problem
Defense Explanation of the project, 20%
answers to theoretical
questions, quality and
completeness of the
presentation (structure,
clarity, coverage of all
required points)
Total 100%