Project Documentation: Monitoring and Logging with Prometheus and Grafana

1. Project Overview

This project involved setting up a monitoring and logging system using Prometheus and Grafana to monitor the performance and health of cloud infrastructure and applications. The system allows for real-time data collection, visualization, and alerting, enabling proactive management of resources.

2. Environment Setup

Server Provisioning:
- Cloud Provider: AWS EC2 instance
- Instance Type: t3.medium
- Operating System: Ubuntu 22.04
- Network Configuration: The instance was associated with a public subnet and had Security Groups configured to allow HTTP (port 80), HTTPS (port 443), and Grafana (port 3000).

3. Prometheus Installation and Configuration

Installation:

The latest version of Prometheus was downloaded and installed:

wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar -xvzf prometheus-2.47.0.linux-amd64.tar.gz
cd prometheus-2.47.0.linux-amd64

Running Prometheus:
- Prometheus was started using the following command:
```
./prometheus --config.file=prometheus.yml
```

Configuration:

The prometheus.yml file was configured to scrape metrics from the local machine:

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

4. Grafana Installation and Configuration

Installation:

Grafana was downloaded and installed independently of Prometheus:

wget https://dl.grafana.com/oss/release/grafana_10.0.0_amd64.deb
sudo dpkg -i grafana_10.0.0_amd64.deb

Starting Grafana:

The Grafana service was started and enabled to run on boot:

sudo systemctl start grafana-server
sudo systemctl enable grafana-server

Accessing Grafana:
- Grafana was accessed via a web browser at http://your-server-ip:3000.
- The default login credentials (admin/admin) were used for the first login, and the password was changed upon initial access.
Adding Prometheus as a Data Source:
- Prometheus was added as a data source within Grafana:
- Navigate to Configuration > Data Sources.
- Select Prometheus and set the URL to http://localhost:9090.
- Save and test the connection to ensure it was successful.

5. Dashboard Creation

Creating a New Dashboard:
- A new dashboard was created in Grafana to visualize metrics:
- Go to Dashboards > New Dashboard.
- Add a panel, selecting a metric from Prometheus (e.g., CPU usage).
- Customize the visualization and save the dashboard.
Dashboard Example:
- The dashboard was designed to display key metrics like CPU usage, memory usage, disk I/O, and network traffic, allowing for a comprehensive view of the system’s performance.

6. Alerting Setup in Prometheus

Configuration:

Alerting rules were added to the prometheus.yml configuration:

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - "alert.rules.yml"

An example alert rule was created in the alert.rules.yml file:

groups:
- name: example
  rules:
  - alert: HighCPUUsage
    expr: process_cpu_seconds_total > 0.85
    for: 5m
    labels:
      severity: critical
    annotations:
      summary: "High CPU usage detected"

Verification:
- Alerts were verified by stressing the system and checking that the alert was triggered and displayed in the Prometheus web UI.

7. Security and Maintenance

Security Measures:
- Basic security was implemented by setting up authentication for Grafana.
- The firewall was configured to restrict access to the monitoring services.
- HTTPS was considered for secure access, although it was not implemented in this basic setup.
Maintenance:
- Regular maintenance tasks were established, including:
- Updating Prometheus and Grafana as new versions are released.
- Backing up Grafana dashboards and Prometheus configuration files.
- Reviewing and updating alert configurations based on the evolving infrastructure needs.

8. Testing and Outcome

System Testing:
- The monitoring and alerting setup was tested by inducing load on the server and verifying that metrics were collected and visualized correctly in Grafana.
- Alerts were triggered as expected when predefined conditions were met.
Final Outcome:
- The project was successfully completed, with a fully operational monitoring and logging system in place. The system provides real-time insights into the health and performance of the cloud infrastructure, enabling proactive management and rapid issue resolution.

9. Conclusion

This project provided hands-on experience in setting up and configuring a monitoring and logging system using Prometheus and Grafana. The skills gained are crucial for maintaining high availability, performance, and security in modern cloud environments. The setup is scalable and can be adapted to more complex infrastructures as needed.

Appendix

Useful Commands:
- Restart Prometheus: sudo systemctl restart prometheus
- Restart Grafana: sudo systemctl restart grafana-server
- View Prometheus logs: sudo journalctl -u prometheus.service -f
- View Grafana logs: sudo journalctl -u grafana-server -f
Resources:
- Prometheus Documentation
- Grafana Documentation