#9 - Monitoring, troubleshooting và bảo trì liên tục

Trong tám bài đầu tiên của series, chúng ta đã tìm hiểu về các khía cạnh khác nhau của tối ưu hóa cơ sở dữ liệu, từ thiết kế schema, indexing, transaction management đến tối ưu hóa cho các hệ quản trị cơ sở dữ liệu cụ thể. Tuy nhiên, tối ưu hóa không phải là một hoạt động một lần mà là một quá trình liên tục. Bài viết này sẽ tập trung vào các kỹ thuật và công cụ để giám sát, khắc phục sự cố và bảo trì cơ sở dữ liệu một cách chủ động.

Một hệ thống cơ sở dữ liệu hiệu quả cần được giám sát liên tục để phát hiện sớm các vấn đề tiềm ẩn, xác định chính xác nguyên nhân khi sự cố xảy ra, và duy trì hiệu năng tối ưu theo thời gian. Trong bài viết này, chúng ta sẽ khám phá các KPIs quan trọng, công cụ giám sát, kỹ thuật troubleshooting, và chiến lược bảo trì để đảm bảo cơ sở dữ liệu của bạn luôn hoạt động ở trạng thái tốt nhất.

1. Giám sát hiệu năng cơ sở dữ liệu

1.1 Key Performance Indicators (KPIs)

Để giám sát hiệu quả, bạn cần xác định rõ các chỉ số hiệu năng quan trọng phù hợp với hệ thống của mình:


  graph LR
    A[Database KPIs] --> B[
      Performance KPIs
      • Query Response Time• Throughput
      • Cache Hit Ratio
      • Index Usage
      • Deadlock Rate
      • Lock Wait Time
    ]
    A --> C[
      Availability KPIs
      • Uptime Percentage
      • Mean Time Between Failures
      • Mean Time To Recovery
    ]
    A --> D[
      Resource Utilization KPIs
      • CPU Utilization
      • Memory Usage
      • Disk I/O
      • Network Bandwidth
    ]
    A --> E[
      Business Impact KPIs
      • Application Response Time
      • User Experience Metrics
      • Business Transaction Completion]

Performance KPIs

Query Response Time: Thời gian trung bình, P95, P99 để hoàn thành queries

-- MySQL: Kiểm tra query response time
SELECT * FROM performance_schema.events_statements_summary_by_digest
ORDER BY avg_timer_wait DESC LIMIT 10;

-- PostgreSQL: Kiểm tra query response time
SELECT query, calls, total_time / calls as avg_time
FROM pg_stat_statements
ORDER BY avg_time DESC LIMIT 10;

Throughput: Số lượng queries hoặc transactions xử lý mỗi giây

-- MySQL: Kiểm tra throughput
SHOW GLOBAL STATUS LIKE 'Questions';
SHOW GLOBAL STATUS LIKE 'Com_select';
SHOW GLOBAL STATUS LIKE 'Com_insert';
SHOW GLOBAL STATUS LIKE 'Com_update';
SHOW GLOBAL STATUS LIKE 'Com_delete';

Cache Hit Ratio: Tỷ lệ dữ liệu được phục vụ từ cache

-- MySQL: Buffer pool hit ratio
SELECT (1 - (
  SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_reads'
) / (
  SELECT variable_value FROM performance_schema.global_status WHERE variable_name = 'Innodb_buffer_pool_read_requests'
)) * 100 AS buffer_pool_hit_ratio;

Resource Utilization KPIs

CPU Utilization: Mức sử dụng CPU
Memory Usage: Mức sử dụng RAM
Disk I/O: Throughput và latency của disk operations
Network Bandwidth: Lưu lượng mạng vào/ra

1.2 Công cụ giám sát

Có nhiều công cụ giám sát cơ sở dữ liệu, từ built-in tools đến third-party solutions:

Built-in Database Monitoring Tools

MySQL: Performance Schema, Information Schema, Slow Query Log
PostgreSQL: pg_stat_statements, pg_stat_activity, pg_stat_database
MongoDB: mongostat, mongotop, db.serverStatus()
Redis: INFO command, Redis Sentinel

Open-source Monitoring Tools

Prometheus + Grafana: Hệ thống giám sát và visualization mạnh mẽ

# prometheus.yml example
scrape_configs:
  - job_name: "mysql"
    static_configs:
      - targets: ["localhost:9104"] # MySQL exporter

Percona Monitoring and Management (PMM): Giám sát toàn diện cho MySQL, MongoDB, PostgreSQL
Zabbix: Giám sát hệ thống và cơ sở dữ liệu
Nagios: Giám sát và alerting

1.3 Thiết lập Dashboard hiệu quả

Một dashboard giám sát hiệu quả nên bao gồm:


  graph LR
    A[Database Monitoring Dashboard] --> B[
      Overview Panel
      • Key Metrics Summary
      • Health Status
    ]
    A --> C[
      Performance Panel
      • Query Performance
      • Throughput Graphs
      • Slow Queries
    ]
    A --> D[
      Resource Utilization Panel
      • CPU Usage
      • Memory Usage
      • Disk I/O
      • Network Traffic
    ]
    A --> E[
      Availability Panel
      • Uptime
      • Replication Status
      • Backup Status
    ]
    A --> F[
      Alerts Panel
      • Recent Alerts
      • Resolved Issues
    ]

2. Phân tích và xử lý slow queries

Slow queries là một trong những nguyên nhân phổ biến nhất gây ra vấn đề hiệu năng cơ sở dữ liệu.

2.1 Bật và cấu hình Slow Query Log

MySQL/MariaDB:

-- Bật slow query log
SET GLOBAL slow_query_log = 'ON';
SET GLOBAL long_query_time = 1;  -- Log queries slower than 1 second
SET GLOBAL slow_query_log_file = '/var/log/mysql/mysql-slow.log';
SET GLOBAL log_queries_not_using_indexes = 'ON';  -- Log queries not using indexes

PostgreSQL:

-- Bật slow query logging
ALTER SYSTEM SET log_min_duration_statement = 1000;  -- Log queries slower than 1000ms
ALTER SYSTEM SET log_directory = 'pg_log';
ALTER SYSTEM SET log_filename = 'postgresql-slow-%Y-%m-%d.log';
ALTER SYSTEM SET logging_collector = on;

-- Reload configuration
SELECT pg_reload_conf();

MongoDB:

// Bật profiling
db.setProfilingLevel(1, { slowms: 100 }); // Log operations slower than 100ms

// Kiểm tra profiling status
db.getProfilingStatus();

2.2 Phân tích Slow Queries

MySQL/MariaDB:

# Sử dụng pt-query-digest (Percona Toolkit)
pt-query-digest /var/log/mysql/mysql-slow.log

PostgreSQL:

-- Sử dụng pg_stat_statements
SELECT
    substring(query, 1, 100) AS short_query,
    round(total_time::numeric, 2) AS total_time,
    calls,
    round(mean_time::numeric, 2) AS mean_time,
    round((100 * total_time / sum(total_time::numeric) OVER ())::numeric, 2) AS percentage
FROM pg_stat_statements
ORDER BY total_time DESC
LIMIT 10;

2.3 Tối ưu Slow Queries

Sau khi xác định slow queries, áp dụng các kỹ thuật tối ưu:

Thêm indexes phù hợp:

-- MySQL: Thêm index cho query chậm
CREATE INDEX idx_large_table_non_indexed_column ON large_table(non_indexed_column);

-- PostgreSQL: Thêm index với INCLUDE
CREATE INDEX idx_orders_customer_date ON orders(customer_id) INCLUDE (order_date);

Viết lại queries:

-- Trước khi tối ưu
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'completed'
ORDER BY o.created_at DESC;

-- Sau khi tối ưu
SELECT o.id, o.order_date, o.total_amount, c.name, c.email
FROM orders o
JOIN customers c ON o.customer_id = c.id
WHERE o.status = 'completed'
ORDER BY o.created_at DESC
LIMIT 100;

Sử dụng query hints:

-- MySQL: Sử dụng index hints
SELECT * FROM products USE INDEX (idx_category_price)
WHERE category_id = 5 AND price > 100;

3. Troubleshooting memory, CPU, disk I/O issues

3.1 Memory Issues

Phát hiện memory issues:

# Linux: Kiểm tra memory usage
free -m
vmstat 1
top

Các vấn đề phổ biến và giải pháp:

Buffer pool/cache quá nhỏ:

-- MySQL: Tăng buffer pool size
SET GLOBAL innodb_buffer_pool_size = 8589934592;  -- 8GB

-- PostgreSQL: Tăng shared buffers
ALTER SYSTEM SET shared_buffers = '2GB';

Swapping:

# Kiểm tra swap usage
swapon -s
vmstat 1

# Giảm swappiness
echo 10 > /proc/sys/vm/swappiness

3.2 CPU Issues

Phát hiện CPU issues:

# Linux: Kiểm tra CPU usage
top
mpstat -P ALL 1

Các vấn đề phổ biến và giải pháp:

CPU-intensive queries:

-- MySQL: Tìm queries đang chạy
SHOW PROCESSLIST;

-- PostgreSQL: Tìm queries đang chạy
SELECT pid, age(clock_timestamp(), query_start), usename, query
FROM pg_stat_activity
WHERE state != 'idle'
ORDER BY query_start desc;

Quá nhiều connections:

-- MySQL: Kiểm tra số lượng connections
SHOW STATUS LIKE 'Threads_connected';

-- PostgreSQL: Kiểm tra số lượng connections
SELECT count(*) FROM pg_stat_activity;

3.3 Disk I/O Issues

Phát hiện Disk I/O issues:

# Linux: Kiểm tra disk I/O
iostat -x 1
iotop

Các vấn đề phổ biến và giải pháp:

Random I/O quá nhiều:

-- MySQL: Tăng innodb_buffer_pool_size
SET GLOBAL innodb_buffer_pool_size = 8G;

-- PostgreSQL: Tăng shared_buffers và effective_cache_size
ALTER SYSTEM SET shared_buffers = '2GB';
ALTER SYSTEM SET effective_cache_size = '6GB';

Write-heavy workloads:

-- MySQL: Điều chỉnh innodb_flush_method và innodb_flush_log_at_trx_commit
SET GLOBAL innodb_flush_method = 'O_DIRECT';
SET GLOBAL innodb_flush_log_at_trx_commit = 2;  -- Cân nhắc giữa hiệu năng và durability

4. Alerting, thresholds và automated reports

4.1 Thiết lập Alerting System


  graph TD
    A[Monitoring System] -->|Threshold exceeded| B[Alert Manager]
    B -->|Email| C[DBA/DevOps]
    B -->|SMS| C
    B -->|Slack| C
    B -->|PagerDuty| C
    C -->|Acknowledge| D[Incident Response]
    D -->|Resolve| E[Post-Mortem Analysis]

Prometheus Alerting Rules:

groups:
  - name: database_alerts
    rules:
      - alert: HighCPUUsage
        expr: rate(node_cpu_seconds_total{mode="system"}[1m]) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage detected"
          description: "CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"

      - alert: LowDiskSpace
        expr: node_filesystem_avail_bytes{mountpoint="/var/lib/mysql"} / node_filesystem_size_bytes{mountpoint="/var/lib/mysql"} * 100 < 10
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Low disk space"
          description: "Less than 10% disk space available on {{ $labels.instance }} MySQL data directory"

4.2 Thresholds hợp lý

Một số ngưỡng cảnh báo phổ biến:

CPU Usage: > 80% trong 5 phút
Memory: < 10% free memory trong 5 phút
Disk Space: < 10% free space
Slow Queries: > 10 slow queries/phút
Connection Usage: > 80% max_connections
Replication Lag: > 30 giây

4.3 Automated Reports

Python script để tạo báo cáo hiệu năng hàng ngày:

import mysql.connector
import pandas as pd
import matplotlib.pyplot as plt
from datetime import datetime
import smtplib
from email.mime.multipart import MIMEMultipart
from email.mime.text import MIMEText
from email.mime.application import MIMEApplication

def generate_daily_report():
    # Connect to database
    conn = mysql.connector.connect(
        host="localhost",
        user="report_user",
        password="password",
        database="mysql"
    )

    # Query for slow queries
    slow_query_df = pd.read_sql("""
        SELECT db, count(*) as count, avg(query_time) as avg_time
        FROM mysql.slow_log
        WHERE start_time > NOW() - INTERVAL 24 HOUR
        GROUP BY db
        ORDER BY count DESC
    """, conn)

    # Query for resource usage
    resource_df = pd.read_sql("""
        SELECT
            variable_name,
            variable_value
        FROM performance_schema.global_status
        WHERE variable_name IN (
            'Threads_connected',
            'Threads_running',
            'Innodb_buffer_pool_reads',
            'Innodb_buffer_pool_read_requests',
            'Queries',
            'Slow_queries'
        )
    """, conn)

    # Create report
    report_date = datetime.now().strftime("%Y-%m-%d")
    report_file = f"database_report_{report_date}.html"

    # Generate plots
    plt.figure(figsize=(10, 6))
    slow_query_df.plot(kind='bar', x='db', y='count')
    plt.title('Slow Queries by Database')
    plt.tight_layout()
    plt.savefig('slow_queries.png')

    # Create HTML report
    html = f"""
    <html>
    <head>
        <title>Database Performance Report - {report_date}</title>
        <style>
            body {{ font-family: Arial, sans-serif; }}
            table {{ border-collapse: collapse; width: 100%; }}
            th, td {{ border: 1px solid #ddd; padding: 8px; text-align: left; }}
            th {{ background-color: #f2f2f2; }}
        </style>
    </head>
    <body>
        <h1>Database Performance Report - {report_date}</h1>

        <h2>Slow Queries Summary</h2>
        {slow_query_df.to_html()}

        <h2>Resource Usage</h2>
        {resource_df.to_html()}

        <h2>Slow Queries by Database</h2>
        <img src="cid:slow_queries.png">

        <h2>Recommendations</h2>
        <p>Based on the data, consider the following actions:</p>
        <ul>
            <li>Review slow queries in the most affected databases</li>
            <li>Check for missing indexes</li>
            <li>Optimize buffer pool size if read ratio is low</li>
        </ul>
    </body>
    </html>
    """

    # Send email report
    send_email_report(html, report_date)

    conn.close()

def send_email_report(html_content, report_date):
    msg = MIMEMultipart()
    msg['Subject'] = f'Database Performance Report - {report_date}'
    msg['From'] = '[email protected]'
    msg['To'] = '[email protected]'

    # Attach HTML content
    msg.attach(MIMEText(html_content, 'html'))

    # Attach image
    with open('slow_queries.png', 'rb') as f:
        img = MIMEApplication(f.read(), _subtype="png")
        img.add_header('Content-ID', '<slow_queries.png>')
        msg.attach(img)

    # Send email
    with smtplib.SMTP('smtp.example.com', 587) as server:
        server.login('[email protected]', 'password')
        server.send_message(msg)

# Run daily via cron
if __name__ == "__main__":
    generate_daily_report()

5. Bảo trì định kỳ và proactive optimization

5.1 Lịch bảo trì định kỳ


  gantt
    title Database Maintenance Schedule
    dateFormat  YYMMDD
    section Daily
    Backup Verification       :daily1, 230101, 1d
    Slow Query Analysis       :daily2, 230101, 1d
    section Weekly
    Index Optimization        :weekly1, 230101, 7d
    Statistics Update         :weekly2, 230103, 7d
    section Monthly
    Full Table Optimization   :monthly1, 230115, 30d
    Storage Cleanup           :monthly2, 230120, 30d
    section Quarterly
    Performance Audit         :quarterly1, 230101, 90d
    Capacity Planning         :quarterly2, 230201, 90d

5.2 Tác vụ bảo trì định kỳ

Daily Tasks

Kiểm tra backup:

# Verify MySQL backup
mysqlbinlog /var/backup/mysql/binlog.000123 | head

# Restore test
mysql -e "CREATE DATABASE backup_test; USE backup_test; SOURCE /var/backup/mysql/backup.sql;"

Kiểm tra slow queries:

# Analyze slow query log
pt-query-digest /var/log/mysql/mysql-slow.log --since=24h

Weekly Tasks

Optimize indexes:

-- MySQL: Analyze tables
ANALYZE TABLE customers, orders, products;

-- PostgreSQL: Analyze tables
VACUUM ANALYZE customers, orders, products;

Update statistics:

-- MySQL: Update statistics
ANALYZE TABLE customers, orders, products;

-- PostgreSQL: Update statistics
ANALYZE VERBOSE customers, orders, products;

Monthly Tasks

Table optimization:

-- MySQL: Optimize tables
OPTIMIZE TABLE customers, orders, products;

-- PostgreSQL: Full VACUUM
VACUUM FULL ANALYZE customers, orders, products;

Storage cleanup:

# Clean old binary logs
mysql -e "PURGE BINARY LOGS BEFORE DATE_SUB(NOW(), INTERVAL 7 DAY);"

# Clean old backup files
find /var/backup/mysql -name "backup_*.sql" -mtime +30 -delete

5.3 Proactive Optimization

Capacity Planning

-- MySQL: Kiểm tra tăng trưởng database
SELECT
    table_schema as 'Database',
    SUM(data_length + index_length) / 1024 / 1024 as 'Size (MB)',
    SUM(data_free) / 1024 / 1024 as 'Free Space (MB)'
FROM information_schema.TABLES
GROUP BY table_schema;

-- PostgreSQL: Kiểm tra tăng trưởng database
SELECT
    datname as database_name,
    pg_size_pretty(pg_database_size(datname)) as size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;

Trend Analysis

import pandas as pd
import matplotlib.pyplot as plt
import mysql.connector
from datetime import datetime, timedelta

# Connect to database
conn = mysql.connector.connect(
    host="localhost",
    user="monitor",
    password="password",
    database="monitoring"
)

# Query historical metrics
query = """
SELECT
    collection_time,
    metric_name,
    metric_value
FROM performance_metrics
WHERE
    collection_time >= DATE_SUB(NOW(), INTERVAL 30 DAY)
    AND metric_name IN ('queries_per_second', 'connections', 'buffer_pool_hit_ratio')
ORDER BY collection_time
"""

df = pd.read_sql(query, conn)

# Pivot data for plotting
pivot_df = df.pivot(index='collection_time', columns='metric_name', values='metric_value')

# Plot trends
plt.figure(figsize=(12, 8))

# Plot queries per second
plt.subplot(3, 1, 1)
pivot_df['queries_per_second'].plot()
plt.title('Queries Per Second - Last 30 Days')
plt.grid(True)

# Plot connections
plt.subplot(3, 1, 2)
pivot_df['connections'].plot()
plt.title('Active Connections - Last 30 Days')
plt.grid(True)

# Plot buffer pool hit ratio
plt.subplot(3, 1, 3)
pivot_df['buffer_pool_hit_ratio'].plot()
plt.title('Buffer Pool Hit Ratio - Last 30 Days')
plt.grid(True)

plt.tight_layout()
plt.savefig('database_trends.png')

# Predict future growth
from sklearn.linear_model import LinearRegression
import numpy as np

# Prepare data for prediction
X = np.array(range(len(pivot_df))).reshape(-1, 1)
y = pivot_df['queries_per_second'].values

# Train model
model = LinearRegression()
model.fit(X, y)

# Predict next 30 days
future_days = 30
future_X = np.array(range(len(pivot_df), len(pivot_df) + future_days)).reshape(-1, 1)
future_y = model.predict(future_X)

# Plot prediction
plt.figure(figsize=(10, 6))
plt.plot(pivot_df.index, pivot_df['queries_per_second'], label='Historical')
future_dates = [pivot_df.index[-1] + timedelta(days=i+1) for i in range(future_days)]
plt.plot(future_dates, future_y, 'r--', label='Predicted')
plt.title('Queries Per Second - Prediction for Next 30 Days')
plt.legend()
plt.grid(True)
plt.savefig('qps_prediction.png')

conn.close()

6. Case Study: Troubleshooting một hệ thống database trong thực tế

6.1 Scenario: Hệ thống e-commerce gặp sự cố hiệu năng

Một hệ thống e-commerce đang gặp vấn đề hiệu năng nghiêm trọng trong giờ cao điểm. Người dùng báo cáo trang web chậm, thời gian tải trang dài và đôi khi gặp lỗi timeout. Hệ thống sử dụng MySQL làm cơ sở dữ liệu chính.

6.2 Quy trình troubleshooting


  flowchart TD
    A[Phát hiện vấn đề] --> B[Thu thập metrics]
    B --> C{Xác định nguyên nhân}
    C -->|CPU cao| D[Phân tích CPU usage]
    C -->|Memory cao| E[Phân tích memory usage]
    C -->|Disk I/O cao| F[Phân tích disk I/O]
    C -->|Slow queries| G[Phân tích slow queries]
    C -->|Connection cao| H[Phân tích connections]

    D --> I[Tối ưu queries]
    E --> J[Điều chỉnh memory settings]
    F --> K[Tối ưu disk I/O]
    G --> L[Tối ưu indexes và queries]
    H --> M[Điều chỉnh connection pool]

    I --> N[Triển khai giải pháp]
    J --> N
    K --> N
    L --> N
    M --> N

    N --> O[Giám sát kết quả]
    O --> P{Vấn đề đã giải quyết?}
    P -->|Có| Q[Tài liệu hóa và chia sẻ]
    P -->|Không| B

Bước 1: Thu thập metrics

# Kiểm tra system load
uptime

# Kiểm tra CPU, memory, disk usage
top
vmstat 1
iostat -x 1

# Kiểm tra MySQL status
mysql -e "SHOW GLOBAL STATUS LIKE '%connections%';"
mysql -e "SHOW GLOBAL STATUS LIKE '%threads%';"
mysql -e "SHOW GLOBAL STATUS LIKE '%questions%';"
mysql -e "SHOW GLOBAL STATUS LIKE '%slow%';"

Bước 2: Phân tích slow queries

# Phân tích slow query log
pt-query-digest /var/log/mysql/mysql-slow.log

# Kết quả phân tích
# Phát hiện: Query tìm kiếm sản phẩm không sử dụng index hiệu quả
# Query: SELECT * FROM products WHERE category_id = ? AND price BETWEEN ? AND ? ORDER BY popularity DESC

Bước 3: Phân tích execution plan

-- Phân tích execution plan
EXPLAIN SELECT * FROM products
WHERE category_id = 5 AND price BETWEEN 100 AND 500
ORDER BY popularity DESC;

-- Kết quả:
-- Using filesort (không tốt)
-- Không sử dụng index cho ORDER BY

Bước 4: Triển khai giải pháp

-- Tạo composite index phù hợp
CREATE INDEX idx_category_price_popularity ON products(category_id, price, popularity);

-- Tối ưu query
-- Thay vì SELECT *
SELECT id, name, description, price, image_url, stock_quantity
FROM products
WHERE category_id = 5 AND price BETWEEN 100 AND 500
ORDER BY popularity DESC
LIMIT 20;

Bước 5: Điều chỉnh cấu hình MySQL

# my.cnf adjustments
innodb_buffer_pool_size = 6G  # Tăng từ 2G
innodb_log_file_size = 512M   # Tăng từ 128M
max_connections = 300         # Tăng từ 150

Bước 6: Giám sát kết quả

# Giám sát hiệu năng sau khi tối ưu
mysqltuner --user=root --pass=password

# Kiểm tra slow queries mới
pt-query-digest /var/log/mysql/mysql-slow.log --since="2 hours ago"

6.3 Kết quả và bài học

Sau khi triển khai các giải pháp, hệ thống đã cải thiện đáng kể:

Thời gian phản hồi trung bình giảm 70%
CPU usage giảm từ 95% xuống 40%
Số lượng slow queries giảm 85%
Throughput tăng 50%

Bài học rút ra:

Giám sát liên tục là cần thiết để phát hiện vấn đề sớm
Indexes phù hợp có tác động lớn đến hiệu năng query
Việc chọn lọc dữ liệu trả về (thay vì SELECT *) giúp giảm tải network và memory
Cấu hình buffer pool phù hợp với workload và available memory rất quan trọng
Giám sát và phân tích liên tục giúp phát hiện các vấn đề tiềm ẩn trước khi chúng trở nên nghiêm trọng

7. Chiến lược backup và disaster recovery

Một phần quan trọng của việc bảo trì cơ sở dữ liệu là đảm bảo dữ liệu được sao lưu đầy đủ và có thể khôi phục nhanh chóng khi cần thiết.

7.1 Các loại backup


  graph LR
    A[Database Backup Types] --> B[Logical Backup]
    A --> C[Physical Backup]

    B --> B1[SQL Dumps]
    B --> B2[CSV/Text Export]

    C --> C1[File System Snapshot]
    C --> C2[Block-level Backup]

    A --> D[Backup Strategy]
    D --> D1[Full Backup]
    D --> D2[Incremental Backup]
    D --> D3[Differential Backup]

    A --> E[Backup Schedule]
    E --> E1[Daily]
    E --> E2[Weekly]
    E --> E3[Monthly]

Logical Backup

# MySQL: Logical backup với mysqldump
mysqldump --single-transaction --routines --triggers --all-databases > full_backup.sql

# PostgreSQL: Logical backup với pg_dump
pg_dump -Fc -f database_backup.dump database_name

Physical Backup

# MySQL: Physical backup với Percona XtraBackup
xtrabackup --backup --target-dir=/backup/mysql/full

# PostgreSQL: Physical backup
pg_basebackup -D /backup/postgresql/full -Ft -z -P

7.2 Backup Automation và Verification

#!/bin/bash
# backup_mysql.sh - Script tự động backup MySQL database

# Cấu hình
BACKUP_DIR="/var/backups/mysql"
DATE=$(date +%Y-%m-%d)
MYSQL_USER="backup_user"
MYSQL_PASSWORD="secure_password"
RETENTION_DAYS=7

# Tạo thư mục backup
mkdir -p $BACKUP_DIR/$DATE

# Thực hiện full backup
mysqldump --user=$MYSQL_USER --password=$MYSQL_PASSWORD \
  --single-transaction --routines --triggers --events --all-databases \
  | gzip > $BACKUP_DIR/$DATE/full_backup.sql.gz

# Backup binary logs
mysql --user=$MYSQL_USER --password=$MYSQL_PASSWORD \
  -e "SHOW BINARY LOGS" | grep -v "Log_name" | awk '{print $1}' | \
  xargs -I{} cp /var/lib/mysql/{} $BACKUP_DIR/$DATE/

# Verify backup
gunzip -c $BACKUP_DIR/$DATE/full_backup.sql.gz | head -n 20 > /dev/null
if [ $? -ne 0 ]; then
  echo "Backup verification failed!" | mail -s "MySQL Backup Failed" [email protected]
  exit 1
fi

# Cleanup old backups
find $BACKUP_DIR -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;

# Log success
echo "Backup completed successfully at $(date)" >> /var/log/mysql_backup.log

7.3 Disaster Recovery Plan


  flowchart TD
    A[Database Failure Detected] --> B{Assess Severity}
    B -->|Minor Issue| C[Quick Fix]
    B -->|Major Issue| D[Initiate Recovery Plan]

    D --> E[Notify Stakeholders]
    E --> F[Stop Applications]
    F --> G{Determine Recovery Method}

    G -->|Point-in-Time Recovery| H[Restore Full Backup]
    H --> I[Apply Binary Logs]
    I --> J[Verify Data Integrity]

    G -->|Full Restore| K[Restore Latest Backup]
    K --> J

    J --> L[Test Database Functionality]
    L --> M[Restart Applications]
    M --> N[Monitor System]
    N --> O[Post-Incident Review]

Recovery Time Objective (RTO) và Recovery Point Objective (RPO)

RTO: Thời gian tối đa cho phép để khôi phục hệ thống sau sự cố
RPO: Lượng dữ liệu tối đa có thể mất trong trường hợp sự cố


  graph LR
    A[Disaster] --> B[Recovery Point]
    B --> C[Present]

    D[RPO] ---|"Data Loss"| A
    E[RTO] ---|"Downtime"| C

    style A fill:#f44336,stroke:#000,stroke-width:2px
    style B fill:#4caf50,stroke:#000,stroke-width:2px
    style C fill:#2196f3,stroke:#000,stroke-width:2px

Point-in-Time Recovery (PITR)

-- MySQL: Point-in-Time Recovery
-- 1. Restore full backup
mysql < full_backup.sql

-- 2. Apply binary logs to specific point in time
mysqlbinlog --stop-datetime="2023-06-15 14:30:00" \
  /var/lib/mysql/binlog.000001 /var/lib/mysql/binlog.000002 | mysql

-- PostgreSQL: Point-in-Time Recovery
-- 1. Restore base backup
pg_restore -d database_name backup_file.dump

-- 2. Create recovery.conf
/*
restore_command = 'cp /path/to/archive/%f %p'
recovery_target_time = '2023-06-15 14:30:00'
*/

-- 3. Start PostgreSQL

7.4 Testing Recovery Procedures

#!/bin/bash
# test_recovery.sh - Script kiểm tra quy trình recovery

# Cấu hình
BACKUP_DIR="/var/backups/mysql"
TEST_SERVER="recovery-test.example.com"
MYSQL_USER="recovery_user"
MYSQL_PASSWORD="secure_password"
TEST_DB="recovery_test"

# Lấy backup mới nhất
LATEST_BACKUP=$(find $BACKUP_DIR -name "full_backup.sql.gz" | sort -r | head -n 1)

# Tạo test database
ssh $TEST_SERVER "mysql -u$MYSQL_USER -p$MYSQL_PASSWORD -e 'DROP DATABASE IF EXISTS $TEST_DB; CREATE DATABASE $TEST_DB;'"

# Restore backup vào test database
gunzip -c $LATEST_BACKUP | ssh $TEST_SERVER "mysql -u$MYSQL_USER -p$MYSQL_PASSWORD $TEST_DB"

# Kiểm tra integrity
TABLES_COUNT=$(ssh $TEST_SERVER "mysql -u$MYSQL_USER -p$MYSQL_PASSWORD -e 'SHOW TABLES FROM $TEST_DB;' | wc -l")

if [ $TABLES_COUNT -lt 2 ]; then
  echo "Recovery test failed: No tables found in restored database"
  exit 1
fi

# Chạy validation queries
VALIDATION_RESULT=$(ssh $TEST_SERVER "mysql -u$MYSQL_USER -p$MYSQL_PASSWORD $TEST_DB -e 'SELECT COUNT(*) FROM users; SELECT COUNT(*) FROM orders;'")

echo "Recovery test completed successfully"
echo "Validation results:"
echo "$VALIDATION_RESULT"

# Cleanup
ssh $TEST_SERVER "mysql -u$MYSQL_USER -p$MYSQL_PASSWORD -e 'DROP DATABASE $TEST_DB;'"

8. Giám sát và quản lý replication

Replication là một phần quan trọng của nhiều hệ thống cơ sở dữ liệu, giúp tăng tính sẵn sàng và khả năng mở rộng. Việc giám sát và quản lý replication hiệu quả là rất quan trọng.

8.1 Kiến trúc replication


  graph TD
    A[Primary/Master] -->|Binary Log| B[Replica/Slave 1]
    A -->|Binary Log| C[Replica/Slave 2]
    A -->|Binary Log| D[Replica/Slave 3]

    B -->|Read Traffic| E[Application Servers]
    C -->|Read Traffic| E
    D -->|Read Traffic| E

    A -->|Write Traffic| E

8.2 Giám sát replication status

MySQL/MariaDB:

-- Kiểm tra replication status trên replica
SHOW SLAVE STATUS\G

-- Các metrics quan trọng cần giám sát:
-- 1. Slave_IO_Running và Slave_SQL_Running (phải là "Yes")
-- 2. Seconds_Behind_Master (độ trễ replication)
-- 3. Last_IO_Error và Last_SQL_Error (lỗi replication)

PostgreSQL:

-- Kiểm tra replication status
SELECT client_addr, state, sent_lsn, write_lsn, flush_lsn, replay_lsn,
       (pg_wal_lsn_diff(sent_lsn, replay_lsn) / 1024 / 1024) AS lag_size_mb
FROM pg_stat_replication;

-- Kiểm tra replication slot
SELECT slot_name, plugin, slot_type, active, restart_lsn
FROM pg_replication_slots;

8.3 Xử lý replication lag


  flowchart TD
    A[Phát hiện Replication Lag] --> B{Xác định nguyên nhân}
    B -->|Workload cao| C[Tối ưu write workload]
    B -->|Network issue| D[Kiểm tra network]
    B -->|Resource bottleneck| E[Tăng resources cho replica]
    B -->|Single-threaded replication| F[Sử dụng parallel replication]

    C --> G[Giám sát kết quả]
    D --> G
    E --> G
    F --> G

    G --> H{Lag đã giảm?}
    H -->|Có| I[Tiếp tục giám sát]
    H -->|Không| B

Giải pháp cho replication lag:

-- MySQL: Bật parallel replication (MySQL 5.7+)
SET GLOBAL slave_parallel_type = 'LOGICAL_CLOCK';
SET GLOBAL slave_parallel_workers = 8;

-- PostgreSQL: Tăng max_wal_senders
ALTER SYSTEM SET max_wal_senders = 10;

8.4 Failover và switchover


  sequenceDiagram
    participant M as Master
    participant S1 as Slave 1
    participant S2 as Slave 2
    participant P as Proxy/Load Balancer
    participant A as Application

    Note over M: Master failure detected
    P->>P: Stop sending writes to Master
    P->>S1: Promote Slave 1 to new Master
    S1->>S1: SET GLOBAL read_only = OFF
    S1->>S2: Reconfigure Slave 2 to replicate from Slave 1
    S2->>S1: CHANGE MASTER TO MASTER_HOST='slave1'
    P->>A: Redirect write traffic to new Master (S1)
    Note over S1: Now acting as Master
    Note over S2: Now replicating from S1
    Note over M: When recovered, can be added back as slave

Automated Failover với Orchestrator (MySQL):

# Cài đặt Orchestrator
docker run -d --name orchestrator -p 3000:3000 \
  -e ORCHESTRATOR_USER=orc_user \
  -e ORCHESTRATOR_PASSWORD=orc_password \
  openark/orchestrator:latest

# Cấu hình Orchestrator (orchestrator.conf.json)
{
  "MySQLTopologyCredentialsConfigFile": "/etc/orchestrator/topology.cnf",
  "MySQLOrchestratorHost": "orchestrator-db",
  "MySQLOrchestratorPort": 3306,
  "MySQLOrchestratorDatabase": "orchestrator",
  "MySQLOrchestratorUser": "orc_user",
  "MySQLOrchestratorPassword": "orc_password",
  "InstancePollSeconds": 5,
  "PromotionIgnoreHostnameFilters": [],
  "DetectClusterAliasQuery": "SELECT SUBSTRING_INDEX(@@hostname, '.', 1)",
  "DetectInstanceAliasQuery": "SELECT @@hostname",
  "PromoteMasterOnFailover": true,
  "FailMasterPromotionOnLagSeconds": 30
}

9. Quản lý và tối ưu hóa connection pools

Connection pooling là một kỹ thuật quan trọng để tối ưu hóa hiệu năng ứng dụng khi tương tác với cơ sở dữ liệu.

9.1 Lợi ích của connection pooling


  graph LR
    A[Connection Pooling Benefits] --> B[Giảm overhead kết nối]
    A --> C[Tái sử dụng connections]
    A --> D[Kiểm soát số lượng connections]
    A --> E[Cân bằng tải]
    A --> F[Giảm thời gian phản hồi]

9.2 Cấu hình connection pool hiệu quả

Laravel (PHP):

// Laravel - Cấu hình trong config/database.php
'mysql' => [
    'driver' => 'mysql',
    'url' => env('DATABASE_URL'),
    'host' => env('DB_HOST', '127.0.0.1'),
    'port' => env('DB_PORT', '3306'),
    'database' => env('DB_DATABASE', 'mydb'),
    'username' => env('DB_USERNAME', 'user'),
    'password' => env('DB_PASSWORD', 'password'),
    'unix_socket' => env('DB_SOCKET', ''),
    'charset' => 'utf8mb4',
    'collation' => 'utf8mb4_unicode_ci',
    'prefix' => '',
    'prefix_indexes' => true,
    'strict' => true,
    'engine' => null,
    // Cấu hình connection pool
    'pool' => [
        'min' => 5,     // Tương đương với minimumIdle
        'max' => 10,    // Tương đương với maximumPoolSize
    ],
    // Cấu hình timeout
    'wait_timeout' => 28800,  // Thời gian tối đa một connection có thể ở trạng thái idle
    'connect_timeout' => 10,  // Tương đương với connectionTimeout (tính bằng giây)
    'idle_timeout' => 30,     // Tương đương với idleTimeout (tính bằng giây)
    'max_lifetime' => 1800,   // Tương đương với maxLifetime (tính bằng giây)
],

// Sử dụng connection pool trong code
// Laravel tự động quản lý connection pool, bạn chỉ cần sử dụng DB facade bình thường
$results = DB::table('users')->get();

PgBouncer (PostgreSQL):

# pgbouncer.ini
[databases]
mydb = host=localhost port=5432 dbname=mydb

[pgbouncer]
listen_addr = *
listen_port = 6432
auth_type = md5
auth_file = /etc/pgbouncer/userlist.txt
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 20
min_pool_size = 10
reserve_pool_size = 5
reserve_pool_timeout = 3
max_db_connections = 50
max_user_connections = 50

9.3 Giám sát connection pool

Laravel DB Metrics (với Prometheus):

// Cài đặt package prometheus-exporter
// composer require spatie/laravel-prometheus

// Đăng ký metrics trong app/Providers/AppServiceProvider.php
public function boot()
{
    // Đăng ký metrics cho database connections
    \Spatie\Prometheus\Facades\Prometheus::addGauge(
        name: 'db_connections_active',
        help: 'Number of active database connections',
        callback: fn () => DB::getConnections()->filter->isConnected()->count()
    );

    \Spatie\Prometheus\Facades\Prometheus::addGauge(
        name: 'db_connections_total',
        help: 'Total number of database connections',
        callback: fn () => DB::getConnections()->count()
    );

    // Đăng ký metrics cho query time
    DB::listen(function ($query) {
        \Spatie\Prometheus\Facades\Prometheus::trackHistogram(
            name: 'db_query_duration_seconds',
            help: 'Database query duration in seconds',
            value: $query->time / 1000, // Convert from ms to seconds
            labels: ['connection' => $query->connectionName]
        );
    });
}

// Metrics được thu thập:
// - db_connections_active
// - db_connections_total
// - db_query_duration_seconds

PgBouncer Monitoring:

-- Kết nối đến PgBouncer admin console
psql -p 6432 -U pgbouncer pgbouncer

-- Xem thống kê pools
SHOW POOLS;

-- Xem thống kê clients
SHOW CLIENTS;

-- Xem thống kê servers
SHOW SERVERS;

9.4 Troubleshooting connection issues


  flowchart TD
    A[Connection Issues] --> B{Xác định loại vấn đề}
    B -->|Connection timeouts| C[Kiểm tra connection timeout settings]
    B -->|Too many connections| D[Kiểm tra max_connections]
    B -->|Connection leaks| E[Kiểm tra connection closing]
    B -->|Slow connection establishment| F[Kiểm tra DNS và network]

    C --> G[Điều chỉnh timeout settings]
    D --> H[Tối ưu pool size]
    E --> I[Fix connection leaks trong code]
    F --> J[Tối ưu network và DNS]

    G --> K[Giám sát kết quả]
    H --> K
    I --> K
    J --> K

Phát hiện connection leaks:

// Laravel - Phát hiện connection leaks

// 1. Cấu hình trong config/database.php
'mysql' => [
    // Các cấu hình khác...

    // Bật chế độ debug connection
    'dump_queries_on_error' => true,

    // Thiết lập thời gian timeout cho connections
    'idle_timeout' => 60, // 60 giây - tương đương với leakDetectionThreshold
],

// 2. Tạo middleware để theo dõi và đóng connections sau mỗi request
class CloseDbConnectionsMiddleware
{
    public function handle($request, Closure $next)
    {
        $response = $next($request);

        // Đóng tất cả các connections sau khi xử lý request
        DB::disconnect();

        return $response;
    }
}

// 3. Đăng ký middleware trong app/Http/Kernel.php
protected $middleware = [
    // Các middleware khác...
    \App\Http\Middleware\CloseDbConnectionsMiddleware::class,
];

// 4. Ghi log các connections bị leak trong AppServiceProvider
public function boot()
{
    DB::listen(function ($query) {
        // Ghi lại thời gian bắt đầu của query
        $query->start_time = microtime(true);
    });

    DB::getEventDispatcher()->listen('connection.disconnecting', function ($connection) {
        // Kiểm tra thời gian kết nối đã mở
        if (isset($connection->start_time) && (microtime(true) - $connection->start_time) > 60) {
            Log::warning('Possible DB connection leak detected', [
                'connection' => $connection->getName(),
                'duration' => microtime(true) - $connection->start_time,
                'trace' => debug_backtrace(DEBUG_BACKTRACE_IGNORE_ARGS),
            ]);
        }
    });
}

Xử lý too many connections:

-- MySQL: Kiểm tra active connections
SHOW STATUS LIKE 'Threads_connected';

-- MySQL: Kiểm tra max_connections
SHOW VARIABLES LIKE 'max_connections';

-- PostgreSQL: Kiểm tra active connections
SELECT count(*) FROM pg_stat_activity;

-- PostgreSQL: Kiểm tra max_connections
SHOW max_connections;

10. Tự động hóa tác vụ bảo trì

Tự động hóa các tác vụ bảo trì thường xuyên giúp giảm công sức và đảm bảo tính nhất quán.

10.1 Sử dụng cron jobs

# /etc/crontab

# Hàng ngày: Backup database
0 1 * * * root /usr/local/bin/backup_mysql.sh

# Hàng tuần: Optimize tables
0 3 * * 0 root /usr/local/bin/optimize_tables.sh

# Hàng ngày: Phân tích slow queries
30 1 * * * root /usr/local/bin/analyze_slow_queries.sh

# Mỗi giờ: Kiểm tra replication status
0 * * * * root /usr/local/bin/check_replication.sh

# Mỗi 5 phút: Kiểm tra database health
*/5 * * * * root /usr/local/bin/check_db_health.sh

Nội dung các file script:

# /usr/local/bin/backup_mysql.sh
#!/bin/bash

DATE=$(date +"%Y-%m-%d")
BACKUP_DIR="/var/backups/mysql"
DB_USER="backup_user"
DB_PASS="secure_password"
DB_HOST="localhost"

# Tạo thư mục backup nếu chưa tồn tại
mkdir -p $BACKUP_DIR

# Danh sách database cần backup (loại trừ các system databases)
DATABASES=$(mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -e "SHOW DATABASES;" | grep -Ev "(Database|information_schema|performance_schema|mysql|sys)")

# Backup từng database
for DB in $DATABASES; do
    echo "Backing up database: $DB"
    mysqldump -h $DB_HOST -u $DB_USER -p$DB_PASS --single-transaction --quick --lock-tables=false $DB | gzip > "$BACKUP_DIR/$DB-$DATE.sql.gz"
done

# Xóa các backup cũ hơn 30 ngày
find $BACKUP_DIR -name "*.sql.gz" -type f -mtime +30 -delete

echo "Backup completed at $(date)"

# /usr/local/bin/optimize_tables.sh
#!/bin/bash

DB_USER="maintenance_user"
DB_PASS="secure_password"
DB_HOST="localhost"
LOG_FILE="/var/log/mysql/optimize-$(date +"%Y-%m-%d").log"

echo "Starting table optimization at $(date)" > $LOG_FILE

# Lấy danh sách các database
DATABASES=$(mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -e "SHOW DATABASES;" | grep -Ev "(Database|information_schema|performance_schema|mysql|sys)")

for DB in $DATABASES; do
    echo "Optimizing tables in database: $DB" >> $LOG_FILE

    # Lấy danh sách các bảng trong database
    TABLES=$(mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -e "USE $DB; SHOW TABLES;" | grep -v "Tables_in")

    for TABLE in $TABLES; do
        echo "  - Optimizing table: $TABLE" >> $LOG_FILE
        mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -e "USE $DB; OPTIMIZE TABLE $TABLE;" >> $LOG_FILE 2>&1
    done
done

echo "Table optimization completed at $(date)" >> $LOG_FILE

# /usr/local/bin/analyze_slow_queries.sh
#!/bin/bash

SLOW_LOG="/var/log/mysql/mysql-slow.log"
REPORT_DIR="/var/log/mysql/reports"
DATE=$(date +"%Y-%m-%d")

# Tạo thư mục báo cáo nếu chưa tồn tại
mkdir -p $REPORT_DIR

# Kiểm tra xem pt-query-digest đã được cài đặt chưa
if ! command -v pt-query-digest &> /dev/null; then
    echo "pt-query-digest not found. Please install Percona Toolkit."
    exit 1
fi

# Phân tích slow query log
pt-query-digest $SLOW_LOG > "$REPORT_DIR/slow-query-report-$DATE.txt"

# Tạo báo cáo tổng hợp
echo "Top 10 slow queries on $DATE:" > "$REPORT_DIR/summary-$DATE.txt"
head -n 50 "$REPORT_DIR/slow-query-report-$DATE.txt" >> "$REPORT_DIR/summary-$DATE.txt"

# Xóa các báo cáo cũ hơn 30 ngày
find $REPORT_DIR -name "*.txt" -type f -mtime +30 -delete

# Gửi email báo cáo
mail -s "MySQL Slow Query Report for $DATE" [email protected] < "$REPORT_DIR/summary-$DATE.txt"

echo "Slow query analysis completed at $(date)"

# /usr/local/bin/check_replication.sh
#!/bin/bash

DB_USER="repl_monitor"
DB_PASS="secure_password"
DB_HOST="localhost"
LOG_FILE="/var/log/mysql/replication-check.log"
ALERT_EMAIL="[email protected]"

# Kiểm tra trạng thái replication
REPL_STATUS=$(mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -e "SHOW SLAVE STATUS\G")

# Kiểm tra xem Slave_IO_Running và Slave_SQL_Running có đang chạy không
IO_RUNNING=$(echo "$REPL_STATUS" | grep "Slave_IO_Running" | awk '{print $2}')
SQL_RUNNING=$(echo "$REPL_STATUS" | grep "Slave_SQL_Running" | awk '{print $2}')
SECONDS_BEHIND=$(echo "$REPL_STATUS" | grep "Seconds_Behind_Master" | awk '{print $2}')

TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")

# Ghi log
echo "$TIMESTAMP - IO: $IO_RUNNING, SQL: $SQL_RUNNING, Seconds Behind: $SECONDS_BEHIND" >> $LOG_FILE

# Gửi cảnh báo nếu có vấn đề
if [ "$IO_RUNNING" != "Yes" ] || [ "$SQL_RUNNING" != "Yes" ]; then
    echo "ALERT: Replication is not running properly!" | mail -s "MySQL Replication Alert" $ALERT_EMAIL
fi

# Cảnh báo nếu replication lag quá lớn
if [ "$SECONDS_BEHIND" != "NULL" ] && [ $SECONDS_BEHIND -gt 300 ]; then
    echo "ALERT: Replication lag is $SECONDS_BEHIND seconds!" | mail -s "MySQL Replication Lag Alert" $ALERT_EMAIL
fi

# /usr/local/bin/check_db_health.sh
#!/bin/bash

DB_USER="monitor_user"
DB_PASS="secure_password"
DB_HOST="localhost"
LOG_FILE="/var/log/mysql/health-check.log"
ALERT_EMAIL="[email protected]"
TIMESTAMP=$(date +"%Y-%m-%d %H:%M:%S")

# Kiểm tra xem MySQL có đang chạy không
if ! mysqladmin -h $DB_HOST -u $DB_USER -p$DB_PASS ping &>/dev/null; then
    echo "$TIMESTAMP - CRITICAL: MySQL server is not responding!" >> $LOG_FILE
    echo "MySQL server on $DB_HOST is not responding!" | mail -s "CRITICAL: MySQL Down" $ALERT_EMAIL
    exit 1
fi

# Kiểm tra số lượng connections
CONNECTIONS=$(mysqladmin -h $DB_HOST -u $DB_USER -p$DB_PASS status | awk '{print $4}')
MAX_CONNECTIONS=$(mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -e "SHOW VARIABLES LIKE 'max_connections';" | awk 'NR==2 {print $2}')
CONN_PERCENT=$((CONNECTIONS * 100 / MAX_CONNECTIONS))

# Kiểm tra dung lượng bộ nhớ
MEM_USAGE=$(mysql -h $DB_HOST -u $DB_USER -p$DB_PASS -e "SELECT ROUND((SUM(data_length + index_length) / 1024 / 1024), 2) as 'DB Size (MB)' FROM information_schema.TABLES;" | awk 'NR==2 {print $1}')

# Ghi log
echo "$TIMESTAMP - Connections: $CONNECTIONS/$MAX_CONNECTIONS ($CONN_PERCENT%), Memory: $MEM_USAGE MB" >> $LOG_FILE

# Cảnh báo nếu số lượng connections quá cao
if [ $CONN_PERCENT -gt 80 ]; then
    echo "WARNING: High connection usage: $CONNECTIONS/$MAX_CONNECTIONS ($CONN_PERCENT%)" | mail -s "MySQL Connection Alert" $ALERT_EMAIL
fi

# Kiểm tra các bảng bị hỏng
CORRUPT_TABLES=$(mysqlcheck -h $DB_HOST -u $DB_USER -p$DB_PASS --all-databases --check | grep -v "OK" | grep -v "Table" | wc -l)

if [ $CORRUPT_TABLES -gt 0 ]; then
    echo "$TIMESTAMP - CRITICAL: Found $CORRUPT_TABLES corrupted tables!" >> $LOG_FILE
    mysqlcheck -h $DB_HOST -u $DB_USER -p$DB_PASS --all-databases --check | grep -v "OK" | mail -s "MySQL Corrupt Tables Alert" $ALERT_EMAIL
fi

10.2 Sử dụng Ansible cho tự động hóa

# database_maintenance.yml
---
- name: Database Maintenance Tasks
  hosts: database_servers
  become: yes
  tasks:
    - name: Check MySQL service status
      service_facts:
      register: service_state

    - name: Ensure MySQL is running
      service:
        name: mysql
        state: started
      when: service_state.ansible_facts.services['mysql.service'] is defined and service_state.ansible_facts.services['mysql.service'].state != "running"

    - name: Check disk space
      shell: df -h /var/lib/mysql
      register: disk_space

    - name: Alert if disk space is low
      debug:
        msg: "WARNING: Low disk space on {{ inventory_hostname }}"
      when: disk_space.stdout is regex("([7-9][0-9]|100)%")

    - name: Optimize MySQL tables
      mysql_db:
        state: import
        name: all
        target: /tmp/optimize.sql
      when: ansible_date_time.weekday == "Sunday"

    - name: Backup MySQL databases
      mysql_db:
        state: dump
        name: all
        target: "/backup/mysql/{{ ansible_date_time.date }}_full.sql"

10.3 Sử dụng monitoring tools để trigger actions

Prometheus Alertmanager với Webhook:

# alertmanager.yml
receivers:
  - name: "database-team"
    webhook_configs:
      - url: "http://automation-server:8080/api/trigger/database-maintenance"
        send_resolved: true

route:
  group_by: ["alertname", "instance"]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: "database-team"
  routes:
    - match:
        severity: critical
      receiver: "database-team"

Automation Server Endpoint:

from flask import Flask, request, jsonify
import subprocess
import json

app = Flask(__name__)

@app.route('/api/trigger/database-maintenance', methods=['POST'])
def trigger_maintenance():
    alert_data = request.json

    # Xử lý các loại alerts khác nhau
    for alert in alert_data['alerts']:
        if alert['labels']['alertname'] == 'HighDiskUsage':
            # Trigger cleanup script
            instance = alert['labels']['instance']
            subprocess.run(['/usr/local/bin/cleanup_database.sh', instance])

        elif alert['labels']['alertname'] == 'SlowQueries':
            # Trigger query optimization
            instance = alert['labels']['instance']
            subprocess.run(['/usr/local/bin/optimize_slow_queries.sh', instance])

        elif alert['labels']['alertname'] == 'ReplicationLag':
            # Trigger replication fix
            instance = alert['labels']['instance']
            subprocess.run(['/usr/local/bin/fix_replication_lag.sh', instance])

    return jsonify({'status': 'success', 'message': 'Maintenance tasks triggered'})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

Kết luận

Monitoring, troubleshooting và bảo trì liên tục là những yếu tố quan trọng để đảm bảo cơ sở dữ liệu của bạn hoạt động hiệu quả và ổn định. Trong bài viết này, chúng ta đã tìm hiểu:

Giám sát hiệu năng: Xác định KPIs quan trọng và thiết lập hệ thống giám sát hiệu quả
Phân tích và tối ưu slow queries: Phát hiện, phân tích và khắc phục các queries chậm
Troubleshooting issues: Xử lý các vấn đề về memory, CPU, disk I/O
Alerting và reporting: Thiết lập cảnh báo và báo cáo tự động
Bảo trì định kỳ: Lập lịch và thực hiện các tác vụ bảo trì thường xuyên
Backup và disaster recovery: Đảm bảo dữ liệu được bảo vệ và có thể khôi phục
Quản lý replication: Giám sát và xử lý vấn đề replication
Connection pooling: Tối ưu hóa kết nối đến cơ sở dữ liệu
Tự động hóa: Sử dụng các công cụ để tự động hóa tác vụ bảo trì

Bằng cách áp dụng các kỹ thuật và công cụ này, bạn có thể xây dựng một hệ thống cơ sở dữ liệu mạnh mẽ, ổn định và hiệu quả, đáp ứng nhu cầu của ứng dụng và người dùng.

Trong bài tiếp theo, chúng ta sẽ khám phá các case studies thực tế về tối ưu hóa cơ sở dữ liệu và xu hướng tương lai trong lĩnh vực này.

Tài liệu tham khảo

MySQL Documentation: https://dev.mysql.com/doc/
PostgreSQL Documentation: https://www.postgresql.org/docs/
MongoDB Documentation: https://docs.mongodb.com/
Percona Database Performance Blog: https://www.percona.com/blog/
Prometheus Documentation: https://prometheus.io/docs/
Grafana Documentation: https://grafana.com/docs/
“High Performance MySQL” by Baron Schwartz, Peter Zaitsev, and Vadim Tkachenko
“PostgreSQL 11 Administration Cookbook” by Simon Riggs, Gianni Ciolli