Hướng dẫn làm dự án AI hoàn chỉnh từ A tới Z (Data → Training → Deploy)

Nội dung chính

1 Tổng quan luồng dự án
- 1.1 Sơ đồ luồng (ASCII)
2 1) Lên ý tưởng & xác định scope
- 2.1 Mục tiêu rõ ràng
- 2.2 Yêu cầu phi chức năng (non-functional)
- 2.3 Deliverables
3 2) Thiết kế dữ liệu (Data plan)
- 3.1 Xác định classes & số lượng cần cho mỗi class
- 3.2 Metadata cần thu
- 3.3 Lưu trữ
4 3) Thu thập & chuẩn hóa dữ liệu
- 4.1 Thu thập
- 4.2 Kiểm tra & lọc
- 4.3 Ví dụ script kiểm tra kích thước tối thiểu (python)
5 4) Exploratory Data Analysis (EDA)
- 5.1 Mục đích
- 5.2 Công cụ
- 5.3 Ví dụ code
- 5.4 Kiểm tra imbalance
6 5) Pipeline tiền xử lý (production-ready)
- 6.1 Yêu cầu pipeline
- 6.2 Ví dụ dùng tf.data (TensorFlow)
- 6.3 Augmentation (on-the-fly)
7 6) Chọn model & chiến lược training
- 7.1 Chiến lược tổng quát
- 7.2 Lựa chọn vì demo local + Flask
- 7.3 Mẫu mã Keras (transfer learning)
8 7) Huấn luyện & theo dõi (experiments)
- 8.1 Thiết lập experiment tracking
- 8.2 Callbacks cần có
- 8.3 Ví dụ training call
- 8.4 Ghi chú thực nghiệm
9 8) Đánh giá model, debug lỗi, validation
- 9.1 Metrics cần quan tâm
- 9.2 Tạo confusion matrix
- 9.3 Nếu low performance
10 9) Tối ưu model & export
- 10.1 Export format
- 10.2 Export Keras .h5
- 10.3 Convert to TFLite (float16 quant)
- 10.4 Validate exported model
11 10) Xây dựng API phục vụ model (Flask / FastAPI)
- 11.1 Quy tắc
- 11.2 Ví dụ Flask app (production-ready pattern)
12 11) Đóng gói & triển khai (Docker, VPS, K8s)
- 12.1 Dockerfile mẫu (Flask + Keras)
- 12.2 Docker Compose (nginx + app)
- 12.3 Deployment options
- 12.4 Healthcheck
13 12) CI/CD, monitoring, model versioning
- 13.1 CI/CD
- 13.2 Model registry & versioning
- 13.3 Monitoring
14 13) Privacy / Ethics / Legal checklist
15 14) Checklist triển khai cuối cùng (pre-release)
16 15) Hình ảnh / Sơ đồ: cách tạo & mã để sinh
- 16.1 1) Flowchart (Graphviz)
- 16.2 2) Simple architecture diagram using matplotlib (python)
- 16.3 3) Confusion matrix heatmap (seaborn)
17 16) Ví dụ dự án minh họa: cây nhận diện (quick-repro)
18 17) Những lỗi hay gặp và cách debug nhanh
19 18) Tài liệu & nguồn học tham khảo (ngắn gọn)
20 Kết luận (ngắn)

Tổng quan luồng dự án

User → (Upload / Data collection) → Preprocess → Train → Eval → Export model → Serve via API → Client (Web/Mobile) → Feedback → Retrain.

Sơ đồ luồng (ASCII)

1) Lên ý tưởng & xác định scope

Mục tiêu rõ ràng

Xác định bài toán: classification/regression/detection/segmentation.
Ví dụ: “Nhận diện tên cây cảnh (multi-class classification) từ ảnh RGB một cây trong khung”.
Xác định output: tên cây + confidence. Không thêm chăm sóc/tư vấn (theo yêu cầu).

Yêu cầu phi chức năng (non-functional)

Response time mục tiêu (local demo): <1s inference (model nhẹ).
Mức accuracy mục tiêu (ban đầu): >= 85% trên tập test thực tế.
Triển khai ban đầu: chạy local bằng Flask.

Deliverables

Dataset chuẩn, scripts train, model export, Flask API, README + hướng dẫn cài đặt.

2) Thiết kế dữ liệu (Data plan)

Xác định classes & số lượng cần cho mỗi class

Bắt đầu với 20–50 loài phổ biến.
Mỗi class tối thiểu 100 ảnh (tốt nhất 300–1000 ảnh/class nếu được).

Metadata cần thu

filename, class_label, source, date_collected, camera_exif (nếu có), location (opt-in), user_feedback.

Lưu trữ

Dùng cấu trúc thư mục chuẩn:

Dùng object storage (S3/minio) nếu dữ liệu lớn.

3) Thu thập & chuẩn hóa dữ liệu

Thu thập

Tự chụp, lấy từ iNaturalist/Flickr/Kaggle (chú ý license), crowdsourcing.
Viết script scraper (requests + selenium nếu cần) hoặc dùng google-images-download/bing-image-downloader.

Kiểm tra & lọc

Loại bỏ ảnh mờ, watermark, quá nhỏ. Dùng script kiểm tra resolution/size.
Mở từng class kiểm tra chất lượng.

Ví dụ script kiểm tra kích thước tối thiểu (python)

4) Exploratory Data Analysis (EDA)

Mục đích

Hiểu phân bố classes, imbalance, outliers.
Quan sát sample images, histogram sizes, color distributions.

Công cụ

Jupyter notebook, pandas, matplotlib, seaborn.

Ví dụ code

Kiểm tra imbalance

Nếu imbalance lớn, plan oversampling or class-weighting.

5) Pipeline tiền xử lý (production-ready)

Yêu cầu pipeline

deterministic preprocessing (same for train/val/test and inference)
support augmentation only on train
fast (use tf.data or torch.utils.data)

Ví dụ dùng `tf.data` (TensorFlow)

Augmentation (on-the-fly)

rotation, flip, random_crop, color_jitter.
Dùng albumentations cho PyTorch; dùng tf.image/keras.preprocessing cho TF.

6) Chọn model & chiến lược training

Chiến lược tổng quát

Dùng transfer learning: backbone pretrained (MobileNetV2/EfficientNetB0).
Freeze base, train head, sau đó unfreeze một phần và fine-tune.

Lựa chọn vì demo local + Flask

MobileNetV2 hoặc EfficientNetB0: nhỏ, inference nhanh.

Mẫu mã Keras (transfer learning)

from tensorflow.keras.applications import MobileNetV2

from tensorflow.keras.layers import GlobalAveragePooling2D, Dense, Dropout

from tensorflow.keras.models import Model

base = MobileNetV2(weights=‘imagenet’, include_top=False, input_shape=(224,224,3))
x = GlobalAveragePooling2D()(base.output)
x = Dropout(0.3)(x)
x = Dense(256, activation=‘relu’)(x)
out = Dense(num_classes, activation=‘softmax’)(x)
model = Model(inputs=base.input, outputs=out)for layer in base.layers:
layer.trainable = False
model.compile(optimizer=‘adam’, loss=‘categorical_crossentropy’, metrics=[‘accuracy’])

7) Huấn luyện & theo dõi (experiments)

Thiết lập experiment tracking

Dùng TensorBoard hoặc Weights & Biases (wandb) để theo dõi loss/acc, learning rate, hist gradients.

Callbacks cần có

ModelCheckpoint(save_best_only=True)
EarlyStopping(patience=5)
ReduceLROnPlateau

Ví dụ training call

Ghi chú thực nghiệm

Ghi config (batch size, lr, backbone, augmentation) vào file config.yaml.
Lưu model với tên mô tả: plant_mobilenetv2_bs32_lr1e-3_epoch30.h5.

8) Đánh giá model, debug lỗi, validation

Metrics cần quan tâm

Accuracy (top-1), top-3 accuracy
Confusion matrix: phát hiện các cặp class dễ nhầm
Per-class precision & recall

Tạo confusion matrix

Nếu low performance

Kiểm tra data leakage (ảnh test xuất hiện trong train)
Kiểm tra augmentation quá mạnh làm mất đặc trưng
Thêm ảnh thực tế, giảm overfitting (dropout, weight decay)
Thử backbone mạnh hơn hoặc tăng dataset

9) Tối ưu model & export

Export format

For Flask local: save Keras .h5 or SavedModel.
For mobile: convert to TFLite.
For cross-platform: ONNX.

Export Keras .h5

Convert to TFLite (float16 quant)

Validate exported model

Run sample inference on exported model and compare outputs to original model (sanity check).

10) Xây dựng API phục vụ model (Flask / FastAPI)

Quy tắc

Load model một lần khi server start, không load mỗi request.
Tiền xử lý và postprocess phải match training pipeline.

Ví dụ Flask app (production-ready pattern)

app.py

from flask import Flask, request, jsonify, render_template

import numpy as np

from PIL import Image

import io

import tensorflow as tf

app = Flask(__name__)
model = tf.keras.models.load_model(“model/plant_model.h5”)
labels = […] # load from JSONdef preprocess_image(image_bytes):
img = Image.open(io.BytesIO(image_bytes)).convert(“RGB”).resize((224,224))
arr = np.array(img)/255.0
return np.expand_dims(arr, 0)@app.route(“/identify”, methods=[“POST”])
def identify():
if ‘image’ not in request.files:
return jsonify({“error”:“no file”}), 400
file = request.files[‘image’].read()
inp = preprocess_image(file)
preds = model.predict(inp)[0]
idx = int(np.argmax(preds))
return jsonify({“class”: labels[idx], “confidence”: float(preds[idx])})if __name__==“__main__”:
app.run(host=“0.0.0.0”, port=7860)

FastAPI (nếu cần async + docs)

FastAPI tự động tạo OpenAPI docs, tốt khi phát triển API cho front-end.

11) Đóng gói & triển khai (Docker, VPS, K8s)

Dockerfile mẫu (Flask + Keras)

Docker Compose (nginx + app)

Nginx làm reverse proxy, static files, TLS.
App chạy gunicorn 2-4 workers.

Deployment options

VPS (Ubuntu) + Docker Compose
Cloud VM (DigitalOcean, AWS EC2)
Container service (AWS ECS, GCP Cloud Run)
Kubernetes (GKE/EKS) cho scale lớn

Healthcheck

Endpoint /healthz trả 200 OK khi model load thành công.

12) CI/CD, monitoring, model versioning

CI/CD

GitHub Actions / GitLab CI để:
- Linting, unit tests
- Build Docker image → push to registry
- Deploy to staging → run smoke tests
Workflow mẫu: push => build image => run tests => deploy to server.

Model registry & versioning

Lưu model artifacts trên S3 / MinIO hoặc DVC.
Store metadata: model_id, version, training config, metrics.
Use MLflow/W&B for tracking experiments.

Monitoring

Log every request (input hash, prediction, latency).
Use Prometheus + Grafana to monitor latency/throughput.
Model drift: monitor accuracy on “golden test set” overtime.
Alert when latency or error rate spikes.

13) Privacy / Ethics / Legal checklist

Thông báo rõ khi lưu ảnh user (privacy policy).
Nếu thu location, cần opt-in.
Xem license ảnh thu thập từ internet.
Nếu dùng model để nhận diện loài hiếm, cân nhắc bảo mật thông tin.

14) Checklist triển khai cuối cùng (pre-release)

Unit tests cho pipeline (preprocess, predict wrapper)
Integration tests (curl requests)
Smoke tests post-deploy
Model audit: confusion matrix + per-class metrics
Monitoring & alerting setup
Backup & rollback plan
Documentation (README + API docs)
Docker image scanned for vulnerabilities

15) Hình ảnh / Sơ đồ: cách tạo & mã để sinh

1) Flowchart (Graphviz)

Bạn có thể tạo file flow.dot:

Sinh PNG:

2) Simple architecture diagram using matplotlib (python)

import matplotlib.pyplot as plt

fig, ax = plt.subplots(figsize=(8,4))

ax.text(0.1,0.6,"Data\n(Images)", bbox=dict(boxstyle="round", facecolor="lightblue"))

ax.text(0.35,0.6,"Preprocess\nPipeline", bbox=dict(boxstyle="round", facecolor="lightgreen"))

ax.text(0.6,0.6,"Model\n(Train/Export)", bbox=dict(boxstyle="round", facecolor="lightcoral"))

ax.text(0.85,0.6,"Serving\n(Flask)", bbox=dict(boxstyle="round", facecolor="lightgrey"))

ax.arrow(0.23,0.6,0.1,0, head_width=0.02)

ax.arrow(0.48,0.6,0.1,0, head_width=0.02)

ax.arrow(0.73,0.6,0.08,0, head_width=0.02)

ax.axis('off')

plt.savefig("arch.png", dpi=150)

3) Confusion matrix heatmap (seaborn)

16) Ví dụ dự án minh họa: cây nhận diện (quick-repro)

Folder skeleton (local):

train.py chứa pipeline training, model_utils.py chứa preprocess + label load.

17) Những lỗi hay gặp và cách debug nhanh

Model not found in Flask: kiểm tra path (relative vs absolute). Use os.path.join(os.path.dirname(__file__), 'model', 'plant_model.h5').
ModuleNotFoundError trong venv: activate venv, cài packages trong venv.
Mismatch preprocess: Ensure inference preprocessing = training preprocessing (resize & normalization).
CORS issues: nếu frontend khác origin, enable CORS trong Flask.
Performance (slow inference): batch predict, use ONNX Runtime, or quantize TFLite.

18) Tài liệu & nguồn học tham khảo (ngắn gọn)

TensorFlow docs (official)
Keras API
ONNX Runtime docs
Weights & Biases / MLflow (experiment tracking)
Docker, Gunicorn, Nginx guides

Kết luận (ngắn)

Đây là roadmap toàn diện và từng bước thực tế để bạn triển khai một dự án AI hoàn chỉnh — từ thiết kế dữ liệu, huấn luyện, cho tới deploy và vận hành. Mấu chốt nằm ở dữ liệu chất lượng, pipeline tiền xử lý chuẩn, và một quy trình deploy/monitoring tốt.