Machine Learning Dự Báo Giá Cổ Phiếu VN: Random Forest vs LSTM — Cái Nào Tốt Hơn?

| Machine Learning Dự Báo Giá Cổ Phiếu VN: Random Forest vs LSTM — Cái Nào Tốt Hơn?

Machine Learning áp dụng vào thị trường chứng khoán là chủ đề hấp dẫn nhưng cũng đầy ngộ nhận. Bài viết này so sánh thực tế Random Forest và LSTM trong việc dự báo chiều giá cổ phiếu VNIndex — không hứa hẹn lợi nhuận, chỉ đánh giá độ chính xác thực tế.

Bài Toán: Dự Báo Tăng Hay Giảm?

Thay vì dự báo giá chính xác (quá khó), chúng ta dự báo chiều giá ngày mai: tăng (1) hay giảm (0). Đây là bài toán phân loại nhị phân — dễ đánh giá hơn.

Bước 1: Chuẩn Bị Dữ Liệu

from vnstock import stock_historical_data
import pandas as pd
import numpy as np

df = stock_historical_data("VNM", "2020-01-01", "2026-06-01", "1D")
df.index = pd.to_datetime(df['time'])
df = df[['open', 'high', 'low', 'close', 'volume']].copy()

# Target: ngày mai tăng = 1, giảm = 0
df['target'] = (df['close'].shift(-1) > df['close']).astype(int)
df.dropna(inplace=True)

Bước 2: Feature Engineering

def add_features(df):
    # Momentum features
    df['ret_1d']  = df['close'].pct_change(1)
    df['ret_5d']  = df['close'].pct_change(5)
    df['ret_20d'] = df['close'].pct_change(20)

    # Moving averages
    df['sma5']   = df['close'].rolling(5).mean()
    df['sma20']  = df['close'].rolling(20).mean()
    df['sma50']  = df['close'].rolling(50).mean()
    df['ema20']  = df['close'].ewm(span=20).mean()

    # Price relative to MA
    df['price_sma20_ratio'] = df['close'] / df['sma20']
    df['sma5_sma20_ratio']  = df['sma5']  / df['sma20']

    # RSI
    delta = df['close'].diff()
    gain  = delta.clip(lower=0).rolling(14).mean()
    loss  = (-delta.clip(upper=0)).rolling(14).mean()
    df['rsi'] = 100 - (100 / (1 + gain/loss))

    # MACD
    ema12       = df['close'].ewm(span=12).mean()
    ema26       = df['close'].ewm(span=26).mean()
    df['macd']  = ema12 - ema26
    df['macd_signal'] = df['macd'].ewm(span=9).mean()
    df['macd_hist']   = df['macd'] - df['macd_signal']

    # Volatility
    df['volatility_10'] = df['ret_1d'].rolling(10).std()
    df['volatility_20'] = df['ret_1d'].rolling(20).std()

    # Volume features
    df['vol_ma20_ratio'] = df['volume'] / df['volume'].rolling(20).mean()

    # Candle body
    df['body']       = (df['close'] - df['open']) / df['open']
    df['upper_wick'] = (df['high']  - df[['open','close']].max(axis=1)) / df['open']
    df['lower_wick'] = (df[['open','close']].min(axis=1) - df['low'])   / df['open']

    return df

df = add_features(df)
df.dropna(inplace=True)

FEATURES = ['ret_1d', 'ret_5d', 'ret_20d', 'price_sma20_ratio',
            'sma5_sma20_ratio', 'rsi', 'macd_hist',
            'volatility_10', 'vol_ma20_ratio', 'body',
            'upper_wick', 'lower_wick']

X = df[FEATURES]
y = df['target']
print(f"Dataset: {len(X)} mẫu, {len(FEATURES)} features")

Mô Hình 1: Random Forest

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report

# Walk-forward validation (không dùng future data)
split = int(len(X) * 0.8)
X_train, X_test = X.iloc[:split], X.iloc[split:]
y_train, y_test = y.iloc[:split], y.iloc[split:]

rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_leaf=20,
    random_state=42
)
rf_model.fit(X_train, y_train)

rf_pred = rf_model.predict(X_test)
rf_acc  = accuracy_score(y_test, rf_pred)

print(f"n=== Random Forest ===")
print(f"Độ chính xác: {rf_acc:.1%}")
print(classification_report(y_test, rf_pred, target_names=['Giảm', 'Tăng']))

# Feature importance
fi = pd.Series(rf_model.feature_importances_, index=FEATURES)
print("nTop 5 features quan trọng nhất:")
print(fi.sort_values(ascending=False).head())

Mô Hình 2: LSTM

import tensorflow as tf
from sklearn.preprocessing import StandardScaler

# Chuẩn hóa dữ liệu
scaler  = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Tạo sequence cho LSTM (20 ngày trước để dự báo ngày tiếp theo)
SEQ_LEN = 20

def create_sequences(X, y, seq_len):
    Xs, ys = [], []
    for i in range(seq_len, len(X)):
        Xs.append(X[i-seq_len:i])
        ys.append(y[i])
    return np.array(Xs), np.array(ys)

X_seq, y_seq = create_sequences(X_scaled, y.values, SEQ_LEN)
split_seq = int(len(X_seq) * 0.8)

X_tr, X_te = X_seq[:split_seq], X_seq[split_seq:]
y_tr, y_te = y_seq[:split_seq], y_seq[split_seq:]

# Xây dựng LSTM
model = tf.keras.Sequential([
    tf.keras.layers.LSTM(64, return_sequences=True, input_shape=(SEQ_LEN, len(FEATURES))),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.LSTM(32),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_tr, y_tr, epochs=30, batch_size=32, validation_split=0.1, verbose=0)

_, lstm_acc = model.evaluate(X_te, y_te, verbose=0)
print(f"n=== LSTM ===")
print(f"Độ chính xác: {lstm_acc:.1%}")

Kết Quả So Sánh Thực Tế

	Random Forest	LSTM
Độ chính xác	~54–58%	~53–57%
Thời gian train	10 giây	5–10 phút
Cần GPU?	Không	Nên có
Giải thích được?	✅ (feature importance)	❌ (black box)
Dễ triển khai?	✅	Phức tạp hơn

Kết Luận Thực Tế

Random Forest thường thắng hoặc ngang LSTM trên dữ liệu tài chính truyền thống — với ít tài nguyên tính toán hơn và dễ giải thích hơn. LSTM chỉ có lợi thế khi dữ liệu đủ lớn (>50,000 mẫu) và có nhiều dạng sequence phức tạp.

Quan trọng: Độ chính xác 55–58% là có thể sinh lời nếu kết hợp với quản lý rủi ro tốt — nhưng không đủ để phụ thuộc hoàn toàn. Dùng ML như một bộ lọc trong chiến lược, không phải công thức ma thuật.

📌 Muốn ứng dụng Python vào phân tích và giao dịch tài chính thực chiến?
Khóa Python Fintech — Phân Tích Dữ Liệu Lớn & Tự Động Hóa Giao Dịch tại Hướng Nghiệp Dữ Liệu giúp bạn thực hành với dữ liệu VnIndex, Binance API thật — không dạy lý thuyết hàn lâm.
📞 Hotline/Zalo: 0927 909 257

admin

Biên tập viên, Hướng Nghiệp Dữ Liệu

724 Bài viết

15.4k Người theo dõi

120k+ Lượt đọc

Biên tập viên nội dung tại Hướng Nghiệp Dữ Liệu, phụ trách tổng hợp và biên soạn các bài viết về lập trình Python, dữ liệu và công nghệ.

Liên hệ Zalo

Bài viết gần đây

So sánh Copy Trading và Tự lập trình Bot Auto Trading: Cái nào tối ưu hơn?

Giao dịch định lượng (Quant Trading) là gì? Quỹ tỉ đô kiếm tiền thế nào?

Hướng dẫn tự code Bot Trade Coin/Forex bằng Python từ con số 0

Bot Trading có thật sự kiếm được tiền? Test thực tế Bot Python trên MT5

Đừng dùng chỉ báo nữa, hãy học Price Action! Đây là lý do…

Bài Toán: Dự Báo Tăng Hay Giảm?

Bước 1: Chuẩn Bị Dữ Liệu

Bước 2: Feature Engineering

Mô Hình 1: Random Forest

Mô Hình 2: LSTM

Kết Quả So Sánh Thực Tế

Kết Luận Thực Tế

admin

Bài viết gần đây

Bài Toán: Dự Báo Tăng Hay Giảm?

Bước 1: Chuẩn Bị Dữ Liệu

Bước 2: Feature Engineering

Mô Hình 1: Random Forest

Mô Hình 2: LSTM

Kết Quả So Sánh Thực Tế

Kết Luận Thực Tế

Weekly Digest — Nhận Bản Tin Hàng Tuần

admin