Hướng Dẫn Sử Dụng Pydantic Để Xác Thực Đầu Ra LLM

Giới thiệu

Các mô hình ngôn ngữ lớn (LLM) sinh ra văn bản – không phải lúc nào cũng đúng chuẩn dữ liệu có cấu trúc như JSON. Ngay cả khi bạn yêu cầu trả về JSON, kết quả vẫn có thể bị sai tên trường, thiếu trường, sai kiểu dữ liệu, hoặc lẫn cả chú thích, markdown. Nếu không kiểm tra, các lỗi này dễ gây crash hoặc bug khó phát hiện trong ứng dụng.
Pydantic là thư viện Python mạnh để xác thực kiểu dữ liệu ở thời điểm chạy. Khi bạn nhận kết quả từ LLM, Pydantic sẽ kiểm tra dữ liệu có đúng schema bạn định nghĩa không, tự động chuyển đổi kiểu nếu có thể, và báo lỗi rất rõ ràng nếu có vấn đề. Nhờ đó, bạn có thể yên tâm rằng dữ liệu LLM trả về sẽ “đúng chuẩn” như bạn mong muốn.
Bài viết này hướng dẫn cách ứng dụng Pydantic để xác thực đầu ra của LLM trong thực tế: từ định nghĩa schema, kiểm tra dữ liệu lồng nhau, tích hợp với API OpenAI, LangChain, LlamaIndex, đến cách retry khi gặp lỗi xác thực.

Bạn có thể xem mã nguồn trên GitHub. Để bắt đầu, cài Pydantic phiên bản 2.x:
pip install pydantic[email]

Khởi động nhanh với Pydantic và LLM

Giả sử bạn muốn xây dựng chức năng trích xuất thông tin liên hệ từ văn bản tự nhiên, dùng LLM để phân tích và Pydantic để xác thực:


from pydantic import BaseModel, EmailStr, field_validator
from typing import Optional
class ContactInfo(BaseModel):
    name: str
    email: EmailStr
    phone: Optional[str] = None
    company: Optional[str] = None
    @field_validator('phone')
    @classmethod
    def validate_phone(cls, v):
        if v is None:
            return v
        cleaned = ''.join(filter(str.isdigit, v))
        if len(cleaned) < 10:
            raise ValueError('Phone number must have at least 10 digits')
        return cleaned

Model này giúp bạn kiểm tra kiểu dữ liệu (str, email, v.v.), bắt buộc hoặc không bắt buộc, và có thể tự xử lý xác thực nâng cao như chuẩn hóa số điện thoại.
Khi nhận được phản hồi từ LLM:


import json
llm_response = '''
{
    "name": "Sarah Johnson",
    "email": "[email protected]",
    "phone": "(555) 123-4567",
    "company": "TechCorp Industries"
}
'''
data = json.loads(llm_response)
contact = ContactInfo(**data)
print(contact.name)
print(contact.email)
print(contact.model_dump())

Nếu dữ liệu sai kiểu hoặc thiếu trường cần thiết, Pydantic sẽ báo lỗi cụ thể.

Xử lý dữ liệu LLM trả về không chuẩn JSON

Thực tế, LLM thường trả về dữ liệu kèm chú thích, markdown hoặc dính text ngoài JSON. Để giải quyết, bạn có thể dùng regex để tách phần JSON và xác thực:


from pydantic import BaseModel, ValidationError, field_validator
import json
import re
class ProductReview(BaseModel):
    product_name: str
    rating: int
    review_text: str
    would_recommend: bool
    @field_validator('rating')
    @classmethod
    def validate_rating(cls, v):
        if not 1 <= v <= 5:
            raise ValueError('Rating must be an integer between 1 and 5')
        return v
def extract_json_from_llm_response(response: str) -> dict:
    """Tách JSON từ phản hồi có text thừa của LLM."""
    json_match = re.search(r'\{.*\}', response, re.DOTALL)
    if json_match:
        return json.loads(json_match.group())
    raise ValueError("No JSON found in response")
def parse_review(llm_output: str) -> ProductReview:
    """Parse và xác thực dữ liệu review từ LLM."""
    try:
        data = extract_json_from_llm_response(llm_output)
        review = ProductReview(**data)
        return review
    except json.JSONDecodeError as e:
        print(f"Lỗi JSON: {e}")
        raise
    except ValidationError as e:
        print(f"Lỗi xác thực: {e}")
        raise
    except Exception as e:
        print(f"Lỗi khác: {e}")
        raise

Ví dụ tình huống LLM trả về lộn xộn:


messy_response = '''
Here's the review in JSON format:
{
    "product_name": "Wireless Headphones X100",
    "rating": 4,
    "review_text": "Great sound quality, comfortable for long use.",
    "would_recommend": true
}
Hope this helps!
'''
review = parse_review(messy_response)
print(f"Sản phẩm: {review.product_name}")
print(f"Đánh giá: {review.rating}/5")

Làm việc với dữ liệu lồng nhau (nested models)

Khi dữ liệu phức tạp hơn, ví dụ một sản phẩm có nhiều thông số kỹ thuật và đánh giá:


from pydantic import BaseModel, Field, field_validator
from typing import List
class Specification(BaseModel):
    key: str
    value: str
class Review(BaseModel):
    reviewer_name: str
    rating: int = Field(..., ge=1, le=5)
    comment: str
    verified_purchase: bool = False
class Product(BaseModel):
    id: str
    name: str
    price: float = Field(..., gt=0)
    category: str
    specifications: List[Specification]
    reviews: List[Review]
    average_rating: float = Field(..., ge=1, le=5)
    @field_validator('average_rating')
    @classmethod
    def check_average_matches_reviews(cls, v, info):
        reviews = info.data.get('reviews', [])
        if reviews:
            calculated_avg = sum(r.rating for r in reviews) / len(reviews)
            if abs(calculated_avg - v) > 0.1:
                raise ValueError(
                    f'Average rating {v} does not match calculated average {calculated_avg:.2f}'
                )
        return v

Model này giúp kiểm tra dữ liệu ở mọi lớp: nếu một review sai kiểu hoặc điểm trung bình không khớp, bạn sẽ biết ngay.
Ví dụ xác thực sản phẩm:


llm_response = {
    "id": "PROD-2024-001",
    "name": "Smart Coffee Maker",
    "price": 129.99,
    "category": "Kitchen Appliances",
    "specifications": [
        {"key": "Capacity", "value": "12 cups"},
        {"key": "Power", "value": "1000W"},
        {"key": "Color", "value": "Stainless Steel"}
    ],
    "reviews": [
        {
            "reviewer_name": "Alex M.",
            "rating": 5,
            "comment": "Makes excellent coffee every time!",
            "verified_purchase": True
        },
        {
            "reviewer_name": "Jordan P.",
            "rating": 4,
            "comment": "Good but a bit noisy",
            "verified_purchase": True
        }
    ],
    "average_rating": 4.5
}
product = Product(**llm_response)
print(f"{product.name}: ${product.price}")
print(f"Đánh giá trung bình: {product.average_rating}")
print(f"Số lượng đánh giá: {len(product.reviews)}")

Ứng dụng Pydantic với API OpenAI, LangChain, LlamaIndex

Tích hợp với OpenAI API

Bạn có thể gửi prompt cho GPT, yêu cầu trả về JSON theo schema, rồi xác thực bằng Pydantic:


from openai import OpenAI
from pydantic import BaseModel
from typing import List
import os
client = OpenAI(api_key=os.getenv("OPENAI_API_KEY"))
class BookSummary(BaseModel):
    title: str
    author: str
    genre: str
    key_themes: List[str]
    main_characters: List[str]
    brief_summary: str
    recommended_for: List[str]
def extract_book_info(text: str) -> BookSummary:
    prompt = """
    Extract book information from the following text and return it as JSON.
    Required format:
    {
        "title": "book title",
        "author": "author name",
        "genre": "genre",
        "key_themes": ["theme1", "theme2"],
        "main_characters": ["character1", "character2"],
        "brief_summary": "summary in 2-3 sentences",
        "recommended_for": ["audience1", "audience2"]
    }
    Text: %s
    Return ONLY the JSON, no additional text.
    """ % text
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a helpful assistant that extracts structured data."},
            {"role": "user", "content": prompt}
        ],
        temperature=0
    )
    llm_output = response.choices[0].message.content
    import json
    data = json.loads(llm_output)
    return BookSummary(**data)

Khi dùng LLM, luôn xác thực lại kết quả với Pydantic dù prompt có chi tiết đến đâu.
Ví dụ sử dụng:


book_text = """
'The Midnight Library' by Matt Haig is a contemporary fiction novel that explores themes of regret, mental health, and the infinite possibilities of life. The story follows Nora Seed, a woman who finds herself in a library between life and death, where each book represents a different life she could have lived. Through her journey, she encounters various versions of herself and must decide what truly makes a life worth living. The book resonates with readers dealing with depression, anxiety, or life transitions.
"""
try:
    book_info = extract_book_info(book_text)
    print(f"Tiêu đề: {book_info.title}")
    print(f"Tác giả: {book_info.author}")
    print(f"Chủ đề: {', '.join(book_info.key_themes)}")
except Exception as e:
    print(f"Lỗi trích xuất thông tin sách: {e}")

Kết hợp LangChain với Pydantic

LangChain hỗ trợ trích xuất dữ liệu có cấu trúc từ LLM với model Pydantic rất tiện:
Cách 1 – Dùng PydanticOutputParser:


from langchain_openai import ChatOpenAI
from langchain.output_parsers import PydanticOutputParser
from langchain.prompts import PromptTemplate
from pydantic import BaseModel, Field
from typing import List, Optional
class Restaurant(BaseModel):
    name: str = Field(description="Tên nhà hàng")
    cuisine: str = Field(description="Loại ẩm thực")
    price_range: str = Field(description="Mức giá: $, $$, $$$, hoặc $$$$")
    rating: Optional[float] = Field(default=None, description="Đánh giá thang 5.0")
    specialties: List[str] = Field(description="Món đặc trưng")
def extract_restaurant_with_parser(text: str) -> Restaurant:
    parser = PydanticOutputParser(pydantic_object=Restaurant)
    prompt = PromptTemplate(
        template="Extract restaurant information from the following text.\n{format_instructions}\n{text}\n",
        input_variables=["text"],
        partial_variables={"format_instructions": parser.get_format_instructions()}
    )
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    chain = prompt | llm | parser
    result = chain.invoke({"text": text})
    return result

Cách 2 – Dùng function calling native của LLM với with_structured_output():


def extract_restaurant_structured(text: str) -> Restaurant:
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    structured_llm = llm.with_structured_output(Restaurant)
    prompt = PromptTemplate.from_template(
        "Extract restaurant information from the following text:\n\n{text}"
    )
    chain = prompt | structured_llm
    result = chain.invoke({"text": text})
    return result

Ví dụ sử dụng:


restaurant_text = """
Mama's Italian Kitchen is a cozy family-owned restaurant serving authentic Italian cuisine. Rated 4.5 stars, it's known for its homemade pasta and wood-fired pizzas. Prices are moderate ($$), and their signature dishes include lasagna bolognese and tiramisu.
"""
try:
    restaurant_info = extract_restaurant_structured(restaurant_text)
    print(f"Nhà hàng: {restaurant_info.name}")
    print(f"Ẩm thực: {restaurant_info.cuisine}")
    print(f"Món đặc trưng: {', '.join(restaurant_info.specialties)}")
except Exception as e:
    print(f"Lỗi: {e}")

Sử dụng LlamaIndex với Pydantic

LlamaIndex cũng có nhiều cách trích xuất dữ liệu có cấu trúc, rất mạnh khi xử lý tài liệu lớn.
Dùng LLMTextCompletionProgram đơn giản:


from llama_index.core.program import LLMTextCompletionProgram
from pydantic import BaseModel, Field
from typing import List, Optional
class Product(BaseModel):
    name: str = Field(description="Tên sản phẩm")
    brand: str = Field(description="Thương hiệu")
    category: str = Field(description="Phân loại sản phẩm")
    price: float = Field(description="Giá USD")
    features: List[str] = Field(description="Tính năng nổi bật")
    rating: Optional[float] = Field(default=None, description="Đánh giá khách hàng 1-5")
def extract_product_simple(text: str) -> Product:
    prompt_template_str = """
    Extract product information from the following text and structure it properly: {text}
    """
    program = LLMTextCompletionProgram.from_defaults(
        output_cls=Product,
        prompt_template_str=prompt_template_str,
        verbose=False
    )
    result = program(text=text)
    return result

Nếu cần kiểm soát chi tiết hơn, dùng parser riêng:


from llama_index.core.program import LLMTextCompletionProgram
from llama_index.core.output_parsers import PydanticOutputParser
from llama_index.llms.openai import OpenAI
def extract_product_with_parser(text: str) -> Product:
    prompt_template_str = """
    Extract product information from the following text: {text} {format_instructions}
    """
    llm = OpenAI(model="gpt-4o-mini", temperature=0)
    program = LLMTextCompletionProgram.from_defaults(
        output_parser=PydanticOutputParser(output_cls=Product),
        prompt_template_str=prompt_template_str,
        llm=llm,
        verbose=False
    )
    result = program(text=text)
    return result

Ví dụ thực tế:


product_text = """
The Sony WH-1000XM5 wireless headphones feature industry-leading noise cancellation, exceptional sound quality, and up to 30 hours of battery life. Priced at $399.99, these premium headphones include Adaptive Sound Control, multipoint connection, and speak-to-chat technology. Customers rate them 4.7 out of 5 stars.
"""
try:
    product_info = extract_product_with_parser(product_text)
    print(f"Sản phẩm: {product_info.name}")
    print(f"Thương hiệu: {product_info.brand}")
    print(f"Giá: ${product_info.price}")
    print(f"Tính năng: {', '.join(product_info.features)}")
except Exception as e:
    print(f"Lỗi: {e}")

Tự động thử lại khi gặp lỗi xác thực

Nếu LLM trả về dữ liệu không hợp lệ, bạn có thể tự động gửi prompt mới, kèm feedback lỗi để LLM sửa:


from pydantic import BaseModel, ValidationError
from typing import Optional
import json
class EventExtraction(BaseModel):
    event_name: str
    date: str
    location: str
    attendees: int
    event_type: str
def extract_with_retry(llm_call_function, max_retries: int = 3) -> Optional[EventExtraction]:
    last_error = None
    for attempt in range(max_retries):
        try:
            response = llm_call_function(last_error)
            data = json.loads(response)
            return EventExtraction(**data)
        except ValidationError as e:
            last_error = str(e)
            print(f"Lần thử {attempt + 1} thất bại: {last_error}")
            if attempt == max_retries - 1:
                print("Đã thử tối đa, dừng lại")
                return None
        except json.JSONDecodeError:
            print(f"Lần thử {attempt + 1}: JSON không hợp lệ")
            last_error = "Phản hồi không phải JSON hợp lệ. Vui lòng chỉ trả về JSON hợp lệ."
            if attempt == max_retries - 1:
                return None
    return None

Giả lập một hàm LLM biết sửa dần khi có feedback:


def mock_llm_call(previous_error: Optional[str] = None) -> str:
    if previous_error is None:
        return '{"event_name": "Tech Conference 2024", "date": "2024-06-15", "location": "San Francisco"}'
    elif "attendees" in previous_error.lower():
        return '{"event_name": "Tech Conference 2024", "date": "2024-06-15", "location": "San Francisco", "attendees": "about 500", "event_type": "Conference"}'
    else:
        return '{"event_name": "Tech Conference 2024", "date": "2024-06-15", "location": "San Francisco", "attendees": 500, "event_type": "Conference"}'
result = extract_with_retry(mock_llm_call)
if result:
    print(f"\nThành công! Trích xuất sự kiện: {result.event_name}")
    print(f"Số người dự kiến: {result.attendees}")
else:
    print("Không thể trích xuất dữ liệu hợp lệ")

Kết luận

Pydantic giúp bạn biến đầu ra “tự do” của LLM thành dữ liệu có cấu trúc, đúng kiểu, ổn định và dễ kiểm soát. Khi xây dựng ứng dụng AI, hãy:

Định nghĩa schema rõ ràng, sát nhu cầu
Xác thực mọi đầu ra LLM trước khi sử dụng
Xử lý lỗi và thử lại tự động nếu cần
Lồng schema vào prompt để model trả về dữ liệu đúng định dạng

Bắt đầu từ model đơn giản, bổ sung kiểm tra khi gặp trường hợp thực tế. Chúc bạn xây dựng hệ thống AI hiệu quả!

Tham khảo và tài liệu mở rộng

Tags: AI News

Hướng Dẫn Sử Dụng Pydantic Để Xác Thực Đầu Ra LLM

Giới thiệu

Khởi động nhanh với Pydantic và LLM

Xử lý dữ liệu LLM trả về không chuẩn JSON

Làm việc với dữ liệu lồng nhau (nested models)

Ứng dụng Pydantic với API OpenAI, LangChain, LlamaIndex

Tích hợp với OpenAI API

Kết hợp LangChain với Pydantic

Sử dụng LlamaIndex với Pydantic

Tự động thử lại khi gặp lỗi xác thực

Kết luận

Tham khảo và tài liệu mở rộng

Related Posts

About The Author

Le Quoc Thai

Leave a Comment Cancel Reply

Giới thiệu

Khởi động nhanh với Pydantic và LLM

Xử lý dữ liệu LLM trả về không chuẩn JSON

Làm việc với dữ liệu lồng nhau (nested models)

Ứng dụng Pydantic với API OpenAI, LangChain, LlamaIndex

Tích hợp với OpenAI API

Kết hợp LangChain với Pydantic

Sử dụng LlamaIndex với Pydantic

Tự động thử lại khi gặp lỗi xác thực

Kết luận

Tham khảo và tài liệu mở rộng

Related Posts

Share your love

Must Read

About The Author

Le Quoc Thai

Leave a Comment Cancel Reply