AI & Technology

Gemma 4 คู่มือฉบับสมบูรณ์ — ตั้งแต่เลือกโมเดล ติดตั้ง ใช้งาน ไปจนถึง Fine-Tuning

คู่มือ Gemma 4 ฉบับละเอียดที่สุดในภาษาไทย — ครอบคลุม 4 โมเดล (E2B ถึง 31B), วิธีติดตั้งบน Ollama/HuggingFace/vLLM, Thinking Mode, Function Calling, Image+Audio Understanding, Fine-Tuning ด้วย LoRA และ deploy บน Android

6 Apr 202625 min

Gemma 4GoogleAIOpen SourceLLMTutorialFine-TuningOn-Device AI

1. Gemma 4 คืออะไร — สรุปสั้นๆ

Gemma 4 คือตระกูลโมเดล AI เปิดจาก Google DeepMind ปล่อยวันที่ 2 เมษายน 2026

สิ่งที่ต้องรู้:

สร้างจากเทคโนโลยี Gemini 3 — เป็น open-weight version ของโมเดลหลักที่ Google ใช้เอง
4 ขนาด ตั้งแต่รันบนมือถือได้ ไปจนถึงระดับ server
Apache 2.0 license — เปลี่ยนจาก Gemma License แบบเดิมที่มีเงื่อนไข มาเป็นเสรีเต็มรูปแบบ ใช้เชิงพาณิชย์ได้ทันทีโดยไม่ต้องให้ฝ่ายกฎหมายตรวจ
Multimodal ตั้งแต่แกน — รับ text, image, audio ได้ในโมเดลเดียว ไม่ต้องต่อ pipeline หลายตัว
รองรับ 140+ ภาษา รวมถึงภาษาไทย
31B ติดอันดับ 3 ของโลก บน Arena AI text leaderboard ด้วย Elo ~1452

บทความนี้จะพาตั้งแต่เลือกโมเดล ติดตั้ง ใช้งานแต่ละ feature ไปจนถึง fine-tuning และ deploy จริง ไม่ใช่ข่าว ไม่ใช่ overview — แต่เป็น tutorial ที่ทำตามได้ทีละขั้น

2. เลือกโมเดลไหนดี? — เปรียบเทียบ 4 ขนาด

Model	Params (Total / Active)	Context	รับ Image	รับ Audio	RAM (Q4)	เหมาะกับ
E2B	5.1B / 2.3B	128K	✓	✓	~3.2 GB	มือถือ, Edge, ทดลอง
E4B	8B / 4.5B	128K	✓	✓	~5 GB	Laptop, everyday use
26B A4B	25.2B / 3.8B (MoE)	256K	✓	✗	~15.6 GB	Best bang-for-buck
31B	30.7B / 30.7B	256K	✓	✗	~17.4 GB	Maximum quality

คำว่า "E" หมายถึงอะไร?

E ย่อมาจาก Effective — จำนวน parameter ที่ถูก activate จริงตอน inference ตัวอย่างเช่น E2B มี parameter ทั้งหมด 5.1B แต่ตอนรัน activate แค่ 2.3B ทำให้ได้คุณภาพดีกว่าโมเดล 2B ทั่วไป แต่กิน RAM ไม่ต่างกันมาก

26B A4B คือ MoE

โมเดล 26B ใช้สถาปัตยกรรม Mixture of Experts — มี expert หลายตัวแต่แต่ละ request activate แค่ ~3.8B parameters ให้คุณภาพใกล้เคียง 31B แต่ใช้ compute น้อยกว่ามาก ถ้า GPU มี VRAM จำกัดแต่ต้องการคุณภาพสูง นี่คือตัวเลือกที่ดีที่สุด

เลือกยังไง?

ทดลอง / มือถือ / Edge → E2B
ใช้งานวันๆ บน laptop → E4B (default ใน Ollama)
Production ที่ต้องคุณภาพสูง VRAM จำกัด → 26B A4B
ต้องการคุณภาพสูงสุด มี GPU เพียงพอ → 31B

3. ติดตั้ง — 5 วิธีจากง่ายสุดถึงจริงจัง

3.1 Google AI Studio (ไม่ต้องติดตั้งอะไรเลย)

วิธีที่เร็วที่สุดในการลอง Gemma 4 คือเปิด Google AI Studio แล้วเลือกโมเดล Gemma 4

ฟรี ไม่ต้องใส่บัตรเครดิต
รองรับทุกขนาด
มี API key ให้เรียกผ่าน REST ได้ทันที
Rate limit: ~15 requests/นาที สำหรับ free tier

เหมาะกับ: ทดลอง prompt, ดูคุณภาพ output ก่อนตัดสินใจรัน local

3.2 Ollama (local, ง่ายที่สุด)

Ollama คือวิธีรัน LLM บนเครื่องตัวเองที่ง่ายที่สุด รองรับ macOS, Linux, Windows

ติดตั้ง Ollama:

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# หรือบน macOS ดาวน์โหลด app จาก https://ollama.com/download

รัน Gemma 4:

# Default (E4B) — เหมาะกับเครื่องที่มี RAM 8GB+
ollama run gemma4

# เลือกขนาดเฉพาะ
ollama run gemma4:e2b     # เล็กสุด ~3.2GB
ollama run gemma4:e4b     # default ~5GB
ollama run gemma4:26b     # MoE ~15.6GB
ollama run gemma4:31b     # ใหญ่สุด ~17.4GB

พอรันคำสั่งนี้ Ollama จะดาวน์โหลดโมเดล (ครั้งแรกเท่านั้น) แล้วเปิด chat prompt ให้พิมพ์คุยได้เลย

เรียกผ่าน API:

Ollama เปิด REST API ที่ port 11434 อัตโนมัติ:

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [
    {"role": "system", "content": "คุณเป็นผู้ช่วย AI ที่ตอบภาษาไทย"},
    {"role": "user", "content": "อธิบาย Kubernetes แบบสั้นๆ"}
  ]
}'

ข้อดีของ Ollama:

ติดตั้งง่าย คำสั่งเดียว
รองรับ Apple Silicon ใช้ MLX อัตโนมัติตั้งแต่ v0.19+
มี OpenAI-compatible API ที่ /v1/chat/completions
จัดการ model versions ให้

3.3 Hugging Face Transformers (Python)

เหมาะกับคนที่อยากควบคุม pipeline เองทุกขั้นตอน

ติดตั้ง:

pip install "transformers>=5.5.0" torch accelerate

ใช้งาน:

from transformers import pipeline

MODEL_ID = "google/gemma-4-E4B-it"

pipe = pipeline(
    task="any-to-any",
    model=MODEL_ID,
    device_map="auto",
    dtype="auto",
)

messages = [
    {"role": "system", "content": "คุณเป็นผู้ช่วย AI ที่ตอบภาษาไทย"},
    {"role": "user", "content": "อธิบายว่า Docker ต่างจาก VM ยังไง"},
]

output = pipe(text=pipe.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
))
print(output[0]["generated_text"])

โมเดลที่มีบน Hugging Face:

google/gemma-4-E2B-it — instruct version ของ E2B
google/gemma-4-E4B-it — instruct version ของ E4B
google/gemma-4-26B-A4B-it — instruct version ของ 26B MoE
google/gemma-4-31B-it — instruct version ของ 31B

ถ้าต้องการ quantized version ดูได้ที่ Unsloth: unsloth/gemma-4-E4B-it-GGUF ฯลฯ

3.4 LM Studio (GUI desktop app)

สำหรับคนที่ไม่อยาก terminal — LM Studio เป็น desktop app ที่รัน LLM ได้ง่ายๆ

ดาวน์โหลดจาก lmstudio.ai
ค้นหา "gemma-4" ในแถบ Models
เลือกขนาดที่เหมาะกับ RAM ของเครื่อง
กด Download แล้ว Chat ได้เลย

LM Studio รองรับ GGUF format และเปิด local server ให้เรียกผ่าน API ได้เหมือน Ollama

3.5 vLLM / SGLang (production server)

สำหรับ production ที่ต้องรับ traffic จริง vLLM คือ standard

pip install vllm

# รัน server
vllm serve google/gemma-4-31B-it \
  --dtype auto \
  --max-model-len 65536 \
  --port 8000

server จะเปิด OpenAI-compatible API ที่ http://localhost:8000/v1/chat/completions ใช้ library อะไรก็ได้ที่รองรับ OpenAI format เรียกได้ทันที

Docker version:

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -e HF_TOKEN=$HF_TOKEN \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model google/gemma-4-31B-it

vLLM จัดการ batching, KV-cache, continuous batching ให้อัตโนมัติ throughput สูงกว่ารัน Transformers ตรงๆ หลายเท่า

4. ใช้งานพื้นฐาน — Chat, System Prompt, Streaming

Prompt Format ของ Gemma 4

Gemma 4 ใช้ token format เฉพาะของตัวเอง:

<|turn>system
คุณเป็นผู้ช่วย AI ที่ตอบภาษาไทย<turn|>
<|turn>user
อธิบาย Kubernetes แบบสั้นๆ<turn|>
<|turn>model

ในทางปฏิบัติ ถ้าใช้ Ollama หรือ Transformers pipeline ไม่ต้องจัดการ format เอง — library จัดการให้ แต่ถ้าเรียก API ตรงหรือเขียน custom inference ต้องรู้ format นี้

System Prompt

Gemma 4 รองรับ system role เป็น native ไม่ต้อง hack ใส่ใน user message เหมือนโมเดลรุ่นเก่า:

messages = [
    {
        "role": "system",
        "content": "คุณเป็น senior software engineer ที่เชี่ยวชาญ Kubernetes "
                   "ตอบสั้นกระชับ ให้ command ตัวอย่างเสมอ"
    },
    {
        "role": "user",
        "content": "pod ค้างอยู่ที่ CrashLoopBackOff ต้องทำยังไง"
    },
]

Streaming ผ่าน Ollama API

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [
    {"role": "user", "content": "เขียน Python function หา Fibonacci"}
  ],
  "stream": true
}'

output จะมาทีละ chunk เหมาะกับ UI ที่ต้องแสดงผลแบบ real-time

Streaming ผ่าน Python (OpenAI-compatible)

from openai import OpenAI

# ชี้ไปที่ Ollama หรือ vLLM server
client = OpenAI(base_url="http://localhost:11434/v1", api_key="unused")

stream = client.chat.completions.create(
    model="gemma4",
    messages=[
        {"role": "system", "content": "ตอบภาษาไทย สั้นกระชับ"},
        {"role": "user", "content": "อธิบาย event loop ของ Node.js"},
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

5. Thinking Mode — ให้โมเดลคิดก่อนตอบ

Thinking Mode คือความสามารถที่ให้โมเดล คิดทีละขั้น (chain-of-thought) ก่อนจะให้คำตอบสุดท้าย เหมาะกับโจทย์ที่ต้องใช้เหตุผลซับซ้อน

วิธีเปิด Thinking Mode

ใส่ token <|think|> ไว้ที่ต้น system prompt:

<|turn>system
<|think|>คุณเป็นนักคณิตศาสตร์<turn|>
<|turn>user
ถ้า x^2 + 5x + 6 = 0 หา x<turn|>
<|turn>model

เมื่อเปิด thinking mode โมเดลจะ output ความคิดภายในก่อน:

<|channel>thought
ต้องหาค่า x จากสมการกำลังสอง
x^2 + 5x + 6 = 0
แยกตัวประกอบ: (x+2)(x+3) = 0
ดังนั้น x = -2 หรือ x = -3
<channel|>
คำตอบ: x = -2 หรือ x = -3

ใช้กับ Ollama

curl http://localhost:11434/api/chat -d '{
  "model": "gemma4",
  "messages": [
    {
      "role": "system",
      "content": "<|think|>คุณเป็นผู้เชี่ยวชาญแก้ปัญหา logic"
    },
    {
      "role": "user",
      "content": "ห้องหนึ่งมีคน 3 คน คนแรกพูดจริงเสมอ คนที่สองโกหกเสมอ คนที่สามสุ่ม ถ้าถามคำถามเดียว จะรู้ได้ไหมว่าใครเป็นใคร?"
    }
  ]
}'

ข้อสำคัญ: จัดการ thinking tokens ใน multi-turn

เวลาส่ง conversation history กลับไปให้โมเดล ต้อง strip thinking tokens ออก เอาแค่คำตอบสุดท้าย ไม่งั้นโมเดลจะสับสนกับ context เก่า

import re

def strip_thinking(response: str) -> str:
    """ลบ thinking tokens ออกจาก response"""
    return re.sub(
        r"<\|channel>thought.*?<channel\|>",
        "",
        response,
        flags=re.DOTALL,
    ).strip()

เมื่อไหร่ควรเปิด / ปิด?

สถานการณ์	Thinking Mode
โจทย์คณิตศาสตร์ / logic	✓ เปิด
เขียนโค้ดที่ซับซ้อน	✓ เปิด
วิเคราะห์ข้อมูล	✓ เปิด
Chat ทั่วไป	✗ ปิด (เร็วกว่า)
Summarize ข้อความ	✗ ปิด
Translate	✗ ปิด

เปิด thinking mode ทำให้ output ยาวขึ้นและช้าลง — ใช้เฉพาะเมื่อต้องการจริงๆ

Tip: ถ้าต้องการลด thinking tokens ลง ~20% ใส่คำว่า "คิดสั้นๆ" หรือ "think briefly" ใน system prompt

6. Function Calling — ให้โมเดลเรียกเครื่องมือ

Function Calling ทำให้โมเดลสามารถ "เรียก function" ที่เรากำหนดไว้ได้ แทนที่จะแค่ตอบเป็นข้อความ โมเดลจะ output ชื่อ function + arguments ให้เราไปเรียกจริง แล้วส่งผลลัพธ์กลับ

กำหนด Tools

Gemma 4 ใช้ token <|tool> ในการกำหนด tool:

<|turn>system
คุณเป็นผู้ช่วยที่ใช้เครื่องมือได้
<|tool>get_weather{location:string, unit:string}<tool|>
<|tool>search_database{query:string, limit:integer}<tool|><turn|>
<|turn>user
อากาศที่กรุงเทพวันนี้เป็นยังไง?<turn|>
<|turn>model

โมเดล Output Tool Call

โมเดลจะตอบกลับมาในรูปแบบ:

<|tool_call>call:get_weather{location:<|"|>Bangkok<|"|>, unit:<|"|>celsius<|"|>}<tool_call|>

ส่งผลลัพธ์กลับ

หลังจากเราเรียก function จริงแล้ว ส่งผลลัพธ์กลับด้วย <|tool_response>:

<|tool_response>response:get_weather{temperature:<|"|>32<|"|>, condition:<|"|>partly cloudy<|"|>}<tool_response|>

ตัวอย่าง Agentic Loop (Python)

import json
import requests

TOOLS = {
    "get_weather": lambda location, unit="celsius": {
        "temperature": 32,
        "condition": "partly cloudy",
        "location": location,
    },
    "search_database": lambda query, limit=10: {
        "results": [f"Result for '{query}'"],
        "count": 1,
    },
}

def agent_loop(user_message: str, max_rounds: int = 5):
    messages = [
        {"role": "system", "content": "คุณเป็นผู้ช่วยที่ใช้เครื่องมือได้"},
        {"role": "user", "content": user_message},
    ]

    for _ in range(max_rounds):
        response = requests.post(
            "http://localhost:11434/api/chat",
            json={"model": "gemma4", "messages": messages, "stream": False},
        ).json()

        assistant_msg = response["message"]["content"]

        # ถ้าไม่มี tool_call → ตอบเสร็จแล้ว
        if "<|tool_call>" not in assistant_msg:
            return assistant_msg

        # Parse tool call แล้วเรียก function จริง
        # (ในโค้ดจริงต้อง parse token อย่างระวัง)
        func_name, kwargs = parse_tool_call(assistant_msg)
        result = TOOLS[func_name](**kwargs)

        # ส่งผลลัพธ์กลับแล้ววนรอบใหม่
        messages.append({"role": "assistant", "content": assistant_msg})
        messages.append({"role": "tool", "content": json.dumps(result)})

    return "เกินจำนวนรอบที่กำหนด"

Pattern สำคัญ

กำหนด tools ให้ชัดเจน — ชื่อ function + parameter types
Parse tool_call — แยก function name กับ arguments
เรียก function จริง — ฝั่ง application ทำ
ส่งผลลัพธ์กลับ — ผ่าน tool_response
วนรอบ — จนกว่าโมเดลจะตอบโดยไม่เรียก tool

pattern นี้คือ agentic loop ที่ทำให้โมเดลกลายเป็น AI Agent ที่คิดและลงมือทำได้

7. Image Understanding — วิเคราะห์รูปภาพ

Gemma 4 ทุกขนาดรองรับ image input ไม่ต้องต่อ vision model แยก

ส่งรูปผ่าน Transformers

from transformers import pipeline
from PIL import Image

pipe = pipeline(
    task="any-to-any",
    model="google/gemma-4-E4B-it",
    device_map="auto",
    dtype="auto",
)

image = Image.open("receipt.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "อ่านข้อมูลจากใบเสร็จนี้ ดึงชื่อร้าน รายการสินค้า ราคารวม"},
        ],
    },
]

output = pipe(text=pipe.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
), images=[image])
print(output[0]["generated_text"])

ตัวอย่าง Use Cases

อ่านใบเสร็จ (OCR + extraction):

prompt: "อ่านใบเสร็จนี้ ดึงข้อมูลออกมาเป็น JSON: ชื่อร้าน, วันที่, รายการสินค้าพร้อมราคา, ราคารวม, VAT"

วิเคราะห์ Dashboard:

prompt: "ดูกราฟนี้แล้วสรุปว่า trend เป็นอย่างไร มีจุดผิดปกติตรงไหนบ้าง"

GUI Element Detection:

prompt: "ระบุ UI elements ทั้งหมดในภาพนี้ พร้อม bounding box coordinates"

Gemma 4 รองรับ bounding box detection ในรูปภาพ สามารถบอกตำแหน่ง (x, y, width, height) ของวัตถุในภาพได้ เหมาะกับงาน automation ที่ต้องหา element บนหน้าจอ

ส่งหลายรูปพร้อมกัน

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": Image.open("before.png")},
            {"type": "image", "image": Image.open("after.png")},
            {"type": "text", "text": "เปรียบเทียบ 2 รูปนี้ มีอะไรเปลี่ยนแปลงบ้าง"},
        ],
    },
]

8. Audio Understanding (E2B / E4B เท่านั้น)

โมเดล E2B และ E4B มี conformer-based audio encoder ในตัว รับเสียงเข้ามาวิเคราะห์ได้โดยตรง

ข้อจำกัดที่ต้องรู้

ความยาวสูงสุด: 30 วินาที ต่อ audio clip
Token cost: 25 tokens ต่อวินาที (audio 30 วินาที = 750 tokens)
Sample rate: 16kHz
Channel: mono เท่านั้น (ถ้ามี stereo ต้อง convert เป็น mono ก่อน)
Format: WAV, MP3

ส่งเสียงผ่าน Transformers

from transformers import AutoProcessor, AutoModelForMultimodalLM
import librosa

model_id = "google/gemma-4-E4B-it"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
    model_id, device_map="auto", dtype="auto"
)

# โหลด audio — ต้อง resample เป็น 16kHz
audio, sr = librosa.load("meeting.wav", sr=16000)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "audio", "audio": audio},
            {"type": "text", "text": "ถอดเสียงนี้เป็นข้อความภาษาไทย"},
        ],
    },
]

inputs = processor.apply_chat_template(
    messages, tokenize=True, return_tensors="pt"
).to(model.device)

output = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(output[0], skip_special_tokens=True))

ตัวอย่าง Use Cases

Transcription (ถอดเสียงเป็นข้อความ):

prompt: "ถอดเสียงนี้เป็นข้อความ"

Speech Translation (แปลเสียงข้ามภาษา):

prompt: "ฟังเสียงภาษาไทยนี้แล้วแปลเป็นภาษาอังกฤษ"

Audio Analysis:

prompt: "วิเคราะห์โทนเสียงและอารมณ์ของผู้พูดในคลิปนี้"

หมายเหตุ: โมเดล 26B และ 31B ไม่รองรับ audio — เพราะ Google ออกแบบให้ audio เป็น feature ของโมเดลเล็กที่ใช้ on-device โดยรับเสียงจากไมค์โดยตรง

9. Structured Output — บังคับให้ตอบเป็น JSON

ในงานจริง เราไม่ได้ต้องการ prose text — เราต้องการ structured data ที่ parse ได้

วิธีที่ 1: Prompt Engineering

วิธีง่ายที่สุดคือบอกใน prompt:

messages = [
    {
        "role": "system",
        "content": "ตอบเป็น JSON เท่านั้น ไม่ต้องอธิบายเพิ่ม"
    },
    {
        "role": "user",
        "content": """
ดึงข้อมูลจากข้อความนี้:
"บริษัท ABC จำกัด ออกใบแจ้งหนี้เลขที่ INV-2026-0042
วันที่ 5 เมษายน 2026 จำนวนเงิน 125,000 บาท
กำหนดชำระ 30 วัน"

Schema:
{
  "company": "string",
  "invoice_number": "string",
  "date": "YYYY-MM-DD",
  "amount": number,
  "currency": "string",
  "payment_terms_days": number
}
"""
    },
]

วิธีที่ 2: Constrained Generation (vLLM)

vLLM รองรับ guided decoding ที่บังคับ output format:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="unused")

response = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[
        {"role": "user", "content": "ดึงข้อมูลจากข้อความ: ..."}
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "invoice",
            "schema": {
                "type": "object",
                "properties": {
                    "company": {"type": "string"},
                    "invoice_number": {"type": "string"},
                    "date": {"type": "string"},
                    "amount": {"type": "number"},
                    "currency": {"type": "string"},
                },
                "required": ["company", "invoice_number", "amount"],
            },
        },
    },
)

วิธีนี้ รับประกัน ว่า output เป็น valid JSON ตาม schema 100% ไม่มี parse error

วิธีที่ 3: Gemma 4 Native Structured Output

Gemma 4 ใช้ token <|"|> เป็น string delimiter ในการส่ง structured data ทำให้ไม่สับสนกับ special characters:

<|tool_call>call:extract_invoice{
  company:<|"|>บริษัท ABC จำกัด<|"|>,
  amount:<|"|>125000<|"|>
}<tool_call|>

ในทางปฏิบัติ ใช้วิธีที่ 1 (prompt) สำหรับ prototype และวิธีที่ 2 (constrained generation) สำหรับ production

10. Fine-Tuning ด้วย LoRA

Fine-tuning คือการสอนให้โมเดลเก่งขึ้นในงานเฉพาะทาง Gemma 4 รองรับ LoRA/QLoRA ซึ่งใช้ GPU แค่ตัวเดียวก็ fine-tune ได้

เตรียมข้อมูล

Format ข้อมูลเป็น JSONL — แต่ละบรรทัดเป็น conversation:

{"messages": [{"role": "system", "content": "คุณเป็นผู้เชี่ยวชาญกฎหมาย PDPA"}, {"role": "user", "content": "บริษัทเก็บข้อมูลลูกค้าโดยไม่ขอ consent ผิดกฎหมายไหม?"}, {"role": "assistant", "content": "ผิดมาตรา 19..."}]}
{"messages": [{"role": "system", "content": "คุณเป็นผู้เชี่ยวชาญกฎหมาย PDPA"}, {"role": "user", "content": "Data breach ต้องแจ้งภายในกี่ชั่วโมง?"}, {"role": "assistant", "content": "ต้องแจ้งภายใน 72 ชั่วโมง..."}]}

ควรมีอย่างน้อย 200-500 ตัวอย่าง สำหรับ LoRA fine-tuning

Fine-Tune ด้วย Unsloth (แนะนำ)

Unsloth เป็น library ที่ optimize มาสำหรับ fine-tuning LLM บน consumer GPU ใช้ memory น้อยกว่า Transformers ปกติ 2-4 เท่า

pip install unsloth

from unsloth import FastModel
from trl import SFTTrainer
from transformers import TrainingArguments
from datasets import load_dataset

# โหลดโมเดลแบบ 4-bit quantization
model, tokenizer = FastModel.from_pretrained(
    "unsloth/gemma-4-E4B-it",
    max_seq_length=4096,
    load_in_4bit=True,
    full_finetuning=False,
)

# เพิ่ม LoRA adapters
model = FastModel.get_peft_model(
    model,
    r=16,                    # LoRA rank
    lora_alpha=16,
    target_modules=[
        "q_proj", "k_proj", "v_proj",
        "o_proj", "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0,
    bias="none",
)

# โหลดข้อมูล
dataset = load_dataset("json", data_files="train.jsonl", split="train")

# Training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./output",
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=10,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch",
    ),
)

trainer.train()

Memory ที่ต้องการ

Model	Full Fine-Tune	LoRA (r=16)	QLoRA (4-bit + LoRA)
E2B	~12 GB	~6 GB	~4 GB
E4B	~18 GB	~8 GB	~5 GB
26B A4B	~55 GB	~20 GB	~12 GB
31B	~65 GB	~24 GB	~15 GB

Tip: ใช้ use_gradient_checkpointing="unsloth" เพื่อลด memory ลงอีก ~30% แลกกับความเร็วที่ลดลงเล็กน้อย

Save และ Deploy

# Save LoRA adapter
model.save_pretrained("./gemma4-pdpa-adapter")
tokenizer.save_pretrained("./gemma4-pdpa-adapter")

# Merge adapter กลับเข้า base model (สำหรับ production)
model.save_pretrained_merged(
    "./gemma4-pdpa-merged",
    tokenizer,
    save_method="merged_16bit",
)

# หรือ export เป็น GGUF สำหรับ Ollama
model.save_pretrained_gguf(
    "./gemma4-pdpa-gguf",
    tokenizer,
    quantization_method="q4_k_m",
)

export เป็น GGUF แล้วใช้กับ Ollama ได้ทันที:

# สร้าง Modelfile
echo 'FROM ./gemma4-pdpa-gguf/unsloth.Q4_K_M.gguf' > Modelfile

# import เข้า Ollama
ollama create gemma4-pdpa -f Modelfile

# ทดสอบ
ollama run gemma4-pdpa

11. Deploy บน Android (AICore / Edge Gallery)

Gemma 4 ถูกออกแบบมาให้รันบน Android โดยเฉพาะ Google เปิด 3 ช่องทาง:

11.1 AICore Developer Preview

AICore คือ system service ของ Android ที่ให้แอปเรียกใช้ AI model ที่ฝังอยู่ใน OS ได้

โมเดล Gemma 4 E2B ถูก optimize มาเฉพาะสำหรับ AICore
รัน on-device 100% ไม่ส่งข้อมูลออก
ใช้ RAM < 1.5 GB (2-bit + 4-bit quantization + memory-mapped embeddings)
ต้องสมัคร Developer Preview ที่ Android Developers

11.2 Google AI Edge Gallery

Edge Gallery เป็นแอปที่ให้ทดลองรัน Gemma 4 บนมือถือได้ทันที:

ดาวน์โหลด Google AI Edge Gallery จาก Play Store (หรือ App Store)
เลือก Gemma 4 E2B หรือ E4B
Chat ได้เลย — ทุกอย่างรันบนเครื่อง

เหมาะกับ demo ให้ลูกค้าดูว่า on-device AI ทำอะไรได้บ้าง

11.3 ML Kit GenAI Prompt API

สำหรับ Android developer ที่ต้องการ integrate เข้าแอปจริง:

// build.gradle
implementation("com.google.mlkit:genai-prompt:1.0.0-beta")

import com.google.mlkit.genai.prompt.GenerativeModel

val model = GenerativeModel(
    modelName = "gemma-4-e2b",  // หรือ "gemma-4-e4b"
)

// ใช้งาน
val response = model.generateContent("อธิบาย PDPA สั้นๆ")
println(response.text)

ML Kit จัดการ download model, quantization, memory management ให้ทั้งหมด

12. Self-Host สำหรับ Production

GPU Requirements

Model	GPU (FP16)	GPU (INT8)	GPU (INT4/GPTQ)
E2B	1x RTX 3060 12GB	1x RTX 3060 12GB	1x RTX 3060 12GB
E4B	1x RTX 3060 12GB	1x RTX 3060 12GB	1x RTX 3060 12GB
26B A4B	1x A100 40GB	1x RTX 4090 24GB	1x RTX 4090 24GB
31B	1x A100 80GB	1x A100 40GB	1x RTX 4090 24GB

vLLM Production Setup

# docker-compose.yml
version: "3.8"
services:
  gemma4:
    image: vllm/vllm-openai:latest
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    ports:
      - "8000:8000"
    volumes:
      - ~/.cache/huggingface:/root/.cache/huggingface
    environment:
      - HF_TOKEN=${HF_TOKEN}
    command: >
      --model google/gemma-4-26B-A4B-it
      --dtype auto
      --max-model-len 65536
      --gpu-memory-utilization 0.9
      --enable-prefix-caching
    ipc: host

docker compose up -d

ทดสอบ

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "google/gemma-4-26B-A4B-it",
    "messages": [
      {"role": "user", "content": "สวัสดี"}
    ]
  }'

Production Checklist

ตั้ง --gpu-memory-utilization ให้เหมาะ (0.85-0.95)
เปิด --enable-prefix-caching สำหรับ system prompt ซ้ำๆ
ตั้ง --max-model-len ตามที่ใช้จริง (ไม่ต้องเต็ม 256K)
Monitor GPU memory และ request latency
ตั้ง rate limiting หน้า reverse proxy
ใช้ health check endpoint /health

13. ราคาและ License

ทางเลือกฟรี

วิธี	ค่าใช้จ่าย	ข้อจำกัด
Google AI Studio	ฟรี	Rate limit ~15 req/min
Ollama (local)	ฟรี (ค่า hardware)	ขึ้นกับ RAM/GPU เครื่อง
llama.cpp (local)	ฟรี (ค่า hardware)	ต้อง compile เอง
Hugging Face Spaces	ฟรี (basic)	CPU only, ช้า

API Pricing (OpenRouter)

Model	Input (per 1M tokens)	Output (per 1M tokens)
Gemma 4 26B A4B	$0.13	$0.40
Gemma 4 31B	$0.14	$0.40

ราคาถูกมากเมื่อเทียบกับ proprietary models ที่คิดราคา $3-15 per 1M tokens

Apache 2.0 — ทำอะไรได้บ้าง

สิ่งที่ทำได้	Apache 2.0
ใช้เชิงพาณิชย์	✓ ได้เลย
แก้ไข ดัดแปลง	✓ ได้เลย
Redistribute	✓ ได้เลย
Embed ใน product ที่ขาย	✓ ได้เลย
ใช้ร่วมกับ OSS อื่น	✓ เข้ากันได้ดี
ต้องเปิดเผยซอร์สโค้ดตัวเอง	✗ ไม่ต้อง

สรุป: เอาไปทำอะไรก็ได้ ไม่มีเงื่อนไข legal ที่ต้องกังวล ฝ่ายกฎหมายส่วนใหญ่อนุมัติ Apache 2.0 ได้ทันที

14. สรุป — เลือกอะไร เมื่อไหร่

Quick Decision Guide

ทดลองดูก่อน?
  → Google AI Studio (ไม่ต้องติดตั้ง)

รันบน laptop วันๆ?
  → Ollama + gemma4 (E4B default)

ต้องการ audio/video บนมือถือ?
  → E2B หรือ E4B + AICore / Edge Gallery

Production API server?
  → vLLM + 26B A4B (best bang-for-buck)

คุณภาพสูงสุด ไม่จำกัด GPU?
  → vLLM + 31B

Fine-tune สำหรับงานเฉพาะ?
  → Unsloth + E4B (consumer GPU)
  → Unsloth + 26B A4B (ถ้ามี A100)

สิ่งที่ Gemma 4 เปลี่ยน

On-device AI เป็นจริงแล้ว — E2B รันบนมือถือใน < 1.5 GB
Apache 2.0 ปลดล็อก commercial use — ไม่มี legal barrier
Multimodal ในตัว — ไม่ต้องต่อ pipeline ซับซ้อน
Function calling native — สร้าง AI Agent ได้โดยไม่ต้องพึ่ง framework ภายนอก
คุณภาพระดับ top-3 ของโลก — ในราคา $0.14/1M tokens หรือฟรีถ้ารันเอง

สำหรับทีมที่กำลัง build AI product — ไม่ว่าจะเป็น chatbot, document processing, on-device assistant, หรือ AI Agent — Gemma 4 เป็นตัวเลือกที่ควรพิจารณาอย่างจริงจัง

ต้องการคำปรึกษาเรื่อง AI Integration สำหรับองค์กร? ทีม Enersys มีประสบการณ์ deploy AI ตั้งแต่ prototype ถึง production — ติดต่อเรา เพื่อคุยเรื่อง use case ของคุณ

แหล่งข้อมูล

ลิงก์ที่เกี่ยวข้อง

Genesis AI Platform

แพลตฟอร์ม Agentic AI สำหรับองค์กร

AI Readiness Assessment

ประเมินความพร้อม AI ขององค์กรฟรี

ติดต่อปรึกษา AI Strategy

พูดคุยกับผู้เชี่ยวชาญ

Back to Insights

AEO + SEO — The Survival Guide for When AI Swallows Google Search

Gartner predicts search volume will drop 25% by 2026 and 50% by 2028 — while zero-click search has surged to 65%. Websites that fail to adapt will disappear from customers’ view. This article is a complete guide for Thai businesses.

AEO vs GEO — A Deep Dive into the Two Strategies That Determine Whether AI Will "See" or "Skip" Your Website

Web mentions correlate with AI citations 3x more strongly than backlinks, AI referral traffic grew 527% YoY, and websites with schema are 2.5x more likely to be cited by AI — a complete AEO vs GEO guide with audit steps and website optimization tips.

Agentic AI in the Enterprise — From 5% to 40% by 2026: Opportunities and Risks Every Executive Should Know

The Agentic AI market is growing from $1B to more than $9B in just two years. Gartner predicts that 40% of enterprise applications will include AI agents by the end of 2026, but more than 40% of projects may also be canceled. Here is a practical look at the opportunities, risks, and strategies for Thai enterprises.

"Empowering Innovation,
Transforming Futures."

ติดต่อเราเพื่อทำให้โปรเจกต์ของคุณเป็นจริง

Gemma 4 คู่มือฉบับสมบูรณ์ — ตั้งแต่เลือกโมเดล ติดตั้ง ใช้งาน ไปจนถึง Fine-Tuning

1. Gemma 4 คืออะไร — สรุปสั้นๆ

2. เลือกโมเดลไหนดี? — เปรียบเทียบ 4 ขนาด

คำว่า "E" หมายถึงอะไร?

26B A4B คือ MoE

เลือกยังไง?

3. ติดตั้ง — 5 วิธีจากง่ายสุดถึงจริงจัง

3.1 Google AI Studio (ไม่ต้องติดตั้งอะไรเลย)

3.2 Ollama (local, ง่ายที่สุด)

3.3 Hugging Face Transformers (Python)

3.4 LM Studio (GUI desktop app)

3.5 vLLM / SGLang (production server)

4. ใช้งานพื้นฐาน — Chat, System Prompt, Streaming

Prompt Format ของ Gemma 4

System Prompt

Streaming ผ่าน Ollama API

Streaming ผ่าน Python (OpenAI-compatible)

5. Thinking Mode — ให้โมเดลคิดก่อนตอบ

วิธีเปิด Thinking Mode

ใช้กับ Ollama

ข้อสำคัญ: จัดการ thinking tokens ใน multi-turn

เมื่อไหร่ควรเปิด / ปิด?

6. Function Calling — ให้โมเดลเรียกเครื่องมือ

กำหนด Tools

โมเดล Output Tool Call

ส่งผลลัพธ์กลับ

ตัวอย่าง Agentic Loop (Python)

Pattern สำคัญ

7. Image Understanding — วิเคราะห์รูปภาพ

ส่งรูปผ่าน Transformers

ตัวอย่าง Use Cases

ส่งหลายรูปพร้อมกัน

8. Audio Understanding (E2B / E4B เท่านั้น)

ข้อจำกัดที่ต้องรู้

ส่งเสียงผ่าน Transformers

ตัวอย่าง Use Cases

9. Structured Output — บังคับให้ตอบเป็น JSON

วิธีที่ 1: Prompt Engineering

วิธีที่ 2: Constrained Generation (vLLM)

วิธีที่ 3: Gemma 4 Native Structured Output

10. Fine-Tuning ด้วย LoRA

เตรียมข้อมูล

Fine-Tune ด้วย Unsloth (แนะนำ)

Memory ที่ต้องการ

Save และ Deploy

11. Deploy บน Android (AICore / Edge Gallery)

11.1 AICore Developer Preview

11.2 Google AI Edge Gallery

11.3 ML Kit GenAI Prompt API

12. Self-Host สำหรับ Production

GPU Requirements

vLLM Production Setup

ทดสอบ

Production Checklist

13. ราคาและ License

ทางเลือกฟรี

API Pricing (OpenRouter)

Apache 2.0 — ทำอะไรได้บ้าง

14. สรุป — เลือกอะไร เมื่อไหร่

Quick Decision Guide

สิ่งที่ Gemma 4 เปลี่ยน

แหล่งข้อมูล

ลิงก์ที่เกี่ยวข้อง

Related Articles

AEO + SEO — The Survival Guide for When AI Swallows Google Search

AEO vs GEO — A Deep Dive into the Two Strategies That Determine Whether AI Will "See" or "Skip" Your Website

Agentic AI in the Enterprise — From 5% to 40% by 2026: Opportunities and Risks Every Executive Should Know