ทำ Modal-Agents ง่าย ๆ ที่บ้าน ด้วย DeepSeek R1 + Qwen2.5VL-Multi

อัปเดตล่าสุด Feb. 5, 2025

DeepSeek และ Qwen เป็นที่พูดถึงกันมากในช่วงเวลานี้ วันนี้ผมจะมาทำ Demo โดยเอา Qwen2.5VL และ DeepSeek-R1 มาทำ Agent โดยใช้ SWARM

ที่จริงในโลกนี้มี Framework เกี่ยวกับ Agents หลายตัวที่น่าสนใจขึ้นอยู่กับลักษณะของงานที่ใช้และความถนัด ส่วนตัวผมชอบ SWARM เพราะผมชอบผึ้ง

ในบทความนี้อาจจะเป็นการลงลึกถึงพริกถึงขิงเกี่ยวกับการใช้งาน อาจจะไม่เหมาะกับผู้เล่นใหม่ แต่ก็สามารถลองทำได้ สำหรับคนที่ยังไม่รู้ว่า DeepSeek คืออะไรสามารถอ่านเพิ่มเติมได้ที่นี่

เห้ยแล้วทำไมผมใช้ SWARM? มันเป็นของ OpenAI ไม่ใช่เหรอ? นั่นแหละครับเหตุผล

AI Agent (Image by: Andreas Horn)

โอเคทุกคนพร้อมเข้าไปในสู่โลกแห่ง Agents ที่ทุกๆแหล่งข่าวบอกปี 2025 คือยุคแห่ง Agents ยัง ผมก็เชื่ออย่างนั้นและในปีนี้ AI Agents น่าจะพัฒนาขึ้นไปอีก ๆ

OpenAI CEO Sam Altman said the first artificial intelligence agents might enter the workforce this year as his company inches closer to developing humanlike artificial general intelligence (AGI).
“We believe that, in 2025, we may see the first AI agents ‘join the workforce’ and materially change the output of companies,” said Altman in a blog post titled “Reflections” on Jan. 6.

Nvidia CEO Jensen Huang มีความมั่นใจว่า AI เอเจนต์จะกลายเป็นกระแสหลัก โดยกล่าวว่า “เราเริ่มเห็นการนำ AI เอเจนต์มาใช้ในองค์กรซึ่งกำลังเป็นที่นิยมอย่างมาก

Image source: Frank Shines

รันโมเดลผ่าน AWS EC2 (g4dn.xlarge) JupyterNotebook ผ่าน Docker

อันดับแรก วิธีการ setup ต่างๆ ถ้าผู้อ่านทำใน Local สามารถ setup ตามนี้ได้เลย

SWARM
Ollama
Qwen2.5VL — 7B
DeepSeek-R1–7B

from huggingface_hub import login
login(token="your api key naja")

ต่อไป ทำการโหลด Qwen2.5VL ซึ่งเป็น Multi-Modal สามารถรับข้อมูลที่เป็นรูปภาพได้จ้ะ

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
import torch
# default: Load the model on the available device(s)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
   "Qwen/Qwen2.5-VL-7B-Instruct", torch_dtype="auto", device_map="auto", load_in_4bit=True
)
# We recommend enabling flash_attention_2 for better acceleration and memory saving, especially in multi-image and video scenarios.
# model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
#     "Qwen/Qwen2.5-VL-7B-Instruct",
#     torch_dtype=torch.bfloat16,
#     attn_implementation="flash_attention_2",
#     device_map="auto",
# )
# default processer
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

แต่ถ้าออเจ้ามี GPU version ใหม่ตามนี้

FlashAttention-2 with CUDA currently supports:
Ampere, Ada, or Hopper GPUs (e.g., A100, RTX 3090, RTX 4090, H100). Support for Turing GPUs (T4, RTX 2080) is coming soon, please use FlashAttention 1.x for Turing GPUs for now.
Datatype fp16 and bf16 (bf16 requires Ampere, Ada, or Hopper GPUs).
All head dimensions up to 256. Head dim > 192 backward requires A100/A800 or H100/H800. Head dim 256 backward now works on consumer GPUs (if there’s no dropout) as of flash-attn 2.5.5.

สามารถใช้ flash attention2 ซึ่งดีกว่า

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

ต่อไปเราก็เอารูปภาพที่ Bro ต้องการมาอัปโหลดได้เลย แล้วมาดูผลลัพธ์กัน

from PIL import Image
import matplotlib.pyplot as plt

image_path = "steak.jpg"
image = Image.open(image_path)
messages = [
   {"role": "user", "content": [
       {"type": "image"},
       {"type": "text", "text": "Describe this image. "}
   ]}
]
input_text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(image, input_text, return_tensors="pt").to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
   out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
   generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
plt.figure(figsize=(10, 8))
plt.imshow(image)
plt.axis('off')
plt.tight_layout()
plt.show()

Steak.jpeg

[
'This image shows a plated meal consisting of a steak and French fries. The steak appears to be cooked to a medium-rare doneness, with a juicy interior and a nicely seared exterior.
It is sliced into portions and accompanied by a rich, dark sauce that pools around the meat. The French fries are golden brown and appear to be crispy on the outside. The plate is white is white white with black trim,
giving it a classic diner-style appearance. In the background, there is a fork resting on the edge of the plate, along with a glimpse of another dish containing salad greens.
A glass of red wine is also partially visible']

ต่อไป ผมจะสร้าง Agent ขึ้นโดยใช้ SWARM โดยดึงโมเดลจาก Ollama ซึ่งเป็นโมเดล DeepSeek-R1 7B

from openai import OpenAI

llm_client = OpenAI(
   base_url="http://localhost:11434/v1",
   api_key='ollama', 
)
model = "MFDoom/deepseek-r1-tool-calling:7b"

ผมสร้าง Agents 2 ตัว ตัวแรกเป็นเชฟมือดีจากแดนปลาดิบ ตัวที่สองเป็นผู้เชี่ยวชาญด้านสัตว์ป่า

from swarm import Swarm, Agent
client = Swarm(client=llm_client)
Animal_agent = Agent(
   name="Animal Agent",
   instructions="You are an expert in animal species and Zoos in Thailand",
   model=model,
)
food_agent = Agent(
   name="Food Agent",
   instructions="You are an expert in food and cooking.",
   model=model,
)
# ถ้ามันมีการส่งไปให้อีก Agent นึงมันต้องปริ้นท์ "## Transferring to Food agent.### "
def transfer_to_agent():
   """Transfer"""
   print("## Transferring to Food agent.### ")
   return food_agent
Animal_agent.functions.append(transfer_to_agent)
messages = [{"role": "user", "content": f"it's description from the {output_text} can you explaie me more?"}]
response = client.run(agent=Animal_agent, messages=messages)
print(response.messages[-1]["content"])

จากโค้ดข้างบนมันคือเวทมนต์นะ คือผมสั่ง AI เหมือนสั่งเพื่อนว่า เห้ยไอ้ Agent1 ที่เก่งเรื่องเกี่ยวกับสัตว์มากๆ ถ้าเจออะไรที่มันไม่เกี่ยวกับมึง ส่งไปให้ Agent2 ซึ่งเป็นเชฟแดนปลาดิบ และนี่คือผลลัพธ์…

## Transferring to Food agent.###
To understand this description of the plated meal, let's break down each element:
1. **Steak and Doneness:** The steak is medium-rare, which means it has a juicy interior that's still unlitigated with enough rareness (sought-after flavor) on the exterior due to searing.
2. **Sauce:** A rich, dark sauce surrounds the steak, contributing depth and flavor to the meal.
3. **French Fries:** Golden brown fries are crispy on the outside, indicating they've been well-fried and cooked in batches for that char and texture.
4. **Plate Setup:** The white plate with black trim gives it a classic diner look--a clean background typical of many breakfast or casual dining settings.
5. **Dining Scene:** A fork partially resting on the edge shows someone is about to eat, likely indicating active eating. A salad greens dish suggests there are other items on the table besides just steak and fries.
6. **Beverage:** A glass of red wine visible in the background indicates someone's choice for a drink.
This description paints a picture of an enjoyable meal with a balance between the steak's flavor, the fries' crispiness, and the overall dining atmosphere suggested by the plate setup and presence of other dishes or a drink.

ผลลัพธ์เป็นที่น่าพอใจและมีการ transfer ไป Agent อีกตัวนึงเกิดขึ้น

ต่อไปเราลองขยับมายากขึ้นอีกสั้กหน่อยดีมั้ย?

คราวนี้ผมลองเอาภาพหมูเด้งไปใส่ ในตัว Qwen ได้ผลลัพธ์ตามนี้

Prompt

[
'This image shows two baby hippos resting on the ground. The larger one appears to be an adult hippo,
possibly the mother, lying down with its head resting near the smaller baby hippo. Both animals have dark skin and appear to be wet,
suggesting they may have been in water recently.
The setting looks like a sandy area, likely part of their enclosure or habitat. The baby hippo has a pinkish hue around its mouth and nose,
which is typical for young hippos. The overall scene conveys a sense of calm and affection between the two hippos.']

แต่ผมลองเปลี่ยนคำถาม เป็น Where can I find it in Thailand?

messages = [{"role": "user", "content": f" {output_text} Where can I find it in Thailand?"}]
response = client.run(agent=Animal_agent, messages=messages)
print(response.messages[-1]["content"])

จะได้


First, analyze the query to understand the user's needs. The user provided an image description involving two baby hippos in Thailand and asked where it can be found. I'll consider which tool would best address this type of question. Since the inquiry is about finding a specific animal in Thailand, I need a function that retrieves information about wildlife in the country. Looking at the available tools, the most appropriate one here is the transfer_to_agent function because it allows using external knowledge sources to find specific information efficiently. Since no parameters are required for this type of query, I'll set the parameters as empty within "object" properties. Finally, structure the response into the expected JSON format with the chosen function and its parameters.
```json{"name": "transfer_to_agent", "parameters": {"type": "object", "required": [], "properties": {"function_name": "Find_zoos_and_biodiversity_spots_in_Thailand", "parameters": {"image_description": "The image contains two baby hippos, one larger identified as a mother possibly with a young baby, both in sandy habitat with pinkish features and calm interaction"}}}}```

จะเห็นได้ว่ามันตอบไม่ครบเพราะมันไม่รู้สถานที่ และไม่มีฟังก์ชันอื่นที่เชื่อมกับมันที่บอกสถานที่ได้

ลองปรับกันสั้กหน่อย

ผมได้เพิ่มฟังก์ชัน zoo_place เพื่อบอก Animal_agent ว่าเห้เพื่อนถ้านายไม่รู้ว่ามันคือสถานที่ไหนให้มาฟังก์ชันนี้ ซึ่งการใช้งานจริงสามารถเชื่อมกับ RAG หรือ API ภายนอกได้

from swarm import Swarm, Agent
client = Swarm(client=llm_client)
Animal_agent = Agent(
   name="Animal Agent",
   instructions=(
       "You are an expert in animal species and Zoos in Thailand. "
       "If you are not sure of the answer or if your output does not match the expected "
       "response, please trigger the zoo_place function to provide the default answer."
   ),
   model=model,
)
food_agent = Agent(
   name="Food Agent",
   instructions="You are an expert in food and cooking.",
   model=model,
)

def transfer_to_agent():
   """Transfer control to the Food Agent."""
   print("## Transferring to Food agent.### ")

   return food_agent


def zoo_place(**kwargs):
   """Return the default zoo location."""
   print("Transferring to Zoo-Map Assistance")
   return "KhaoKhew"
Animal_agent.functions.extend([transfer_to_agent, zoo_place])
messages = [{
   "role": "user",
   "content": f"{output_text} Where can I find the famous place about it in Thailand?"
}]
response = client.run(agent=Animal_agent, messages=messages)
print(response.messages[-1]["content"])

แล้วมาดูผลลัพธ์สิ เห้ยยมันรู้แล้ว และมีการ transfer ไปฟังก์ชันจริงๆ Transferring to Zoo-Map Assistance

Okay, the user is asking about a famous place in Thailand related to a scene involving two baby hippos. The scene described includes an adult hippo serving as the mother and a smaller one acting as the baby, with dark skin suggesting recent water contact on sandy terrain.
First, I need to match this description with known locations in Thailand that have such animal exhibits. KhaoKhew National Park comes to mind because it's well-known for its hippopotamuses. I recall that there's an interpretative center where they showcase these animals in a natural habitat.
Considering the user is from Thailand and likely interested in zoos or nature reserves, mentioning KhaoKhew makes sense. It's an educational place where visitors can learn about wildlife, which aligns with their query about famous places related to the scene described.
I make sure to provide the full name of the park as it's a significant attraction. This should help the user find more information or plan their visit accordingly.
The famous place you might be referring to in Thailand is **KhaoKhew National Park**

โบนัสภาษาไทย

Prompt

[
'ภาพนี้แสดงถ้วยอาหารที่มีเนื้อสัตว์และฟรายด์ฟรายส์บนจานขาวขอบดำ เนื้อสัตว์ถูกตัดเป็นชิ้นบางๆ
และมีเนื้อสัตว์ที่ยังไม่สุกอยู่ในช่วงกลาง
ฟรายด์ฟรายส์ด้านล่างของเนื้อสัตว์ดูเหมือนจะเป็นแบบทอดที่มีสีเหลืองอ่อน']

เปิดโลกการเขียนโปรแกรมและ Software Development ด้วย online courses ที่จะพาคุณอัพสกิลและพัฒนาสู่การเป็นมืออาชีพ เรียนออนไลน์ เรียนจากที่ไหนก็ได้ พร้อมซัพพอร์ตหลังเรียน

เรียนเขียนโปรแกรม

ทำ Modal-Agents ง่าย ๆ ที่บ้าน ด้วย DeepSeek R1 + Qwen2.5VL-Multi

รันโมเดลผ่าน AWS EC2 (g4dn.xlarge) JupyterNotebook ผ่าน Docker

ต่อไปเราลองขยับมายากขึ้นอีกสั้กหน่อยดีมั้ย?

ลองปรับกันสั้กหน่อย

บทความที่เกี่ยวข้อง

DeepSeek R1-Zero เมื่อ AI เรียนรู้เองได้ดี และ cost สุดถูก

Cursor สุดยอด AI Code Editor ที่จะมาแทนการโค้ดแบบเดิม ๆ เร็วขึ้น 10x