Qwen2VL을 이용해서 동영상을 설명하는 방법

Qwen2VL 모델을 사용하면 동영상 및 이미지 콘텐츠를 분석하고 설명할 수 있습니다. 본 포스트에서는 Qwen2VL-7B-Instruct 모델을 활용하여 동영상을 설명하는 방법을 소개합니다.

1. 사전 준비

먼저 필요한 라이브러리를 설치하고, 모델과 프로세서를 불러옵니다.

import torch
from transformers import Qwen2VLForConditionalGeneration, AutoProcessor
from langchain_ollama import ChatOllama

# Ollama 기반의 한국어 번역 모델 설정
chat_model = ChatOllama(model="exaone3.5", temperature=0)

# Qwen2VL 모델 및 프로세서 로드
device = "cuda" if torch.cuda.is_available() else "cpu"
model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", device_map="auto").to(device)
processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

2. 동영상 설명 요청 생성

동영상의 프레임을 활용하여 설명을 생성할 수 있습니다. 예제에서는 gdp_MBCnews.mp4 파일을 입력하여 동영상의 내용을 설명합니다.

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "./data/gdp_MBCnews.mp4"},
            {"type": "text", "text": "Describe this video in detail."}
        ],
    }
]

inputs = processor.apply_chat_template(
    conversation,
    video_fps=1,  # 초당 1프레임만 분석
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

3. 설명 생성하기

모델을 사용하여 주어진 동영상에 대한 설명을 생성합니다.

output_ids = model.generate(**inputs, max_new_tokens=1280)

# 생성된 텍스트만 추출
generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]
output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

print("Generated Description:", output_text)

4. 한국어 번역 (선택 사항)

출력된 설명을 한국어로 변환할 수도 있습니다.

response = chat_model.invoke(f"<내용>{output_text[0]}</내용> <내용>을 한국어로 적어줘.")
print("Translated Response:", response.content)

5. 추가 활용: 여러 미디어 파일 처리

Qwen2VL은 다양한 입력(이미지, 동영상, 텍스트)을 동시에 처리할 수 있습니다. 여러 개의 미디어 파일을 한 번에 분석할 수도 있습니다.

conversation_multi = [
    {
        "role": "user",
        "content": [
            {"type": "image", "path": "./data/image1.jpg"},
            {"type": "video", "path": "./data/video1.mp4"},
            {"type": "text", "text": "Describe the common elements in these medias."}
        ]
    }
]

inputs_multi = processor.apply_chat_template(
    conversation_multi,
    video_fps=1,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

output_ids_multi = model.generate(**inputs_multi, max_new_tokens=128)
generated_text_multi = processor.batch_decode(output_ids_multi, skip_special_tokens=True, clean_up_tokenization_spaces=True)

print("Generated Multi-Media Description:", generated_text_multi)

결론

Qwen2VL 모델을 활용하면 동영상 및 이미지의 내용을 분석하고 설명할 수 있습니다.

import torch

from transformers import Qwen2VLForConditionalGeneration, AutoTokenizer, AutoProcessor

from langchain_ollama import ChatOllama

chat_model = ChatOllama(model="exaone3.5",temperature=0,)

# Load the model in half-precision on the available device(s)

model = Qwen2VLForConditionalGeneration.from_pretrained("Qwen/Qwen2-VL-7B-Instruct", device_map="auto")

processor = AutoProcessor.from_pretrained("Qwen/Qwen2-VL-7B-Instruct")

# conversation = [

# {

# "role":"user",

# "content":[

# {

# "type":"image",

# "url": "./data/cats.jpg"

# },

# {

# "type":"text",

# "text":"Describe this image."

# }

# ]

# }

# ]

# inputs = processor.apply_chat_template(

# conversation,

# add_generation_prompt=True,

# tokenize=True,

# return_dict=True,

# return_tensors="pt"

# ).to(model.device)

###############################

# 할당 시도: 76.73 GiB의 메모리를 할당하려고 했습니다.

# GPU 총 용량: GPU 0의 총 용량은 79.15 GiB이며, 그 중 31.59 GiB가 비어 있습니다.

# 현재 사용 중인 메모리: 프로세스 1719050은 2.54 GiB, 프로세스 1747901은 2.32 GiB, 프로세스 1748781은 3.02 GiB의 메모리를 사용 중입니다.

# ./data/gdp_MBCnews.mp4 Resolution:640x360 Length:30" CPU

# The video begins with a man in a suit and tie standing behind a desk, speaking to the camera. He is wearing a blue suit and tie,

# and he appears to be a news anchor or reporter. The background behind him is a cityscape with tall buildings and a river.

# The news anchor is speaking in Korean, and there is a caption at the bottom of the screen that reads "MBC NEWS 9:30." The caption is in Korean,

# and it translates to "MBC NEWS 9:30" in English. The news anchor is wearing a blue suit and tie, and he appears to be a news anchor or reporter.

# The background behind him is a cityscape with tall buildings and a river. The news anchor is speaking in Korean,

# and there is a caption at the bottom of the screen that reads "MBC NEWS 9:30." The caption is in Korean, and it translates to "MBC NEWS 9:30" in English.

# 영상은 정장 차림의 남성이 책상 뒤에 서서 카메라를 향해 말하는 것으로 시작됩니다. 그는 파란색 정장과 넥타이를 착용하고 있으며, 뉴스 앵커나 리포터로 보입니다.

# 배경에는 고층 건물과 강이 보이는 도시 풍경이 나타납니다. 뉴스 앵커는 한국어로 말하고 있으며,

# 화면 하단에는 "MBC 뉴스 9:30"이라는 한글 자막과 함께 영어로 "MBC NEWS 9:30"이 표시됩니다. 앵커는 정장과 넥타이를 착용하고 있으며, 뉴스 앵커 또는 리포터로 보입니다.

# 배경은 여전히 고층 건물과 강이 있는 도시 풍경입니다. 앵커는 한국어로 이야기하고 있으며, 화면 하단에는 "MBC 뉴스 9:30"이라는 한글 자막과 함께 영어로 동일한 문구가 표시됩니다.

# conversation = [

# {

# "role": "user",

# "content": [

# {"type": "video", "path": "./data/gdp_MBCnews.mp4"},

# {"type": "text", "text": "Describe this video in detail."},

# ],

# }

# ]

# inputs = processor.apply_chat_template(

# conversation,

# video_fps=1,

# add_generation_prompt=True,

# tokenize=True,

# return_dict=True,

# return_tensors="pt"

# )

# ).to(model.device)

###############################

# Batch Mixed Media Inference

# The model can batch inputs composed of mixed samples of various types such as images, videos, and text. Here is an example.

# Conversation for the first image

conversation = [

{

"role": "user",

"content": [

{"type": "image", "path": "./data/text_kor.png"},

{"type": "text", "text": "Describe this image in detail."}

]

}

]

inputs = processor.apply_chat_template(

conversation,

add_generation_prompt=True,

tokenize=True,

return_dict=True,

return_tensors="pt"

).to(model.device)

###############################

# Inference: Generation of the output

output_ids = model.generate(**inputs, max_new_tokens=1280)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]

output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

print(output_text)

response = chat_model.invoke("<내용>"+output_text[0]+"</내용> <내용>을 한국어로 적어줘.")

print(f"response: {response.content}")

###############################

'''

# Conversation with two images

conversation2 = [

{

"role": "user",

"content": [

{"type": "image", "path": "/path/to/image2.jpg"},

{"type": "image", "path": "/path/to/image3.jpg"},

{"type": "text", "text": "What is written in the pictures?"}

]

}

]

# Conversation with mixed midia

conversation4 = [

{

"role": "user",

"content": [

{"type": "image", "path": "/path/to/image3.jpg"},

{"type": "image", "path": "/path/to/image4.jpg"},

{"type": "video", "path": "/path/to/video.mp4"},

{"type": "text", "text": "What are the common elements in these medias?"},

}

]

conversations = [conversation1, conversation2, conversation3, conversation4]

# Preparation for batch inference

ipnuts = processor.apply_chat_template(

conversations,

video_fps=1,

add_generation_prompt=True,

tokenize=True,

return_dict=True,

return_tensors="pt"

).to(model.device)

# Batch Inference

output_ids = model.generate(**inputs, max_new_tokens=128)

generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]

output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)

print(output_text)

Multiple Image Inputs

By default, images and video content are directly included in the conversation. When handling multiple images, it’s helpful to add labels to the images and videos for better reference. Users can control this behavior with the following settings:

conversation = [

{

"role": "user",

"content": [

{"type": "image"},

{"type": "text", "text": "Hello, how are you?"}

]

{

"role": "assistant",

"content": "I'm doing well, thank you for asking. How can I assist you today?"

{

"role": "user",

"content": [

{"type": "text", "text": "Can you describe these images and video?"},

{"type": "image"},

{"type": "video"},

{"type": "text", "text": "These are from my vacation."}

]

{

"role": "assistant",

"content": "I'd be happy to describe the images and video for you. Could you please provide more context about your vacation?"

{

"role": "user",

"content": "It was a trip to the mountains. Can you see the details in the images and video?"

}

]

# default:

prompt_without_id = processor.apply_chat_template(conversation, add_generation_prompt=True)

# add ids

prompt_with_id = processor.apply_chat_template(conversation, add_generation_prompt=True, add_vision_id=True)

'''

신고하기

프로필

Qwen2VL을 이용해서 동영상을 설명하는 방법

Qwen2VL을 이용해서 동영상을 설명하는 방법

1. 사전 준비

2. 동영상 설명 요청 생성

3. 설명 생성하기

4. 한국어 번역 (선택 사항)

5. 추가 활용: 여러 미디어 파일 처리

결론

작성자: 김영국

댓글 쓰기

0 댓글

Most Popular

Flask 서버(Python)에서 HTTPS로 통신하는 방법

MCP 서버 개발 및 디버깅 방법(Python)

Python으로 MCP 서버 구축 및 Claude 연동 가이드

Tags

Categories

팔로어

Featured post

빌 게이츠가 유퀴즈에서 추천한 도서 목록

Popular Posts

Flask 서버(Python)에서 HTTPS로 통신하는 방법

MCP 서버 개발 및 디버깅 방법(Python)

Python으로 MCP 서버 구축 및 Claude 연동 가이드

Footer Menu Widget

Contact form

프로필

Qwen2VL을 이용해서 동영상을 설명하는 방법

Qwen2VL을 이용해서 동영상을 설명하는 방법

1. 사전 준비

2. 동영상 설명 요청 생성

3. 설명 생성하기

4. 한국어 번역 (선택 사항)

5. 추가 활용: 여러 미디어 파일 처리

결론

작성자: 김영국

관심 있을 만한 글

댓글 쓰기

0 댓글

Most Popular

Social Plugin

Tags

Categories

팔로어

Featured post

Popular Posts

Footer Menu Widget

Contact form