Applying Amazon Nova Pro model video understanding capabilities to first real motion picture ever made!

5 minute read
Content level: Intermediate
1

Video understanding capabilities of the foundation model Amazon Nova Pro applied to first real motion picture ever made in order to evaluate a leading-edge LLM to the quality of creative artifacts published 125+ years ago.

1. Introduction

In this article, we apply the video understanding capabilities of the foundation model Amazon Nova Pro (announced at re:Invent 2024) to first real motion picture ever made in order to evaluate a leading-edge LLM to the quality of artifacts published 125+ years ago. It will also assess the capabilities of the model to date scenes being captured long ago and “understand” the context of this epoch.

2. The first ever real motion picture

This is the video that we will use for this article

Leaving Lumiere Factory

Full video visible on Youtube at: https://www.youtube.com/watch?v=yvC_xrDqB3s

“Workers Leaving the Lumière Factory” is as per Wikipedia an 1895 French short black-and-white silent documentary film directed and produced by Louis Lumière. It is often referred to as the first real motion picture ever made. This film was made in the 35 mm format with an aspect ratio of 1.33:1, and at a speed of 16 frames per second. At that rate, the 17 meters of film length provided a duration of 46 seconds, holding a total of 800 frames.

3. Results

We defined very system and user prompt for your video analysis

user_prompt = ("Please, provide all relevant details about the places, people "
               "and actions in the various scenes this video. "
               "If you recognize a famous movie, its authors, its (approximate) date of creation, etc,  "
               "please tell us its name, when and by whom it was done")
               
system_prompt = ("You are an expert media analyst. "
                "When the user provides you with a video, 
                analyze it carefully to answer the que

Then, when Nova Pro is invoked (see Python code below) with this prompt on the video, it gives this answer:

This is a black and white silent film, probably from the early 20th century. ``It depicts a group of people in a snowy environment, engaging in various activities such as walking, carrying objects, and interacting with each other. The setting appears to be a rural or small-town area, with a barn and a house visible in the background. The people are dressed in winter clothing, including hats, coats, and boots. Some individuals are carrying baskets or other items, suggesting they might be engaged in some form of work or gathering. The film captures a slice of everyday life in a bygone era, showcasing the simplicity and community spirit of the time.

It is interesting to see that the model could detect very precisely all the various components of the scene, down to hats, coats, boots and baskets. The proposed date is quite precise as well: early 20th century is not a bad guess for 1895! Personally, I also like the kind of final social / emotional mention at the end: “the simplicity and community spirit of the time"

On the token consumption side, the input tokens as counted by Nova in Bedrock are 1599 for the input tokens and 131 for the output leading to a total of 11730.

4. Amazon Nova Models

Amazon Nova models are indeed an entire family

  • Amazon Nova Micro, a text-only model that delivers the lowest latency responses at very low cost.
  • Amazon Nova Lite, a very low-cost multimodal model that is lightning fast for processing image, video, and text inputs.
  • Amazon Nova Pro, a highly capable multimodal model with the best combination of accuracy, speed, and cost for a wide range of tasks.
  • Amazon Nova Premier, the most capable of Amazon’s multimodal models for complex reasoning tasks and for use as the best teacher for distilling custom models (available in the Q1 2025 timeframe).
  • Amazon Nova Canvas, a state-of-the-art image generation model.
  • Amazon Nova Reel, a state-of-the-art video generation model.

In the rest of this article, we will use only Amazon Nova Pro and leverage its video understanding capabilities.

For more details on Nova family of models, see this blog post published for the launch or read the public documentation. If you want to replicate my experience and update its parameters, please, read section about vision capabilities.

5. Python code for Nova Invocation

Feel free to reuse this code for your own experiments

import json

import boto3

AWS_ACCOUNT = boto3.client('sts').get_caller_identity().get('Account')
AWS_REGION = "us-east-1"
INFERENCE_PROFILE = f"arn:aws:bedrock:{AWS_REGION}:{AWS_ACCOUNT}:inference-profile/us.amazon.nova-pro-v1:0"
S3_BUCKET = f"<your-bucket-name>"

VIDEO_FILE = <your-mp4-video>

system_prompt = ""
user_prompt = ("Please, provide all relevant details about the places, people "
               "and actions in the various scenes this video. "
               "If you recognize a famous movie, its authors, its (approximate) date of creation, etc,  "
               "please tell us its name, when and by whom it was done")

user_message_list = [
    {
        "role": "user",
        "content": [
            {
                "video": {
                    "format": "mp4",
                    "source": {
                        "s3Location": {
                            "uri": "s3://" + S3_BUCKET + "/" + VIDEO_FILE,
                            "bucketOwner": AWS_ACCOUNT
                        }
                    }
                }
            },
            {
                "text": user_prompt
            }
        ]
    }
]
system_message_list = [
    {
        "text": "You are an expert media analyst. "
                "When the user provides you with a video, analyze it carefully to answer the questions"
    }
]

body = {
    "schemaVersion": "messages-v1",
    "system": system_message_list,
    "messages": user_message_list,
    "inferenceConfig": {"max_new_tokens": 300, "top_p": 0.1, "top_k": 20, "temperature": 0.3},
}

invocation = boto3.client("bedrock-runtime", region_name=AWS_REGION).invoke_model(
    body=json.dumps(body),
    modelId=INFERENCE_PROFILE,
    accept="application/json",
    contentType="application/json"
)

print(json.dumps(invocation, indent=3, default=str))

print("response: ", invocation["body"].read())