Using the Vision Service¶
With our dedicated Vision Service, you get access to additional models and configuration options for advanced computer vision tasks. For example, you can use it to generate custom captions and labels for your photos. The service runs in a separate container that acts as a proxy between the models and PhotoPrism®, extending its capabilities. It also allows Python developers to experiment with new ideas, try different models, and customize prompts.
The service and its integrations are under active development, so the configuration, commands, and other details may change or break unexpectedly. Please keep this in mind and notify us when something doesn't work as expected. Thank you for your help in keeping this documentation updated!
Getting Started¶
This guide explains how to set up the dedicated service as an AI model proxy to enhance PhotoPrism's capabilities. You can use a wide range of additional models with it, including lightweight, preconfigured models, as well as popular but more demanding large language models in combination with Ollama.
While the upcoming version of PhotoPrism will also allow you to generate captions with Ollama directly, a key advantage of using the dedicated vision service is greater flexibility and access to an even broader range of models. This makes it ideal for advanced users and developers.
Developers can proceed to the Build Setup guide, which explains how to set up a Vision Service development environment.
Since neither Vision Service nor Ollama support authentication, both services should only be used within a secure, private network. They must not be exposed to the public internet.
Step 1: Start the Vision Service¶
- Create a new, empty folder on the server where you want to run the Vision service.
-
Inside this folder, create a
compose.yaml
file with the following content:compose.yaml
for Vision Serviceservices: photoprism-vision: image: photoprism/vision:latest restart: unless-stopped ports: - "5000:5000" environment: # Set OLLAMA_ENABLED=true and configure the host if you want this service to use Ollama - OLLAMA_ENABLED=false - OLLAMA_HOST=http://<ollama-ip>:11434 volumes: - "./models:/app/models" - "./venv:/app/venv"
-
If you plan to use Ollama through this service, set
OLLAMA_ENABLED=true
and replace<ollama-ip>
with the IP address of your Ollama machine. - Start the service:
docker compose up -d
Step 2: Configure PhotoPrism¶
Now, create a new config/vision.yml
file or edit the existing file in the storage folder of your PhotoPrism instance, following the example below. Its absolute path from inside the container is /photoprism/storage/config/vision.yml
:
This example uses the pre-installed kosmos-2
model for generating captions. It does not require Ollama.
Available pre-installed Models
The Vision service also provides additional pre-installed models, such as vit-gpt2
and blip
for image captioning, as well as nsfw_image_detector
for NSFW content detection. You can enable these models by updating the Name
field in your vision.yml
configuration.
vision.yml
Models:
- Type: caption
Resolution: 720
Name: "kosmos-2"
Version: "latest"
Prompt: |
Write a journalistic caption that is informative and briefly describes the most important visual content in up to 3 sentences:
- Use explicit language to describe the scene if necessary for a proper understanding.
- Avoid text formatting, meta-language, and filler words.
- Do not start captions with boring phrases such as "This image", "The image", "This picture", "The picture", "A picture of", "Here are", or "There is".
- Instead, start describing the content by first identifying the subjects and any actions that might be performed.
- Try providing a casual description of what the subjects look like, including their gender and age.
- If the place seems special or familiar, provide a brief, interesting description without being vague.
Service:
# IMPORTANT: Replace this IP with the address of your Vision service machine.
Uri: "http://<vision-service-ip>:5000/api/v1/vision/caption"
Thresholds:
Confidence: 10
This example uses Ollama's llava-phi3
model for generating captions, proxied through the Vision service.
vision.yml
Models:
- Type: caption
Resolution: 720
Name: "llava-phi3"
Version: "latest"
Prompt: |
Write a journalistic caption that is informative and briefly describes the most important visual content in up to 3 sentences:
- Use explicit language to describe the scene if necessary for a proper understanding.
- Avoid text formatting, meta-language, and filler words.
- Do not start captions with boring phrases such as "This image", "The image", "This picture", "The picture", "A picture of", "Here are", or "There is".
- Instead, start describing the content by first identifying the subjects and any actions that might be performed.
- Try providing a casual description of what the subjects look like, including their gender and age.
- If the place seems special or familiar, provide a brief, interesting description without being vague.
Service:
# IMPORTANT: Replace this IP with the address of your Vision service machine.
Uri: "http://<vision-service-ip>:5000/api/v1/vision/caption"
Thresholds:
Confidence: 10
The config file must be named vision.yml
, not vision.yaml
, as otherwise it won't be found and will have no effect.
Step 3: Restart PhotoPrism¶
Run the following commands to restart photoprism
and apply the new settings:
docker compose stop photoprism
docker compose up -d
You should now be able to use the photoprism vision
CLI commands when opening a terminal, e.g. photoprism vision run -m caption
to generate captions.
Troubleshooting¶
GPU Performance Issues¶
If you're using the Vision Service with Ollama enabled (OLLAMA_ENABLED=true
), you may experience GPU VRAM management issues over time. The same VRAM degradation symptoms and solutions apply when Ollama is used through the Vision Service proxy.
Detailed troubleshooting tips can be found in the Caption Generation documentation.