Operators were asked to answer questions like whether a person in a red jacket appeared near a gate after 9pm, but traditional NVR tooling still forced them to scrub camera by camera and timestamp by timestamp.
The compliance posture ruled out sending CCTV footage to a third-party cloud service. Any useful AI layer had to be provisioned inside the client's environment.
The team also wanted alerting for changing business scenarios. Training a new detector every time someone asked for a new non-critical condition was too slow and too expensive.
We started with a narrow proof of value: ingest a bounded set of CCTV clips, generate visual-language embeddings, and expose a simple search box that accepted real operator questions.
The prototype proved the important part quickly. Instead of browsing an NVR timeline, operators could type a scene description and jump directly to likely clips.
From there we added early zero-shot alert experiments. The point was not to replace urgent safety systems, but to let teams monitor long-tail conditions without starting a model-training project every time.
The full solution is an on-prem video intelligence stack. Camera streams are segmented into searchable clips, passed through a local small VLM for scene understanding, then embedded and written into a private vector database with camera, timestamp, and location metadata.
When an operator searches, a local small LLM plans the query: it separates the visual description from filters such as camera zone, time window, and event type. The system retrieves candidate clips from the vector DB, reranks them against the original request, and returns the most relevant moments first.
Operators can search by object, person attributes, location, time window, or plain-language description. Matching clips can be stitched into reviewable incident timelines for handoff, with the embedding, retrieval, and reranking path staying inside the client's environment.
For alerting, the system supports configurable zero-shot rules for non-time-sensitive events. A rule can describe what to look for, and the same local VLM plus retrieval pipeline evaluates it against indexed footage without requiring a freshly trained detector.
- CCTV review moved from manual timeline scrubbing to seconds-level clip retrieval.
- The architecture respected the client's on-prem compliance requirement.
- New monitoring ideas could be tested as prompts and rules before becoming hardened workflows.