Semantic video search for private CCTV operations

// Customer context

A client had a large CCTV footprint and strict compliance requirements. The footage could not be sent to a public cloud, and the existing NVR workflow made investigations slow. We prototyped semantic search over video clips, then expanded it into an on-prem video intelligence layer for search, review, and non-time-sensitive alerts.

deploymenton-prem

interfaceplain English

alertingzero-shot rules

01The customer problem

Operators were asked to answer questions like whether a person in a red jacket appeared near a gate after 9pm, but traditional NVR tooling still forced them to scrub camera by camera and timestamp by timestamp.

The compliance posture ruled out sending CCTV footage to a third-party cloud service. Any useful AI layer had to be provisioned inside the client's environment.

The team also wanted alerting for changing business scenarios. Training a new detector every time someone asked for a new non-critical condition was too slow and too expensive.

02The proof of value

We started with a narrow proof of value: ingest a bounded set of CCTV clips, generate visual-language embeddings, and expose a simple search box that accepted real operator questions.

The prototype proved the important part quickly. Instead of browsing an NVR timeline, operators could type a scene description and jump directly to likely clips.

From there we added early zero-shot alert experiments. The point was not to replace urgent safety systems, but to let teams monitor long-tail conditions without starting a model-training project every time.

03The production solution

The full solution is an on-prem video intelligence stack. Camera streams are segmented into searchable clips, passed through a local small VLM for scene understanding, then embedded and written into a private vector database with camera, timestamp, and location metadata.

When an operator searches, a local small LLM plans the query: it separates the visual description from filters such as camera zone, time window, and event type. The system retrieves candidate clips from the vector DB, reranks them against the original request, and returns the most relevant moments first.

Operators can search by object, person attributes, location, time window, or plain-language description. Matching clips can be stitched into reviewable incident timelines for handoff, with the embedding, retrieval, and reranking path staying inside the client's environment.

For alerting, the system supports configurable zero-shot rules for non-time-sensitive events. A rule can describe what to look for, and the same local VLM plus retrieval pipeline evaluates it against indexed footage without requiring a freshly trained detector.

04What changed

CCTV review moved from manual timeline scrubbing to seconds-level clip retrieval.
The architecture respected the client's on-prem compliance requirement.
New monitoring ideas could be tested as prompts and rules before becoming hardened workflows.

AI agents for commercial cleaning operations