This assistant helps team members navigate apps, capture screens, interpret text, and generate intelligent guidance using free AI tools. Below is the tech stack and instructions for customizing your departmentβs version.
Layer | Tool/Tech | Role |
---|---|---|
Interface | Gradio / Streamlit | Launch user-friendly UI |
Screen Input | Python + OpenCV | Capture screen or window |
Text Recognition | Tesseract OCR | Extract readable content |
Vision Model | Moondream 2 / BLIP / LLaVA | Understand UI elements |
Language Model | Mistral 7B / Phi-2 / LLaMA 3 | Natural-language reasoning |
Orchestrator | LangChain | Connect vision, language, and actions |
Local Hosting | LocalAI | Run everything without cloud |
Audio Input (optional) | Whisper.cpp | Voice commands support |
vision-assistant/
βββ app.py
βββ capture_screen.py
βββ ocr_engine.py
βββ vision_model.py
βββ language_model.py
βββ agent_logic.py
βββ config.yaml
βββ requirements.txt
βββ README.md
βββ utils/
βββ image_utils.py
βββ text_utils.py
import gradio as gr
from agent_logic import process_screen
gr.Interface(fn=process_screen, inputs=[], outputs="text", title="Vision Assistant").launch()
Team members can edit agent_logic.py
and config.yaml
to tailor prompts and task automation logic to their department workflows (e.g., HR onboarding, Sales CRM navigation).
Please add any improvements, utilities, or department workflows to the shared Git repo. Make sure your models are locally hosted or cloud-safe per your privacy standards.
This assistant is built for local use. Make sure sensitive data is handled according to your organizationβs internal guidelines.