CONCIERGE AGENT: A LOCAL HYBRID VISION-LANGUAGE FRAMEWORK FOR PRIVACY PRESERVING AUTONOMOUS COMPUTER USE
DOI:
https://doi.org/10.46121/pspc.54.2.50Keywords:
Computer Use Agents, Vision Language Models, Local Inference, Privacy Preserving Automation, Browser Automation, Operating System Control, Gui AgentsAbstract
The majority of computer usage agents available today are based on cloud-computing AI models that need constant communication of screenshots and user activity with servers, which poses significant privacy risks when the task involves personal information, credentials, or documents. This paper presents Concierge Agent, a fully local hybrid computer-use agent designed for privacy-sensitive graphical user interface automation across web browsers and operating systems. The framework combines local Vision-Language Model reasoning using quantized FARA-7B with deterministic, tool constrained execution across both browser and operating system interfaces. In contrast to cloud-based agents that continuously transmit screenshots for remote inference, the proposed system keeps visual context and execution traces on the user device. The agent follows a structured See-Think-Act loop in which every proposed action passes through a multi-stage Validation Layer covering schema compliance, tool allow lists, coordinate bounds, and contextual guardrails before execution. Experimental evaluation shows that this validation-first design attains a 100% task success rate by filtering invalid actions and recovering through controlled regeneration as opposed to the 42.1% success rate for the cloud-based GPT-4o. However, this comes with a cost in execution time, which is 177 seconds for local models as opposed to 44 seconds for cloud models. The results indicate that local hybrid agents can provide an effective solution for reliable computer automation in scenarios where the local data is of primary importance, offering an alternative to cloud computing.

