SCALE & LLM-OPS: ARCHITECTING LLM-AS-A-SERVICE - INFRASTRUCTURE REQUIREMENTS FOR HIGH-CONCURRENCY AGENTIC WORKLOADS

Authors

  • Pavan Madduri Author

DOI:

https://doi.org/10.46121/pspc.54.1.27

Keywords:

large language models, LLM infrastructure, agentic AI, model serving, GPU optimization, inference scaling

Abstract

Large language models have transitioned from research prototypes to production services powering critical business applications, yet infrastructure requirements for deploying LLMs at scale remain poorly understood, particularly for agentic workloads involving multi-turn conversations, tool use, and complex reasoning chains. This research develops and evaluates comprehensive infrastructure architectures for LLM-as-a-Service platforms supporting high-concurrency agentic workloads. The study implements three deployment strategies—monolithic GPU clusters, disaggregated inference, and hybrid edge-cloud architectures—across production environments serving 2.4 million requests daily. The hybrid architecture achieved 94.7% GPU utilization while maintaining P95 latency below 850 milliseconds for complex agentic interactions requiring multiple LLM invocations. Disaggregated inference enabled 3.2x cost reduction through separation of prompt processing and token generation, optimizing each phase independently. Intelligent request batching improved throughput by 420% compared to naive sequential processing while preserving response quality. The framework scaled to 12,000 concurrent agentic sessions without performance degradation through adaptive resource allocation and predictive auto-scaling. Cost per million tokens decreased 67% through architectural optimizations including KV cache sharing, speculative decoding, and quantization techniques. This research contributes practical infrastructure patterns enabling organizations to deploy production LLM services supporting agentic workloads at enterprise scale with acceptable latency and cost economics.

Downloads

Published

2026-02-25