As generative AI moves from experimental prototypes to full-scale production, businesses are focusing on cost-efficiency. Running large language models (LLMs) can be expensive, prompting companies to explore methods like caching and routing simpler queries to smaller models. At its re:Invent conference, AWS introduced two such features for its Bedrock LLM hosting service: prompt caching and intelligent prompt routing.
Prompt caching helps eliminate redundant processing of repeated queries. For example, when multiple users ask similar questions about the same document, caching prevents the model from reprocessing each query from scratch. According to Atul Deo, AWS’s director of product for Bedrock, this approach significantly reduces costs and improves efficiency, especially as context windows grow larger. Models like Nova are now capable of handling 300,000 to 2 million tokens, and this capacity is expected to increase.
AWS claims that caching can lower costs by up to 90% and reduce response times by as much as 85%. Adobe, one of the companies testing this feature, reported a 72% decrease in response time for its generative AI applications on Bedrock.
The second feature, intelligent prompt routing, optimizes the use of different models within the same model family. It evaluates incoming queries and automatically routes them to the most suitable model. Simpler queries are sent to smaller, faster models, avoiding the expense of using highly capable but slower ones.
While prompt routing is not entirely new—similar systems exist in open-source projects and startups—AWS emphasizes that its solution requires minimal human input. Currently, the routing feature is limited to models within the same family, but AWS plans to expand its capabilities and offer more customization options in the future.
To support niche needs, AWS also announced a Bedrock marketplace. This platform allows businesses to access hundreds of specialized models tailored for smaller audiences. Unlike standard Bedrock services, users of the marketplace will need to manage infrastructure capacity themselves. At launch, the marketplace will feature around 100 specialized models, with plans to add more.
These new features—prompt caching and intelligent routing—aim to make generative AI more cost-effective and accessible for businesses. While AWS’s innovations are part of a broader industry trend, their focus on reducing costs and enhancing efficiency positions Bedrock as a valuable tool for organizations scaling their AI operations. The introduction of a specialized model marketplace further broadens Bedrock’s appeal, catering to diverse use cases across industries.