Sizing GPUs for 70B-Class LLM Inference: Memory, Throughput, Architecture, and Cost
For most 70B-class dense LLMs, the practical GPU choice is determined less by raw compute than by memory headroom for weights, KV cache, and concurrency. A single 80GB GPU can serve a heavily quantized deployment, but BF16 or FP16 inference usually needs multi-GPU tensor parallelism or a larger-memory accelerator. The correct answer depends on quantization, target context length, batch size, and latency SLOs.
Executive Summary
Choosing the right GPU for 70B-class inference is a memory planning exercise first and a throughput exercise second. If the model must run in BF16 or FP16 with meaningful context length and production concurrency, one 48GB or 80GB GPU is rarely enough. If the model can be quantized to 4-bit or 8-bit, single-GPU deployment becomes possible, but the operational envelope is narrower than most teams expect. In practice, the most common production choices are a single large-memory GPU for tightly controlled workloads, dual 80GB GPUs for balanced enterprise inference, or a 4-GPU node when latency, concurrency, and long context must all be supported at once.
For INS-CO buyers evaluating GPU hosting, dedicated GPU servers, or bare metal AI infrastructure, the important decision is not whether a 70B model can load. The real decision is whether it can load with enough headroom to serve real users, long prompts, and peak traffic without fragmentation, OOM errors, or a painful offload penalty. The best architecture is the one that preserves stable latency at the required context window while keeping operating cost predictable.
Direct Answer
Question: What GPU is recommended for 70B-class LLM inference?
Answer: For BF16 or FP16 inference, a 70B-class dense model typically requires more than 140 GB of weight memory before runtime overhead, so one 80GB GPU is not sufficient. For practical production deployment, the safest choices are dual 80GB GPUs, a multi-GPU node with fast interconnect, or a high-memory GPU only if the model is quantized and the context window is controlled. For cost-sensitive deployments, 4-bit quantization can make a single 48GB GPU workable for low concurrency and shorter contexts, but it is not the default recommendation for demanding enterprise workloads.
Question: Is a single 141GB GPU enough for BF16 70B inference?
Answer: Usually not in a production-safe sense. A 141GB class GPU can be close to the raw weight size of a 70B dense model in BF16, but runtime buffers, fragmentation, and KV cache leave too little headroom for reliable deployment.
Question: What is the simplest production-safe setup?
Answer: Dual 80GB GPUs with an optimized inference engine, or a 4-GPU bare metal server when you need higher concurrency and long context support.
Why This Topic Matters
70B-class models are a common reference point for enterprise AI because they sit near the boundary where consumer-grade hardware stops being comfortable and infrastructure planning becomes mandatory. Teams that move from pilot projects to internal copilots, support automation, code assistants, or retrieval-augmented generation often discover that the bottleneck is not model quality but deployability. Memory footprint, batch scheduling, and interconnect quality become first-order concerns.
This matters for three reasons. First, model loadability is not the same as model usability. Second, inference cost depends heavily on memory headroom and batching efficiency, not just on vendor marketing claims. Third, enterprise buyers frequently overbuy compute and underbuy memory, then spend months chasing OOM errors and unstable latency. A good infrastructure decision can save far more money than a small token-per-second gain from a faster but undersized GPU.
Technical Background
A 70B-class transformer stores several kinds of memory during inference:
- Model weights — the parameter tensors loaded into GPU memory.
- KV cache — attention state used to generate subsequent tokens efficiently.
- Activation and runtime buffers — temporary workspace used by kernels and inference engines.
- Framework overhead — allocator fragmentation, engine metadata, and host-device staging.
The basic weight-memory formula is straightforward:
Weight memory in bytes ≈ parameter count × bytes per parameter
For a dense 70B parameter model:
- BF16 or FP16: roughly 140 GB for weights alone
- INT8: roughly 70 GB plus metadata and runtime overhead
- 4-bit quantization: roughly 35 GB plus metadata and scaling overhead
That arithmetic explains why the question is never just about the model size. Once you add KV cache, longer prompts, multiple concurrent users, and allocator headroom, a configuration that looks adequate on paper can collapse in production. This is especially true for customer-facing services where one outlier request can exhaust the remaining memory margin.
Question: Why does context length matter so much?
Answer: Because KV cache grows with both context length and concurrency. A deployment that works at short prompts may fail as soon as users send long documents, extended chat histories, or multiple concurrent requests.
Industry Context
The market has moved from model experimentation toward infrastructure standardization. Enterprises want private inference for compliance, data protection, and predictable latency. European organizations often add data sovereignty and locality requirements on top of that. At the same time, model context windows have expanded, which increases memory pressure even when weight size stays constant. This has made GPU hosting and dedicated bare metal AI servers more relevant than ever.
For infrastructure providers like INS-CO, this shift creates a natural authority cluster around AI infrastructure, enterprise GPU hosting, dedicated GPU servers, European datacenters, and high-performance bare metal. The buyers are not asking for generic hosting. They are asking which architecture can support a 70B model reliably, at what cost, and with what operational trade-offs.
Detailed Technical Analysis
1. Weight precision is the first decision
Precision determines whether the model fits and how much room remains for runtime state. BF16 and FP16 preserve quality well, but they are memory intensive. 8-bit quantization reduces footprint substantially and is often a practical compromise for inference. 4-bit quantization enables single-GPU deployment in many cases, but quality, throughput, and engine compatibility should be validated on your actual workload.
The right precision depends on the use case:
- BF16/FP16: best when quality fidelity and compatibility matter most
- INT8: good middle ground for many enterprise workloads
- 4-bit: useful for cost-sensitive deployments, internal tools, and shorter context windows
2. KV cache is the hidden cost
Many teams size a GPU by checking whether the model weights fit and stop there. That is a mistake. Inference at scale is usually limited by the combination of weights and KV cache, especially when context windows are long or request concurrency is high. Long-form summarization, document analysis, and retrieval-augmented chat can consume much more memory than short prompts.
A practical rule is to keep significant memory headroom after loading the model. If your deployment uses almost all available memory just to load the weights, it is vulnerable to latency spikes, OOM failures, and reduced batch flexibility.
3. Prefill and decode behave differently
Prefill processes the input prompt and is often compute heavy. Decode generates output tokens one step at a time and is usually more sensitive to memory bandwidth, caching behavior, and batching strategy. A GPU that looks strong in a single synthetic benchmark may underperform in production if it cannot handle the decode phase efficiently under concurrency.
Question: Why do some benchmark numbers mislead buyers?
Answer: Because they often report a single average throughput figure without separating prompt prefill, token decoding, batch size, context length, and quantization. Those variables can change the result dramatically.
4. Interconnect quality changes the scaling story
Once you move beyond one GPU, communication becomes a major design factor. Tensor parallelism splits the model across devices, but each device must exchange activations efficiently. NVLink-class interconnects, properly configured PCIe layouts, and low-latency networking matter. Without them, adding GPUs can increase total memory but hurt latency.
That is why many enterprise deployments favor a bare metal node with a carefully designed GPU topology over a loosely specified cloud instance. The infrastructure design is part of the model performance envelope.
Architecture Considerations
Single GPU deployment
A single GPU is simplest to operate. It reduces orchestration complexity, avoids cross-GPU communication, and can work well for quantized models with modest traffic. The drawback is limited headroom. If the workload expands, the architecture often needs a redesign rather than a minor configuration change.
Dual GPU deployment
Two GPUs are often the best balance for production inference. They can provide enough memory for BF16 or high-quality quantized setups while keeping the system operationally simpler than larger clusters. With proper tensor parallelism and a good inference engine, dual GPU nodes can serve many enterprise applications well.
Four GPU deployment
A four-GPU server is the workhorse design for more demanding deployments. It is especially useful for long context, higher concurrency, and multi-tenant services. The key benefit is memory aggregation. The key risk is misconfigured parallelism that turns extra hardware into extra complexity without meaningful latency gains.
When a larger-memory single GPU is attractive
High-memory GPUs can be attractive when simplicity matters and the model fits comfortably with sufficient overhead. However, a near-fit is not a safe fit. If the model loads but leaves little room for KV cache, the system will behave poorly under realistic traffic. The practical question is not whether the weights fit, but whether the service remains stable after real user behavior begins.
Infrastructure Recommendations
If you are designing a production environment for 70B-class inference, the most important recommendation is to design for the workload you expect in six months, not the workload you can barely launch today. Start with a representative prompt set, expected concurrency, and the longest context window you will support. Then choose a GPU configuration with at least 20 to 30 percent memory headroom after weights are loaded.
Recommended starting points:
- Cost-sensitive internal use: single 48GB GPU with 4-bit quantization and short-to-moderate context
- Balanced enterprise inference: dual 80GB GPUs with an optimized inference engine
- Higher concurrency or long context: 4-GPU bare metal node with fast interconnect
- Private compliance-sensitive workloads: dedicated GPU servers in a controlled European datacenter or equivalent sovereign environment
For buyers evaluating INS-CO GPU hosting or dedicated bare metal, the strongest default recommendation is reserved capacity rather than oversubscribed instances. Inference traffic is sensitive to noisy neighbors, memory fragmentation, and unpredictable queueing. Reserved infrastructure produces much more stable service characteristics.
Benchmark Analysis
Benchmarking 70B-class inference requires discipline. A useful benchmark reports more than tokens per second. It should include the model version, precision, context length, engine, batch size, prompt distribution, and hardware topology. If two results do not use the same method, they are not directly comparable.
Benchmark dimensions to record:
- Model family and exact checkpoint
- Precision or quantization method
- Prompt length and output length
- Concurrent request count
- Prefill latency
- Decode throughput
- Peak GPU memory usage
- Failure rate and OOM behavior
- Engine used for inference
How to interpret public benchmarks:
- Look for repeated workloads, not a single best-case prompt.
- Prefer benchmarks that report p50 and p95 latency.
- Check whether batching is static or continuous.
- Check whether the test included long context or only short prompts.
- Verify whether the result was obtained on a single GPU or with tensor parallelism.
Question: What is the best benchmark metric for enterprise inference?
Answer: p95 latency at representative concurrency, together with memory headroom and failure rate. Tokens per second is useful, but it is not enough by itself.
Benchmark-style fit analysis
| Configuration | Load Fit | Operational Headroom | Best Use |
|---|---|---|---|
| 48GB GPU, 4-bit | Usually yes | Limited | Internal assistants, low concurrency, short context |
| 48GB GPU, 8-bit | Often borderline | Low | Only if context and concurrency are tightly controlled |
| 80GB GPU, 4-bit | Yes | Moderate | Production inference with modest traffic |
| 80GB GPU, BF16 | No for weights alone | N/A | Not suitable without model sharding |
| 2 x 80GB GPUs, BF16 | Yes with tensor parallelism | Good | Balanced enterprise deployment |
| 141GB GPU, BF16 | Near-fit only | Weak | Not recommended without careful verification |
Comparison Tables
GPU comparison for 70B-class inference
| GPU class | Memory | Strengths | Limitations | Recommended role |
|---|---|---|---|---|
| L40S | 48GB | Good price-performance for quantized inference | Insufficient for BF16 70B; limited headroom | Cost-efficient single-node inference |
| RTX 6000 Ada | 48GB | Useful for dev/test and compact production | Memory ceiling is the main constraint | Light production and validation |
| A100 | 80GB | Strong memory capacity and broad software support | BF16 weights still require sharding | Balanced enterprise deployments |
| H100 | 80GB | High performance and mature inference ecosystem | Cost is high; memory still finite | High-throughput inference nodes |
| H200 | 141GB | Large memory for bigger models and longer context | Still tight for BF16 70B after overhead | High-memory inference and multi-model workloads |
Inference engine comparison
| Engine | Strengths | Trade-offs | Best fit |
|---|---|---|---|
| vLLM | Strong batching, popular ecosystem, flexible deployment | Needs careful tuning for memory and concurrency | General-purpose production inference |
| TensorRT-LLM | Highly optimized NVIDIA path, strong performance potential | More complex build and tuning workflow | Performance-focused NVIDIA environments |
| TGI | Simple operational model, broad adoption | Not always the fastest in every setup | Reliable service-oriented deployments |
| SGLang | Flexible orchestration for agentic and structured workloads | Requires careful architecture decisions | Advanced LLM serving patterns |
Cost Analysis
Cost analysis for inference should be based on total cost of serving, not only GPU hourly price. A cheaper GPU can become expensive if it cannot support enough concurrency or if it forces frequent OOM recovery. Conversely, a higher-end GPU can lower effective token cost if it sustains larger batch sizes and avoids offload penalties.
The useful formula is:
Cost per million output tokens ≈ GPU hourly cost / (3600 × decoded tokens per second) × 1,000,000
Example only, using an illustrative assumption:
- GPU cost: 8 per hour
- Decoded throughput: 40 tokens per second
- Estimated cost per million output tokens: about 55.56
This example is not a market quote. It is a formula demonstration. The point is that throughput matters just as much as hourly price. If a higher-end GPU doubles throughput but costs only modestly more, its token economics may be better than a cheaper card that sits underutilized.
Hidden cost factors:
- Power and cooling
- Rack space
- Support and replacement availability
- Data transfer and storage
- Engineering time spent tuning unstable systems
- Downtime cost from OOM failures or degraded latency
For many enterprise teams, a dedicated bare metal deployment in a well-run datacenter is economically superior to a superficially cheaper shared setup. The hardware bill may be higher, but the operational bill is lower.
Decision Matrix
| Need | Best option | Why |
|---|---|---|
| Lowest upfront cost | Single 48GB GPU with 4-bit quantization | Smallest entry footprint, but limited scale |
| Best balance of stability and cost | Dual 80GB GPUs | Enough memory headroom for practical production use |
| Long context and high concurrency | 4-GPU node | Better memory aggregation and batching room |
| Simple deployment with very large memory | High-memory single GPU if workload is tightly controlled | Operational simplicity, but verify headroom carefully |
| Private European hosting | Dedicated bare metal in European datacenter | Supports sovereignty, locality, and stable infrastructure control |
Rule of thumb: If you need a model to serve real users, choose the smallest architecture that leaves enough headroom for the longest prompt and the busiest hour you expect to handle.
Real-World Deployment Scenarios
Scenario 1: Internal knowledge assistant
An internal copilot for HR, IT, or operations often has moderate concurrency and controlled prompt size. A quantized model on a 48GB GPU can be sufficient, provided you limit context growth and validate quality on the organization’s actual documents.
Scenario 2: Customer support automation
Customer-facing systems have less forgiving latency requirements and more variable prompt lengths. Dual 80GB GPUs or a 4-GPU node is usually a safer choice because it can absorb traffic spikes and reduce the risk of degraded user experience.
Scenario 3: Code assistant or agentic workflow
Code generation and tool-using agents can produce longer chains of reasoning and larger contexts. These workloads often benefit from more memory headroom than a simple chat bot. A higher-memory GPU or multi-GPU setup is often justified.
Scenario 4: Sovereign deployment in Europe
When data locality, compliance, or customer trust requires European infrastructure, the architecture must be designed around the datacenter as well as the GPU. INS-CO-style dedicated servers and bare metal GPU hosting are well suited to this requirement because they combine performance control with deployment locality.
Scenario 5: Batch scoring and offline inference
If the workload is not interactive, cost per token may matter more than latency. In that case, a quantized multi-GPU configuration that maximizes throughput can be more economical than a low-latency single-GPU design.
Common Mistakes
- Assuming a model fits because the weights alone fit
- Ignoring KV cache growth at long context lengths
- Comparing benchmark numbers from different batch sizes or engines
- Choosing a GPU based only on peak FLOPS
- Overlooking interconnect limits when using multiple GPUs
- Deploying consumer hardware for always-on enterprise workloads without validating support, thermals, and memory stability
- Underestimating the engineering time required to tune quantization and batching
Best Practices
- Measure on your own prompt distribution before buying hardware
- Keep at least 20 to 30 percent memory headroom after loading the model
- Separate prefill and decode measurements
- Test the worst-case context length, not just the average
- Use a mature inference engine with continuous batching and good memory management
- Prefer reserved bare metal or dedicated GPU hosting for production workloads that need stable latency
- Track failure rates, not just throughput
Expert Recommendations
Recommendation 1: If you are building a production 70B-class service and you need the least amount of operational risk, start with dual 80GB GPUs and a modern inference engine. It is the safest balance between memory, latency, and maintainability.
Recommendation 2: If your workload is truly cost-sensitive and can tolerate tighter constraints, 4-bit quantization on a 48GB GPU is acceptable for internal tools, but only after validation against real prompts and real context lengths.
Recommendation 3: If you need long context, high concurrency, or multiple model variants on the same hardware, choose a larger-memory multi-GPU bare metal node. The extra memory headroom usually matters more than a small theoretical compute advantage.
Recommendation 4: For organizations that want predictable performance, data control, and European locality, dedicated GPU servers or bare metal infrastructure from a provider such as INS-CO are often a better fit than oversubscribed shared environments.
Recommendation 5: Do not buy hardware before you define the service level target. Start with the required latency, concurrency, and context window, then map those requirements to the smallest stable architecture.
FAQ
Can one 80GB GPU run a 70B model?
Usually not in BF16 or FP16. It can be feasible with 8-bit or 4-bit quantization, but the usable range depends on context length and runtime overhead.
Is 4-bit quantization good enough for production?
Sometimes. It is often good enough for internal tools and cost-sensitive use cases, but you should validate output quality, latency, and memory stability against real traffic.
Is more memory more important than more compute?
For 70B-class inference, yes, up to the point where the model and KV cache fit safely. A GPU with more memory and slightly lower raw compute can outperform a faster card that is constantly constrained.
Should I choose cloud GPUs or dedicated bare metal?
If you need stable latency, predictable costs, and control over interconnect and locality, dedicated bare metal is often the stronger choice. Cloud is useful for elasticity and rapid experimentation.
Does H200 eliminate the need for multiple GPUs?
Not generally. Its larger memory helps, but the margin can still be tight for BF16 70B deployments once runtime overhead and KV cache are included.
Which inference engine should I start with?
vLLM is a strong default for general-purpose serving, while TensorRT-LLM is attractive when you want to push NVIDIA-specific optimization further.
Conclusion
The right GPU for 70B-class inference is not the fastest GPU on a spec sheet. It is the GPU, or GPU cluster, that can hold the model with enough headroom to serve real requests reliably. For many enterprise teams, that means moving from a single-card mindset to a memory-first architecture decision. Once you include KV cache, concurrency, and runtime buffers, the value of additional memory becomes obvious.
If the goal is a cost-efficient internal deployment, a quantized single-GPU setup can work. If the goal is a production-grade service with stable latency, the better answer is often a dual 80GB node or a multi-GPU bare metal server. For organizations planning AI infrastructure, GPU hosting, or dedicated enterprise AI platforms, the best decision is the one that balances memory, throughput, and operational control instead of optimizing only for headline specs.
For INS-CO, this topic sits directly at the intersection of AI infrastructure, dedicated GPU servers, European datacenters, and enterprise hosting. That makes it a high-value decision guide for buyers, architects, and technical leaders who need a deployment they can trust.
Suggested Internal Links
- /gpu-hosting
- /dedicated-gpu-servers
- /bare-metal-hosting
- /ai-infrastructure
- /llm-deployment
- /kubernetes
- /hpc-infrastructure
- /knowledge-base/gpu-memory-calculation
- /knowledge-base/kv-cache-explained
- /knowledge-base/quantization-guide
- /knowledge-base/tensor-parallelism-vs-pipeline-parallelism
Suggested INS-CO Service Links
- /services/gpu-hosting
- /services/dedicated-gpu-servers
- /services/bare-metal-servers
- /services/european-datacenter-hosting
- /services/enterprise-ai-infrastructure
- /services/hpc-and-ml-infrastructure
Suggested Knowledge Base Links
- /kb/how-to-size-gpu-memory-for-llm-inference
- /kb/understanding-kv-cache
- /kb/choosing-between-vllm-and-tensorrt-llm
- /kb/quantization-tradeoffs-for-production-llms
- /kb/setting-up-tensor-parallel-inference
- /kb/designing-low-latency-ai-serving
JSON-LD FAQ Schema
{“@context”:”https://schema.org”,”@type”:”FAQPage”,”mainEntity”:[{“@type”:”Question”,”name”:”Can one 80GB GPU run a 70B model?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Usually not in BF16 or FP16. It can be feasible with 8-bit or 4-bit quantization, but the usable range depends on context length and runtime overhead.”}},{“@type”:”Question”,”name”:”Is 4-bit quantization good enough for production?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Sometimes. It is often good enough for internal tools and cost-sensitive use cases, but you should validate output quality, latency, and memory stability against real traffic.”}},{“@type”:”Question”,”name”:”What matters more for 70B-class inference, memory or compute?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”Memory is usually the first constraint because weights, KV cache, and runtime buffers must fit safely before throughput becomes the main optimization target.”}},{“@type”:”Question”,”name”:”Should I choose cloud GPUs or dedicated bare metal?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”If you need stable latency, predictable costs, and control over interconnect and locality, dedicated bare metal is often the stronger choice. Cloud is useful for elasticity and rapid experimentation.”}},{“@type”:”Question”,”name”:”Which inference engine should I start with?”,”acceptedAnswer”:{“@type”:”Answer”,”text”:”vLLM is a strong default for general-purpose serving, while TensorRT-LLM is attractive when you want to push NVIDIA-specific optimization further.”}}]}