Symptom
Information is required on the correct efficient specification and
configuration of Intel/AMD x64 Hardware running Windows for SAP ABAP and
SAP Java application server environments. This note also discusses SAP
on Windows virtualized environments.
Cause
“Intel” x86 based hardware (based on either Intel or AMD) has evolved
rapidly in recent years. Many new technologies and features in Windows
and Intel H/W platforms (hereafter called “Intel”) directly impact the
optimal configurations for SAP systems.
SAP ABAP and SAP Java application servers should be deployed after reviewing the recommendations in this note. The configurations in this SAP Note have been tested and proven by SAP, Microsoft and hardware vendors in lab tests, benchmarks and customer deployments.
More information on SAP standard benchmarks and the term “SAPS” can be found at http://www.sap.com/benchmark/
SAP ABAP and SAP Java application servers should be deployed after reviewing the recommendations in this note. The configurations in this SAP Note have been tested and proven by SAP, Microsoft and hardware vendors in lab tests, benchmarks and customer deployments.
More information on SAP standard benchmarks and the term “SAPS” can be found at http://www.sap.com/benchmark/
Resolution
Prior to purchasing new hardware, when
installing and configuring SAP on Windows on physical or virtual
environments, follow the deployment guidelines in the PDF file attached
to this SAP note.
General
SAP server throughput (as measured by
SAPS) has increased significantly on Intel based server hardware in
recent years. Intel/AMD and OEM hardware manufacturers achieved
performance increases by introducing many new technologies and
concepts. SAP applications require appropriate hardware configurations
and parameterization to achieve the performance and throughput increases
demonstrated in the SAP Standard Application benchmarks. Inappropriate
Intel configurations could cause significant performance problems,
unpredictable performance (sometimes slow) or significantly underperform
relative to SAP Standard Application benchmarks. Provided the concepts
and configurations documented in this note are followed these problems
should not occur.
1. Overview of Modern Intel Server Technologies
1.1. Clock Speed
All SAP work processes other than Message
Server and Enqueue Server are executing logic within a single thread.
The performance of batch jobs in particular and other work process types
in general is largely determined by the latency of database requests
and by the time a SAP work process spends running on a single CPU
Windows tread. SCU is the SAP specific terminology for describing per
“thread” throughput (Single Computing Unit – note 1501701). SCU is very important in determining the performance of a SAP system.
SAP Standard Application benchmarks have
shown a strong correlation between clock speed (GHz) and SCU on the same
processor architecture. On some Intel servers disabling Hyperthreading
may increase SCU, thereby improving the throughput of a single work
process (eg. batch job) but decreasing total aggregate throughput of the
entire server. If you need to speed up a single transaction or report
you might try to switch off Hyperthreading. The exact performance
increase per thread is dependent on factors beyond the scope of this
note. Please contact Intel for further information on Hyperthreading
and performance. SAP benchmarks on Windows Intel systems have shown
higher SCU on higher clock speed processors.
2 socket servers have significantly higher SCU than 4 or 8 socket servers. Benchmarks show 8 socket Intel servers have 55% lower SAPS/thread than 2 socket as at June 2012.
Some SAP components and some specific SAP processes (see note 1501701)
are particularly sensitive to SCU performance. Hypothetical examples
below show how to calculate SCU performance (which corresponds to per
thread performance on Windows):
Example – Intel Server with Hyperthreading ON:
SAPS = 32,000
H/W configuration = Intel E5 2 processors / 16 cores / 32 threads
SCU = 32,000 / 32 threads = 1,000 SCU SAPS
SAPS = 32,000
H/W configuration = Intel E5 2 processors / 16 cores / 32 threads
SCU = 32,000 / 32 threads = 1,000 SCU SAPS
Same Intel server with Hyperthreading OFF:
SAPS = 22,000
H/W configuration = Intel E5 2 processors / 16 cores / 16 threads
SCU = 22,000 / 16 threads = 1,375 SCU SAPS
SAPS = 22,000
H/W configuration = Intel E5 2 processors / 16 cores / 16 threads
SCU = 22,000 / 16 threads = 1,375 SCU SAPS
Windows Power Saving features can lower
the clock speed when the CPU is idle. Hardware vendors and Microsoft
can provide more information on the optimal energy/performance
configuration.
Additional information about Hyperthreading and SCU on Virtualized systems can be found in note 1246467 - Hyper-V Configuration Guideline and note 1056052 - Windows: VMware vSphere configuration guidelines. Microsoft and VMware provide additional whitepapers and blogs on this topic.
1.2. Multi-core
When talking about performance we need to
distinguish between performance expressed in throughput, like Sales
Orders per hour, payroll calculations per hour, which can be executed on
a given hardware. On the other side, performance often is associated
with the time it takes to calculate such an elementary operation of one
payroll calculations. Or on the database side, the time it takes to
execute, for example, a single lookup of a row in a table. Indications
on throughput performance delivered by a single server hardware can be
read out of SAP benchmarks and the associated SAPS number. Information
on the time it takes to execute an elementary operation is expressed
more by the SCU as introduced above. Consequences of changing hardware
in a SAP system to a more recent model with more processor cores can be
the need to adjust the configuration or number of SAP instances running
on a single server in order to be able to leverage the increased number
of CPU cores. On the SAP application side, the scale-out provided by the
SAP application layer allows high flexibility to leverage hardware
which provide a high SCU. Whereas on the DBMS side, the focus in
selecting a DBMS server often is more on the throughput performance and
the ability of executing as many requests as possible in parallel.
1.3. Large Physical Memory
Windows Zero Memory Management is generally recommended and is documented in note 88416. SAP generally recommends against huge ABAP or Java instances as documented in note 9942.
A 2 socket server is very powerful and a single instance with around 50
work processes is unlikely to leverage the CPU power of the H/W.
Increasing the number of work processes (beyond about 50) and users on a
single instance may not linearly improve throughput. An example is
three ABAP instances each with 50 work processes has shown much better
performance than one ABAP instance with 150 work processes.
Installing multiple ABAP or Java instances on a single physical server will allow the H/W resources to be fully leveraged.
Solution: install multiple smaller ABAP
instances per physical server, balance workload with SAP Logon Load
Balancing and keep the instance configuration identical by setting most
parameters in the default.pfl
In general use Windows Zero Administration Memory Management. Remove the profile parameters listed in note 88416 and set only the PHYS_MEMSIZE. ZAMM parameters will be automatically calculated correctly based on the value for PHYS_MEMSIZE.
Suggested Profile Parameters for ABAP instances sharing the same H/W and operating system:
PHYS_MEMSIZE |
physical RAM / number of instances + small amount for operating system |
em/max_size_MB | ZAMM default = 1.5 x PHYS_MEMSIZE* |
abap/heap_area_dia | 2GB (2000000000) or slightly higher |
abap/heap_area_nondia | 0 (up to max value of abap/heap_area_total) |
abap/heap_area_total | ZAMM default = PHYS_MEMSIZE |
*As of 720_EXT downwards compatible kernel patch 315 or higher
The attached PDF file contains sample configurations
The attached PDF file contains sample configurations
1.4. NUMA
Non-Uniform Memory Access (NUMA) directly
impacts the performance of SAP ABAP application servers. The SAP
Kernel for Windows is single threaded and does not contain NUMA handling
logic to localize memory storage for a specific process to a specific
NUMA node.
Performance will therefore be maximized
on high clock speed processors with the least number of NUMA nodes.
These conditions are both met on 2 socket commodity Intel systems.
Local memory access times are very fast
on NUMA based systems because the memory controller is directly
connected to one processor. Remote memory access is many times slower
than local. The calculation of local versus remote memory access for
SAP application instances is a simple mathematical formula:
2 socket = 50% chance of a local NUMA node access
4 socket = 25% chance of a local NUMA node access
8 socket = 12.5% chance of a local NUMA node access
4 socket = 25% chance of a local NUMA node access
8 socket = 12.5% chance of a local NUMA node access
2 socket Intel commodity servers have a
higher clock speed and better NUMA characteristics and are therefore
suitable for SAP application servers. Excessive remote memory accesses
on 8 socket or higher servers running SAP ABAP instances will adversely
impact performance. This can occur with or without virtualization.
Virtualization software does not prevent NUMA induced latencies nor
change the physical structure of the processor/memory layout. Modern
virtualization software may avoid remote memory communication if a
Virtual Machine is equal to or smaller than the resources of one NUMA
node.
RDBMS software from Microsoft, IBM,
Oracle and SAP are all designed NUMA aware as of current releases. NUMA
aware RDBMS software will attempt to keep memory structures local and
avoid remote memory access. Modern DBMS software has demonstrated very
good scalability on 8 socket or higher Intel servers.
1.5. Processor Groups (K-Groups)
Windows 2008 R2 and higher introduced a
concept called “Processor Groups”. Processor groups are required to
address > 64 threads. See SAP Note 1635387 - Windows Processor Groups.
Processor Groups are required on most 4 socket servers (4 socket * 10 core * hyperthreading = 80 threads)
Applications and DBMS software must be
Processor Group aware otherwise the maximum number of threads the
application or DBMS can address is limited to 64. Performance will be
somewhat less than the H/W capability. See section 2 of this note for
further information
Current status (August 2012)
- SAP Kernel = no automatic processor group handling – see note 1635387
- SQL Server 2008 R2 and higher = processor group aware
- Oracle 11g = processor group support planned with patch 11.2.0.4
- Other DBMS = check with DBMS vendor for support status (MaxDB/Livecache, DB2, Sybase etc)
1.6. Performance Bottlenecks
1.6.1. Network
SAP 3 tier configurations require a very
high performance, low latency and 100% reliable network connection
between the SAP application server(s), the message server and the
database.
Large or busy systems strongly benefit from:
- 10 Gigabit network
- A separate network for SAP application servers to communicate with the RDBMS
- Offload, SR-IOV, VM-FEX and parallelism features built into modern network cards and drivers. TCPIP v4 and v6 offload and Receive Side Scaling have been tested by Microsoft, HP and other vendors. Contact Microsoft and/or H/W vendor for recommended NIC and drivers
- TCPIP Loopback (127.0.0.1) communication is single threaded and is unable to be distributed over multiple threads with technologies such as RSS. Some RDBMS and SAP instances may attempt to use loopback rather than shared memory by default
-
The attached PDF file contains links with additional information about network topologies and configuration for Intel systems
1.6.2. Memory
SAP and DBMS performance testing and
customer deployments have shown that RAM is a determining factor in
scalability. A modern Intel or AMD system with insufficient memory will
be unable to run efficiently or achieve peak throughput. SAP benchmarks
provide an indication of the appropriate amount of RAM for a particular
hardware configuration. Customers should use the H/W configurations
published on the SAP benchmark website as guidance for how much RAM to
specify. SAP Quicksizer also provides some guidance. As of August 2012
the minimum RAM for a 2 socket Intel server should be 128GB.
1.6.3. Insufficient IO Performance
IO can be a significant performance
bottleneck. Common causes are insufficient LUNs, one LUN presented to
Hyper-Visor partitioned into multiple drive letters, insufficient HBAs,
incorrectly configured MPIO software. Microsoft and SAN vendors can
provide additional information on optimal IO configurations.
1.7. Energy Consumption
Customers are encouraged to compare the
energy consumption of different H/W configurations. In most cases it is
observed that 8 socket systems use proportionately more energy than 2
socket systems.
2. Summary of Physical Hardware Configurations
SAP benchmarks show several clear trends:
- Total SAPS on Intel servers has increased significantly in recent years
- A substantial increase in SAPS per core on 2 socket Intel server and somewhat lesser increase on 4 socket and 8 socket Intel servers
- A significant but more moderate increase in SAPS per CPU thread. Increase in SAPS per CPU thread (SCU) is most significant on 2 socket Intel servers
- Total number of cores and threads has increased dramatically. Servers with 12 to 80 core and 24 to 160 threads are available from most H/W vendors as at 2012.
OS Limitations:
Windows 2012 supports up to 640 threads* and 4TB RAM
Windows 2008 R2 supports up to 256 threads and 2TB RAM
Windows 2008 supports 64 threads and 2TB RAM
Hyper-V 3.0 (Windows 2012) supports 64 vCPU & 1TB RAM per Virtual Machine
Hyper-V 2.0 (Windows 2008 R2) supports 4 vCPU & 64GB RAM per Virtual Machine
VMware vSphere 4.x supports 8 vCPU and 255GB RAM per Virtual Machine
VMware vSphere 5.0 supports 32 vCPU and 1TB RAM per Virtual Machine
VMware vSphere 5.1 supports 64 vCPU and 1TB RAM per Virtual Machine
Windows 2008 R2 supports up to 256 threads and 2TB RAM
Windows 2008 supports 64 threads and 2TB RAM
Hyper-V 3.0 (Windows 2012) supports 64 vCPU & 1TB RAM per Virtual Machine
Hyper-V 2.0 (Windows 2008 R2) supports 4 vCPU & 64GB RAM per Virtual Machine
VMware vSphere 4.x supports 8 vCPU and 255GB RAM per Virtual Machine
VMware vSphere 5.0 supports 32 vCPU and 1TB RAM per Virtual Machine
VMware vSphere 5.1 supports 64 vCPU and 1TB RAM per Virtual Machine
*thread = Sockets x cores per processor x Hyperthreading.
4 socket x 10 core Intel server with Hyperthreading = 80 threads (e.g. HP DL580 G7, Dell R910)
4 socket x 10 core Intel server with Hyperthreading = 80 threads (e.g. HP DL580 G7, Dell R910)
Balanced configurations tested and deployed at customer sites:
- 2 socket Intel or AMD = > 32,000-42,000 SAPS 128-384GB RAM, 10G network and 2 x Dual Port HBA
- 4 socket Intel or AMD = > 62,000-75,000 SAPS 512GB-1TB RAM, 10G network and 2-4 Dual Port HBA
- 8 socket Intel = 130,000-140.000 SAPS 1TB RAM or more, 10G network and 4-8 Dual Port HBA
SAP ABAP & Java servers and DBMS software will perform well on configuration #1.
#1 has generally demonstrated best performance with simple configuration and tuning for most SAP applications relative to #2 and #3.
#2 may also be possible for ABAP & Java servers, though performance will not be as good as expected and additional configuration is required.
#1 has generally demonstrated best performance with simple configuration and tuning for most SAP applications relative to #2 and #3.
#2 may also be possible for ABAP & Java servers, though performance will not be as good as expected and additional configuration is required.
Configuration #3 requires special expert
configuration and tuning to run SAP application servers or DBMS together
with SAP application servers (with or without virtualization). The SAP
ASCS/SCS is not a full application server and can run without problems
on configuration #1, #2 or #3
Configurations #1, #2 and #3 are suitable
for modern DBMS software and will deliver nearly linear scalability
with addition of CPU sockets. DBMS software running on 2, 4 and 8 socket
servers with large amounts of memory will achieve very good scalability
and performance without the need for complex configurations and
tuning.
SAP application server
installation, configuration and tuning on 2 socket servers is simple and
largely automatic. 2 socket servers have a high clock speed, high SAPS
per thread/SCU, efficient energy consumption and demonstrate good NUMA
characteristics.
If SAP sizing indicates additional
capacity in excess of a 2 socket server (currently 32,000 - 42,000 SAPS)
is required for DBMS layer, if additional availability &
reliability features are required or if many databases are consolidated
onto a single server/cluster then select 4 socket or higher servers.
Note: on average 10-30% of the total SAPS
resources is consumed by the DBMS layer. SAP application servers
typically consume 70%+ of the overall CPU resources of most customer
systems. The SAP application server layer should be scaled out horizontally on 2 socket commodity servers or Virtual Machines. The PDF file attached to this note demonstrates examples.
3. Configuration of SAP ABAP Server on 4 socket & 8 or higher socket servers
3.1. Guidelines for running SAP ABAP server on 4 socket systems
4 socket servers can be configured to run
SAP application servers if 2 socket servers are unavailable. Additional
configuration and tuning is required. Knowledge of modern Processor
technologies, NUMA, K-Groups and SAP profile parameters is required.
Recommended Configuration Steps:
- If possible disable HyperThreading to reduce the total threads to below 64. All threads will be in one K-Group. Performance will be significantly reduced.
- In the case of still having enabled Hyperthreading AND having more than 64 CPUs, implement Microsoft KB 2510206 as per note 1635387. This will force creation of evenly sized K-Groups/Processor Groups by the Windows OS.
- Implement NUMA affinity as detailed in SAP note 1667863
- Determine the amount of local memory per NUMA node(s) and size the SAP instance accordingly
- If Virtualization is configured on 4 socket systems please consult the Hypervisor vendor for further information, guidance and best practices regarding the configuration of non-NUMA aware applications on VMs
3.2. Guidelines for running SAP ABAP server on 8+ socket systems
Complex configuration and tuning is
required to achieve good, stable and predictable performance on SAP
application servers or DBMS together with SAP application servers (with
or without virtualization) on 8 socket or higher servers.
SAP is unable to provide generalized documentation regarding 8 socket or higher configurations because:
- Some hardware architectures only provide 4 QPI/HyperTransport links. H/W configurations with > 4 sockets require specialized Hubs/Node controllers. The implementation of > 4 socket servers differs significantly between the various hardware vendors
- Disabling hyperthreading is often insufficient to reduce the number of threads below 64, therefore K-Group configuration is generally required
- Placement of PCI HBA, NIC or SSD cards into an inappropriate PCI slot can have a dramatic impact on performance on some 8 socket systems
- The impact of device drivers, some backup software and some Anti-Virus software that was not designed for K-Groups, 8 socket servers with OEM designed Hubs/Node controllers and NUMA architectures is likely be pronounced and significant
- Remote memory accesses are vastly more probable on 8 socket systems
- Total physical memory will be (most often evenly) distributed over 8 sockets which may lead to very little local memory per NUMA node
Hardware vendors are responsible for the
specification, implementation, configuration and performance support of
SAP application servers on 8 socket servers. Poor SAP application server
performance on 8 socket servers should be referred to the hardware
vendor. Configuration, tuning and performance support of SAP application servers on 8 socket servers requires a Consulting engagement.
8 socket or higher servers offer
excellent performance, reliability and scalability for DBMS software (or
other software that is NUMA aware). Typically there would be no need
to engage expert consulting to install, configure and tune DBMS software
on 8 socket servers. It is generally recommended to obtain the latest
“Best Practices” deployment guides from the relevant hardware vendor.
Hardware vendors will often provide a deployment guide for each specific
DBMS. Standard readily available documentation is sufficient to deploy
DBMS software on large 8 socket or higher systems. Provided only DBMS and (A)SCS software is installed the SAP Support procedures for 2, 4 or 8 socket servers are the same.
4. Virtualization
4.1. Virtual platforms supported for Windows
Windows Hyper-V and VMware vSphere are both supported for SAP and documented in note 1409608
4.2. Virtual CPU (vCPU)
Hyper-V and VMware vSphere both map each
individual vCPU to one physical core/thread. Hypervisors will try to
run all vCPU on the same physical processor. This is only possible if
the number of vCPU is equal to or less than the number of cores on a
physical processor. A server with 2 processors each with 8 cores would
be able to run 8 vCPU on a single processor. If the number of vCPU was
increased to 12, the Hypervisor will run the VM across both processors.
Hypervisors can automatically “relocate” a
VM from a busy processor socket to another processor that is not so
busy. The process of moving a VM from one NUMA node to another will
eventually require copying the entire memory context across the QPI or
Hyper-Transport (AMD) links. Frequent VM relocations are likely to
impact overall system performance and impact the predictability of
performance (sometimes a VM will run slowly then after a migration to
another NUMA node run fast).
4.3. Virtual RAM + Virtual NUMA (vRAM)
Hypervisors allocate vRAM to physical
RAM. SAP systems should not “overcommit” meaning the vRAM should be
equal to or less than physical RAM for Production systems.
If vRAM is larger than the physical RAM
connected to one NUMA node or if the number of vCPU exceeds the number
of cores on a single processor, the Virtual Machine will be performing
Remote NUMA memory access (which is many times slower than Local
access).
Hyper-V 2.0 and VMware vSphere 4.x did
not provide NUMA information to the Virtual Machines. RDBMS software
performance was therefore significantly decreased if vRAM > local
NUMA memory or vCPU > cores on one single processor.
Hyper-V 3.0 and VMware vSphere 5.0 do
provide NUMA topology information to the Virtual Machine. Both Hyper-V
3.0 and VMWare 5.0 greatly improve the alignment between VMs, the NUMA
node, the vCPU and local memory.
The amount of Local NUMA memory
(therefore the maximum vRAM before remote access occurs) is a function
of Total RAM and number of processors.
- 2 socket with 8 cores each and 128GB RAM. Each processor has 8 cores and 64GB local memory directly connected to one processor + 64GB remote memory.
- 4 socket with 8 cores each and 128GB RAM. Each processor has 8 cores and 32GB local memory directly connected to one processor + 96GB remote memory.
- 8 socket with 8 cores each and 128GB RAM. Each processor has 8 cores and 16GB local memory directly connected to one processor + 112GB remote memory.
Partitioning large 4 socket or 8 socket
servers into many Virtual Machines is unlikely to achieve good
predictable and stable performance without expert knowledge and
configuration. 2 socket servers with large amounts of physical memory
(up to 768GB as of 2012) have shown consistent and predictable results
running virtual workloads. The configuration and operation of 2 socket
servers with large amounts of RAM is relatively simple. Virtualization
vendors provide additional documentation and recommendations on NUMA
configurations and best practices.
Virtualization software does not prevent
NUMA induced latencies nor change the physical structure of the
processor/memory layout. Modern Virtualization software may avoid remote
memory communication if a Virtual Machine is equal to or smaller than
the resources of one NUMA node.
Author:
Cameron Gardiner, Microsoft Corporation
Contact Person for questions and comments on this article:
cgardin@microsoft.com
Reviewer:
Karl-Heinz Hochmuth, SAP AG
Bernd Lober, SAP AG
Matthias Schlarb, VWware Global Inc.
Peter Simon, SAP AG
Jürgen Thomas, Microsoft Corporation
Keywords
Intel, AMD, NUMA, local memory, remote memory, single threaded, Zero
Memory Management, ZMM, PHYS_MEMSIZE, em/max_size_MB, vCPU, vRAM, x64,
64 bit, sizing, Wintel , per thread performance, Multi-SID,
consolidation, QPI, Hyperthreading, abap/heap_area, energy
Header Data
Released On | 07.01.2013 09:01:41 | ||
Release Status | Released to Customer | ||
Component | BC-OP-NT Windows | ||
Priority | Normal | ||
Category | How To | ||
Operating System |
|
Product
This document is not restricted to a product or product version
Attachments
|
2 comments:
Unitec has extensive experience in desktop, server, storage and network virtualisation solutions. We work with our clients to ensure the benefits sought from choosing virtualisation are evident across the entire organisation.
It's interesting that many of the bloggers to helped clarify a few things for me as well as giving.Most of ideas can be nice content.The people to give them a good shake to get your point and across the command.
Quoting & Invoicing Software
Post a Comment