Best Practices - Top Ten Tuning Tips Updated for LDOMs

Best Practices – Top Ten Tuning Tips Updated

By jsavit on Mar 28, 2014

This post is one of a series of "best practices" notes for Oracle VM Server for SPARC (formerly called Logical Domains). This is an update to a previous entry on the same topic.

Top Ten Tuning Tips – Updated

Oracle VM Server for SPARC is a high performance virtualization technology for SPARC servers. It provides native CPU performance without the virtualization overhead typical of hypervisors. The way memory and CPU resources are assigned to domains avoids problems often seen in other virtual machine environments, and there are intentionally few "tuning knobs" to adjust.

However, there are best practices that can enhance or ensure performance. This blog post lists and briefly explains performance tips and best practices that should be used in most environments. Detailed instructions are in the Oracle VM Server for SPARC Administration Guide. Other important information is in the Release Notes. (The Oracle VM Server for SPARC documentation home page is here.)

Big Rules / General Advice

Some important notes first:

"Best practices" may not apply to every situation. There are often exceptions or trade-offs to consider. We'll mention them so you can make informed decisions. Please evaluate these practices in the context of your requirements. There is no one "best way", since there is no single solution that is optimal for all workloads, platforms, and requirements.
Best practices, and "rules of thumb" change over time as technology changes. What may be "best" at one time may not be the best answer later as features are added or enhanced.
Continuously measure, and tune and allocate resources to meet service level objectives. Once objectives are met, do something else – it's rarely worth trying to squeeze the last bit of performance when performance objectives have been achieved.
Standard Solaris tools and tuning apply in a domain or virtual machine just as on bare metal: the *stat tools, DTrace, driver options, TCP window sizing, /etc/system settings, and so on, apply here as well.
The answer to many performance questions is "it depends". Your mileage may vary. In other words: there are few fixed "rules" that say how much performance boost you'll achieve from a given practice.

Despite these disclaimers, there is advice that can be valuable for providing performance and availability:

The Tips

Keep firmware, Logical Domains Manager, and Solaris up to date – Performance enhancements are continually added to Oracle VM Server for SPARC, so staying current is important. For example, Oracle VM Server for SPARC 3.1 and 3.1.1 both added important performance enhancements.

That also means keeping firmware current. Firmware is easy to "install once and forget", but it contains much of the logical domains infrastructure, so it should be kept current too. The Release Notes list minimum and recommended firmware and software levels needed for each platform.

Some enhancements improve performance automatically just by installing the new versions. Others require administrators configure and enable new features. The following items will mention them as needed.
Allocate sufficient CPU and memory resources to each domain, especially control, I/O and service domains – This cannot be overemphasized. If a service domain is short on CPU, then all of its clients are delayed. Don't starve service domains!

For the control domain and other service domains, use a minimum of at least 1 core (8 vCPUs) and 4GB or 8GB of memory for small workloads. Use two cores and 16GB of RAM if there is substantial I/O load. Be prepared to allocate more resources as needed. Don't think of this as "waste". To a large extent this represents CPU load to drive physical devices shifted from the guest domain to the service domain.

Actual requirements must be based on system load: small CPU and memory allocations were appropriate with older, smaller LDoms-capable systems, but larger values are better choices for the demanding, higher scaled systems and applications now used with domains, Today's faster CPUs and I/O devices are capable of generating much higher I/O rates than older systems, and service domains must be suitably provisioned to support the load. Control domain sizing suitable for a T2000 or T5220 will not be enough for a T5-8 or an M6-32! I/O devices matter too: a 10GbE network device driven at line speed can consume an entire CPU core, so add another core to drive that.

How can you tell if you need more resources in the service domain? Within the domain you can use vmstat, mpstat, and prstat to see if there is pent up demand for CPU. Alternatively, issue ldm list or ldm list -l from the control domain. If you consistently see high CPU utilization, add more CPU cores. You might not be observing the some peak loads, so just add proactively.

Good news: you can dynamically add and remove CPUs to meet changing load conditions, even for the control domain. You should leave some headroom on the server so you can allocate resources as needed. Tip: Rather than leave "extra" CPU cores unassigned, just give them to the service domains. They'll make use of them if needed, and you can remove them if they are excess capacity that is needed for another domain.

You can allocation CPU resources manually via ldm set-core or automatically with the built-in policy-based resource manager. That's a Best Practice of its own, especially if you have guest domains with peak and idle periods.

The same applies to memory. Again, the good news is that standard Solaris tools like vmstat can be used to see if a domain is low on memory, and memory can also added to or removed from a domain. Applications need the same amount of RAM to run efficiently in a domain as they do on bare metal, so no guesswork or fudge-factor is required. Logical domains do not oversubscribe memory, which avoids problems like unpredictable thrashing.

In summary, add another core if ldm list shows that the control domain is busy. Add more RAM if you are hosting lots of virtual devices are running agents, management software, or applications in the control domain and vmpstat -p shows that you are short on memory. Both can be done dynamically without an outage.
Allocate domains on core boundaries – SPARC servers supporting logical domains have multiple CPU cores with 8 CPU threads each. (The exception is that Fujitsu M10 SPARC servers have 2 CPU threads per core. The considerations are similar, just substitute "2" for "8" as needed.) Avoid "split core" situations in which CPU cores are shared by more than one domain (different domains with CPU threads on the same core). This can reduce performance by causing "false cache sharing" in which domains compete for a core's Level 1 cache. The impact on performance is highly variable, depending on the domains' behavior.

Split core situations are easily avoided by always assigning virtual CPUs in multiples of 8 (ldm set-vcpu 8 mydomain or ldm add-vcpu 24 mydomain). It is rarely good practice to give tiny allocations of 1 or 2 virtual CPUs, and definitely not for production workloads. If fine-grain CPU granularity is needed for multiple applications, deploy them in zones within a logical domain for sub-core resource control.

The best method is to use the whole core constraint to assign CPU resources in increments of entire cores (ldm set-core 1 mydomain or ldm add-core 3 mydomain). The whole-core constraint requires a domain be given its own cores, or the bind operation will fail. This prevents unnoticed sub-optimal configurations, and also enables the critical thread opimization discussed below in the section Single Thread Performance.

In most cases the logical domain manager avoids split-core situations even if you allocate fewer than 8 virtual CPUs to a domain. The manager attempts to allocate different cores to different domains even when partial core allocations are used. It is not always possible, though, so the best practice is to allocate entire cores.

For a slightly lengthier writeup, see Best Practices – Core allocation.
Use Solaris 11 in the control and service domains – Solaris 11 contains functional and performance improvements over Solaris 10 (some will be mentioned below), and will be where future enhancements are made. It is also required to use Oracle VM Manager with SPARC. Guest domains can be a mixture of Solaris 10 and Solaris 11, so there is no problem doing "mix and match" regardless of which version of Solaris is used in the control domain. It is a best practice to deploy Solaris 11 in the control domain even if you haven't upgraded the domains running applications.
NUMA latency – Servers with more than one CPU socket, such as a T4-4, have non-uniform memory access (NUMA) latency between CPUs and RAM. "Local" memory access from CPUs on the same socket has lower latency than "remote". This can have an effect on applications, especially those with large memory footprints that do not fit in cache, or are otherwise sensitive to memory latency.

Starting with release 3.0, the logical domains manager attempts to bind domains to CPU cores and RAM locations on the same CPU socket, making all memory references local. If this is not possible because of the domain's size or prior core assignments, the domain manager tries to distribute CPU core and RAM equally across sockets to prevent an unbalanced configuration. This optimization is automatically done at domain bind time, so subsequent reallocation of CPUs and memory may not be optimal. Keep in mind that that this does not apply to single board servers, like a T4-1. In many cases, the best practice is to do nothing special.

To further reduce the likelihood of NUMA latency, size domains so they don't unnecessarily span multiple sockets. This is unavoidable for very large domains that needs more CPU cores or RAM than are available on a single socket, of course.

If you must control this for the most stringent performance requirements, you can use "named resources" to allocate specific CPU and memory resources to the domain, using commands like ldm add-core cid=3 ldm1 and ldm add-mem mblock=PA-start:size ldm1. This technique is successfully used in the SPARC Supercluster engineered system, which is rigorously tested on a fixed number of configurations. This should be avoided in general purpose environments unless you are certain of your requirements and configuration, because it requires model-specific knowledge of CPU and memory topology, and increases administrative overhead.
Single thread CPU performance – Starting with the T4 processor, SPARC servers can use a critical threading mode that delivers the highest single thread performance. This mode uses out-of-order (OOO) execution and dedicates all of a core's pipeline and cache resource to a software thread. Depending on the application, this can be several times faster than in the normal "throughput mode".

Solaris will generally detect threads that will benefit from this mode and "do the right thing" with little or no administrative effort, whether in a domain or not. To explicitly set this for an application, set its scheduling class to FX with a priority of 60 or more. Several Oracle applications, like Oracle Database, automatically leverage this capability to get performance benefits not available on other platforms, as described in the section "Optimization #2: Critical Threads" in How Oracle Solaris Makes Oracle Database Fast. That's a serious example of the benefits of the combined software/hardware stack's synergy. An excellent writeup can be found in Critical Threads Optimization in the Observatory blog.

This doesn't require setup at the logical domain level other than to use whole-core allocation, and to provide enough CPU cores so Solaris can dedicate a core to its critical applications. Consider that a domain with one full core or less cannot dedicate a core to 1 CPU thread, as it has other threads to dispatch. The chances of having enough cores to provide dedicated resources to critical threads get better as more cores are added to the domain, and this works best in domains with 4 or more cores. Other than that, there is little you need to do to enable this powerful capability of SPARC systems (tip of the hat to Bob Netherton for enlightening me on this area).

Mentioned for completeness sake: there is also a deprecated command to control this at the domain level by using ldm set-domain threading=max-ipc mydomain, but this is generally unnecessary and should not be done.
Live Migration – Live migration is CPU intensive in the control domain of the source (sending) host. You must configure at least 1 core to the control domain in all cases, but additional core will speed migration and reduce suspend time. The core can be added just before starting migration and removed afterwards. If the machine is older than T4, add crypto accelerators to the control domains. No such step is needed on later machines.

Live migration also adds CPU load in the domain being migrated, so its best to perform migrations during low activity periods. Guests that heavily modify their memory take more time to migrate since memory contents have to be retransmitted, possibly several times. The overhead of tracking changed pages also increases guest CPU utilization.

Remember that live migration is not the answer to all questions. Some other platforms lack the ability to update system software without an outage, so they require "evacuating" the server via live migration. With Oracle VM Server for SPARC you should always have an alternate service domain for production systems, and then you can do "rolling upgrades" in place without having to evacuate the box. For example, you can pkg update Solaris in both the control domain and the service domains at the same time during normal operational hours, and then reboot them one at a time into the new Solaris level. While one service domain reboots, all I/O proceed through the alternate, and you can cycle through all the service domains without any loss in application availability. Oracle VM Server for SPARC reduces the number of use cases in which live migration is the only answer.
Network I/O – Configure aggregates, use multiple network links, use jumbo frames, adjust TCP windows and other systems settings the same way and for the same reasons as you would in a non-virtual environments.

Use RxDring support to substantially reduce network latency and CPU utilization. To turn this on, issue ldm set-domain extended-mapin-space=on mydomain for each of the involved domains. The domains must run Solaris 11 or Solaris 10 update 10 and later, and the involved domains (including the control domain) will require a domain reboot for the change to take effect. This also requires 4MB of RAM per guest.

If you are using a Solaris 10 control or service domain for virtual network I/O, then it is important to plumb the virtual switch (vsw) as the network interface and not use the native NIC or aggregate (aggr) interface. If the native NIC or aggr interface is plumbed, there can be a performance impact sinces each packet may be duplicated to provide a packet to each client of the physical hardware. Avoid this by not plumbing the NIC and only plumbing the vsw. The vsw doesn't need to be plumbed either unless the guest domains need to communicate with the service domain. This isn't an issue for Solaris 11 – another reason to use that in the service domain. (thanks to Raghuram for great tip)

As an alternative to virtual network I/O, use Direct I/O (DIO) or Single Root I/O Virtualization (SR-IOV) to provide native-level network I/O performance. With physical I/O, there is no virtualization overhead at all, which improves bandwidth and latency, and eliminates load in the service domain. They currently have two main limitations: they cannot be used in conjunction with live migration, and introduce a dependency on the domain owning the bus containing the SR-IOV physical device, but provide superior performance. SR-IOV is described in an excellent blog article by Raghuram Kothakota.

For the ultimate performance for large application or database domains, you can use a PCIe root complex domain for completely native performance for network and any other devices on the bus.
Disk I/O – For best performance, use a whole disk backend (a LUN or full disk). Use multiple LUNs to spread load across virtual and physical disks and reduce queueing (just as you would do in a non-virtual environment). Flat files in a file system are convenient and easy to set up as backends, but have less performance.

Starting with Oracle VM Server for SPARC 3.1.1, you can also use SR-IOV for Fibre Channel devices, with the same benefits as with networking: native I/O performance. For completely native performance for all devices, use a PCIe root complex domain and exclusively use physical I/O.

ZFS can also be used for disk backends. This provides flexibility and useful features (clones, snapshots, compression) but can impose overhead compared to a raw device. Note that local or SAN ZFS disk backends preclude live migration, because a zpool can be mounted to only one host at a time. When using ZFS backends for virtual disk, use a zvol rather than a flat file – it performs much better. Also: make sure that the ZFS recordsize for the ZFS dataset matches the application (also, just as in a non-virtual environment). This avoids read-modify-write cycles that inflate I/O counts and overhead. The default of 128K is not optimal for small random I/O.
Networked disk on NFS and iSCSI – NFS and iSCSI also can perform quite well if an appropriately fast network is used. Apply the same network tuning you would use for in non-virtual applications. For NFS, specify mount options to disable atime, use hard mounts, and set large read and write sizes.

If the NFS and iSCSI backends are provided by ZFS, such as in the ZFS Storage Appliance, provide lots of RAM for buffering, and install write-optimized solid-state disk (SSD) "logzilla" ZFS Intent Logs (ZIL) to speed up synchronous writes.

Summary

By design, logical domains don't have a lot of "tuning knobs", and many tuning practices you would do for Solaris in a non-domained environment apply equally when domains are used. However, there are configuration best practices and tuning steps you can use to improve performance. This blog note itemizes some of the most effective (and least exotic) performance best practices.

0 0 votes

Article Rating