Thursday, November 12, 2009

Migration from UltraSPARC I,-V to UltraSPARC T1 and T2

Migration from fast single threaded CPU machine to CMT UltraSPARC T1 and T2 results in increased CPU reporting [ID 781763.1]

--------------------------------------------------------------------------------

In this Document
Symptoms
Changes
Cause
Solution
References



--------------------------------------------------------------------------------



Applies to:
Oracle Server - Enterprise Edition - Version: 10.2.0.1 to 11.1.0.7
Sun Solaris SPARC (64-bit)
Oracle Server Enterprise Edition - Version: 10.2.0.1 to 11.1.0.7
Symptoms
Sun Microsystems have recently rolled out a new generation of servers that many Oracle Customers are migrating to. Many are seeing diminished performance when doing simple single-user / single-process performance tests against an Oracle database.

Here is a classic example:
This first tkprof output is from the new T2000 UltraSPARC machine
select distinct...

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.11 0.12 0 10 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 11.43 11.16 0 41520 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 11.54 11.28 0 41530 0 1

This second tkprof output is from the older V440 SPARC machine:
select distinct...

call count cpu elapsed disk query current rows
------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 1 0.03 0.01 0 10 0 0
Execute 1 0.00 0.00 0 0 0 0
Fetch 2 3.62 3.53 0 39693 0 1
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 4 3.65 3.55 0 39703 0

The identical SQL statement, with no disk reads, and practically the same number of logical reads is performing about 3 times worse in elapsed time yet is using 3 times more CPU time. Oddly, we have seen that the overall CPU utilization on the new machine is actually under 10% even at peak load whereas on the older machine the CPU was almost 100% busy during peak load.

Changes
# A CMT pipeline runs at say 1.2GHz and has 4 threads sharing it
# Therefore each thread only gets 1/4 the cycles and runs 300MHz
# This makes it less performant than an old US II chip


The Sun engineer gave a nice metaphor that describes this situation:

Suppose you have 20 boxes that you need to transport from Colorado Springs to Redwood Shores. The old machine is like a Ferrari that can take 1 box at a time where as the new machine is like a semi truck that can transport all the boxes at the same the time.

A good tool to monitor the core utilization is called corestat. See the following link for usage instructions:


http://www.solarisinternals.com/wiki/index.php/CMT_Utilization


Cause
While Sun may report that all the virtual CPU on a T2-based computer as being 1200MHz (1.2GHz) -- for example, when looking at "/usr/sbin/psrinfo -v" output, the work within each physical core is actually divided across 4 (Chip-level MultiThreading - or CMT) threads. So any one active Unix process will only use 25% of the CPU core's cycles (or effectively feel like it is on a single CPU that is only running at 300MHz).

So a single process may report it is nearly 100% CPU bound (as measured via getrusage() calls) from its perspective, but it can only keep one virtual CPU busy at most. So at the system level, the computer may only appear to be (100/N)% busy (where N is # virtual CPUs) due to the activity of that single process.

Thus, individual process response time may suffer, but the unused resources will allow a higher number of concurrent processes to run than older systems could handle as efficiently.

The T2 is designed more to handle high numbers of concurrent active processes. It’s expected that under higher numbers of concurrent high CPU consumers, that the individual response time on the Sun T2 will remain fairly steady, while the old SPARC platform response time will degrade as users spend more time waiting on the run queue for the faster CPU. At what concurrent load the SPARC based system response time will cross over and appear slower than Sun T2 is unknown and would depend on the overall workload profile being generated by the concurrent user activity.

So when doing simple comparison tests of a relatively simple single-threaded task with little or no time spent waiting for CPU time or IO or other sources of significant waits, then on paper, the process on Sun SPARC could appear to complete as much as 8x faster than a similar task running on the Sun T2.

++++++++++++++++++++++++
Here is the general write up on this issue from Sun Microsystems:
The CMT machine has a lower clock rate and is a single-instruction-issue per
clock cycle while the US-IIIi has a higher clock rate and a potential
of four-instruction-issue per cycle. That is the main reason the single
thread performance of US-T1/1 GHz. is less than the US-IIIi/1593.

This single thread performance tradeoff was the price of the high
level of integration of the US-T1 and the dramatic increase in
system level throughput it allows over the non-CMT type machines.

If you run a single process and measure the elapsed time to completion
T6300 vs 280R you are not taking advantage of throughput computing that
your get from a multi-threaded application in which the T6300 shines.
If this application is single threaded, that would be the expected
outcome (poor performance on the T6300). If many instances (the ideal
number being equal to the number of hardware threads) of this application
are started in parallel and they are independent and there is no internal
synchronization between these instances, you will see completion times
that is a fraction of the 280R.

For additional information you can read this Blueprint:

Developing and Tuning Applications on UltraSPARC T1 Chip Multithreading Systems
January 2007
http://www.sun.com/blueprints/0107/819-5144.html
in which we read:
"...The performance of a single thread on a system with UltraSPARC T1
processors is less than that of a single threaded processor. This is because
the strands do not have exclusive access to the resources of the core and
the pipeline is simpler. Although individual strands are weaker, aggregate
performance is greater than in previous generations of SPARC processors.
Indeed by running LWPs on 32 hardware strands in parallel, the aggregate
performance of UltraSPARC T1 processors exceeds that of today's SPARC
systems. This is the tradeoff of single-thread performance versus
throughput."


Solution
This is not so much a defect in Sun or Oracle code as it is a misunderstanding of how this architecture works and what it is expected of it. Some possible Oracle methods for alleviating any of this is to try and leverage parallel execution where available or increase the concurrent workload and see the hardware scale to handle it. This infers changes to the application architecture are necessary to take full advantage of the available horse power of this platform in the form of more threads and cores per processor.

Please refer to the following note from Sun:

http://blogs.sun.com/deniss/entry/lesons_learned_from_t1

Note:
Sun Microsystems' UltraSPARC T2 microprocessor is a multithreading, multi-core CPU. It is a member of the SPARC family, and the successor to the UltraSPARC T1. The chip is sometimes referred to by its codename, Niagara 2. Sun started selling servers with the T2 processor in October 2007.

http://en.wikipedia.org/wiki/UltraSPARC_T2

This link give some good specs on how this new architecture works with regards to CMT (Chip Multi threading):
http://www.solarisinternals.com/wiki/index.php/CMT_Utilization

This link gives some recommendations on how to make Oracle work better on CMT architecture:
http://blogs.sun.com/glennf/tags/cmt

This link is to a Adobe PDF presentation from Sun that discusses the costs/benefits of using the T2000.
Slide 12 is a illustration of a comparison with an AMD x6220:
http://blogs.sun.com/glennf/resource/Optimizing_Oracle_CMT_v1.pdf



References
BUG:8316255 - NUMCPU ALGORITHM NEEDS TO IGNORE THREADS WHEN CALCULATING CPU COUNT

No comments:

Post a Comment