<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>vPivot &#187; vmkernel</title>
	<atom:link href="http://vpivot.com/tag/vmkernel/feed/" rel="self" type="application/rss+xml" />
	<link>http://vpivot.com</link>
	<description>Scott Drummonds on Virtualization</description>
	<lastBuildDate>Wed, 01 Feb 2012 06:46:55 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Storage Consolidation (or: How Many VMDKs Per Volume?)</title>
		<link>http://vpivot.com/2010/11/07/storage-consolidation-or-how-many-vmdks-per-volume/</link>
		<comments>http://vpivot.com/2010/11/07/storage-consolidation-or-how-many-vmdks-per-volume/#comments</comments>
		<pubDate>Sun, 07 Nov 2010 08:15:23 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[esxtop]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[vcenter]]></category>
		<category><![CDATA[vmkernel]]></category>
		<category><![CDATA[vmworld]]></category>
		<category><![CDATA[vmworld europe]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=696</guid>
		<description><![CDATA[Part of the performance best practices talk I co-presented at VMworld in San Francisco and Copenhagen focused on answering the question, &#8220;How many virtual machines can be placed on a single VMFS volume?&#8221;  There are a lot of theories as to a best answer.  It will not surprise you to learn that no single consolidation [...]]]></description>
			<content:encoded><![CDATA[<p>Part of the performance best practices talk I co-presented at VMworld in San Francisco and Copenhagen focused on answering the question, &#8220;How many virtual machines can be placed on a single VMFS volume?&#8221;  There are a lot of theories as to a best answer.  It will not surprise you to learn that no single consolidation ratio works in every environment.  Your workloads will influence the maximum consolidation.  But we know enough about how ESX virtualizes storage to provide guidance as to the right storage consolidation ratios.</p>
<p><span id="more-696"></span>First, a little background on ESX&#8217;s storage queues.  There are two relevant queues in ESX.  First is the device queue, which has one instantiation at each HBA for each LUN.  Second is the kernel queue, which handles &#8220;overflowed&#8221; IOs that are waiting to be placed in a full device queue.</p>
<p>For Fibre Channel HBAs, the device queue&#8217;s default length is 32 commands.  It is much larger for iSCSI. No HBA, and thus no device queue, exists for NFS.  A 32 command queue is capable of opening 32 commands at a time.  Obviously, if you double this queue length then the queue will drive twice as many IOs to the volume.  For the rest of this article I will discuss queues in terms of the 32 element Fibre Channel queue.</p>
<p>Because one device queue is instantiated at each HBA for each LUN, a storage reconfiguration at an array can change the number of queues at an ESX host.  Increasing the number of queues increases the total number of IOs that the host can open against the array.  I demonstrated this in my VMworld presentation with the following figure.</p>
<div id="attachment_697" class="wp-caption aligncenter" style="width: 489px"><a href="http://vpivot.com/wp-content/uploads/2010/11/device-queues.png"><img class="size-full wp-image-697" title="Example: Two Storage Configurations" src="http://vpivot.com/wp-content/uploads/2010/11/device-queues.png" alt="Two VMFS volumes means two queues.  One volume one queue." width="479" height="519" /></a><p class="wp-caption-text">Putting two VMs on two volumes results in up to 64 commands being opened from the pair of them at one time.</p></div>
<p>This figure shows the simple difference between two virtual machines sharing a single VMFS volume and two that each get their own.  In the first configuration, only 32 commands can be opened from the host and that single queue is shared between the virtual machines.  In the second configuration, the host can open up 64 total commands and each virtual machine can open up to 32.</p>
<p>Your first reaction to this might be, &#8220;Wow! I should put every VMDK on a VMFS volume of its own!  Then imagine the total throughput that the host could drive!!&#8221;  My first response to this is stop using so many exclamation points.  Nobody likes an overenthusiastic writer.  But second, you should consider that more is not always better.  In fact, I can think of several reasons why you should not reconfigure storage to multiply the number of queues:</p>
<ol>
<li>Allowing a host to open many commands simultaneously may be good for the individual virtual machines but is likely to be dangerous for the shared infrastructure.  This could result in short but extremely intense <a href="http://virtualgeek.typepad.com/virtual_geek/2009/06/vmware-io-queues-micro-bursting-and-multipathing.html">microbursts</a> of IO that could present challenges to your fabric or storage processors.</li>
<li>The device driver (and the HBA) can only open a fixed number of commands depending on the device&#8217;s implementation.  You have to use these sparingly.</li>
<li>The configuration that results in more queues necessarily requires more VMFS volumes which results in a greater administration cost.</li>
</ol>
<p>In addition to reconfiguring storage to increase the number of device queues, you always have the option of increasing the length of ESX&#8217;s device queues.  This is documented on page 71 of the <a href="www.vmware.com/pdf/vsphere4/r40/vsp_40_san_cfg.pdf">Fibre Channel SAN Configuration Guide</a>.  But I will caution you from reconfiguring storage queues, too.  This requires manual changes at every host, produces longer queues that more quickly eat into the fixed number of commands each HBA can support, and increases the possible IO intensity every virtual machine on the host.</p>
<p>And if these detailed explanations are insufficient at explaining why storage queue manipulation is unproductive or even counterproductive towards your goal of optimizing your infrastructure, let me point out that VMware has years of experience at consolidating storage and they chose 32 commands per queue as the right number for most environments.  Trust their experience on this one.</p>
<p>Of course I would be remiss if I did not mention that there are rare times that a storage reconfiguration may help performance.  Redistributing virtual machines across different VMFS volumes or increasing queue depths can correct some issues.  And you can identify occasions where this change may help by a large kernel latency.</p>
<p>As I mentioned above, commands that are waiting for access to a full device queue reside in the kernel queue until a device queue slot becomes available.  On the whole, commands should only spend a fraction of a millisecond in the kernel queue on their way to the device queue.  A kernel queuing time of over one millisecond and certainly over two milliseconds suggests the virtual machines are not having their IO needs served fast enough.</p>
<p>You can see kernel queueing times in the kernel latency statistic reported in esxtop (counter: KAVG) and vCenter (counter: Kernel Latency).  When these latencies consistently average any whole number in milliseconds its time to investigate storage.  But know that slow storage can result in high kernel queuing times.  So, before you go manipulating queues, or reconfiguring your storage layout, make sure your storage is serving IOs in periods deemed acceptible by the storage teams (usually 5-10 ms).</p>
<p>This is kind of a long article by vPivot standards, I know.  But cut me some slack.  <a href="http://virtualgeek.typepad.com/">Chad Sakac</a> bangs out footnotes and parenthetical digressions that are longer than this entry.  This content has already been covered in my VMworld presentations so if you have access to those recordings go listen to Kaushik and I present it there.  But for those of you that were unable to attend I wanted to present this important guidance for your consideration.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/11/07/storage-consolidation-or-how-many-vmdks-per-volume/feed/</wfw:commentRss>
		<slash:comments>18</slash:comments>
		</item>
		<item>
		<title>Hyper-Threading on vSphere</title>
		<link>http://vpivot.com/2010/03/06/hyper-threading-on-vsphere/</link>
		<comments>http://vpivot.com/2010/03/06/hyper-threading-on-vsphere/#comments</comments>
		<pubDate>Sat, 06 Mar 2010 18:05:38 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[hyper-threading]]></category>
		<category><![CDATA[intel]]></category>
		<category><![CDATA[java]]></category>
		<category><![CDATA[scheduler]]></category>
		<category><![CDATA[vmkernel]]></category>
		<category><![CDATA[vmmark]]></category>
		<category><![CDATA[vsphere]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=328</guid>
		<description><![CDATA[I continue to receive many questions from our customers on the expected performance gains of the new version of Hyper-Threading in Intel&#8217;s Core i7 processors. The answer requires a little bit of discussion on Hyper-Threading, a little bit on ESX, and comes with some performance data. If you are still interested, read on. On VI3, [...]]]></description>
			<content:encoded><![CDATA[<p>I continue to receive many questions from our customers on the expected performance gains of the new version of Hyper-Threading in Intel&#8217;s Core i7 processors.  The answer requires a little bit of discussion on Hyper-Threading, a little bit on ESX, and comes with some performance data.  If you are still interested, read on.</p>
<p><span id="more-328"></span>On VI3, many of VMware&#8217;s customers disabled Hyper-Threading on their older, Netburst architecture Intel processors.  Intel has vaguely described the new Hyper-Threading as more efficient than the previous generation and I believe this to be due to a shorter pipeline and an improved ability to context switch pipeline stage data.  Long pipelines&#8211;such as the Netburst era Xeons of model numbers x1xx and x2xx&#8211;are more likely to suffer bubbles during context switches and are therefore penalized versus shorter pipeline products, such as the Core i7.  Furthermore, by pushing and restoring pipeline stage data during a hardware context switch, the new HT can reduce pipeline bubbles.</p>
<p>But the gains vSphere users experience as a result of the new Hyper-Threading also comes from changes in ESX.  ESX&#8217;s scheduler must make decisions as to when to co-locate two worlds on a physical core to take advantage of Hyper-Threading.  In some conditions the scheduler will perform this co-location and in others it will allow a world to run on the core by itself.  The decision to execute worlds concurrently instead of serially on a physical core can be informally called the scheduler&#8217;s <em>trust</em> of Hyper-Threading.  The vSphere scheduler <em>trusts</em> Hyper-Threading more than the VI3 scheduler did.  This amplifies the effect of HT.</p>
<p>I am now going to bore you with a disclaimer before I give you any data showing the effect of Hyper-Threading.  The value of HT will vary from workload to workload and the ultimate authority of HT&#8217;s value is the end-user.  The following numbers are the result of informal analysis and VMware that should only be used as a guide in your own analysis.  Please do not make purchasing decisions on this information, which is devoid of the detail we would normally commit to a white paper.</p>
<table id="newspaper-a">
<tbody>
<tr>
<th>Workload</th>
<th>Observed Throughput Gain Due to HT</th>
</tr>
<tr>
<td>VMmark</td>
<td>24%</td>
</tr>
<tr>
<td>SPECjbb</td>
<td>10%</td>
</tr>
<tr>
<td>Consolidated SQL</td>
<td>19%</td>
</tr>
</tbody>
</table>
<p>In addition to the gains we informally cite here, I can say that we have not yet seen a workload where the new Hyper-Threading slows down consolidated performance.  As far as we can tell, the new Hyper-Threading should be left enabled in 100% of virtualized environments.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/03/06/hyper-threading-on-vsphere/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
		<item>
		<title>Maximum Concurrent VMotions</title>
		<link>http://vpivot.com/2010/03/03/maximum-concurrent-vmotions/</link>
		<comments>http://vpivot.com/2010/03/03/maximum-concurrent-vmotions/#comments</comments>
		<pubDate>Wed, 03 Mar 2010 18:38:35 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[network]]></category>
		<category><![CDATA[vmkernel]]></category>
		<category><![CDATA[vmotion]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=322</guid>
		<description><![CDATA[A VMware customer and attendee of a talk I gave at a performance roundtable asked me for a preview of unreleased features*.  When I talked about the amazing improvements to VMotion that would enable as many as eight concurrent VMotions the customer said, and I am paraphrasing here, &#8220;Yawn.  I can already do that.&#8221;  Really?  [...]]]></description>
			<content:encoded><![CDATA[<p>A VMware customer and attendee of a talk I gave at a performance roundtable asked me for a preview of unreleased features*.  When I talked about the amazing improvements to VMotion that would enable as many as eight concurrent VMotions the customer said, and I am paraphrasing here, &#8220;Yawn.  I can already do that.&#8221;  Really?  I had no idea customers could do this.  As it turns out, many of us at VMware did not know that customers knew how to do this.</p>
<p><span id="more-322"></span>VMware&#8217;s dedicated and curious customer base somehow obtained information on an undocumented parameter that limits the maximum number of concurrent VMotions.  Information on modifying this parameter is sprinkled around the <a href="http://www.youtube.com/watch?v=f99PcP0aFNE">internet tubes</a> but my favorite comes from Jason Boche.  Jason and others have identified how you can <a href="http://www.boche.net/blog/?p=806">change the current limit of two VMotions per host</a>.  I want to give you an explanation of why you should not do this and show you what you can expect from future releases of vSphere.</p>
<p>The concurrent VMotion limit was set after careful analysis of the capabilities of existing hardware, VMotion implementation details, and a large number of enterprise applications.  It has been set at two to provide a near 100% guarantee of no downtime during the migrations.  On some workloads or on older hardware, it is possible that three or more concurrent migrations could saturate a system resource and result in downtime.  If your hardware is brand new and your application is not wildly touching memory, it is possible that you can somewhat safely increase the concurrent VMotion limit.  But, as Kit Colbert told me, &#8220;I think allowing four simultaneous VMotions is probably OK in most scenarios, but if the VMs are really large and/or have very big working sets, then I’d dissuade customers from bumping up the limits.&#8221;</p>
<p>And if Kit&#8217;s gentle reminder is not enough to dissuade you from making this change to your production environments, I will point out that problems that arise as a result of changing the concurrent VMotion limit are not supported by VMware.  We simply cannot promise the unfaltering quality of VMotion if end users increase this limit.</p>
<p>Now, back to the feature preview in my ongoing performance road show (today: Bellevue, WA!).  Using an in-house development version of ESX, we are running eight concurrent VMotions on a single host with the even better quality than that of vSphere 4.  A phenomenally dedicated group of engineers has drastically improved the throughput of VMotions and decreased the already tiny virtual machine stun time.  This means that we can maintain our zero downtime commitment while upping the number of concurrent VMotions by a factor of four.  Furthermore, total migration time has also decreased, so our development host can evacuate a large number of virtual machines almost order of magnitude faster than vSphere 4.</p>
<p>(*) Any time I talk about this or other unreleased features the standard VMware disclaimer applies.  We are not committing this feature to any specific product nor committing any product to a specific date.  Not my rules, guy</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/03/03/maximum-concurrent-vmotions/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>PVSCSI and vmxnet3</title>
		<link>http://vpivot.com/2010/02/22/pvscsi-and-vmxnet3/</link>
		<comments>http://vpivot.com/2010/02/22/pvscsi-and-vmxnet3/#comments</comments>
		<pubDate>Tue, 23 Feb 2010 00:12:34 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[pvscsi]]></category>
		<category><![CDATA[vmkernel]]></category>
		<category><![CDATA[vmxnet]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=309</guid>
		<description><![CDATA[I heard a myth today that VMware did not support running vmxnet3 and PVSCSI in the same virtual machine.  I have talked with a dozen engineers on the subject since it came up this morning and all swear the drivers run great together.  The two drivers work on very different and unrelated stacks in the [...]]]></description>
			<content:encoded><![CDATA[<p>I heard a myth today that VMware did not support running vmxnet3 and PVSCSI in the same virtual machine.  I have talked with a dozen engineers on the subject since it came up this morning and all swear the drivers run great together.  The two drivers work on very different and unrelated stacks in the VMkernel.  There are no inter-dependencies of any sort between PVSCSI and vmxnet3.</p>
<p>I think this rumor sprung from our somewhat limited support of paravirtualized drivers in FT-protected virtual machines, which will be improved in a subsequent release.  And while most of you probably know that PVSCSI and vmxnet3 run together, I thought it worth a brief comment on this blog.  Myths are like cockroaches.  For every one you see there are hundreds hiding behind the walls.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/02/22/pvscsi-and-vmxnet3/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Windows Guest Defragmentation</title>
		<link>http://vpivot.com/2010/02/12/windows-guest-defragmentation/</link>
		<comments>http://vpivot.com/2010/02/12/windows-guest-defragmentation/#comments</comments>
		<pubDate>Fri, 12 Feb 2010 16:22:55 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[pex]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[vmkernel]]></category>
		<category><![CDATA[windows]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=288</guid>
		<description><![CDATA[Today at VMware Partner Exchange I had a lunchtime discussion with a partner of ours that makes a Windows file system (NTFS) defragmentation tool. He related anecdotes of incredible performance acceleration credited to defragmentation and quoted a few numbers based on his test environment. When he asked me what VMware&#8217;s recommendations were on the subject [...]]]></description>
			<content:encoded><![CDATA[<p>Today at VMware Partner Exchange I had a lunchtime discussion with a partner of ours that makes a Windows file system (NTFS) defragmentation tool.  He related anecdotes of incredible performance acceleration credited to defragmentation and quoted a few numbers based on his test environment.  When he asked me what VMware&#8217;s recommendations were on the subject I remained uncharacteristically silent.  Do we have best practices on this?</p>
<p><span id="more-288"></span>When people ask me about file system fragmentation I explain that fragmentation can come from two sources: the guest file system or VMFS.  In 2009 we included experiments in our <a href="http://www.vmware.com/pdf/vsp_4_thinprov_perf.pdf">thin provisioning white paper</a> that showed that both internal and external fragmentation in VMFS have no significant effect on performance.  As for guest fragmentation, VMware has avoided the business of optimizing native operating systems so there is no extant, official guidance.</p>
<p>More precisely, the large number of mappings from the guest file to the disk make it difficult to know how changes to each can impact the system&#8217;s performance as a whole.  But in talking with this partner I realized that there are two inescapable truths that suggest guest defragmentation is critical in a virtualized environment:</p>
<ol>
<li>Defragmentation can decrease the number of disk commands and the resultant IOPS.</li>
<li>The fewer IOs, the more efficient the virtualization.</li>
</ol>
<p>Guest defrag tools will order each file&#8217;s blocks sequentially in the guest file system.  This will enable the guest to make a few number of calls to larger, contiguous data than had the blocks been separated on the guest file system.  By making fewer calls to larger blocks, the following things happen:</p>
<ul>
<li>The array can leverage its faster sequential access capabilities to improve storage throughput.</li>
<li>The hypervisor handles fewer SCSI messages from the guest resulting in lower overhead.</li>
<li>The smaller number of commands results in fewer outstanding operations in the 32-element HBA queue, which allows more virtual machines to access the storage concurrently.</li>
</ul>
<p>I have not found out how much consolidated workloads have to gain from guest defragmentation.  Nor have I quantified the impact to shared storage of a shift from a larger number of small commands to a smaller number of large commands.  But I am going to work with this partner and see if we can publish some numbers in 2010.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/02/12/windows-guest-defragmentation/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Inaccuracy of In-guest Performance Counters</title>
		<link>http://vpivot.com/2010/02/10/inaccuracy-of-in-guest-performance-counters/</link>
		<comments>http://vpivot.com/2010/02/10/inaccuracy-of-in-guest-performance-counters/#comments</comments>
		<pubDate>Wed, 10 Feb 2010 23:33:43 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[perfmon]]></category>
		<category><![CDATA[scheduler]]></category>
		<category><![CDATA[timekeeping]]></category>
		<category><![CDATA[vmkernel]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=268</guid>
		<description><![CDATA[Every couple of months I receive a request for an explanation as to why performance counters in a virtual machine cannot be trusted.  While it is unfairly cynical to say that in-guest counters are never right, accurate capacity management and troubleshooting should rely on the counters provided by vSphere in either vCenter or esxtop.  The [...]]]></description>
			<content:encoded><![CDATA[<p>Every couple of months I receive a request for an explanation as to why performance counters in a virtual machine cannot be trusted.  While it is unfairly cynical to say that in-guest counters are never right, accurate capacity management and troubleshooting should rely on the counters provided by vSphere in either <a href="http://communities.vmware.com/docs/DOC-5600">vCenter</a> or <a href="http://communities.vmware.com/docs/DOC-9279">esxtop</a>.  The explanation is too short to merit a white paper but I hope a blog article will serve as the authoritative comment on the subject.</p>
<p><span id="more-268"></span>Usually this issue arises inside a new VMware customer or an established customer that has added new staff to the virtualization team.  In both cases the administrators are familiar with existing tools and require good reason to retool their thinking and environment around a new measurement system.</p>
<p>I was discussing the response to these concerns with my friend and colleague Kaushik Banerjee, the head of VMware&#8217;s outbound engineering group.  Kaushik and I spend a lot of time thinking about communicating technical details to our customers and in this case we chose different approaches to answer the question.  Both responses are complementary, so choose the one that suits your needs.</p>
<h2>Kaushik&#8217;s Approach: The Killer Examples</h2>
<p>Kaushik suggested that if we show cases where the guest OS&#8217;s counters were obviously wrong that a naturally suspicious VI admin would never trust the guest counters again.  To that end, I offer the following screen shots to make our point.</p>
<div id="attachment_285" class="wp-caption alignnone" style="width: 610px"><a href="http://vpivot.com/wp-content/uploads/2010/02/utilization_guest_higher.jpg"><img class="size-full wp-image-285" title="Guest Utilization Higher Than Host" src="http://vpivot.com/wp-content/uploads/2010/02/utilization_guest_higher.jpg" alt="Guest Utilization Higher Than Host" width="600" /></a><p class="wp-caption-text">Perfmon&#39;s counters show utilization higher in the guest than the host reports.</p></div>
<p>This screen shot shows two counters available in Perfmon inside a Windows guest with the <a href="http://vpivot.com/2009/09/17/using-perfmon-for-accurate-esx-performance-counters">vmStatsProvider</a> installed (available by default since vSphere).  The darker, red line is the CPU utilization as reported by the guest.  The lighter, greenish (?) line is CPU utilization of the virtual machine, from the host&#8217;s perspective.  This is the real CPU utilization passed up to the host by vmStatsProvider.  Notice how the host is always reporting higher utilization than the guest.  This is due to one of the reasons why guest counters cannot be trusted: they are unaware of hypervisor overheads.</p>
<p>This second screen shot shows a different case where the host utilization is lower than that reported by the guest.  Again, the dark red line represents the guest OS&#8217;s report of CPU utilization and the lighter line shows the real CPU utilization as reported by ESX.</p>
<div id="attachment_284" class="wp-caption alignnone" style="width: 610px"><a href="http://vpivot.com/wp-content/uploads/2010/02/utilization_host_higher.jpg"><img class="size-full wp-image-284" title="Host Utilization Higher Than Guest" src="http://vpivot.com/wp-content/uploads/2010/02/utilization_host_higher.jpg" alt="Host Utilization Higher Than Guest" width="600" /></a><p class="wp-caption-text">Perfmon&#39;s counters report a higher CPU utilization than ESX&#39;s.</p></div>
<p>The reason the host shows lower utilization than the guest is because the guest is unaware that it is only getting a fraction of the host&#8217;s CPU, time-sliced by ESX&#8217;s scheduler.  In this case the virtual machine was contending for CPU with other active virtual machines but this just the same principle would apply had a CPU limit been set.</p>
<h2>Scott&#8217;s Approach: A Detailed Explanation</h2>
<p>My approach to convincing VI admins to avoid guest tools is based on bottomless thirst for information that is common to technophiles.  If I can provide an explanation for the underlying system of resource scheduling and manipulation, then our admins will be able to deal with the guest counter issue and maybe solve other issues with their newfound knowledge.</p>
<p>There are four reasons why guest counters cannot be trusted:</p>
<ol>
<li>The guest is unaware of virtualization overheads.  As screen shot one showed above, the hypervisor will increase the CPU load as it virtualizes the hardware for the guest operating system.  That additional CPU work is not seeing by guest tools.</li>
<li>The guest is unaware that it is only seeing the portion of CPU that ESX&#8217;s scheduler is allowing it to see.  Because of contention or resource restrictions, virtual machines only get a slice of the CPU&#8217;s time.  When a guest thinks it is getting 100% of the CPU it may not know that the processor is being shared by eight other virtual machines.  See the second screen shot above.</li>
<li>Time skew in virtual machines can change the sample window for time-based counters.  This means that the guest may have measured 10 milliseconds of time passage during a read command when 12 milliseconds have elapsed.  This is more common on older versions of ESX and when the host CPU is saturated.  More on this below.</li>
<li>The virtual machines are unaware that they are being de-scheduled when idle, which means that they appear to be working more of the time than they are.  Consider a case where a virtual machine is idle 90% of the time.  If ESX does not schedule the VM during its idle time then the guest will think that its processor queues are full 100% of the time that it is being executed.</li>
</ol>
<p>The time drift explanation (item three) was historically the most problematic for VMware.  On older versions of our products time drift was common.  As ESX has matured we have reduced the amount of drift which has improved the accuracy of guest counters.  But the timer hardware is still being virtualized in software running on the host CPU.  This means that if host processor is fully utilized, the timer may not be scheduled on time, resulting in a delay in some ticks and a resultant skew in guest time.</p>
<h2>References</h2>
<p><a href="https://www.vmware.com/pdf/VI3.5_Performance.pdf">Performance Best Practices and Benchmarking Guidelines</a>.  This white paper was the last version that we printed that included benchmarking best practices, which contains some discussion on the need to measure performance from outside the host-under-test.</p>
<p><a href="http://www.vmware.com/pdf/vmware_timekeeping.pdf">Timekeeping in Virtual Machines</a>.  This document&#8211;not updated since VI3 but still accurate in its theory&#8211;will give the background on VMware-based time keeping and provide an explanation as to how skew can occur.</p>
<p><a href="http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf">VMware vSphere™ 4: The CPU Scheduler in VMware® ESX™ 4</a>.  This white paper provides great detail how the scheduler works which will fully explains the notion of time slicing.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/02/10/inaccuracy-of-in-guest-performance-counters/feed/</wfw:commentRss>
		<slash:comments>20</slash:comments>
		</item>
		<item>
		<title>PVSCSI and Low IO Workloads</title>
		<link>http://vpivot.com/2010/02/04/pvscsi-and-low-io-workloads/</link>
		<comments>http://vpivot.com/2010/02/04/pvscsi-and-low-io-workloads/#comments</comments>
		<pubDate>Thu, 04 Feb 2010 17:46:56 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[pvscsi]]></category>
		<category><![CDATA[storage]]></category>
		<category><![CDATA[vmkernel]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=274</guid>
		<description><![CDATA[Scott Sauer recently asked me a tough question on Twitter.  My roaming best practices talk includes the phrase &#8220;do not use PVSCSI for low-IO workloads&#8221;.  When Scott saw a VMware KB echoing my recommendation, he asked the obvious question: &#8220;Why?&#8221;  It took me a couple of days to get a sufficient answer. One technique for [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://twitter.com/ssauer">Scott Sauer</a> recently asked me a tough question on Twitter.  My roaming best practices talk includes the phrase &#8220;do not use PVSCSI for low-IO workloads&#8221;.  When Scott saw a <a href="http://kb.vmware.com/kb/1017652">VMware KB echoing my recommendation</a>, he asked the obvious question: &#8220;Why?&#8221;  It took me a couple of days to get a sufficient answer.</p>
<p><span id="more-274"></span>One technique for storage driver efficiency improvements is interrupt coalescing.  Coalescing can be thought of as buffering: multiple events are queued for simultaneous processing.  For coalescing to improve efficiency, interrupts must stream in fast enough to create large batch requests. Otherwise the timeout window will pass with no additional interrupts arriving.  This means the single interrupt is handled as normal but after a useless delay.</p>
<p>An intelligent storage driver will therefore coalesce at high IO but not low IO.  In the years we have spent optimizing ESX&#8217;s LSI Logic virtual storage adapter, we have fine-tuned the coalescing behavior to give fantastic performance on all workloads.  This is done by tracking two key storage counters:</p>
<ul>
<li>Outstanding IOs (OIOs): Represents the virtual machine&#8217;s <em>demand</em> for IO.</li>
<li>IOs per second (IOPS):  Represents the storage system&#8217;s <em>supply</em> of IO.</li>
</ul>
<p>The robust LSI Logic driver increases coalescing as OIOs and IOPS increase.  No coalescing is used with few OIOs or low throughput.  This produces efficient IO at large throughput and low latency IO when throughput is small.</p>
<p>Currently the PVSCSI driver coalesces based on OIOs only, and not throughput.  This means that when the virtual machine is requesting a lot of IO but the storage is not delivering, the PVSCSI driver is coalescing interrupts.  But without the storage supplying a steady stream of IOs there are no interrupts to coalesce.  <em>The result is a slightly increased latency with little or no efficiency gain for PVSCSI in low throughput environments.</em></p>
<p>LSI Logic is so efficient at low throughput levels that there is no need for a special device driver to improve efficiency.  The CPU utilization difference between LSI and PVSCSI at hundreds of IOPS is insignificant.  But at massive amounts of IO&#8211;where 10-50K IOPS are streaming over the virtual SCSI bus&#8211;PVSCSI can save a large number of CPU cycles.  Because of that, our first implementation of PVSCSI was built on the assumption that customers would only use the technology when they had backed their virtual machines by world-class storage.</p>
<p>But VMware&#8217;s marketing engine (me, really) started telling everyone about PVSCSI without the right caveat (&#8220;only for massive IO systems!&#8221;)  So, everyone started using it as a general solution.  This meant that in one condition&#8211;slow storage (low IOPS) with a demanding virtual machine (high OIOs)&#8211;PVSCSI has been inefficiently coalescing IOs resulting in performance slightly worse than LSI Logic.</p>
<p>But now VMware&#8217;s customers want PVSCSI as a general solution and not just for high IO workloads.  As a result we are including advanced coalescing behavior in PVSCSI for future versions of ESX.  More on that when the release vehicle is set.</p>
<h2>PVSCSI In A Nutshell</h2>
<p>If you plodded through the above technical explanation of interrupt coalescing and PVSCSI I applaud you.  If you just want a summary of what to do, here it is:</p>
<ul>
<li>For existing products, only use PVSCSI against VMDKs that are backed by fast (greater than 2,000 IOPS) storage.</li>
<li>If you have installed PVSCSI in low IO environments, do not worry about reconfiguring to LSI Logic.  The net loss of performance is very small.  And clearly these low IO virtual machines are not running your performance-critical applications.</li>
<li>For future products*, PVSCSI will be as efficient as LSI Logic for all environments.</li>
</ul>
<p>(*) Specific product versions not yet announced.</p>
<h2>Update: February 16</h2>
<p>The simple, almost austere KB on this rare occurrence raised more questions than answers.  You may notice that <a href="http://kb.vmware.com/kb/1017652">the KB has been updated</a> with text from this blog since the blog&#8217;s original publication.  A <a href="http://www.vmware.com/pdf/vsp_4_pvscsi_perf.pdf">white paper on PVSCSI</a> that had been under construction for quite some time was also released with <a href="http://blogs.vmware.com/performance/2010/02/highperformance-pvsci-storage-adapter-can-reduce-cpio-by-1030.html">a VROOM! article</a> we often pair with such a white paper.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2010/02/04/pvscsi-and-low-io-workloads/feed/</wfw:commentRss>
		<slash:comments>27</slash:comments>
		</item>
		<item>
		<title>Solid State Disks and Host Swapping</title>
		<link>http://vpivot.com/2009/12/24/solid-state-disks-and-host-swapping/</link>
		<comments>http://vpivot.com/2009/12/24/solid-state-disks-and-host-swapping/#comments</comments>
		<pubDate>Fri, 25 Dec 2009 01:15:45 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[esxtop]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[ssd]]></category>
		<category><![CDATA[swap]]></category>
		<category><![CDATA[vmkernel]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=183</guid>
		<description><![CDATA[Recently I have been thinking, talking, and writing about ESX host memory swapping a lot.  ESX swaps memory under the same conditions that traditional operating systems do; the application(s) is using more memory than available on the physical hardware.  Host swapping is an unavoidable consequence of this condition, whether virtualization is present or not. But [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I have been thinking, talking, and <a href="http://vpivot.com/2009/12/23/your-performance-enemy-host-swapping/">writing</a> about ESX host memory swapping a lot.  ESX swaps memory under the same conditions that traditional operating systems do; the application(s) is using more memory than available on the physical hardware.  Host swapping is an unavoidable consequence of this condition, whether virtualization is present or not.</p>
<p><span id="more-183"></span>But <a href="http://communities.vmware.com/blogs/chethank/2009/12/22/using-solidstate-drives-to-improve-performance-of-sql-databases-on-vsphere-hosts-when-memory-is-overcommitted">a recent article</a> by my engineering colleague Chethan Kumar shows an avenue that allows VI admins to aggressively over-commit memory and avoid the catastrophic performance penalty of swapping: use solid state disks to host ESX swap files.</p>
<p>The fundamental problem with host swapping comes from the high latency of traditional disks compared to memory.  Data can be retrieved from memory in nanoseconds but takes milliseconds to fetch from a hard drive.  That means a single 4K memory page takes 100,000 times longer to retrieve if the operating system swapped it out.</p>
<p>The value that solid state disks offer to this problem is exceptional latency, as compared to traditional drives.  The SSD that Chethan used showed microsecond latencies, about 1,000 times lower than physical disks.  This means that  time spent waiting for swap activity* has been decreased to 0.1% of the time spent swapping to physical disks.</p>
<p>The importance of fast swap files is that it enables administrators to more aggressively over-commit memory.  Today our admins rightfully fear the VMs&#8217; aggregate active memory exceeding the available physical memory, which results in swapping.  Today SSD technology in shared storage such as EMC&#8217;s new CLARiiONs allows our admins to cleverly place swap files and drive up memory utilization to previously unheard of levels.  This may enable standard memory overcommitment of 200% or more, with extreme over-commit being much higher than this.</p>
<p>In future versions of ESX we want to automate the usage of SSDs to maximize the use of available memory.  But that&#8217;s a roadmap discussion that I will leave for another day.</p>
<p>(*) This swap wait time has conveniently been added to ESX 4&#8242;s version of esxtop under the counter %SWPWT.  See <a href="http://communities.vmware.com/docs/DOC-9279">Interpreting esxtop Statistics</a> for more information.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2009/12/24/solid-state-disks-and-host-swapping/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Your Performance Enemy: Host Swapping</title>
		<link>http://vpivot.com/2009/12/23/your-performance-enemy-host-swapping/</link>
		<comments>http://vpivot.com/2009/12/23/your-performance-enemy-host-swapping/#comments</comments>
		<pubDate>Wed, 23 Dec 2009 18:43:57 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[memory]]></category>
		<category><![CDATA[swap]]></category>
		<category><![CDATA[vmkernel]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=178</guid>
		<description><![CDATA[Three times in the past week I have engaged in challenging discussions on host memory swapping and its impact to performance.  If you read my article on host swapping and the whitepaper it summarized, you know the deleterious effect on performance caused by host swapping.  When reading the paper, one of our most astute customers [...]]]></description>
			<content:encoded><![CDATA[<p>Three times in the past week I have engaged in challenging discussions on host memory swapping and its impact to performance.  If you read <a href="http://vpivot.com/2009/09/25/esx-memory-management-ballooning-rules/">my article on host swapping</a> and <a href="http://www.vmware.com/resources/techresources/10062">the whitepaper it summarized</a>, you know the deleterious effect on performance caused by host swapping.  When reading the paper, one of our most astute customers saw a sentence that gave him pause:<br />
<span id="more-178"></span></p>
<blockquote><p>ESX attempts to mitigate the impact of interacting with guest operating system memory management by randomly selecting the swapped guest physical pages.</p></blockquote>
<p>This customer has read some of our other documentation and knows that the vSphere client and esxtop report active and touched memory, each representing a kind of working set.  &#8220;So,&#8221; the customer asks me, &#8220;if ESX is keeping working sets, why would choose swap pages randomly instead of selecting inactive pages outside of the working set?&#8221;  If ESX could choose only inactive pages then the penalty due to host swapping would drop greatly.</p>
<p>As it turns out, ESX does not track working sets.  For ESX to know which pages are actively being read it would have to trap every memory access, which would greatly hurt performance.  Instead, we track a small sample of the host&#8217;s memory&#8211;exactly 100 pages&#8211;to extrapolate the size of active memory.  Because sample-based calculations are accurate with the square of the sample size, a 100 page sample delivers a very high accuracy of active memory.  But, because it is only a sample, the activity on the great majority of pages is a mystery.</p>
<p>This means that ESX has absolutely no information on the read and write behavior of most memory on the system.  Guest operating systems fare no better in tracking working sets.  But because they can categorize pages based on usage&#8211;application heap, buffers for IO, kernel memory, etc.&#8211;guest OSes can make better decisions about which pages should be swapped and which should not.  For this reason the balloon driver can induce non-harmful guest paging which is superior to host swapping.</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2009/12/23/your-performance-enemy-host-swapping/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Four Things You Should Know About ESX 4&#039;s Scheduler</title>
		<link>http://vpivot.com/2009/09/29/four-things-you-should-know-about-esx-4s-scheduler/</link>
		<comments>http://vpivot.com/2009/09/29/four-things-you-should-know-about-esx-4s-scheduler/#comments</comments>
		<pubDate>Wed, 30 Sep 2009 06:00:18 +0000</pubDate>
		<dc:creator>drummonds</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[cpu]]></category>
		<category><![CDATA[scheduler]]></category>
		<category><![CDATA[vmkernel]]></category>
		<category><![CDATA[vmmark]]></category>
		<category><![CDATA[vsphere]]></category>

		<guid isPermaLink="false">http://vpivot.com/?p=11</guid>
		<description><![CDATA[[This is the last re-post of old community content.  But this content is important enough to be worth a re-post.] I spend a great deal of time answering customers&#8217; questions about the scheduler. Never have so many questions been asked about such an abstruse component for which so little user influence is possible. But CPU [...]]]></description>
			<content:encoded><![CDATA[<p><em>[This is the last <a href="http://communities.vmware.com/blogs/drummonds/2009/08/21/four-things-you-should-know-about-esx-4s-scheduler">re-post of old community content</a>.  But this content is important enough to be worth a re-post.]</em></p>
<p>I spend a great deal of time answering customers&#8217; questions about the scheduler.  Never have so many questions been asked about such an abstruse component for which so little user influence is possible.  But CPU scheduling is central to system performance, so VMware strives to provide as much information on the subject as possible.  In this blog entry, I want to point out a few nuggets of information on the CPU scheduler.  These four bullets answer 95% of the questions I get asked.</p>
<p><span id="more-11"></span></p>
<h2>Item 1: ESX 4&#8242;s Scheduler Better Uses Caches Across Sockets</h2>
<p>On UMA systems at low load levels, virtual machine performance improves when each virtual CPU (vCPU) is placed on its own socket.  This is because providing each vCPU its own socket also gives it the entire cache on that CPU.  On page 18 of a <a class="jive-link-external" href="http://www.vmware.com/files/pdf/perf-vsphere-cpu_scheduler.pdf">recent paper on the scheduler written by Seongbeom Kim</a>, a graph highlights the case where vCPU spreading improves performance.</p>
<p><img class="jive-image-thumbnail jive-image" src="http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-4886-6674/Picture+2.png" alt="Picture 2.png" width="620" /></p>
<p>The X-axis represents different combinations of VM and vCPU counts.  SPECjbb is memory intensive and shows great gains with increases in CPU cache.  The few cases that show dramatic benefit due to the ESX 4.0 scheduler are benefiting from the distribution of vCPUs across sockets.  Very large gains are possible in this somewhat uncommon case.</p>
<h2>Item 2: Overuse of SMP Only Slows Consolidated Environments At Saturation</h2>
<p>For years customers have asked me how many vCPUs they should give to their VMs.  The best guidance, &#8220;as few as possible&#8221;, seems too vague to satisfy.  It remains the only correct answer, unfortunately.  But <a class="jive-link-external" href="http://blogs.vmware.com/performance/2009/06/measuring-the-cost-of-smp-with-mixed-workloads.html">a recent experiment performed by Bruce Herndon&#8217;s team</a> sheds some light on this VM sizing question.</p>
<p>In this experiment we ran VMmark against VMs that were configured outside of VMmark specifications.  In one case some of the virtual machines were given too few vCPUs and in another they were given too many.  Because VMmark&#8217;s workload is fixed, increasing the VMs&#8217; sizes does not increase the work performed by the VMs.  In other words, the system&#8217;s score does not depend on the VMs&#8217; vCPU count.  Until CPU saturation, that is.</p>
<p><img class="jive-image-thumbnail jive-image" src="http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-4886-6675/Picture+3.png" alt="Picture 3.png" width="620" /></p>
<p>Notice that the scores are similar between the undersized, right-sized, and over-sized VMs.  Up until tile 10 (60 VMs) they are nearly identical.  There is a slight difference in processor utilization that begins to impact throughput (score) as the system runs out of CPU.  At that point the additional vCPUs waste cycles which degrades system performance.  Two points I will call out from this work:</p>
<ul>
<li>Sloppy VI admins that provide too many vCPUs need not worry about performance when their servers are under low load.  But performance will suffer when CPU utilization spikes.</li>
<li>The penalty of over-sizing VMs gets worse as VMs get larger.  Using a 2-way VM is not that bad, but unneeded use of 4-way VMs when one or two processors suffice can cost up to 15% of your system throughput.  I presume that unnecessarily eight vCPUs would be criminal.</li>
</ul>
<h2>Item 3: ESX Has Not Strictly Co-scheduled Since ESX 2.5</h2>
<p>I have documented ESX&#8217;s relaxation of co-scheduling previously (<a class="jive-link-wiki" href="http://communities.vmware.com/docs/DOC-4960">Co-scheduling SMP VMs in VMware ESX Server</a>).  But this statement cannot be repeated too frequently: ESX has not strictly co-scheduled virtual machines since version 2.5.   This means that ESX can place vCPUs from SMP VMs individually.  It is not necessary to wait for physical cores to be available for every vCPU before starting the VM.  However, as Item 3 pointed out, this does not give you free license to over-size your VMs.  Be frugal with your SMP VMs and assign vCPUs only when you need them.</p>
<h2>Item 4: The Cell Construct Has Been Eliminated in ESX 4.0</h2>
<p>In the performance best practices deck that I give at conferences I talk about the benefits of creating small virtual machines over large ones.  In versions of ESX up to ESX 3.5, the scheduler used a construct called a cell that would contain and lock CPU cores.  The vCPUs from a single VM could never span a cell.  With a ESX 3.x&#8217;s cell size of four this meant that VMs never spanned multiple four-core sockets.  Consider this figure:</p>
<p><img class="jive-image" src="http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-4886-6688/Picture+1.png" alt="http://communities.vmware.com/servlet/JiveServlet/downloadImage/38-4886-6688/Picture+1.png" /></p>
<p>What this figure shows is that a 4-way VM on ESX 3.5 can only be placed in two locations on this hypothetical two-socket configuration.  There are 12 combinations for a 2-way VM and eight for a uniprocessor VM.  The scheduler has more opportunities to optimize VM placement when you provide it with smaller VMs.</p>
<p>In ESX 4 we have eliminated the cell lock so VMs can span multiple sockets, as item one states.  Continue to think of this placement problem as a challenge to the scheduler that you can alleviate.  By choosing multiple, smaller VMs you free the scheduler to pursue opportunities to optimize performance in consolidated environments</p>
]]></content:encoded>
			<wfw:commentRss>http://vpivot.com/2009/09/29/four-things-you-should-know-about-esx-4s-scheduler/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

