Ned’s BigFaT Blog!

March 28, 2008

Would you like some cheese with that whine?

Filed under: Uncategorized — makfu @ 2:02 am

So todays topic is Can Direct X 10 be ported to XP? 

This has been reported, incorrectly, an incredible number of times. What started as a technical misunderstanding regarding Vista’s core graphics stack has lead to a plethora of conspiracy theories and the notion that Microsoft could “easily” implement DX 10 on top of XP. That the technically inept “tech media” actually propagated this nonsense gave this theory a sense of legitimacy.

I think the most fundamental problem with this discussion is most people have little to no understanding of how radically different the WDDM driver model and the new Direct X Graphics Kernel are compared to their predecessors. In the old model, the kernel mode “miniport” driver was responsible for implementing all GPU management, including scheduling and memory management. In Vista the DXGK is responsible for this work and is the arbitrator for all pipelines rendering to the display (DX9, DX10, OGL ICD, GDI). This major overhaul was necessary to support a number of key features in D3D10 and DX9ex.

Here is one good example of why Microsoft needed to reengineer the stack for D3D10; D3D10 supports geometry shaders that can procedurally generate new primitives based on properties of existing ones. This means that, with limitations, you can create procedurally generated geometry on the GPU. The best analogy I can come up with is to think of it like an origami crane, you start with a basic primitive and by applying geometry shader instructions you can generate a more complex shape, just like following the instructions of folding paper until you get a crane (though it’s essentially additive, not subtractive).

With this capability comes the possibility of much larger and longer running shader programs on the GPU that can generate much more content in a far more efficient and parallelized pipeline (since it’s no longer, for example, one triangle in, one triangle out). This means D3D10 has the ability to generate really complex stuff on screen in real time, but that also means more stuff, period. This leads to greater usage of framebuffer memory and the need to manage execution on the GPU since these shader programs are potentially much more complex (if the GPU is stuck in one threads shader code, it could prevent another thread from running, which could be bad – though today’s DXGK/WDDM doesn’t do command stream preemption, that’s a WDDM 2 feature). Thus, the need for a new underlying framework in the form of framebuffer virtualization and GPU scheduling, implemented in the  new, adeptly named, DirectX Graphics Kernel, and the new driver model that “plugs” into it, the WDDM. WDDM and DXGK make up the core of Direct X 10.

Now, everything written above is extrapolated from discussing the requirements of just one, albeit major, feature of the D3D10 API (a feature that, by the way, isn’t used in a single shipping program because todays DX10 hardware isn’t optimized for the new geometry shader functionality in DX10). There are, however, more than a few other major changes and features (like advanced instancing, actually used in certain games) in D3D10 that also leverage new core functionality that, when combined, further drove the decisions regarding what that underlying core graphics architecture had to be and what features it had to support.

Even more interesting is there are some other important, non D3D10 specific reasons why we want framebuffer virtualization and GPU scheduling and all that good stuff. The most obvious is multiple discreet on-screen apps (not just threads within an app’s process) using the GPU. This is becoming a common scenario; a good example of which is running Aero while also running Microsoft Virtual Earth 3D and maybe Chess in Vista =). In the future, probably everything will be rendered to the 3d pipeline via application frameworks like WPF. This combined with desktop composition’s (DWM/Aero) current need to allocate shared D3D surfaces, all points to a future where the DXGK/WDDM features aren’t just nice to have, but really are necessary.

One final particularly important point: D3D9 isn’t emulated, it’s a reimplementation (forward port, if you like) of the API (libraries) on top of the new DXGK/WDDM infrastructure. Let me repeat that, DX9 on Vista was essentially a bottom up rewrite of the DX9 libraries for WDDM. In the process of re-implementing DX9, the DX folks also added the features above that could only easily be implemented on top of DXGK/WDDM, like cross-process shared surfaces (used by DWM/Aero, for example). With Vista, DX9 is quite a bit more advanced than XP’s version.

So DX10’s advanced features (specifically the D3D API functionality) depends on a lot of functionality implemented via some very advanced core DX10 OS components that simply doesn’t exist in the old Windows XP Direct X and graphics driver model. Regardless if DX10 is back portable, it is NOT a trivial matter as even forward porting DX9 to the WDDM model was a huge undertaking. Quite simply, it wouldn’t be a port of DX10; it would be a whole new implementation of the libraries on the old display driver model OR you would end up back porting the entire WDDM/DXGK infrastructure. Not trivial and, practically speaking, not feasible.

March 27, 2008

I reject your reality and will substitute my own

Filed under: Uncategorized — makfu @ 2:23 am

Man, the anti-Vista brigade just can’t stop themselves. Every time I turn around I hear some other bit of nonsense about the product. So I am going to voice my highly opinionated views over then next few days on several arguments I have seen lately.

Todays topic: is Vista SP1 is slower that Server 2008 in benchmarks?

Like Duh! The server product is geared to be stripped down and is optimized out of the box for server workloads. This, by the way, makes it well suited for running timed benchmarks. This does not mean it is specifically well suited to interactive, single user workloads.

One of the observations made by one widely quoted blog was that when configured in an equivalent fashion, Vista is about 17% slower than 2008. The problem with this statement, besides a lack of detailed configuration information sited in the blog, is that despite being binary identical, the out of box configuration is quite different between the systems (and yes, all common components between Vista SP1 and Server 2008 are binary identical and the “slipstream” ISO that is available was actually built as a complete build at the same time as Server 2008’s ISO’s).

So, for example, Server 2008 doesn’t have superfetch or the Windows Search indexer enabled EVEN if you do install the “desktop experience” feature. To enable Windows Search requires installing additional role services and superfetch requires delving into the registry (and is strictly not supported on any server configuration because of the impact to server applications and multi-user configurations). But even assuming that these services are disabled in the OS instances used for testing, there are additional lower-level differences in the run-time optimizations between the two OS’s.

For example, the default configuration for performance settings is to favor background services, versus foreground applications, which has a profound impact on processor scheduling. Specifically, the Vista default of Optimize Performance for Applications enables short quantum lengths (time slices) with variable quantum length for foreground applications and a high foreground boost by using longer quantum’s for foreground interactive processes. In contrast, the server default of Optimize Performance for Background Services, provide a long quantum that is fixed for all threads (e.g. no foreground boost).

If all these options aren’t configured identically, the service count and configuration isn’t identical and if all drivers aren’t 100% identical, then performance could differ greatly between two systems, even though they are based on the same base binaries.

Most importantly, it’s irrelevant to compare server versus client OS’s because benchmarks do not tell a complete story. For example, during multiple benchmark runs, the basic demand-page system caching model that is used in server (or Vista if superfetch is disabled), versus the proactive paging enabled by superfetch, acts as a major potential differentiator because in subsequent runs, a large quantity of code and data information becomes cached in the system cache (standby page list). Performance in scenarios where users load common large files or applications in a variable order will highlight the advantages of a system like superfetch (especially in very large memory configurations) because the system will proactively begin loading commonly used pages (code and data) once the system boots or once memory becomes free, based on a usage profile that the superfetch system develops over time.

Put simply, Vista is tuned to try and scale performance the way users work. Users don’t work like benchmarks; most open and close files and applications randomly and have usage patterns that are not strictly linear. Servers workloads however look very much like benchmark runs  – a series of actions (launching processes and loading files in a repeatable sequence), so a server OS default install (with superfetch disabled), will most likely run benchmarks faster than a default Vista install and most benchmarks would benefit at all from proactive paging (caching).

With that said, I will be running some benchmarks this weekend to illustrate the above topics, using a common (real hardware) platform to evaluate performance between identical Vista and Server 2008 installs.

The topic for tomorrow? Can DX10 be ported to XP…

March 7, 2008

Prepare to be Rocked…

Filed under: Uncategorized — makfu @ 3:36 am

One of the recurring bits of misinformation that I see floating about message forums is how OS X has supposedly better “64bit support” with comments stating that OS X Leopard is “more 64bit” than Vista. I find this assertion amusing because it is, contrary to Apple marketing, completely wrong.

First, let me state that, their Windows software notwithstanding, Apple makes terrific products and OS X is an excellent operating system. However, the question at hand is whether OS X Leopard is a 64bit OS, and the answer to that is an unequivocal no.

First, to make myself clear, I define the OS as the core kernel, drivers, system services, shell and primary UI libraries. An OS can support application code of different word lengths, via subsystems, without actually being natively said word length. For example, DOS, via the 4GW runtime environment, could run 32bit protected mode code, however this did not make DOS itself a 32bit OS (though some would claim that DOS4GW provided so many functions that it was itself an OS).

Some operating systems, like Windows 9x, are legitimately hybrid systems, as its core kernel was derived from a v86 VMM (virtual machine monitor that, yes, you could almost call a hypervisor). The Windows V86 VMM was a true 32bit preemptive multitasking, virtual memory, Ring 0 kernel that managed 1 “system VM”, where your 16bit windows apps and 16bit Windows OS code lived, and other DOS v86 vm’s. In fact, even as far back as Windows 2.11 /386, the system VM (where your 16bit windows apps ran using cooperative multitasking) was preemptively multitasked alongside all DOS applications running in their separate dos “box” v86 virtual machines.

The VMM was extended in Windows95 (not by much) to support preemptive thread scheduling and memory management for 32bit protected-mode Windows processes. This meant that Windows 9x was not a pure 32bit system, since much functionality was derived from the 16bit components in the system VM and, even in a few occurrences, DOS int21h and BIOS int13h 16bit real mode functions were invoked. Because of its 32 bit kernel, however, it was also not a 16bit OS in the strict sense and is correctly described as a hybrid (or, as I prefer, a big fat kludge).

Now what about 64bit Windows NT based operating systems, such as Vista and Server 2008 x64? Are they a similar kludge as Win9x, given they run both 32bit and 64bit code? The short answer is no. The long answer requires that we delve into how X86-64, or as it’s more commonly called, x64, works.

The x64 CPU supports 3 modes of operation; Real Mode – the legacy x86 segmented memory model used by DOS, Protected Mode – the 32bit linear address space mode with hardware memory management introduced with the 80386 (also includes 16bit 286 protected mode and v86 mode), and Long Mode – the 64bit linear address space mode, also with hardware memory management, introduced by AMD with the Hammer architecture based CPU’s.

Long Mode is interesting because, when active, it actually encapsulates 32bit Protected Mode and 64bit Native Long Mode. When a 64bit OS switches the CPU to Long Mode, the first stop is an intermediate “compatibility sub-mode”. This sub-mode is essentially identical to “legacy” 32bit Protected Mode, but without the virtual-8086 (v86) sub-mode support used to run DOS/BIOS 16bit Real Mode code in a Protected Mode, 32bit OS. It is a further step to actually switch the CPU to full 64bit Long Mode, but this step is actually a critical part of AMD’s well thought out compatibility strategy. By allowing a nearly unmodified 32bit code-base to run in a default 32-bit “sub-mode”, AMD solved many problems for OS developers and, as we shall see in a few paragraphs, actually made it possible for Apple, and others, to get to a 64bit world via an elegant shortcut.

But first, how does Windows support x64? 64bit (x64) variants of XP, Server 2003, Server 2008 and Vista run in full 64bit Long Mode, meaning that the system boots all the way to full Long Mode, supports and uses pointers 64bits in length and subsequently supports 64bit virtual addressing along with 64bit datatypes (for example, in Windows data model, LLP64, longlong is natively a 64bit data type). Furthermore all data registers used are 64bits in length and 8 additional general purpose and XMM registers are available for use via the full Long Mode instruction set architecture. This is a point that needs repeating, with 64bit Windows, the OS, kernel, drivers, shell and all major libraries (Win32, COM, GDI, Direct X, .Net, etc.) are all true, native 64bit code all running in 64bit full Long Mode with access to the 8 extra GP and XMM registers and gobs of address space. Top to bottom, 64bit Windows (NT) is a 64bit OS.

So if, the OS is 64bit, how does it run 32bit applications? Well, first of all, x64 versions of Windows do not use an “emulation” environment, such as the NTVDM used in 32bit Windows NT based operating systems for running 16bit code. Instead, 64bit NT uses a translation layer called WOW64 which leverages a very cool feature of the x64 architecture that AMD had the foresight to add when developing x64, namely the ability, once the CPU is in Long Mode, to dynamically switch the CPU’s sub mode from either 32bit compatibility (e.g. protected) mode or full Long Mode based on the code segment (CS) value loaded in the CS register.

In 64bit Windows, this works as follows: when loading a 32bit process, 64bit DLL’s named NTDLL.DLL, WOW64.DLL and WOWWIN64.DLL are loaded into the address space and then, WOW64.DLL proceeds to load a 32bit version of ntdll.dll and calls its initialization routine which loads all the required DLL’s for the application, including 32bit system DLL’s that, if they make system calls, are modified to call into WOW64.DLL (or WOWWIN64.DLL), rather than the standard call path. For all 32bit code loaded in the address space, the memory manager sets the L&D bits of the Code Segment to its corresponding 32bit mode indicator (L0, D1). When (and anytime) the program begins executing, e.g. there is a context switch to a thread executing 32bit code in the programs process, that code segment value is loaded into the CS register, per the transfer of control, and the CPU switches on the fly to 32bit Compatibility Mode.

When a 32bit program needs to make a system call or, as is more often the case, a function in a system DLL needs to make a core OS function call, the modified DLL calls the WOW64 stub libraries, which, based on loading the CS value for WOW64, causes the CS register values used for CPU mode selection to be set to L&D values L1, D0. From this point on, until returning execution back to 32bit code, the processor is in 64bit long mode. Before passing (thunking) the system call, WOW64 also performs stack translation for 32bit values/arguments.

So, from the above description, you can now understand how a 64bit Long Mode OS, on x64, executes 32bit protected mode code and it should be fairly clear how x64 CPU’s make the transition between 32 and 64bit modes. I will add, that the same WOW64 subsystem is used on the other major 64bit platform supported by Windows, IA64 (Itanium) with an additional DLL that provides x86 to IA64 instruction translation, making the Itanium version of WOW64 an actual emulation environment, versus a translation (thunking) layer as it is in x64 version of Windows.

Okay, so how do other OS’s, such as OS X Leopard, run 64bit code on a 32bit kernel? The answer is the exact inverse of what is described above. Using OS X as an example, the “XNU” kernel is not 64bit code; it is a 32bit PAE enabled kernel and, as discussed a few paragraphs back, a 32bit system can bootstrap into Long Mode’s 32bit compatibility mode. Furthermore none of XNU/Darwin kernel mode drivers or components are 64bit as one can’t safely mix x86 and x64 code in a common address space due to differences in pointer lengths (32bit vs. 64bit), argument/variable management (stack vs. register) and register counts (8 vs. 16) except when it’s very tightly controlled, such as the WOW64 DLL’s mapped into a 32bit processes user-mode address space.  Doing so in kernel mode would be a very dangerous and make kernel mode development extremely difficult to debug.

However, the OS does support 64bit processes and has certain libraries that are coded as 64bit native for supporting 64bit programs (processes). Just as Windows switches the CPU back to 64bit long mode via WOW64 when an application makes a system call, a system call from a 64bit long mode process in OS X will cause whatever library invokes the system call to follow a call path that results in some code specifying the CS for a 32bit code segment descriptor, thus setting the L&D bits of the Code Segment register to its corresponding 32bit mode indicator (L0,D1). Subsequently, a function, prior to passing arguments to the native kernel mode system call, will truncate/reformat any 64bit values that are being passed to 32bit x86 compatible values. Returning from the system call back to the 64bit process, causes CS register values to be set to L1,D0 and the processor is magically back in full 64bit Long Mode.

Now, if that sounded disparaging of OS X, then you have bought into the “more bits” is better “measuring” contest. Reality is that the benefits of a fully 64bit OS are dependent on a lot of factors. Apple using the x64 architecture in the manner they did is a valid way to support 64bit applications. It is interesting to note, that once upon a time, this was the direction NT was headed in with NT 5.0 (Windows 2000) on the Alpha architecture (not to mention this is how several other OS’s on other architectures made the move to 64bit support). The likely reason that Windows (NT) ended up becoming a fully 64bit platform all at once is because, with the end of Alpha, development of a 64bit system was focused entirely on Itanium which had no real 32bit mode. As a result, the x64 version of Windows is actually a port of the Itanium version and even carries over the 44bit address map from that platform. Had 64bit efforts started later, with more focus initially on x86, it is very likely that Windows would have travelled the more relaxed route to a 64bit world.

That said, eventually those OS vendors using a 64bit process on a 32bit kernel model afforded to them via x64’s clever compatibility, will have to port their core codebase to native 64bit since, in the near future, contraints in the amount of addressing allowed in 32bit compatibility mode will limit the total physically addressable system memory (even for 64bit processes) and the memory accessible to the kernel itself.

[Khan] Ah. Not so wounded as we were led to believe. So much the better. [/Khan]

Filed under: Uncategorized — makfu @ 2:54 am

Wow, okay, that was pretty slovenly of me not to update my blog in 6 months, I will try to do better for the rest of this year. To start off, I will post a couple of really big entries over the next day or so.

The question, of course, is what to blog about? I try to make sure that I stick to incredibly boring technical stuff, primarily because there are enough idiots out there blogging about all sorts of political, ecological, theological topics, most of whom really don’t have any additional insight beyond what the average layman knows. So I stick to what I know, which is compsci and technology.

This brings me to my second point, which is that one needs to beware those who purport to be experts, but really aren’t. One of the biggest issues with the interweb and it’s “blogosphere” (I really hate that word) is that there are no standards around what is stated as fact. Nobody fact checks most blogs and therefore, what content is printed on them is only as good as the writers own knowledge and their willingness to check their own statements. When you mix agendas and bias with the lack of oversight, you end up with a lot of blatantly misleading content.

Making this even worse is the general downward trend in editorial fact-checking standards in new-media (organized, managed blogs and e-zines) and traditional media. It has become a fairly regular occurrence that some traditional media outlet is under the spotlight for sloppy reporting or, in at least a few cases, outright fabrication of “facts” and “sources”. In my area of specialization, I see this trend very clearly in the tech media, which has always had fairly low standards IMHO, but as of late, they have gotten much worse. The worst offenders are the online “controversy” sites like “The Register” and “The Inquirer”, who I intentionally call out because of their history of gross misreporting (e.g. AMD’s reverse hyperthreading and DX 10 on XP, just to name two off the top of my head).
 
However, this problem of misreporting, exaggeration and plain old technical ignorance has spread to other places including some well read online and print tech reporting outlets. I won’t call them out in this post, but my recommendation is to always test someone else’s assertions, especially if they made them without solid technical details.

Blog at WordPress.com.