Multi-core processors: hype or the real deal?
9th Aug 2008 | 11:00
Multi-core processors offer great potential but aren't without their problems. We look at what they are
For years, scientists have enjoyed the benefits of supercomputers that contain thousands of processors each working in parallel to achieve phenomenal levels of performance. The advent of multi-core processors means that these advantages are on offer to ordinary computer users, but unlike earlier computing design improvements, the gains available from working in parallel are not guaranteed.
So if multiple cores is the way to go, it's not surprising that manufacturers are queuing up to release chips with ever more cores. Most PC users now consider processors with two cores as entry-level, and those with triple- and quad-core chips are being touted for those users with power-hungry applications who want a little more out of their computer. Some companies are already selling products that use chips with a higher number of cores, although they are less mainstream.
Sun Microsystems use an eight-core UltraSPARC T2 processor in their latest servers, and some companies have chips with hundreds of cores that are aimed at specialised applications such as mobile phones. The PC market isn't getting left behind, though: both Intel and AMD are expected to have eight-cores in the near future; and Intel has even more ambitious plans.
Approaches to parallelism
Multi-core might be the buzz-phrase of the moment, but this certainly doesn't represent the industry's first step into parallel computation – far from it. The multi-core approach is just the latest method in a long line of techniques aiming to do more than one thing at once.
For many years, processors performed one operation per clock cycle. So a processor with a clockspeed of 1MHz was able to perform one million operations per second. However, as processors became more sophisticated, instructions took more than a single clock cycle to execute and this relationship broke down.
Those instructions might have done the work of lots of the simpler instructions – and certainly clockspeeds increased dramatically – but there was a time when this had to be balanced against the fact that instructions were executed over many clock cycles.
The move back to single-cycle instructions was the result of a technique called 'pipelining', which represented the computing industry's first faltering steps into parallelism. Executing an instruction is a multi-step process. At its simplest stage, it might involve decoding the instruction, loading from memory the data for it to work on, performing the necessary action on that data and finally writing the results back to memory.
Each of these steps could take a clock cycle. Pipelining involves separating these steps and performing them in parallel. It isn't possible to perform all those steps in parallel for a single instruction, but as soon as one instruction has been decoded, the decoder is free to start decoding the next instruction at the same time as the data is being loaded for the first instruction.
Pipelining doesn't quite give us single clock cycle instructions – mainly because the pipeline has to be flushed each time the program branches – but it comes close.
The next development in parallelism was the provision of multiple execution units in processors. A chip might have two arithmetic/logic units (which perform integer arithmetic and logical operations) and two floating point units (which perform arithmetic on floating point numbers). When used alongside pipelining, the multiple execution units permitted – for the first time – more than one instruction to be executed in a clock cycle.
However, the improvements aren't as dramatic as you might expect. For a start, it often isn't possible to work on two consecutive instructions in parallel because the second of the two might require the result of the first. And secondly, executing multiple instructions in parallel might involve guessing whether or not a branch will be taken.
If that guess proves to be wrong, the result of one of the parallel instructions has to be discarded. A final point is that if only the arithmetic/logic and floating point units are duplicated, there is still contention for other resources within the processor.
All of the approaches to parallelism described above have one common factor. The parallel resources are managed within the processor, which means that they are transparent to both the operating system and application software.
In other words, if you'd swapped from a processor without pipelining to one that supports it, or if you'd upgraded from a pipelined processor with one arithmetic/logic unit to one with two, you'd have seen an immediate performance increase with the same software. Sadly, this does not happen with multi-core processors; as some industry experts have commented, there's no longer such a thing as a free lunch. For the first time, parallel resources will only work with specially written software that employs a technique called 'multi-threading'.
At its simplest, a thread can be an application. In this case, all you need to make use of multi-core processors is an operating system that supports multi-threading. Happily, all versions of Windows since XP have this support. Now you can run two programs at once without them competing for the resources of a single core.
This helps if you want to run a power-hungry application in the background – such as rendering – without the foreground task being affected; but it doesn't help if you want that power-hungry application to run faster. To achieve that, the application itself has to support multi-threading. This involves splitting up the operation of the code into logical tasks that can be performed independently and in parallel. We'll investigate this issue in quite a bit more detail later on.
The gains on offer
The idea that going from one to two or two to four cores will double the performance of a chip is, sadly, naive. So what performance gains can we reasonably expect in the real world? I asked James Reinders, Director of Marketing and Business Development for Intel's Software Development Products, what he expected to happen.
"Some programs scale well," he told me, "and may see gains close to the number of cores. Occasionally, this might mean a four-fold speed increase on four cores, but even a three-fold increase on four cores is good. Anything which has reasonably independent work to do can see such speed-ups, for instance database queries. But we will not see doubling cores make a system twice as fast in general."
So why does each additional core give less advantage than the previous one? "Eventually, the overhead of distributing work catches up to any program," Reinders explained. "For some it might be a few cores, others it might be tens of thousands. For anyone who has been up against a deadline, you know that having someone offer to help can sometimes be the answer to your prayers, and other times be simply of no use at all. Sometimes the overhead of breaking down short term work can be more trouble than it is worth. On the other hand, the more work there is to do, the more likely it is that you can use the help."
Reinders also had a cautionary tale to tell about how all this extra power might end up being used. "I used to have a 40MHz processor in my laptop, now I have one with a processor running at over 2GHz. Is that fifty times as fast? Not really, and the speed boosts it does have are used for many things other than running a single application faster such as smooth scrolling, higher resolution screens with more colours, wireless connectivity and virus checkers. The extra power is used more for things that I didn't have before, than for making my old applications faster. We will see history repeat itself with multi-core processors."
AMD's Chuck Moore, Senior Fellow Design Engineer for Accelerated Computing, was even more realistic. "An important aspect is something called Amdahl's Law. While not widely known outside the silicon design engineering and software development communities, it's highly relevant to the parallel processing and programming issue." Amdahl's Law is a formula for working out the performance gain of a parallel system.
If all of the workload can be parallelised, the speed-up is equal to the number of processors. However, if even a small percentage of the program can't be run in parallel, the level of improvement drops dramatically. For example, if five per cent of a program can't be parallelised, the performance gain is limited to 20 times – no matter how many cores are used.
However, these restrictions only apply to homogeneous cores, a standard which could be replaced in the near future. "AMD is now focusing heavily on bringing heterogeneous multi-core solutions to market, as part of our Accelerated Computing initiative," Moore told me. "The title Accelerated Computing comes from the idea that in the future, AMD and the industry plan to increasingly produce microprocessor designs that combine varying mixtures of scalar processing cores, parallel processing cores, and fixed-function accelerators on-chip.
AMD calls this new category of processor an Accelerated Processing Unit (APU). By creating the optimum mix of these three types of blocks, an APU can be more highly tailored to accelerate the software that matters most to a particular end-user. AMD's first APU will be 'Swift', which is targeted at the notebook space. This APU combines x86 scalar processing cores, a parallel processing core (based on ATi GPU technology) and a universal video decoder (again based on ATi technology) on-chip."
Certain types of software are more likely to benefit from the multi-core approach than others. So what applications will benefit the most from the new approach, and which already available software can take advantage of an increase in the number of cores?
I asked Mike Taulty, Developer Evangelist at Microsoft UK, what types of application are multi- threaded today; and received a somewhat unexpected answer. "On a Windows machine, it would be easier to list applications that are not multi-threaded. Most applications are multi-threaded," he told me. This is rather surprising.
An application like Word, for example, spends most of its time waiting for a key depression – so there would seem to be no great imperative to use multi-threading. I put this point to Taulty, and asked whether in this case, multi-threading is used as a programming convenience rather than for performance gains.
"Yes," he answered. "Client applications often push work to separate threads to partition the programming of that work. However, client applications like Outlook or Word might make use of secondary threads so that they can be doing things like re-indexing your mail while you're typing a message."
Intel's James Reinders also had some ideas about which applications will most lend themselves to multithreading on multi-core processors. "Any program which processes a lot of data is generally quite easy to optimise for parallelism. Concurrent processing of data is the easiest to find, and it's usually easy to modify a program to achieve this.
We call this type of parallelism data-parallelism. Programs which process photos, videos or scientific data tend to exploit their data-parallelism. The other type of parallelism is task-parallelism. This is definitely harder to grasp for almost everyone, but the concept is simple: doing multiple things at once. But figuring out the things in a program which can be done at the same time eludes people all the time."
AMD might have part of the solution. "The need for better parallel programming techniques is real and increasing," Chuck Moore explains. "AMD clearly understands and is taking action to solve the challenges of programming for parallelism.
Of equal and growing importance, AMD is also working collaboratively with the software community to develop better tools for programming within both homogenous and heterogeneous processing environments as part of our Accelerated Computing initiative. We look forward to these methods becoming part of a cross-industry, standards-based approach to overcoming the challenges and achieving the full potential of parallel processing."
A new computing paradigm
Last year, Intel demonstrated an 80-core chip that was capable of exchanging data at a speed of one terabyte a second. The company hopes to have these chips ready for commercial production within five years. Industry experts have suggested that chips like these will produce a totally new computing paradigm. I asked Intel's James Reinders whether he agreed with this view.
"Yes, I'm a huge believer that large numbers of cores will change many things. Mainframes, mini-computers, personal computers – more performance has always brought new computing paradigms. For me, I can't understand why anyone would think what we have today is 'good enough' and will be all there is in the future!"
So what is this new paradigm? What new ways will people be using computers in the future thanks to the multi-core revolution? "It takes a little imagination to envision the future. It always has," Reinders commented. "My imagination has thoughts on how it will work out. I tend to think of five things.
First, speculation – the computer does things it thinks I will want, so it's ready when I actually ask for it. Second, modelling – the computer models the world it resides in, and adapts better to it. A far too simple example is having my computer default to 'USA' when I'm using it in the USA; a more sophisticated model is having it learn my face and notice it in photos, videos, etc. Why can a five year old do this so well, but no computer seems to even try to do it?
Third, virtual reality at a level far beyond what we experience today; making that commonplace, replacing current computer graphics as completely as the VGA/CGA replaced the original monochrome text displays. Fourth, speech recognition. And lastly, eliminating the wait-cursor (the Windows hourglass). I'm sure they will turn out to be only part of what happens, and many days I worry that wait-cursors are as inevitable as death and taxes. We'll see."