Table of Contents
BE CLASSIFIEDS: Student Seeking Internship Be Developer Classifieds are an opportunity for Be developers to tell each other about projects they're working on, are thinking about working on, or would like to work on. Here's your chance to request assistance on a project, or make your talents known to other projects. For more information, check out the Classified Ads section in the Registered Be Developer area. BeOS Job: Student Seeking Internship Computer science student seeking a paid internship for the remainder of this semester with a company currently developing BeOS software in the US. Resume available at http://www.pobox.com/~orrd/;. For more information, please contact: orrd@pobox.com.
BE ENGINEERING INSIGHTS: Optimizing Your Applications for Cache and Memory Access By Dmitriy Budko dmitriy@be.com
Today's high-performance CPUs (Pentium II, PowerPC 604) are fast, but without a good memory and cache system they are helpless. Just try using any Intel system with the cache disabled in the BIOS setting! The CPU will spend 95% of its time waiting for data from RAM or waiting until data is written to RAM. Caches minimize this time by keeping the most recently used data in small but fast memory near the CPU. Normally, this is all an application programmer needs to know, because all details related to cache/memory coherency, bus mastering devices, and multiple CPUs for an application are transparent. Nevertheless, if you care about the performance of your code you have to know much more, or your code will run faster on an old Pentium 166 MHz system than on a new, fast Pentium II 300 MHz system with 512K cache, SDRAM, etc. You don't believe it -- or you think that this code must be black magic, written in assembler? Wrong. The example I'm presenting here is very simple and straightforward, but *is* memory intensive. High-memory traffic is also characteristic of BeOS multimedia applications. The example program is the primitive implementation of the classic Erastosthenes Sieve algorithm for searches of prime numbers. This implementation can be sped up easily but I preferred here to focus on cache effects. The main loop is terse: for(i=2; i<=sqrt_max; i++) /* don't search all numbers */ { if(!list[i]) /* i is a prime number */ { /* mark as non-prime all multiples of i */ for(j=i+i; j<=nmax; j+=i) list[j] = 1; } } As you see, the code is very straightforward but it runs faster on a Pentium 166 than on a Pentium II 300 MHz! Here's the full text of the test program: #include <OS.h> #include <stdlib.h> #include <stdio.h> #include <math.h> #include <string.h> #define NMAX (8*1024*1024) double get_sec(void) { return (double)system_time()/1.0e6; } /* simple implementation */ void sieve1(unsigned char* list, int nmax) { int i,j; int sqrt_max; sqrt_max = (int)(sqrt(nmax)+0.5); for(i=2; i<=sqrt_max; i++) { if(!list[i]) { for(j=i+i; j<=nmax; j+=i) list[j] = 1; } } } /* the same but without unnecessary memory writes */ void sieve2(unsigned char* list, int nmax) { int i,j; int sqrt_max; sqrt_max = (int)(sqrt(nmax)+0.5); for(i=2; i<=sqrt_max; i++) { if(!list[i]) { for(j=i+i; j<=nmax; j+=i) if(!list[j]) /* it's not already marked */ list[j] = 1; /* mark it */ } } } /* the same but use only 1 bit not 1 byte per number */ #define TEST_LIST_BIT(n) ( (list[n>>3] & (1<<(n&7))) != 0) #define SET_LIST_BIT(n) ( (list[n>>3] |= (1<<(n&7))) ) void sieve3(unsigned char* list, int nmax) { int i,j; int sqrt_max; sqrt_max = (int)(sqrt(nmax)+0.5); for(i=2; i<=sqrt_max; i++) { if(!TEST_LIST_BIT(i)) { for(j=i+i; j<=nmax; j+=i) SET_LIST_BIT(j); } } } void sieve4(unsigned char* list, int nmax) { int i,j; int sqrt_max; sqrt_max = (int)(sqrt(nmax)+0.5); for(i=2; i<=sqrt_max; i++) { if(!TEST_LIST_BIT(i)) { for(j=i+i; j<=nmax; j+=i) if(!TEST_LIST_BIT(j)) SET_LIST_BIT(j); } } } double profile_sieve(void (*sf)(unsigned char*, int), const char* name, unsigned char* list, int nmax) { double t1, t2, dt; memset(list, 0, nmax); t1 = get_sec(); sf(list, nmax); t2 = get_sec(); dt = t2 - t1; printf("%s: %4.3f sec\n", name, dt); return dt; } unsigned char primes_list[NMAX+1]; int main() { profile_sieve(sieve1, "sieve1", primes_list, NMAX); profile_sieve(sieve2, "sieve2", primes_list, NMAX); profile_sieve(sieve3, "sieve3", primes_list, NMAX); profile_sieve(sieve4, "sieve4", primes_list, NMAX); return 0; } The results with gcc -O3 and Metrowerks cc -O3 (smaller numbers are better) in seconds are PPC 604e 132 Pentium 166 PentiumII 300 sieve1 9.36 1.97 2.75 sieve2 6.97 3.83 1.76 sieve3 4.80 3.72 1.48 sieve4 3.53 2.80 1.03 On the sieve1 function, the Pentium 166 is much faster than the Pentium II 300. What's wrong with this version? Pentium Pro and Pentium II processors have a "write allocate by read-for-ownership" cache, whereas the Pentium processor has a "no-write-allocate; write through on write miss" cache. On Pentium Pro and Pentium II processors, when a write occurs and the write misses the cache, the entire 32-byte cache line is fetched. On the Pentium processor, when the same write miss occurs, the write is simply sent out to memory. Write allocate is generally advantageous, since sequential stores are merged into burst writes and the data remains in the cache for use by later loads. This is why P6- family processors and PowerPC 603 and 604 adopted this write strategy, and why some Pentium processor system designs implement it for the L2 cache, even though the Pentium processor uses write-through on a write miss. However, write allocate can be a disadvantage in code where:
When a large number of writes occur within an application, as in the sieve1 function, and both the stride is longer than the 32-byte cache line and the array is large, every store on a Pentium Pro or Pentium II processor will cause an entire cache line to be fetched. In addition, this fetch will probably replace one (sometimes two) dirty cache lines. What are the possible optimizations? Sieve2 checks the value prior to writing, and thus reduces the number of writes of dirty cache lines to memory. Sieve3 packs the byte array into bit array, thereby reducing the size of the array, which reduces memory and cache traffic. Sieve4 use both techniques. So, when you write performance-critical code think about cache behavior. This article illustrates only one specific speed trap -- there are many more. For a better understanding of how interactions between hardware and software affect performance, I recommend the excellent textbook "Computer Architecture: A Quantitative Approach," by David Patterson and John Hennesy, ISBN 1-55860-329-8. Specific recommendations for Intel CPUs are in the "Intel Architecture Optimizations Manual," <http://developer.intel.com/design/pentiumii/manuals/242816.htm>, which was used as a reference for this article.
BE ENGINEERING INSIGHTS: Scrolling to Oblivion By Steven Black steven@be.com
Hello again. After my initial idea for a Newsletter article fell through, (alas, it even had functional example code), I had to settle for something quick and straightforward, as QA, along with the rest of engineering, is busy preparing for R4. So, we'll just talk about scrolling.
<ftp://ftp.be.com/pub/samples/interface_kit/scrollbarapp.zip>
In the Be API, there are Applications like the Tracker, with the horizontal scroll
bar starting after the status area, and NetPositive with
scroll bars that disappear and reappear, do not use
Under normal circumstances you use Let's start out with a relatively common example for text.
Suppose you have a B The frame must include space around it for the scroll bars. If your frame doesn't take into account the size of the scroll bars, the scroll bars may not be visible, which would make them useless. You also need to specify your initial text rectangle. (Note that the text rectangle is specific to the text view and differs from the view's frame rectangle.) While the bottom of this rectangle will grow to fit, the left and right coordinates determine things such as where line wrap occurs and where text aligns for right and centered alignment. (If you don't use line wraps, the text disappears when it reaches the bounds of the text rectangle.) When you create the BTextViews handle the scroll bars automatically. The scroll
bars are automatically resized as the content of the
B However, many applications use text views for other purposes
than simple direct user input. Medium-sized to large
scrollable There are two different, nonexclusive kinds of word wrap. There's wrapping at a fixed point, and there's wrapping at the edge of the view. The key to each is when you wrap at a fixed point you emulate physical (i.e., hardware) limitations, and when you wrap at the edge of the view, you no longer require the horizontal scroll bar. The key to wrapping at the edge of the view is to catch
If you don't want to wrap at all, you need to make sure your text region is large enough for all of your text. Many programs just allocate a presumably larger-than-needed text rectangle. (Examples of this approach include at least one popular programmer's editor on BeWare.) Most people can get away with just counting the lines and finding the longest line available -- though this gets slow with large files. However, I digress -- let's get back to scrolling.
Images are typically contained in BBitmap *image; image = BTranslationUtils::GetBitmapFile(name); That's really all you need to get an image from a filename in BeOS. And -- you must remember to link in the Translation Kit, as it's not normally included. It's all in libtranslation.so. Additionally, some scaling is often required. I identify two different types of scaling. There's scaling to fit the window, done by some arbitrary percentage. There's also scaling to fit the screen while keeping the aspect ratio, which is actually an adaptation of scaling by a fixed percentage. The first thing to be aware of with images and scroll views
is that not much is handled automatically. You're going to
need to make some sort of view for the We've already mentioned the basic Images have a fixed size, and if you're not scaling the
image, you'll either have the view background color, or, if
you've set the background color to a transparent color
(either The view containing the image needs to adjust the scroll bars as needed. This includes the proportion, the range, and the steps. Instead of trying to explain what I think are rather straightforward calls to adjust the scroll bar, here's the section of the code that I used. void TImageView::FixupScrollbars(void) { BRect bounds; BScrollBar *sb; bounds = Bounds(); float myPixelWidth = bounds.Width(); float myPixelHeight = bounds.Height(); float maxWidth = 1, maxHeight = 1; if(image!=NULL){ // get max size of image GetMaxSize(&maxWidth, &maxHeight); } else fprintf(stderr, "Image is null\n"); float propW = myPixelWidth/maxWidth; float propH = myPixelHeight/maxHeight; float rangeW = maxWidth - myPixelWidth; float rangeH = maxHeight - myPixelHeight; if(rangeW < 0) rangeW = 0; if(rangeH < 0) rangeH = 0; if ((sb=ScrollBar(B_HORIZONTAL))!=NULL) { sb->SetProportion(propW); sb->SetRange(0, rangeW); // Steps are 1/8 visible window for small steps // and 1/2 visible window for large steps sb->SetSteps(myPixelWidth / 8.0, myPixelWidth / 2.0); } if ((sb=ScrollBar(B_VERTICAL))!=NULL) { sb->SetProportion(propH); sb->SetRange(0, rangeH); // Steps are 1/8 visible window for small steps // and 1/2 visible window for large steps sb->SetSteps(myPixelHeight / 8.0, myPixelHeight / 2.0); } } For the views You might be wondering what's needed to scale the image.
Would you believe: nothing! The The sample code is readily available. It can handle images and text files. For images it shows them maximized to perspective and zoomed. For text files, it starts out not wrapping them. Since too few image programs actually allow you to view files maximized to perspective, the sample code should be mildly useful. (If you view it as an excuse to learn some Tracker scripting, too, it could be highly useful for those who wish to browse image files.) I lifted some of the image-related stuff from George Hoffman's QuickPaint sample code, so credit for that goes to him.
DEVELOPERS' WORKSHOP: Untangling Threads By Michael Morrissey jersey@be.com "Developers' Workshop" is a weekly feature that provides
answers to our developers' questions, or topic requests.
To submit a question, visit
http://www.be.com/developers/suggestion_box.html.
If you've never given much thought to what multithreading is
and how it works, chances are that your code may have
routines which could greatly benefit if modified to take
advantage of threading. Often applications have small,
helper functions which contribute significantly to the
running time of the application, and threading these
functions frequently can lead to enormous speed improvements
on multiprocessor machines.
In this article, we'll be looking at one such helper
function, the world- famous Quicksort routine. We'll
investigate what makes a routine a good candidate for
threading, give some general design tips, and offer
questions to keep in mind when threading your code.
Before you look at the sample code, you might want to skim
through the Be Book's section on Threads, and the one on
Semaphores. You can get the sample code from here:
<ftp://ftp.be.com/pub/samples/intro/QSort.zip>
The sample code version of Quicksort is not meant to be put
into a library, but rather to be used as a tool for
experimentation. Consequently, there are three command-line
arguments: the first specifies the size of the array to be
created and filled with random integers, the next argument
specifies the maximum number of threads to spawn for
sorting, and the third argument specifies a threshold
variable, which indicates the size of the smallest array
which is to spawn a thread. The program also creates a
First, about the algorithm: Quicksort is the canonical
divide-and-conquer algorithm. The general idea is this:
given an array of numbers, choose a "pivot value"; ideally,
the pivot is the mean value of the array. Then reorder the
elements of the array so that elements less than the pivot
appear on the low end of the array, and elements greater
than the pivot appear on the high end. You now have
partitioned the array into two sub-arrays: one with values
less than pivot, one with values greater than pivot. Now
recursively apply this algorithm to each of the two
sub-arrays, partitioning each of the sub-arrays into two
more sub-arrays. In this way, you order the entire array by
working on consecutively smaller and smaller arrays.
Quicksort is good for large arrays, but inefficient on
smaller ones. For this reason, one optimization which is
generally made is to order small sub-arrays using a straight
insertion-sort, rather than partitioning down to single
elements. This results in a great speed up, and the sample
code uses it in the At this point, you might be saying to yourself, "We have the
regular, unthreaded version of Quicksort. Great! Now how do
we thread it?" If you're wondering about that, back up! The
question that you should be wondering about is: "Is
threading this algorithm going to speed up the running time
of this algorithm?" Threading an algorithm doesn't always
improve running time, and it can even hurt running time.
There's nothing worse than rewriting a piece of code to make
it run faster, only to discover that you've slowed it down.
(Trust me.) Unfortunately, there's no simple way to decide
whether or not threading is a good idea. You can, however,
look for some tell-tale signs, use your best judgement, and
experiment.
One of the most important questions to ask is, "Will the
data the threads are working on overlap or be mutually
exclusive?" If the data is independent, the need for
synchronization between the threads is greatly reduced,
which saves on both running time and on your effort in
implementing the algorithm. If the data overlaps, you'll
have to carefully consider how to synchronize your threads
so that they don't clobber each other's work, and also so
that they aren't spending all of their time competing for
control of a variable rather than doing "real" work.
The next question to ask is, "Does the amount of work each
thread will perform significantly outweigh the overhead of
managing the thread?" As light-weight as they are, threads
still have overhead, in the form of kernel calls, memory
allocation, and context-switching. If your thread isn't
going to spend a fair amount of time computing, the overhead
might not be worth it.
I must admit that Quicksort for integers fails this test.
Comparing integers is very inexpensive, and does not far
outweigh the cost of spawning a new thread. It's easy to
imagine, however, sorting an array of a user-defined type
which has an expensive comparison operation, such as sorting
an array of complex numbers. For simplicity's sake, rather
than defining my own type and templatizing the QSort class,
I compare the logarithms of an array of integers (rather
than the integers themselves). The resulting ordering is the
same as if I'd ordered the integers directly, but the
logarithm operations are expensive enough to make threading
worthwhile.
Your next question might be, "How will I know when my
algorithm has finished?" That sounds like a funny question
if you're only used to working with single-threaded
algorithms, but it's quite important in the multithreaded
world. Threads are not guaranteed to execute in a particular
order (unless you synchronize them to do so explicitly), so
you cannot predicate your termination condition on the order
of execution. In Quicksort, when an element is put into its
correct place in the overall array, a counter which refers
to "elements yet to be sorted" is decremented. When the
counter reaches zero, a semaphore is released, indicating to
the managing thread that the sorting is finished and the
child threads are finished.
Another crucial question is, "How many threads should I
start?" Surprise -- there's no easy answer to this question
either. In the case of Quicksort, it doesn't do much good to
have more threads running that processors. That's because
when a Partition thread is running, it can run straight
through, as fast as it can, since it doesn't try to acquire
variables which other threads might be using, which could
cause contention. Some algorithms may benefit from having
more threads than processors, since while one thread is
idle, waiting to acquire a shared resource, another thread
can be running.
Not having more threads than processors (for Quicksort) is
only half of the situation. Ideally, you'd like to have at
exactly as many threads as processors running at any point
in time, so as to keep all of the processors working. Just
spawning threads and giving them an initial amount of work
to do is not enough -- one thread might finish sorting a
sub-array faster than the others. It would be good if this
thread could go back and help out the other threads which
are still running. So in the QSort constructor, we create a
semaphore and give it a thread count of THREADS (from argv):
...
}
The workthread semaphore controls how many threads can be
started. Before a worker thread is spawned, it tries to
acquire the workthread semaphore. If it is immediately
available, a worker thread is spawned and resumed:
If there are no available threads in the pool, that is, if
Typically, rather than creating a thread every time it is
taken from the pool, and killing the thread every time it is
returned to the pool, you would simply resume and suspend
the same threads, and feed the awakened thread fresh data.
For this sample code, I chose not to implement that strategy
for reasons of clarity. You should take a few minutes and
think about how you'd restructure this program to use the
"traditional pool" scheme, how you'd get fresh data into the
threads, and how you might modify the mechanism which
detects when the sort is finished.
One final thought: keep in mind that you want your routine
to scale as nicely as possible from a single-processor
machine to a multi-processor machine. Compare the single
thread version of your routine on a single-processor system
with the multithreaded version of your routine on a
single-processor system. Is the multithreaded version much
slower? If it is, can you rework your threading model to
reduce that discrepancy? If not, you may want to consider
special-casing your routine on single-processor systems.
I strongly encourage you to play around with the arguments
to this program, and see how the size of the array, the
number of threads in the pool, and the threshold variable
interact and affect running time of this program. Then try
looking at your own code for functions which might benefit
from threading. If you think you have a good candidate,
experiment!
This is Agenda time, a yearly industry conference which always
takes place in Phoenix, Arizona, in a resort, The
Phoenician, once famous for its prominence in the real
estate debacle of an earlier decade. So, in taxpayer
subsidized splendor we spin and schmooze -- and hear
preachers warning us about the danger of government
intervention in our affairs.
Still on symbols, last year's Agenda opening day was made
memorable by the "surprise" DOJ suit against Microsoft.
During the morning's coffee break, video was piped on the
screen as attendees walked in and out of the room. At first,
some of us thought this was one of the spoofs we're
regularly treated to, but no, this was the Bill and Janet
(now Joel) reality show.
A year later, on the very same opening day, the trial opens,
and, for the first time since the beginning of Agenda, Bill
Gates isn't at the conference. This was deemed such a
momentous change by Agenda organizers they e-mailed
attendees in advance essentially offering their money back
if Bill's absence killed their interest in the conference.
On the one hand this is laudable truth-in-conferencing on
the part of the organizers. On the other hand, the
implication that there is so much value in the submissive
hope of being able to touch a fringe of the emperor's tunic.
One is happy to report the assumption of such a supine
position wasn't supported by attendance numbers: there was
the usual turnout of industry grandees and smaller
entrepreneurial fry as well as many last minute pleas to get
admitted to this "by invitation only" conference.
Another observable change is what some call the shift from
MIPS to Mbps, the feeling that what we need most isn't faster
processors but faster connections to the great network in
our future. Indeed, one can feel the frustration at PCs with
more computing power and storage than an early Cray
supercomputer, connected to the Internet via a 56Kbps modem.
Dave Nagel, the head of AT&T Labs, commenting on the impact
of broadband home connections, observed the experience was
so good customers couldn't bear the thought of going back to
the old POTS days and turned into broadband evangelists. "If
it's offered where you live, get it, if not, demand it and,
if you don't get it, move to a place that has it." In other
words, now that we have the $500 PC running Office and
Explorer, we need a $50 per month high-speed connection.
Call it a "T1 in every pot," at least figuratively speaking.
This, in turn, gives rise to more connected thoughts, home
networking and Internet multimedia appliances. AT&T, Philips
and Cisco, from their vantage points, all stressed the
importance of such technologies and products in creating new
user experiences, not just smaller/cheaper versions of what
we have today, and in reaching new users, not just relieving
current users from their frustrations -- and from more of
their money.
But, of course, there is the other side of the issue, the
side that says we need more MIPS. Or, perhaps better stated,
both sides are right, we need more Mbps and more MIPS. If
all I want to do is run Microsoft Office and Explorer with a
POTS connection, the $500 PC offers an unbeatable
price-performance combination. If, on the other hand, I want
to capture, edit, produce and polish video, sound, music, 3D
animations, or play really good games, the 450MHz
"supercomputer" doesn't look so overpowered anymore.
That's the point Andy Grove made in his opening Monday
morning talk and demos. But first, we were treated to a live
videoconferenced ion-beam healing session. The patient? A
processor chip in Santa Clara. Some attendees thought it a
little too geeky; personally, I see it as providing a nice
balance to Steve (Forbes), who shared his well-pressed,
high-level vision of our industry's role in the future of
our country.
Staying on the "Why We Need More MIPS" message, Intel's
chairman then treated us to demonstrations of Pentium-based
systems: Dell servers running Linux, newly converted Silicon
Graphics systems running Windows NT, and a sample of Intel's
upcoming processors running BeOS audio, video and 3D
applications, drawing nice applause and looking quite smooth
when compared to the SGI system. A nice milestone in
fulfilling the goal set our by one of our shareholders who,
before investing, teasingly called Be "The Poor Man's
Silicon Graphics," promptly rephrased "SGI for the Rest Of
Us" by others who wanted more than the poor man's goal.
Some found strange symbolism in the demonstration of several
OS platforms just as the DOJ-MS anti-trust suit opened on
the other side of the country. Personally, I hope this kind
of demonstration will soon become unremarkable.
1997 Be Newsletters | 1995 & 1996 Be Newsletters Copyright ©1998 Be, Inc. Be is a registered trademark, and BeOS, BeBox, BeWare, GeekPort, the Be logo and the BeOS logo are trademarks of Be, Inc. All other trademarks mentioned are the property of their respective owners. Comments about this site? Please write us at webmaster@be.com. |