August 1997 BeDC: Approaching the Be File System

August 1997 BeDC

Approaching the Be File System

DOMINIC GIAMPAOLO: I am Dominic Giampaolo, and this is going to be three talks in one, not just one, but three. I am talking about the Be file system, the file system API, which is how you access all the functionality, and the file system independent layer, which is how you write additional file systems, as part of the BeOS.

I guess I've already skipped my first slide, haven't I? So that's a rough overview of what we're going to be talking about. Let's jump right in, since I have an awful lot of slides to go through. Is someone from Be here that could press the key so that I don't keep having to walk back? I guess I'm on my own.

The Be file system, which I wrote, is a 64-bit file system. It supports extended file attributes, indexing of attributes, BTrees are used for directories and indices. It's a journaled file system, and it's designed for sustained high bandwidth IO. Now, let's go through those.

64-bit file system. Files can be 2 to the 64th in size. You can have file systems, as a whole, which are also 2 to the 64th. There are no funky limitations or cluster sizes or anything like that. You pick a block size for the file system, and that's independent of the size of the disk that you're on. We have 9-gig drives with 1K blocks. You can have a 2-gig drive with 8K blocks. It is up to you.

Extended file attributes, I'll go into a little more detail in the next few slides. But essentially a file is not just one piece of data; it's also anything that you want to associate with it. You can associate meta information with a file, and that data can also be any size.

We also index attributes. If an attribute is of, say, a simple string, we can index those, so that you can do efficient look-ups, and I'll talk a little bit more about that when I go into queries and how you can use them with the file system.

We use BTrees which are standard database technique for storing data on disk for efficient look-ups. Both directories and indices use the same structure. So if you have a directory with like 10,000 files, most systems have to go linearly through to find a particular file name. This way we can efficiently look it up. As well as for indices, when you index file names, for example, and you may have 50,000 or 100,000 files on a file system, it's very efficient to look up a particular file.

We also support journaling. Journaling is a technique that, again, I'll go into more detail, but roughly it preserves data integrity of the file system, not necessarily user data, but of the file system, so you don't have to spend time during boot-up with a lengthy file system check, or you don't have to necessarily worry about a power failure or anything like that.

And as well, the last bullet item, sustained high bandwidth I/O, the goal is to try to maintain as close to the raw drive speed as possible. So that is, if you're just accessing the raw device, no file system, what's the bandwidth and how close can we get to that going through the file system.

Attributes, they're name-value pairs, a comment such as -- an attribute, an example of one, "comment equals this is cool." So attribute name is "comment"; the value is "this is cool." You could have "age equals 27." Again that's an attribute; the name is "age," the value is "27."

The types, of course, are string, integers, floating point, double and raw, which is just raw binary data, anything you want, you know, blobs in database terminology. You can have an unlimited number of attributes per file, so if you need to have 10,000 attributes on a file, no problem, you can do it. The data, for the raw types, can of course be any size you want, and in fact it actually uses the same data structure as the file to store that data, so it's relatively efficient to access it as well.

You store meta information about files. So the tracker, for example, stores the icon location as an attribute of a file. StyleEdit, which is an editor that comes with the BeOS, allows you to do multifont style text editing in different colors, et cetera. Instead of inventing yet another text format, it stores the text that you have written as plain ASCII text, and as an attribute of the file, it stores the style information. So that way you can send that text file to someone, and it's sent as raw text, or you could compile it or whatever, but you could actually edit it in different colors and different fonts and whatnot.

Indexing of attributes is equal to queries. What this means is that -- we've talked about some of these attributes such as a comment or keywords or age or who an E-mail is from. When you index them, then you can perform queries, so you could say, give me all E-mails who have a from line of, you know, dominic@be.com, and the system will go through and maintain -- it maintains the indices for you. So any time an E-mail is created and a from attribute is added, it's also added to the index, and then you can later go back and issue queries about them.

The query language has standard comparison operators, equals, less than, less than not equal, et cetera, greater than or equal to, all those sorts of things. It's not really an SQL database, however. Some people, you know, we've used the term "database" in the past. It has a lot of the functionality of that, but it's not a strict database per se. You know, the entries are really files in the file system and all have names. Some people really want to have things to be hidden, just little records that are very inexpensive, and it's not quite intended for that.

The other main feature is the notion of live queries. So you can, for example, say, show me all files that have been created in the last 15 minutes, and you can say, I want this query to be live, and what that means is that any time a new file is created, since it would be in the last 15 minutes, you will receive an update about that.

You could say -- you could monitor a directory, and you could say, show me any file that's created in this directory. So, for example, the print server works that way. When you drop a file into the print directory, into the spool directory, it's simply monitoring that, and it pulls it out of the file and says, oh, there's a new file to be printed.

Journaling. As I said, it preserves file system integrity while not compromising performance. The idea is that when the file system is doing update operations such as creating a file, deleting a file, those sorts of things, you have to be careful, because should the power go out or the system crash or anything happen, the drive catches on fire, well, you're probably not going to get your data if that happens, but anything else, you don't want the file system structures corrupted.

And so you use a technique called journaling, which essentially takes the changes that you are about to make and writes them to one location on the disk, and then later on they're flushed out to the appropriate location, you know, whenever the system has time, and then that journal entry is recorded as being completed, when the last of those changes is written to the correct location.

What's important about this is that because you're writing everything at once to one location on the disk in a contiguous write, that's very efficient. You can then lazily flush things out, and so this actually can be a bit of performance win. It is not as much of a performance win as some system, such as Lunix, which have no checking whatsoever. They are completely unsafe, very, very fast, but completely unsafe, as I found out on our new sever when it corrupted its file system.

So here's a way at the other end of the spectrum, such as most UNIX file systems, which would have synchronous writes for when you change a directory or when you create a file, those sorts of things. It's much faster than that. It sits in between that and something like the X2 file system.

Transactions, of course, happen at most once. You can never wind up with like the same file in two places, because of journaling. The log playback is simple and fast. Essentially no matter what the size of your disk or how many files you have on it, should you just shut off the power or reset a machine in the middle of doing something, all that has to happen is the log has to go through and say, this block goes here, this block goes here, and so on, and that takes about as most 30 seconds in the worst case I could imagine, which would be 2 megs worth of blocks to play back.

The next slide. Performance. Big long writes, so in terms of what you can do to get performance out of the Be file system when you are programming and using the BeOS, big long writes as well as reads are good. Of course with writes it helps maintain file contiguity.

So if you were to do I/O reading one byte at a time, you would find performance that was abysmal. If you read 64K at a time, you would be running it close to the drive maximum. In fact, we found that, well, certainly in a clean file system, we have gotten within 95 percent of the raw disk bandwidth, but even with a file system that has been in use for some time, you can get within 85 to 90 percent of the raw disk bandwidth, which is quite good, simply by doing large reads of 64K or greater.

You can bypass the cache by doing those writes. For example, someone who is working on CD-ROM writer software, for example, wouldn't want to write all of that data and have it go through the cache, or even read all the file data through the cache, because it would simply blow the cache out of proportion and it would have no useful data in it, because it is read once and throw it away.

Most of our movie playing, in fact, is not done in memory, it completely bypasses the cache, because when you're reading, as I guess some of you saw at the demo earlier today, 12 movies from disk on the Intel demo, that data -- you can't keep it all cached in memory. At least we don't cheat and keep it all in memory, that's not fair, and it's pointless to keep it in memory, so you simply bypass the cache by doing I/O's that large.

I/O's that are a multiple of the file system block size are good. The minimum block size is 1K, so you are pretty much guaranteed that if you do I/O's of 1K or 2K, multiples of that are always good. If you were to do like 17 bytes at a time, you would find that to be very inefficient.

So, just -- these are things that it's important to be aware of. You know, buffering always does make a difference. And this is about how you get performance, that's our story.

The POSIX F open, plain old open API and the BFile API, which I'll talk about next, have little or no performance differences. The BFile API, which I'll talk about again in a minute, is not buffered, whereas F open is. So if you do have to read 17 bytes at a time, it is probably wise to use F open or to build a buffering scheme on the top of the plain BFile API.

Now we'll talk about the file system APIs. So there is the standard POSIX API, straightforward. You know, if you've got UNIX code or code from anywhere in the world, really, a platform that uses F open or open or read and write, it works, straight away, no problems. We have also added several extensions to support some of the features that we have, such as attributes and indexing, and not so much in the query aspect, but a little bit there as well.

We also provide an object oriented C++ API, one of the authors of which just walked in, Doug Foley. Next slide.

So POSIX API extensions for reading attributes and writing attributes, read attr and write attr, are pretty straightforward. To create a index and remove an index, it's about as simple as you can imagine. Essentially you give what the index name is, which matches an attribute name, what the type of it, and an index is created and maintained for you. Nothing else has to be done. Then you can issue queries on that index and so on.

So, C++ API design goals. One is to balance efficiency and ease of use, keep a consistency between API and the underlying file system, you know what are the notions that the file system exports, and what do you have at the top level. Transparent support for external file systems; you shouldn't have to use a different API to access HFS or NFS or DOS file systems, et cetera, as well as to appease Jon Watte, since he is one of our favorite guys at Metrowerks.

The key concepts in the C++ API are that there are entries and that there are data. When you think of a file system, there is a hierarchical structure to it, and there are named entries in it. Some of those names are directories, some of those names are files, as well as actually some of those names can be devices in the Be world, and there is associated data with them. So the directory contains, of course, named entries, and each entry has associated data.

Now, for a directory, the associated data is the list of things that are in it. For a file, that data is what's in it, whether it's a tiff file, the contents of the image, or a word processing file, et cetera.

The difference between directories and files provided with Be file and Be directory, there are two concepts. The entry, the notion of a BEntry or a BFile object, and -- those are two entries, sorry. This is where I stumble.

The BEntry class, it's a named entry in a directory, and it simply refers to an entity. So when you have a hierarchy, like the BeOS folder has a directory in a named system, it has another directory in ASCII, those are entries in the file system, and you can refer to them in the BEntry class. With the BEntry class you have access to the name, the location, the hierarchy. You can change the name, you can get the parent directory, move, remove, all the meta operations that operate on the entity as a whole, the name of it, et cetera.

The BFile class is what you use, and you get it from the BEntry class. You create a BFile object from the entry. This allows you access to read and write the data, read and write the attributes, and so on, but you can't actually change the name of the file with only the BFile class.

BDirectory allows you to iterate through all the entries in the directory. You can look up whether a specific entry exists, as well as you can create new entries, so you can get a new BEntry from a BDirectory.

The BQuery class, which we didn't talk about very much in the previous developer conference, because it didn't exist yet, is the access to file system queries. There is a stack- based push-pop interface which is similar to the way the things existed in the DR8 world; that is, you push on operators and query pieces to build a query, as well as there is a regular infix string notation so you can say, literally, name equals star foo star, and that will give you a query that matches -- that will return to you the results of that.

You can also specify that you would like a query to be live, and the BQuery class integrates with the rest of the Be APIs, which if you are new to the BeOS, there is the notion of loopers and handlers which receive messages which are updates. So the BQuery class can be very easily integrated into a GUI program, because it sends you the same sort of messages that the windowing system does, so you receive events that are happening, if a file has been added to or removed from your query.

So, for example, if you say, find me all the files whose name contains foo, and the file that matches that query is deleted, you can find out about it, and you can update your lists appropriately. This is exactly how the tracker works, by the way.

BPath and EntryRef are reference objects. They refer to an entry in the named space. You can use BPath or EntryRef to pass references between applications. So if you need to tell someone else, here, load this file, for example, when you receive a drag and drop message, I believe you get an EntryRef. And so that's the way to pass things. You don't want to store EntryRefs on disk, but that's another story. BPath and EntryRef can be used to create BEntry, BFile, BDirectory objects. So basically from the reference to the actual contents of it, there are constructors that go from one to the other.

Basic information is accessible through methods such as IsFile, IsDirectory. So this is in the BFile?

FROM THE AUDIENCE: Stat.

GIAMPAOLO: Oh, the statible structure, the hierarchy. I'm a pure kernel person.

So, there is a notion. The hierarchy, the class hierarchy has statible object, which if you're familiar with UNIX, the stat function tells you information about something, about a named entry, and BStatible is the base class which provides these things such as the size of the file, these are all methods of the object, so it's very easy to figure out what the size of a file is, once you have a reference to it. And of course, there is also standard stat structure that is normal to POSIX.

Okay. Plug-in file systems. This is switching gears completely and moving more into my realm, which is if you wanted to write a loadable file systems, for example, to access, I don't know, NTFS for whatever reason, or what we've done internally, HFS, these are all plug-in file systems. The kernel knows nothing actually about any file system, except for the root file system, which is a purely virtual; it's a container to hold other file systems.

So, next slide. File system handlers are add-ons which the kernel loads dynamically, including the Be file system is not part of the kernel; it's simply loaded at boot time actually by the boot ROM. And the file system provides the notion of files and directories that are exported all the way to the user level. There are examples of these like the native file system. HFS, DOS, NFS, any file system that you can imagine, the API is such that you can plug it in.

A volume is a set of files and directories typically on a physical device, although, as I'll talk about, that's not always true. It is served by a file system handler. It can be mounted and unmounted. So when you mount a volume, it doesn't have to necessarily be at the very top level. Like on the Macintosh, you always have named entries, the volume has a name, and it is always at the very top level. We actually support mounting anywhere in the hierarchy, although it's not generally used.

There is an API, a single API that's presented by the kernel and is propagated up, and then at the lower level, there is an API that the file system handlers plug into and that basically unifies everything. So you don't have to necessarily -- like, for example, the Mac HFS file system doesn't support queries, and so it simply doesn't implement those functions, and they return an error if you try to call them. But it's only one API that you have to program to.

File systems can support things like attributes and queries, and MIME file types are basically attributes, but indexing and so on and so forth, the API for file systems is such that if you have that functionality, it can be plugged in, and if you don't have it, you don't have to implement it.

The file system determines which supported features. The ISA 9660 file system, which is really very useful to have, and we do, implements very, very few of the file system API calls. It implements enough to essentially iterate through a directory and read data. Clearly you don't need to do writing to CD-ROMs.

Virtual file system handlers are ones that are purely in memory, so they don't actually represent things that are on a disk. As I alluded to, there is the root file system, which is a container. You can create directories, and you can create subdirectories in there, and you can create symbolic links, but there is no actual data storage associated with those things. It is only maintained in memory. This is useful, as I'll get to later, and there are things like the proc file system from UNIX. You can envision many things that you would want to represent for the file system handler that become part of the name space but aren't associated with real storage.

The powerful API that we have allows you to do things like /dev, so for us the device file systems -- the devices that are available on the system are implemented as a directory hierarchy. For example, there are dev disks, and we have tended to shy away from the cryptic UNIX names, even though we have /dev, so you have dev disk, SCSI, and SCSI ID and so on and so forth. Pipe is another file system, one we plan to implement in the future is /proc, which will allow you access to process information, information about the kernel. You can export many, many things that are synthesizable through this.

This allows us -- the key benefits of having the plug-in file system architecture are that it allows us to support external file systems, ones that we haven't written. Like I said, we've already seen as part of the Be Masters Award, we received DOS file system support, and things that the kernel doesn't know about, and it all can just be plugged in. I'll get to questions in a minute.

And you can also do powerful virtual file systems. Eventually, like I said, /proc has the potential to open up a great deal of functionality, and it's a pseudo file system.

So, in summary, the big things about BFS is that it supports attributes, indices, journaling, queries, and 64-bit volumes and files. These are all things that you would pretty much expect in a modern file system. The key things, I think, about the Be file system are that it has attributes and indexing. It actually indexes those attributes. For example, NT allows you to have other attributes on file, but it doesn't index them necessarily, unless that's changed recently. And we have, you know, full query language to allow you to access those things.

We also provide the standard POSIX APIs. Another good thing is that because we have been able to start from scratch, access to 64-bit files doesn't require using a funky API. You don't have to do anything special. It simply works out of the box. You know, if you have written your program once, you don't have to change it because, well, now I need to access files greater than 2 gigabytes.

The object-oriented C++ API is there for people that are working in the C++ level and want to maintain consistency. And it works, you know, side by side, you know, with the standard POSIX API. They both use the same underlying calls, so there is no real difference. It is just the style of interface.

And our plug-in file support -- it got chopped off -- supports a variety of file systems that implement different functionality for access to different systems. Network file systems was, during the design of the plug-in file systems, we spent a great deal of time focusing on how do we make sure this will work with NFS, because network file systems are very important.

So I have roared through my slides, and I believe that is the last one, so I'll open it up to questions.

FROM THE AUDIENCE: I have two questions. First, where is the log data for the journals stored? Another well-known computer vendor that has a journal file system has a separate partition on their disks that have the log data stored on that. Where do you keep yours?

GIAMPAOLO: The log is stored generally at the beginning of the file system. There is a section of the disk that's reserved when the file system is formatted. The log size is 2 megabytes currently. The location isn't particularly fixed, and actually, I have been meaning to experiment with moving it around. It's not a separate partition. It's only there.

Actually, moving it into a separate partition, or even better a separate spindle, i.e., a different disk, is what you would really want to do for performance, although we haven't explored that yet.

FROM THE AUDIENCE: Are there options for that, to do that in the future?

GIAMPAOLO: In the future. Right now, the formatting file system, there is one option essentially, which is how big the block size is. As we get into more of those performance-oriented sorts of things, we will probably move it elsewhere. It hasn't actually proved to be a big bottleneck in many things.

FROM THE AUDIENCE: Would you correlate, say, if you did move the log to a different partition or a different spindle, even, would you be correlating logs with individual other file systems, or would they have a generic log for all file systems?

GIAMPAOLO: No, it would be one per file system. There is no way you could merge different logs.

FROM THE AUDIENCE: You were talking about creating indices for attributes. When you are creating an index, do you tell it what -- you said that the OS handles keeping that updated. So then in effect what you do is you create it and then tell it what attributes you want it to have in that index?

GIAMPAOLO: No, an index is only for one particular attribute. This is something that I still have yet to find a good way to really explain it. An attribute, like I said, such as a comment, when you choose to have the comment attribute indexed, you would create an index whose name is comment. Then any time a file is written, that has an attribute whose name is comment, it's also added to that index. The types have to match as well. And that's how -- that's what I mean by it is maintained for you by the system. So one index matches one attribute, and then you build queries such as comment is equal to this, and name is equal to something else.

FROM THE AUDIENCE: Is it possible to have two attributes with the same name and different types?

GIAMPAOLO: Not on one file, no.

FROM THE AUDIENCE: But on different files?

GIAMPAOLO: Yes.

FROM THE AUDIENCE: What happens if your journal space gets full?

GIAMPAOLO: The journal space never will actually get full. It's a circular buffer that wraps around.

FROM THE AUDIENCE: This is even more worrisome.

GIAMPAOLO: No. Let me finish. It really does work. So each log transaction is at most 128K, and each one of those goes into the journal. As the journal fills up, then you have a list of blocks in memory that have not been flushed yet. If the journal becomes full, the cache is flushed, and that allows those transactions to complete and free up space. So the journal can never become full.

FROM THE AUDIENCE: So what we have basically is you have a time hit; things would flow down as they got full. You would have to have some time to do some I/O.

GIAMPAOLO: Sure. I mean, it's not infinite in size. You would have to flush things. But, I mean, you have to do the I/O no matter what. It's not like all of a sudden there is this big gap in time.

FROM THE AUDIENCE: I was thinking for like video applications if you are writing video out.

GIAMPAOLO: Actually, video is almost a nonissue, because you tend to write one file and make it really big, and so that the -- the way the transaction buffer works, the transactions to the same file tend to get coalesced, and filling up that 128K transaction buffer will take you a long time, and then you still have 2 megs minus 128K to go before you would care about it. You would write an awful lot of data before that became an issue.

FROM THE AUDIENCE: The second question I had was, with these attributes you have, is there any performance advantage or disadvantage between storing large amounts of data in attributes versus in the main file structure?

GIAMPAOLO: Not really. The API is slightly different to read and write data from an attribute, and so there is a little bit of slightly different code path, but underlying it, it uses the same data structures. So if you're storing like 100 megabytes in an attribute, it's about the same.

The only difference is that you have to explicitly specify the position that you're writing at in the attribute, so every time you come into the file system, and there is no buffer in API to attribute IPO yet. It's a trivial thing to add.

FROM THE AUDIENCE: Now that the system users is given the option of initializing the drive with as small as 1K of block size, why would somebody want anything bigger?

GIAMPAOLO: If you are doing video, for example, having larger block size might be more efficient, because it takes fewer blocks to represent a larger file. So that's the main reason. The block sizes are 1, 2, 4 or 8K. Right now that's the maximum.

FROM THE AUDIENCE: So like for optimum performance with video, you would recommend having a separate drive initialized with larger block size set?

GIAMPAOLO: You could. I haven't actually measured to see what the real difference is. 1K blocks, like I said, provide within 95 percent of the raw disk bandwidth, so I haven't bothered to go much larger than that myself. We have done it in the past.

FROM THE AUDIENCE: Can you also have access to the raw disk if you want to?

GIAMPAOLO: Yes, under the BeOS access to a raw device is trivial. In fact, if you're familiar with UNIX, I mean, basically you can cat the raw device, I mean, you just open it and read and write it. It continues to baffle me how difficult that is to do under the MacOS. I have to actually port the Be file system -- eventually, I'm going to -- to run the MacOS so that you can copy files from the Be file system to the MacOS. And the Be file system is totally portable. I mean, to do simulations, I mean, I just run it inside of a file under UNIX. I was working remotely for a while. I only had access to a Sun workstation, and I did just as a raw file there.

You can access any raw device under the BeOS trivially. So literally you can use open or you can construct a BFile with the name of a raw disk and just access it and just open, read and write.

FROM THE AUDIENCE: If you are creating lots of new files, and they have attributes which are indexed, is the creation of the files slow -- essentially are inserts slow?

GIAMPAOLO: Sure. The more indices and the more attributes you have -- so, if you add an attribute to a file which is indexed, it has to be added to the index, so it's going to be slower than if you add an unindexed attribute. Just for general numbers, I've run some file system benchmarks which time creation and deletion of files, and we've gotten around 400 file creates a second and I think 1500 deletes per second zero-byte files, and then from there it goes down as you have -- as you write more and more data. So it's pretty reasonable numbers.

And the name, size and last modification time are all indexed. So any time you create a file, you actually insert into the name index, the size index and the last modification index.

FROM THE AUDIENCE: Can you talk a bit about short attributes.

GIAMPAOLO: Yes, there is actually a small trick that we can do to improve the performance of small attributes. Without getting too much into the gory file system details, there is a fast area for attributes, which is about 760 bytes or so, and things such as the tracker's location for an icon tend to be stored there, because they are small and they fit there. So if you add like age equals 17, that's a very tiny attribute, you're storing 4 bytes of data, and that tends to go into this vast area. Once that area fills up, then they spill out to the normal attribute area, and they cost a little bit more.

FROM THE AUDIENCE: Is there an advantage to using short names?

GIAMPAOLO: Sure. It compacts things. There is no way to get feedback or to force something to be into the fast area. It is meant to be transparent, which is good and bad. It's good because you don't ever have to think about it; it's bad because if you really wanted to lock something there, it's not possible. The ability to do that is actually not terribly hard.

FROM THE AUDIENCE: You said age equals 17 is 4 bytes?

GIAMPAOLO: Well, the value data is only 4 bytes. Sure, the name -- the actual attribute structure, I mean, the 3 bytes for the age, and if it is stored in the small area, that would be about 12 bytes total.

FROM THE AUDIENCE: Is there any way to obtain quality of service guarantees other than raw access to the disks?

GIAMPAOLO: If you're just accessing the raw device, it is as fast as it is; that's the quality of service. There is no notion of guaranteed rate I/O or anything like that. You may be talking about like Irex or something, they have guaranteed rate I/O, or NC, I'm not sure if they have that. No. If you're accessing the raw device, and you're the only person accessing it, then you get it as fast as it comes off the platter, essentially. But the file system doesn't provide any guaranteed benefit.

FROM THE AUDIENCE: Is it advantageous to particularly large card writing, and so what is the optimum partition size?

GIAMPAOLO: There is no advantage to partitioning a large drive really. I mean, the only thing I can think of is that, you know, if you knew that one drive was going to have many, many, many temporary files, you may not want to pay for the cost of doing the indexing, and you can't do it in the Preview Release, I don't think. There is going to be a way to turn off indexing, because then the performance numbers actually jump considerably.

I mean, indexing is a very nice future, but it does cost. So that would be the only reason that I could think of that you would want to split things up.

FROM THE AUDIENCE: Can you give an estimate of how much it jumps?

GIAMPAOLO: I haven't run the benchmarks. I forget. I think it was like 700 file creates a second and like 2500 deletes, but don't quote me on that, because I don't remember.

FROM THE AUDIENCE: Would you comment on security features in the Be file system, thing like shredding, access control.

GIAMPAOLO: Shredding on delete, NSA type stuff. We don't actually go back and rewrite the data like 30 times to make sure it has really been erased. I mean, it's possible to implement, but we don't do anything like that currently.

Access control lists, attributes are perfect for that. That's not something that we've focused on just yet. There is standard UID, GID style protection which is actually honored by the file system. The rest of the system has a long way to go with protection and security. And that's actually something we are working on for multiuser, but the user ID and the group ID of files is honored as well as the protection bit. So if the file says you can't read it, you can't read it, and access control lists almost certainly will added later.

FROM THE AUDIENCE: How easy would it be to layer a virtual file system on above BFS to provide security?

GIAMPAOLO: PGP, they were working on that. Well, they were actually doing it underneath. They were providing an encrypted disk device, and then the Be file system would work on top of that. There is still a few gotchas with that, and if someone is interested in working on that, we try to work very closely with people on that sort of thing. It's not a stackable file system layer per se, and we have to do a few little things internally to make that even easier. Technically it will work, but there is still a little bit to go.

FROM THE AUDIENCE: The file forks, does that separate into four like a Mac?

GIAMPAOLO: You can think of it as forks, I mean, you can think of attributes as different forks of a file, but it's not -- the physical implementation, there is the data stream, there is the notion of a data stream, and there is the primary one, which is the main data of the file, and then the attributes. You can think of it as an N-forked file system, if you want, which is I think sometimes how NTFS is referred to, because each of the attributes can be as large as you want. So you can think of one as a resource for it, if you have an attribute named resource. Does that answer your question?

FROM THE AUDIENCE: I was curious, if you are going to store that file in a different file system, on UNIX --

GIAMPAOLO: If you were to copy it, how do you preserve the attributes? That's a very difficult problem, and what we've tried to do is, if you were just to tar up the file using UNIX tar, or even the shell copy, it does not preserve attributes. We tried to separate those. There's Be tools that understand attributes such as the tracker. When you copy a file, it does copy attributes. And then there are, in quotes, dumb tools that just know how to deal with the data segment.

In terms of archiving utilities, Chris Herbort, our favorite Be developer, has done zip so that it will archive attributes of files, so that you can preserve them. And Jon Watte in his ever-coding fashion has also come up with a slightly more modern, shall we say, file archiving protocol which supports the greater than 2-gigabyte files, arbitrary numbers of attributes, because zip does have a few limitations there. So the way we tend to see it is, use the write tools to unzip things -- I mean, to compress things and to move them if you need to.

FROM THE AUDIENCE: How extensive is the access to drivers and hardware? You mentioned /dev earlier. Is it full access, or is it just enough to support POSIX?

GIAMPAOLO: I'm not quite sure what you mean. I mean, you can go in there and type LS-LR, if you want.

FROM THE AUDIENCE: Is the model that every chunk of hardware, every device driver is going to have an entry in the /dev drive?

GIAMPAOLO: Yes, and they're dynamically publishable and you can change it. For example, we're not supposed to, but we write really hot plug SCSI devices, and you can rescan the device tree, and they show up. I have only fried like one drive doing that, which isn't bad for ten months in development.

So you have -- the device hierarchy is actually fairly nice, because, like I said, it is very dynamic and you can change. Some things such as the pipe file system, which normally you don't think of pipes being implemented as a file system, but the idea is that you can have any number of pipes active at a time, and so it simply creates names in there, which show up dynamically.

I can show you privately, if you want, afterwards, because it's hard to type and all that sort of stuff on the screen.

FROM THE AUDIENCE: Do you have any plans to support something like the POSIX dot 4 A I/O functions?

GIAMPAOLO: We do have some thinking about asynchronous I/O. His question was about the POSIX A I/O, which is asynchronous I/O, that is, do this I/O and tell me when you're done. It hasn't been completed yet. We're shooting to try to get it for the next release, the next big release, because that's very important for high bandwidth and for many, many things. So we would like to get that done. We're trying.

FROM THE AUDIENCE: What's the deal with the temp directory; that's actually burst at some point?

GIAMPAOLO: Yes, at boot time, the boot script cleans out anything with temp. I argued with Cyril about that. I didn't think it was a good deal. He wanted it cleaned out, so it got cleaned out.

FROM THE AUDIENCE: (Inaudible)

GIAMPAOLO: Actually, if you look in the boot script, there is a command to check it and remove anything below /temp if it exists. FROM THE AUDIENCE: Is there any facility for combining automounting of disks with something like FS tab?

GIAMPAOLO: Disks are automounted, and in the tracker, this one is actually easy to show, so I'll do it. You have mount settings, so you can choose whether to mount at boot time, whether the tracker should mount all disks or automount nothing or when new drive show an old poll-removable media for new disk or automatically mount them if they are BFS or if they are anything at all. So you can choose that that way.

FROM THE AUDIENCE: What I mean is, let's say you wanted some disk to be automounted and others you wanted in a specific place, like if you had a disk you wanted in a specific place. Is there a way to fix that?

GIAMPAOLO: In that case you would add a command to the boot script. There is a user- extensible portion of the boot script that runs fairly early on that you can use to automatically mount things. You can add the commands there to mount it, which is essentially mount and then whatever the device is. It's not the same as FS tab, though. We have tended -- I have a very strong UNIX heritage, but we didn't try to clone UNIX, so we probably won't ever have an explicit FS tab file.

FROM THE AUDIENCE: Can you talk about some of the differences both positive and negative between the BFS and the NTFS.

GIAMPAOLO: Where to begin? There is actually a great deal of similarity between the two file systems. I didn't really know a lot about NTFS when I started this. I just had some of my own experience working at grad school and stuff. And it turns out that they do many things similarly. I believe they use BeTrees to store directories. I mean, you know, the details of that, of course, are obscure and probably they're not -- actually, I think they're publishing a book on that.

There is very little information in a real file system implementer's way of knowing. I don't know. I know that they support attributes. I don't know, can they be arbitrary size, can you have any number of them, can you index them.

The stuff I've seen doesn't seem to imply that, but it's hard to find real good information about it. They're a 64-bit file systems, we're a 64-bit file system. They have done a lot more work on support for fault tolerance, RAID subsystems. That's not really file system per se; that's more of the supporting stuff. I'm trying to think of other stuff. Offhand, I can't think of any other major differences.

FROM THE AUDIENCE: Do you do disk mirroring?

GIAMPAOLO: No, not currently. Things like that we sort of view as more third-party opportunities, you know, writing a driver to do real disk mirroring, that kind of thing. We don't have the engineering resources really to tackle that. We probably will try and do some kind of simple RAID solution just because I really want to see raw CCIR 61 video coming across the desk. But I would like to see like 40 megabytes a second through the file system, hopefully soon.

FROM THE AUDIENCE: Where would you do this, somewhere underneath the file system?

GIAMPAOLO: That would be done at the driver level, sure. You would implement a RAID driver, which would be configured to open up the other drivers that do SCSI and then use them appropriately, and that comes back to the asynchronous I/O question, and so on, because you need asynchronous I/O to do RAID correctly.

FROM THE AUDIENCE: How will the Intel port affect the file system?

GIAMPAOLO: It doesn't at all. You can't currently take a disk from a PowerPC big-endian machine and plug it into a little-endian system and work. The file system does know about endianness, and that's actually recorded in the root block along with a bunch of other things, what endianness the file system is. So the information is there, then we just have to -- I'm not sure what the plan is going to be, whether it will be to be able to actually take one disk from any system and go. There will be a performance hit, because you have to obviously convert the data, and that can be expensive.