Table of Contents
BE ENGINEERING INSIGHTS: Creating an Internet Search Engine By Russ Mc Mahon russ@be.com
The most rewarding thing for me as a contributor to the Be Developer Newsletter is the chance to explore applications and technologies that aren't on any schedule or project master plan. It's an opportunity to write code just for the fun of it, and share the knowledge and experience with others. One long-time interest of mine is Internet search engines and Internet robots. I thought it would be cool to scan the Web to retrieve particular items and then process the returns using my own rules and conditions. I know there are services out there to do just that, but I wanted something that I could control, so for this article I built a simple Internet search engine. I'd like to thank Benoit Schillings for his article "Mining the Net", <http://www.be.com/aboutbe/benewsletter/volume_II/Issue40.html#Insight>. In it, Benoit lays out a "site getter" class and goes into detail about the Internet search process. I also want to thank James Clark for his awesome SGML parser- normalizer NSGMLS. These tools are definitely worth checking out. Currently, most people use large scale search engines such as Lycos, Yahoo, Alta Vista, and others. These have huge databases that hold Web links and other information that users can access. Robots -- also called spiders -- roam the Web and feed the local databases, trying to keep an up to date representation of the Web. I wanted to call my project "the embryonic beginnings of a Web 'bot," but I can't, because as you'll see below, my sample engine is really a meta-search engine. Meta-search engines query primary search engines (like those named above) in order to gather query results. They maintain no local database of their own and often do pre- and post-processing on the query and returned results. A good example of a meta-search engine is MetaCrawler <http://www.metacrawler.com/>; it can span multiple search engines and claims to provide users with a more consistent interface. Another popular meta-search engine is Sherlock <http://www.apple.com/sherlock/>. Sherlock ships with Mac OS 8.5, and replaces the system Find application as a generalized search program. In addition to local searches, it does Web-based searches. Both Sherlock and MetaCrawler can span multiple search engines, but unlike MetaCrawler, Sherlock processes the query results locally through configurable plug-ins. Well heck, if all a meta-search engine needs to do is contact the main search engines, and have them return the goods, that seems easy enough -- so it's time to start coding. In the sample below, a TCP connection is made to www.altavista.com, server port 80 (HTTP). I send a GET request with the path of the CGI script to execute, and the parameters to pass to the script. The main parameter of interest is the search keyword, in this case "primus" -- a locally popular rock band. I end the request with HTTP/1.0 and a blank line to terminate the request. I receive back from the server an HTML page with a list of hits matching "primus." The page is sent to standard out. #include <stdio.h> #include <socket.h> #include <netdb.h> #include <string.h> #include <iostream.h> class mysearch { public: mysearch(); }; mysearch::mysearch() { int sd; char buf[BUFSIZ+1]; // Resolve Host name to IP address struct in addr **pptr; struct hostent *hp; hp = gethostbyname("www.altavista.com"); pptr = (struct in addr**)hp->h addr list; // Server side socket struct sockaddr in dest; memset(&dest,0,sizeof(dest)); dest.sin family=AF INET; memcpy(&dest.sin addr, *pptr, sizeof(struct in addr)); dest.sin port=htons(80); sd = socket(AF INET, SOCK STREAM, 0); connect(sd, (sockaddr*)&dest, sizeof(dest)); // Blank line terminates HTTP request sprintf(buf, "GET /cgi- bin/query?pg=q&kl=XX&q=primus&search=Search HTTP/1.0\n\n"); send(sd, buf, strlen(buf), 0); while (recv(sd, buf, sizeof(buf), 0) > 0) cout << buf; } int main() { mysearch mysrch; exit(EXIT SUCCESS); } Why do it this way? Since searching and filtering tasks are
quite disparate in nature, I've put the search mechanism in
one binary and the query parser in another. I hope this
makes the basic operations clearer, and easier to follow. I
use standard in and out to communicate between the binaries.
I haven't tried to make this robust or of production
quality, as the code serves only to demonstrate my topic.
For the search engine, much of the code is just getting a
client connection set up, and sending and receiving HTTP
data. The calls The above sample goes out to Alta Vista and looks for "primus." However, it's easy to query other search engines as well. Just change the host name, the CGI path, and the parameter string. For example: To search using Alta Vista: Host Name: www.altavista.com Parameters: /cgi-bin/query?pg=q&kl=XX&q=primus&search=Search To search using Excite: Host Name: search.excite.com Parameters: /search.gw?search=beegees To search using Yahoo: Host Name: search.yahoo.com Parameters: /bin/search?search=beegees Notice that I changed the Excite and Yahoo queries to search for "beegees" rather than "primus." To query using other search engines I haven't listed, such as Infoseek or Lycos, go the site with your browser and enter your search keyword, then look at the URL that is sent to sent the Web server (in the location field at the top part of the browser window). It should be easy to see what the host name and the CGI parameters. are. Be aware that there are several ways to send parameters back to Web servers; if the simple method described above doesn't work, a deeper understanding of CGI and HTTP may be required. Dumping the HTML page as text is also a great way to see what's going on. Another way to understand how CGI and HTTP work is to set a Web server such as Apache and write some scripts and forms. I recommend doing this before you use the samples and blast away at the main search services. If the query was successful an HTML page listing the search results is returned. To confirm that things are working, you can save the page to a file and then open it in the browser. It should appear just as if the browser had launched the query. Now we're at the heart of things, and it's time to get at the goods! The problem with this phase is that it's very application specific; what I'm after may not be what you're after. Also, each search site has its own results format, which may even change from query to query. Sherlock uses pattern matching on the returned HTML document to get at items of interest. OK, but we can do better. I use James Clark's NSGMLS application to transform the document to a simpler and more rigid format called ESIS (Element Structure Information Set). The ESIS format makes getting at HTML elements easy. NSGMLS is a command line utility that uses James's SP library to parse SGML and also HTML and XML. James provides nice HTML documentation and sources for his parse utilities at <http://www.jclark.com/sp.html>. The port to BeOS was trivial. For example, the following HTML document is transformed to ESIS using NSGMLS: <html><body>Your rock candy In ESIS every element is on a separate line, and the first character on the line is known as the command. The command can be used to decide how to parse the line. In this simple sample that's no big deal, but results returned from the search engines are more complex. (HTML Now we're getting somewhere. If we run the search engine and pipe the output to NSGMLS, the ESIS syntax will be sent to standard out. However, it's the whole page, and I just want to get at specific elements within it. Time to slap down a bit more code. #include <stdio.h> #include <iostream.h> #include <string.h> #include <stdlib.h> class myfilter { public: myfilter(); friend istream& operator >> (istream& s, myfilter& myftlr); private: int GetNextFieldStr(char* instring, char* outstring); }; myfilter::myfilter() { } istream& operator >> (istream& s, myfilter& myfltr) { char buf[BUFSIZ+1]; char cmpbuf[BUFSIZ+1]; while (s){ s.getline(buf, sizeof(buf)); // Parse the lines using ESIS syntax if (buf[0] == 'A'){ myfltr.GetNextFieldStr(buf, cmpbuf); if (strcmp(cmpbuf, "AHREF") != 0) continue; myfltr.GetNextFieldStr(buf, cmpbuf); if (strcmp(cmpbuf, "CDATA") != 0) continue; myfltr.GetNextFieldStr(buf, cmpbuf); if (strncmp(cmpbuf, "http", 4) != 0) continue; // Found something of interest cout << buf << endl; } } return s; } int myfilter::GetNextFieldStr(char* instring, char* outstring) { char* ptr; char* ptr1; char* ptr2; ptr = instring; if((ptr1 = strchr(ptr, ' ')) == NULL){ strcpy(outstring, instring); return (0); } *ptr1 = '\0'; ptr2 = ++ptr1; for(int n = BUFSIZ; n >= 0; n--){ ptr1--; if (*ptr1 == ' ') *ptr1 = '\0'; else break; } strcpy(outstring, ptr); strcpy(instring, ptr2); return (1); } int main() { myfilter myfltr; // Overloaded extraction operator cin >> myfltr; exit(EXIT SUCCESS); } The sample above goes through the page and grabs URLs beginning with http. The URLs are located using the AHREF CDATA syntax of ESIS. So if the search engine binary is called mysearch, and the filter above is called myfilter, then after running the following, only URLs starting with http will be sent to standard out.
I know by looking at the URLs outputted that further filtering is needed to really nail down the "primus" search. This should be easy: I'll just look at the ESIS output and add some more parsing code to the filter operation. I hope you can see from these samples that there's a world of possibilities to explore. Once URL links are returned, they can be further filtered to really customize the results of the query. Also, I go after only http URLs, but within the page returned from the main search engines there's often hit-rating information and other goodies. I'd like to turn the command line utilities into real applications, and search the different engines in parallel using threads, and cross- reference the results. It would also be great to plug the meta-search engine into Be's Find application, and turn the results into NetPositive Bookmarks. I didn't make it quite that far in this article, but there's always next time.
BE ENGINEERING INSIGHTS: Something Fishy By Adam Haberlach adam@be.com
I recently implemented a client for the "Distributed Fish Protocol" (DFP). The protocol defines a common environment for transferring simple data objects ("fish") between machines ("tanks"). The protocol was designed as a demonstration of some operating systems created by members of the ACM chapter at UIUC. The entire system is extremely small; the whole thing can be implemented using only broadcast UDP packets. Because UDP is somewhat unreliable, steps are taken to make sure that fish are not lost. For relevant RFCs, check out <http://www.acm.uiuc.edu/sigops/eoh98/dfp/fish.html>. To handle incoming packets, my DFP client implementation defines the "FishPortal" class. A FishPortal object binds and listens to a UDP port (in a separate thread). When it gets a Fish Protocol Packet, the packet is repackaged as a BMessage, which is sent to and handled by the client's BWindow object. There are two types of messages that need to be handled: The "Tank" message lets a tank locate its neighbors, and the "Transfer" message lets a tank transfer a fish to a neighboring tank. The Location Message A tank lives in a 256x256 tank array called "TankSpace." To see if its neighbors are "alive", a tank broadcasts a PING message to its nearest neighbors in TankSpace (top, bottom, left, right), and then listens for a responsive PONG. In my DFP client, I use the new BMessageRunner class to regularly broadcast the PING message. Before every PING, my FishPortal object marks all four neighbors as dead, and then marks them alive if they respond. (A more complex system could be a bit smarter about tracking the identity and states of the other tanks.) Whenever the status of a neighboring tanks changes, a BMessage is sent so that the client can update any indicators it uses. My client displays green and red dots to indicate the states of the other tanks. Still another neat thing for a client to do would be to recognize when it is in the same space as another tank and move itself to another location to avoid tank collisions. The Transfer Message When a fish swims to the edge of a tank, the client checks
to make sure the neighboring tank in that direction is
alive, and, if it is, it sends the tank a If a neighboring tanks isn't alive, the fish is thrown back in the same tank and swims off in a new direction. Either way, the client need only send a message to the FishPortal, which takes care of all the details of passing the fish along or throwing it back in the tank. The FishPortal class also handles a list of "pending" fish,
since fish spend some time in limbo while the fish-passing
messages are being handled. Currently, there's no provision
for other clients denying reception (through a Note that I didn't spend much time fine-tuning the display. As a result, my fish swim backwards and flicker. I know of one bug in the code, and I'm sure I will get lots of mail pointing out the ones that got away. If anyone out there has any interesting additions or suggestions (or even new clients), I'm always interested. The supplied client defaults to a tank located at (3,3). To specify a different location, use the command-line arguments -x <int> and -y <int>. For example, if you have two machines on the same network, just starting a client on one machine with the default location and the other with the line "Tank -x 4 -y 3" should put them next to each other in TankSpace.
DEVELOPERS' WORKSHOP: The Refrigerator Question By Eric Shepherd sheppy@be.com "Developers' Workshop" is a weekly feature that provides
answers to our developers' questions, or topic requests.
To submit a question, visit
http://www.be.com/developers/suggestion_box.html.
Of all the functions the BeOS provides to programmers, one
of the least understood and most widely debated is
is computer Some developers claim that this function always returns 1,
and say that it's a sham. They say their tests have produced
significant empirical evidence to this effect. Others offer
circumstantial evidence, such as "programs can't run when
the computer's off," which doesn't really explain anything.
This is a lot like trying to figure out if the light stays
on in the refrigerator when you close the door. How can you
be sure that is computer If you run this program, you get the following output:
Now try running the program and shutting off your computer
while it's running. Unfortunately, In other words, the empirical evidence is a little flimsy.
Due to failures beyond Be's control, test programs such as
the one above can't present their full results. I'm not
saying that is computer However, Be guarantees that is computer
As often, this column is in response to recurring
discussions I've had or witnessed, in person or
electronically. Today's topic is what type of developers
we're looking for -- a subject that appears to have been
re-energized by recent announcements around Release 4 and
Comdex.
When major, "brand name" developers such as MGI or Steinberg
announce their support of the BeOS platform, what does it
mean for emerging developers who lack the resources and
market muscle of more well-established companies? Are we
abandoning the little guy and cozying up to the
establishment?
I'll say at the outset that these questions address very
real, permanent issues. The need to give critical mass to
the platform is one issue; the threat of established
developers muscling smaller ones out of the ecosystem is
another; and Be's role, if any, in maintaining balance in
the system yet another.
Meeting the first need -- giving critical mass to the
platform -- causes trouble and ambivalence. On one hand,
seeing a major player on the nascent platform is reassuring,
since it legitimizes it and strengthens the self-fulfilling
prophecy: the platform will take because respectable,
established people believe it will. On the other hand, these
established companies may become ogres that devour fledgling
developers. This is a real issue, but it's only part of the
story. The establishment may be powerful, but it is also
conservative. It rarely shows the most innovation and is
often muscle-bound.
A new platform is like a new musical instrument. You can
transpose existing music for it, or write new music that
exploits its unique expressive power. The saxophone, for
example, solved two problems: it provided better coupling
and impedance between the reed and the air mass; and the
mechanics offered better, faster fingering and a wider range
of notes. You could play Bach fugues on a saxophone (rather
nicely) or explore the then-emerging genre of jazz.
Likewise, with a new operating system, you can write a
PhotoShop clone, or take a more innovative approach to image
creation and processing. If this sounds like "all you need
to do to lose weight is eat less and exercise more," my
apologies. So instead of offering simplistic advice, perhaps
I should address what we can do. For one thing, we're
already working to make BeDepot an effective, inexpensive
distribution channel for Be developers. We still have much
work to do, but our intent is clear: this is equally open to
all, this is a level playing field.
And then there is what we shouldn't do. I still remember the
1985 day when Bill Gates threatened to stop developing
applications for the Macintosh if a certain software and its
look and feel license wasn't renewed. Apple's executive
staff met and decided not to go along. One dinner later, the
license was renewed. We can only speculate what would have
happened if other developers had been allowed to take the
room on the Mac platform that Microsoft now occupies. The
lesson in this story for Be is that we know we shouldn't
make decisions that limit innovation and tie the platform.
1997 Be Newsletters | 1995 & 1996 Be Newsletters Copyright ©1998 Be, Inc. Be is a registered trademark, and BeOS, BeBox, BeWare, GeekPort, the Be logo and the BeOS logo are trademarks of Be, Inc. All other trademarks mentioned are the property of their respective owners. Comments about this site? Please write us at webmaster@be.com. |