Programming assignment #4: CSE 100

Programming assignment #4: CSE 100 Fall 2008

>>> Due FRIDAY DECEMBER 5, 11:59 pm PDT.

>>> What will be turned in: Index.java, Lookup.java, plus any other .java files in your working directory.

If you are working in a team of 2 students, turn in your assignment from one of your two accounts only. On this assignment, your team can be the same as one of your previous assignments, if you want.

Getting started

Review the data structures and algorithms you have learned to this point; graphs, hashtables, stacks, queues, breadth- and depth-first search will be particularly important in this assignment.

Indexing and searching the web

You have no doubt used internet search engines: you type in a query in the form of a list of keywords, and the engine returns to you a list of web pages, ordered (hopefully) by how well they match the keywords. In this assignment, you will implement the basic functionality used in these keyword-driven search systems.

Your system will have two parts: An indexing utility, which will create a keyword index by searching a graph of web pages; and a lookup utility, which will create lists of "relevant" web pages from keyword queries. The first part will be implemented as a Java program with main() method in a file Index.java. The second part will be implemented as a Java program with main() method in a file Lookup.java. Breaking the task into two parts makes computational sense; once the index is created, keyword searches in the index can be performed much faster than searching the web documents directly. But it means that both of these parts are essential to having a working system: without code to read the index correctly, indexing is useless; and without code to produce an index, lookup is useless.

Index

You will write a Java program Index.java that will be invoked with a command line of the form

java Index firstpage searchtype maxpages indexfile

where firstpage is the complete URL of the web page where the search should start, searchtype is either breadth or depth, maxpages is an integer specifying the maximum number of web pages to search, and indexfile is the name of the file that will contain the index. Depending on the search type, the program will do a breadth-first or depth-first search of the web starting with the page with the given URL, visiting no more than the maximum number of pages specified, and produce a keyword index. This index will be written to the named file. The format of the index file will be suitable for reading by the Lookup program, described below.

Your program should also print to standard output some informative messages as it is building the index. For example, for each web page it searches, it should print out how many distinct keywords were found in it; and it should print out some summary information at the end (e.g. saying how many distinct keywords and how many web pages appear in the final index).

Constructing a web page keyword index

A keyword index is a dictionary or table structure that associates keywords with documents in which they occur. That is, the index is a structure that, when given a keyword, returns a list (or some other structure) containing the names of documents that contain that keyword. For this assignment, documents are web pages. In fact, to make lookup more meaningful, the index should map a keyword to a list of pairs, each pair consisting of the URL of the web page containing the keyword, together with the number of times that keyword appeared in that page. This number will be used to rank pages when doing lookup, based on the idea that more occurrences of a keyword in a page indicate that page is more likely to be "about" that keyword.

In order to construct the index, you will have to visit a set of web pages, and read the contents of these pages. You can think of this set of pages as vertices in a directed graph. Each web page has a URL ("Uniform Resource Locator") that identifies it; a typical URL has a form like http://www.cse.ucsd.edu/classes/fa08/cse100/assignments.html. The URL of the "start vertex" web page is given on the command line to your Index program. Each web page can contain links to other web pages. These links are expressed as URL's which appear inside certain "tags" in the page; so the graph of web pages has an adjacency-list representation. If you are doing a breadth-first search from a page P, you should visit P, and then visit all of the pages linked to by P before visiting any pages linked to by pages linked to by P. If you are doing a depth-first search from a page P, you should visit P, and then recursively do a depth-first search from each of the pages linked to by P. In either kind of search, you should make sure not to visit the same web page (i.e., a page with the same URL) twice.

Visiting a page: keywords and links

Web pages are HTML (HyperText Markup Language) documents, which consist of HTML "tags", and text. Visiting a page will involve parsing it to discover the keywords it contains, and discovering the links it contains. The keywords found will be incorporated appropriately into the index, and the links will be noted for visiting later. The javax.swing.text.html package and its subpackages contain classes that are designed for processing web pages. A tutorial on these classes is available here, though our application is simpler than the one described there.

The key class in that package is HTMLEditorKit and its subclasses. In particular, you will want to define a class that extends the classHTMLEditorKit.ParserCallback (a public inner class defined within HTMLEditorKit), and that overrides its handleStartTag and handleTextmethods. The handleStartTag method will be called by the HTML parser itself when a start tag is encountered; if this start tag includes an HREFattribute, it indicates a link. The handleText method is called by the HTML parser when text is encountered; this text contains the keywords. More details are available below, and in the online Java API documentation, about how these methods are intended to be used.

Processing keywords

Stemming is a process that most search engines perform on words that appear in documents they have indexed, and the same stemming process is applied to words that users submit for a search using the index. Stemming means removing common prefixes and suffixes in an attempt to get to the core of the keyword; for example, "modification", "remodify", "modifier", "modified", "modify", all have the same stem, which can be represented as "modifi". In indexing and lookup, all of these keywords will be counted as equivalent to their stem, because they are just variants of it.

Along with the stemming process, it can be helpful to eliminate noise words (also called "stop words") which are very common words (e.g. "a", "an", "are", "but", etc.) that occur in almost every document, and so don't serve very well in distinguishing relevant documents from irrelevant ones; and also removing punctuation, converting to lower case, and performing other "cleanup" operations.

The result after stemming and noise word removal is supposed to be the reduction of the words in a document to their essentials. The design of the stemmer can make a big difference (for better or worse) in the performance of the search engine; agressive stemming does improve the average relevance of search results, but the downside is that it can make it impossible to restrict your search to a particular form of a word.

For this assignment, we will provide a basic stemmer for you to use. The static makeKeyword() method in the Keyword class takes a String argument, and returns a String representing the stemmed and cleaned keyword (or the empty string "" if the result is a stop word or otherwise not usable as a valid keyword). Copy Keyword.java from the public/P4 directory and use it!

The index you will construct will be a table that maps keywords to a structure of URLs ~~(URL, keyword count) pairs~~ (update: you'll want to use (URL, count) pairs for many forms of relevance rating, but not strictly required for basic retrieval). However, as you are doing your depth-first or breadth-first search of the web, you may find it more convenient to work with a table that maps URL's to a collection of (keyword, keyword count) pairs. With such a table it is easy to check if a URL has been visited before, for example (just see if it already has an entry in the table). But this table, with URL's as keys, then needs to be converted to a map that uses keywords as keys. This process is called "inverting the index". Alternatively, you could work directly in the first instance with a table that uses keywords as keys, and skip the "inverting" step. UPDATE: I wouldn't recommend the "inverting" thing--it is probably easier to just build your index the correct way the first time. But if you do that, you will need a HashSet in order to keep track of previously visited URLs (a visited list, like in your assignment #3).

In any case, each valid keyword you find in any web pages you searched will need to have an entry in the index you produce. Associated with each keyword should be a list (or a more suitable structure) of pairs: Each pair will consist of the URL of a page that contained the keyword, and the number of times the keyword occurred in that page.

Processing links

HTML tags are Strings that start with the "left corner bracket" character '<' and continue to the next "right corner bracket" character '>'. If the tag starts with "<a " or "<A ", it is an anchor tag. Anchor tags can contain hyperreferences, i.e. links, to other web pages. If there is a link, the URL of the other page will appear as a double-quoted string immediately following HREF= or href=. For example,

<a href="http://slashdot.org/index.html">

is an anchor tag containing a link to the page with URL http://slashdot.org/index.html.

Link URL's can be relative or absolute. The above example is an absolute link. A relative link will specify a web page file relative to the URL of the page containing the relative link. For example, the web page with URL http://www.cse.ucsd.edu/classes/fa08/cse100/assignments.html contains the following anchor tag, which specifies a relative link:

<a HREF="assignment4.html">

In this context, this relative link corresponds to an absolute link http://www.cse.ucsd.edu/classes/fa08/cse100/assignment4.html.

The java.net.URL class

The Java standard library package java.net contains the URL class which provides a very useful API for communicating with web servers and accessing contents of web pages. You should read the online documentation for the class; we will mention some of its main features here.

You can create a URL object by using the constructor that takes as argument a String which is an absolute URL:

URL myURL = new URL("http://www.cse.ucsd.edu/classes/fa08/cse100/index.html");

Now the openStream() instance method of myURL will open a network connection to the URL, and return a java.io.InputStream object associated with this connection. You should wrap this InputStream in an java.io.InputStreamReader to do Unicode conversion, and wrap that in a java.io.BufferedReader for efficiency, and then create a javax.swing.text.html.parser.ParserDelegator to parse the stream for you.

How to handle relative URL's? If the URL constructor determines that its single String argument is not a valid absolute URL, it will throw ajava.net.MalformedURLException. This exception can be caught, and handled by attempting to create a URL object by using the URL constructor that takes two arguments: A URL object, and a String which is a relative URL, relative to the first argument. That is, if you are visiting a page using a URL object currentURL, and find a relative URL "assignments.html" there, then

URL myURL = new URL(currentURL,"assignments.html");

should create the appropriate URL object. If this also fails, the URL can just be ignored.

Note that constructing a relative URL works by replacing the last component of the absolute URL with the relative part, if the absolute URL does not end with a slash "/"; or, by concatenating the relative part onto the end of the absolute URL, if the absolute URL does end with a slash "/". These will lead to very different results!

Using HTMLEditorKit.ParserCallback and ParserDelegator

The classes in the javax.swing.text.html package and its subpackages use an event-driven programming model to parse HTML documents.

To use these classes, create an instance of the javax.swing.text.html.parser.ParserDelegator class, and call its parse method. This method takes three arguments:

A Reader object connected to a URL of an HTML document, as described above;
an instance of your class that extends HTMLEditorKit.ParserCallback and overrides methods as described below;
and a boolean (which should have value true).

As the document at that URL is parsed, methods in your HTMLEditorKit.ParserCallback object will be automatically called to handle events that occur in the parse. For purposes of building your web indexer (as opposed to a full web browser), only two methods are important, and they can be defined along the following lines:

public void handleStartTag(HTML.Tag tag, MutableAttributeSet attr, int pos)

This method will be called when a start tag is encountered. It will be passed three arguments. The first argument is an instance of theHTML.Tag enum type, which specifies the kind of start tag. We are only interested in anchor tags, which will be the case when tag == HTML.Tag.A. Then in particular we are only interested in the value of the HREF attribute within the tag. The MutableAttributeSet attr argument contains information about all the attributes of the tag. The value of the HREF attribute can be obtained as a String with a statement like

String link = (String) attr.getAttribute(HTML.Attribute.HREF);

and then that String can be used to construct a URL as described above. (The third argument gives the position in the HTML document of the start tag, and can be ignored.)

public void handleText(char[] text, int pos)

This method will be called when text is encountered in the HTML document being parsed. The text is passed as a char array in the first argument. (The second argument gives the position in the HTML document of the start of this text, and can be ignored.) The char array can be converted to a string:

String t = new String(text);

and then tokenized into keywords. Keywords contain only digits and letters ("word characters"), so the String should be split on non-word characters. This can be done by creating a Scanner, or using the split() instance method of the String itself. A candidate keyword that is a stop word should be discarded; if not a stop word, it will be used to build the index.

Lookup

You will write a Java program Lookup.java that will be invoked with a command line of the form

 java Lookup indexfile

where indexfile is the name of a file created with your Index program. The Lookup program will read a keyword index from the named file. Then it will prompt the user for a sequence of whitespace-delimited keywords entered from the keyboard on a single line; this sequence of keywords is the user's query. Keywords that are in parentheses are words that the user does NOT want to see on the page. For example, for a keyword string as follows:

Hi hello (bye goodbye) howdy (later)

You should return a page with the following text: "Hi howdy how are you today? hello" but not a page that has this text: "Hi howdy hello bye" and not a page that has this text: "Hi howdy"

Keywords in the query will be stemmed and cleaned. (So, in the above example, it would be acceptable to have "Hi" be uncapitalized because that will be ignored after the cleaning process anyhow.) Valid keywords in the query will be used to search the index for the URL's of web pages that contain them; these URL's will be printed out, one per line, together with, and sorted by, their "relevance rating" (see below). Then your program will prompt the user for another query, and so on, until a blank line is entered, whereupon the program exits.

The lookup operation will find web pages in the index that contain all the valid keywords in the query, and none of the keywords in query that are in parentheses. If no web page meets the constraints, your program must inform the user that no pages were found. If one or more pages contain all the keywords, their names should be printed out, one per line, sorted by "relevance rating".

The relevance rating for web page is a score intended to indicate how relevant the page is to the user's query. Your search engine must print all the results, and only the results, that are indicated by the above instructions. But the order in which the results are returned--sorted by relevance rating--is up to you. You may design your own scoring method. It doesn't need to be very elaborate to receive full credit, but it could be fun to get a little creative here. If you'd like some ideas taken from "real life," I'm happy to discuss it in person or on the webboard. There's also a great deal of material on the web about this. (Try searching for PageRank or TF-IDF.) Here's a brainstorm of some things you could take into consideration: (1) number of times the keywords occur on the pages in question, (2) total length of the webpage (maybe to normalize results of (1), or maybe you think users might like longer pages because they have more content?), (3) how common is the word throughout the whole set of documents (maybe very common words are less useful for discerning relevance than unusual ones?), (4) number of links on a page, (5) number of links to a page (you may not be able to get enough data for this metric to be meaningful), (6) how close together the words appear on the page, (7) "birds of a feather flock together" so you could examine pages that a page links to or that link to it to see if they are also relevant to the keywords and use that as a judge of the page in question (this is a big key to Google's success). Note that some of these require substantial additional data storage for the index, while others require very little additional storage. You should explain your scoring method, why you think it is indicative of relevance of the content, and what was needed behind the scenes to implement it, in about 2 paragraphs in your README.

Note that the keywords the user enters must be stemmed before searching the index, just as the words in the source documents were before being put in the index. This means that some words cannot be searched for at all (noise words, for example), and some words cannot be distinguished (different words with the same stem, or words with different capitalization, etc.). The payoff is (hopefully) increased relevance of the search results.

Hints, Suggestions, and other Remarks

You can use any data structures you wish in doing this assignment. The important thing is to use structures and algorithms that are efficient for the purpose. (You could even use Huffman code compression on the INDEX file, to save disk space as well, if you want!) You can assume that the standard Java library class files and Keyword.class will be on the CLASSPATH. Any other code must be in .java files bundled with your assignment. You can start with code you (not a previous team partner) have written for a previous assignment, if you wish.
You might consider using the java.util.Hashtable class (or java.util.HashMap, which has the same functionality a single-threaded environment, but is faster because its methods are not synchronized) where appropriate. A Hashtable can maintain associations between any type of keys (say, Strings or URL's) and any type of values (say, Integers, Vectors, or even other Hashtables), and gives you fast "put" and "get" operations. For increased compile-time type safety and other advantages, you should use Java generics and specify type parameters for keys and values.
If you implement your index using a Hashtable or Hashmap, note that these classes implement the Serializable interface. This means that a Hashtable object can be passed to an ObjectOutputStream object for writing (for example to a file); and anObjectInputStream object can be used for reading and reconstituting the Hashtable at a later time. This will work perfectly, as long as the Hashtable contains objects that are themselves Serializable. It could save you a lot of time in coding a solution to this assigment. Refer to the online Java documentation for more information about these classes.
Your indexing algorithm needs to be able to do both depth-first and breadth-first searches. These alternatives should be handled with as much code re-use as possible. Hint: It is possible to implement these search strategies so that the only difference between them is that one puts URL's to visit on a queue, and the other puts them on a stack. Another hint: The java.util.LinkedList class has efficient addFirst(), addLast(), removeFirst(), and removeLast() methods.
When run, your Index program can start indexing from any web page, by specifying its URL as the start point. (Be sure to give a reasonable number of maximum pages if you do a search starting at someplace like www.yahoo.com, or you'll use up your CPU quota trying to index the entire Web!) When spidering a limited number of pages on the open web, depth-first and breadth-first searches will usually give different results.
The syntax of well-formed HTML anchor tags is described above. However, quite a few web pages in the world have anchor tags, and other HTML tags, that don't exactly meet that standard. For example, they might be missing quotes on the HREF URL, or have a space between the opening < and the 'A', etc. A web crawler (or a web browser) that wants to find the most possible pages would try hard to do the right thing with as many technically incorrect tags as possible. However for this assignment, you can just go with what the ParserDelegator tells your handleStartTag() method.
An absolute URL has a part that names a host machine, and a part that names a document on that machine. For practical purposes, the host machine name is case insensitive, but the rest is case sensitive. So for example it is in general an error to lowercase the HREF string before passing it to a URL constructor. You can avoid these problems by using the equals() method of the URL class when comparing two URL's. (However, the equals() method has its own problems: it doesn't work right with virtual hosting. Using the sameFile() method may be better in this respect, but note that all Collection datastructures use equals() by default. See the documentation for the URL class for more information.)
The administrator of a web site might decide that some pages on the site should not be indexed by web search engines. (There could be various reasons for that: for example, the pages might be dynamically created and exist for only a short time, so references to them in an index will rapidly be out of date.) The standard way to accomplish that is by placing a file robots.txt in the root directory of the web site, which contains information about which pages on the site should not be automatically indexed. The format of the robots.txt file, and how it should be treated by indexing engines, is covered by the "Robots Exclusion Standard". Your indexing engine for this assignment doesn't have to follow this standard, but you should be aware of it, and any more serious web robot you write should conform to it.
UPDATE 11/16/08: The following is a copy/paste from the webboard, regarding the tutorial on ParserCallback that is linked to above (here is the tutorial link again). Please see the webboard for the original question and conversation context. Here are my replies:
Callbacks are a very common paradigm for programming tasks. You will very often see them in dealing with graphical user interfaces (GUI) and document parsers (like in this case). In user interfaces, you would create a callback object/method to deal with events such as a mouse click. The callback object would contain code that you want to execute whenever a mouse click occurs, for example.

In this case, you are dictating what should occur when certain things are found in an HTML object. The code above has a handleText() method. That method says what the document parser should do when it encounters text in the HTML document (i.e., words that are not inside an HTML tag).

I put some java code in the ../public/p4 directory on ieng6 to help you get started with this. I basically just copied and pasted code from the tutorial you are asking about (that is linked from the assignment spec). The tutorial code prints out all text in the document. I modified it slightly so that it only prints out text that is inside tags (paragraph tags) like this . There is a very simple HTML file in the ../public/p4 directory that you can use to demonstrate the starter code. You will see that the HTML file has some text in a <H3> tag, and some in a tag, but only the text in the tag will be printed when you run the sample java code.

For the assignment, you probably don't care about the tag, but you do care about link tags--<A> tags. It should be pretty straightforward to modify the code I'm giving you to pay attention to the A tag instead of P.

-----------------

I put a second java file in ../public/p4 that does the exact same thing, but puts the class definition in the format you are used to seeing, instead of being an anonymous class. You can compare the two java files side by side and that should help you understand exactly what is going on. For this assignment, it might make more sense to do it the traditional class definition way, instead of the anonymous way, depending on how long your code is. Anonymous classes are a handy trick when the amount of code for the anonymous class is very small and you don't want to bother with a traditional class definition. Depending on how you do your relevance rating, you might have quite a bit of code here.

Another thing to notice about anonymous classes is what happens when you compile: 'javac ReadWebPage.java' makes *2* .class files. One is ReadWebPage.class, as you would expect. The other is the .class file for the anonymous class. (So it really does work just like a class definition.) You never gave the anonymous class a name (that's why it's called anonymous), so java just gives it a number, resulting in 'ReadWebPage$1.class'
UPDATE 11/25/08: You can use the following archive for testing purposes: PHP manual.

Last assigment, good luck!