>>> Due FRIDAY DECEMBER 5, 11:59 pm PDT.
>>> What will be turned in: Index.java, Lookup.java, plus any other .java files in your working directory.
If you are working in a team of 2 students, turn in your assignment from one of your two accounts only. On this assignment, your team can be the same as one of your previous assignments, if you want.
Getting started
You have no doubt used internet search engines: you type in a query in the form of a list of keywords, and the engine returns to you a list of web pages, ordered (hopefully) by how well they match the keywords. In this assignment, you will implement the basic functionality used in these keyword-driven search systems.
Your system will have two parts: An indexing utility, which will create a keyword index by searching a graph of web pages; and a lookup utility, which will create lists of "relevant" web pages from keyword queries. The first part will be implemented as a Java program with main() method in a file Index.java. The second part will be implemented as a Java program with main() method in a file Lookup.java. Breaking the task into two parts makes computational sense; once the index is created, keyword searches in the index can be performed much faster than searching the web documents directly. But it means that both of these parts are essential to having a working system: without code to read the index correctly, indexing is useless; and without code to produce an index, lookup is useless.
Index.java
that will be invoked with a command line of the form
java Index firstpage searchtype maxpages indexfilewhere
firstpage
is the complete URL of the web page where the search should start, searchtype
is either breadth
or depth
, maxpages
is an integer specifying the maximum number of web pages to search, and indexfile
is the name of the file that will contain the index. Depending on the search type, the program will do a breadth-first or depth-first search of the web starting with the page with the given URL, visiting no more than the maximum number of pages specified, and produce a keyword index. This index will be written to the named file. The format of the index file will be suitable for reading by the Lookup program, described below. Your program should also print to standard output some informative messages as it is building the index. For example, for each web page it searches, it should print out how many distinct keywords were found in it; and it should print out some summary information at the end (e.g. saying how many distinct keywords and how many web pages appear in the final index).
In order to construct the index, you will have to visit a set of web pages, and read the contents of these pages. You can think of this set of pages as vertices in a directed graph. Each web page has a URL ("Uniform Resource Locator") that identifies it; a typical URL has a form like http://www.cse.ucsd.edu/classes/fa08/cse100/assignments.html
. The URL of the "start vertex" web page is given on the command line to your Index program. Each web page can contain links to other web pages. These links are expressed as URL's which appear inside certain "tags" in the page; so the graph of web pages has an adjacency-list representation. If you are doing a breadth-first search from a page P, you should visit P, and then visit all of the pages linked to by P before visiting any pages linked to by pages linked to by P. If you are doing a depth-first search from a page P, you should visit P, and then recursively do a depth-first search from each of the pages linked to by P. In either kind of search, you should make sure not to visit the same web page (i.e., a page with the same URL) twice.
javax.swing.text.html
package and its subpackages contain classes that are designed for processing web pages. A tutorial on these classes is available here, though our application is simpler than the one described there.
The key class in that package is HTMLEditorKit
and its subclasses. In particular, you will want to define a class that extends the classHTMLEditorKit.ParserCallback
(a public inner class defined within HTMLEditorKit), and that overrides its handleStartTag
and handleText
methods. The handleStartTag
method will be called by the HTML parser itself when a start tag is encountered; if this start tag includes an HREF
attribute, it indicates a link. The handleText
method is called by the HTML parser when text is encountered; this text contains the keywords. More details are available below, and in the online Java API documentation, about how these methods are intended to be used.
Along with the stemming process, it can be helpful to eliminate noise words (also called "stop words") which are very common words (e.g. "a", "an", "are", "but", etc.) that occur in almost every document, and so don't serve very well in distinguishing relevant documents from irrelevant ones; and also removing punctuation, converting to lower case, and performing other "cleanup" operations.
The result after stemming and noise word removal is supposed to be the reduction of the words in a document to their essentials. The design of the stemmer can make a big difference (for better or worse) in the performance of the search engine; agressive stemming does improve the average relevance of search results, but the downside is that it can make it impossible to restrict your search to a particular form of a word.
For this assignment, we will provide a basic stemmer for you to use. The static makeKeyword()
method in the Keyword
class takes a String argument, and returns a String representing the stemmed and cleaned keyword (or the empty string ""
if the result is a stop word or otherwise not usable as a valid keyword). Copy Keyword.java from the public/P4 directory and use it!
The index you will construct will be a table that maps keywords to a structure of URLs (URL, keyword count) pairs (update: you'll want to use (URL, count) pairs for many forms of relevance rating, but not strictly required for basic retrieval). However, as you are doing your depth-first or breadth-first search of the web, you may find it more convenient to work with a table that maps URL's to a collection of (keyword, keyword count) pairs. With such a table it is easy to check if a URL has been visited before, for example (just see if it already has an entry in the table). But this table, with URL's as keys, then needs to be converted to a map that uses keywords as keys. This process is called "inverting the index". Alternatively, you could work directly in the first instance with a table that uses keywords as keys, and skip the "inverting" step. UPDATE: I wouldn't recommend the "inverting" thing--it is probably easier to just build your index the correct way the first time. But if you do that, you will need a HashSet in order to keep track of previously visited URLs (a visited list, like in your assignment #3).
In any case, each valid keyword you find in any web pages you searched will need to have an entry in the index you produce. Associated with each keyword should be a list (or a more suitable structure) of pairs: Each pair will consist of the URL of a page that contained the keyword, and the number of times the keyword occurred in that page.
'<'
and continue to the next "right corner bracket" character '>'
. If the tag starts with "<a "
or "<A "
, it is an anchor tag. Anchor tags can contain hyperreferences, i.e. links, to other web pages. If there is a link, the URL of the other page will appear as a double-quoted string immediately following HREF=
or href=
. For example,
<a href="http://slashdot.org/is an anchor tag containing a link to the page with URLindex.html ">
http://slashdot.org/index.html
Link URL's can be relative or absolute. The above example is an absolute link. A relative link will specify a web page file relative to the URL of the page containing the relative link. For example, the web page with URL http://www.cse.ucsd.edu/classes/fa08/cse100/assignments.html
contains the following anchor tag, which specifies a relative link:
<a HREF="assignment4.html">In this context, this relative link corresponds to an absolute link
http://www.cse.ucsd.edu/classes/fa08/cse100/assignment4.html
.
java.net
contains the URL
class which provides a very useful API for communicating with web servers and accessing contents of web pages. You should read the online documentation for the class; we will mention some of its main features here.
You can create a URL
object by using the constructor that takes as argument a String which is an absolute URL:
URL myURL = new URL("http://www.cse.ucsd.edu/classes/fa08/cse100/index.html");Now the
openStream()
instance method of myURL
will open a network connection to the URL, and return a java.io.InputStream
object associated with this connection. You should wrap this InputStream
in an java.io.InputStreamReader
to do Unicode conversion, and wrap that in a java.io.BufferedReader
for efficiency, and then create a javax.swing.text.html.parser.ParserDelegator
to parse the stream for you.
How to handle relative URL's? If the URL
constructor determines that its single String argument is not a valid absolute URL, it will throw ajava.net.MalformedURLException
URL
object by using the URL
constructor that takes two arguments: A URL
object, and a String which is a relative URL, relative to the first argument. That is, if you are visiting a page using a URL
object currentURL
, and find a relative URL "assignments.html"
there, then
URL myURL = new URL(currentURL,"assignments.should create the appropriatehtml");
URL
object. If this also fails, the URL can just be ignored.
Note that constructing a relative URL works by replacing the last component of the absolute URL with the relative part, if the absolute URL does not end with a slash "/"; or, by concatenating the relative part onto the end of the absolute URL, if the absolute URL does end with a slash "/". These will lead to very different results!
javax.swing.text.html
package and its subpackages use an event-driven programming model to parse HTML documents.
To use these classes, create an instance of the javax.swing.text.html.parser.
class, and call its parse
method. This method takes three arguments:
Reader
object connected to a URL of an HTML document, as described above;
HTMLEditorKit.ParserCallback
and overrides methods as described below;
true
).
HTMLEditorKit.ParserCallback
object will be automatically called to handle events that occur in the parse. For purposes of building your web indexer (as opposed to a full web browser), only two methods are important, and they can be defined along the following lines:
public void handleStartTag(HTML.Tag tag, MutableAttributeSet attr, int pos)This method will be called when a start tag is encountered. It will be passed three arguments. The first argument is an instance of the
HTML.Tag
enum type, which specifies the kind of start tag. We are only interested in anchor tags, which will be the case when tag == HTML.Tag.A
. Then in particular we are only interested in the value of the HREF attribute within the tag. The MutableAttributeSet attr
argument contains information about all the attributes of the tag. The value of the HREF attribute can be obtained as a String with a statement like
String link = (String) attr.getAttribute(HTML.and then that String can be used to construct a URL as described above. (The third argument gives the position in the HTML document of the start tag, and can be ignored.)Attribute.HREF);
public void handleText(char[] text, int pos)This method will be called when text is encountered in the HTML document being parsed. The text is passed as a char array in the first argument. (The second argument gives the position in the HTML document of the start of this text, and can be ignored.) The char array can be converted to a string:
String t = new String(text);and then tokenized into keywords. Keywords contain only digits and letters ("word characters"), so the String should be split on non-word characters. This can be done by creating a
Scanner
, or using the split()
instance method of the String itself. A candidate keyword that is a stop word should be discarded; if not a stop word, it will be used to build the index.
Lookup.java
that will be invoked with a command line of the form
java Lookup indexfilewhere
indexfile
is the name of a file created with your Index program. The Lookup program will read a keyword index from the named file. Then it will prompt the user for a sequence of whitespace-delimited keywords entered from the keyboard on a single line; this sequence of keywords is the user's query. Keywords that are in parentheses are words that the user does NOT want to see on the page. For example, for a keyword string as follows:The lookup operation will find web pages in the index that contain all the valid keywords in the query, and none of the keywords in query that are in parentheses. If no web page meets the constraints, your program must inform the user that no pages were found. If one or more pages contain all the keywords, their names should be printed out, one per line, sorted by "relevance rating".
The relevance rating for web page is a score intended to indicate how relevant the page is to the user's query. Your search engine must print all the results, and only the results, that are indicated by the above instructions. But the order in which the results are returned--sorted by relevance rating--is up to you. You may design your own scoring method. It doesn't need to be very elaborate to receive full credit, but it could be fun to get a little creative here. If you'd like some ideas taken from "real life," I'm happy to discuss it in person or on the webboard. There's also a great deal of material on the web about this. (Try searching for PageRank or TF-IDF.) Here's a brainstorm of some things you could take into consideration: (1) number of times the keywords occur on the pages in question, (2) total length of the webpage (maybe to normalize results of (1), or maybe you think users might like longer pages because they have more content?), (3) how common is the word throughout the whole set of documents (maybe very common words are less useful for discerning relevance than unusual ones?), (4) number of links on a page, (5) number of links to a page (you may not be able to get enough data for this metric to be meaningful), (6) how close together the words appear on the page, (7) "birds of a feather flock together" so you could examine pages that a page links to or that link to it to see if they are also relevant to the keywords and use that as a judge of the page in question (this is a big key to Google's success). Note that some of these require substantial additional data storage for the index, while others require very little additional storage. You should explain your scoring method, why you think it is indicative of relevance of the content, and what was needed behind the scenes to implement it, in about 2 paragraphs in your README.
Note that the keywords the user enters must be stemmed before searching the index, just as the words in the source documents were before being put in the index. This means that some words cannot be searched for at all (noise words, for example), and some words cannot be distinguished (different words with the same stem, or words with different capitalization, etc.). The payoff is (hopefully) increased relevance of the search results.
If you implement your index using a Hashtable or Hashmap, note that these classes implement the Serializable
interface. This means that a Hashtable object can be passed to an ObjectOutputStream
object for writing (for example to a file); and anObjectInputStream
object can be used for reading and reconstituting the Hashtable at a later time. This will work perfectly, as long as the Hashtable contains objects that are themselves Serializable. It could save you a lot of time in coding a solution to this assigment. Refer to the online Java documentation for more information about these classes.
When run, your Index program can start indexing from any web page, by specifying its URL as the start point. (Be sure to give a reasonable number of maximum pages if you do a search starting at someplace like www.yahoo.com, or you'll use up your CPU quota trying to index the entire Web!) When spidering a limited number of pages on the open web, depth-first and breadth-first searches will usually give different results.
ParserDelegator
tells your handleStartTag()
method.
robots.txt
in the root directory of the web site, which contains information about which pages on the site should not be automatically indexed. The format of the robots.txt
file, and how it should be treated by indexing engines, is covered by the "Robots Exclusion Standard". Your indexing engine for this assignment doesn't have to follow this standard, but you should be aware of it, and any more serious web robot you write should conform to it.
Callbacks are a very common paradigm for programming tasks. You will very often see them in dealing with graphical user interfaces (GUI) and document parsers (like in this case). In user interfaces, you would create a callback object/method to deal with events such as a mouse click. The callback object would contain code that you want to execute whenever a mouse click occurs, for example.
In this case, you are dictating what should occur when certain things are found in an HTML object. The code above has a handleText() method. That method says what the document parser should do when it encounters text in the HTML document (i.e., words that are not inside an HTML tag).
I put some java code in the ../public/p4 directory on ieng6 to help you get started with this. I basically just copied and pasted code from the tutorial you are asking about (that is linked from the assignment spec). The tutorial code prints out all text in the document. I modified it slightly so that it only prints out text that is inside <P> tags (paragraph tags) like this </P>. There is a very simple HTML file in the ../public/p4 directory that you can use to demonstrate the starter code. You will see that the HTML file has some text in a <H3> tag, and some in a <P> tag, but only the text in the <P> tag will be printed when you run the sample java code.
For the assignment, you probably don't care about the <P> tag, but you do care about link tags--<A> tags. It should be pretty straightforward to modify the code I'm giving you to pay attention to the A tag instead of P.
-----------------
I put a second java file in ../public/p4 that does the exact same thing, but puts the class definition in the format you are used to seeing, instead of being an anonymous class. You can compare the two java files side by side and that should help you understand exactly what is going on. For this assignment, it might make more sense to do it the traditional class definition way, instead of the anonymous way, depending on how long your code is. Anonymous classes are a handy trick when the amount of code for the anonymous class is very small and you don't want to bother with a traditional class definition. Depending on how you do your relevance rating, you might have quite a bit of code here.
Another thing to notice about anonymous classes is what happens when you compile: 'javac ReadWebPage.java' makes *2* .class files. One is ReadWebPage.class, as you would expect. The other is the .class file for the anonymous class. (So it really does work just like a class definition.) You never gave the anonymous class a name (that's why it's called anonymous), so java just gives it a number, resulting in 'ReadWebPage$1.class'
Last assigment, good luck!