CSE 150 Programming Assignment #4

Date assigned: May 25, 2007

Date due: 11:59:59 June 7 2007

 

 

The task here is to build a spam filter using Bayesian learning.  You should implement the naive Bayes learning algorithm (NB) described in class, and apply it to learn a classifier that distinguishes spam from legitimate email messages as accurately as possible.

 

You must write a preprocessor that converts an email message into a vector of feature values.  Use your creativity in choosing features that you think are likely to be predictive, i.e. likely to be different for the two classes of messages.  Use at least 100 different features, and make sure your NB implementation can handle at least 200 features and 2000 training examples.  Most of your features are likely to be words, but also use other features such as the time of day of the message.  Use human intelligence to avoid features that may be highly predictive for the training data, but that will not be useful for test data, for example the date of the email message.

 

There is a training set of about 600 spam messages and here is a training set of about 300 legitimate messages.  You will have to divide each file into its separate email messages; this is not difficult and you should make sure your counts match the official numbers.  You must also write a script to preprocess each message, to compute the value for it of each feature that you use.

 

On Wednesday June 6 we will publish a test set of mixed spam and legitimate email messages.  You will run your classifier to generate a file of predicted labels 1 for spam and 0 for legitimate.  We will compute the percentage of correct labels that you submit, and part of your score will be based on this percentage.  Note that for this assignment, both types of mistake (classifying spam as legitimate, and vice versa) are equally bad.

 

Logistics:

You can work alone or in groups of at most 3. Code must be submitted (with comments) by the due date. A copy of your write up as a pdf should be included. Turn in the code using the turnin script.

 

Hard copies of write-ups are due no later than at the start of the final, printed and stapled. Code should be included in the printout. Write-ups should include a brief description of the approach taken, answers to any questions posed, and sample output on some examples. See below for more details on the content of the write up.

 

Half of the points are based on the quality of the write-up, testing, and code. The other half are based on the performance of your code on a test set of problems and against other student’s submissions. Bonus points may be awarded at the TA's discretion for going above and beyond the HW's description.

 

Points will be deducted for not following these instructions. Note that no late programming assignments will be accepted!

 

Project specifics

  1. Write a Preprocessor. You must write a preprocessor function that inputs text emails, computes the values of chosen features for these emails, and outputs the resulting vectors of feature values. If you like, the preprocessor may be written in a scripting language such as Python or Perl. This function will be used by both the learning and classifying algorithms. You should not have to replicate this code in both programs. In designing the preprocessor, you must choose which features to use. In your report, describe and justify your method for choosing features. You may use any interface you like, but be sure to describe this interface in your README.

 

The email files are given as long text files with emails concatenated together (namely an mbox created from a bunch of .emlx files). Part of the task of the preprocessor is to separate this text file into separate emails; this doesn’t need to be output, but is obviouslt critical for tagging individual emails. Most binary attachments have been stripped. The headers have been abridged to suppress labelings by spam-filters. For non-spam, names and emails have been changed to protect the innocent.

 

Files are obtainable here.

 

  1. Implement the naive Bayes Learner. You must write a program that implements the naive Bayes learning algorithm. This program should input the set of labeled training text emails (as opposed to feature vectors) and estimate the necessary statistics. It should then output these statistics to a file to be used by the naive Bayes classifier. Use whatever input and output interfaces you like. In the README header, specify the exact command needed to run your learner on the spam and legit training emails. For instance, if the command to run your learner is:

 

spamNBLearner legit spam > spam_params.out

 

please specify this in the README header.

 

  1. Implement the naive Bayes Classifier. You must write a program that implements the naive Bayes classifier algorithm. This program should take two arguments: the name of a file consisting of text emails strung together and a file containing the learned parameters. Using the statistics learned by the naive Bayes learner, it should ouput to stdout the estimated class of each input email: 1 for spam and 0 for legitimate. Each classification should be on a separate line. If the name of your classifier executable is spamNBClassifier, then the command to run your classifier on the email messages in the file test using the params learned above would be:

 

spamNBClassifier test spam_params.out

 

Please specify the name of the classifier executable in the README header. The expected output should be something like:

1

1

0

1

0

...

 

  1. Test your classifier. You are to to test your classifier on each email of a set of unlabeled test emails (provided June 6). You should store the results of your classifier in the file results.txt. Part of your grade will be based on the classification error of your algorithm on this test set. You may not hand label this test set and use it in your learning algorithm.

 

Write up:

 

Your report should include discussions of the following issues:

 

    * In your experience, is the NB algorithm useful for spam filtering?

    * What are the strengths and weaknesses of using NB compared to hand-coding a spam recognizer?

    * What programming problems did you have to solve in order to make your implementation of NB successful?

    * What have you learned about the practicalities of data mining applications, like this one?

 

You should also describe the classifier that your software learns, numerically and qualitatively.  Which features (and which values of which features) are most predictive of a message being spam?  How many features contribute significantly to classification, versus how many are uninformative?  What features might you want to add to your feature set in the future, to get a better classifier?  Is it adequate to represent a message as a "bag of words," or does this lose too much information?

 

As described in previous assignments, the following is a good outline. Please use it.

 

1. Files submitted (with brief description of each)

2. How to run (including unpacking, compiling)

3. Approach (describe algorithm(s)), problems encountered, hurdles overcome)

4. Known bugs (hopefully empty! But better to report than for me to find.)

5. Extra credit (can be empty - just because it's here doesn't mean you'll get points, but won't get points if not listed)

6. Analysis of results on training set (how well did you do on these?)

7. Answers to questions

8. Summary

 

Testing:

Your turned in directory MUST contain a results.txt file that shows the output for each message in the test file. If we provide multiple test files, each must have a results file (alternate names will provided for different files in this case).

 

Grading:

As has typically been the case, half your points are based on your performance and half on the quality of the report and code.