CSE 150 Programming Assignment #4
Date assigned: May 25, 2007
Date due: 11:59:59 June 7 2007
The task here is to build a spam filter using Bayesian learning. You should implement the naive Bayes learning algorithm (NB) described in class, and apply it to learn a classifier that distinguishes spam from legitimate email messages as accurately as possible.
You must write a preprocessor that converts an email message into a vector of feature values. Use your creativity in choosing features that you think are likely to be predictive, i.e. likely to be different for the two classes of messages. Use at least 100 different features, and make sure your NB implementation can handle at least 200 features and 2000 training examples. Most of your features are likely to be words, but also use other features such as the time of day of the message. Use human intelligence to avoid features that may be highly predictive for the training data, but that will not be useful for test data, for example the date of the email message.
There is a training set of about 600 spam messages and here is a training set of about 300 legitimate messages. You will have to divide each file into its separate email messages; this is not difficult and you should make sure your counts match the official numbers. You must also write a script to preprocess each message, to compute the value for it of each feature that you use.
On Wednesday June 6 we will publish a test set of mixed spam and legitimate email messages. You will run your classifier to generate a file of predicted labels 1 for spam and 0 for legitimate. We will compute the percentage of correct labels that you submit, and part of your score will be based on this percentage. Note that for this assignment, both types of mistake (classifying spam as legitimate, and vice versa) are equally bad.
Logistics:
You can work alone or in groups of at most 3. Code must be submitted (with comments) by the due date. A copy of your write up as a pdf should be included. Turn in the code using the turnin script.
Hard copies of write-ups are due no later than at the start of the final, printed and stapled. Code should be included in the printout. Write-ups should include a brief description of the approach taken, answers to any questions posed, and sample output on some examples. See below for more details on the content of the write up.
Half of the points are based on the quality of the write-up, testing, and code. The other half are based on the performance of your code on a test set of problems and against other student’s submissions. Bonus points may be awarded at the TA's discretion for going above and beyond the HW's description.
Points will be deducted for not following these instructions. Note that no late programming assignments will be accepted!
The email files are given as long text files with emails concatenated together (namely an mbox created from a bunch of .emlx files). Part of the task of the preprocessor is to separate this text file into separate emails; this doesn’t need to be output, but is obviouslt critical for tagging individual emails. Most binary attachments have been stripped. The headers have been abridged to suppress labelings by spam-filters. For non-spam, names and emails have been changed to protect the innocent.
Files are obtainable here.
spamNBLearner legit spam > spam_params.out
please specify this in the README header.
spamNBClassifier test spam_params.out
Please specify the
name of the classifier executable in the README header. The expected output
should be something like:
1
1
0
1
0
...
Write up:
Your report should include discussions of the following issues:
* In your experience, is the NB algorithm useful for spam filtering?
* What are the strengths and weaknesses of using NB compared to hand-coding a spam recognizer?
* What programming problems did you have to solve in order to make your implementation of NB successful?
* What have you learned about the practicalities of data mining applications, like this one?
You should also describe the classifier that your software learns, numerically and qualitatively. Which features (and which values of which features) are most predictive of a message being spam? How many features contribute significantly to classification, versus how many are uninformative? What features might you want to add to your feature set in the future, to get a better classifier? Is it adequate to represent a message as a "bag of words," or does this lose too much information?
As described in previous assignments, the following is a good outline. Please use it.
1. Files submitted (with brief description of each)
2. How to run (including unpacking, compiling)
3. Approach (describe algorithm(s)), problems encountered, hurdles overcome)
4. Known bugs (hopefully empty! But better to report than for me to find.)
5. Extra credit (can be empty - just because it's here doesn't mean you'll get points, but won't get points if not listed)
6. Analysis of results on training set (how well did you do on these?)
7. Answers to questions
8. Summary
Testing:
Your turned in directory MUST contain a results.txt file that shows the output for each message in the test file. If we provide multiple test files, each must have a results file (alternate names will provided for different files in this case).
Grading:
As has typically been the case, half your points are based on your
performance and half on the quality of the report and code.