In order to orchestrate effort, Professor Sweeney has aleady implemented some Java classes that form the general architecture in which your detector will operate. These classes are: SpamBam.java and Detector.java . You do not need the .java versions of these files because you may not make any changes to them whatsoever. Technically, all that is needed for you to use them are their .class files. Nevertheless, for those wanting to pursue this lab further as part of their project, the source files are being made available.
The general idea is as follows. There will be about 20 detectors, where each detector checks for a particular feature in an email message that is associated with spam messages. Each detector returns a value between 0 and 1, inclusive, reporting the certainty. The value 1 means the message is spam and 0 means it is not. Values between 0 and 1 give a measure of how much certainty exists.
Here is a description of the classes provided.
This class contains methods that coordinate the task. It maintains a collection of detectors (up to 100). These detectors can be automatically loaded into its collection by merely providing the .class file in the directory in which an instance of SpamBam operates. The loadDetectors() method is responsible for the autoloading of detectors. Basically, if a detector resides in the directory and has the name Detector1.class, Detector2.class, ..., Detector100.class, the file is added to the collection. If the file does not exist, it is not loaded. Therefore, the methods are operable with 0 to 100 detectors.
The register() method allows a particular instance of a Detector to be loaded. This method is not needed for regular use.
The list() method returns a multi-line string containing an enumerated listing of the detectors currently loaded in the collection. The description of each detector is provided using the detector's getLabel() method, which is described below.
The checkMessage() method accepts a filename and a boolean flag (robust) and performs the comparisons of each detector on those files. If the robust is true, then interim messages about detectors and their computed values are displayed. The results are stored in an array of floats, where the number of the detector in the collection (see list() above) corresponds to the array position of the float value containing the detector's result. A FileNotFoundException is thrown if problems are encountered accessing the file.
The listFiles() method returns a multi-line string containing an enumerated listing of the .java files found in the directory. The directory path is given as a string. If no string is provided, the current directory is used.
Finally, the resolve() method returns a string that contains the results, such that the message is considered spam based on a computation over the results from the detectors.
The Detector class is the superclass of all detectors. Your detector will extend this class. This class has two abstract methods which your detector will have to define. These are: getLabel() and check().
The getLabel() method returns a string containing a short description of the email message features that your detector tests. The text for your detector depends on which detector you decide to build. The list of detectors, along with the text that is to appear as its label, is provided in Step 1 below.
The check() method takes an InputStream parameter, which is assumed to provide contents of a an email message and a boolean flag (robust). Your check() method returns a value between 0 and 1, inclusive, to report the amount of similarity found. If robust is true, then you should display values computed in order to provide additional information about the detector's finding beyond the overall 0 to 1 decision it returns. The characteristics that you will test with your detector determines how your version of check() will be defined. The list of detectors, along with descriptions of what the check() method is to accomplish is provided in Step 1 below.
As an example of how this all fits together, here is a sample program named mySpamBam.java that uses these classes. A sample detector is also included for your review. This detector, which was described above, computes the number of characters in an email message (a file ending in .txt) and reports on the results. This detector is available as Detector21.java.
In order to operate the program with email messages, you will have to provide a depot sub-directory to the file in which the .java files are located that contains the email messages. In this version of the code, each email message appears in a separate text file in this directory. Click here to view some sample email files. Additional email files are available on privacy.cs.cmu.edu in the spam/messages directory.
Once you understand the general idea, as described above, you are ready to begin. In a previous lecture, the following detector assignments were made:
| Student | Detector# | Description | |
|---|---|---|---|
| dyh | 1. | lots of high ascii characters | |
| jdewji | 2. | all uppercase | |
| ghartman | 3. | attempt to obscure HTML links in message | |
| tshah | 4. | Space out of words (e.g., V I A G R A) | |
| hlp | 5. | keywords based on sexual references | |
| nancyc | 6. | keywords based on purchases or sell attempts | |
| rholcomb | 7. | blank email messages | |
| wknop | 8. | gibberish letters and digits | |
| yiliw | 9. | Notification message in which spam or virus was found | |
| bingbin | 10. | Reply message ("Re:") with no original message | |
| ihong | 11. | keywords based on pharmacy sales, weight loss, hair loss | |
| mmezzour | 12. | lots of font color and size | |
| aaziz | 13. | unusual occurrence of bold and italic characters | |
| acecchet | 14. | attachment with little or no text | |
| abugaj | 15. | keywords diplomas and degrees | |
| jungkim | 16. | identify a different language codeset (not wanted) | |
| cbenjavi | 17. | recognizing get rich fast (incl. make money ebay) | |
| willhoit | 18. | no articles or connecting words, high occurence of nouns | |
| mogliari | 19. | ||
| 20. |
Your detector will inherit from Detector and provide definitions for the abstract methods described there. In particular, you will have to provide: (1) a constructor that executes the constructor of the superclass; (2) a method named getLabel() that returns the text label provided in the table above; and, (3) a check() method that performs the operation for your detector. The check() method accepts two parameters. The first is of type InputStream and the second is of type boolean. The method returns a value between 0 and 1, inclusive, that reports the certainty the .txt email submission is spam. If the boolean value is true, additional information about values computed by the detector are displayed.
In writing your detector, you may create additional .java files as needed. If so, be sure to include these with your submission.
You may NOT change the contents of SpamBam or Detector under any circumstances UNLESS you are using a Mac or Linux machine. In these cases, there is a static constant named MACHINE that appears at the top of the SpamBam.java file. Change this value to "\\" for PC machines and to "\" for Mac and Unix machines. This string is the directory separator for file path statements.
The tricky part is the basis on which your detector makes a decision!. When you look at the descriptions of the detectors you are writing, the code does not seem particularly difficult. It is not. However, in order to make your code effective as a detector for spam, you will have to experiment with spam messages to get parameters that establish a useful return value (0..1). This is the tricky part!
For example, Detector21.java. is provided as a sample detector. It merely counts the total number of characters in the email message. That's the easy part. The tricky part is finding a scientific basis to determine what number of characters is an optimal threshold! In the code provided, the arbitrary values 300 and 1500 are used. These were just guesses! In your case, you need to conduct experiments to determine what are reasonable values and how to map those values into the 0..1 range necessary to provide a decision. This is the real intellectual part of the assignment and serves as the basis for the one page abstract you write.
Remember! Check your detector not on the set of experimental messages you may develop it using, but also on other categories and collections of spam as well as on your own non-spam messages. See email archives for a list of sample spam archives.
We have provided some email samples for you to try your detectors on. These are available on privacy.cs.cmu.edu in the spam directory. Feel free to copy these files as needed to your machine to test your detector. In this assignment, you should use the email messages found under the spam/messages directory. Messages in this directory are stored with a single message in a text file. (The other directory, spam/mailboxes, has multiple messages in a text file, which is how email mailboxes are typically stored.) In the Lab, you will work with email messages stored as distinct text files. In the project, you may elect to work with email mailboxes.
Once you have your detector working, you can share your detector with the other students in the class, and in turn, you can also try detectors constructed by other students. However, you can only share your .class file(s). Do not share your .java files!
To share your detector, copy your .class file(s) to privacy.cs.cmu.edu using the spam/detectors directory. Be sure your detector is named properly -- Detector followed by its assigned number (see the table above) and then .class.
To test the detectors of other students, copy the files found on privacy.cs.cmu.edu in the spam/detectors directory. Detectors should be named Detector followed by its assigned number (see table above) and then .class. Copying these to your directory and just running the program will automatically execute any detectors you copy. You do not have to change any code.
Submit your .java file for your detector and any accompanying files you created that your detector needs. Do not submit copies of the files we provided. Your solution must work with the original version of those files (unchanged). Also, do not submit any test email messages provided for testing. You need only submit the .java file for your detector and any accompanying files, if any, you created. Email these to padlab@privacy.cs.cmu.edu as attachments.
You also provide a one-page abstract based on your detector. Describe the algorithm you used, what features are tracked, how they are tracked, and the thresholds and bases for mapping those features into the range 0..1. Provide experimental results that support your design decisions as well overall performance results. Submit your one page abstract to padlab@privacy.cs.cmu.edu as a PDF or DOC file.
Enjoy this lab? If so, you may interested in related projects.