diff --git a/ex6.pdf b/ex6.pdf new file mode 100644 index 0000000..07cc0fc Binary files /dev/null and b/ex6.pdf differ diff --git a/ex6/dataset3Params.m b/ex6/dataset3Params.m new file mode 100644 index 0000000..71a24d8 --- /dev/null +++ b/ex6/dataset3Params.m @@ -0,0 +1,34 @@ +function [C, sigma] = dataset3Params(X, y, Xval, yval) +%EX6PARAMS returns your choice of C and sigma for Part 3 of the exercise +%where you select the optimal (C, sigma) learning parameters to use for SVM +%with RBF kernel +% [C, sigma] = EX6PARAMS(X, y, Xval, yval) returns your choice of C and +% sigma. You should complete this function to return the optimal C and +% sigma based on a cross-validation set. +% + +% You need to return the following variables correctly. +C = 1; +sigma = 0.3; + +% ====================== YOUR CODE HERE ====================== +% Instructions: Fill in this function to return the optimal C and sigma +% learning parameters found using the cross validation set. +% You can use svmPredict to predict the labels on the cross +% validation set. For example, +% predictions = svmPredict(model, Xval); +% will return the predictions on the cross validation set. +% +% Note: You can compute the prediction error using +% mean(double(predictions ~= yval)) +% + + + + + + + +% ========================================================================= + +end diff --git a/ex6/emailFeatures.m b/ex6/emailFeatures.m new file mode 100644 index 0000000..37f8747 --- /dev/null +++ b/ex6/emailFeatures.m @@ -0,0 +1,61 @@ +function x = emailFeatures(word_indices) +%EMAILFEATURES takes in a word_indices vector and produces a feature vector +%from the word indices +% x = EMAILFEATURES(word_indices) takes in a word_indices vector and +% produces a feature vector from the word indices. + +% Total number of words in the dictionary +n = 1899; + +% You need to return the following variables correctly. +x = zeros(n, 1); + +% ====================== YOUR CODE HERE ====================== +% Instructions: Fill in this function to return a feature vector for the +% given email (word_indices). To help make it easier to +% process the emails, we have have already pre-processed each +% email and converted each word in the email into an index in +% a fixed dictionary (of 1899 words). The variable +% word_indices contains the list of indices of the words +% which occur in one email. +% +% Concretely, if an email has the text: +% +% The quick brown fox jumped over the lazy dog. +% +% Then, the word_indices vector for this text might look +% like: +% +% 60 100 33 44 10 53 60 58 5 +% +% where, we have mapped each word onto a number, for example: +% +% the -- 60 +% quick -- 100 +% ... +% +% (note: the above numbers are just an example and are not the +% actual mappings). +% +% Your task is take one such word_indices vector and construct +% a binary feature vector that indicates whether a particular +% word occurs in the email. That is, x(i) = 1 when word i +% is present in the email. Concretely, if the word 'the' (say, +% index 60) appears in the email, then x(60) = 1. The feature +% vector should look like: +% +% x = [ 0 0 0 0 1 0 0 0 ... 0 0 0 0 1 ... 0 0 0 1 0 ..]; +% +% + + + + + + + + +% ========================================================================= + + +end diff --git a/ex6/emailSample1.txt b/ex6/emailSample1.txt new file mode 100644 index 0000000..eac52a3 --- /dev/null +++ b/ex6/emailSample1.txt @@ -0,0 +1,10 @@ +> Anyone knows how much it costs to host a web portal ? +> +Well, it depends on how many visitors you're expecting. +This can be anywhere from less than 10 bucks a month to a couple of $100. +You should checkout http://www.rackspace.com/ or perhaps Amazon EC2 +if youre running something big.. + +To unsubscribe yourself from this mailing list, send an email to: +groupname-unsubscribe@egroups.com + diff --git a/ex6/emailSample2.txt b/ex6/emailSample2.txt new file mode 100644 index 0000000..e47acda --- /dev/null +++ b/ex6/emailSample2.txt @@ -0,0 +1,34 @@ +Folks, + +my first time posting - have a bit of Unix experience, but am new to Linux. + + +Just got a new PC at home - Dell box with Windows XP. Added a second hard disk +for Linux. Partitioned the disk and have installed Suse 7.2 from CD, which went +fine except it didn't pick up my monitor. + +I have a Dell branded E151FPp 15" LCD flat panel monitor and a nVidia GeForce4 +Ti4200 video card, both of which are probably too new to feature in Suse's default +set. I downloaded a driver from the nVidia website and installed it using RPM. +Then I ran Sax2 (as was recommended in some postings I found on the net), but +it still doesn't feature my video card in the available list. What next? + +Another problem. I have a Dell branded keyboard and if I hit Caps-Lock twice, +the whole machine crashes (in Linux, not Windows) - even the on/off switch is +inactive, leaving me to reach for the power cable instead. + +If anyone can help me in any way with these probs., I'd be really grateful - +I've searched the 'net but have run out of ideas. + +Or should I be going for a different version of Linux such as RedHat? Opinions +welcome. + +Thanks a lot, +Peter + +-- +Irish Linux Users' Group: ilug@linux.ie +http://www.linux.ie/mailman/listinfo/ilug for (un)subscription information. +List maintainer: listmaster@linux.ie + + diff --git a/ex6/ex6.m b/ex6/ex6.m new file mode 100644 index 0000000..a48e060 --- /dev/null +++ b/ex6/ex6.m @@ -0,0 +1,150 @@ +%% Machine Learning Online Class +% Exercise 6 | Support Vector Machines +% +% Instructions +% ------------ +% +% This file contains code that helps you get started on the +% exercise. You will need to complete the following functions: +% +% gaussianKernel.m +% dataset3Params.m +% processEmail.m +% emailFeatures.m +% +% For this exercise, you will not need to change any code in this file, +% or any other files other than those mentioned above. +% + +%% Initialization +clear ; close all; clc + +%% =============== Part 1: Loading and Visualizing Data ================ +% We start the exercise by first loading and visualizing the dataset. +% The following code will load the dataset into your environment and plot +% the data. +% + +fprintf('Loading and Visualizing Data ...\n') + +% Load from ex6data1: +% You will have X, y in your environment +load('ex6data1.mat'); + +% Plot training data +plotData(X, y); + +fprintf('Program paused. Press enter to continue.\n'); +pause; + +%% ==================== Part 2: Training Linear SVM ==================== +% The following code will train a linear SVM on the dataset and plot the +% decision boundary learned. +% + +% Load from ex6data1: +% You will have X, y in your environment +load('ex6data1.mat'); + +fprintf('\nTraining Linear SVM ...\n') + +% You should try to change the C value below and see how the decision +% boundary varies (e.g., try C = 1000) +C = 1; +model = svmTrain(X, y, C, @linearKernel, 1e-3, 20); +visualizeBoundaryLinear(X, y, model); + +fprintf('Program paused. Press enter to continue.\n'); +pause; + +%% =============== Part 3: Implementing Gaussian Kernel =============== +% You will now implement the Gaussian kernel to use +% with the SVM. You should complete the code in gaussianKernel.m +% +fprintf('\nEvaluating the Gaussian Kernel ...\n') + +x1 = [1 2 1]; x2 = [0 4 -1]; sigma = 2; +sim = gaussianKernel(x1, x2, sigma); + +fprintf(['Gaussian Kernel between x1 = [1; 2; 1], x2 = [0; 4; -1], sigma = 0.5 :' ... + '\n\t%f\n(this value should be about 0.324652)\n'], sim); + +fprintf('Program paused. Press enter to continue.\n'); +pause; + +%% =============== Part 4: Visualizing Dataset 2 ================ +% The following code will load the next dataset into your environment and +% plot the data. +% + +fprintf('Loading and Visualizing Data ...\n') + +% Load from ex6data2: +% You will have X, y in your environment +load('ex6data2.mat'); + +% Plot training data +plotData(X, y); + +fprintf('Program paused. Press enter to continue.\n'); +pause; + +%% ========== Part 5: Training SVM with RBF Kernel (Dataset 2) ========== +% After you have implemented the kernel, we can now use it to train the +% SVM classifier. +% +fprintf('\nTraining SVM with RBF Kernel (this may take 1 to 2 minutes) ...\n'); + +% Load from ex6data2: +% You will have X, y in your environment +load('ex6data2.mat'); + +% SVM Parameters +C = 1; sigma = 0.1; + +% We set the tolerance and max_passes lower here so that the code will run +% faster. However, in practice, you will want to run the training to +% convergence. +model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma)); +visualizeBoundary(X, y, model); + +fprintf('Program paused. Press enter to continue.\n'); +pause; + +%% =============== Part 6: Visualizing Dataset 3 ================ +% The following code will load the next dataset into your environment and +% plot the data. +% + +fprintf('Loading and Visualizing Data ...\n') + +% Load from ex6data3: +% You will have X, y in your environment +load('ex6data3.mat'); + +% Plot training data +plotData(X, y); + +fprintf('Program paused. Press enter to continue.\n'); +pause; + +%% ========== Part 7: Training SVM with RBF Kernel (Dataset 3) ========== + +% This is a different dataset that you can use to experiment with. Try +% different values of C and sigma here. +% + +% Load from ex6data3: +% You will have X, y in your environment +load('ex6data3.mat'); + +% Try different SVM Parameters here +[C, sigma] = dataset3Params(X, y, Xval, yval); + +% Train the SVM +model= svmTrain(X, y, C, @(x1, x2) gaussianKernel(x1, x2, sigma)); +visualizeBoundary(X, y, model); + +fprintf('Program paused. Press enter to continue.\n'); +pause; + diff --git a/ex6/ex6_spam.m b/ex6/ex6_spam.m new file mode 100644 index 0000000..479848e --- /dev/null +++ b/ex6/ex6_spam.m @@ -0,0 +1,138 @@ +%% Machine Learning Online Class +% Exercise 6 | Spam Classification with SVMs +% +% Instructions +% ------------ +% +% This file contains code that helps you get started on the +% exercise. You will need to complete the following functions: +% +% gaussianKernel.m +% dataset3Params.m +% processEmail.m +% emailFeatures.m +% +% For this exercise, you will not need to change any code in this file, +% or any other files other than those mentioned above. +% + +%% Initialization +clear ; close all; clc + +%% ==================== Part 1: Email Preprocessing ==================== +% To use an SVM to classify emails into Spam v.s. Non-Spam, you first need +% to convert each email into a vector of features. In this part, you will +% implement the preprocessing steps for each email. You should +% complete the code in processEmail.m to produce a word indices vector +% for a given email. + +fprintf('\nPreprocessing sample email (emailSample1.txt)\n'); + +% Extract Features +file_contents = readFile('emailSample1.txt'); +word_indices = processEmail(file_contents); + +% Print Stats +fprintf('Word Indices: \n'); +fprintf(' %d', word_indices); +fprintf('\n\n'); + +fprintf('Program paused. Press enter to continue.\n'); +pause; + +%% ==================== Part 2: Feature Extraction ==================== +% Now, you will convert each email into a vector of features in R^n. +% You should complete the code in emailFeatures.m to produce a feature +% vector for a given email. + +fprintf('\nExtracting features from sample email (emailSample1.txt)\n'); + +% Extract Features +file_contents = readFile('emailSample1.txt'); +word_indices = processEmail(file_contents); +features = emailFeatures(word_indices); + +% Print Stats +fprintf('Length of feature vector: %d\n', length(features)); +fprintf('Number of non-zero entries: %d\n', sum(features > 0)); + +fprintf('Program paused. Press enter to continue.\n'); +pause; + +%% =========== Part 3: Train Linear SVM for Spam Classification ======== +% In this section, you will train a linear classifier to determine if an +% email is Spam or Not-Spam. + +% Load the Spam Email dataset +% You will have X, y in your environment +load('spamTrain.mat'); + +fprintf('\nTraining Linear SVM (Spam Classification)\n') +fprintf('(this may take 1 to 2 minutes) ...\n') + +C = 0.1; +model = svmTrain(X, y, C, @linearKernel); + +p = svmPredict(model, X); + +fprintf('Training Accuracy: %f\n', mean(double(p == y)) * 100); + +%% =================== Part 4: Test Spam Classification ================ +% After training the classifier, we can evaluate it on a test set. We have +% included a test set in spamTest.mat + +% Load the test dataset +% You will have Xtest, ytest in your environment +load('spamTest.mat'); + +fprintf('\nEvaluating the trained Linear SVM on a test set ...\n') + +p = svmPredict(model, Xtest); + +fprintf('Test Accuracy: %f\n', mean(double(p == ytest)) * 100); +pause; + + +%% ================= Part 5: Top Predictors of Spam ==================== +% Since the model we are training is a linear SVM, we can inspect the +% weights learned by the model to understand better how it is determining +% whether an email is spam or not. The following code finds the words with +% the highest weights in the classifier. Informally, the classifier +% 'thinks' that these words are the most likely indicators of spam. +% + +% Sort the weights and obtin the vocabulary list +[weight, idx] = sort(model.w, 'descend'); +vocabList = getVocabList(); + +fprintf('\nTop predictors of spam: \n'); +for i = 1:15 + fprintf(' %-15s (%f) \n', vocabList{idx(i)}, weight(i)); +end + +fprintf('\n\n'); +fprintf('\nProgram paused. Press enter to continue.\n'); +pause; + +%% =================== Part 6: Try Your Own Emails ===================== +% Now that you've trained the spam classifier, you can use it on your own +% emails! In the starter code, we have included spamSample1.txt, +% spamSample2.txt, emailSample1.txt and emailSample2.txt as examples. +% The following code reads in one of these emails and then uses your +% learned SVM classifier to determine whether the email is Spam or +% Not Spam + +% Set the file to be read in (change this to spamSample2.txt, +% emailSample1.txt or emailSample2.txt to see different predictions on +% different emails types). Try your own emails as well! +filename = 'spamSample1.txt'; + +% Read and predict +file_contents = readFile(filename); +word_indices = processEmail(file_contents); +x = emailFeatures(word_indices); +p = svmPredict(model, x); + +fprintf('\nProcessed %s\n\nSpam Classification: %d\n', filename, p); +fprintf('(1 indicates spam, 0 indicates not spam)\n\n'); + diff --git a/ex6/ex6data1.mat b/ex6/ex6data1.mat new file mode 100644 index 0000000..ae0d2aa Binary files /dev/null and b/ex6/ex6data1.mat differ diff --git a/ex6/ex6data2.mat b/ex6/ex6data2.mat new file mode 100644 index 0000000..c6ad661 Binary files /dev/null and b/ex6/ex6data2.mat differ diff --git a/ex6/ex6data3.mat b/ex6/ex6data3.mat new file mode 100644 index 0000000..a0441ac Binary files /dev/null and b/ex6/ex6data3.mat differ diff --git a/ex6/gaussianKernel.m b/ex6/gaussianKernel.m new file mode 100644 index 0000000..5aa2fec --- /dev/null +++ b/ex6/gaussianKernel.m @@ -0,0 +1,26 @@ +function sim = gaussianKernel(x1, x2, sigma) +%RBFKERNEL returns a radial basis function kernel between x1 and x2 +% sim = gaussianKernel(x1, x2) returns a gaussian kernel between x1 and x2 +% and returns the value in sim + +% Ensure that x1 and x2 are column vectors +x1 = x1(:); x2 = x2(:); + +% You need to return the following variables correctly. +sim = 0; + +% ====================== YOUR CODE HERE ====================== +% Instructions: Fill in this function to return the similarity between x1 +% and x2 computed using a Gaussian kernel with bandwidth +% sigma +% +% + + + + + + +% ============================================================= + +end diff --git a/ex6/getVocabList.m b/ex6/getVocabList.m new file mode 100644 index 0000000..0b5f427 --- /dev/null +++ b/ex6/getVocabList.m @@ -0,0 +1,25 @@ +function vocabList = getVocabList() +%GETVOCABLIST reads the fixed vocabulary list in vocab.txt and returns a +%cell array of the words +% vocabList = GETVOCABLIST() reads the fixed vocabulary list in vocab.txt +% and returns a cell array of the words in vocabList. + + +%% Read the fixed vocabulary list +fid = fopen('vocab.txt'); + +% Store all dictionary words in cell array vocab{} +n = 1899; % Total number of words in the dictionary + +% For ease of implementation, we use a struct to map the strings => integers +% In practice, you'll want to use some form of hashmap +vocabList = cell(n, 1); +for i = 1:n + % Word Index (can ignore since it will be = i) + fscanf(fid, '%d', 1); + % Actual Word + vocabList{i} = fscanf(fid, '%s', 1); +end +fclose(fid); + +end diff --git a/ex6/linearKernel.m b/ex6/linearKernel.m new file mode 100644 index 0000000..11fd759 --- /dev/null +++ b/ex6/linearKernel.m @@ -0,0 +1,12 @@ +function sim = linearKernel(x1, x2) +%LINEARKERNEL returns a linear kernel between x1 and x2 +% sim = linearKernel(x1, x2) returns a linear kernel between x1 and x2 +% and returns the value in sim + +% Ensure that x1 and x2 are column vectors +x1 = x1(:); x2 = x2(:); + +% Compute the kernel +sim = x1' * x2; % dot product + +end \ No newline at end of file diff --git a/ex6/plotData.m b/ex6/plotData.m new file mode 100644 index 0000000..795cc16 --- /dev/null +++ b/ex6/plotData.m @@ -0,0 +1,17 @@ +function plotData(X, y) +%PLOTDATA Plots the data points X and y into a new figure +% PLOTDATA(x,y) plots the data points with + for the positive examples +% and o for the negative examples. X is assumed to be a Mx2 matrix. +% +% Note: This was slightly modified such that it expects y = 1 or y = 0 + +% Find Indices of Positive and Negative Examples +pos = find(y == 1); neg = find(y == 0); + +% Plot Examples +plot(X(pos, 1), X(pos, 2), 'k+','LineWidth', 1, 'MarkerSize', 7) +hold on; +plot(X(neg, 1), X(neg, 2), 'ko', 'MarkerFaceColor', 'y', 'MarkerSize', 7) +hold off; + +end diff --git a/ex6/porterStemmer.m b/ex6/porterStemmer.m new file mode 100644 index 0000000..6da6fd1 --- /dev/null +++ b/ex6/porterStemmer.m @@ -0,0 +1,385 @@ +function stem = porterStemmer(inString) +% Applies the Porter Stemming algorithm as presented in the following +% paper: +% Porter, 1980, An algorithm for suffix stripping, Program, Vol. 14, +% no. 3, pp 130-137 + +% Original code modeled after the C version provided at: +% http://www.tartarus.org/~martin/PorterStemmer/c.txt + +% The main part of the stemming algorithm starts here. b is an array of +% characters, holding the word to be stemmed. The letters are in b[k0], +% b[k0+1] ending at b[k]. In fact k0 = 1 in this demo program (since +% matlab begins indexing by 1 instead of 0). k is readjusted downwards as +% the stemming progresses. Zero termination is not in fact used in the +% algorithm. + +% To call this function, use the string to be stemmed as the input +% argument. This function returns the stemmed word as a string. + +% Lower-case string +inString = lower(inString); + +global j; +b = inString; +k = length(b); +k0 = 1; +j = k; + + + +% With this if statement, strings of length 1 or 2 don't go through the +% stemming process. Remove this conditional to match the published +% algorithm. +stem = b; +if k > 2 + % Output displays per step are commented out. + %disp(sprintf('Word to stem: %s', b)); + x = step1ab(b, k, k0); + %disp(sprintf('Steps 1A and B yield: %s', x{1})); + x = step1c(x{1}, x{2}, k0); + %disp(sprintf('Step 1C yields: %s', x{1})); + x = step2(x{1}, x{2}, k0); + %disp(sprintf('Step 2 yields: %s', x{1})); + x = step3(x{1}, x{2}, k0); + %disp(sprintf('Step 3 yields: %s', x{1})); + x = step4(x{1}, x{2}, k0); + %disp(sprintf('Step 4 yields: %s', x{1})); + x = step5(x{1}, x{2}, k0); + %disp(sprintf('Step 5 yields: %s', x{1})); + stem = x{1}; +end + +% cons(j) is TRUE <=> b[j] is a consonant. +function c = cons(i, b, k0) +c = true; +switch(b(i)) + case {'a', 'e', 'i', 'o', 'u'} + c = false; + case 'y' + if i == k0 + c = true; + else + c = ~cons(i - 1, b, k0); + end +end + +% mseq() measures the number of consonant sequences between k0 and j. If +% c is a consonant sequence and v a vowel sequence, and <..> indicates +% arbitrary presence, + +% gives 0 +% vc gives 1 +% vcvc gives 2 +% vcvcvc gives 3 +% .... +function n = measure(b, k0) +global j; +n = 0; +i = k0; +while true + if i > j + return + end + if ~cons(i, b, k0) + break; + end + i = i + 1; +end +i = i + 1; +while true + while true + if i > j + return + end + if cons(i, b, k0) + break; + end + i = i + 1; + end + i = i + 1; + n = n + 1; + while true + if i > j + return + end + if ~cons(i, b, k0) + break; + end + i = i + 1; + end + i = i + 1; +end + + +% vowelinstem() is TRUE <=> k0,...j contains a vowel +function vis = vowelinstem(b, k0) +global j; +for i = k0:j, + if ~cons(i, b, k0) + vis = true; + return + end +end +vis = false; + +%doublec(i) is TRUE <=> i,(i-1) contain a double consonant. +function dc = doublec(i, b, k0) +if i < k0+1 + dc = false; + return +end +if b(i) ~= b(i-1) + dc = false; + return +end +dc = cons(i, b, k0); + + +% cvc(j) is TRUE <=> j-2,j-1,j has the form consonant - vowel - consonant +% and also if the second c is not w,x or y. this is used when trying to +% restore an e at the end of a short word. e.g. +% +% cav(e), lov(e), hop(e), crim(e), but +% snow, box, tray. + +function c1 = cvc(i, b, k0) +if ((i < (k0+2)) || ~cons(i, b, k0) || cons(i-1, b, k0) || ~cons(i-2, b, k0)) + c1 = false; +else + if (b(i) == 'w' || b(i) == 'x' || b(i) == 'y') + c1 = false; + return + end + c1 = true; +end + +% ends(s) is TRUE <=> k0,...k ends with the string s. +function s = ends(str, b, k) +global j; +if (str(length(str)) ~= b(k)) + s = false; + return +end % tiny speed-up +if (length(str) > k) + s = false; + return +end +if strcmp(b(k-length(str)+1:k), str) + s = true; + j = k - length(str); + return +else + s = false; +end + +% setto(s) sets (j+1),...k to the characters in the string s, readjusting +% k accordingly. + +function so = setto(s, b, k) +global j; +for i = j+1:(j+length(s)) + b(i) = s(i-j); +end +if k > j+length(s) + b((j+length(s)+1):k) = ''; +end +k = length(b); +so = {b, k}; + +% rs(s) is used further down. +% [Note: possible null/value for r if rs is called] +function r = rs(str, b, k, k0) +r = {b, k}; +if measure(b, k0) > 0 + r = setto(str, b, k); +end + +% step1ab() gets rid of plurals and -ed or -ing. e.g. + +% caresses -> caress +% ponies -> poni +% ties -> ti +% caress -> caress +% cats -> cat + +% feed -> feed +% agreed -> agree +% disabled -> disable + +% matting -> mat +% mating -> mate +% meeting -> meet +% milling -> mill +% messing -> mess + +% meetings -> meet + +function s1ab = step1ab(b, k, k0) +global j; +if b(k) == 's' + if ends('sses', b, k) + k = k-2; + elseif ends('ies', b, k) + retVal = setto('i', b, k); + b = retVal{1}; + k = retVal{2}; + elseif (b(k-1) ~= 's') + k = k-1; + end +end +if ends('eed', b, k) + if measure(b, k0) > 0; + k = k-1; + end +elseif (ends('ed', b, k) || ends('ing', b, k)) && vowelinstem(b, k0) + k = j; + retVal = {b, k}; + if ends('at', b, k) + retVal = setto('ate', b(k0:k), k); + elseif ends('bl', b, k) + retVal = setto('ble', b(k0:k), k); + elseif ends('iz', b, k) + retVal = setto('ize', b(k0:k), k); + elseif doublec(k, b, k0) + retVal = {b, k-1}; + if b(retVal{2}) == 'l' || b(retVal{2}) == 's' || ... + b(retVal{2}) == 'z' + retVal = {retVal{1}, retVal{2}+1}; + end + elseif measure(b, k0) == 1 && cvc(k, b, k0) + retVal = setto('e', b(k0:k), k); + end + k = retVal{2}; + b = retVal{1}(k0:k); +end +j = k; +s1ab = {b(k0:k), k}; + +% step1c() turns terminal y to i when there is another vowel in the stem. +function s1c = step1c(b, k, k0) +global j; +if ends('y', b, k) && vowelinstem(b, k0) + b(k) = 'i'; +end +j = k; +s1c = {b, k}; + +% step2() maps double suffices to single ones. so -ization ( = -ize plus +% -ation) maps to -ize etc. note that the string before the suffix must give +% m() > 0. +function s2 = step2(b, k, k0) +global j; +s2 = {b, k}; +switch b(k-1) + case {'a'} + if ends('ational', b, k) s2 = rs('ate', b, k, k0); + elseif ends('tional', b, k) s2 = rs('tion', b, k, k0); end; + case {'c'} + if ends('enci', b, k) s2 = rs('ence', b, k, k0); + elseif ends('anci', b, k) s2 = rs('ance', b, k, k0); end; + case {'e'} + if ends('izer', b, k) s2 = rs('ize', b, k, k0); end; + case {'l'} + if ends('bli', b, k) s2 = rs('ble', b, k, k0); + elseif ends('alli', b, k) s2 = rs('al', b, k, k0); + elseif ends('entli', b, k) s2 = rs('ent', b, k, k0); + elseif ends('eli', b, k) s2 = rs('e', b, k, k0); + elseif ends('ousli', b, k) s2 = rs('ous', b, k, k0); end; + case {'o'} + if ends('ization', b, k) s2 = rs('ize', b, k, k0); + elseif ends('ation', b, k) s2 = rs('ate', b, k, k0); + elseif ends('ator', b, k) s2 = rs('ate', b, k, k0); end; + case {'s'} + if ends('alism', b, k) s2 = rs('al', b, k, k0); + elseif ends('iveness', b, k) s2 = rs('ive', b, k, k0); + elseif ends('fulness', b, k) s2 = rs('ful', b, k, k0); + elseif ends('ousness', b, k) s2 = rs('ous', b, k, k0); end; + case {'t'} + if ends('aliti', b, k) s2 = rs('al', b, k, k0); + elseif ends('iviti', b, k) s2 = rs('ive', b, k, k0); + elseif ends('biliti', b, k) s2 = rs('ble', b, k, k0); end; + case {'g'} + if ends('logi', b, k) s2 = rs('log', b, k, k0); end; +end +j = s2{2}; + +% step3() deals with -ic-, -full, -ness etc. similar strategy to step2. +function s3 = step3(b, k, k0) +global j; +s3 = {b, k}; +switch b(k) + case {'e'} + if ends('icate', b, k) s3 = rs('ic', b, k, k0); + elseif ends('ative', b, k) s3 = rs('', b, k, k0); + elseif ends('alize', b, k) s3 = rs('al', b, k, k0); end; + case {'i'} + if ends('iciti', b, k) s3 = rs('ic', b, k, k0); end; + case {'l'} + if ends('ical', b, k) s3 = rs('ic', b, k, k0); + elseif ends('ful', b, k) s3 = rs('', b, k, k0); end; + case {'s'} + if ends('ness', b, k) s3 = rs('', b, k, k0); end; +end +j = s3{2}; + +% step4() takes off -ant, -ence etc., in context vcvc. +function s4 = step4(b, k, k0) +global j; +switch b(k-1) + case {'a'} + if ends('al', b, k) end; + case {'c'} + if ends('ance', b, k) + elseif ends('ence', b, k) end; + case {'e'} + if ends('er', b, k) end; + case {'i'} + if ends('ic', b, k) end; + case {'l'} + if ends('able', b, k) + elseif ends('ible', b, k) end; + case {'n'} + if ends('ant', b, k) + elseif ends('ement', b, k) + elseif ends('ment', b, k) + elseif ends('ent', b, k) end; + case {'o'} + if ends('ion', b, k) + if j == 0 + elseif ~(strcmp(b(j),'s') || strcmp(b(j),'t')) + j = k; + end + elseif ends('ou', b, k) end; + case {'s'} + if ends('ism', b, k) end; + case {'t'} + if ends('ate', b, k) + elseif ends('iti', b, k) end; + case {'u'} + if ends('ous', b, k) end; + case {'v'} + if ends('ive', b, k) end; + case {'z'} + if ends('ize', b, k) end; +end +if measure(b, k0) > 1 + s4 = {b(k0:j), j}; +else + s4 = {b(k0:k), k}; +end + +% step5() removes a final -e if m() > 1, and changes -ll to -l if m() > 1. +function s5 = step5(b, k, k0) +global j; +j = k; +if b(k) == 'e' + a = measure(b, k0); + if (a > 1) || ((a == 1) && ~cvc(k-1, b, k0)) + k = k-1; + end +end +if (b(k) == 'l') && doublec(k, b, k0) && (measure(b, k0) > 1) + k = k-1; +end +s5 = {b(k0:k), k}; diff --git a/ex6/processEmail.m b/ex6/processEmail.m new file mode 100644 index 0000000..234bae0 --- /dev/null +++ b/ex6/processEmail.m @@ -0,0 +1,125 @@ +function word_indices = processEmail(email_contents) +%PROCESSEMAIL preprocesses a the body of an email and +%returns a list of word_indices +% word_indices = PROCESSEMAIL(email_contents) preprocesses +% the body of an email and returns a list of indices of the +% words contained in the email. +% + +% Load Vocabulary +vocabList = getVocabList(); + +% Init return value +word_indices = []; + +% ========================== Preprocess Email =========================== + +% Find the Headers ( \n\n and remove ) +% Uncomment the following lines if you are working with raw emails with the +% full headers + +% hdrstart = strfind(email_contents, ([char(10) char(10)])); +% email_contents = email_contents(hdrstart(1):end); + +% Lower case +email_contents = lower(email_contents); + +% Strip all HTML +% Looks for any expression that starts with < and ends with > and replace +% and does not have any < or > in the tag it with a space +email_contents = regexprep(email_contents, '<[^<>]+>', ' '); + +% Handle Numbers +% Look for one or more characters between 0-9 +email_contents = regexprep(email_contents, '[0-9]+', 'number'); + +% Handle URLS +% Look for strings starting with http:// or https:// +email_contents = regexprep(email_contents, ... + '(http|https)://[^\s]*', 'httpaddr'); + +% Handle Email Addresses +% Look for strings with @ in the middle +email_contents = regexprep(email_contents, '[^\s]+@[^\s]+', 'emailaddr'); + +% Handle $ sign +email_contents = regexprep(email_contents, '[$]+', 'dollar'); + + +% ========================== Tokenize Email =========================== + +% Output the email to screen as well +fprintf('\n==== Processed Email ====\n\n'); + +% Process file +l = 0; + +while ~isempty(email_contents) + + % Tokenize and also get rid of any punctuation + [str, email_contents] = ... + strtok(email_contents, ... + [' @$/#.-:&*+=[]?!(){},''">_<;%' char(10) char(13)]); + + % Remove any non alphanumeric characters + str = regexprep(str, '[^a-zA-Z0-9]', ''); + + % Stem the word + % (the porterStemmer sometimes has issues, so we use a try catch block) + try str = porterStemmer(strtrim(str)); + catch str = ''; continue; + end; + + % Skip the word if it is too short + if length(str) < 1 + continue; + end + + % Look up the word in the dictionary and add to word_indices if + % found + % ====================== YOUR CODE HERE ====================== + % Instructions: Fill in this function to add the index of str to + % word_indices if it is in the vocabulary. At this point + % of the code, you have a stemmed word from the email in + % the variable str. You should look up str in the + % vocabulary list (vocabList). If a match exists, you + % should add the index of the word to the word_indices + % vector. Concretely, if str = 'action', then you should + % look up the vocabulary list to find where in vocabList + % 'action' appears. For example, if vocabList{18} = + % 'action', then, you should add 18 to the word_indices + % vector (e.g., word_indices = [word_indices ; 18]; ). + % + % Note: vocabList{idx} returns a the word with index idx in the + % vocabulary list. + % + % Note: You can use strcmp(str1, str2) to compare two strings (str1 and + % str2). It will return 1 only if the two strings are equivalent. + % + + + + + + + + + + + % ============================================================= + + + % Print to screen, ensuring that the output lines are not too long + if (l + length(str) + 1) > 78 + fprintf('\n'); + l = 0; + end + fprintf('%s ', str); + l = l + length(str) + 1; + +end + +% Print footer +fprintf('\n\n=========================\n'); + +end diff --git a/ex6/readFile.m b/ex6/readFile.m new file mode 100644 index 0000000..08686d6 --- /dev/null +++ b/ex6/readFile.m @@ -0,0 +1,18 @@ +function file_contents = readFile(filename) +%READFILE reads a file and returns its entire contents +% file_contents = READFILE(filename) reads a file and returns its entire +% contents in file_contents +% + +% Load File +fid = fopen(filename); +if fid + file_contents = fscanf(fid, '%c', inf); + fclose(fid); +else + file_contents = ''; + fprintf('Unable to open %s\n', filename); +end + +end + diff --git a/ex6/spamSample1.txt b/ex6/spamSample1.txt new file mode 100644 index 0000000..bab0ca2 --- /dev/null +++ b/ex6/spamSample1.txt @@ -0,0 +1,42 @@ +Do You Want To Make $1000 Or More Per Week? + + + +If you are a motivated and qualified individual - I +will personally demonstrate to you a system that will +make you $1,000 per week or more! This is NOT mlm. + + + +Call our 24 hour pre-recorded number to get the +details. + + + +000-456-789 + + + +I need people who want to make serious money. Make +the call and get the facts. + +Invest 2 minutes in yourself now! + + + +000-456-789 + + + +Looking forward to your call and I will introduce you +to people like yourself who +are currently making $10,000 plus per week! + + + +000-456-789 + + + +3484lJGv6-241lEaN9080lRmS6-271WxHo7524qiyT5-438rjUv5615hQcf0-662eiDB9057dMtVl72 + diff --git a/ex6/spamSample2.txt b/ex6/spamSample2.txt new file mode 100644 index 0000000..f8e8fce --- /dev/null +++ b/ex6/spamSample2.txt @@ -0,0 +1,8 @@ +Best Buy Viagra Generic Online + +Viagra 100mg x 60 Pills $125, Free Pills & Reorder Discount, Top Selling 100% Quality & Satisfaction guaranteed! + +We accept VISA, Master & E-Check Payments, 90000+ Satisfied Customers! +http://medphysitcstech.ru + + diff --git a/ex6/spamTest.mat b/ex6/spamTest.mat new file mode 100644 index 0000000..b7bf953 Binary files /dev/null and b/ex6/spamTest.mat differ diff --git a/ex6/spamTrain.mat b/ex6/spamTrain.mat new file mode 100644 index 0000000..1b9c81f Binary files /dev/null and b/ex6/spamTrain.mat differ diff --git a/ex6/submit.m b/ex6/submit.m new file mode 100644 index 0000000..dc4e5d9 --- /dev/null +++ b/ex6/submit.m @@ -0,0 +1,573 @@ +function submit(partId, webSubmit) +%SUBMIT Submit your code and output to the ml-class servers +% SUBMIT() will connect to the ml-class server and submit your solution + + fprintf('==\n== [ml-class] Submitting Solutions | Programming Exercise %s\n==\n', ... + homework_id()); + if ~exist('partId', 'var') || isempty(partId) + partId = promptPart(); + end + + if ~exist('webSubmit', 'var') || isempty(webSubmit) + webSubmit = 0; % submit directly by default + end + + % Check valid partId + partNames = validParts(); + if ~isValidPartId(partId) + fprintf('!! Invalid homework part selected.\n'); + fprintf('!! Expected an integer from 1 to %d.\n', numel(partNames) + 1); + fprintf('!! Submission Cancelled\n'); + return + end + + if ~exist('ml_login_data.mat','file') + [login password] = loginPrompt(); + save('ml_login_data.mat','login','password'); + else + load('ml_login_data.mat'); + [login password] = quickLogin(login, password); + save('ml_login_data.mat','login','password'); + end + + if isempty(login) + fprintf('!! Submission Cancelled\n'); + return + end + + fprintf('\n== Connecting to ml-class ... '); + if exist('OCTAVE_VERSION') + fflush(stdout); + end + + % Setup submit list + if partId == numel(partNames) + 1 + submitParts = 1:numel(partNames); + else + submitParts = [partId]; + end + + for s = 1:numel(submitParts) + thisPartId = submitParts(s); + if (~webSubmit) % submit directly to server + [login, ch, signature, auxstring] = getChallenge(login, thisPartId); + if isempty(login) || isempty(ch) || isempty(signature) + % Some error occured, error string in first return element. + fprintf('\n!! Error: %s\n\n', login); + return + end + + % Attempt Submission with Challenge + ch_resp = challengeResponse(login, password, ch); + + [result, str] = submitSolution(login, ch_resp, thisPartId, ... + output(thisPartId, auxstring), source(thisPartId), signature); + + partName = partNames{thisPartId}; + + fprintf('\n== [ml-class] Submitted Assignment %s - Part %d - %s\n', ... + homework_id(), thisPartId, partName); + fprintf('== %s\n', strtrim(str)); + + if exist('OCTAVE_VERSION') + fflush(stdout); + end + else + [result] = submitSolutionWeb(login, thisPartId, output(thisPartId), ... + source(thisPartId)); + result = base64encode(result); + + fprintf('\nSave as submission file [submit_ex%s_part%d.txt (enter to accept default)]:', ... + homework_id(), thisPartId); + saveAsFile = input('', 's'); + if (isempty(saveAsFile)) + saveAsFile = sprintf('submit_ex%s_part%d.txt', homework_id(), thisPartId); + end + + fid = fopen(saveAsFile, 'w'); + if (fid) + fwrite(fid, result); + fclose(fid); + fprintf('\nSaved your solutions to %s.\n\n', saveAsFile); + fprintf(['You can now submit your solutions through the web \n' ... + 'form in the programming exercises. Select the corresponding \n' ... + 'programming exercise to access the form.\n']); + + else + fprintf('Unable to save to %s\n\n', saveAsFile); + fprintf(['You can create a submission file by saving the \n' ... + 'following text in a file: (press enter to continue)\n\n']); + pause; + fprintf(result); + end + end + end +end + +% ================== CONFIGURABLES FOR EACH HOMEWORK ================== + +function id = homework_id() + id = '6'; +end + +function [partNames] = validParts() + partNames = { 'Gaussian Kernel', ... + 'Parameters (C, sigma) for Dataset 3', ... + 'Email Preprocessing' ... + 'Email Feature Extraction' ... + }; +end + +function srcs = sources() + % Separated by part + srcs = { { 'gaussianKernel.m' }, ... + { 'dataset3Params.m' }, ... + { 'processEmail.m' }, ... + { 'emailFeatures.m' } }; +end + +function out = output(partId, auxstring) + % Random Test Cases + x1 = sin(1:10)'; + x2 = cos(1:10)'; + ec = 'the quick brown fox jumped over the lazy dog'; + wi = 1 + abs(round(x1 * 1863)); + wi = [wi ; wi]; + if partId == 1 + sim = gaussianKernel(x1, x2, 2); + out = sprintf('%0.5f ', sim); + elseif partId == 2 + load('ex6data3.mat'); + [C, sigma] = dataset3Params(X, y, Xval, yval); + out = sprintf('%0.5f ', C); + out = [out sprintf('%0.5f ', sigma)]; + elseif partId == 3 + word_indices = processEmail(ec); + out = sprintf('%d ', word_indices); + elseif partId == 4 + x = emailFeatures(wi); + out = sprintf('%d ', x); + end +end + + +% ====================== SERVER CONFIGURATION =========================== + +% ***************** REMOVE -staging WHEN YOU DEPLOY ********************* +function url = site_url() + url = 'http://class.coursera.org/ml-007'; +end + +function url = challenge_url() + url = [site_url() '/assignment/challenge']; +end + +function url = submit_url() + url = [site_url() '/assignment/submit']; +end + +% ========================= CHALLENGE HELPERS ========================= + +function src = source(partId) + src = ''; + src_files = sources(); + if partId <= numel(src_files) + flist = src_files{partId}; + for i = 1:numel(flist) + fid = fopen(flist{i}); + if (fid == -1) + error('Error opening %s (is it missing?)', flist{i}); + end + line = fgets(fid); + while ischar(line) + src = [src line]; + line = fgets(fid); + end + fclose(fid); + src = [src '||||||||']; + end + end +end + +function ret = isValidPartId(partId) + partNames = validParts(); + ret = (~isempty(partId)) && (partId >= 1) && (partId <= numel(partNames) + 1); +end + +function partId = promptPart() + fprintf('== Select which part(s) to submit:\n'); + partNames = validParts(); + srcFiles = sources(); + for i = 1:numel(partNames) + fprintf('== %d) %s [', i, partNames{i}); + fprintf(' %s ', srcFiles{i}{:}); + fprintf(']\n'); + end + fprintf('== %d) All of the above \n==\nEnter your choice [1-%d]: ', ... + numel(partNames) + 1, numel(partNames) + 1); + selPart = input('', 's'); + partId = str2num(selPart); + if ~isValidPartId(partId) + partId = -1; + end +end + +function [email,ch,signature,auxstring] = getChallenge(email, part) + str = urlread(challenge_url(), 'post', {'email_address', email, 'assignment_part_sid', [homework_id() '-' num2str(part)], 'response_encoding', 'delim'}); + + str = strtrim(str); + r = struct; + while(numel(str) > 0) + [f, str] = strtok (str, '|'); + [v, str] = strtok (str, '|'); + r = setfield(r, f, v); + end + + email = getfield(r, 'email_address'); + ch = getfield(r, 'challenge_key'); + signature = getfield(r, 'state'); + auxstring = getfield(r, 'challenge_aux_data'); +end + +function [result, str] = submitSolutionWeb(email, part, output, source) + + result = ['{"assignment_part_sid":"' base64encode([homework_id() '-' num2str(part)], '') '",' ... + '"email_address":"' base64encode(email, '') '",' ... + '"submission":"' base64encode(output, '') '",' ... + '"submission_aux":"' base64encode(source, '') '"' ... + '}']; + str = 'Web-submission'; +end + +function [result, str] = submitSolution(email, ch_resp, part, output, ... + source, signature) + + params = {'assignment_part_sid', [homework_id() '-' num2str(part)], ... + 'email_address', email, ... + 'submission', base64encode(output, ''), ... + 'submission_aux', base64encode(source, ''), ... + 'challenge_response', ch_resp, ... + 'state', signature}; + + str = urlread(submit_url(), 'post', params); + + % Parse str to read for success / failure + result = 0; + +end + +% =========================== LOGIN HELPERS =========================== + +function [login password] = loginPrompt() + % Prompt for password + [login password] = basicPrompt(); + + if isempty(login) || isempty(password) + login = []; password = []; + end +end + + +function [login password] = basicPrompt() + login = input('Login (Email address): ', 's'); + password = input('Password: ', 's'); +end + +function [login password] = quickLogin(login,password) + disp(['You are currently logged in as ' login '.']); + cont_token = input('Is this you? (y/n - type n to reenter password)','s'); + if(isempty(cont_token) || cont_token(1)=='Y'||cont_token(1)=='y') + return; + else + [login password] = loginPrompt(); + end +end + +function [str] = challengeResponse(email, passwd, challenge) + str = sha1([challenge passwd]); +end + +% =============================== SHA-1 ================================ + +function hash = sha1(str) + + % Initialize variables + h0 = uint32(1732584193); + h1 = uint32(4023233417); + h2 = uint32(2562383102); + h3 = uint32(271733878); + h4 = uint32(3285377520); + + % Convert to word array + strlen = numel(str); + + % Break string into chars and append the bit 1 to the message + mC = [double(str) 128]; + mC = [mC zeros(1, 4-mod(numel(mC), 4), 'uint8')]; + + numB = strlen * 8; + if exist('idivide') + numC = idivide(uint32(numB + 65), 512, 'ceil'); + else + numC = ceil(double(numB + 65)/512); + end + numW = numC * 16; + mW = zeros(numW, 1, 'uint32'); + + idx = 1; + for i = 1:4:strlen + 1 + mW(idx) = bitor(bitor(bitor( ... + bitshift(uint32(mC(i)), 24), ... + bitshift(uint32(mC(i+1)), 16)), ... + bitshift(uint32(mC(i+2)), 8)), ... + uint32(mC(i+3))); + idx = idx + 1; + end + + % Append length of message + mW(numW - 1) = uint32(bitshift(uint64(numB), -32)); + mW(numW) = uint32(bitshift(bitshift(uint64(numB), 32), -32)); + + % Process the message in successive 512-bit chs + for cId = 1 : double(numC) + cSt = (cId - 1) * 16 + 1; + cEnd = cId * 16; + ch = mW(cSt : cEnd); + + % Extend the sixteen 32-bit words into eighty 32-bit words + for j = 17 : 80 + ch(j) = ch(j - 3); + ch(j) = bitxor(ch(j), ch(j - 8)); + ch(j) = bitxor(ch(j), ch(j - 14)); + ch(j) = bitxor(ch(j), ch(j - 16)); + ch(j) = bitrotate(ch(j), 1); + end + + % Initialize hash value for this ch + a = h0; + b = h1; + c = h2; + d = h3; + e = h4; + + % Main loop + for i = 1 : 80 + if(i >= 1 && i <= 20) + f = bitor(bitand(b, c), bitand(bitcmp(b), d)); + k = uint32(1518500249); + elseif(i >= 21 && i <= 40) + f = bitxor(bitxor(b, c), d); + k = uint32(1859775393); + elseif(i >= 41 && i <= 60) + f = bitor(bitor(bitand(b, c), bitand(b, d)), bitand(c, d)); + k = uint32(2400959708); + elseif(i >= 61 && i <= 80) + f = bitxor(bitxor(b, c), d); + k = uint32(3395469782); + end + + t = bitrotate(a, 5); + t = bitadd(t, f); + t = bitadd(t, e); + t = bitadd(t, k); + t = bitadd(t, ch(i)); + e = d; + d = c; + c = bitrotate(b, 30); + b = a; + a = t; + + end + h0 = bitadd(h0, a); + h1 = bitadd(h1, b); + h2 = bitadd(h2, c); + h3 = bitadd(h3, d); + h4 = bitadd(h4, e); + + end + + hash = reshape(dec2hex(double([h0 h1 h2 h3 h4]), 8)', [1 40]); + + hash = lower(hash); + +end + +function ret = bitadd(iA, iB) + ret = double(iA) + double(iB); + ret = bitset(ret, 33, 0); + ret = uint32(ret); +end + +function ret = bitrotate(iA, places) + t = bitshift(iA, places - 32); + ret = bitshift(iA, places); + ret = bitor(ret, t); +end + +% =========================== Base64 Encoder ============================ +% Thanks to Peter John Acklam +% + +function y = base64encode(x, eol) +%BASE64ENCODE Perform base64 encoding on a string. +% +% BASE64ENCODE(STR, EOL) encode the given string STR. EOL is the line ending +% sequence to use; it is optional and defaults to '\n' (ASCII decimal 10). +% The returned encoded string is broken into lines of no more than 76 +% characters each, and each line will end with EOL unless it is empty. Let +% EOL be empty if you do not want the encoded string broken into lines. +% +% STR and EOL don't have to be strings (i.e., char arrays). The only +% requirement is that they are vectors containing values in the range 0-255. +% +% This function may be used to encode strings into the Base64 encoding +% specified in RFC 2045 - MIME (Multipurpose Internet Mail Extensions). The +% Base64 encoding is designed to represent arbitrary sequences of octets in a +% form that need not be humanly readable. A 65-character subset +% ([A-Za-z0-9+/=]) of US-ASCII is used, enabling 6 bits to be represented per +% printable character. +% +% Examples +% -------- +% +% If you want to encode a large file, you should encode it in chunks that are +% a multiple of 57 bytes. This ensures that the base64 lines line up and +% that you do not end up with padding in the middle. 57 bytes of data fills +% one complete base64 line (76 == 57*4/3): +% +% If ifid and ofid are two file identifiers opened for reading and writing, +% respectively, then you can base64 encode the data with +% +% while ~feof(ifid) +% fwrite(ofid, base64encode(fread(ifid, 60*57))); +% end +% +% or, if you have enough memory, +% +% fwrite(ofid, base64encode(fread(ifid))); +% +% See also BASE64DECODE. + +% Author: Peter John Acklam +% Time-stamp: 2004-02-03 21:36:56 +0100 +% E-mail: pjacklam@online.no +% URL: http://home.online.no/~pjacklam + + if isnumeric(x) + x = num2str(x); + end + + % make sure we have the EOL value + if nargin < 2 + eol = sprintf('\n'); + else + if sum(size(eol) > 1) > 1 + error('EOL must be a vector.'); + end + if any(eol(:) > 255) + error('EOL can not contain values larger than 255.'); + end + end + + if sum(size(x) > 1) > 1 + error('STR must be a vector.'); + end + + x = uint8(x); + eol = uint8(eol); + + ndbytes = length(x); % number of decoded bytes + nchunks = ceil(ndbytes / 3); % number of chunks/groups + nebytes = 4 * nchunks; % number of encoded bytes + + % add padding if necessary, to make the length of x a multiple of 3 + if rem(ndbytes, 3) + x(end+1 : 3*nchunks) = 0; + end + + x = reshape(x, [3, nchunks]); % reshape the data + y = repmat(uint8(0), 4, nchunks); % for the encoded data + + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + % Split up every 3 bytes into 4 pieces + % + % aaaaaabb bbbbcccc ccdddddd + % + % to form + % + % 00aaaaaa 00bbbbbb 00cccccc 00dddddd + % + y(1,:) = bitshift(x(1,:), -2); % 6 highest bits of x(1,:) + + y(2,:) = bitshift(bitand(x(1,:), 3), 4); % 2 lowest bits of x(1,:) + y(2,:) = bitor(y(2,:), bitshift(x(2,:), -4)); % 4 highest bits of x(2,:) + + y(3,:) = bitshift(bitand(x(2,:), 15), 2); % 4 lowest bits of x(2,:) + y(3,:) = bitor(y(3,:), bitshift(x(3,:), -6)); % 2 highest bits of x(3,:) + + y(4,:) = bitand(x(3,:), 63); % 6 lowest bits of x(3,:) + + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + % Now perform the following mapping + % + % 0 - 25 -> A-Z + % 26 - 51 -> a-z + % 52 - 61 -> 0-9 + % 62 -> + + % 63 -> / + % + % We could use a mapping vector like + % + % ['A':'Z', 'a':'z', '0':'9', '+/'] + % + % but that would require an index vector of class double. + % + z = repmat(uint8(0), size(y)); + i = y <= 25; z(i) = 'A' + double(y(i)); + i = 26 <= y & y <= 51; z(i) = 'a' - 26 + double(y(i)); + i = 52 <= y & y <= 61; z(i) = '0' - 52 + double(y(i)); + i = y == 62; z(i) = '+'; + i = y == 63; z(i) = '/'; + y = z; + + %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% + % Add padding if necessary. + % + npbytes = 3 * nchunks - ndbytes; % number of padding bytes + if npbytes + y(end-npbytes+1 : end) = '='; % '=' is used for padding + end + + if isempty(eol) + + % reshape to a row vector + y = reshape(y, [1, nebytes]); + + else + + nlines = ceil(nebytes / 76); % number of lines + neolbytes = length(eol); % number of bytes in eol string + + % pad data so it becomes a multiple of 76 elements + y = [y(:) ; zeros(76 * nlines - numel(y), 1)]; + y(nebytes + 1 : 76 * nlines) = 0; + y = reshape(y, 76, nlines); + + % insert eol strings + eol = eol(:); + y(end + 1 : end + neolbytes, :) = eol(:, ones(1, nlines)); + + % remove padding, but keep the last eol string + m = nebytes + neolbytes * (nlines - 1); + n = (76+neolbytes)*nlines - neolbytes; + y(m+1 : n) = ''; + + % extract and reshape to row vector + y = reshape(y, 1, m+neolbytes); + + end + + % output is a character array + y = char(y); + +end diff --git a/ex6/submitWeb.m b/ex6/submitWeb.m new file mode 100644 index 0000000..e429365 --- /dev/null +++ b/ex6/submitWeb.m @@ -0,0 +1,20 @@ +% submitWeb Creates files from your code and output for web submission. +% +% If the submit function does not work for you, use the web-submission mechanism. +% Call this function to produce a file for the part you wish to submit. Then, +% submit the file to the class servers using the "Web Submission" button on the +% Programming Exercises page on the course website. +% +% You should call this function without arguments (submitWeb), to receive +% an interactive prompt for submission; optionally you can call it with the partID +% if you so wish. Make sure your working directory is set to the directory +% containing the submitWeb.m file and your assignment files. + +function submitWeb(partId) + if ~exist('partId', 'var') || isempty(partId) + partId = []; + end + + submit(partId, 1); +end + diff --git a/ex6/svmPredict.m b/ex6/svmPredict.m new file mode 100644 index 0000000..ec8ef77 --- /dev/null +++ b/ex6/svmPredict.m @@ -0,0 +1,54 @@ +function pred = svmPredict(model, X) +%SVMPREDICT returns a vector of predictions using a trained SVM model +%(svmTrain). +% pred = SVMPREDICT(model, X) returns a vector of predictions using a +% trained SVM model (svmTrain). X is a mxn matrix where there each +% example is a row. model is a svm model returned from svmTrain. +% predictions pred is a m x 1 column of predictions of {0, 1} values. +% + +% Check if we are getting a column vector, if so, then assume that we only +% need to do prediction for a single example +if (size(X, 2) == 1) + % Examples should be in rows + X = X'; +end + +% Dataset +m = size(X, 1); +p = zeros(m, 1); +pred = zeros(m, 1); + +if strcmp(func2str(model.kernelFunction), 'linearKernel') + % We can use the weights and bias directly if working with the + % linear kernel + p = X * model.w + model.b; +elseif strfind(func2str(model.kernelFunction), 'gaussianKernel') + % Vectorized RBF Kernel + % This is equivalent to computing the kernel on every pair of examples + X1 = sum(X.^2, 2); + X2 = sum(model.X.^2, 2)'; + K = bsxfun(@plus, X1, bsxfun(@plus, X2, - 2 * X * model.X')); + K = model.kernelFunction(1, 0) .^ K; + K = bsxfun(@times, model.y', K); + K = bsxfun(@times, model.alphas', K); + p = sum(K, 2); +else + % Other Non-linear kernel + for i = 1:m + prediction = 0; + for j = 1:size(model.X, 1) + prediction = prediction + ... + model.alphas(j) * model.y(j) * ... + model.kernelFunction(X(i,:)', model.X(j,:)'); + end + p(i) = prediction + model.b; + end +end + +% Convert predictions into 0 / 1 +pred(p >= 0) = 1; +pred(p < 0) = 0; + +end + diff --git a/ex6/svmTrain.m b/ex6/svmTrain.m new file mode 100644 index 0000000..2b2f169 --- /dev/null +++ b/ex6/svmTrain.m @@ -0,0 +1,192 @@ +function [model] = svmTrain(X, Y, C, kernelFunction, ... + tol, max_passes) +%SVMTRAIN Trains an SVM classifier using a simplified version of the SMO +%algorithm. +% [model] = SVMTRAIN(X, Y, C, kernelFunction, tol, max_passes) trains an +% SVM classifier and returns trained model. X is the matrix of training +% examples. Each row is a training example, and the jth column holds the +% jth feature. Y is a column matrix containing 1 for positive examples +% and 0 for negative examples. C is the standard SVM regularization +% parameter. tol is a tolerance value used for determining equality of +% floating point numbers. max_passes controls the number of iterations +% over the dataset (without changes to alpha) before the algorithm quits. +% +% Note: This is a simplified version of the SMO algorithm for training +% SVMs. In practice, if you want to train an SVM classifier, we +% recommend using an optimized package such as: +% +% LIBSVM (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) +% SVMLight (http://svmlight.joachims.org/) +% +% + +if ~exist('tol', 'var') || isempty(tol) + tol = 1e-3; +end + +if ~exist('max_passes', 'var') || isempty(max_passes) + max_passes = 5; +end + +% Data parameters +m = size(X, 1); +n = size(X, 2); + +% Map 0 to -1 +Y(Y==0) = -1; + +% Variables +alphas = zeros(m, 1); +b = 0; +E = zeros(m, 1); +passes = 0; +eta = 0; +L = 0; +H = 0; + +% Pre-compute the Kernel Matrix since our dataset is small +% (in practice, optimized SVM packages that handle large datasets +% gracefully will _not_ do this) +% +% We have implemented optimized vectorized version of the Kernels here so +% that the svm training will run faster. +if strcmp(func2str(kernelFunction), 'linearKernel') + % Vectorized computation for the Linear Kernel + % This is equivalent to computing the kernel on every pair of examples + K = X*X'; +elseif strfind(func2str(kernelFunction), 'gaussianKernel') + % Vectorized RBF Kernel + % This is equivalent to computing the kernel on every pair of examples + X2 = sum(X.^2, 2); + K = bsxfun(@plus, X2, bsxfun(@plus, X2', - 2 * (X * X'))); + K = kernelFunction(1, 0) .^ K; +else + % Pre-compute the Kernel Matrix + % The following can be slow due to the lack of vectorization + K = zeros(m); + for i = 1:m + for j = i:m + K(i,j) = kernelFunction(X(i,:)', X(j,:)'); + K(j,i) = K(i,j); %the matrix is symmetric + end + end +end + +% Train +fprintf('\nTraining ...'); +dots = 12; +while passes < max_passes, + + num_changed_alphas = 0; + for i = 1:m, + + % Calculate Ei = f(x(i)) - y(i) using (2). + % E(i) = b + sum (X(i, :) * (repmat(alphas.*Y,1,n).*X)') - Y(i); + E(i) = b + sum (alphas.*Y.*K(:,i)) - Y(i); + + if ((Y(i)*E(i) < -tol && alphas(i) < C) || (Y(i)*E(i) > tol && alphas(i) > 0)), + + % In practice, there are many heuristics one can use to select + % the i and j. In this simplified code, we select them randomly. + j = ceil(m * rand()); + while j == i, % Make sure i \neq j + j = ceil(m * rand()); + end + + % Calculate Ej = f(x(j)) - y(j) using (2). + E(j) = b + sum (alphas.*Y.*K(:,j)) - Y(j); + + % Save old alphas + alpha_i_old = alphas(i); + alpha_j_old = alphas(j); + + % Compute L and H by (10) or (11). + if (Y(i) == Y(j)), + L = max(0, alphas(j) + alphas(i) - C); + H = min(C, alphas(j) + alphas(i)); + else + L = max(0, alphas(j) - alphas(i)); + H = min(C, C + alphas(j) - alphas(i)); + end + + if (L == H), + % continue to next i. + continue; + end + + % Compute eta by (14). + eta = 2 * K(i,j) - K(i,i) - K(j,j); + if (eta >= 0), + % continue to next i. + continue; + end + + % Compute and clip new value for alpha j using (12) and (15). + alphas(j) = alphas(j) - (Y(j) * (E(i) - E(j))) / eta; + + % Clip + alphas(j) = min (H, alphas(j)); + alphas(j) = max (L, alphas(j)); + + % Check if change in alpha is significant + if (abs(alphas(j) - alpha_j_old) < tol), + % continue to next i. + % replace anyway + alphas(j) = alpha_j_old; + continue; + end + + % Determine value for alpha i using (16). + alphas(i) = alphas(i) + Y(i)*Y(j)*(alpha_j_old - alphas(j)); + + % Compute b1 and b2 using (17) and (18) respectively. + b1 = b - E(i) ... + - Y(i) * (alphas(i) - alpha_i_old) * K(i,j)' ... + - Y(j) * (alphas(j) - alpha_j_old) * K(i,j)'; + b2 = b - E(j) ... + - Y(i) * (alphas(i) - alpha_i_old) * K(i,j)' ... + - Y(j) * (alphas(j) - alpha_j_old) * K(j,j)'; + + % Compute b by (19). + if (0 < alphas(i) && alphas(i) < C), + b = b1; + elseif (0 < alphas(j) && alphas(j) < C), + b = b2; + else + b = (b1+b2)/2; + end + + num_changed_alphas = num_changed_alphas + 1; + + end + + end + + if (num_changed_alphas == 0), + passes = passes + 1; + else + passes = 0; + end + + fprintf('.'); + dots = dots + 1; + if dots > 78 + dots = 0; + fprintf('\n'); + end + if exist('OCTAVE_VERSION') + fflush(stdout); + end +end +fprintf(' Done! \n\n'); + +% Save the model +idx = alphas > 0; +model.X= X(idx,:); +model.y= Y(idx); +model.kernelFunction = kernelFunction; +model.b= b; +model.alphas= alphas(idx); +model.w = ((alphas.*Y)'*X)'; + +end diff --git a/ex6/visualizeBoundary.m b/ex6/visualizeBoundary.m new file mode 100644 index 0000000..b2020a7 --- /dev/null +++ b/ex6/visualizeBoundary.m @@ -0,0 +1,24 @@ +function visualizeBoundary(X, y, model, varargin) +%VISUALIZEBOUNDARY plots a non-linear decision boundary learned by the SVM +% VISUALIZEBOUNDARYLINEAR(X, y, model) plots a non-linear decision +% boundary learned by the SVM and overlays the data on it + +% Plot the training data on top of the boundary +plotData(X, y) + +% Make classification predictions over a grid of values +x1plot = linspace(min(X(:,1)), max(X(:,1)), 100)'; +x2plot = linspace(min(X(:,2)), max(X(:,2)), 100)'; +[X1, X2] = meshgrid(x1plot, x2plot); +vals = zeros(size(X1)); +for i = 1:size(X1, 2) + this_X = [X1(:, i), X2(:, i)]; + vals(:, i) = svmPredict(model, this_X); +end + +% Plot the SVM boundary +hold on +contour(X1, X2, vals, [0 0], 'Color', 'b'); +hold off; + +end diff --git a/ex6/visualizeBoundaryLinear.m b/ex6/visualizeBoundaryLinear.m new file mode 100644 index 0000000..f17f5ca --- /dev/null +++ b/ex6/visualizeBoundaryLinear.m @@ -0,0 +1,16 @@ +function visualizeBoundaryLinear(X, y, model) +%VISUALIZEBOUNDARYLINEAR plots a linear decision boundary learned by the +%SVM +% VISUALIZEBOUNDARYLINEAR(X, y, model) plots a linear decision boundary +% learned by the SVM and overlays the data on it + +w = model.w; +b = model.b; +xp = linspace(min(X(:,1)), max(X(:,1)), 100); +yp = - (w(1)*xp + b)/w(2); +plotData(X, y); +hold on; +plot(xp, yp, '-b'); +hold off + +end diff --git a/ex6/vocab.txt b/ex6/vocab.txt new file mode 100644 index 0000000..27f64a3 --- /dev/null +++ b/ex6/vocab.txt @@ -0,0 +1,1899 @@ +1 aa +2 ab +3 abil +4 abl +5 about +6 abov +7 absolut +8 abus +9 ac +10 accept +11 access +12 accord +13 account +14 achiev +15 acquir +16 across +17 act +18 action +19 activ +20 actual +21 ad +22 adam +23 add +24 addit +25 address +26 administr +27 adult +28 advanc +29 advantag +30 advertis +31 advic +32 advis +33 ae +34 af +35 affect +36 affili +37 afford +38 africa +39 after +40 ag +41 again +42 against +43 agenc +44 agent +45 ago +46 agre +47 agreement +48 aid +49 air +50 al +51 alb +52 align +53 all +54 allow +55 almost +56 alon +57 along +58 alreadi +59 alsa +60 also +61 altern +62 although +63 alwai +64 am +65 amaz +66 america +67 american +68 among +69 amount +70 amp +71 an +72 analysi +73 analyst +74 and +75 ani +76 anim +77 announc +78 annual +79 annuiti +80 anoth +81 answer +82 anti +83 anumb +84 anybodi +85 anymor +86 anyon +87 anyth +88 anywai +89 anywher +90 aol +91 ap +92 apolog +93 app +94 appar +95 appear +96 appl +97 appli +98 applic +99 appreci +100 approach +101 approv +102 apt +103 ar +104 archiv +105 area +106 aren +107 argument +108 arial +109 arm +110 around +111 arrai +112 arriv +113 art +114 articl +115 artist +116 as +117 ascii +118 ask +119 asset +120 assist +121 associ +122 assum +123 assur +124 at +125 atol +126 attach +127 attack +128 attempt +129 attent +130 attornei +131 attract +132 audio +133 aug +134 august +135 author +136 auto +137 autom +138 automat +139 avail +140 averag +141 avoid +142 awai +143 awar +144 award +145 ba +146 babi +147 back +148 background +149 backup +150 bad +151 balanc +152 ban +153 bank +154 bar +155 base +156 basenumb +157 basi +158 basic +159 bb +160 bc +161 bd +162 be +163 beat +164 beberg +165 becaus +166 becom +167 been +168 befor +169 begin +170 behalf +171 behavior +172 behind +173 believ +174 below +175 benefit +176 best +177 beta +178 better +179 between +180 bf +181 big +182 bill +183 billion +184 bin +185 binari +186 bit +187 black +188 blank +189 block +190 blog +191 blood +192 blue +193 bnumber +194 board +195 bodi +196 boi +197 bonu +198 book +199 boot +200 border +201 boss +202 boston +203 botan +204 both +205 bottl +206 bottom +207 boundari +208 box +209 brain +210 brand +211 break +212 brian +213 bring +214 broadcast +215 broker +216 browser +217 bug +218 bui +219 build +220 built +221 bulk +222 burn +223 bush +224 busi +225 but +226 button +227 by +228 byte +229 ca +230 cabl +231 cach +232 calcul +233 california +234 call +235 came +236 camera +237 campaign +238 can +239 canada +240 cannot +241 canon +242 capabl +243 capillari +244 capit +245 car +246 card +247 care +248 career +249 carri +250 cartridg +251 case +252 cash +253 cat +254 catch +255 categori +256 caus +257 cb +258 cc +259 cd +260 ce +261 cell +262 cent +263 center +264 central +265 centuri +266 ceo +267 certain +268 certainli +269 cf +270 challeng +271 chanc +272 chang +273 channel +274 char +275 charact +276 charg +277 charset +278 chat +279 cheap +280 check +281 cheer +282 chief +283 children +284 china +285 chip +286 choic +287 choos +288 chri +289 citi +290 citizen +291 civil +292 claim +293 class +294 classifi +295 clean +296 clear +297 clearli +298 click +299 client +300 close +301 clue +302 cnet +303 cnumber +304 co +305 code +306 collect +307 colleg +308 color +309 com +310 combin +311 come +312 comfort +313 command +314 comment +315 commentari +316 commerci +317 commiss +318 commit +319 common +320 commun +321 compani +322 compar +323 comparison +324 compat +325 compet +326 competit +327 compil +328 complet +329 comprehens +330 comput +331 concentr +332 concept +333 concern +334 condit +335 conf +336 confer +337 confid +338 confidenti +339 config +340 configur +341 confirm +342 conflict +343 confus +344 congress +345 connect +346 consid +347 consolid +348 constitut +349 construct +350 consult +351 consum +352 contact +353 contain +354 content +355 continu +356 contract +357 contribut +358 control +359 conveni +360 convers +361 convert +362 cool +363 cooper +364 copi +365 copyright +366 core +367 corpor +368 correct +369 correspond +370 cost +371 could +372 couldn +373 count +374 countri +375 coupl +376 cours +377 court +378 cover +379 coverag +380 crash +381 creat +382 creativ +383 credit +384 critic +385 cross +386 cultur +387 current +388 custom +389 cut +390 cv +391 da +392 dagga +393 dai +394 daili +395 dan +396 danger +397 dark +398 data +399 databas +400 datapow +401 date +402 dave +403 david +404 dc +405 de +406 dead +407 deal +408 dear +409 death +410 debt +411 decad +412 decid +413 decis +414 declar +415 declin +416 decor +417 default +418 defend +419 defens +420 defin +421 definit +422 degre +423 delai +424 delet +425 deliv +426 deliveri +427 dell +428 demand +429 democrat +430 depart +431 depend +432 deposit +433 describ +434 descript +435 deserv +436 design +437 desir +438 desktop +439 despit +440 detail +441 detect +442 determin +443 dev +444 devel +445 develop +446 devic +447 di +448 dial +449 did +450 didn +451 diet +452 differ +453 difficult +454 digit +455 direct +456 directli +457 director +458 directori +459 disabl +460 discount +461 discov +462 discoveri +463 discuss +464 disk +465 displai +466 disposit +467 distanc +468 distribut +469 dn +470 dnumber +471 do +472 doc +473 document +474 doe +475 doer +476 doesn +477 dollar +478 dollarac +479 dollarnumb +480 domain +481 don +482 done +483 dont +484 doubl +485 doubt +486 down +487 download +488 dr +489 draw +490 dream +491 drive +492 driver +493 drop +494 drug +495 due +496 dure +497 dvd +498 dw +499 dynam +500 ea +501 each +502 earli +503 earlier +504 earn +505 earth +506 easi +507 easier +508 easili +509 eat +510 eb +511 ebai +512 ec +513 echo +514 econom +515 economi +516 ed +517 edg +518 edit +519 editor +520 educ +521 eff +522 effect +523 effici +524 effort +525 either +526 el +527 electron +528 elimin +529 els +530 email +531 emailaddr +532 emerg +533 empir +534 employ +535 employe +536 en +537 enabl +538 encod +539 encourag +540 end +541 enemi +542 enenkio +543 energi +544 engin +545 english +546 enhanc +547 enjoi +548 enough +549 ensur +550 enter +551 enterpris +552 entertain +553 entir +554 entri +555 enumb +556 environ +557 equal +558 equip +559 equival +560 error +561 especi +562 essenti +563 establish +564 estat +565 estim +566 et +567 etc +568 euro +569 europ +570 european +571 even +572 event +573 eventu +574 ever +575 everi +576 everyon +577 everyth +578 evid +579 evil +580 exactli +581 exampl +582 excel +583 except +584 exchang +585 excit +586 exclus +587 execut +588 exercis +589 exist +590 exmh +591 expand +592 expect +593 expens +594 experi +595 expert +596 expir +597 explain +598 explor +599 express +600 extend +601 extens +602 extra +603 extract +604 extrem +605 ey +606 fa +607 face +608 fact +609 factor +610 fail +611 fair +612 fall +613 fals +614 famili +615 faq +616 far +617 fast +618 faster +619 fastest +620 fat +621 father +622 favorit +623 fax +624 fb +625 fd +626 featur +627 feder +628 fee +629 feed +630 feedback +631 feel +632 femal +633 few +634 ffffff +635 ffnumber +636 field +637 fight +638 figur +639 file +640 fill +641 film +642 filter +643 final +644 financ +645 financi +646 find +647 fine +648 finish +649 fire +650 firewal +651 firm +652 first +653 fit +654 five +655 fix +656 flag +657 flash +658 flow +659 fnumber +660 focu +661 folder +662 folk +663 follow +664 font +665 food +666 for +667 forc +668 foreign +669 forev +670 forget +671 fork +672 form +673 format +674 former +675 fortun +676 forward +677 found +678 foundat +679 four +680 franc +681 free +682 freedom +683 french +684 freshrpm +685 fri +686 fridai +687 friend +688 from +689 front +690 ftoc +691 ftp +692 full +693 fulli +694 fun +695 function +696 fund +697 further +698 futur +699 ga +700 gain +701 game +702 gari +703 garrigu +704 gave +705 gcc +706 geek +707 gener +708 get +709 gif +710 gift +711 girl +712 give +713 given +714 global +715 gnome +716 gnu +717 gnupg +718 go +719 goal +720 god +721 goe +722 gold +723 gone +724 good +725 googl +726 got +727 govern +728 gpl +729 grand +730 grant +731 graphic +732 great +733 greater +734 ground +735 group +736 grow +737 growth +738 gt +739 guarante +740 guess +741 gui +742 guid +743 ha +744 hack +745 had +746 half +747 ham +748 hand +749 handl +750 happen +751 happi +752 hard +753 hardwar +754 hat +755 hate +756 have +757 haven +758 he +759 head +760 header +761 headlin +762 health +763 hear +764 heard +765 heart +766 heaven +767 hei +768 height +769 held +770 hello +771 help +772 helvetica +773 her +774 herba +775 here +776 hermio +777 hettinga +778 hi +779 high +780 higher +781 highli +782 highlight +783 him +784 histori +785 hit +786 hold +787 home +788 honor +789 hope +790 host +791 hot +792 hour +793 hous +794 how +795 howev +796 hp +797 html +798 http +799 httpaddr +800 huge +801 human +802 hundr +803 ibm +804 id +805 idea +806 ident +807 identifi +808 idnumb +809 ie +810 if +811 ignor +812 ii +813 iii +814 iiiiiiihnumberjnumberhnumberjnumberhnumb +815 illeg +816 im +817 imag +818 imagin +819 immedi +820 impact +821 implement +822 import +823 impress +824 improv +825 in +826 inc +827 includ +828 incom +829 increas +830 incred +831 inde +832 independ +833 index +834 india +835 indian +836 indic +837 individu +838 industri +839 info +840 inform +841 initi +842 inlin +843 innov +844 input +845 insert +846 insid +847 instal +848 instanc +849 instant +850 instead +851 institut +852 instruct +853 insur +854 int +855 integr +856 intel +857 intellig +858 intend +859 interact +860 interest +861 interfac +862 intern +863 internet +864 interview +865 into +866 intro +867 introduc +868 inumb +869 invest +870 investig +871 investor +872 invok +873 involv +874 ip +875 ireland +876 irish +877 is +878 island +879 isn +880 iso +881 isp +882 issu +883 it +884 item +885 itself +886 jabber +887 jame +888 java +889 jim +890 jnumberiiiiiiihepihepihf +891 job +892 joe +893 john +894 join +895 journal +896 judg +897 judgment +898 jul +899 juli +900 jump +901 june +902 just +903 justin +904 keep +905 kei +906 kept +907 kernel +908 kevin +909 keyboard +910 kid +911 kill +912 kind +913 king +914 kingdom +915 knew +916 know +917 knowledg +918 known +919 la +920 lack +921 land +922 languag +923 laptop +924 larg +925 larger +926 largest +927 laser +928 last +929 late +930 later +931 latest +932 launch +933 law +934 lawrenc +935 le +936 lead +937 leader +938 learn +939 least +940 leav +941 left +942 legal +943 lender +944 length +945 less +946 lesson +947 let +948 letter +949 level +950 lib +951 librari +952 licens +953 life +954 lifetim +955 light +956 like +957 limit +958 line +959 link +960 linux +961 list +962 listen +963 littl +964 live +965 ll +966 lo +967 load +968 loan +969 local +970 locat +971 lock +972 lockergnom +973 log +974 long +975 longer +976 look +977 lose +978 loss +979 lost +980 lot +981 love +982 low +983 lower +984 lowest +985 lt +986 ma +987 mac +988 machin +989 made +990 magazin +991 mai +992 mail +993 mailer +994 main +995 maintain +996 major +997 make +998 maker +999 male +1000 man +1001 manag +1002 mani +1003 manual +1004 manufactur +1005 map +1006 march +1007 margin +1008 mark +1009 market +1010 marshal +1011 mass +1012 master +1013 match +1014 materi +1015 matter +1016 matthia +1017 mayb +1018 me +1019 mean +1020 measur +1021 mechan +1022 media +1023 medic +1024 meet +1025 member +1026 membership +1027 memori +1028 men +1029 mention +1030 menu +1031 merchant +1032 messag +1033 method +1034 mh +1035 michael +1036 microsoft +1037 middl +1038 might +1039 mike +1040 mile +1041 militari +1042 million +1043 mime +1044 mind +1045 mine +1046 mini +1047 minimum +1048 minut +1049 miss +1050 mistak +1051 mobil +1052 mode +1053 model +1054 modem +1055 modifi +1056 modul +1057 moment +1058 mon +1059 mondai +1060 monei +1061 monitor +1062 month +1063 monthli +1064 more +1065 morn +1066 mortgag +1067 most +1068 mostli +1069 mother +1070 motiv +1071 move +1072 movi +1073 mpnumber +1074 mr +1075 ms +1076 msg +1077 much +1078 multi +1079 multipart +1080 multipl +1081 murphi +1082 music +1083 must +1084 my +1085 myself +1086 name +1087 nation +1088 natur +1089 nbsp +1090 near +1091 nearli +1092 necessari +1093 need +1094 neg +1095 net +1096 netscap +1097 network +1098 never +1099 new +1100 newslett +1101 next +1102 nextpart +1103 nice +1104 nigeria +1105 night +1106 no +1107 nobodi +1108 non +1109 none +1110 nor +1111 normal +1112 north +1113 not +1114 note +1115 noth +1116 notic +1117 now +1118 nt +1119 null +1120 number +1121 numbera +1122 numberam +1123 numberanumb +1124 numberb +1125 numberbit +1126 numberc +1127 numbercb +1128 numbercbr +1129 numbercfont +1130 numbercli +1131 numbercnumb +1132 numbercp +1133 numberctd +1134 numberd +1135 numberdari +1136 numberdnumb +1137 numberenumb +1138 numberf +1139 numberfb +1140 numberff +1141 numberffont +1142 numberfp +1143 numberftd +1144 numberk +1145 numberm +1146 numbermb +1147 numberp +1148 numberpd +1149 numberpm +1150 numberpx +1151 numberst +1152 numberth +1153 numbertnumb +1154 numberx +1155 object +1156 oblig +1157 obtain +1158 obvious +1159 occur +1160 oct +1161 octob +1162 of +1163 off +1164 offer +1165 offic +1166 offici +1167 often +1168 oh +1169 ok +1170 old +1171 on +1172 onc +1173 onli +1174 onlin +1175 open +1176 oper +1177 opinion +1178 opportun +1179 opt +1180 optim +1181 option +1182 or +1183 order +1184 org +1185 organ +1186 origin +1187 os +1188 osdn +1189 other +1190 otherwis +1191 our +1192 out +1193 outlook +1194 output +1195 outsid +1196 over +1197 own +1198 owner +1199 oz +1200 pacif +1201 pack +1202 packag +1203 page +1204 pai +1205 paid +1206 pain +1207 palm +1208 panel +1209 paper +1210 paragraph +1211 parent +1212 part +1213 parti +1214 particip +1215 particular +1216 particularli +1217 partit +1218 partner +1219 pass +1220 password +1221 past +1222 patch +1223 patent +1224 path +1225 pattern +1226 paul +1227 payment +1228 pc +1229 peac +1230 peopl +1231 per +1232 percent +1233 percentag +1234 perfect +1235 perfectli +1236 perform +1237 perhap +1238 period +1239 perl +1240 perman +1241 permiss +1242 person +1243 pgp +1244 phone +1245 photo +1246 php +1247 phrase +1248 physic +1249 pick +1250 pictur +1251 piec +1252 piiiiiiii +1253 pipe +1254 pjnumber +1255 place +1256 plai +1257 plain +1258 plan +1259 planet +1260 plant +1261 planta +1262 platform +1263 player +1264 pleas +1265 plu +1266 plug +1267 pm +1268 pocket +1269 point +1270 polic +1271 polici +1272 polit +1273 poor +1274 pop +1275 popul +1276 popular +1277 port +1278 posit +1279 possibl +1280 post +1281 potenti +1282 pound +1283 powel +1284 power +1285 powershot +1286 practic +1287 pre +1288 predict +1289 prefer +1290 premium +1291 prepar +1292 present +1293 presid +1294 press +1295 pretti +1296 prevent +1297 previou +1298 previous +1299 price +1300 principl +1301 print +1302 printabl +1303 printer +1304 privaci +1305 privat +1306 prize +1307 pro +1308 probabl +1309 problem +1310 procedur +1311 process +1312 processor +1313 procmail +1314 produc +1315 product +1316 profession +1317 profil +1318 profit +1319 program +1320 programm +1321 progress +1322 project +1323 promis +1324 promot +1325 prompt +1326 properti +1327 propos +1328 proprietari +1329 prospect +1330 protect +1331 protocol +1332 prove +1333 proven +1334 provid +1335 proxi +1336 pub +1337 public +1338 publish +1339 pudg +1340 pull +1341 purchas +1342 purpos +1343 put +1344 python +1345 qnumber +1346 qualifi +1347 qualiti +1348 quarter +1349 question +1350 quick +1351 quickli +1352 quit +1353 quot +1354 radio +1355 ragga +1356 rais +1357 random +1358 rang +1359 rate +1360 rather +1361 ratio +1362 razor +1363 razornumb +1364 re +1365 reach +1366 read +1367 reader +1368 readi +1369 real +1370 realiz +1371 realli +1372 reason +1373 receiv +1374 recent +1375 recipi +1376 recommend +1377 record +1378 red +1379 redhat +1380 reduc +1381 refer +1382 refin +1383 reg +1384 regard +1385 region +1386 regist +1387 regul +1388 regular +1389 rel +1390 relat +1391 relationship +1392 releas +1393 relev +1394 reliabl +1395 remain +1396 rememb +1397 remot +1398 remov +1399 replac +1400 repli +1401 report +1402 repositori +1403 repres +1404 republ +1405 request +1406 requir +1407 research +1408 reserv +1409 resid +1410 resourc +1411 respect +1412 respond +1413 respons +1414 rest +1415 result +1416 retail +1417 return +1418 reveal +1419 revenu +1420 revers +1421 review +1422 revok +1423 rh +1424 rich +1425 right +1426 risk +1427 road +1428 robert +1429 rock +1430 role +1431 roll +1432 rom +1433 roman +1434 room +1435 root +1436 round +1437 rpm +1438 rss +1439 rule +1440 run +1441 sa +1442 safe +1443 sai +1444 said +1445 sale +1446 same +1447 sampl +1448 san +1449 saou +1450 sat +1451 satellit +1452 save +1453 saw +1454 scan +1455 schedul +1456 school +1457 scienc +1458 score +1459 screen +1460 script +1461 se +1462 search +1463 season +1464 second +1465 secret +1466 section +1467 secur +1468 see +1469 seed +1470 seek +1471 seem +1472 seen +1473 select +1474 self +1475 sell +1476 seminar +1477 send +1478 sender +1479 sendmail +1480 senior +1481 sens +1482 sensit +1483 sent +1484 sep +1485 separ +1486 septemb +1487 sequenc +1488 seri +1489 serif +1490 seriou +1491 serv +1492 server +1493 servic +1494 set +1495 setup +1496 seven +1497 seventh +1498 sever +1499 sex +1500 sexual +1501 sf +1502 shape +1503 share +1504 she +1505 shell +1506 ship +1507 shop +1508 short +1509 shot +1510 should +1511 show +1512 side +1513 sign +1514 signatur +1515 signific +1516 similar +1517 simpl +1518 simpli +1519 sinc +1520 sincer +1521 singl +1522 sit +1523 site +1524 situat +1525 six +1526 size +1527 skeptic +1528 skill +1529 skin +1530 skip +1531 sleep +1532 slow +1533 small +1534 smart +1535 smoke +1536 smtp +1537 snumber +1538 so +1539 social +1540 societi +1541 softwar +1542 sold +1543 solut +1544 solv +1545 some +1546 someon +1547 someth +1548 sometim +1549 son +1550 song +1551 soni +1552 soon +1553 sorri +1554 sort +1555 sound +1556 sourc +1557 south +1558 space +1559 spain +1560 spam +1561 spamassassin +1562 spamd +1563 spammer +1564 speak +1565 spec +1566 special +1567 specif +1568 specifi +1569 speech +1570 speed +1571 spend +1572 sponsor +1573 sport +1574 spot +1575 src +1576 ssh +1577 st +1578 stabl +1579 staff +1580 stai +1581 stand +1582 standard +1583 star +1584 start +1585 state +1586 statement +1587 statu +1588 step +1589 steve +1590 still +1591 stock +1592 stop +1593 storag +1594 store +1595 stori +1596 strategi +1597 stream +1598 street +1599 string +1600 strip +1601 strong +1602 structur +1603 studi +1604 stuff +1605 stupid +1606 style +1607 subject +1608 submit +1609 subscrib +1610 subscript +1611 substanti +1612 success +1613 such +1614 suffer +1615 suggest +1616 suit +1617 sum +1618 summari +1619 summer +1620 sun +1621 super +1622 suppli +1623 support +1624 suppos +1625 sure +1626 surpris +1627 suse +1628 suspect +1629 sweet +1630 switch +1631 system +1632 tab +1633 tabl +1634 tablet +1635 tag +1636 take +1637 taken +1638 talk +1639 tape +1640 target +1641 task +1642 tax +1643 teach +1644 team +1645 tech +1646 technic +1647 techniqu +1648 technolog +1649 tel +1650 telecom +1651 telephon +1652 tell +1653 temperatur +1654 templ +1655 ten +1656 term +1657 termin +1658 terror +1659 terrorist +1660 test +1661 texa +1662 text +1663 than +1664 thank +1665 that +1666 the +1667 thei +1668 their +1669 them +1670 themselv +1671 then +1672 theori +1673 there +1674 therefor +1675 these +1676 thi +1677 thing +1678 think +1679 thinkgeek +1680 third +1681 those +1682 though +1683 thought +1684 thousand +1685 thread +1686 threat +1687 three +1688 through +1689 thu +1690 thursdai +1691 ti +1692 ticket +1693 tim +1694 time +1695 tip +1696 tire +1697 titl +1698 tm +1699 to +1700 todai +1701 togeth +1702 token +1703 told +1704 toll +1705 tom +1706 toner +1707 toni +1708 too +1709 took +1710 tool +1711 top +1712 topic +1713 total +1714 touch +1715 toward +1716 track +1717 trade +1718 tradit +1719 traffic +1720 train +1721 transact +1722 transfer +1723 travel +1724 treat +1725 tree +1726 tri +1727 trial +1728 trick +1729 trip +1730 troubl +1731 true +1732 truli +1733 trust +1734 truth +1735 try +1736 tue +1737 tuesdai +1738 turn +1739 tv +1740 two +1741 type +1742 uk +1743 ultim +1744 un +1745 under +1746 understand +1747 unfortun +1748 uniqu +1749 unison +1750 unit +1751 univers +1752 unix +1753 unless +1754 unlik +1755 unlimit +1756 unseen +1757 unsolicit +1758 unsubscrib +1759 until +1760 up +1761 updat +1762 upgrad +1763 upon +1764 urgent +1765 url +1766 us +1767 usa +1768 usag +1769 usb +1770 usd +1771 usdollarnumb +1772 useless +1773 user +1774 usr +1775 usual +1776 util +1777 vacat +1778 valid +1779 valu +1780 valuabl +1781 var +1782 variabl +1783 varieti +1784 variou +1785 ve +1786 vendor +1787 ventur +1788 veri +1789 verifi +1790 version +1791 via +1792 video +1793 view +1794 virtual +1795 visa +1796 visit +1797 visual +1798 vnumber +1799 voic +1800 vote +1801 vs +1802 vulner +1803 wa +1804 wai +1805 wait +1806 wake +1807 walk +1808 wall +1809 want +1810 war +1811 warm +1812 warn +1813 warranti +1814 washington +1815 wasn +1816 wast +1817 watch +1818 water +1819 we +1820 wealth +1821 weapon +1822 web +1823 weblog +1824 websit +1825 wed +1826 wednesdai +1827 week +1828 weekli +1829 weight +1830 welcom +1831 well +1832 went +1833 were +1834 west +1835 what +1836 whatev +1837 when +1838 where +1839 whether +1840 which +1841 while +1842 white +1843 whitelist +1844 who +1845 whole +1846 whose +1847 why +1848 wi +1849 wide +1850 width +1851 wife +1852 will +1853 william +1854 win +1855 window +1856 wing +1857 winner +1858 wireless +1859 wish +1860 with +1861 within +1862 without +1863 wnumberp +1864 woman +1865 women +1866 won +1867 wonder +1868 word +1869 work +1870 worker +1871 world +1872 worldwid +1873 worri +1874 worst +1875 worth +1876 would +1877 wouldn +1878 write +1879 written +1880 wrong +1881 wrote +1882 www +1883 ximian +1884 xml +1885 xp +1886 yahoo +1887 ye +1888 yeah +1889 year +1890 yesterdai +1891 yet +1892 york +1893 you +1894 young +1895 your +1896 yourself +1897 zdnet +1898 zero +1899 zip