Problem: I was planning my trip to Florida and looking for fun things (“adventure” activities like jet ski rentals, kayaking, and go karting) to do in Orlando and Miami. I like saving money, so I subscribed to Groupon, Livingsocial, and Google Offers for those cities. Those sites then promptly flooded my inbox with deals for gym membership, in-ear headphones, and anti-cellulite treatment. Not useful. Going to each site and specifying my deal preferences took a while. Plus, if I found a deal that I liked, I had to copy-paste the link to that deal in another document so that I had it for future reference (in case I wanted to buy it later). Too many steps, too much hassle, unhappy email inbox.
Solution: So I wanted to build a site that scraped the fun/adventure deals automatically from these deal sites. Example use case: if a person plans to visit a new city (e.g. Los Angeles), he or she could just visit the site and see in one glance a list of the currently active adventure deals (e.g. scuba diving) in that city. Sure, it seems that aggregator sites like Yipit solve this. Almost all aggregation sites like Yipit require users to give them their email address before showing them any deals (most are also difficult to navigate). More unnecessary steps for the user. Plus, I found that the Yipit deals weren’t the same as the ones displayed on the actual Groupon/Livingsocial/Google Offer sites.
“pre” minimum viable product: I gathered feedback for my idea to see if other people besides me would actually use it. This time, I just made a few quick posts on reddit (in the city subreddits), and got many comments. People said they would use it. Next.
MVP: The site I built scrapes Livingsocial; Groupon generates its pages dynamically with ajax… can’t scrape that w/o a JS engine, a big PITA to set up. Google Offers didn’t have very many quality deals, and I thought I’d simplify by making the MVP only for Livingsocial for now.
Applying the Naive Bayes classifier
After scraping all the deals, they need to be classified as “adventure” or not. Obviously, doing this by hand is not scalable if I wanted to scrape deals for more than a couple cities. So I implemented the Naive Bayes classifier. Naive Bayes is often used in author text identification, e.g. finding out if Madison or Hamilton wrote certain unidentified essays in the Federalist Papers.
At a high level, Naive Bayes treats each “document” or block of text as a “bag of words”, meaning that it doesn’t care about the order of the words. When given a new “document” to classify, Naive Bayes asks and answers the question, “given each classification/category, what is the probability that this new document belongs to that classification/category?” The category with the highest probability is then the category that Naive Bayes has “predicted” the new document should belong to.
The site currently uses the deal “headline” (e.g. “Five Women’s Fitness Classes” or “Chimney Flue Sweep”) as the document text that Naive Bayes uses. I also tried using the actual deal description (i.e. the paragraph or two of text that Livingsocial writes to describe the deal), and from eyeballing the predictions, it looked like both gave similar prediction accuracy. Using the deal headline is a lot faster though.
Prediction accuracy is still pretty bad. I didn’t want Naive Bayes to automatically assign its predicted categories to the deals, so I decided to keep categorizing the deals manually, but with the help of Naive Bayes’s recommendations. I also decided to make its binary classification decisions more “fuzzy”. Here’s a screenshot of the admin page that tells me the predicted deal type of the scraped deals, with a column called “prediction confidence”, which is a score derived from the Naive Bayes output that signifies how strong its prediction is.
No better way to learn than to do
Doing is the best way to learn, because working on your own projects forces you to engage in deliberate practice (Cal Newport’s key to living a remarkable life). Not only do you practice your skills, but you also learn about learning: when faced with an obstacle while working on a personally initiated project, you have just you and your own resourcefulness–no boss telling you what to do or professor giving guidelines. For example, this time, I encountered the issue of my requests timing out when in production on Heroku, since Heroku has a max request time of 30 seconds and some of my requests were taking up to a few minutes (when my Naive Bayes implementation was inefficient). I googled my problem, found a stackoverflow post, and learned about worker queues and the Ruby library delayed_job, which fixed my problem by allowing more time intensive requests to be run in the background.
The site is at https://adrenalinejunkie.herokuapp.com/