Posts

Showing posts from November, 2017

Implementing Python Scraper Day One

Live Decision Making: Not going to do aggregate values for series because not all films will be in a series and it will be difficult to differentiate. Only going to use the number in the series. Need to remember that I can only use features that will be available for a movie that hasn't been released, so film rating is out.  Need to make film MPAA rating a number and not a string (gonna do this in R because I will have collected all possible ratings) There isn't an easy way to find the gender of an Actor so that will be left out ...going back and forth from IMDb to Box Office Mojo is a pain :( hmm... I may scrape  BOM Alphabetical  instead and run searches on IMDb to find the film rather than go IMDb to BOM Process would then be: Start scraping films from BOM and appending Film objects __init__ with a mojo_id to Queue1 Loop pulls from Queue1 and tries to find IMDb page If fails, film is ignored Else, set imdb_id and append to Queue2 Start scraping non-a...

R Brain Dump

My initial thoughts on the R portion of this assignment are that it should be pretty straightforward. I'll import the Film collection from Mongo into R. Turn it into a data.frame (all cleaning of na values would've been handled by the Python portion). I'll then try backward fitting a model. To test it, I will do a couple random splits to create test and training data then comparing. I will also try to take a real film that is upcoming and predict the revenue then wait and see. Once I get a model working, I will train it on all of the data and then create a function to predict. At this point, I will have to find a way to have Python call this function and return a result, but I will cross this bridge when I make the Python interface.

Python Scraper Brain Dump

Here's what I gotta do. I need to create a web scraper that crawls IMDb to get films to be used in the model. I'm probably going to start with this URL (IMDb Most Popular)  followed by the lists sorted by IMDb rating and Number of Votes. The scraper will store a set of seen movie ID's to avoid repeats. Once the three initial lists are parsed, the ID's will be put into a Queue so the program can be multi-threaded. The scraper will then take an item from the queue and start pulling out the needed data. I plan on creating a Python class called Film that knows how to gather information from IMDb. Pulling out fields like budget, runtime etc. will be pretty straightforward. It's actors that will be the doozy. If an actor has already been processed (check a set of actor ID's), then that actor is skipped. If an actor hasn't been seen, I'll create an Actor class to handle scraping that info. Any film an actor has been in will be added to the Queue of films to...

Ready, Set, Go!

It's nearing the end of the semester which means that is now time to start the final project for DS4100. This will be post numero uno in a multiple part blog series following my struggle. To kick things off, here is the rough outline for my project. Predict the Gross Revenue for an Upcoming Film Multiple regression model that takes an IMDb page for an upcoming movie and predicts financial performance. The data is on a couple different websites each with their own method of searching and they do not have API access. I was going to use Python to scrape and parse the HTML and to put it into Mongo. I plan on having two Collections: Films and Actors. Then I plan on using R to pull the information into a data frame and to generate a model. I will use Python to create an interface between the user and the R model. Data is from IMDb and Box Office Mojo: Movie rating (critics + people) Number of votes Length of film Film MPAA rating Film Budget Known movies rev...