Posts

Video Demo General Script

Video Demo Rough Transcript Introduction to the Problem Gather Film and Actor data from IMDb and Box Office Mojo Following Features: FILM SPECIFIC weekday (1-7) day month budget length mpaa_num (converted from strings to ints) ACTOR SPECIFIC  avg_actor_age max_actor_film_revenue avg_actor_film_revenue max_actor_film_votes avg_actor_film_votes max_actor_film_stars avg_actor_film_stars max_actor_film_appearances avg_actor_film_appearances max_actor_film_metascore avg_actor_film_metascore DIRECTOR SPECIFIC director_age director_number_of_films max_director_film_revenue avg_director_film_revenue max_director_film_votes avg_director_film_votes max_director_film_stars avg_director_film_stars max_director_film_metascore avg_director_film_metascore Use Python to scrape the data Broken into two steps Scraping Aggregation Use MongoDb to save the objects (films and actors) Use R to build the multiple regression model Scraping...

Improvements for Future Updates

A list of improvements I've been trying to update as I find mistakes or I purposefully make them.  True multiprocessing (not just Python's Threads that don't really do anything) Don't use the requests package as it doesn't do concurrency Find a way to get screen time for actors and weight actors accordingly Better solution than a high level try-except in QueueConsumer for error handling (I did this so I could sleep peacefully knowing that this will continue to run overnight) Use IMDb's advanced search instead of the regular one to get better results  Exclude Actors by potentially checking profile photo first to avoid querying each page and failing More testing to guarantee that all scraped values will be correct Make two models depending on the budget I ended up making 3. It helped a little bit, but not by much. See more worthwhile improvement in the bullet below. Scrape films that don't have revenue information because they are still importan...

Implementing Python Scraper Day Three

This is (hopefully) the home stretch. What I'm going to accomplish today: Switch to running one at a time Save Films and Actors before running aggregate commands Make the R data model Here we go... (12/4/2017 2:18pm): I'm an idiot. My consumers didn't break (I don't think at least). I never started a while loop, so only one item was consumed and the function exited. Some errors that appeared: Dates on BOM didn't match IMDb which I'm using to map ID's Fixed! Added a year tolerance that checks for multiple possible years Some actors only have a year for birthday Fixed! Set birthdate to 1-1-year Budgets for films won't be in number format Not the actual issue. Looks like the BOM -> IMDb is messing up It was messing up because I checked for all years around the BOM year but not for the BOM year itself...smh Once I fixed ^ the budget thing was an issue again. I was looking in the wrong spot for the budget. Should be fixed now....

Implementing Python Scraper Day Two

Alright, I'm gonna see if I can get this done in one quick sprint. The goal for today is to finish the code and at least get it running with only a few errors (nothing fatally obviously). Write the following: Start consumers Get a page Get the rows of films For each row Get all td's (columns) Get needed fields If fields aren't there, skip Create Film and append to the first Queue When all Queues are empty: Add all Films to the dictionary Add all Actors to the dictionary For each film, get aggregate fields Save to mongodb Live Brainstorm: I had to pass self to consume which gets passed to a thread. Not sure how that'll work... Really hoping that I don't run into proxy errors... Time to start running the code @ 6:37PM on 12/1. I'm excited to see how much I did incorrectly! First thing that I did wrong is that my SetQueue needs to be passed an ID var to put an item in the seen set. Looks like I will probably need a bunch of IP's...

Implementing Python Scraper Day One

Live Decision Making: Not going to do aggregate values for series because not all films will be in a series and it will be difficult to differentiate. Only going to use the number in the series. Need to remember that I can only use features that will be available for a movie that hasn't been released, so film rating is out.  Need to make film MPAA rating a number and not a string (gonna do this in R because I will have collected all possible ratings) There isn't an easy way to find the gender of an Actor so that will be left out ...going back and forth from IMDb to Box Office Mojo is a pain :( hmm... I may scrape  BOM Alphabetical  instead and run searches on IMDb to find the film rather than go IMDb to BOM Process would then be: Start scraping films from BOM and appending Film objects __init__ with a mojo_id to Queue1 Loop pulls from Queue1 and tries to find IMDb page If fails, film is ignored Else, set imdb_id and append to Queue2 Start scraping non-a...

R Brain Dump

My initial thoughts on the R portion of this assignment are that it should be pretty straightforward. I'll import the Film collection from Mongo into R. Turn it into a data.frame (all cleaning of na values would've been handled by the Python portion). I'll then try backward fitting a model. To test it, I will do a couple random splits to create test and training data then comparing. I will also try to take a real film that is upcoming and predict the revenue then wait and see. Once I get a model working, I will train it on all of the data and then create a function to predict. At this point, I will have to find a way to have Python call this function and return a result, but I will cross this bridge when I make the Python interface.

Python Scraper Brain Dump

Here's what I gotta do. I need to create a web scraper that crawls IMDb to get films to be used in the model. I'm probably going to start with this URL (IMDb Most Popular)  followed by the lists sorted by IMDb rating and Number of Votes. The scraper will store a set of seen movie ID's to avoid repeats. Once the three initial lists are parsed, the ID's will be put into a Queue so the program can be multi-threaded. The scraper will then take an item from the queue and start pulling out the needed data. I plan on creating a Python class called Film that knows how to gather information from IMDb. Pulling out fields like budget, runtime etc. will be pretty straightforward. It's actors that will be the doozy. If an actor has already been processed (check a set of actor ID's), then that actor is skipped. If an actor hasn't been seen, I'll create an Actor class to handle scraping that info. Any film an actor has been in will be added to the Queue of films to...