Posts

Showing posts from December, 2017

Video Demo General Script

Video Demo Rough Transcript Introduction to the Problem Gather Film and Actor data from IMDb and Box Office Mojo Following Features: FILM SPECIFIC weekday (1-7) day month budget length mpaa_num (converted from strings to ints) ACTOR SPECIFIC  avg_actor_age max_actor_film_revenue avg_actor_film_revenue max_actor_film_votes avg_actor_film_votes max_actor_film_stars avg_actor_film_stars max_actor_film_appearances avg_actor_film_appearances max_actor_film_metascore avg_actor_film_metascore DIRECTOR SPECIFIC director_age director_number_of_films max_director_film_revenue avg_director_film_revenue max_director_film_votes avg_director_film_votes max_director_film_stars avg_director_film_stars max_director_film_metascore avg_director_film_metascore Use Python to scrape the data Broken into two steps Scraping Aggregation Use MongoDb to save the objects (films and actors) Use R to build the multiple regression model Scraping...

Improvements for Future Updates

A list of improvements I've been trying to update as I find mistakes or I purposefully make them.  True multiprocessing (not just Python's Threads that don't really do anything) Don't use the requests package as it doesn't do concurrency Find a way to get screen time for actors and weight actors accordingly Better solution than a high level try-except in QueueConsumer for error handling (I did this so I could sleep peacefully knowing that this will continue to run overnight) Use IMDb's advanced search instead of the regular one to get better results  Exclude Actors by potentially checking profile photo first to avoid querying each page and failing More testing to guarantee that all scraped values will be correct Make two models depending on the budget I ended up making 3. It helped a little bit, but not by much. See more worthwhile improvement in the bullet below. Scrape films that don't have revenue information because they are still importan...

Implementing Python Scraper Day Three

This is (hopefully) the home stretch. What I'm going to accomplish today: Switch to running one at a time Save Films and Actors before running aggregate commands Make the R data model Here we go... (12/4/2017 2:18pm): I'm an idiot. My consumers didn't break (I don't think at least). I never started a while loop, so only one item was consumed and the function exited. Some errors that appeared: Dates on BOM didn't match IMDb which I'm using to map ID's Fixed! Added a year tolerance that checks for multiple possible years Some actors only have a year for birthday Fixed! Set birthdate to 1-1-year Budgets for films won't be in number format Not the actual issue. Looks like the BOM -> IMDb is messing up It was messing up because I checked for all years around the BOM year but not for the BOM year itself...smh Once I fixed ^ the budget thing was an issue again. I was looking in the wrong spot for the budget. Should be fixed now....

Implementing Python Scraper Day Two

Alright, I'm gonna see if I can get this done in one quick sprint. The goal for today is to finish the code and at least get it running with only a few errors (nothing fatally obviously). Write the following: Start consumers Get a page Get the rows of films For each row Get all td's (columns) Get needed fields If fields aren't there, skip Create Film and append to the first Queue When all Queues are empty: Add all Films to the dictionary Add all Actors to the dictionary For each film, get aggregate fields Save to mongodb Live Brainstorm: I had to pass self to consume which gets passed to a thread. Not sure how that'll work... Really hoping that I don't run into proxy errors... Time to start running the code @ 6:37PM on 12/1. I'm excited to see how much I did incorrectly! First thing that I did wrong is that my SetQueue needs to be passed an ID var to put an item in the seen set. Looks like I will probably need a bunch of IP's...