Python Scraper Brain Dump
Here's what I gotta do. I need to create a web scraper that crawls IMDb to get films to be used in the model. I'm probably going to start with this URL (IMDb Most Popular) followed by the lists sorted by IMDb rating and Number of Votes.
The scraper will store a set of seen movie ID's to avoid repeats. Once the three initial lists are parsed, the ID's will be put into a Queue so the program can be multi-threaded. The scraper will then take an item from the queue and start pulling out the needed data. I plan on creating a Python class called Film that knows how to gather information from IMDb. Pulling out fields like budget, runtime etc. will be pretty straightforward. It's actors that will be the doozy.
If an actor has already been processed (check a set of actor ID's), then that actor is skipped. If an actor hasn't been seen, I'll create an Actor class to handle scraping that info. Any film an actor has been in will be added to the Queue of films to be scraped. I'll probably set a limit to 10,000 films or so. <- I will have to be careful about a hard limit to avoid referencing a Film from an actor object that doesn't exist. I'll likely have to create a special class around a set so it can be locked when an item is checked and/or added.
Some issues with actor scraping is I need to calculate summary stats like gross income for films that they've been in before, and not including, this film. The best way to do this would probably be to store references to Film objects and create a method that given a Film ID, it will aggregate information from before that film. Might be a good idea to store that information in an array and stop when the given Film ID is hit. This extra step is necessary because an actor may not be as famous during one film as they are in another. For example, DiCaprio in Titanic compared to the Revenant. This won't take into account screen time, so cameos will throw it off a little bit.
Another difficult thing about actors is, if age is an important factor, I will have to calculate the age of the actor when the film was released. I'll probably use the same process that I plan on using for calculating actor popularity.
At the moment, I'm torn on if I should treat a Director as an Actor, make a Director class, or ignore the Director.
I've also gotta remember that if I want to use the revenue of a series as a feature, I need to only look at films released before the one being scraped in the series. Because there isn't an easy way to link IMDb to films listed on Box Office Mojo, I'll have to come up with some funky way to do this calculation.
At this point in time, I will have a collection of Films with info such as release date, budget, revenue, and an array of Actor ID's. I will also have a collection of Actors that have basic info like birth date, number of films starred in, and an array of Film ID's. Assuming all of this works as expected, I can now calculate aggregate Actor information for each Film (avg. actor age, avg. actor film revenue and so on).
Once I get my hands on the data, I'll decide which Films or Actors to drop based on features that they're missing.
I will then save all of this info in a MongoDB database. There will be two (possibly three) collections: actors, films, (directors).
Fingers crossed all of that works. If it does, I'll start working on the R portion of the project followed by the Python to R interface.
The scraper will store a set of seen movie ID's to avoid repeats. Once the three initial lists are parsed, the ID's will be put into a Queue so the program can be multi-threaded. The scraper will then take an item from the queue and start pulling out the needed data. I plan on creating a Python class called Film that knows how to gather information from IMDb. Pulling out fields like budget, runtime etc. will be pretty straightforward. It's actors that will be the doozy.
If an actor has already been processed (check a set of actor ID's), then that actor is skipped. If an actor hasn't been seen, I'll create an Actor class to handle scraping that info. Any film an actor has been in will be added to the Queue of films to be scraped. I'll probably set a limit to 10,000 films or so. <- I will have to be careful about a hard limit to avoid referencing a Film from an actor object that doesn't exist. I'll likely have to create a special class around a set so it can be locked when an item is checked and/or added.
Some issues with actor scraping is I need to calculate summary stats like gross income for films that they've been in before, and not including, this film. The best way to do this would probably be to store references to Film objects and create a method that given a Film ID, it will aggregate information from before that film. Might be a good idea to store that information in an array and stop when the given Film ID is hit. This extra step is necessary because an actor may not be as famous during one film as they are in another. For example, DiCaprio in Titanic compared to the Revenant. This won't take into account screen time, so cameos will throw it off a little bit.
Another difficult thing about actors is, if age is an important factor, I will have to calculate the age of the actor when the film was released. I'll probably use the same process that I plan on using for calculating actor popularity.
At the moment, I'm torn on if I should treat a Director as an Actor, make a Director class, or ignore the Director.
I've also gotta remember that if I want to use the revenue of a series as a feature, I need to only look at films released before the one being scraped in the series. Because there isn't an easy way to link IMDb to films listed on Box Office Mojo, I'll have to come up with some funky way to do this calculation.
At this point in time, I will have a collection of Films with info such as release date, budget, revenue, and an array of Actor ID's. I will also have a collection of Actors that have basic info like birth date, number of films starred in, and an array of Film ID's. Assuming all of this works as expected, I can now calculate aggregate Actor information for each Film (avg. actor age, avg. actor film revenue and so on).
Once I get my hands on the data, I'll decide which Films or Actors to drop based on features that they're missing.
I will then save all of this info in a MongoDB database. There will be two (possibly three) collections: actors, films, (directors).
Fingers crossed all of that works. If it does, I'll start working on the R portion of the project followed by the Python to R interface.
Comments
Post a Comment