Python Scraper Brain Dump
Here's what I gotta do. I need to create a web scraper that crawls IMDb to get films to be used in the model. I'm probably going to start with this URL (IMDb Most Popular) followed by the lists sorted by IMDb rating and Number of Votes. The scraper will store a set of seen movie ID's to avoid repeats. Once the three initial lists are parsed, the ID's will be put into a Queue so the program can be multi-threaded. The scraper will then take an item from the queue and start pulling out the needed data. I plan on creating a Python class called Film that knows how to gather information from IMDb. Pulling out fields like budget, runtime etc. will be pretty straightforward. It's actors that will be the doozy. If an actor has already been processed (check a set of actor ID's), then that actor is skipped. If an actor hasn't been seen, I'll create an Actor class to handle scraping that info. Any film an actor has been in will be added to the Queue of films to...
Comments
Post a Comment