Implementing Python Scraper Day One


Live Decision Making:
  • Not going to do aggregate values for series because not all films will be in a series and it will be difficult to differentiate. Only going to use the number in the series.
  • Need to remember that I can only use features that will be available for a movie that hasn't been released, so film rating is out. 
  • Need to make film MPAA rating a number and not a string (gonna do this in R because I will have collected all possible ratings)
  • There isn't an easy way to find the gender of an Actor so that will be left out
  • ...going back and forth from IMDb to Box Office Mojo is a pain :(
    • hmm... I may scrape BOM Alphabetical instead and run searches on IMDb to find the film rather than go IMDb to BOM
      • Process would then be:
        1. Start scraping films from BOM and appending Film objects __init__ with a mojo_id to Queue1
        2. Loop pulls from Queue1 and tries to find IMDb page
          1. If fails, film is ignored
          2. Else, set imdb_id and append to Queue2
        3. Start scraping non-aggregate fields
          1. When an Actor is created using an imdb_id, add to Queue3
          2. Take an Actor and start scraping non-aggregate fields
            • See lower bullet point about Actor Films
            1. Add to Queue4
        4. Add to Queue5
      • After the process, Queue4 will have finished Actors and Queue5 will have finished Films. Two dictionaries will be made of imdb_id -> Object and then the aggregate value process will be started.
  • Gonna need a QueueConsumer abstract class with concrete classes whose job is to process all of these
  • A Film will need two __init__ depending if created with Mojo or IMDb id
    • Not the case anymore because of the assumption that is below this
  • Movies will be ignored if BOM can't be found so if there is a Film that an Actor has been in that wasn't scraped, it will be ignored. This is a shift from the original plan to take the list of Films an Actor has been in and add it to a Queue to be scraped. The main reason for this decision was because I didn't want to try to implement IMDb_id to a mojo_id functionality. A second reason was that it would've been difficult to check multiple Queues before starting a scrape because if I were to try to scrape Actor Films, there would be multiple consumers processing Films independently. 
  • Would make sense to create a method that returns statistics about a value (max, min, avg) so that the logic won't be repeated dozens of times
  • I can't filter votes, metascore and star rating by date so Films that became popular years after release will skew results. Same goes for revenue. Some Films will have been out for decades and others a couple months.
  • I'm just gonna assume I match a BOM movie to its IMDb page correctly. I'll of course check that the function works, but I won't verify every match. Some matches may be false but that's okay for this rough draft
  • I'm not going to have classes dedicated to interacting with IMDb and BOM. I'll have the Film and Actor classes know how to do that. My thought behind that is it is more important to group Actor/Film functionality together than IMDb/BOM.
  • Rather than try to catch errors and remove items from a Queue or verify every field, I'm going to add a field to the Actor and Film classes called FAILED. If anything critical breaks, this will be set to True and that object will be ignored. 
  • I noticed a bunch of n/a in the BOM alphabetical lists. Those are rando movies or haven't been released yet. I will ignore those in the scraping process. 
  • So things don't fail, I will need an object in Film that handles errors. This object will know what default values to return and whether or not to set self.FAILED to TRUE. If I don't do this, I will be copy and pasting a ton of code.
  • I initialized all the fields in Film to some kind of value, but I now realize I will have to revert them to None if I want the if not .... checks to work.
  • Number in the series will more than likely be wrong due to re-releases and random cartoon movies not directly connected to the series (e.g. Star Wars). Gonna remove this feature as it would be too difficult to scrape.
  • I'm going to try to add a classwide instance of a dictionary to Film and Actor to see if that can be used to link all of the instances together
  • Ergg... I have methods to get both the average *whatever* of all of the Films for an Actor and the max *whatever* of all of the Films for an Actor. A director uses both of them, but when finding the max and average *whatevers* for every Actor, only the averages considered. So it is the average averages and the max average. Not the best implementation, but I'm not gonna change it now...
  • This model doesn't account for cameos...oh well
  • All of the scraping is done in memory which is probably a mistake. One error and it will have to start all over which is unfortunate. 
    • High level try...except(?) hmm....
  • Grrr....my refactored methods don't currently handle if a field is missing, then what happens?!!! if get_film(id) returns None, I currently access a field in a lambda. This will throw an error which is unfortunate. 
    • Gonna fix this by adding get_methods to each class. If a Film wasn't scraped, a generic Film value will be returned. Not the cleanest way but it'll work.
And so concludes Day One of coding. I worked on this from 13-18 and 21-00. Twas a solid 8 hours of work. Tomorrow I will implement the multithreaded scraping process and hopefully I wrote today's code well enough that nothing will break. Key takeaways to remember for tomorrow are:
  • Look at the outlined process for scraping
  • Remember to convert MPAA in R
  • Gotta check n/a when scraping from the alphabetical list in BOM
  • I may need some high level try...excepts to avoid the entire scraping process from breaking midway through

Comments

Popular posts from this blog

"Hello, World!" Blog Post

Python Scraper Brain Dump

We Know Everything