Improvements for Future Updates


A list of improvements I've been trying to update as I find mistakes or I purposefully make them. 
  • True multiprocessing (not just Python's Threads that don't really do anything)
  • Don't use the requests package as it doesn't do concurrency
  • Find a way to get screen time for actors and weight actors accordingly
  • Better solution than a high level try-except in QueueConsumer for error handling (I did this so I could sleep peacefully knowing that this will continue to run overnight)
  • Use IMDb's advanced search instead of the regular one to get better results 
  • Exclude Actors by potentially checking profile photo first to avoid querying each page and failing
  • More testing to guarantee that all scraped values will be correct
  • Make two models depending on the budget
    • I ended up making 3. It helped a little bit, but not by much. See more worthwhile improvement in the bullet below.
  • Scrape films that don't have revenue information because they are still important for aggregate values such as average actor metascore.
  • Normalize more than just the budget field
Need to create a util file to generate an index for Box Office Mojo:
  • Data structure will be a dict: word -> set(titles) and a second structure will be a dict: title -> mojo_id
  • When given an IMDb title, it looks up each word in the title and does an intersection of the sets. Not the best way but that's what I currently plan on doing until I come up with a better way. (Doesn't help that this is due in 5hrs).

Comments

Popular posts from this blog

"Hello, World!" Blog Post

Python Scraper Brain Dump

We Know Everything