Improvements for Future Updates
A list of improvements I've been trying to update as I find mistakes or I purposefully make them.
- True multiprocessing (not just Python's Threads that don't really do anything)
- Don't use the requests package as it doesn't do concurrency
- Find a way to get screen time for actors and weight actors accordingly
- Better solution than a high level try-except in QueueConsumer for error handling (I did this so I could sleep peacefully knowing that this will continue to run overnight)
- Use IMDb's advanced search instead of the regular one to get better results
- Exclude Actors by potentially checking profile photo first to avoid querying each page and failing
- More testing to guarantee that all scraped values will be correct
- Make two models depending on the budget
- I ended up making 3. It helped a little bit, but not by much. See more worthwhile improvement in the bullet below.
- Scrape films that don't have revenue information because they are still important for aggregate values such as average actor metascore.
- Normalize more than just the budget field
Need to create a util file to generate an index for Box Office Mojo:
- Data structure will be a dict: word -> set(titles) and a second structure will be a dict: title -> mojo_id
- When given an IMDb title, it looks up each word in the title and does an intersection of the sets. Not the best way but that's what I currently plan on doing until I come up with a better way. (Doesn't help that this is due in 5hrs).
Comments
Post a Comment