Implementing Python Scraper Day Three

This is (hopefully) the home stretch. What I'm going to accomplish today:

  1. Switch to running one at a time
  2. Save Films and Actors before running aggregate commands
  3. Make the R data model

Here we go... (12/4/2017 2:18pm):
  • I'm an idiot. My consumers didn't break (I don't think at least). I never started a while loop, so only one item was consumed and the function exited.
  • Some errors that appeared:
    • Dates on BOM didn't match IMDb which I'm using to map ID's
      • Fixed! Added a year tolerance that checks for multiple possible years
    • Some actors only have a year for birthday
      • Fixed! Set birthdate to 1-1-year
    • Budgets for films won't be in number format
      • Not the actual issue. Looks like the BOM -> IMDb is messing up
        • It was messing up because I checked for all years around the BOM year but not for the BOM year itself...smh
      • Once I fixed ^ the budget thing was an issue again. I was looking in the wrong spot for the budget. Should be fixed now.
  • Realizing my default date of year 3000 may be in issue if I don't check for it...
  • I almost didn't scrape any actresses because I only looked for the word actor
  • Seriously underestimated the RAM this needs. I'm going to save the films as they reach the output Queue so I need another consumer.
  • Next steps:
    • Refactor to save into MongoDB
    • Get running on an Intel NUC
    • Write code to get all from Database and then run aggregation
    • Start working on R code
  • If for some reason the process fails, I can restart and add already saved objects to be excluded from the SetQueue.
  • I should've purged the imdb_page and box office mojo page after the values were scraped...
  • Well the NUC crashed. I'm going to write code to purge the html pages and to populate the SetQueues
  • Smh such a rookie mistake. When passing sets to the SetQueues, I used a reference instead of a copy so every SetQueue was mutating the same sets. I'm a dingus.
  • Forgot to add Actors that were pending scraping back into the Queue in the event of a failure (which there have been several)

Comments

  1. I stumbled upon this post and I'm so glad I did! The way you've presented [mention a general idea or concept, e.g., 'the balance between ambition and well-being'

    Delhi Agra Jaipur Tour Packages
    Golden Triangle india Tour Package

    ReplyDelete

Post a Comment

Popular posts from this blog

R Brain Dump

Video Demo General Script

Python Scraper Brain Dump