Implementing Python Scraper Day Three
This is (hopefully) the home stretch. What I'm going to accomplish today:
- Switch to running one at a time
- Save Films and Actors before running aggregate commands
- Make the R data model
Here we go... (12/4/2017 2:18pm):
- I'm an idiot. My consumers didn't break (I don't think at least). I never started a while loop, so only one item was consumed and the function exited.
- Some errors that appeared:
- Dates on BOM didn't match IMDb which I'm using to map ID's
- Fixed! Added a year tolerance that checks for multiple possible years
- Some actors only have a year for birthday
- Fixed! Set birthdate to 1-1-year
- Budgets for films won't be in number format
- Not the actual issue. Looks like the BOM -> IMDb is messing up
- It was messing up because I checked for all years around the BOM year but not for the BOM year itself...smh
- Once I fixed ^ the budget thing was an issue again. I was looking in the wrong spot for the budget. Should be fixed now.
- Realizing my default date of year 3000 may be in issue if I don't check for it...
- I almost didn't scrape any actresses because I only looked for the word actor
- Seriously underestimated the RAM this needs. I'm going to save the films as they reach the output Queue so I need another consumer.
- Next steps:
- Refactor to save into MongoDB
- Get running on an Intel NUC
- Write code to get all from Database and then run aggregation
- Start working on R code
- If for some reason the process fails, I can restart and add already saved objects to be excluded from the SetQueue.
- I should've purged the imdb_page and box office mojo page after the values were scraped...
- Well the NUC crashed. I'm going to write code to purge the html pages and to populate the SetQueues
- Smh such a rookie mistake. When passing sets to the SetQueues, I used a reference instead of a copy so every SetQueue was mutating the same sets. I'm a dingus.
- Forgot to add Actors that were pending scraping back into the Queue in the event of a failure (which there have been several)
I stumbled upon this post and I'm so glad I did! The way you've presented [mention a general idea or concept, e.g., 'the balance between ambition and well-being'
ReplyDeleteDelhi Agra Jaipur Tour Packages
Golden Triangle india Tour Package