Implementing Python Scraper Day Two
Alright, I'm gonna see if I can get this done in one quick sprint. The goal for today is to finish the code and at least get it running with only a few errors (nothing fatally obviously).
Write the following:
Write the following:
- Start consumers
- Get a page
- Get the rows of films
- For each row
- Get all td's (columns)
- Get needed fields
- If fields aren't there, skip
- Create Film and append to the first Queue
- When all Queues are empty:
- Add all Films to the dictionary
- Add all Actors to the dictionary
- For each film, get aggregate fields
- Save to mongodb
Live Brainstorm:
- I had to pass self to consume which gets passed to a thread. Not sure how that'll work...
- Really hoping that I don't run into proxy errors...
- Time to start running the code @ 6:37PM on 12/1. I'm excited to see how much I did incorrectly!
- First thing that I did wrong is that my SetQueue needs to be passed an ID var to put an item in the seen set.
- Looks like I will probably need a bunch of IP's and probably a proxy...
- Nevermind - I was just parsing the data wrong. Silly me.
- For some reason, the NUM page doesn't error when increasing the page number like the other pages
- It seems that requests automatically encodes a url - fancy!
- Darnit..... I don't currently account for an actor that is also a director
- Okay, I'm also gonna scrape whatever films an Actor has also directed
- Film when aggregating will have to specify what to use
- It shouldn't be a terrible thing to refactor
- Okay, it was terrible. New plan:
- Add director id to TODO queue with director-(id)
- This will get processed and saved separate from the normal actor
- This director-(id) will be striped when being called
- Looks like I royally messed up the multithreading - should've expected that
- Gonna switch it to row in order to see if I can debug it
Comments
Post a Comment