Implementing Python Scraper Day Two

Alright, I'm gonna see if I can get this done in one quick sprint. The goal for today is to finish the code and at least get it running with only a few errors (nothing fatally obviously).

Write the following:

  1. Start consumers
  2. Get a page
  3. Get the rows of films
  4. For each row
    1. Get all td's (columns)
    2. Get needed fields
      • If fields aren't there, skip
    3. Create Film and append to the first Queue
  5. When all Queues are empty:
    1. Add all Films to the dictionary
    2. Add all Actors to the dictionary
  6. For each film, get aggregate fields
  7. Save to mongodb
Live Brainstorm:
  • I had to pass self to consume which gets passed to a thread. Not sure how that'll work...
  • Really hoping that I don't run into proxy errors...
  • Time to start running the code @ 6:37PM on 12/1. I'm excited to see how much I did incorrectly!
  • First thing that I did wrong is that my SetQueue needs to be passed an ID var to put an item in the seen set.
  • Looks like I will probably need a bunch of IP's and probably a proxy...
    • Nevermind - I was just parsing the data wrong. Silly me.
  • For some reason, the NUM page doesn't error when increasing the page number like the other pages
  • It seems that requests automatically encodes a url - fancy!
  • Darnit..... I don't currently account for an actor that is also a director
    • Okay, I'm also gonna scrape whatever films an Actor has also directed
    • Film when aggregating will have to specify what to use
    • It shouldn't be a terrible thing to refactor
      • Okay, it was terrible. New plan:
      • Add director id to TODO queue with director-(id)
        • This will get processed and saved separate from the normal actor
        • This director-(id) will be striped when being called
  • Looks like I royally messed up the multithreading - should've expected that
    • Gonna switch it to row in order to see if I can debug it

Comments

Popular posts from this blog

"Hello, World!" Blog Post

Python Scraper Brain Dump

We Know Everything