Implementing Python Scraper Day Two

Implementing Python Scraper Day Two

December 01, 2017

Alright, I'm gonna see if I can get this done in one quick sprint. The goal for today is to finish the code and at least get it running with only a few errors (nothing fatally obviously).

Write the following:

Start consumers
Get a page
Get the rows of films
For each row

Get all td's (columns)
Get needed fields

If fields aren't there, skip

Create Film and append to the first Queue

When all Queues are empty:

Add all Films to the dictionary
Add all Actors to the dictionary

For each film, get aggregate fields
Save to mongodb

Live Brainstorm:

I had to pass self to consume which gets passed to a thread. Not sure how that'll work...
Really hoping that I don't run into proxy errors...
Time to start running the code @ 6:37PM on 12/1. I'm excited to see how much I did incorrectly!
First thing that I did wrong is that my SetQueue needs to be passed an ID var to put an item in the seen set.
Looks like I will probably need a bunch of IP's and probably a proxy...

Nevermind - I was just parsing the data wrong. Silly me.

For some reason, the NUM page doesn't error when increasing the page number like the other pages
It seems that requests automatically encodes a url - fancy!
Darnit..... I don't currently account for an actor that is also a director

Okay, I'm also gonna scrape whatever films an Actor has also directed
Film when aggregating will have to specify what to use
It shouldn't be a terrible thing to refactor

Okay, it was terrible. New plan:
Add director id to TODO queue with director-(id)

This will get processed and saved separate from the normal actor
This director-(id) will be striped when being called

Looks like I royally messed up the multithreading - should've expected that

Gonna switch it to row in order to see if I can debug it

Comments