My First Contracting Gig
Scraping woes
Billy Fung /
2018-02-20
Continuing from my first PostgreSQL experience
I first learned Python through web scraping, it was what I needed to do the most while I was at university, grabbing content from web pages, or navigating web pages in order to check if something was there/changed. Mostly it was for enrolling in courses through the online portal and downloading web articles to read for later. But after uni I found that web scraping was a pretty handy skill to have, especially for contracting gigs. Often people want to crawl through an entire webpage and then grab specific content.
My first contracting gig was to go through a website that had profile pages for models, and grab all the information about the model. Back then, it never even came across my mind that this might be illegal, or unethical, it was a job that seemed fairly straightforward and paid well. Now before this, all the scraping I did was for myself, so I never considered other ways of presenting the output, or storing it. Also something new to me, keeping data for later usage. As per my previous blog post, this is how I came to learn Postgres.
Scraping and storing
Learning on the job, I quickly found issues that arose between scraping and storing. With scraping so many pages, probably the most I’ve ever done with a single job, I realised that just using requests
and beautifulsoup4
was woefully inadequate. The first time I ran the script, it would take hours, and also heat up my laptop like crazy. I wish I had a picture, but I would leave my laptop outside in the winter so the freezing temps would cool it better. And to make matters worse, I didn’t write proper exception catching, so this job is when I realised how important it is to keep a script running instead of failing after 4 hours.
Encoding
Because the nature of the information lead itself to different languages, I quickly learned about different encoding. I had always dealt with nice looking proper text, that didn’t have weird accents, or symbols. I then also learned how to properly store encoded data, or at least present it.
Validating
After I finally got a presentable database, I quickly realised that I had no way of properly checking it. Nor had I thought about the best schema for the data. This job had quickly become more work than I thought, which is a recurring theme in most of my work. If I had a rigid schema with restrictions, then the db would’ve been able to point out things like if emails didn’t make sense, or if locations didn’t match. This would be a lesson on data cleaning.
Presenting
The final lesson was that I had no idea how to present the output. I had a database, so my first thought was to make it into a csv, which was silly. Then I learned about .dump files and ended up there. But now in hindsight, I think it would be ideal to host it, and show the client where it is, and perhaps restrict access.
I’ll try to update this post in the future by going back and hopefully finding some code/emails/pictures that go along with the story.