Billy Fung

Web Scraping and PostgreSQL

Billy Fung / 2016-01-28


PostgreSQL and Python scraping

Been meaning to learn Python programming? Want to see what the hype around PostgreSQL is all about? Well the best way to do both is to dive right into it with a scraping project. Following along, you’ll learn the basics of running a web scraper on a website to gather information and then put that information into a database where you will be able to make useful queries from.

[PostgreSQL][2] has been touted as the “world’s most advance open source database”, and offers various features that make it stand out from other SQL databases such as MySQL and MariaDB. Although users might not immediately make use of those features, it is good to start with a database that has those features available if needed in the future. With a database chosen, the next step is to fill the database with data. It doesn’t really matter if you choose a database first or the data first, what is important is that you choose the tool that gets the job done.

The data

Scraping data from a website is a way to take information available online, and then either manipulate it immediately, or store it in a database for future functionality. In this example, the data will come from an audition and jobs site, StarNow which allows models and photographers to advertise their talent. Exploring the website, model profiles include a plethora of data, ranging from acting and model credentials, to physical details of the model. Depending on how much data you want, and the goals of your project, you can choose to scrape all the data available, or just a small subset to make the job quicker. Here the goal will be to scrape all the physical attributes of models, and then put those attributes into a PostgreSQL database, using Python.

Creating the database

Following the installation instructions from the PostgreSQL website for your OS, the first step is to create the database, either locally or in the cloud with Compose.io. In order to create a new database, you must first connect to the default postgres database and then run the command CREATE DATABASE starnow. Now there will be a starnow database available locally (or on the cloud), and with that you will be ready to start putting data into it. Alternatively, this can also all be completed within a Python script with the psycopg2 library. This library is an adapter layer that allows Python to communicate with the Postgres database.

    from psycopg2 import connect
    from psycopg2.extensions import ISOLATION_LEVEL_AUTOCOMMIT
    
    con = None
    # connect to the default admin database
    con = connect(database='postgres')
    
    dbname = "starnow"
    
    # autocommit ends transaction after every query
    con.set_isolation_level(ISOLATION_LEVEL_AUTOCOMMIT)
    cur = con.cursor()
    cur.execute('CREATE DATABASE ' + dbname)
    cur.close()
    con.close()

Scraping the data

Beautiful Soup is a Python library for scraping and parsing, and although other Python scraper libraries exist, this one is very well documented and easy to use. Along with the requests library for HTTP communication, these two tools are the bread and butter of Python scraping for projects that aren’t too complex. Other options will be discussed later. The general scraping process is then: 1. request search page 2. scrape search page for models 3. request model page 4. scrape model page for physical attributes 5. repeat for all models on current search page 6. scrape next search page till no more search pages

The data fields that are available on the profile pages on StarNow are limited to: variables = ['name', 'gender', 'location', 'username', 'talenturl', 'roles', 'height', 'hips', 'weight', 'shoe', 'ethnicity', 'hair', 'skin', 'hairlength', 'eye', 'hairtype', 'chest', 'dress', 'waist']

These variables will then become the columns for the physical table within the starnow database.

    CREATE TABLE physical (id serial PRIMARY KEY, name varchar, gender varchar,
     location varchar, username varchar, talenturl varchar, roles varchar, height varchar, hips varchar,
      weight varchar, shoe varchar, ethnicity varchar, hair varchar, skin varchar, hairlength varchar, eye varchar, hairtype varchar, chest varchar,
       dress varchar, waist varchar)

This will create a table called physical with all the variables in lower case as type varchar. If you want to use case sensitive column names, enclose the variable in “, eg. ‘Name’ varchar. For the ease of putting data into the database quickly, the types were all chosen to be varchar, meaning Python strings.

The lazy method of scraping the results page is to first look at how many results pages there are, in this instance the last page is 3528. Knowing this, iterating requests through all the results pages is then: import requests from bs4 import BeautifulSoup for x in range(1, 3528): searchPage = "http://www.starnow.com.au/talent/?p=%d" % x # grab html and then use bs4 to parse try: r = requests.get(searchPage) soup = BeautifulSoup(r.content) except: print "Error getting page " + searchPage pass

Parsing with Beautiful Soup

Each search result page returns 20 different profiles; the next step is then to identify how each profile in the HTML source. This is often the part where scraping data gets messy and dirty, sometimes website don’t have logical coding styles or naming patterns and it will be left to the scraper to dig through it all. Luckily with StarNow the code is nicely tagged, so each of the 20 ads on the result page can be obtained by

    soup.find_all("div", {'itemtype': ['http://schema.org/Person']})

which gives a Python list of 20 results. The next step is to request the model page, which is easily done once the username is obtained for talent in directory: # go through the 20 ads and parse within them # username in <a href=""> data['name'] = talent('a')[0]['href'][1:] data['url'] = "http://www.starnow.com.au/" + data['name'] try: personalPage = requests.get(data['talenturl']) personalSoup = BeautifulSoup(personalPage.content) except: print "error getting model page" + data['talenturl'] pass

Once the model’s profile page is obtained, you can begin looking for the data that fills the table. A very handy feature of BeautifulSoup is the find function, which allows you to search through the document for certain html tags. The ['content'] gives the contents within the tag that BeautifulSoup finds

    data['gender'] = soup.find('meta', {'property':'gender'})['content']
    data['name'] = soup.find('meta', {'property':'name'})['content']
    data['location'] = soup('span')[2].string.strip()
    data['username'] = soup.find('meta',{'property':'profile:username'})['content']
    data['roles'] = soup.find_all('div', {'class': 'profile__roles'})[0].string.strip()

The physical details are located within a table, but BeautifulSoup makes it easy to find. physical = soup.find('table', id='ctl00_ cphMain_physicalDetails_memberAttributes') With the physical details table HTML, the rest of the data for the database table can then be filled in using a simple iteration.

    # create a dict mapping the db column names to search terms
    searchword = {'height':'Height:', 'hips':'Hips:', 'weight':'Weight:', 'shoe':'Shoe size:', 'ethnicity':'Ethnicity:',
                  'hair':'Hair color:', 'skin':'Skin color:', 'hairlength':'Hair length:', 'eye':'Eye color', 'hairtype':'Hair type:',
                  'chest':'Chest:', 'dress':'Dress Size:', 'waist':'Waist:'}
    
    if 'Model' in data['roles'] or 'Actor' in data['roles']:
        for x in searchword.keys():
            try:
                data[x] = physical('div', text=re.compile(searchword[x]))[0].next_sibling.strip()
            except:
                pass

Putting the data into the database

Once all the data is collected, you are ready to put it into your PostgreSQL database. The psycopg2 library makes this easy once again

    try:
        conn = psycopg2.connect(database="starnow")
        cur = conn.cursor()
    except:
        print "Unable to connect"
    
    cur.execute("INSERT INTO starnow (%s) VALUES (%s)" %(','.join(data), ','.join('%%(%s)s' % k for k in data)), data)
    
    # don't forget to commit to make changes persistent 
    conn.commit()

That last step had a bit of weirdness going on with the INSERT statement, but all it does is go through the dict data, and then format it into such a way so that the SQL command will be executed without SQL injections. The longer way would be to have an insert statement for every item within the data dictionary. Now all the data is in your starnow database!

Querying

Well with all that data in your database, what do you want to do with it? Fear not, querying it with psycopg2 is also very simple. Website searches often let you filter by gender, and that looks like: cur.execute("SELECT * FROM starnow WHERE gender='Male'") male = cur.fetchall() And with that you have all the scraped male models ready for whatever purposes you’d like.

Using numpy and matplotlib you can create simple histograms to look at distributions.

male_height A simple histogram showing the distribution of male model heights on StarNow.au

What next?

Well that sums up a very simple and quick intro to scraping with Python, BeautifulSoup4 and PostgreSQL. As stated earlier, this is only one of the many options available to you for scraping; the same results could easily have been accomplished using Python with Scrapy and SQLite. With all this physical details data, you can now do lots of fun stuff like create recommendation engines based off physical details, look at the statistics for models in the database, create your own search engine app for the database, the list goes on. Having a database and querying it is the framework of most online web applications these days. If you want to do more, creating more tables within the database for more data like the acting and modeling credentials is good practice as well.