Web Scrapping

5 minute read

Web scrapping

This notebook contains final for the web scrapping article i wrote on medium. It can be accessed here: https://medium.com/p/32b0ceeee538/edit

Import all the required modules

import re
import requests, bs4
from requests import get
from bs4 import BeautifulSoup
from IPython.core.display import clear_output
from warnings import warn
from time import sleep
from random import randint
from time import time

Specify the header, to ensure our requests are accepted


headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}

Loop through pages 0 -10 at interval of 2

pages = [str(i) for i in range(0,10,2)]

Declare lists to stored scraped data

names = []
release_dates = []
ratings= []
meta_scores =[]
user_scores = []

Prepare the monitoring loop

start_time = time()
requests = 0

Iterate through the pages


for page in pages:
    
    #make a get request
    movies = get('https://www.metacritic.com/browse/movies/score/metascore/all/filtered?page='+ page, headers = headers)
    
    #pause the loop for 8-20 seconds
    sleep(randint(8,20))
    
    #monitor the requests
    requests += 1
    elapsed_time = time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)
    
    #show a warning if a non 200 status code is returned
    if movies.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))
    
    #break the loop if the requests exceed 26
    if requests > 10:
        warn('Number of requests was greater than expected.')
        break
   
    #parse the  movie response content into the beautiful soup object
    movie_soup = BeautifulSoup(movies.text, 'html.parser')
    
    #find the major tag peculiar to each movie
    container = movie_soup.find_all('td', class_ = 'clamp-summary-wrap')
     
    #iterate through the major tag   
    for con in container:
        #scrape the movie names
        name = con.find('h3').text
        names.append(name)
    
        #scrape the release_dates
        release_date = con.select('div.clamp-details span')[0].text
        release_dates.append(release_date)
    
        #scrape the ratings
        rating = con.select('div.clamp-details span')[1].text
        ratings.append(rating)
    
        #scrape the meta scores
        meta_score= con.select('a.metascore_anchor div')[0].text
        meta_scores.append(meta_score)
    
        #scrape the user scores.
        user_score = con.select('a.metascore_anchor div')[2].text
        user_scores.append(user_score)

Parse the data into a dataframe and store in a csv file

import pandas as pd
movie_df = pd.DataFrame({'Movie_names': names,
'Release_dates': release_dates,
'Ratings': ratings,
'Meta_scores': meta_scores,
'User_scores': user_scores})
print(movie_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
Meta_scores      500 non-null object
Movie_names      500 non-null object
Ratings          500 non-null object
Release_dates    500 non-null object
User_scores      500 non-null object
dtypes: object(5)
memory usage: 19.6+ KB
None
movie_df
Meta_scores Movie_names Ratings Release_dates User_scores
0 100 Citizen Kane | Approved September 4, 1941 8.4
1 100 The Godfather | R March 11, 1972 9.2
2 100 Rear Window | TV-G September 1, 1954 8.8
3 100 Casablanca | TV-PG January 23, 1943 9.0
4 100 Boyhood | R July 11, 2014 7.6
5 100 Three Colors: Red | R November 23, 1994 8.7
6 100 Vertigo | PG May 28, 1958 8.7
7 100 Notorious | Approved September 6, 1946 7.9
8 99 Singin' in the Rain | G April 11, 1952 8.8
9 99 City Lights | Passed March 7, 1931 8.2
10 99 Moonlight | Not Rated October 21, 2016 7.2
11 99 Pinocchio | Passed February 23, 1940 8.3
12 99 Touch of Evil | PG-13 February 1, 1958 7.8
13 98 The Treasure of the Sierra Madre | TV-PG January 24, 1948 8.5
14 98 Pan's Labyrinth | R December 29, 2006 8.7
15 98 North by Northwest | TV-G August 6, 1959 8.1
16 98 Rashomon | Not Rated December 26, 1951 8.3
17 98 All About Eve | TV-PG October 27, 1950 8.8
18 98 Hoop Dreams | PG-13 October 14, 1994 8.0
19 97 My Left Foot | R March 30, 1990 8.5
20 97 The Third Man | Approved September 3, 1949 8.2
21 97 Dr. Strangelove or: How I Learned to Stop Worr... | GP January 29, 1964 8.3
22 97 Gone with the Wind | TV-PG January 17, 1940 8.4
23 97 4 Months, 3 Weeks and 2 Days | Not Rated January 23, 2008 7.9
24 97 Some Like It Hot | Approved March 29, 1959 8.3
25 97 Psycho | M September 8, 1960 9.1
26 97 American Graffiti | PG August 11, 1973 8.1
27 96 Dumbo | Approved October 31, 1941 8.1
28 96 Roma | Not Rated November 21, 2018 8.0
29 96 Ran | R December 20, 1985 8.4
... ... ... ... ... ...
470 82 Good Bye, Dragon Inn Not Rated September 17, 2004 5.9
471 82 Safe Conduct Not Rated October 11, 2002 6.2
472 82 Leaving Las Vegas | R October 27, 1995 8.9
473 82 Maiden | Not Rated June 28, 2019 7.6
474 82 War for the Planet of the Apes | PG-13 July 14, 2017 8.0
475 82 Duma | PG August 5, 2005 8.5
476 82 The Constant Gardener | R August 31, 2005 7.0
477 82 Short Term 12 | R August 23, 2013 8.5
478 82 Parenthood | PG-13 August 2, 1989 8.5
479 82 Star Wars: Episode V - The Empire Strikes Back | PG May 21, 1980 9.0
480 82 Our Beloved Month of August | Not Rated September 3, 2010 tbd
481 82 Sugar | R April 3, 2009 7.8
482 82 Marwencol | Not Rated October 8, 2010 7.7
483 82 The Wind That Shakes the Barley | Not Rated March 16, 2007 7.9
484 82 Face/Off | R June 27, 1997 8.8
485 82 The Lobster | R May 13, 2016 7.0
486 82 The Nightmare Before Christmas | PG October 22, 1993 8.6
487 82 2001: A Space Odyssey | G April 2, 1968 8.1
488 82 Pride & Prejudice | PG November 11, 2005 8.7
489 82 The Squid and the Whale | R October 5, 2005 7.4
490 82 Winged Migration | G April 18, 2003 8.7
491 82 Quince Tree of the Sun Not Rated May 5, 2000 tbd
492 82 School of Rock | PG-13 October 3, 2003 8.5
493 82 Life and Nothing More | Not Rated October 24, 2018 8.2
494 82 Star Trek | PG-13 May 7, 2009 7.9
495 82 A Quiet Place | PG-13 April 6, 2018 7.4
496 82 Deliverance | TV-14 July 21, 1972 7.7
497 82 Frances Ha | R May 17, 2013 7.7
498 82 The Namesake | PG-13 March 9, 2007 8.0
499 82 A Hijacking | R June 21, 2013 7.3

500 rows × 5 columns

Updated: