Web Scrapping
Web scrapping
This notebook contains final for the web scrapping article i wrote on medium. It can be accessed here: https://medium.com/p/32b0ceeee538/edit
Import all the required modules
import re
import requests, bs4
from requests import get
from bs4 import BeautifulSoup
from IPython.core.display import clear_output
from warnings import warn
from time import sleep
from random import randint
from time import time
Specify the header, to ensure our requests are accepted
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
Loop through pages 0 -10 at interval of 2
pages = [str(i) for i in range(0,10,2)]
Declare lists to stored scraped data
names = []
release_dates = []
ratings= []
meta_scores =[]
user_scores = []
Prepare the monitoring loop
start_time = time()
requests = 0
Iterate through the pages
for page in pages:
    
    #make a get request
    movies = get('https://www.metacritic.com/browse/movies/score/metascore/all/filtered?page='+ page, headers = headers)
    
    #pause the loop for 8-20 seconds
    sleep(randint(8,20))
    
    #monitor the requests
    requests += 1
    elapsed_time = time() - start_time
    print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
    clear_output(wait = True)
    
    #show a warning if a non 200 status code is returned
    if movies.status_code != 200:
        warn('Request: {}; Status code: {}'.format(requests, response.status_code))
    
    #break the loop if the requests exceed 26
    if requests > 10:
        warn('Number of requests was greater than expected.')
        break
   
    #parse the  movie response content into the beautiful soup object
    movie_soup = BeautifulSoup(movies.text, 'html.parser')
    
    #find the major tag peculiar to each movie
    container = movie_soup.find_all('td', class_ = 'clamp-summary-wrap')
     
    #iterate through the major tag   
    for con in container:
        #scrape the movie names
        name = con.find('h3').text
        names.append(name)
    
        #scrape the release_dates
        release_date = con.select('div.clamp-details span')[0].text
        release_dates.append(release_date)
    
        #scrape the ratings
        rating = con.select('div.clamp-details span')[1].text
        ratings.append(rating)
    
        #scrape the meta scores
        meta_score= con.select('a.metascore_anchor div')[0].text
        meta_scores.append(meta_score)
    
        #scrape the user scores.
        user_score = con.select('a.metascore_anchor div')[2].text
        user_scores.append(user_score)
Parse the data into a dataframe and store in a csv file
import pandas as pd
movie_df = pd.DataFrame({'Movie_names': names,
'Release_dates': release_dates,
'Ratings': ratings,
'Meta_scores': meta_scores,
'User_scores': user_scores})
print(movie_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
Meta_scores      500 non-null object
Movie_names      500 non-null object
Ratings          500 non-null object
Release_dates    500 non-null object
User_scores      500 non-null object
dtypes: object(5)
memory usage: 19.6+ KB
None
movie_df
| Meta_scores | Movie_names | Ratings | Release_dates | User_scores | |
|---|---|---|---|---|---|
| 0 | 100 | Citizen Kane | | Approved | September 4, 1941 | 8.4 | 
| 1 | 100 | The Godfather | | R | March 11, 1972 | 9.2 | 
| 2 | 100 | Rear Window | | TV-G | September 1, 1954 | 8.8 | 
| 3 | 100 | Casablanca | | TV-PG | January 23, 1943 | 9.0 | 
| 4 | 100 | Boyhood | | R | July 11, 2014 | 7.6 | 
| 5 | 100 | Three Colors: Red | | R | November 23, 1994 | 8.7 | 
| 6 | 100 | Vertigo | | PG | May 28, 1958 | 8.7 | 
| 7 | 100 | Notorious | | Approved | September 6, 1946 | 7.9 | 
| 8 | 99 | Singin' in the Rain | | G | April 11, 1952 | 8.8 | 
| 9 | 99 | City Lights | | Passed | March 7, 1931 | 8.2 | 
| 10 | 99 | Moonlight | | Not Rated | October 21, 2016 | 7.2 | 
| 11 | 99 | Pinocchio | | Passed | February 23, 1940 | 8.3 | 
| 12 | 99 | Touch of Evil | | PG-13 | February 1, 1958 | 7.8 | 
| 13 | 98 | The Treasure of the Sierra Madre | | TV-PG | January 24, 1948 | 8.5 | 
| 14 | 98 | Pan's Labyrinth | | R | December 29, 2006 | 8.7 | 
| 15 | 98 | North by Northwest | | TV-G | August 6, 1959 | 8.1 | 
| 16 | 98 | Rashomon | | Not Rated | December 26, 1951 | 8.3 | 
| 17 | 98 | All About Eve | | TV-PG | October 27, 1950 | 8.8 | 
| 18 | 98 | Hoop Dreams | | PG-13 | October 14, 1994 | 8.0 | 
| 19 | 97 | My Left Foot | | R | March 30, 1990 | 8.5 | 
| 20 | 97 | The Third Man | | Approved | September 3, 1949 | 8.2 | 
| 21 | 97 | Dr. Strangelove or: How I Learned to Stop Worr... | | GP | January 29, 1964 | 8.3 | 
| 22 | 97 | Gone with the Wind | | TV-PG | January 17, 1940 | 8.4 | 
| 23 | 97 | 4 Months, 3 Weeks and 2 Days | | Not Rated | January 23, 2008 | 7.9 | 
| 24 | 97 | Some Like It Hot | | Approved | March 29, 1959 | 8.3 | 
| 25 | 97 | Psycho | | M | September 8, 1960 | 9.1 | 
| 26 | 97 | American Graffiti | | PG | August 11, 1973 | 8.1 | 
| 27 | 96 | Dumbo | | Approved | October 31, 1941 | 8.1 | 
| 28 | 96 | Roma | | Not Rated | November 21, 2018 | 8.0 | 
| 29 | 96 | Ran | | R | December 20, 1985 | 8.4 | 
| ... | ... | ... | ... | ... | ... | 
| 470 | 82 | Good Bye, Dragon Inn | Not Rated | September 17, 2004 | 5.9 | 
| 471 | 82 | Safe Conduct | Not Rated | October 11, 2002 | 6.2 | 
| 472 | 82 | Leaving Las Vegas | | R | October 27, 1995 | 8.9 | 
| 473 | 82 | Maiden | | Not Rated | June 28, 2019 | 7.6 | 
| 474 | 82 | War for the Planet of the Apes | | PG-13 | July 14, 2017 | 8.0 | 
| 475 | 82 | Duma | | PG | August 5, 2005 | 8.5 | 
| 476 | 82 | The Constant Gardener | | R | August 31, 2005 | 7.0 | 
| 477 | 82 | Short Term 12 | | R | August 23, 2013 | 8.5 | 
| 478 | 82 | Parenthood | | PG-13 | August 2, 1989 | 8.5 | 
| 479 | 82 | Star Wars: Episode V - The Empire Strikes Back | | PG | May 21, 1980 | 9.0 | 
| 480 | 82 | Our Beloved Month of August | | Not Rated | September 3, 2010 | tbd | 
| 481 | 82 | Sugar | | R | April 3, 2009 | 7.8 | 
| 482 | 82 | Marwencol | | Not Rated | October 8, 2010 | 7.7 | 
| 483 | 82 | The Wind That Shakes the Barley | | Not Rated | March 16, 2007 | 7.9 | 
| 484 | 82 | Face/Off | | R | June 27, 1997 | 8.8 | 
| 485 | 82 | The Lobster | | R | May 13, 2016 | 7.0 | 
| 486 | 82 | The Nightmare Before Christmas | | PG | October 22, 1993 | 8.6 | 
| 487 | 82 | 2001: A Space Odyssey | | G | April 2, 1968 | 8.1 | 
| 488 | 82 | Pride & Prejudice | | PG | November 11, 2005 | 8.7 | 
| 489 | 82 | The Squid and the Whale | | R | October 5, 2005 | 7.4 | 
| 490 | 82 | Winged Migration | | G | April 18, 2003 | 8.7 | 
| 491 | 82 | Quince Tree of the Sun | Not Rated | May 5, 2000 | tbd | 
| 492 | 82 | School of Rock | | PG-13 | October 3, 2003 | 8.5 | 
| 493 | 82 | Life and Nothing More | | Not Rated | October 24, 2018 | 8.2 | 
| 494 | 82 | Star Trek | | PG-13 | May 7, 2009 | 7.9 | 
| 495 | 82 | A Quiet Place | | PG-13 | April 6, 2018 | 7.4 | 
| 496 | 82 | Deliverance | | TV-14 | July 21, 1972 | 7.7 | 
| 497 | 82 | Frances Ha | | R | May 17, 2013 | 7.7 | 
| 498 | 82 | The Namesake | | PG-13 | March 9, 2007 | 8.0 | 
| 499 | 82 | A Hijacking | | R | June 21, 2013 | 7.3 | 
500 rows × 5 columns