Web Scrapping
Web scrapping
This notebook contains final for the web scrapping article i wrote on medium. It can be accessed here: https://medium.com/p/32b0ceeee538/edit
Import all the required modules
import re
import requests, bs4
from requests import get
from bs4 import BeautifulSoup
from IPython.core.display import clear_output
from warnings import warn
from time import sleep
from random import randint
from time import time
Specify the header, to ensure our requests are accepted
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36'}
Loop through pages 0 -10 at interval of 2
pages = [str(i) for i in range(0,10,2)]
Declare lists to stored scraped data
names = []
release_dates = []
ratings= []
meta_scores =[]
user_scores = []
Prepare the monitoring loop
start_time = time()
requests = 0
Iterate through the pages
for page in pages:
#make a get request
movies = get('https://www.metacritic.com/browse/movies/score/metascore/all/filtered?page='+ page, headers = headers)
#pause the loop for 8-20 seconds
sleep(randint(8,20))
#monitor the requests
requests += 1
elapsed_time = time() - start_time
print('Request:{}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
clear_output(wait = True)
#show a warning if a non 200 status code is returned
if movies.status_code != 200:
warn('Request: {}; Status code: {}'.format(requests, response.status_code))
#break the loop if the requests exceed 26
if requests > 10:
warn('Number of requests was greater than expected.')
break
#parse the movie response content into the beautiful soup object
movie_soup = BeautifulSoup(movies.text, 'html.parser')
#find the major tag peculiar to each movie
container = movie_soup.find_all('td', class_ = 'clamp-summary-wrap')
#iterate through the major tag
for con in container:
#scrape the movie names
name = con.find('h3').text
names.append(name)
#scrape the release_dates
release_date = con.select('div.clamp-details span')[0].text
release_dates.append(release_date)
#scrape the ratings
rating = con.select('div.clamp-details span')[1].text
ratings.append(rating)
#scrape the meta scores
meta_score= con.select('a.metascore_anchor div')[0].text
meta_scores.append(meta_score)
#scrape the user scores.
user_score = con.select('a.metascore_anchor div')[2].text
user_scores.append(user_score)
Parse the data into a dataframe and store in a csv file
import pandas as pd
movie_df = pd.DataFrame({'Movie_names': names,
'Release_dates': release_dates,
'Ratings': ratings,
'Meta_scores': meta_scores,
'User_scores': user_scores})
print(movie_df.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 500 entries, 0 to 499
Data columns (total 5 columns):
Meta_scores 500 non-null object
Movie_names 500 non-null object
Ratings 500 non-null object
Release_dates 500 non-null object
User_scores 500 non-null object
dtypes: object(5)
memory usage: 19.6+ KB
None
movie_df
Meta_scores | Movie_names | Ratings | Release_dates | User_scores | |
---|---|---|---|---|---|
0 | 100 | Citizen Kane | | Approved | September 4, 1941 | 8.4 |
1 | 100 | The Godfather | | R | March 11, 1972 | 9.2 |
2 | 100 | Rear Window | | TV-G | September 1, 1954 | 8.8 |
3 | 100 | Casablanca | | TV-PG | January 23, 1943 | 9.0 |
4 | 100 | Boyhood | | R | July 11, 2014 | 7.6 |
5 | 100 | Three Colors: Red | | R | November 23, 1994 | 8.7 |
6 | 100 | Vertigo | | PG | May 28, 1958 | 8.7 |
7 | 100 | Notorious | | Approved | September 6, 1946 | 7.9 |
8 | 99 | Singin' in the Rain | | G | April 11, 1952 | 8.8 |
9 | 99 | City Lights | | Passed | March 7, 1931 | 8.2 |
10 | 99 | Moonlight | | Not Rated | October 21, 2016 | 7.2 |
11 | 99 | Pinocchio | | Passed | February 23, 1940 | 8.3 |
12 | 99 | Touch of Evil | | PG-13 | February 1, 1958 | 7.8 |
13 | 98 | The Treasure of the Sierra Madre | | TV-PG | January 24, 1948 | 8.5 |
14 | 98 | Pan's Labyrinth | | R | December 29, 2006 | 8.7 |
15 | 98 | North by Northwest | | TV-G | August 6, 1959 | 8.1 |
16 | 98 | Rashomon | | Not Rated | December 26, 1951 | 8.3 |
17 | 98 | All About Eve | | TV-PG | October 27, 1950 | 8.8 |
18 | 98 | Hoop Dreams | | PG-13 | October 14, 1994 | 8.0 |
19 | 97 | My Left Foot | | R | March 30, 1990 | 8.5 |
20 | 97 | The Third Man | | Approved | September 3, 1949 | 8.2 |
21 | 97 | Dr. Strangelove or: How I Learned to Stop Worr... | | GP | January 29, 1964 | 8.3 |
22 | 97 | Gone with the Wind | | TV-PG | January 17, 1940 | 8.4 |
23 | 97 | 4 Months, 3 Weeks and 2 Days | | Not Rated | January 23, 2008 | 7.9 |
24 | 97 | Some Like It Hot | | Approved | March 29, 1959 | 8.3 |
25 | 97 | Psycho | | M | September 8, 1960 | 9.1 |
26 | 97 | American Graffiti | | PG | August 11, 1973 | 8.1 |
27 | 96 | Dumbo | | Approved | October 31, 1941 | 8.1 |
28 | 96 | Roma | | Not Rated | November 21, 2018 | 8.0 |
29 | 96 | Ran | | R | December 20, 1985 | 8.4 |
... | ... | ... | ... | ... | ... |
470 | 82 | Good Bye, Dragon Inn | Not Rated | September 17, 2004 | 5.9 |
471 | 82 | Safe Conduct | Not Rated | October 11, 2002 | 6.2 |
472 | 82 | Leaving Las Vegas | | R | October 27, 1995 | 8.9 |
473 | 82 | Maiden | | Not Rated | June 28, 2019 | 7.6 |
474 | 82 | War for the Planet of the Apes | | PG-13 | July 14, 2017 | 8.0 |
475 | 82 | Duma | | PG | August 5, 2005 | 8.5 |
476 | 82 | The Constant Gardener | | R | August 31, 2005 | 7.0 |
477 | 82 | Short Term 12 | | R | August 23, 2013 | 8.5 |
478 | 82 | Parenthood | | PG-13 | August 2, 1989 | 8.5 |
479 | 82 | Star Wars: Episode V - The Empire Strikes Back | | PG | May 21, 1980 | 9.0 |
480 | 82 | Our Beloved Month of August | | Not Rated | September 3, 2010 | tbd |
481 | 82 | Sugar | | R | April 3, 2009 | 7.8 |
482 | 82 | Marwencol | | Not Rated | October 8, 2010 | 7.7 |
483 | 82 | The Wind That Shakes the Barley | | Not Rated | March 16, 2007 | 7.9 |
484 | 82 | Face/Off | | R | June 27, 1997 | 8.8 |
485 | 82 | The Lobster | | R | May 13, 2016 | 7.0 |
486 | 82 | The Nightmare Before Christmas | | PG | October 22, 1993 | 8.6 |
487 | 82 | 2001: A Space Odyssey | | G | April 2, 1968 | 8.1 |
488 | 82 | Pride & Prejudice | | PG | November 11, 2005 | 8.7 |
489 | 82 | The Squid and the Whale | | R | October 5, 2005 | 7.4 |
490 | 82 | Winged Migration | | G | April 18, 2003 | 8.7 |
491 | 82 | Quince Tree of the Sun | Not Rated | May 5, 2000 | tbd |
492 | 82 | School of Rock | | PG-13 | October 3, 2003 | 8.5 |
493 | 82 | Life and Nothing More | | Not Rated | October 24, 2018 | 8.2 |
494 | 82 | Star Trek | | PG-13 | May 7, 2009 | 7.9 |
495 | 82 | A Quiet Place | | PG-13 | April 6, 2018 | 7.4 |
496 | 82 | Deliverance | | TV-14 | July 21, 1972 | 7.7 |
497 | 82 | Frances Ha | | R | May 17, 2013 | 7.7 |
498 | 82 | The Namesake | | PG-13 | March 9, 2007 | 8.0 |
499 | 82 | A Hijacking | | R | June 21, 2013 | 7.3 |
500 rows × 5 columns