Garbage In – Getting the Data – Basic WebScraping With Python

Data acquisition is the first part of the data analysis pipeline. The web or the Internet is a vast source of data or information. Often we can access this data with the use of Application Program Interface (API). However, not all the information we may seek is accessible this way. We may also seek hidden or additional information.

Python

Python is an excellent, very simple to use general purpose programming language. The power of Python comes from its simplicity and the extensive array of third-party libraries that mean you don’t have to “reinvent the wheel”.

Under the Hood : The World Wide Web AKA the Net or Internet

Underlying the Internet is the Hyper Text Transfer Protocol or HTTP (you may have seen it in the address bar of the browser). HTTP allows for the exchange of HTML documents it is a client server based protocol and is stateless. The image below gives an example of simple GET requests.

HTTP GET

3 Lines of Python to Grab a Web Page

In this simple example we will GET the HTML page from a website and print it out. We will just use a simple GET request from the urllib.

from urllib.request import urlopen

full_html = urlopen('http://www.dolekemp96.org/main.htm')

print(full_html.read())

In order to make sense of the returned html and parse the data we can make use of the Beautiful Soup library. You can read the docs for Beautiful Soup to get a better understanding of how it works.

conda install beautifulsoup4

Super Simple Web Page Grabber

The code below is very rudimentary but can be used to grab a web page for parsing and is a basic template I use for simple scraping projects

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError

try:
    html = urlopen('#WEB_SITE')
except HTTPError as e:
    print(e)
except URLError as e:
    print("Server not found!")
else:
    print('Sucess')
    print(html.read())

This of course only works for simple websites but is a good start.

No Responses

Leave a Reply