Data acquisition is the first part of the data analysis pipeline. The web or the Internet is a vast source of data or information. Often we can access this data with the use of Application Program Interface (API). However, not all the information we may seek is accessible this way. We may also seek hidden or additional information.
Python is an excellent, very simple to use general purpose programming language. The power of Python comes from its simplicity and the extensive array of third-party libraries that mean you don’t have to “reinvent the wheel”.
Under the Hood : The World Wide Web AKA the Net or Internet
Underlying the Internet is the Hyper Text Transfer Protocol or HTTP (you may have seen it in the address bar of the browser). HTTP allows for the exchange of HTML documents it is a client server based protocol and is stateless. The image below gives an example of simple GET requests.
3 Lines of Python to Grab a Web Page
In this simple example we will GET the HTML page from a website and print it out. We will just use a simple GET request from the urllib.
from urllib.request import urlopen full_html = urlopen('http://www.dolekemp96.org/main.htm') print(full_html.read())
In order to make sense of the returned html and parse the data we can make use of the Beautiful Soup library. You can read the docs for Beautiful Soup to get a better understanding of how it works.
conda install beautifulsoup4
Super Simple Web Page Grabber
The code below is very rudimentary but can be used to grab a web page for parsing and is a basic template I use for simple scraping projects
from urllib.request import urlopen from urllib.error import HTTPError from urllib.error import URLError try: html = urlopen('#WEB_SITE') except HTTPError as e: print(e) except URLError as e: print("Server not found!") else: print('Sucess') print(html.read())
This of course only works for simple websites but is a good start.