Data Scrapping

 Monday 31 August was a guest lecture at Sabudh Foundation where we were introduced to the concept of Data Scraping.

What is Data Scraping?

Data scraping, also known as web scraping, is the process of importing information from a website into a csv or any file on local. It’s one of the most efficient ways to get data from the web, and in some cases to channel that data to another website. 


Popular uses of data scraping include:

  • Research for web content/business intelligence
  • Pricing for travel booking sites or price comparison sites
  • Finding sales leads or conducting market research by crawling public data sources (e.g. Yell and Twitter)
  • Sending product data from an e-commerce site to another online vendor (e.g. Google Shopping)

Scraping using Python


There are numerous number of packages available for web scraping in python but we only need a handful of them in order to scrape almost any site. Some of these libraries are are named below:

  • Requests
  • BeautifulSoup
  • lxml
  • Selenium
  • Scrapy
It is to be not every package mentioned above is required for Scraping the web, Request is the one which is always required while scraping whereas the others depend on the use case of the project.

Note: In this lecture we were provided with the idea of Beautiful Soup

The Request Module

The Requests Module is a simple yet powerful HTTP module, which means that one can use it to access web pages. It the library that tells how one can communicate with websites.

Its simplicity is definitely its greatest strength and one can easily use that without reading documentation. 

For example, if we want to pull down the contents of a page, it can be done as:

import requests
page = requests.get('http://examplesite.com')
contents = page.content

The variable content will hold the raw html of the website given to the get method of request library.

Other than this the request module can access API’s, post to forms, and can do various other stuff.

Beautiful Soup

Beautiful Soup (BS4) is a parsing library that can use different parsers. A parser is simply a program that can extract data from HTML and XML documents.

One advantage of BS4 is its ability to automatically detect encodings, this allows it to gracefully handle HTML documents with special characters. In addition to it BS4 can help us to  navigate a parsed document and find us what we need. This makes it quick and painless to build common applications.

For example, if we wanted to find all the links in the web page we fetched earlier (using requests),  this can be done as

from bs4 import BeautifulSoup
soup = BeautifulSoup(contents, 'html.parser')
soup.find_all('a')


soup is the object of BeautifulSoup Class which takes in the html of the website and parser to go through the html.

Once the object is created we use find_all() method to get all the anchor tags ('a') which returns the list of all the

links present in the web page.

Methods From Beautiful Soup

text

The Beautiful Soup object has a text attribute that returns the plain text of a HTML. Given our simple soup of <p> Hello world </p> the text attribute returns Hello World.

find()

find() method is used get the first instance of any tag of html like anchor tag, paragraph tag etc., after getting the first instance of the particular tag one can get the text using text attribute of Beautiful Soup.

attrs

attrs is like the text in Beautiful Soup which helps us to get the values present in the attributes of various tags. For Example we can get the src attribute of the image tag,  href value of the anchor tag.


Comments

Popular posts from this blog

Supervised Learning(Part-5)

Supervised Learning(Part-2)

Text Analysis (Part - 4)