Wednesday, November 29, 2023
HomeVideo EditingFashionable Internet Scraping With BeautifulSoup and Selenium

Fashionable Internet Scraping With BeautifulSoup and Selenium


Overview

HTML is nearly intuitive. CSS is a good development that cleanly separates the construction of a web page from its feel and appear. JavaScript provides some pizazz. That is the idea. The true world is somewhat totally different.

On this tutorial, you will find out how the content material you see within the browser really will get rendered and tips on how to go about scraping it when vital. Particularly, you will learn to depend Disqus feedback. Our instruments will probably be Python and superior packages like requests, BeautifulSoup, and Selenium.

When Ought to You Use Internet Scraping?

Internet scraping is the follow of routinely fetching the content material of net pages designed for interplay with human customers, parsing them, and extracting some data (probably navigating hyperlinks to different pages). It’s typically vital if there isn’t a different method to extract the mandatory data. Ideally, the appliance gives a devoted API for accessing its information programmatically. There are a number of causes net scraping needs to be your final resort:

  • It’s fragile (the online pages you are scraping would possibly change steadily).
  • It may be forbidden (some net apps have insurance policies towards scraping).
  • It may be gradual and expansive (if it is advisable fetch and wade by quite a lot of noise).

Understanding Actual-World Internet Pages

Let’s perceive what we’re up towards, by trying on the output of some widespread net utility code. Within the article Introduction to Vagrant, there are some Disqus feedback on the backside of the web page:

Understanding Real-World Web PagesUnderstanding Real-World Web PagesUnderstanding Real-World Web Pages

With a purpose to scrape these feedback, we have to discover them on the web page first.

View Web page Supply

Each browser because the daybreak of time (the Nineties) has supported the flexibility to view the HTML of the present web page. Here’s a snippet from the view supply of Introduction to Vagrant that begins with an enormous chunk of minified and uglified JavaScript unrelated to the article itself. Here’s a small portion of it:

Page SourcePage SourcePage Source

Right here is a few precise HTML from the web page:

HTML From the PageHTML From the PageHTML From the Page

This seems fairly messy, however what’s shocking is that you’ll not discover the Disqus feedback within the supply of the web page.

The Mighty Inline Body

It seems that the web page is a mashup, and the Disqus feedback are embedded as an iframe (inline body) ingredient. Yow will discover it out by right-clicking on the feedback space, and you may see that there’s body data and supply there:

The Mighty Inline FrameThe Mighty Inline FrameThe Mighty Inline Frame

That is smart. Embedding third-party content material as an iframe is among the major causes to make use of iframes. Let’s discover the <iframe> tag then in the primary web page supply. Foiled once more! There isn’t any <iframe> tag in the primary web page supply.  

JavaScript-Generated Markup

The explanation for this omission is that view web page supply reveals you the content material that was fetched from the server. However the closing DOM (doc object mannequin) that will get rendered by the browser could also be very totally different. JavaScript kicks in and might manipulate the DOM at will. The iframe cannot be discovered, as a result of it wasn’t there when the web page was retrieved from the server. 

Static Scraping vs. Dynamic Scraping

Static scraping ignores JavaScript. It fetches net pages from the server with out the assistance of a browser. You get precisely what you see in “view web page supply”, and then you definitely slice and cube it. If the content material you are in search of is obtainable, it is advisable go no additional. Nonetheless, if the content material is one thing just like the Disqus feedback iframe, you want dynamic scraping. 

Dynamic scraping makes use of an precise browser (or a headless browser) and lets JavaScript do its factor. Then, it queries the DOM to extract the content material it is in search of. Generally it is advisable automate the browser by simulating a person to get the content material you want.

Static Scraping With Requests and BeautifulSoup

Let’s examine how static scraping works utilizing two superior Python packages: requests for fetching net pages and BeautifulSoup for parsing HTML pages.

Putting in Requests and BeautifulSoup

Set up pipenv first, after which: pipenv set up requests beautifulsoup4 

This can create a digital surroundings for you too. If you happen to’re utilizing the code from gitlab, you’ll be able to simply pipenv set up.

Fetching Pages

Fetching a web page with requests is a one liner: r = requests.get(url)

The response object has quite a lot of attributes. An important ones are okay and content material. If the request fails then r.okay will probably be False and r.content material will comprise the error. The content material is a stream of bytes. It’s often higher to decode it to utf-8 when coping with textual content:

1
>>> r = requests.get('http://www.c2.com/no-such-page')
2
>>> r.okay
3
False
4
>>> print(r.content material.decode('utf-8'))
5
<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML 2.0//EN">
6
<html><head>
7
<title>404 Not Discovered</title>
8
</head><physique>
9
<h1>Not Discovered</h1>
10
<p>The requested URL /ggg was not discovered on this server.</p>
11
<hr>
12
<tackle>
13
Apache/2.0.52 (CentOS) Server at www.c2.com Port 80
14
</tackle>
15
</physique></html>

If all the pieces is OK then r.content material will comprise the requested net web page (identical as view web page supply).

Discovering Components With BeautifulSoup

The get_page() perform under fetches an internet web page by URL, decodes it to UTF-8, and parses it right into a BeautifulSoup object utilizing the HTML parser.

1
def get_page(url):
2
    r = requests.get(url)
3
    content material = r.content material.decode('utf-8')
4
    return BeautifulSoup(content material, 'html.parser')

As soon as we now have a BeautifulSoup object, we are able to begin extracting data from the web page. BeautifulSoup gives many discover features to find parts contained in the web page and drill down deep nested parts. 

Tuts+ creator pages comprise a number of tutorials. Right here is my creator web page. On every web page, there are as much as 12 tutorials. If in case you have greater than 12 tutorials then you’ll be able to navigate to the following web page. The HTML for every article is enclosed in an <article> tag. The next perform finds all of the article parts on the web page, drills all the way down to their hyperlinks, and extracts the href attribute to get the URL of the tutorial:

1
def get_page_articles(web page):
2
    parts = web page.findAll('article')
3
    articles = [e.a.attrs['href'] for e in parts]
4
    return articles

The next code will get all of the articles from my web page and prints them (with out the widespread prefix):

1
web page = get_page('https://tutsplus.com/authors/gigi-sayfan')
2
articles = get_page_articles(web page)
3
prefix = 'https://code.tutsplus.com/tutorials'
4
for a in articles:
5
    print(a[len(prefix):])
6
	
7
Output:
8
9
constructing-video games-with-python-3-and-pygame-half-5--cms-30085
10
constructing-video games-with-python-3-and-pygame-half-4--cms-30084
11
constructing-video games-with-python-3-and-pygame-half-3--cms-30083
12
constructing-video games-with-python-3-and-pygame-half-2--cms-30082
13
constructing-video games-with-python-3-and-pygame-half-1--cms-30081
14
mastering-the-react-lifecycle-strategies--cms-29849
15
testing-information-intensive-code-with-go-half-5--cms-29852
16
testing-information-intensive-code-with-go-half-4--cms-29851
17
testing-information-intensive-code-with-go-half-3--cms-29850
18
testing-information-intensive-code-with-go-half-2--cms-29848
19
testing-information-intensive-code-with-go-half-1--cms-29847
20
make-your-go-applications-lightning-quick-with-profiling--cms-29809

Dynamic Scraping With Selenium

Static scraping was adequate to get the checklist of articles, however as we noticed earlier, the Disqus feedback are embedded as an iframe ingredient by JavaScript. With a purpose to harvest the feedback, we might want to automate the browser and work together with the DOM interactively. Among the best instruments for the job is Selenium.

Selenium is primarily geared in direction of automated testing of net purposes, however it’s nice as a general-purpose browser automation instrument.

Putting in Selenium

Kind this command to put in Selenium: pipenv set up selenium

Select Your Internet Driver

Selenium wants an internet driver (the browser it automates). For net scraping, it often would not matter which driver you select. I desire the Chrome driver. Observe the directions in this Selenium information.

Chrome vs. PhantomJS

In some circumstances you could desire to make use of a headless browser, which suggests no UI is displayed. Theoretically, PhantomJS is simply one other net driver. However, in follow, individuals reported incompatibility points the place Selenium works correctly with Chrome or Firefox and typically fails with PhantomJS. I desire to take away this variable from the equation and use an precise browser net driver.

Counting Disqus Feedback

Let’s do some dynamic scraping and use Selenium to depend Disqus feedback on Tuts+ tutorials. Listed below are the mandatory imports.

1
from selenium import webdriver
2
from selenium.webdriver.widespread.by import By
3
from selenium.webdriver.help.expected_conditions import (
4
    presence_of_element_located)
5
from selenium.webdriver.help.wait import WebDriverWait

The get_comment_count() perform accepts a Selenium driver and URL. It makes use of the get() technique of the driving force to fetch the URL. That is much like requests.get(), however the distinction is that the driving force object manages a dwell illustration of the DOM.

Then, it will get the title of the tutorial and locates the Disqus iframe utilizing its mum or dad id disqus_thread after which the iframe itself:

1
def get_comment_count(driver, url):
2
    driver.get(url)
3
    class_name = 'content-banner__title'
4
    identify = driver.find_element_by_class_name(class_name).textual content
5
    e = driver.find_element_by_id('disqus_thread')
6
    disqus_iframe = e.find_element_by_tag_name('iframe')
7
    iframe_url = disqus_iframe.get_attribute('src')

The subsequent step is to fetch the contents of the iframe itself. Be aware that we anticipate the comment-count ingredient to be current as a result of the feedback are loaded dynamically and never essentially accessible but.

1
    driver.get(iframe_url)
2
    wait = WebDriverWait(driver, 5)
3
    commentCountPresent = presence_of_element_located(
4
        (By.CLASS_NAME, 'comment-count'))
5
    wait.till(commentCountPresent)
6
7
    comment_count_span = driver.find_element_by_class_name(
8
        'comment-count')
9
    comment_count = int(comment_count_span.textual content.break up()[0])

The final half is to return the final remark if it wasn’t made by me. The concept is to detect feedback I have never responded to but.

1
    last_comment = {}
2
    if comment_count > 0:
3
        e = driver.find_elements_by_class_name('creator')[-1]
4
        last_author = e.find_element_by_tag_name('a')
5
        last_author = e.get_attribute('data-username')
6
        if last_author != 'the_gigi':
7
            e = driver.find_elements_by_class_name('post-meta')
8
            meta = e[-1].find_element_by_tag_name('a')
9
            last_comment = dict(
10
            	creator=last_author,
11
                title=meta.get_attribute('title'),
12
                when=meta.textual content)
13
    return identify, comment_count, last_comment

Conclusion

Internet scraping is a helpful follow when the data you want is accessible by an internet utility that does not present an acceptable API. It takes some non-trivial work to extract information from fashionable net purposes, however mature and well-designed instruments like requests, BeautifulSoup, and Selenium make it worthwhile.

Moreover, don’t hesitate to see what we now have accessible on the market and for research within the Envato Market, and do not hesitate to ask any questions and supply your priceless suggestions utilizing the feed under.

RELATED ARTICLES

LEAVE A REPLY

Please enter your comment!
Please enter your name here

Most Popular

Recent Comments