PythonでWebページの画像をダウンロードする

この記事のまとめ

PythonでWebページ上の画像をダウンロードする手順の紹介
上記サンプルコードの紹介

背景

機械学習を実践的に学ぼうとすると一般人には機械学習に必要なデータがないことがネックとなってしまいます。そこで今回は、データとしてWebページ上にある画像データを効率的に収集するためにスクリプトでダウンロードする方法を紹介します。

概要

今回は私自身のフォトアルバム用のブログ<hassiweb-photo.blogspot.jp>のホームにあるすべてのJPEGファイルをダウンロードするPythonスクリプトを組んでみます。

その手順として下記の順に紹介していきます。

HTTP用ライブラリを用いてHTMLコンテンツを取得
取得したHTMLからJPEGファイルのリンク先を取得
再度HTTP用ライブラリを用いてJPEGファイルのリンク先のコンテンツを取得し、JPEGファイルとして保存

0. HTTP用ライブラリについて

Webページを扱うためには、Python上でHTTPの処理を行う必要があります。それを行うためのライブラリはいくつかあります(e.g. urllib, urllib2, urllib3)が、Anaconda環境上で使用できて、最も使い勝手の良さそうなrequestsライブラリを使用します。requestsライブラリはurllib3を使用して構成されているようで、urllib3より使いやすくなるように関数等が定義されています。

requestsライブラリのインストールは、pipをインストールしていれば、下記の通りでインストールできます。

pip install requests

1.`requests`ライブラリを用いてHTMLコンテンツを取得

HTMLコンテンツを取得するためには、requestsライブラリのget関数を使います。引数として、URLを渡します。オプションとして、タイムアウト時間、ヘッダー、クッキー、URL引き渡しパラメーターを渡せます。返り値にはHTTPサーバーの返信コンテンツが入ります。返信コンテンツは構造体になっており、テキストコンテンツとそのエンコード、バイナリコンテンツ、rawコンテンツ、JSON返信コンテンツなどがあります。詳細はリンク先を御覧ください。

今回のサンプルでは、requests.getを用いて、http://hassiweb-photo.blogspot.jpのHTMLテキストを取得します。オプションとしてはtimeoutを適当な値を入れただけで、他のオプションは今回は使わないので空の辞書型変数としていますが、そもそも必要なければ引き渡す必要もありません。また、HTTPサーバーの返信にエラーがないかリターンコードを返信コンテンツ内のraise_for_status関数を使うことで確認できます。エラーがあればそのエラーコードをコマンドライン出力してくれます。

home_url = 'http://hassiweb-photo.blogspot.jp'
 
timeout = 10 # in second
params  = {} # not used
cookies = {} # not used
headers = {} # not used
 
# Get the HTML file
home_response = requests.get(home_url, timeout=timeout, params=params, cookies=cookies, headers=headers)
 
# Check the HTTP return code
if home_response.raise_for_status() != None:
    sys.exit('HTTP Error When Accessing The Target URL!') # if not successed, this script will be terminated

2. 取得したHTMLからJPEGファイルのリンク先を取得

上記で取得したHTMLテキストからJPEGファイルのリンク先を正規表現を使って検索します。

# Find URLs of JPEG images from the HTML text
html       = home_response.text
img_search = re.findall(r'"(https?://[a-zA-Z0-9:/.=_\-]*jpg|jpeg|JPG|JPEG)"', html)
 
# Check whether URLs are found or not
if img_search == []: # if not found, this script will be terminated
    sys.exit('Not Found Image URLs!') # if not successed, this script will be terminated

3. 再度`requests`ライブラリを用いてJPEGファイルのリンク先のコンテンツを取得し、JPEGファイルとして保存

検索したJPEGファイルのすべてのURLについて、requests.getを用いて上記と同様にコンテンツを取得します。上記とことなる点はstream=Falseのオプションをつけている点です。Webページを見ているときに重いJPEGファイルなんかだと少しずつダウンロードされて、イメージの上部から見えてくることを体験したことがあると思います。これはJPEGファイルが幾つかの情報として細切れに送られてきているからです。このオプションを付けることでそういったように1つのファイルが細切れにならず、ダウンロードし終わるまで待ってくれます。

img_dir = 'images'
for img_url in img_search:
    # Get the content of the image
    img_response = requests.get(img_url, timeout=timeout, params=params, cookies=cookies, headers=headers, stream=False)
    if img_response.raise_for_status() != None:
        sys.exit('HTTP Error When Accessing The Image File!') # if not suceessed, this script will be terminated
 
      # Retrieve the file name of the image
      name_search = re.findall(r'\/([a-zA-Z0-9:.=_-]*jpg|jpeg|JPG|JPEG)', img_url)
      img_name    = name_search[0]
 
      # Save the image
      save_image('./'+img_dir+'/'+img_name, img_response.content)

取得したファイルは、content要素に含まれていますのでこれを単純に保存すれば完了です。

# Save an image file
def save_image(file_name, image):
    with open(file_name, 'wb') as f:
        f.write(image)

このページの最後にサンプルコードを載せておきますので参考にしてください。なお、私の実行環境はPython 3.4です。

今回は以上です。最後まで読んでいただき、ありがとうございます。

　← 気に入っていただければ応援ポチをお願いします！

サンプルコード

#!/usr/bin/env python
# -*- coding: utf-8 -*-
#-------------------------------------------------------------------------------
# Name:        Image Downloader
# Purpose:     Download JPEG images which are attached in a web page
#
# Author:      hassiweb
#
# Created:     05/21/2017
# Copyright:   (c) hassiweb 2017
# Licence:     hassiweb
#-------------------------------------------------------------------------------
 
import requests # for "get", "raise_for_status"
import re       # for "findall"
import sys      # for "exit"
 
# Save an image file
def save_image(file_name, image):
    with open(file_name, 'wb') as f:
        f.write(image)
 
# Main
if __name__ == '__main__':
    home_url = 'http://hassiweb-photo.blogspot.jp'
    img_dir  = 'images'
 
    timeout = 10 # in second
    params  = {} # not used
    cookies = {} # not used
    headers = {} # not used
 
    # Get the HTML file
    home_response = requests.get(home_url, timeout=timeout, params=params, cookies=cookies, headers=headers, stream=True)
 
    # Check the HTTP return code
    if home_response.raise_for_status() != None:
        sys.exit('HTTP Error When Accessing The Target URL!') # if not successed, this script will be terminated
 
    # Find URLs of JPEG images from the HTML file
    html       = home_response.text
    img_search = re.findall(r'"(https?://[a-zA-Z0-9:/.=_\-]*jpg|jpeg|JPG|JPEG)"', html)
 
    # Check whether URLs are found or not
    if img_search == []: # if not found, this script will be terminated
        sys.exit('Not Found Image URLs!') # if not successed, this script will be terminated
 
    for img_url in img_search:
        # Retrieve the file name of the image
        name_search = re.findall(r'\/([a-zA-Z0-9:.=_-]*jpg|jpeg|JPG|JPEG)', img_url)
        img_name    = name_search[0]
 
        # Get the content of the image
        img_response = requests.get(img_url, timeout=timeout, params=params, cookies=cookies, headers=headers, stream=False)
        if img_response.raise_for_status() != None:
            sys.exit('HTTP Error When Accessing The Image File!') # if not suceessed, this script will be terminated
 
        # Save the image
        save_image('./'+img_dir+'/'+img_name, img_response.content)

Search This Blog