Laravelでスクレイピングを行う！

以下の3ステップの処理を実装すれば、スクレイピングが行える。

1. GuzzleHttpでURLにアクセスし、HTTPデータを取得
2. 取得したHTTPデータをDOMDocumentでDOMとして解析可能とする
3. DOMXPathで必要なデータを抽出

1. GuzzleHttpのインストール（未インストールの場合）

GuzzleHttpはLaravelにデフォルトで含まれていますが、
もしインストールされていない場合は以下のコマンドでインストールできます。

bash

composer require guzzlehttp/guzzle

2. スクレイピング用のLaravelサービスクラスを作成

以下のように app/Services/WebScraperService.php を作成します。

app/Services/WebScraperService.php

php

<?php

namespace App\Services;

use GuzzleHttp\Client;
use DOMDocument;
use DOMXPath;

class WebScraperService
{
    protected $client;

    public function __construct()
    {
        $this->client = new Client();
    }

    public function scrape(string $url): array
    {
        // 1. GuzzleHttpでURLにアクセス
        $response = $this->client->request('GET', $url, [
            'headers' => [
                'User-Agent' => 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
            ]
        ]);

        if ($response->getStatusCode() !== 200) {
            throw new \Exception("Failed to fetch the URL: " . $url);
        }

        $html = (string) $response->getBody();

        // 2. DOMDocumentでHTMLを解析
        libxml_use_internal_errors(true); // HTMLのエラーを抑制
        $dom = new DOMDocument();
        $dom->loadHTML(mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8'));
        libxml_clear_errors();

        // 3. DOMXPathで必要なデータを取得
        $xpath = new DOMXPath($dom);

        // 例: h1タグを取得
        $titles = [];
        foreach ($xpath->query('//h1') as $node) {
            $titles[] = trim($node->nodeValue);
        }

        // 例: aタグのリンクを取得
        $links = [];
        foreach ($xpath->query('//a[@href]') as $node) {
            $links[] = [
                'text' => trim($node->nodeValue),
                'href' => $node->getAttribute('href'),
            ];
        }

        return [
            'titles' => $titles,
            'links'  => $links,
        ];
    }
}

3. コントローラーを作成

このサービスを使用するために、コントローラーを作成します。

app/Http/Controllers/ScraperController.php

php

<?php

namespace App\Http\Controllers;

use Illuminate\Http\Request;
use App\Services\WebScraperService;

class ScraperController extends Controller
{
    protected $scraperService;

    public function __construct(WebScraperService $scraperService)
    {
        $this->scraperService = $scraperService;
    }

    public function scrape()
    {
        $url = 'https://example.com'; // 取得したいサイトのURLを指定

        try {
            $data = $this->scraperService->scrape($url);
            return response()->json($data);
        } catch (\Exception $e) {
            return response()->json(['error' => $e->getMessage()], 500);
        }
    }
}

4. ルートの設定

以下のように routes/web.php にルートを追加します。

php

use App\Http\Controllers\ScraperController;

Route::get('/scrape', [ScraperController::class, 'scrape']);

5. 動作確認

以下のURLにアクセスすると、スクレイピング結果をJSONで取得できます。

arduino

http://localhost/scrape

補足

User-Agent を指定しないと、サイトによってはアクセスをブロックされる場合があります。
@href などの属性を取得する際は、 query('//a[@href]') のようにXPathを利用。
libxml_use_internal_errors(true); は、HTMLのパース時のエラーを防ぐために使用。