Website scraping tutorial 1 feed of manga chapters

Are you by any chance a manga fan? Does your favourite manga website not publish RSS feeds? Then you might be interested in this scraping example for mangas hosted at mangakatana.com.

How it works

mangakatana.com has manga overview pages like this one for Eminence in Shadow. If you inspect the chapter list on this page (right click on it and choose “Inspect”) you see a HTML structure like this:

<div class="chapters">
     ...
     <div class="chapter">
        <a href="https://mangakatana.com/manga/the-eminence-in-shadow.22020/c59">Chapter 59</a>
     </div>
     ...
     <div class="chapter">
        <a href="https://mangakatana.com/manga/the-eminence-in-shadow.22020/c58">Chapter 58</a>
     </div>
     ...
</div>

One way to extract this structure is using XSLT. An example stylesheet can be found in the RSS scraping example repo: mangakatana-chapters.xsl.

Using this stylesheet we can create a feed like this:

Download the XSLT stylesheet file to ~/.config/liferea/
Right click in the feed list and choose “New subscription …”
Click “Advanced”
Select “Command” as feed source

Set the following command

curl -L https://mangakatana.com/manga/the-eminence-in-shadow.22020 | xsltproc --html --novalid ~/.config/liferea/mangakatana-chapters.xsl -

Click “OK” to subscribe
Finally open the feed properties
In tab “Cache” set cache to “Unlimited”
In tab “Advanced” you can optionally enable “Auto-load item link…”

If it does not work check wether you have curl and xsltproc installed!

Contribute Examples!

Did you adapt the XLST to other websites? If yes consider making a pull request with your new stylesheet to the RSS scraping repo!