Liferea , the free news aggregator on your Linux desktop.

Website Scraping with Liferea

Not every interesting website provides a feed. And some websites do provide summaries only or no content at all. Besides asking the owner of the website to add a feed or provide more details the only choice left is to "scrape" the website content.

Liferea provides two ways to do website scraping:

  • By running a Unix command (usually some script written in your favourite scripting language) which writes a feed to stdout.
  • By applying a postprocessing filter after downloading some web resource. This way it is possible to augment an existing feed or extract contents from a HTML page. The resulting feeds needs to be printed to stdout.

The difference of both approaches is that when using a Unix command the script or command can save its state and retrieve multiple source documents, which is not possible with a post-processing filter. The advantage when using a post-processing filter is that you do not need to retrieve the source document because Liferea will download it and pass it on stdin.

Script Repository

Oliver Feiler the author of SnowNews set up a website scraping script repository. All scripts you find there can be used with Liferea. If you write an own script please consider posting it there so that other users can reuse it.

Example Scraping Script

The German news site Spiegel provides only a RSS 0.91 feed which contains only titles. The mainpage has a simple and easy to parse structure:

  • a div element with class "spTopThema" enclosing each headline section
  • each headline title starts with a h3 tag
  • a short text and links to the article on the line following the h3 tag

Admitted that is a pretty simple structure (and might change) but is sufficient to illustrate a scraping script generating an RSS 1.0 feed:

#!/usr/bin/perl

# print a feed header
print "<?xml version=\"1.0\" encoding=\"ISO-8859-1\"?>\n".
	"<rdf:RDF\n".
	"xmlns:rdf=\"http://www.w3.org/1999/02/22-rdf-syntax-ns#\"\n".
	"xmlns:content=\"http://purl.org/rss/1.0/modules/content/\"\n".
	"xmlns=\"http://my.netscape.com/rdf/simple/0.9/\">\n".
	"<channel>\n".
	"  <title>Spiegel News</title>\n".
	"  <link>http://www.spiegel.de/</link>\n".
	"  <description>Extracted Spiegel News</description>\n".
	"</channel>\n";

my $contentfound = 0;
while(<STDIN>) {
	my $line = $_;

	$contentfound = 1 if($line =~ /^<div class="spTopThema/);

	if(1 == $contentfound) {
		if($line =~ /<h3><a href="(.*)">(.*)<\/a><\/h3>/) {
			print "<item>\n";
			print "  <title>$2</title>\n";
			print "  <id>$1</id>\n";
			print "  <link>http://www.spiegel.de/$1</link>\n";
			my $tmp = <STDIN>;
			print "  <content:encoded>";
			print "<![CDATA[${tmp}]]>";
			print "</content:encoded>\n";
			print "</item>\n";
			$contentfound = 0;
		}
	}
}

# print a feed footer
print "</rdf:RDF>\n";

This script does not download the website itself but reads it from stdin. Therefore this script is to be used as a post-processing filter. To use it create a subscription with URL www.spiegel.de, enable the filter check box and set the script as the filter script as shown in the following screenshot:

How to setup scraping