HeraldicArt.org: Traceable Art | Emblazons | Blog

NLI Download Perl Script

I put together this quick-and-dirty Perl script to extract period armorials from the click-to-pan-and-zoom document hosting service used by the National Library of Ireland.

To run it, you will need the ImageMagick tools, which provide the montage command that stitches image tiles together. Mac users can install ImageMagick using the following commands:

# Install Homebrew
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

# Install Dependencies
brew install imagemagick

Then save the code below as a Perl script named nli_extract.pl:

#!/usr/bin/perl

use strict;

my ( $book_id, $pages, $zoom_scale, $sizes ) = @ARGV;

my $folder = $book_id;
$folder =~ s/(\d\d)\d{4}$/ ( $1 + 1 ) . '0000' /e;
my $base_url = 'http://catalogue.nli.ie/cgi-bin/iipsrv.fcgi' .
               "?FIF=$folder/$book_id/vtls${book_id}_";

my @page_list = map { /(\d+)-(\d+)/ ? $1 .. $2 : $_ } split /[,\s]+/, $pages;

my @sizes = map { [ /(\d+)/g ] } split /[,\s]+/, $sizes;

my ( $max_size, %sizes );
foreach my $size ( @sizes ) {
    my $dims = $size->[0] * $size->[1];
    if ( $sizes{ $dims } ) {
        die "There are two sizes with the same number of tiles! $size->[0]x$size->[1] == $sizes{$dims}->[0]x$sizes{$dims}->[0]\n";
    }
    $sizes{ $dims } = $size;
    if ( $dims > $max_size ) {
        $max_size = $dims;
    }
}

foreach my $page_n ( @page_list ) {
    my $page_str = sprintf( "%03d", $page_n );

    my $page_fn = "page_$page_str.jpeg";
    next if ( -f $page_fn ); # Page is already present

    print "\nFetching page $page_n\n";

    my $page_tile_pattern = "page_$page_str-tile_*.jpeg";

    my $tile_count = 0;
    # attempt to fetch one more tile than we expect to find
    foreach my $tile_n ( 0 .. $max_size ) { 
        my $tile_str = sprintf( "%03d", $tile_n );
        my $url = "$base_url$page_str.jp2&JTL=$zoom_scale,$tile_str";
        my $file = $page_tile_pattern;
        $file =~ s/\*/$tile_str/;

        if ( -f $file and -s $file > 48 ) {
            $tile_count ++;
            next;
        }
        execute( "curl -s -o $file '$url'" );
        if ( ! -f $file or -s $file < 48 ) { print "Failed to retrieve $file\n"; execute( "rm $file" ) if ( -f $file ); last; } $tile_count ++; } if ( $tile_count == 0 ) { print "Unable to retrieve any tiles for page $page_n!\n"; next; } my $size = $sizes{ $tile_count } or die "Can't find a size that matches for $tile_count tiles!\n"; execute( "montage $page_tile_pattern -geometry +0+0 " . "-tile $size->[0]x$size->[1] $page_fn" );

    execute( "rm $page_tile_pattern" ) if ( -f $page_fn );
}

sub execute {
    print qq{$_[0]\n};
    qx{$_[0]\n}
}

Lastly, make a new empty directory, set it as your current working directory, and invoke the script using the document ID manuscript, the pages to be retrieved, the desired scale, and the tile dimensions:

perl nli_extract.pl 000529247 1-113 2 2x4,3x4

That script will take several minutes to run, or as much as a couple of hours for large books at high scale and then exit, leaving you with a series of high-resolution JPEG images, one per page.

You can determine the document ID and page number range by browsing the NLI.ie web site.

A scale of 0 will produce a small thumbnail image, while a scale of 5 is typically a high resolution image corresponding to the highest-quality images the library makes available.

The tile-count parameters require some experimental fiddling to get right. If you start by running the command with a tile count that is much too large — like 99×99 — the script will try to download as many tiles as it can and then exit when it discovers that it couldn’t retrieve as many as you requested. By examining the individual tile images, you should be able to determine the frequency at which they repeat and then you can re-run the command with the necessary tile dimensions, in width x height order. If some pages have different dimensions than others, separate the possible combinations with commas.