April 28, 2016

New 'semi-stable' build for Heritrix

Earlier this month I mentioned that I was planning on making another one of my "semi-stable" Heritrix build, using the current master 3.3.0. This work is now underway in the Landsbokasafn Heritrix repo on GitHub. LBS-2016-02.

Heritrix 3.3.0-LBS-2016-02
I've merged in one pull request that is still open in the IA repository, #154 Fixes for apparent build errors. Most notably, this makes it possible to have Travis-CI build and test Heritrix.

You can review the full list of changes between my last Heritrix build (2015-01) and this new one here. Here is a list of the main changes:

  • Some fixes to how server-not-modified revisit records are written (PR #118).
  • Fix outlink hoppath in metadata records (PR #119)
  • Allow dots in filenames for known good extensions (PR #120)
  • Require Maven 3.3 (PR #126
  • Allow realm to be set by server for basic auth (PR #124)
  • Better error handling in StatisticsTracker (PR #130)
  • Fix to Java 8 Keytool (PR #129) - I wrote a post about this back in 2014.
  • Changes to how cookies are stored in Bdb (PR #133)
  • Handle multiple clauses for same user agent in robots.txt (PR #139)
  • SourceSeedDecideRule and SeedLimitsEnforcer (PR #137 and #148)
  • 'Novel' URL and byte quotes (PR #138)
  • Only submit 'checked' checkbox and radio buttons when submitting forms (PR #122)
  • Form login improvements (PR #142 and #143)
  • Improvements to hosts report (PR #123)
  • Handle SNI error better (PR #141)
  • Allow some whitespace in URLs extracted by ExtractorJS (PR #145)
  • Fix to ExtractorHTML dealing with HTML comments (PR #149)
  • Build against Java 7 (PR #152)

I've ignored all pull request that apply primarily to the contrib package in the above. There were quite a few there, mostly (but not exclusively) relating to AMQP.

I've done some preliminary testing and everything looks good.  So far, the only issue I've noted is one that I was already aware of, about noisy alerts relating to 401s.

I'll be testing this version further over the next few weeks and welcome any additional input.

No comments:

Post a Comment