nutch.git
11 days agoMerge pull request #407 from sebastian-nagel/NUTCH-2674-hostdb-dump-header master
Sebastian Nagel [Fri, 7 Dec 2018 16:56:40 +0000 (17:56 +0100)] 
Merge pull request #407 from sebastian-nagel/NUTCH-2674-hostdb-dump-header

NUTCH-2674 HostDb: dump shows wrong column headers

4 weeks agoNUTCH-2668 Integrate OWASP dependency checks as ant target
Sebastian Nagel [Mon, 19 Nov 2018 21:59:59 +0000 (22:59 +0100)] 
NUTCH-2668 Integrate OWASP dependency checks as ant target
- relax ant build if the OWASP dependency check tool is not installed

4 weeks agoMerge pull request #401 from sebastian-nagel/dependency-check
Sebastian Nagel [Mon, 19 Nov 2018 20:57:25 +0000 (21:57 +0100)] 
Merge pull request #401 from sebastian-nagel/dependency-check

NUTCH-2668 Integrate OWASP dependency checks as ant target

4 weeks agoNUTCH-1842: crawl.gen.delay value is read incorrectly from config
Sebastian Nagel [Mon, 19 Nov 2018 20:54:44 +0000 (21:54 +0100)] 
NUTCH-1842: crawl.gen.delay value is read incorrectly from config
- add warning to CHANGES.txt

4 weeks agoMerge pull request #392 from sebastian-nagel/NUTCH-2606-mime-detection-plain-text
Sebastian Nagel [Mon, 19 Nov 2018 20:52:24 +0000 (21:52 +0100)] 
Merge pull request #392 from sebastian-nagel/NUTCH-2606-mime-detection-plain-text

NUTCH-2606 MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

4 weeks agoNUTCH-1842: crawl.gen.delay value is read incorrectly from config
Sebastian Nagel [Thu, 15 Nov 2018 10:17:37 +0000 (11:17 +0100)] 
NUTCH-1842: crawl.gen.delay value is read incorrectly from config
Merge pull request #393 from YossiTamari/patch-2

4 weeks agoNUTCH-2671 Upgrade to ant ivy library
Sebastian Nagel [Tue, 30 Oct 2018 16:47:22 +0000 (17:47 +0100)] 
NUTCH-2671 Upgrade to ant ivy library
- roll back to 2.4.0 to bring Jenkins build back to normal

4 weeks agoNUTCH-2671 Upgrade to ant ivy library
Sebastian Nagel [Tue, 30 Oct 2018 15:45:22 +0000 (16:45 +0100)] 
NUTCH-2671 Upgrade to ant ivy library
- fix order of ant target dependencies:
  "compile-core" must come before "resolve-test"

4 weeks agoNUTCH-2671 Upgrade to ant ivy library
Sebastian Nagel [Mon, 29 Oct 2018 12:41:42 +0000 (13:41 +0100)] 
NUTCH-2671 Upgrade to ant ivy library
- upgrade to 2.5.0-rc1 to address NUTCH-2669

4 weeks agoNUTCH-2658 Adding the fields required by the index-links plugin to the schema
Jorge Luis Betancourt [Tue, 23 Oct 2018 20:57:03 +0000 (22:57 +0200)] 
NUTCH-2658 Adding the fields required by the index-links plugin to the schema

4 weeks agoNUTCH-2651 Upgrade to Tika 1.19.1 (from 1.18)
Sebastian Nagel [Sun, 21 Oct 2018 18:49:51 +0000 (20:49 +0200)] 
NUTCH-2651 Upgrade to Tika 1.19.1 (from 1.18)
- modified work-around to fix downloading of dependency javax.ws.rs-api-*.jar:
  define property packaging.type in ivysettings.xml

4 weeks agoNUTCH-2661 Move the TestOutlinks class into the o.a.n.parse path
Jorge Luis Betancourt Gonzalez [Wed, 17 Oct 2018 16:07:51 +0000 (18:07 +0200)] 
NUTCH-2661 Move the TestOutlinks class into the o.a.n.parse path

4 weeks agoNUTCH-2660 Plugin tests not executed
Sebastian Nagel [Wed, 17 Oct 2018 12:36:58 +0000 (14:36 +0200)] 
NUTCH-2660 Plugin tests not executed
- add missing unit test packages to plugin build.xml
- tests of "headings" plugin depend on "lib-nekohtml"
- add "protocol-okhttp" to Javadoc API overview
- add missing test packages to ant "eclipse" target

4 weeks agoNUTCH-2659 Add missing Apache license headers
Sebastian Nagel [Wed, 17 Oct 2018 12:23:44 +0000 (14:23 +0200)] 
NUTCH-2659 Add missing Apache license headers

4 weeks agoNUTCH-2655 Update Solr schema.xml for Solr 7.x
Sebastian Nagel [Mon, 15 Oct 2018 13:04:01 +0000 (15:04 +0200)] 
NUTCH-2655 Update Solr schema.xml for Solr 7.x
- add required field types to schema.xml

4 weeks agoNUTCH-2652 Fetcher launches more fetch tasks than fetch lists
Sebastian Nagel [Mon, 15 Oct 2018 11:44:20 +0000 (13:44 +0200)] 
NUTCH-2652 Fetcher launches more fetch tasks than fetch lists
- properly override method getSplits(...) of FileInputFormat

4 weeks agoNUTCH-2651 Upgrade core and parse-tika to use Tika 1.19.1
Sebastian Nagel [Fri, 12 Oct 2018 11:47:43 +0000 (13:47 +0200)] 
NUTCH-2651 Upgrade core and parse-tika to use Tika 1.19.1
- add work-around to fix downloading of dependency javax.ws.rs-api-*.jar
  (need to set property packaging.type=jar)

4 weeks agoNUTCH-2630 Fetcher to log skipped records by robots.txt
Sebastian Nagel [Mon, 8 Oct 2018 12:50:51 +0000 (14:50 +0200)] 
NUTCH-2630 Fetcher to log skipped records by robots.txt
- change required log level to INFO (default) for messages
  reporting skipped URLs because of robots.txt rules
  (disallow or crawl delay larger than fetcher.max.crawl.delay)

4 weeks agoNUTCH-2625 ProtocolFactory.getProtocol(url) may create multiple plugin instances
Sebastian Nagel [Tue, 24 Jul 2018 14:19:04 +0000 (16:19 +0200)] 
NUTCH-2625 ProtocolFactory.getProtocol(url) may create multiple plugin instances
- lock critical block (conditional creation of plugin instance)
  on object cache object

4 weeks agoMerge pull request #387 from sebastian-nagel/NUTCH-2630-fetcher-log-robotstxt-denied
Sebastian Nagel [Wed, 14 Nov 2018 12:04:49 +0000 (13:04 +0100)] 
Merge pull request #387 from sebastian-nagel/NUTCH-2630-fetcher-log-robotstxt-denied

NUTCH-2630 Fetcher to log skipped records by robots.txt

4 weeks agoMerge pull request #395 from sebastian-nagel/NUTCH-2655-solr-schema-7x
Sebastian Nagel [Wed, 14 Nov 2018 09:10:31 +0000 (10:10 +0100)] 
Merge pull request #395 from sebastian-nagel/NUTCH-2655-solr-schema-7x

NUTCH-2655 Update Solr schema.xml for Solr 7.x

5 weeks agoMerge pull request #402 from jorgelbg/index-links-schema
Jorge Luis Betancourt [Sun, 11 Nov 2018 01:19:29 +0000 (02:19 +0100)] 
Merge pull request #402 from jorgelbg/index-links-schema

Add the fields required by the index-links plugin to the Solr schema

5 weeks agoNUTCH-2674 HostDb: dump shows wrong column headers 407/head
Sebastian Nagel [Thu, 8 Nov 2018 15:12:52 +0000 (16:12 +0100)] 
NUTCH-2674 HostDb: dump shows wrong column headers
- replace column headers `redirSum \t ok` with `notModified`
  to match actual dump of HostDb

7 weeks agoNUTCH-2671 Upgrade to ant ivy library
Sebastian Nagel [Tue, 30 Oct 2018 16:47:22 +0000 (17:47 +0100)] 
NUTCH-2671 Upgrade to ant ivy library
- roll back to 2.4.0 to bring Jenkins build back to normal

7 weeks agoNUTCH-2671 Upgrade to ant ivy library
Sebastian Nagel [Tue, 30 Oct 2018 15:45:22 +0000 (16:45 +0100)] 
NUTCH-2671 Upgrade to ant ivy library
- fix order of ant target dependencies:
  "compile-core" must come before "resolve-test"

7 weeks agoMerge pull request #406 from sebastian-nagel/NUTCH-2671-ivy-lib-upgrade
Sebastian Nagel [Tue, 30 Oct 2018 09:52:16 +0000 (10:52 +0100)] 
Merge pull request #406 from sebastian-nagel/NUTCH-2671-ivy-lib-upgrade

NUTCH-2671 Upgrade to ant ivy library

7 weeks agoNUTCH-2671 Upgrade to ant ivy library 406/head
Sebastian Nagel [Mon, 29 Oct 2018 12:41:42 +0000 (13:41 +0100)] 
NUTCH-2671 Upgrade to ant ivy library
- upgrade to 2.5.0-rc1 to address NUTCH-2669

7 weeks agoNUTCH-2668 Integrate OWASP dependency checks as ant target 401/head
Sebastian Nagel [Tue, 23 Oct 2018 12:49:19 +0000 (14:49 +0200)] 
NUTCH-2668 Integrate OWASP dependency checks as ant target
- add ant target "report-vulnerabilities" to generate report
- initial suppression list to exclude false positives

7 weeks agoNUTCH-2658 Adding the fields required by the index-links plugin to the schema 402/head
Jorge Luis Betancourt [Tue, 23 Oct 2018 20:57:03 +0000 (22:57 +0200)] 
NUTCH-2658 Adding the fields required by the index-links plugin to the schema

7 weeks agoMerge pull request #396 from sebastian-nagel/NUTCH-2659-license-headers
Jorge Luis Betancourt [Tue, 23 Oct 2018 20:45:38 +0000 (22:45 +0200)] 
Merge pull request #396 from sebastian-nagel/NUTCH-2659-license-headers

NUTCH-2659 Add missing Apache license headers

8 weeks agoMerge pull request #399 from jorgelbg/indexer-link-test-move
Jorge Luis Betancourt [Tue, 23 Oct 2018 20:06:16 +0000 (22:06 +0200)] 
Merge pull request #399 from jorgelbg/indexer-link-test-move

NUTCH-2661 Move the TestOutlinks class into the o.a.n.parse path

8 weeks agoNUTCH-2651 Upgrade to Tika 1.19.1 (from 1.18)
Sebastian Nagel [Sun, 21 Oct 2018 18:49:51 +0000 (20:49 +0200)] 
NUTCH-2651 Upgrade to Tika 1.19.1 (from 1.18)
- modified work-around to fix downloading of dependency javax.ws.rs-api-*.jar:
  define property packaging.type in ivysettings.xml

8 weeks agoMerge pull request #394 from sebastian-nagel/NUTCH-2652-fetcher-not-split-inputs
Sebastian Nagel [Sat, 20 Oct 2018 17:36:11 +0000 (19:36 +0200)] 
Merge pull request #394 from sebastian-nagel/NUTCH-2652-fetcher-not-split-inputs

NUTCH-2652 Fetcher launches more fetch tasks than fetch lists

8 weeks agoMerge pull request #391 from sebastian-nagel/NUTCH-2651-tika-1.19.1
Sebastian Nagel [Sat, 20 Oct 2018 17:19:28 +0000 (19:19 +0200)] 
Merge pull request #391 from sebastian-nagel/NUTCH-2651-tika-1.19.1

NUTCH-2651 Upgrade core and parse-tika to use Tika 1.19.1

8 weeks agoMerge pull request #397 from sebastian-nagel/NUTCH-2660-execute-plugin-tests
Sebastian Nagel [Sat, 20 Oct 2018 17:16:11 +0000 (19:16 +0200)] 
Merge pull request #397 from sebastian-nagel/NUTCH-2660-execute-plugin-tests

NUTCH-2660 Plugin tests not executed

8 weeks agoMerge pull request #368 from sebastian-nagel/NUTCH-2625-protocolfactory-getprotocol...
Sebastian Nagel [Sat, 20 Oct 2018 17:13:00 +0000 (19:13 +0200)] 
Merge pull request #368 from sebastian-nagel/NUTCH-2625-protocolfactory-getprotocol-synchronized

NUTCH-2625 ProtocolFactory.getProtocol(url) may create multiple plugin instances

2 months agoNUTCH-2661 Move the TestOutlinks class into the o.a.n.parse path 399/head
Jorge Luis Betancourt Gonzalez [Wed, 17 Oct 2018 16:07:51 +0000 (18:07 +0200)] 
NUTCH-2661 Move the TestOutlinks class into the o.a.n.parse path

2 months agoNUTCH-2660 Plugin tests not executed 397/head
Sebastian Nagel [Wed, 17 Oct 2018 12:36:58 +0000 (14:36 +0200)] 
NUTCH-2660 Plugin tests not executed
- add missing unit test packages to plugin build.xml
- tests of "headings" plugin depend on "lib-nekohtml"
- add "protocol-okhttp" to Javadoc API overview
- add missing test packages to ant "eclipse" target

2 months agoNUTCH-2659 Add missing Apache license headers 396/head
Sebastian Nagel [Wed, 17 Oct 2018 12:23:44 +0000 (14:23 +0200)] 
NUTCH-2659 Add missing Apache license headers

2 months agoNUTCH-2655 Update Solr schema.xml for Solr 7.x 395/head
Sebastian Nagel [Mon, 15 Oct 2018 13:04:01 +0000 (15:04 +0200)] 
NUTCH-2655 Update Solr schema.xml for Solr 7.x
- add required field types to schema.xml

2 months agoNUTCH-2652 Fetcher launches more fetch tasks than fetch lists 394/head
Sebastian Nagel [Mon, 15 Oct 2018 11:44:20 +0000 (13:44 +0200)] 
NUTCH-2652 Fetcher launches more fetch tasks than fetch lists
- properly override method getSplits(...) of FileInputFormat

2 months agoNUTCH-1842: crawl.gen.delay value is read incorrectly from configuration. 393/head
YossiTamari [Sun, 14 Oct 2018 09:36:36 +0000 (12:36 +0300)] 
NUTCH-1842: crawl.gen.delay value is read incorrectly from configuration.

The documentation in nutch-default.xml says this value is in milliseconds, but the code assumes it is in days.

2 months agoNUTCH-2606 MIME detection is wrong for plain-text documents send 392/head
Sebastian Nagel [Sat, 13 Oct 2018 11:56:44 +0000 (13:56 +0200)] 
NUTCH-2606 MIME detection is wrong for plain-text documents send
as Content-Type "application/msword"
- allow text/plain (from MIME magic) to overwrite type derived
  from HTTP Content-Type or file extension

2 months agoMerge pull request #389 from sebastian-nagel/NUTCH-2192-remove-oro
Sebastian Nagel [Sat, 13 Oct 2018 09:52:25 +0000 (11:52 +0200)] 
Merge pull request #389 from sebastian-nagel/NUTCH-2192-remove-oro

NUTCH-2192 Migrate from Apache ORO to java.util.regex

2 months agoNUTCH-2651 Upgrade core and parse-tika to use Tika 1.19.1 391/head
Sebastian Nagel [Fri, 12 Oct 2018 11:47:43 +0000 (13:47 +0200)] 
NUTCH-2651 Upgrade core and parse-tika to use Tika 1.19.1
- add work-around to fix downloading of dependency javax.ws.rs-api-*.jar
  (need to set property packaging.type=jar)

2 months agoNUTCH-2192 Migrate from Apache ORO to java.util.regex 389/head
Sebastian Nagel [Wed, 10 Oct 2018 21:58:31 +0000 (23:58 +0200)] 
NUTCH-2192 Migrate from Apache ORO to java.util.regex
- improve javadoc in plugin parse-js (from 2.x)

2 months agoNUTCH-1121 JUnit test for parse-js
Sebastian Nagel [Tue, 9 Oct 2018 16:30:31 +0000 (18:30 +0200)] 
NUTCH-1121 JUnit test for parse-js
- port tests from 2.x
- add test file for "pure" JavaScript (parser extension)

2 months agoNUTCH-2192 NUTCH-1678 NUTCH-1014 NUTCH-1021 Migrate from Apache ORO to java.util...
Sebastian Nagel [Tue, 9 Oct 2018 16:22:56 +0000 (18:22 +0200)] 
NUTCH-2192 NUTCH-1678 NUTCH-1014 NUTCH-1021 Migrate from Apache ORO to java.util.regex
- apply Markus' patch of NUTCH-2192
- finish migration of parse-js
- remove oro dependency
- correct pointer to Java regex syntax (instead of "Perl5")
NUTCH-1063 OutlinkExtractor test generates an exception but does not fail
- fixed by adding null-check (required by java.util.regex classes)

2 months agoMerge pull request #388 from sebastian-nagel/NUTCH-2648-configurable-tls-cert-check
Sebastian Nagel [Tue, 9 Oct 2018 12:57:33 +0000 (14:57 +0200)] 
Merge pull request #388 from sebastian-nagel/NUTCH-2648-configurable-tls-cert-check

 NUTCH-2648 Make configurable whether TLS/SSL certificates are checked by protocol plugins

2 months agoNUTCH-2648 Make configurable whether TLS/SSL certificates are checked by protocol... 388/head
Sebastian Nagel [Mon, 8 Oct 2018 15:50:44 +0000 (17:50 +0200)] 
NUTCH-2648 Make configurable whether TLS/SSL certificates are checked by protocol plugins
- add support to skip certificate validation in protocol-okhttp

2 months agoNUTCH-2648 Make configurable whether TLS/SSL certificates are checked by protocol...
Sebastian Nagel [Mon, 8 Oct 2018 15:47:59 +0000 (17:47 +0200)] 
NUTCH-2648 Make configurable whether TLS/SSL certificates are checked by protocol plugins
- enable/disable validation of certs by property
  `http.tls.certificates.check` (default: false/disabled)

2 months agoNUTCH-2630 Fetcher to log skipped records by robots.txt 387/head
Sebastian Nagel [Mon, 8 Oct 2018 12:50:51 +0000 (14:50 +0200)] 
NUTCH-2630 Fetcher to log skipped records by robots.txt
- change required log level to INFO (default) for messages
  reporting skipped URLs because of robots.txt rules
  (disallow or crawl delay larger than fetcher.max.crawl.delay)

2 months agoMerge pull request #369 from sebastian-nagel/NUTCH-2623-fetcher-queue-mode
Sebastian Nagel [Sun, 7 Oct 2018 19:12:08 +0000 (21:12 +0200)] 
Merge pull request #369 from sebastian-nagel/NUTCH-2623-fetcher-queue-mode

NUTCH-2623 Fetcher to guarantee delay for same host/domain/ip independent of http/https protocol

2 months agoMerge pull request #383 from sebastian-nagel/NUTCH-2644-crawldb-reader
Sebastian Nagel [Sun, 7 Oct 2018 19:08:53 +0000 (21:08 +0200)] 
Merge pull request #383 from sebastian-nagel/NUTCH-2644-crawldb-reader

NUTCH-2644 CrawlDbReader -dump ignores filter options

2 months agoMerge pull request #382 from sebastian-nagel/NUTCH-2634-ant-resolve-default
Sebastian Nagel [Sun, 7 Oct 2018 18:52:41 +0000 (20:52 +0200)] 
Merge pull request #382 from sebastian-nagel/NUTCH-2634-ant-resolve-default

NUTCH-2643 ant target "resolve-default" to depend on "init"

2 months agoMerge pull request #376 from sebastian-nagel/NUTCH-2635-generator-temporary-output
Sebastian Nagel [Sun, 7 Oct 2018 18:44:10 +0000 (20:44 +0200)] 
Merge pull request #376 from sebastian-nagel/NUTCH-2635-generator-temporary-output

NUTCH-2635 Generator writes unneeded temporary output

2 months agoMerge pull request #385 from sebastian-nagel/NUTCH-2642-index-more-date-timezone
Sebastian Nagel [Sun, 7 Oct 2018 17:09:41 +0000 (19:09 +0200)] 
Merge pull request #385 from sebastian-nagel/NUTCH-2642-index-more-date-timezone

NUTCH-2642 MoreIndexingFilter parses ISO 8601 UTC dates in local time zone

2 months agoNUTCH-2623 Fetcher to guarantee delay for same host/domain/ip 369/head
Sebastian Nagel [Fri, 27 Jul 2018 09:01:21 +0000 (11:01 +0200)] 
NUTCH-2623 Fetcher to guarantee delay for same host/domain/ip
independent of http/https protocol
- the modes byHost, byDomain and byIP now do not include the
  protocol for assigning URLs to queues

2 months agoNUTCH-2647 Skip TLS certificate checks in protocol-http plugin
Markus Jelsma [Fri, 28 Sep 2018 09:25:31 +0000 (11:25 +0200)] 
NUTCH-2647 Skip TLS certificate checks in protocol-http plugin

2 months agoMerge pull request #356 from r0ann3l/NUTCH-2602
Roannel Fernández Hernández [Thu, 27 Sep 2018 19:37:20 +0000 (15:37 -0400)] 
Merge pull request #356 from r0ann3l/NUTCH-2602

fix for NUTCH-2602: Index writers description

2 months agoMerge branch 'master' into NUTCH-2602 356/head
r0ann3l [Thu, 27 Sep 2018 16:13:46 +0000 (12:13 -0400)] 
Merge branch 'master' into NUTCH-2602

# Conflicts:
# src/plugin/indexer-dummy/src/java/org/apache/nutch/indexwriter/dummy/DummyIndexWriter.java
# src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java

2 months agoNUTCH-2642 MoreIndexingFilter parses ISO 8601 UTC dates in local time zone 385/head
Sebastian Nagel [Wed, 26 Sep 2018 20:53:41 +0000 (22:53 +0200)] 
NUTCH-2642 MoreIndexingFilter parses ISO 8601 UTC dates in local time zone
- fix date pattern not to convert date in local time zone
  (as suggested by John Lacey)
- do not add last-modified date if there is none
- add unit test

3 months agoNUTCH-2645 Webgraph tools ignore command-line options 383/head
Sebastian Nagel [Thu, 13 Sep 2018 10:17:04 +0000 (12:17 +0200)] 
NUTCH-2645 Webgraph tools ignore command-line options
- must set values of command-line options in job configuration
  to pass them to job tasks
- use separate job configuration for separate web graph jobs/steps
- make NodeDumper job/tool to log to stdout

3 months agoProtocolStatusStatistics: job configuration should not be static
Sebastian Nagel [Thu, 13 Sep 2018 10:16:50 +0000 (12:16 +0200)] 
ProtocolStatusStatistics: job configuration should not be static

3 months agoNUTCH-2644 CrawlDbReader -dump ignores filter options
Sebastian Nagel [Wed, 12 Sep 2018 13:04:05 +0000 (15:04 +0200)] 
NUTCH-2644 CrawlDbReader -dump ignores filter options
- need to pass filter options via job configuration into mapper
  (modifying original configuration does not have any effect)

3 months agoNUTCH-2643 ant target "resolve-default" to depend on "init" 382/head
Sebastian Nagel [Wed, 12 Sep 2018 09:02:54 +0000 (11:02 +0200)] 
NUTCH-2643 ant target "resolve-default" to depend on "init"

3 months agoNUTCH-2639 bin/nutch fails to set native library path on Cygwin causing jobs to fail...
rustyx [Tue, 4 Sep 2018 10:51:02 +0000 (12:51 +0200)] 
NUTCH-2639 bin/nutch fails to set native library path on Cygwin causing jobs to fail with UnsatisfiedLinkError
Pick fix contributed by rustyx for 2.x

A fix for NUTCH-2639.
Will not set java.library.path unnecessarily, so that hadoop.dll/libhadoop.so can be found in PATH.

4 months agoMerge pull request #365 from sebastian-nagel/NUTCH-2621-3rd-party-license-report
Sebastian Nagel [Fri, 17 Aug 2018 09:59:50 +0000 (11:59 +0200)] 
Merge pull request #365 from sebastian-nagel/NUTCH-2621-3rd-party-license-report

NUTCH-2621 Generate report of third-party licenses

4 months agoNUTCH-2632 protocol-okhttp doesn't accept proxy authentication
Sebastian Nagel [Fri, 17 Aug 2018 09:48:56 +0000 (11:48 +0200)] 
NUTCH-2632 protocol-okhttp doesn't accept proxy authentication
- merge PR #375 from branch 'sjwoodard-master', contributed by Steven W.

4 months agoNUTCH-2632 protocol-okhttp doesn't accept proxy authentication
Sebastian Nagel [Fri, 17 Aug 2018 09:45:37 +0000 (11:45 +0200)] 
NUTCH-2632 protocol-okhttp doesn't accept proxy authentication
- apply code formatting rules
- add pointer to NUTCH-2636 and square/okhttp#3995 regarding
  limitations to use http.proxy.exclusion.list together with
  http.proxy.username

4 months agoNUTCH-2633 Fix deprecation warnings when building Nutch master branch under JDK 10...
Lewis John McGibbney [Sat, 11 Aug 2018 00:43:36 +0000 (17:43 -0700)] 
NUTCH-2633 Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13 (#374)

4 months agoNUTCH-2635 Generator writes unneeded temporary output 376/head
Sebastian Nagel [Thu, 16 Aug 2018 19:23:11 +0000 (21:23 +0200)] 
NUTCH-2635 Generator writes unneeded temporary output
- output is written to MultipleOutputs, skip context.write(...)
- fix comment wrapping

4 months agoNUTCH-2633 Fix deprecation warnings when building Nutch master branch under JDK 10...
Lewis John McGibbney [Sat, 11 Aug 2018 00:43:36 +0000 (17:43 -0700)] 
NUTCH-2633 Fix deprecation warnings when building Nutch master branch under JDK 10.0.2+13 (#374)

4 months agoNUTCH-2632 protocol-okhttp proxy authentication 375/head
Steven Woodard [Thu, 9 Aug 2018 18:56:03 +0000 (18:56 +0000)] 
NUTCH-2632 protocol-okhttp proxy authentication

4 months agoPrepare for new development after release of 1.15
Sebastian Nagel [Thu, 26 Jul 2018 12:55:38 +0000 (14:55 +0200)] 
Prepare for new development after release of 1.15
- bump version number (1.15-SNAPSHOT -> 1.16-SNAPSHOT)
- add 1.15 changes / release notes

4 months agoFixes for NUTCH-2602: Description as a table with columns: KEY, DESCRIPTION, VALUE.
r0ann3l [Mon, 30 Jul 2018 20:09:43 +0000 (16:09 -0400)] 
Fixes for NUTCH-2602: Description as a table with columns: KEY, DESCRIPTION, VALUE.

4 months agoMerge pull request #366 from sebastian-nagel/NUTCH-2622-unbundle-lgpl-licensed-jars
Sebastian Nagel [Wed, 25 Jul 2018 13:34:09 +0000 (15:34 +0200)] 
Merge pull request #366 from sebastian-nagel/NUTCH-2622-unbundle-lgpl-licensed-jars

NUTCH-2622 Unbundle LGPL-licensed jars from binary release

4 months agoMerge pull request #367 from sebastian-nagel/NUTCH-2624-protocol-okhttp-resource...
Sebastian Nagel [Wed, 25 Jul 2018 13:30:45 +0000 (15:30 +0200)] 
Merge pull request #367 from sebastian-nagel/NUTCH-2624-protocol-okhttp-resource-leak

NUTCH-2624 protocol-okhttp resource leak

4 months agoMerge branch 'master' into NUTCH-2602
r0ann3l [Tue, 24 Jul 2018 15:34:56 +0000 (11:34 -0400)] 
Merge branch 'master' into NUTCH-2602

4 months agoNUTCH-2625 ProtocolFactory.getProtocol(url) may create multiple plugin instances 368/head
Sebastian Nagel [Tue, 24 Jul 2018 14:19:04 +0000 (16:19 +0200)] 
NUTCH-2625 ProtocolFactory.getProtocol(url) may create multiple plugin instances
- lock critical block (conditional creation of plugin instance)
  on object cache object

4 months agoNUTCH-2624 protocol-okhttp resource leak 367/head
Sebastian Nagel [Mon, 23 Jul 2018 15:57:38 +0000 (17:57 +0200)] 
NUTCH-2624 protocol-okhttp resource leak
- make sure responses are closed to avoid
  that connections/sockets leak

4 months agoNUTCH-2622 Unbundle LGPL-licensed jars from binary release 366/head
Sebastian Nagel [Fri, 20 Jul 2018 12:29:16 +0000 (14:29 +0200)] 
NUTCH-2622 Unbundle LGPL-licensed jars from binary release
- exclude LGPL-licensed dsiutils and libidn as dependencies
  of webarchive-commons
- provide instructions how to include the excluded libs
  when building from source
- fix whitespace in ivy file

4 months agoNUTCH-2621 Generate report of third-party licenses 365/head
Sebastian Nagel [Fri, 20 Jul 2018 11:48:10 +0000 (13:48 +0200)] 
NUTCH-2621 Generate report of third-party licenses
- add ant target `ant report-licenses` (for core and plugins)
  which generates tabular reports of dependency licenses

5 months agoMerge pull request #364 from sebastian-nagel/NUTCH-1993-use-backup-parsers
Sebastian Nagel [Thu, 19 Jul 2018 13:06:09 +0000 (15:06 +0200)] 
Merge pull request #364 from sebastian-nagel/NUTCH-1993-use-backup-parsers

NUTCH-1993 Nutch does not use backup parsers

5 months agoMerge pull request #361 from sebastian-nagel/NUTCH-2619-protocol-okhttp-partial-as...
Sebastian Nagel [Thu, 19 Jul 2018 12:56:26 +0000 (14:56 +0200)] 
Merge pull request #361 from sebastian-nagel/NUTCH-2619-protocol-okhttp-partial-as-truncated

NUTCH-2619 protocol-okhttp: allow to keep partially fetched docs as truncated

5 months agoMerge pull request #355 from sebastian-nagel/NUTCH-2152
Sebastian Nagel [Thu, 19 Jul 2018 12:53:45 +0000 (14:53 +0200)] 
Merge pull request #355 from sebastian-nagel/NUTCH-2152

NUTCH-2152 CommonCrawl dump via Service endpoint

5 months agoMerge pull request #363 from sebastian-nagel/NUTCH-2616-exchange-route-deletions
Sebastian Nagel [Thu, 19 Jul 2018 12:51:04 +0000 (14:51 +0200)] 
Merge pull request #363 from sebastian-nagel/NUTCH-2616-exchange-route-deletions

NUTCH-2616 Review routing of deletions by Exchange component

5 months agoNUTCH-1993 Nutch does not use backup parsers 364/head
Sebastian Nagel [Tue, 17 Jul 2018 13:53:14 +0000 (15:53 +0200)] 
NUTCH-1993 Nutch does not use backup parsers
- apply patch contributed by Arkadi Kosmynin
- return last failed parse result (empty result
  does not contain the failure reason)
- improve javadoc

5 months agoMerge pull request #359 from sebastian-nagel/NUTCH-1106-max-outlink-length
Sebastian Nagel [Tue, 17 Jul 2018 11:31:20 +0000 (13:31 +0200)] 
Merge pull request #359 from sebastian-nagel/NUTCH-1106-max-outlink-length

NUTCH-1106 Options to skip url's based on length

5 months agoMerge pull request #358 from sebastian-nagel/NUTCH-2071
Sebastian Nagel [Tue, 17 Jul 2018 11:29:23 +0000 (13:29 +0200)] 
Merge pull request #358 from sebastian-nagel/NUTCH-2071

NUTCH-2071 A parser failure on a single document may fail crawling job if parser.timeout=-1

5 months agoNUTCH-2616 Review routing of deletions by Exchange component 363/head
Sebastian Nagel [Tue, 17 Jul 2018 11:00:47 +0000 (13:00 +0200)] 
NUTCH-2616 Review routing of deletions by Exchange component
- send deletions to all index writers

5 months agoNUTCH-2620 urlfilter-validator incorrectly assumes that top-level domains are not...
Sebastian Nagel [Mon, 16 Jul 2018 10:02:21 +0000 (12:02 +0200)] 
NUTCH-2620 urlfilter-validator incorrectly assumes that top-level domains are not longer than 4 characters

Merge pull request #362 from owenson/urlvalid-fix
[NUTCH-2620] Fix invalid assumption in URL validator

5 months agotypo in fix 362/head
Gareth Owen [Fri, 13 Jul 2018 07:21:38 +0000 (08:21 +0100)] 
typo in fix

5 months agoFix invalid assumption in URL validator
Gareth Owen [Fri, 13 Jul 2018 07:18:51 +0000 (08:18 +0100)] 
Fix invalid assumption in URL validator

5 months agoNUTCH-2071 358/head
Sebastian Nagel [Mon, 9 Jul 2018 15:55:03 +0000 (17:55 +0200)] 
NUTCH-2071
- also catch any Throwable if parser is called by extension ID

5 months agoNUTCH-2071 A parser failure on a single document
Sebastian Nagel [Mon, 9 Jul 2018 15:39:11 +0000 (17:39 +0200)] 
NUTCH-2071 A parser failure on a single document
may fail crawling job if parser.timeout=-1
- also catch any Throwable if parser.timeout == -1
  (parser is not called from ExecutorService)
- improve log message: show full class name
  of called parser

5 months agoNUTCH-1106 Options to skip url's based on length 359/head
Sebastian Nagel [Tue, 10 Jul 2018 12:01:37 +0000 (14:01 +0200)] 
NUTCH-1106 Options to skip url's based on length
- most browsers support URLs up to around 2048 characters
- use this value for the rule in regex-urlfilter.txt
- limit outlink length to 4096 characters to allow additional
  characters removed during normalization (anchor, query args)

5 months agoNUTCH-1106 Options to skip url's based on length
Sebastian Nagel [Tue, 10 Jul 2018 11:00:59 +0000 (13:00 +0200)] 
NUTCH-1106 Options to skip url's based on length
- add property db.max.outlink.length to limit length
  of outlinks and redirects (default = 8192 characters)
- add rule (not active) to regex-urlfilters.txt.template

5 months agoNUTCH-2619 protocol-okhttp: allow to keep partially fetched docs as truncated 361/head
Sebastian Nagel [Wed, 11 Jul 2018 10:18:26 +0000 (12:18 +0200)] 
NUTCH-2619 protocol-okhttp: allow to keep partially fetched docs as truncated
- return content as successful response (marked as truncated)
  if http.partial.truncated is true and there is already content fetched

5 months agoNUTCH-2618 protocol-okhttp not to use http.timeout for max duration to fetch document
Sebastian Nagel [Tue, 10 Jul 2018 16:11:57 +0000 (18:11 +0200)] 
NUTCH-2618 protocol-okhttp not to use http.timeout for max duration to fetch document
- add property http.time.limit to configure the max. time allowed to
  fetch a single document
- add reason of truncation (content or time) to response metadata
- rename "trimmed" -> "truncated" to follow common Nutch terminology