nutch.git
17 months agoPrepare for Nutch 1.18 release branch-1.18 release-1.18
Lewis John McGibbney [Tue, 19 Jan 2021 23:33:48 +0000 (15:33 -0800)] 
Prepare for Nutch 1.18 release

17 months agoPrepare for Nutch 1.18 release
Lewis John McGibbney [Thu, 14 Jan 2021 23:27:00 +0000 (15:27 -0800)] 
Prepare for Nutch 1.18 release

17 months agoNUTCH-2841 Upgrade xercesImpl dependency (#563)
Lewis John McGibbney [Wed, 13 Jan 2021 18:56:07 +0000 (10:56 -0800)] 
NUTCH-2841 Upgrade xercesImpl dependency (#563)

* NUTCH-2841 Upgrade xercesImpl dependency

17 months agoNUTCH-2837 Update multiple dependencies (#560)
Lewis John McGibbney [Fri, 8 Jan 2021 18:01:38 +0000 (10:01 -0800)] 
NUTCH-2837 Update multiple dependencies (#560)

* NUTCH-2837 Upgrade Slf4j dependencies

* NUTCH-2837 Update multiple dependencies

17 months agoNUTCH-2836 Upgrade various commons dependencies (#559)
Lewis John McGibbney [Fri, 8 Jan 2021 04:41:37 +0000 (20:41 -0800)] 
NUTCH-2836 Upgrade various commons dependencies (#559)

18 months agoAdd possibility to setup deduplication group mode in crawl script (#557)
Jakob Berlin [Thu, 17 Dec 2020 16:59:30 +0000 (17:59 +0100)] 
Add possibility to setup deduplication group mode in crawl script (#557)

18 months agoNUTCH-2835 Upgrade commons-jexl from 2 --> 3 (#558)
Lewis John McGibbney [Thu, 17 Dec 2020 16:56:04 +0000 (08:56 -0800)] 
NUTCH-2835 Upgrade commons-jexl from 2 --> 3 (#558)

18 months agoMerge pull request #556 from sebastian-nagel/tika-1.25
Sebastian Nagel [Tue, 8 Dec 2020 12:54:31 +0000 (13:54 +0100)] 
Merge pull request #556 from sebastian-nagel/tika-1.25

NUTCH-2833 Upgrade to Tika 1.25

19 months agoNUTCH-2833 Upgrade to Tika 1.25 556/head
Sebastian Nagel [Fri, 27 Nov 2020 13:53:06 +0000 (14:53 +0100)] 
NUTCH-2833 Upgrade to Tika 1.25

19 months agoMerge pull request #554 from sebastian-nagel/NUTCH-2582-set-mime-types-reader-pool...
Sebastian Nagel [Wed, 18 Nov 2020 11:26:44 +0000 (12:26 +0100)] 
Merge pull request #554 from sebastian-nagel/NUTCH-2582-set-mime-types-reader-pool-size

NUTCH-2582 Set pool size of XML SAX parsers used for MIME detection in Tika

19 months agoNUTCH-2809 Upgrade any23 plugin dependency to 2.4 (#553)
Lewis John McGibbney [Wed, 18 Nov 2020 03:10:18 +0000 (19:10 -0800)] 
NUTCH-2809 Upgrade any23 plugin dependency to 2.4 (#553)

* NUTCH-2809 Upgrade any23 plugin dependency to 2.4

19 months agoMerge pull request #555 from sebastian-nagel/NUTCH-2829-ant-clean-cache
Sebastian Nagel [Fri, 13 Nov 2020 18:11:51 +0000 (19:11 +0100)] 
Merge pull request #555 from sebastian-nagel/NUTCH-2829-ant-clean-cache

NUTCH-2829 Fix ant target "clean-cache"

19 months agoNUTCH-2829 Fix ant target "clean-cache" 555/head
Sebastian Nagel [Tue, 10 Nov 2020 11:06:16 +0000 (12:06 +0100)] 
NUTCH-2829 Fix ant target "clean-cache"
- make target "clean-cache" depend on "ivy-init" so that
  ivy-related resources are defined

19 months agoNUTCH-2582 Set pool size of XML SAX parsers used for MIME detection in Tika 554/head
Sebastian Nagel [Fri, 16 Oct 2020 21:10:03 +0000 (23:10 +0200)] 
NUTCH-2582 Set pool size of XML SAX parsers used for MIME detection in Tika
- add method in MimeUtil to set MimeTypesReader pool size
- actually adjust pool size to number of Fetcher threads / 2
  (minimum pool size is 10 in case there are less than 20 Fetcher threads)
- double pool size (10 -> 20) of Tika XMLReaderUtils in tika-config.xml

21 months agoMerge pull request #552 from sebastian-nagel/NUTCH-2824
Sebastian Nagel [Mon, 14 Sep 2020 12:27:39 +0000 (14:27 +0200)] 
Merge pull request #552 from sebastian-nagel/NUTCH-2824

NUTCH-2824 urlnormalizer-basic to unescape percent-encoded host names

21 months agoMerge pull request #551 from sebastian-nagel/NUTCH-2823
Sebastian Nagel [Mon, 14 Sep 2020 12:06:23 +0000 (14:06 +0200)] 
Merge pull request #551 from sebastian-nagel/NUTCH-2823

NUTCH-2823 IllegalStateException in IndexWriters.describe() when valiā€¦

21 months agoNUTCH-2824 urlnormalizer-basic to unescape percent-encoded host names 552/head
Sebastian Nagel [Mon, 14 Sep 2020 11:55:34 +0000 (13:55 +0200)] 
NUTCH-2824 urlnormalizer-basic to unescape percent-encoded host names

- add unit tests to verify that a declared MalformedURLException is thrown
  on host names containing illegal percent-encoded sequences and
  any (undeclared) runtime exceptions are caught and rethrown

22 months agoNUTCH-2824 urlnormalizer-basic to unescape percent-encoded host names
Sebastian Nagel [Fri, 21 Aug 2020 14:38:28 +0000 (16:38 +0200)] 
NUTCH-2824 urlnormalizer-basic to unescape percent-encoded host names

22 months agoMerge pull request #549 from sebastian-nagel/NUTCH-2818-ant-rat-task
Sebastian Nagel [Tue, 18 Aug 2020 09:49:01 +0000 (11:49 +0200)] 
Merge pull request #549 from sebastian-nagel/NUTCH-2818-ant-rat-task

NUTCH-2818 Fix Apache Rat task to check sources for license headers

22 months agoMerge pull request #546 from sebastian-nagel/NUTCH-2814-http-date-format-time-zone
Sebastian Nagel [Tue, 18 Aug 2020 09:41:46 +0000 (11:41 +0200)] 
Merge pull request #546 from sebastian-nagel/NUTCH-2814-http-date-format-time-zone

NUTCH-2814 HttpDateFormat's internal time zone may change after parsing a date

22 months agoNUTCH-2823 IllegalStateException in IndexWriters.describe() when validating url param... 551/head
Sebastian Nagel [Mon, 17 Aug 2020 14:54:42 +0000 (16:54 +0200)] 
NUTCH-2823 IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer

- when calculating required column width for first (param names) and
  third column (param values): verify that none of these columns occupy
  more than one third of the table width, otherwise reset width to 1/3

22 months agoNUTCH-2697 Upgrade Ivy to fix the issue of an unset packaging.type property
Sebastian Nagel [Wed, 12 Aug 2020 10:51:51 +0000 (12:51 +0200)] 
NUTCH-2697 Upgrade Ivy to fix the issue of an unset packaging.type property
NUTCH-2671 Upgrade ant ivy library
- upgrade Ivy (2.4.0 -> 2.5.0)
- upgrade all plugins build-ivy.xml to use the ivy jar 2.5.0 installed in
  $NUTCH_HOME/ivy/ for preparing lists of dependencies registered in plugin.xml

22 months agoMerge branch 'derhecht-patch-2', closes #545
Sebastian Nagel [Sun, 16 Aug 2020 18:59:33 +0000 (20:59 +0200)] 
Merge branch 'derhecht-patch-2', closes #545

Includes solution for (closes #544)
NUTCH-2813 MoreIndexingFilter - can't parse erroneous date - 2019-07-03T10:28:14

22 months agoNUTCH-2817 Avoid check for equality of URL path and file part using ==/!=
Sebastian Nagel [Sat, 8 Aug 2020 08:54:42 +0000 (10:54 +0200)] 
NUTCH-2817 Avoid check for equality of URL path and file part using ==/!=
- replace check whether URL path and file are identical
  by check whether URL has a query
- clean up code and improve log messages

22 months agoNUTCH-2816 Add Spotbugs target to ant build
Sebastian Nagel [Thu, 6 Aug 2020 17:24:35 +0000 (19:24 +0200)] 
NUTCH-2816 Add Spotbugs target to ant build
- called on-demand as ant target "spotbugs"
- creates spotbugs report ("build/nutch-spotbugs.html") covering Nutch core and plugins

22 months agoNUTCH-2811 : Setup Github workflows for prs (#543)
Madhawa Gunasekara [Mon, 3 Aug 2020 15:10:45 +0000 (17:10 +0200)] 
NUTCH-2811 : Setup Github workflows for prs (#543)

* NUTCH-2811 : Setup Github workflows for prs

Merging to setup the git workflows.

22 months agoNUTCH-2810 FreeGenerator to actually apply configured number of fetch lists
Sebastian Nagel [Mon, 27 Jul 2020 10:05:23 +0000 (12:05 +0200)] 
NUTCH-2810 FreeGenerator to actually apply configured number of fetch lists

22 months ago[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as...
Sebastian Nagel [Fri, 10 Jul 2020 14:01:18 +0000 (16:01 +0200)] 
[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
- clarify comment regarding bypassing the confidence check for a non-empty http.agent.name

22 months ago[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as...
Sebastian Nagel [Fri, 10 Jul 2020 13:13:49 +0000 (15:13 +0200)] 
[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
- if no agent names are given as command-line arguments use values of
  http.agent.name and http.robots.agents as agent names to be checked
- update command-line help

22 months agoNUTCH-1190 MoreIndexingFilter: move data formats used to parse "lastModified" to...
Jakob Berlin [Mon, 3 Aug 2020 13:56:44 +0000 (15:56 +0200)] 
NUTCH-1190 MoreIndexingFilter: move data formats used to parse "lastModified" to a config file

22 months agoNUTCH-2799 Add .asf.yaml file
Sebastian Nagel [Fri, 10 Jul 2020 13:28:53 +0000 (15:28 +0200)] 
NUTCH-2799 Add .asf.yaml file
- update pull request template regarding Jira linking:
  issue id should be in square brackets (`[NUTCH-XXXX]`)

22 months agoNUTCH-2799 Add .asf.yaml file
Sebastian Nagel [Wed, 8 Jul 2020 14:30:55 +0000 (16:30 +0200)] 
NUTCH-2799 Add .asf.yaml file
- add project description in one sentence
- add github topics
- set github mailing list notifications as configured before

22 months agoNUTCH-2805: Rename plugin urlfilter-domainblacklist (#540)
Shashanka Balakuntala Srinivasa [Wed, 29 Jul 2020 14:35:04 +0000 (20:05 +0530)] 
NUTCH-2805: Rename plugin urlfilter-domainblacklist (#540)

NUTCH-2805: Rename plugin urlfilter-domainblacklist

22 months agoNUTCH-2782: protocol-http / lib-http: support TLSv1.3
shbalaku [Fri, 10 Jul 2020 17:37:36 +0000 (23:07 +0530)] 
NUTCH-2782: protocol-http / lib-http: support TLSv1.3

22 months ago[NUTCH-2730] SitemapProcessor to treat sitemap URLs as Set instead of List
Sebastian Nagel [Mon, 6 Jul 2020 12:03:33 +0000 (14:03 +0200)] 
[NUTCH-2730] SitemapProcessor to treat sitemap URLs as Set instead of List
- sitemap links from robots.txt are treated as set by crawler-commons
  (since crawler-commons 1.1)
- sitemaps referenced in sitemap index are deduplicated

22 months ago[NUTCH-2796] Upgrade to crawler-commons 1.1
Sebastian Nagel [Mon, 6 Jul 2020 12:02:39 +0000 (14:02 +0200)] 
[NUTCH-2796] Upgrade to crawler-commons 1.1

22 months agoPrepare for new development after release of 1.17
Sebastian Nagel [Wed, 17 Jun 2020 21:00:09 +0000 (23:00 +0200)] 
Prepare for new development after release of 1.17
- bump version number (1.17-SNAPSHOT -> 1.18-SNAPSHOT)
- add 1.17 changes / release notes
- update links to Hadoop and Solr API docs
- update current year in API docs etc.

22 months agoNUTCH-2794 Add additional ciphers to HTTP base's default cipher suite
Markus Jelsma [Wed, 17 Jun 2020 11:21:24 +0000 (13:21 +0200)] 
NUTCH-2794 Add additional ciphers to HTTP base's default cipher suite

22 months agoNUTCH-2791 Handle GCS URLs in stats commands
Patrick Mezard [Tue, 9 Jun 2020 15:39:41 +0000 (17:39 +0200)] 
NUTCH-2791 Handle GCS URLs in stats commands

- Handle Google Cloud Storage URLs as crawldb inputs in domainstats,
  protocolstats and crawlcomplete commands.
- Correctly resolve numReducers in protocolstats.
- Align crawlcomplete -inputDirs behaviour on the other commands: expect
  directories containing "current", not "crawldb/current".

22 months agoNUTCH-2789 Documentation: update links to point to cwiki
Sebastian Nagel [Tue, 9 Jun 2020 10:28:00 +0000 (12:28 +0200)] 
NUTCH-2789 Documentation: update links to point to cwiki

22 months agoNUTCH-2789 Docker README: update links to point to cwiki
Sebastian Nagel [Tue, 9 Jun 2020 10:06:17 +0000 (12:06 +0200)] 
NUTCH-2789 Docker README: update links to point to cwiki

22 months agoNUTCH-2788 ParseData: improve presentation of Metadata in method toString()
Sebastian Nagel [Tue, 9 Jun 2020 09:41:37 +0000 (11:41 +0200)] 
NUTCH-2788 ParseData: improve presentation of Metadata in method toString()
- switch to multi-line presentation of Metadata in ParseData::toString
- default implementation of Metadata::toString is still single-line
- replace StringBuffer by StringBuilder in modified methods

22 months agoNUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly
Sebastian Nagel [Tue, 9 Jun 2020 12:17:40 +0000 (14:17 +0200)] 
NUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly
- add JsonSerializer to write common Writable types (null, boolean, numbers)
- remaining "unknown" Writables are written after calling toString()

22 months agoNUTCH-2790 indexer-csv: escape field leading quote character
Patrick Mezard [Tue, 9 Jun 2020 15:00:16 +0000 (17:00 +0200)] 
NUTCH-2790 indexer-csv: escape field leading quote character

Before the change, the leading quote of a field value like '"value'
would be left unescaped.

22 months agoNUTCH-2496 Speed up link inversion step in crawling script
Sebastian Nagel [Fri, 15 May 2020 17:17:00 +0000 (19:17 +0200)] 
NUTCH-2496 Speed up link inversion step in crawling script

- disable URL filtering and normalizing when calling invertlinks
  in bin/crawl

- add note that the steps invertlinks, dedup, index could also
  be done outside the loop over all segments created in the loop
  iterations

- move webgraph construction (commented out anyway) outside the
  loop because it's done over all available segments

22 months agoNUTCH-2720 ROBOTS metatag ignored when capitalized
Sebastian Nagel [Sun, 17 May 2020 12:37:47 +0000 (14:37 +0200)] 
NUTCH-2720 ROBOTS metatag ignored when capitalized
- move string "robots" to constant in metadata.Nutch
- make string lowercase not depend on system locale

22 months agoNUTCH-2720 ROBOTS metatag ignored when capitalized
Sebastian Nagel [Fri, 15 May 2020 21:11:08 +0000 (23:11 +0200)] 
NUTCH-2720 ROBOTS metatag ignored when capitalized

- parse-tika: add lowercase "robots" to metadata

22 months agoNUTCH-2419 Some URL filters and normalizers do not respect command-line override...
Sebastian Nagel [Wed, 13 May 2020 12:39:15 +0000 (14:39 +0200)] 
NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file

- simplify selection of rule file (from property or attribute in plugin.xml)

22 months agoNUTCH-2419 Some URL filters and normalizers do not respect command-line override...
Sebastian Nagel [Fri, 27 Sep 2019 20:51:29 +0000 (22:51 +0200)] 
NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file

- fix urlfilter-domain, urlfilter-domainblacklist, urlfilter-prefix
  and urlfilter-suffix

- always prefer the configured rule file (urlfilter.domain.file,
  urlfilter.domainblacklist.file, urlfilter.prefix.file,
  urlfilter.suffix.file) over the file defined in plugin.xml

- remove constructors taking rule file as argument
  (used only in unit tests and now obsolete because we can override the
   rule file via configuration)

- update Java API doc comments

22 months agoNUTCH-1945 Test for XLSX parser
Sebastian Nagel [Tue, 5 May 2020 11:25:15 +0000 (13:25 +0200)] 
NUTCH-1945 Test for XLSX parser
- add Tika unit test for XLSX files
- bundle instance variables and utility methods in class TikaParserTest
- clean up javadoc comments

22 months agoNUTCH-2758 Add plugin READMEs to binary release packages
Sebastian Nagel [Thu, 30 Apr 2020 12:16:03 +0000 (14:16 +0200)] 
NUTCH-2758 Add plugin READMEs to binary release packages

22 months agoNUTCH-2753 Add -listen option to command-line help of CrawlDbReader and LinkDbReader
Sebastian Nagel [Thu, 30 Apr 2020 15:07:30 +0000 (17:07 +0200)] 
NUTCH-2753 Add -listen option to command-line help of CrawlDbReader and LinkDbReader

22 months agoNUTCH-2002 parse and index checkers to check robots.txt
Sebastian Nagel [Thu, 30 Apr 2020 10:58:05 +0000 (12:58 +0200)] 
NUTCH-2002 parse and index checkers to check robots.txt
- applied Julien's patch to recent code base
- also check redirects whether they are allowed
- add command-line parameter `-checkRobotsTxt` enabling this check

22 months agoNUTCH-2785 FreeGenerator: command-line option to define number of generated fetch...
Sebastian Nagel [Wed, 29 Apr 2020 07:54:32 +0000 (09:54 +0200)] 
NUTCH-2785 FreeGenerator: command-line option to define number of generated fetch lists
- add command-line option `-numFetchers` to FreeGenerator
- in local mode: generate one single fetch list

22 months agoNUTCH-1194 Generator: CrawlDB lock should be released earlier
Sebastian Nagel [Thu, 23 Apr 2020 13:55:32 +0000 (15:55 +0200)] 
NUTCH-1194 Generator: CrawlDB lock should be released earlier
- release CrawlDb lock after select step, in case, generated items
  are not marked in CrawlDb (generate.update.crawldb is false)

22 months agoNUTCH-2434 Add methods to reset parameters HTMLMetaTags
Sebastian Nagel [Tue, 5 May 2020 09:27:35 +0000 (11:27 +0200)] 
NUTCH-2434 Add methods to reset parameters HTMLMetaTags
(apply patch contributed by Markus)

22 months agoNUTCH-2743 Add list of Nutch properties (nutch-default.xml) to documentation
Sebastian Nagel [Wed, 29 Apr 2020 11:03:01 +0000 (13:03 +0200)] 
NUTCH-2743 Add list of Nutch properties (nutch-default.xml) to documentation
- modify ant build.xml to copy nutch-default.xml into docs/api/resources/
- adapt XSLT table layout
- remove obsolete nutch-conf.xsl
- fix typos and normalize spelling in nutch-default.xml

22 months agoNUTCH-2818 Fix Apache Rat task to check sources for license headers 549/head
Sebastian Nagel [Tue, 11 Aug 2020 09:19:14 +0000 (11:19 +0200)] 
NUTCH-2818 Fix Apache Rat task to check sources for license headers
- automatize download of Apache Rat jar file
- write report to build/apache-rat-report.txt

22 months agoMerge pull request #548 from sebastian-nagel/NUTCH-2817-spotbugs-object-equality
Sebastian Nagel [Tue, 11 Aug 2020 07:38:47 +0000 (09:38 +0200)] 
Merge pull request #548 from sebastian-nagel/NUTCH-2817-spotbugs-object-equality

[NUTCH-2817] Avoid check for equality of URL path and file part using == / !=

22 months agoMerge pull request #547 from sebastian-nagel/NUTCH-2816-add-spotbugs-ant-target
Sebastian Nagel [Tue, 11 Aug 2020 07:37:24 +0000 (09:37 +0200)] 
Merge pull request #547 from sebastian-nagel/NUTCH-2816-add-spotbugs-ant-target

[NUTCH-2816] Add Spotbugs target to ant build

22 months agoNUTCH-2817 Avoid check for equality of URL path and file part using ==/!= 548/head
Sebastian Nagel [Sat, 8 Aug 2020 08:54:42 +0000 (10:54 +0200)] 
NUTCH-2817 Avoid check for equality of URL path and file part using ==/!=
- replace check whether URL path and file are identical
  by check whether URL has a query
- clean up code and improve log messages

22 months agoNUTCH-2816 Add Spotbugs target to ant build 547/head
Sebastian Nagel [Thu, 6 Aug 2020 17:24:35 +0000 (19:24 +0200)] 
NUTCH-2816 Add Spotbugs target to ant build
- called on-demand as ant target "spotbugs"
- creates spotbugs report ("build/nutch-spotbugs.html") covering Nutch core and plugins

22 months agoNUTCH-2814 HttpDateFormat's internal time zone may change after parsing a date 546/head
Sebastian Nagel [Fri, 7 Aug 2020 16:12:22 +0000 (18:12 +0200)] 
NUTCH-2814 HttpDateFormat's internal time zone may change after parsing a date
- reset time zone to GMT after parsing a date

23 months agoMerge pull request #542 from sebastian-nagel/NUTCH-2810
Sebastian Nagel [Mon, 3 Aug 2020 19:08:22 +0000 (21:08 +0200)] 
Merge pull request #542 from sebastian-nagel/NUTCH-2810

NUTCH-2810 FreeGenerator to actually apply configured number of fetch lists

23 months agoMerge pull request #537 from sebastian-nagel/NUTCH-2801-robots-checker
Sebastian Nagel [Mon, 3 Aug 2020 19:06:48 +0000 (21:06 +0200)] 
Merge pull request #537 from sebastian-nagel/NUTCH-2801-robots-checker

[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back

23 months agoNUTCH-2811 : Setup Github workflows for prs (#543)
Madhawa Gunasekara [Mon, 3 Aug 2020 15:10:45 +0000 (17:10 +0200)] 
NUTCH-2811 : Setup Github workflows for prs (#543)

* NUTCH-2811 : Setup Github workflows for prs

Merging to setup the git workflows.

23 months agoMerge pull request #536 from sebastian-nagel/NUTCH-2799-asf-yaml-file
Sebastian Nagel [Sun, 2 Aug 2020 11:09:21 +0000 (13:09 +0200)] 
Merge pull request #536 from sebastian-nagel/NUTCH-2799-asf-yaml-file

[NUTCH-2799] Add .asf.yaml file

23 months agoNUTCH-2805: Rename plugin urlfilter-domainblacklist (#540)
Shashanka Balakuntala Srinivasa [Wed, 29 Jul 2020 14:35:04 +0000 (20:05 +0530)] 
NUTCH-2805: Rename plugin urlfilter-domainblacklist (#540)

NUTCH-2805: Rename plugin urlfilter-domainblacklist

23 months agoNUTCH-2810 FreeGenerator to actually apply configured number of fetch lists 542/head
Sebastian Nagel [Mon, 27 Jul 2020 10:05:23 +0000 (12:05 +0200)] 
NUTCH-2810 FreeGenerator to actually apply configured number of fetch lists

23 months agoMerge pull request #538 from balashashanka/NUTCH-2782
Sebastian Nagel [Tue, 14 Jul 2020 10:51:53 +0000 (12:51 +0200)] 
Merge pull request #538 from balashashanka/NUTCH-2782

NUTCH-2782: protocol-http / lib-http: support TLSv1.3

23 months agoMerge pull request #535 from sebastian-nagel/NUTCH-2796-NUTCH-2730
Sebastian Nagel [Tue, 14 Jul 2020 10:48:29 +0000 (12:48 +0200)] 
Merge pull request #535 from sebastian-nagel/NUTCH-2796-NUTCH-2730

[NUTCH-2796] [NUTCH-2730] Update crawler-commons 1.1, SitemapProcessor to treat sitemap URLs as Set instead of List

23 months agoNUTCH-2782: protocol-http / lib-http: support TLSv1.3 538/head
shbalaku [Fri, 10 Jul 2020 17:37:36 +0000 (23:07 +0530)] 
NUTCH-2782: protocol-http / lib-http: support TLSv1.3

23 months ago[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as... 537/head
Sebastian Nagel [Fri, 10 Jul 2020 14:01:18 +0000 (16:01 +0200)] 
[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
- clarify comment regarding bypassing the confidence check for a non-empty http.agent.name

23 months agoNUTCH-2799 Add .asf.yaml file 536/head
Sebastian Nagel [Fri, 10 Jul 2020 13:28:53 +0000 (15:28 +0200)] 
NUTCH-2799 Add .asf.yaml file
- update pull request template regarding Jira linking:
  issue id should be in square brackets (`[NUTCH-XXXX]`)

23 months ago[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as...
Sebastian Nagel [Fri, 10 Jul 2020 13:13:49 +0000 (15:13 +0200)] 
[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
- if no agent names are given as command-line arguments use values of
  http.agent.name and http.robots.agents as agent names to be checked
- update command-line help

23 months agoNUTCH-2799 Add .asf.yaml file
Sebastian Nagel [Wed, 8 Jul 2020 14:30:55 +0000 (16:30 +0200)] 
NUTCH-2799 Add .asf.yaml file
- add project description in one sentence
- add github topics
- set github mailing list notifications as configured before

2 years ago[NUTCH-2730] SitemapProcessor to treat sitemap URLs as Set instead of List 535/head
Sebastian Nagel [Mon, 6 Jul 2020 12:03:33 +0000 (14:03 +0200)] 
[NUTCH-2730] SitemapProcessor to treat sitemap URLs as Set instead of List
- sitemap links from robots.txt are treated as set by crawler-commons
  (since crawler-commons 1.1)
- sitemaps referenced in sitemap index are deduplicated

2 years ago[NUTCH-2796] Upgrade to crawler-commons 1.1
Sebastian Nagel [Mon, 6 Jul 2020 12:02:39 +0000 (14:02 +0200)] 
[NUTCH-2796] Upgrade to crawler-commons 1.1

2 years agoPrepare for new development after release of 1.17
Sebastian Nagel [Wed, 17 Jun 2020 21:00:09 +0000 (23:00 +0200)] 
Prepare for new development after release of 1.17
- bump version number (1.17-SNAPSHOT -> 1.18-SNAPSHOT)
- add 1.17 changes / release notes
- update links to Hadoop and Solr API docs
- update current year in API docs etc.

2 years agoNUTCH-2794 Add additional ciphers to HTTP base's default cipher suite
Markus Jelsma [Wed, 17 Jun 2020 11:21:24 +0000 (13:21 +0200)] 
NUTCH-2794 Add additional ciphers to HTTP base's default cipher suite

2 years agoMerge pull request #533 from pmezard/NUTCH-2791
Sebastian Nagel [Thu, 11 Jun 2020 11:21:19 +0000 (13:21 +0200)] 
Merge pull request #533 from pmezard/NUTCH-2791

NUTCH-2791 Handle GCS URLs in stats commands

2 years agoNUTCH-2791 Handle GCS URLs in stats commands 533/head
Patrick Mezard [Tue, 9 Jun 2020 15:39:41 +0000 (17:39 +0200)] 
NUTCH-2791 Handle GCS URLs in stats commands

- Handle Google Cloud Storage URLs as crawldb inputs in domainstats,
  protocolstats and crawlcomplete commands.
- Correctly resolve numReducers in protocolstats.
- Align crawlcomplete -inputDirs behaviour on the other commands: expect
  directories containing "current", not "crawldb/current".

2 years agoMerge pull request #530 from sebastian-nagel/NUTCH-2789
Sebastian Nagel [Wed, 10 Jun 2020 18:44:36 +0000 (20:44 +0200)] 
Merge pull request #530 from sebastian-nagel/NUTCH-2789

NUTCH-2789 Documentation: update links to point to cwiki

2 years agoMerge pull request #529 from sebastian-nagel/NUTCH-2788
Sebastian Nagel [Wed, 10 Jun 2020 18:42:38 +0000 (20:42 +0200)] 
Merge pull request #529 from sebastian-nagel/NUTCH-2788

NUTCH-2788 ParseData: improve presentation of Metadata in method toString()

2 years agoMerge pull request #531 from sebastian-nagel/NUTCH-2787
Sebastian Nagel [Wed, 10 Jun 2020 18:34:50 +0000 (20:34 +0200)] 
Merge pull request #531 from sebastian-nagel/NUTCH-2787

NUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly

2 years agoMerge pull request #532 from pmezard/NUTCH-2790
Sebastian Nagel [Wed, 10 Jun 2020 18:26:56 +0000 (20:26 +0200)] 
Merge pull request #532 from pmezard/NUTCH-2790

NUTCH-2790 indexer-csv: escape field leading quote character

2 years agoNUTCH-2790 indexer-csv: escape field leading quote character 532/head
Patrick Mezard [Tue, 9 Jun 2020 15:00:16 +0000 (17:00 +0200)] 
NUTCH-2790 indexer-csv: escape field leading quote character

Before the change, the leading quote of a field value like '"value'
would be left unescaped.

2 years agoNUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly 531/head
Sebastian Nagel [Tue, 9 Jun 2020 12:17:40 +0000 (14:17 +0200)] 
NUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly
- add JsonSerializer to write common Writable types (null, boolean, numbers)
- remaining "unknown" Writables are written after calling toString()

2 years agoMerge pull request #527 from sebastian-nagel/NUTCH-2496
Sebastian Nagel [Tue, 9 Jun 2020 10:46:49 +0000 (12:46 +0200)] 
Merge pull request #527 from sebastian-nagel/NUTCH-2496

NUTCH-2496 Speed up link inversion step in crawling script

2 years agoMerge pull request #528 from sebastian-nagel/NUTCH-2720
Sebastian Nagel [Tue, 9 Jun 2020 10:44:30 +0000 (12:44 +0200)] 
Merge pull request #528 from sebastian-nagel/NUTCH-2720

NUTCH-2720 ROBOTS metatag ignored when capitalized

2 years agoNUTCH-2789 Documentation: update links to point to cwiki 530/head
Sebastian Nagel [Tue, 9 Jun 2020 10:28:00 +0000 (12:28 +0200)] 
NUTCH-2789 Documentation: update links to point to cwiki

2 years agoNUTCH-2789 Docker README: update links to point to cwiki
Sebastian Nagel [Tue, 9 Jun 2020 10:06:17 +0000 (12:06 +0200)] 
NUTCH-2789 Docker README: update links to point to cwiki

2 years agoNUTCH-2788 ParseData: improve presentation of Metadata in method toString() 529/head
Sebastian Nagel [Tue, 9 Jun 2020 09:41:37 +0000 (11:41 +0200)] 
NUTCH-2788 ParseData: improve presentation of Metadata in method toString()
- switch to multi-line presentation of Metadata in ParseData::toString
- default implementation of Metadata::toString is still single-line
- replace StringBuffer by StringBuilder in modified methods

2 years agoNUTCH-2720 ROBOTS metatag ignored when capitalized 528/head
Sebastian Nagel [Sun, 17 May 2020 12:37:47 +0000 (14:37 +0200)] 
NUTCH-2720 ROBOTS metatag ignored when capitalized
- move string "robots" to constant in metadata.Nutch
- make string lowercase not depend on system locale

2 years agoNUTCH-2720 ROBOTS metatag ignored when capitalized
Sebastian Nagel [Fri, 15 May 2020 21:11:08 +0000 (23:11 +0200)] 
NUTCH-2720 ROBOTS metatag ignored when capitalized

- parse-tika: add lowercase "robots" to metadata

2 years agoNUTCH-2496 Speed up link inversion step in crawling script 527/head
Sebastian Nagel [Fri, 15 May 2020 17:17:00 +0000 (19:17 +0200)] 
NUTCH-2496 Speed up link inversion step in crawling script

- disable URL filtering and normalizing when calling invertlinks
  in bin/crawl

- add note that the steps invertlinks, dedup, index could also
  be done outside the loop over all segments created in the loop
  iterations

- move webgraph construction (commented out anyway) outside the
  loop because it's done over all available segments

2 years agoMerge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence
Sebastian Nagel [Thu, 14 May 2020 15:43:14 +0000 (17:43 +0200)] 
Merge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence

NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file

2 years agoNUTCH-2419 Some URL filters and normalizers do not respect command-line override... 526/head
Sebastian Nagel [Wed, 13 May 2020 12:39:15 +0000 (14:39 +0200)] 
NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file

- simplify selection of rule file (from property or attribute in plugin.xml)

2 years agoMerge pull request #525 from sebastian-nagel/NUTCH-1945
Sebastian Nagel [Tue, 12 May 2020 13:35:09 +0000 (15:35 +0200)] 
Merge pull request #525 from sebastian-nagel/NUTCH-1945

NUTCH-1945 Test for XLSX parser

2 years agoNUTCH-2419 Some URL filters and normalizers do not respect command-line override...
Sebastian Nagel [Fri, 27 Sep 2019 20:51:29 +0000 (22:51 +0200)] 
NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file

- fix urlfilter-domain, urlfilter-domainblacklist, urlfilter-prefix
  and urlfilter-suffix

- always prefer the configured rule file (urlfilter.domain.file,
  urlfilter.domainblacklist.file, urlfilter.prefix.file,
  urlfilter.suffix.file) over the file defined in plugin.xml

- remove constructors taking rule file as argument
  (used only in unit tests and now obsolete because we can override the
   rule file via configuration)

- update Java API doc comments