Lewis John McGibbney [Tue, 19 Jan 2021 23:33:48 +0000 (15:33 -0800)]
Prepare for Nutch 1.18 release
Lewis John McGibbney [Thu, 14 Jan 2021 23:27:00 +0000 (15:27 -0800)]
Prepare for Nutch 1.18 release
Lewis John McGibbney [Wed, 13 Jan 2021 18:56:07 +0000 (10:56 -0800)]
NUTCH-2841 Upgrade xercesImpl dependency (#563)
* NUTCH-2841 Upgrade xercesImpl dependency
Lewis John McGibbney [Fri, 8 Jan 2021 18:01:38 +0000 (10:01 -0800)]
NUTCH-2837 Update multiple dependencies (#560)
* NUTCH-2837 Upgrade Slf4j dependencies
* NUTCH-2837 Update multiple dependencies
Lewis John McGibbney [Fri, 8 Jan 2021 04:41:37 +0000 (20:41 -0800)]
NUTCH-2836 Upgrade various commons dependencies (#559)
Jakob Berlin [Thu, 17 Dec 2020 16:59:30 +0000 (17:59 +0100)]
Add possibility to setup deduplication group mode in crawl script (#557)
Lewis John McGibbney [Thu, 17 Dec 2020 16:56:04 +0000 (08:56 -0800)]
NUTCH-2835 Upgrade commons-jexl from 2 --> 3 (#558)
Sebastian Nagel [Tue, 8 Dec 2020 12:54:31 +0000 (13:54 +0100)]
Merge pull request #556 from sebastian-nagel/tika-1.25
NUTCH-2833 Upgrade to Tika 1.25
Sebastian Nagel [Fri, 27 Nov 2020 13:53:06 +0000 (14:53 +0100)]
NUTCH-2833 Upgrade to Tika 1.25
Sebastian Nagel [Wed, 18 Nov 2020 11:26:44 +0000 (12:26 +0100)]
Merge pull request #554 from sebastian-nagel/NUTCH-2582-set-mime-types-reader-pool-size
NUTCH-2582 Set pool size of XML SAX parsers used for MIME detection in Tika
Lewis John McGibbney [Wed, 18 Nov 2020 03:10:18 +0000 (19:10 -0800)]
NUTCH-2809 Upgrade any23 plugin dependency to 2.4 (#553)
* NUTCH-2809 Upgrade any23 plugin dependency to 2.4
Sebastian Nagel [Fri, 13 Nov 2020 18:11:51 +0000 (19:11 +0100)]
Merge pull request #555 from sebastian-nagel/NUTCH-2829-ant-clean-cache
NUTCH-2829 Fix ant target "clean-cache"
Sebastian Nagel [Tue, 10 Nov 2020 11:06:16 +0000 (12:06 +0100)]
NUTCH-2829 Fix ant target "clean-cache"
- make target "clean-cache" depend on "ivy-init" so that
ivy-related resources are defined
Sebastian Nagel [Fri, 16 Oct 2020 21:10:03 +0000 (23:10 +0200)]
NUTCH-2582 Set pool size of XML SAX parsers used for MIME detection in Tika
- add method in MimeUtil to set MimeTypesReader pool size
- actually adjust pool size to number of Fetcher threads / 2
(minimum pool size is 10 in case there are less than 20 Fetcher threads)
- double pool size (10 -> 20) of Tika XMLReaderUtils in tika-config.xml
Sebastian Nagel [Mon, 14 Sep 2020 12:27:39 +0000 (14:27 +0200)]
Merge pull request #552 from sebastian-nagel/NUTCH-2824
NUTCH-2824 urlnormalizer-basic to unescape percent-encoded host names
Sebastian Nagel [Mon, 14 Sep 2020 12:06:23 +0000 (14:06 +0200)]
Merge pull request #551 from sebastian-nagel/NUTCH-2823
NUTCH-2823 IllegalStateException in IndexWriters.describe() when valiā¦
Sebastian Nagel [Mon, 14 Sep 2020 11:55:34 +0000 (13:55 +0200)]
NUTCH-2824 urlnormalizer-basic to unescape percent-encoded host names
- add unit tests to verify that a declared MalformedURLException is thrown
on host names containing illegal percent-encoded sequences and
any (undeclared) runtime exceptions are caught and rethrown
Sebastian Nagel [Fri, 21 Aug 2020 14:38:28 +0000 (16:38 +0200)]
NUTCH-2824 urlnormalizer-basic to unescape percent-encoded host names
Sebastian Nagel [Tue, 18 Aug 2020 09:49:01 +0000 (11:49 +0200)]
Merge pull request #549 from sebastian-nagel/NUTCH-2818-ant-rat-task
NUTCH-2818 Fix Apache Rat task to check sources for license headers
Sebastian Nagel [Tue, 18 Aug 2020 09:41:46 +0000 (11:41 +0200)]
Merge pull request #546 from sebastian-nagel/NUTCH-2814-http-date-format-time-zone
NUTCH-2814 HttpDateFormat's internal time zone may change after parsing a date
Sebastian Nagel [Mon, 17 Aug 2020 14:54:42 +0000 (16:54 +0200)]
NUTCH-2823 IllegalStateException in IndexWriters.describe() when validating url param for SolrIndexer
- when calculating required column width for first (param names) and
third column (param values): verify that none of these columns occupy
more than one third of the table width, otherwise reset width to 1/3
Sebastian Nagel [Wed, 12 Aug 2020 10:51:51 +0000 (12:51 +0200)]
NUTCH-2697 Upgrade Ivy to fix the issue of an unset packaging.type property
NUTCH-2671 Upgrade ant ivy library
- upgrade Ivy (2.4.0 -> 2.5.0)
- upgrade all plugins build-ivy.xml to use the ivy jar 2.5.0 installed in
$NUTCH_HOME/ivy/ for preparing lists of dependencies registered in plugin.xml
Sebastian Nagel [Sun, 16 Aug 2020 18:59:33 +0000 (20:59 +0200)]
Merge branch 'derhecht-patch-2', closes #545
Includes solution for (closes #544)
NUTCH-2813 MoreIndexingFilter - can't parse erroneous date - 2019-07-03T10:28:14
Sebastian Nagel [Sat, 8 Aug 2020 08:54:42 +0000 (10:54 +0200)]
NUTCH-2817 Avoid check for equality of URL path and file part using ==/!=
- replace check whether URL path and file are identical
by check whether URL has a query
- clean up code and improve log messages
Sebastian Nagel [Thu, 6 Aug 2020 17:24:35 +0000 (19:24 +0200)]
NUTCH-2816 Add Spotbugs target to ant build
- called on-demand as ant target "spotbugs"
- creates spotbugs report ("build/nutch-spotbugs.html") covering Nutch core and plugins
Madhawa Gunasekara [Mon, 3 Aug 2020 15:10:45 +0000 (17:10 +0200)]
NUTCH-2811 : Setup Github workflows for prs (#543)
* NUTCH-2811 : Setup Github workflows for prs
Merging to setup the git workflows.
Sebastian Nagel [Mon, 27 Jul 2020 10:05:23 +0000 (12:05 +0200)]
NUTCH-2810 FreeGenerator to actually apply configured number of fetch lists
Sebastian Nagel [Fri, 10 Jul 2020 14:01:18 +0000 (16:01 +0200)]
[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
- clarify comment regarding bypassing the confidence check for a non-empty http.agent.name
Sebastian Nagel [Fri, 10 Jul 2020 13:13:49 +0000 (15:13 +0200)]
[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
- if no agent names are given as command-line arguments use values of
http.agent.name and http.robots.agents as agent names to be checked
- update command-line help
Jakob Berlin [Mon, 3 Aug 2020 13:56:44 +0000 (15:56 +0200)]
NUTCH-1190 MoreIndexingFilter: move data formats used to parse "lastModified" to a config file
Sebastian Nagel [Fri, 10 Jul 2020 13:28:53 +0000 (15:28 +0200)]
NUTCH-2799 Add .asf.yaml file
- update pull request template regarding Jira linking:
issue id should be in square brackets (`[NUTCH-XXXX]`)
Sebastian Nagel [Wed, 8 Jul 2020 14:30:55 +0000 (16:30 +0200)]
NUTCH-2799 Add .asf.yaml file
- add project description in one sentence
- add github topics
- set github mailing list notifications as configured before
Shashanka Balakuntala Srinivasa [Wed, 29 Jul 2020 14:35:04 +0000 (20:05 +0530)]
NUTCH-2805: Rename plugin urlfilter-domainblacklist (#540)
NUTCH-2805: Rename plugin urlfilter-domainblacklist
shbalaku [Fri, 10 Jul 2020 17:37:36 +0000 (23:07 +0530)]
NUTCH-2782: protocol-http / lib-http: support TLSv1.3
Sebastian Nagel [Mon, 6 Jul 2020 12:03:33 +0000 (14:03 +0200)]
[NUTCH-2730] SitemapProcessor to treat sitemap URLs as Set instead of List
- sitemap links from robots.txt are treated as set by crawler-commons
(since crawler-commons 1.1)
- sitemaps referenced in sitemap index are deduplicated
Sebastian Nagel [Mon, 6 Jul 2020 12:02:39 +0000 (14:02 +0200)]
[NUTCH-2796] Upgrade to crawler-commons 1.1
Sebastian Nagel [Wed, 17 Jun 2020 21:00:09 +0000 (23:00 +0200)]
Prepare for new development after release of 1.17
- bump version number (1.17-SNAPSHOT -> 1.18-SNAPSHOT)
- add 1.17 changes / release notes
- update links to Hadoop and Solr API docs
- update current year in API docs etc.
Markus Jelsma [Wed, 17 Jun 2020 11:21:24 +0000 (13:21 +0200)]
NUTCH-2794 Add additional ciphers to HTTP base's default cipher suite
Patrick Mezard [Tue, 9 Jun 2020 15:39:41 +0000 (17:39 +0200)]
NUTCH-2791 Handle GCS URLs in stats commands
- Handle Google Cloud Storage URLs as crawldb inputs in domainstats,
protocolstats and crawlcomplete commands.
- Correctly resolve numReducers in protocolstats.
- Align crawlcomplete -inputDirs behaviour on the other commands: expect
directories containing "current", not "crawldb/current".
Sebastian Nagel [Tue, 9 Jun 2020 10:28:00 +0000 (12:28 +0200)]
NUTCH-2789 Documentation: update links to point to cwiki
Sebastian Nagel [Tue, 9 Jun 2020 10:06:17 +0000 (12:06 +0200)]
NUTCH-2789 Docker README: update links to point to cwiki
Sebastian Nagel [Tue, 9 Jun 2020 09:41:37 +0000 (11:41 +0200)]
NUTCH-2788 ParseData: improve presentation of Metadata in method toString()
- switch to multi-line presentation of Metadata in ParseData::toString
- default implementation of Metadata::toString is still single-line
- replace StringBuffer by StringBuilder in modified methods
Sebastian Nagel [Tue, 9 Jun 2020 12:17:40 +0000 (14:17 +0200)]
NUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly
- add JsonSerializer to write common Writable types (null, boolean, numbers)
- remaining "unknown" Writables are written after calling toString()
Patrick Mezard [Tue, 9 Jun 2020 15:00:16 +0000 (17:00 +0200)]
NUTCH-2790 indexer-csv: escape field leading quote character
Before the change, the leading quote of a field value like '"value'
would be left unescaped.
Sebastian Nagel [Fri, 15 May 2020 17:17:00 +0000 (19:17 +0200)]
NUTCH-2496 Speed up link inversion step in crawling script
- disable URL filtering and normalizing when calling invertlinks
in bin/crawl
- add note that the steps invertlinks, dedup, index could also
be done outside the loop over all segments created in the loop
iterations
- move webgraph construction (commented out anyway) outside the
loop because it's done over all available segments
Sebastian Nagel [Sun, 17 May 2020 12:37:47 +0000 (14:37 +0200)]
NUTCH-2720 ROBOTS metatag ignored when capitalized
- move string "robots" to constant in metadata.Nutch
- make string lowercase not depend on system locale
Sebastian Nagel [Fri, 15 May 2020 21:11:08 +0000 (23:11 +0200)]
NUTCH-2720 ROBOTS metatag ignored when capitalized
- parse-tika: add lowercase "robots" to metadata
Sebastian Nagel [Wed, 13 May 2020 12:39:15 +0000 (14:39 +0200)]
NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file
- simplify selection of rule file (from property or attribute in plugin.xml)
Sebastian Nagel [Fri, 27 Sep 2019 20:51:29 +0000 (22:51 +0200)]
NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file
- fix urlfilter-domain, urlfilter-domainblacklist, urlfilter-prefix
and urlfilter-suffix
- always prefer the configured rule file (urlfilter.domain.file,
urlfilter.domainblacklist.file, urlfilter.prefix.file,
urlfilter.suffix.file) over the file defined in plugin.xml
- remove constructors taking rule file as argument
(used only in unit tests and now obsolete because we can override the
rule file via configuration)
- update Java API doc comments
Sebastian Nagel [Tue, 5 May 2020 11:25:15 +0000 (13:25 +0200)]
NUTCH-1945 Test for XLSX parser
- add Tika unit test for XLSX files
- bundle instance variables and utility methods in class TikaParserTest
- clean up javadoc comments
Sebastian Nagel [Thu, 30 Apr 2020 12:16:03 +0000 (14:16 +0200)]
NUTCH-2758 Add plugin READMEs to binary release packages
Sebastian Nagel [Thu, 30 Apr 2020 15:07:30 +0000 (17:07 +0200)]
NUTCH-2753 Add -listen option to command-line help of CrawlDbReader and LinkDbReader
Sebastian Nagel [Thu, 30 Apr 2020 10:58:05 +0000 (12:58 +0200)]
NUTCH-2002 parse and index checkers to check robots.txt
- applied Julien's patch to recent code base
- also check redirects whether they are allowed
- add command-line parameter `-checkRobotsTxt` enabling this check
Sebastian Nagel [Wed, 29 Apr 2020 07:54:32 +0000 (09:54 +0200)]
NUTCH-2785 FreeGenerator: command-line option to define number of generated fetch lists
- add command-line option `-numFetchers` to FreeGenerator
- in local mode: generate one single fetch list
Sebastian Nagel [Thu, 23 Apr 2020 13:55:32 +0000 (15:55 +0200)]
NUTCH-1194 Generator: CrawlDB lock should be released earlier
- release CrawlDb lock after select step, in case, generated items
are not marked in CrawlDb (generate.update.crawldb is false)
Sebastian Nagel [Tue, 5 May 2020 09:27:35 +0000 (11:27 +0200)]
NUTCH-2434 Add methods to reset parameters HTMLMetaTags
(apply patch contributed by Markus)
Sebastian Nagel [Wed, 29 Apr 2020 11:03:01 +0000 (13:03 +0200)]
NUTCH-2743 Add list of Nutch properties (nutch-default.xml) to documentation
- modify ant build.xml to copy nutch-default.xml into docs/api/resources/
- adapt XSLT table layout
- remove obsolete nutch-conf.xsl
- fix typos and normalize spelling in nutch-default.xml
Sebastian Nagel [Tue, 11 Aug 2020 09:19:14 +0000 (11:19 +0200)]
NUTCH-2818 Fix Apache Rat task to check sources for license headers
- automatize download of Apache Rat jar file
- write report to build/apache-rat-report.txt
Sebastian Nagel [Tue, 11 Aug 2020 07:38:47 +0000 (09:38 +0200)]
Merge pull request #548 from sebastian-nagel/NUTCH-2817-spotbugs-object-equality
[NUTCH-2817] Avoid check for equality of URL path and file part using == / !=
Sebastian Nagel [Tue, 11 Aug 2020 07:37:24 +0000 (09:37 +0200)]
Merge pull request #547 from sebastian-nagel/NUTCH-2816-add-spotbugs-ant-target
[NUTCH-2816] Add Spotbugs target to ant build
Sebastian Nagel [Sat, 8 Aug 2020 08:54:42 +0000 (10:54 +0200)]
NUTCH-2817 Avoid check for equality of URL path and file part using ==/!=
- replace check whether URL path and file are identical
by check whether URL has a query
- clean up code and improve log messages
Sebastian Nagel [Thu, 6 Aug 2020 17:24:35 +0000 (19:24 +0200)]
NUTCH-2816 Add Spotbugs target to ant build
- called on-demand as ant target "spotbugs"
- creates spotbugs report ("build/nutch-spotbugs.html") covering Nutch core and plugins
Sebastian Nagel [Fri, 7 Aug 2020 16:12:22 +0000 (18:12 +0200)]
NUTCH-2814 HttpDateFormat's internal time zone may change after parsing a date
- reset time zone to GMT after parsing a date
Sebastian Nagel [Mon, 3 Aug 2020 19:08:22 +0000 (21:08 +0200)]
Merge pull request #542 from sebastian-nagel/NUTCH-2810
NUTCH-2810 FreeGenerator to actually apply configured number of fetch lists
Sebastian Nagel [Mon, 3 Aug 2020 19:06:48 +0000 (21:06 +0200)]
Merge pull request #537 from sebastian-nagel/NUTCH-2801-robots-checker
[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
Madhawa Gunasekara [Mon, 3 Aug 2020 15:10:45 +0000 (17:10 +0200)]
NUTCH-2811 : Setup Github workflows for prs (#543)
* NUTCH-2811 : Setup Github workflows for prs
Merging to setup the git workflows.
Sebastian Nagel [Sun, 2 Aug 2020 11:09:21 +0000 (13:09 +0200)]
Merge pull request #536 from sebastian-nagel/NUTCH-2799-asf-yaml-file
[NUTCH-2799] Add .asf.yaml file
Shashanka Balakuntala Srinivasa [Wed, 29 Jul 2020 14:35:04 +0000 (20:05 +0530)]
NUTCH-2805: Rename plugin urlfilter-domainblacklist (#540)
NUTCH-2805: Rename plugin urlfilter-domainblacklist
Sebastian Nagel [Mon, 27 Jul 2020 10:05:23 +0000 (12:05 +0200)]
NUTCH-2810 FreeGenerator to actually apply configured number of fetch lists
Sebastian Nagel [Tue, 14 Jul 2020 10:51:53 +0000 (12:51 +0200)]
Merge pull request #538 from balashashanka/NUTCH-2782
NUTCH-2782: protocol-http / lib-http: support TLSv1.3
Sebastian Nagel [Tue, 14 Jul 2020 10:48:29 +0000 (12:48 +0200)]
Merge pull request #535 from sebastian-nagel/NUTCH-2796-NUTCH-2730
[NUTCH-2796] [NUTCH-2730] Update crawler-commons 1.1, SitemapProcessor to treat sitemap URLs as Set instead of List
shbalaku [Fri, 10 Jul 2020 17:37:36 +0000 (23:07 +0530)]
NUTCH-2782: protocol-http / lib-http: support TLSv1.3
Sebastian Nagel [Fri, 10 Jul 2020 14:01:18 +0000 (16:01 +0200)]
[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
- clarify comment regarding bypassing the confidence check for a non-empty http.agent.name
Sebastian Nagel [Fri, 10 Jul 2020 13:28:53 +0000 (15:28 +0200)]
NUTCH-2799 Add .asf.yaml file
- update pull request template regarding Jira linking:
issue id should be in square brackets (`[NUTCH-XXXX]`)
Sebastian Nagel [Fri, 10 Jul 2020 13:13:49 +0000 (15:13 +0200)]
[NUTCH-2801] RobotsRulesParser command-line checker to use http.robots.agents as fall-back
- if no agent names are given as command-line arguments use values of
http.agent.name and http.robots.agents as agent names to be checked
- update command-line help
Sebastian Nagel [Wed, 8 Jul 2020 14:30:55 +0000 (16:30 +0200)]
NUTCH-2799 Add .asf.yaml file
- add project description in one sentence
- add github topics
- set github mailing list notifications as configured before
Sebastian Nagel [Mon, 6 Jul 2020 12:03:33 +0000 (14:03 +0200)]
[NUTCH-2730] SitemapProcessor to treat sitemap URLs as Set instead of List
- sitemap links from robots.txt are treated as set by crawler-commons
(since crawler-commons 1.1)
- sitemaps referenced in sitemap index are deduplicated
Sebastian Nagel [Mon, 6 Jul 2020 12:02:39 +0000 (14:02 +0200)]
[NUTCH-2796] Upgrade to crawler-commons 1.1
Sebastian Nagel [Wed, 17 Jun 2020 21:00:09 +0000 (23:00 +0200)]
Prepare for new development after release of 1.17
- bump version number (1.17-SNAPSHOT -> 1.18-SNAPSHOT)
- add 1.17 changes / release notes
- update links to Hadoop and Solr API docs
- update current year in API docs etc.
Markus Jelsma [Wed, 17 Jun 2020 11:21:24 +0000 (13:21 +0200)]
NUTCH-2794 Add additional ciphers to HTTP base's default cipher suite
Sebastian Nagel [Thu, 11 Jun 2020 11:21:19 +0000 (13:21 +0200)]
Merge pull request #533 from pmezard/NUTCH-2791
NUTCH-2791 Handle GCS URLs in stats commands
Patrick Mezard [Tue, 9 Jun 2020 15:39:41 +0000 (17:39 +0200)]
NUTCH-2791 Handle GCS URLs in stats commands
- Handle Google Cloud Storage URLs as crawldb inputs in domainstats,
protocolstats and crawlcomplete commands.
- Correctly resolve numReducers in protocolstats.
- Align crawlcomplete -inputDirs behaviour on the other commands: expect
directories containing "current", not "crawldb/current".
Sebastian Nagel [Wed, 10 Jun 2020 18:44:36 +0000 (20:44 +0200)]
Merge pull request #530 from sebastian-nagel/NUTCH-2789
NUTCH-2789 Documentation: update links to point to cwiki
Sebastian Nagel [Wed, 10 Jun 2020 18:42:38 +0000 (20:42 +0200)]
Merge pull request #529 from sebastian-nagel/NUTCH-2788
NUTCH-2788 ParseData: improve presentation of Metadata in method toString()
Sebastian Nagel [Wed, 10 Jun 2020 18:34:50 +0000 (20:34 +0200)]
Merge pull request #531 from sebastian-nagel/NUTCH-2787
NUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly
Sebastian Nagel [Wed, 10 Jun 2020 18:26:56 +0000 (20:26 +0200)]
Merge pull request #532 from pmezard/NUTCH-2790
NUTCH-2790 indexer-csv: escape field leading quote character
Patrick Mezard [Tue, 9 Jun 2020 15:00:16 +0000 (17:00 +0200)]
NUTCH-2790 indexer-csv: escape field leading quote character
Before the change, the leading quote of a field value like '"value'
would be left unescaped.
Sebastian Nagel [Tue, 9 Jun 2020 12:17:40 +0000 (14:17 +0200)]
NUTCH-2787 CrawlDb JSON dump does not export metadata primitive data types correctly
- add JsonSerializer to write common Writable types (null, boolean, numbers)
- remaining "unknown" Writables are written after calling toString()
Sebastian Nagel [Tue, 9 Jun 2020 10:46:49 +0000 (12:46 +0200)]
Merge pull request #527 from sebastian-nagel/NUTCH-2496
NUTCH-2496 Speed up link inversion step in crawling script
Sebastian Nagel [Tue, 9 Jun 2020 10:44:30 +0000 (12:44 +0200)]
Merge pull request #528 from sebastian-nagel/NUTCH-2720
NUTCH-2720 ROBOTS metatag ignored when capitalized
Sebastian Nagel [Tue, 9 Jun 2020 10:28:00 +0000 (12:28 +0200)]
NUTCH-2789 Documentation: update links to point to cwiki
Sebastian Nagel [Tue, 9 Jun 2020 10:06:17 +0000 (12:06 +0200)]
NUTCH-2789 Docker README: update links to point to cwiki
Sebastian Nagel [Tue, 9 Jun 2020 09:41:37 +0000 (11:41 +0200)]
NUTCH-2788 ParseData: improve presentation of Metadata in method toString()
- switch to multi-line presentation of Metadata in ParseData::toString
- default implementation of Metadata::toString is still single-line
- replace StringBuffer by StringBuilder in modified methods
Sebastian Nagel [Sun, 17 May 2020 12:37:47 +0000 (14:37 +0200)]
NUTCH-2720 ROBOTS metatag ignored when capitalized
- move string "robots" to constant in metadata.Nutch
- make string lowercase not depend on system locale
Sebastian Nagel [Fri, 15 May 2020 21:11:08 +0000 (23:11 +0200)]
NUTCH-2720 ROBOTS metatag ignored when capitalized
- parse-tika: add lowercase "robots" to metadata
Sebastian Nagel [Fri, 15 May 2020 17:17:00 +0000 (19:17 +0200)]
NUTCH-2496 Speed up link inversion step in crawling script
- disable URL filtering and normalizing when calling invertlinks
in bin/crawl
- add note that the steps invertlinks, dedup, index could also
be done outside the loop over all segments created in the loop
iterations
- move webgraph construction (commented out anyway) outside the
loop because it's done over all available segments
Sebastian Nagel [Thu, 14 May 2020 15:43:14 +0000 (17:43 +0200)]
Merge pull request #526 from sebastian-nagel/NUTCH-2419-urlfilter-rule-file-precedence
NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file
Sebastian Nagel [Wed, 13 May 2020 12:39:15 +0000 (14:39 +0200)]
NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file
- simplify selection of rule file (from property or attribute in plugin.xml)
Sebastian Nagel [Tue, 12 May 2020 13:35:09 +0000 (15:35 +0200)]
Merge pull request #525 from sebastian-nagel/NUTCH-1945
NUTCH-1945 Test for XLSX parser
Sebastian Nagel [Fri, 27 Sep 2019 20:51:29 +0000 (22:51 +0200)]
NUTCH-2419 Some URL filters and normalizers do not respect command-line override for rule file
- fix urlfilter-domain, urlfilter-domainblacklist, urlfilter-prefix
and urlfilter-suffix
- always prefer the configured rule file (urlfilter.domain.file,
urlfilter.domainblacklist.file, urlfilter.prefix.file,
urlfilter.suffix.file) over the file defined in plugin.xml
- remove constructors taking rule file as argument
(used only in unit tests and now obsolete because we can override the
rule file via configuration)
- update Java API doc comments