madlib.git
2 days agoTypos master
Bruno P. Kinoshita [Thu, 10 Jan 2019 21:36:49 +0000 (10:36 +1300)] 
Typos

6 days agoElastic net: Fix minor typo in install-check
Rahul Iyer [Fri, 11 Jan 2019 01:17:56 +0000 (17:17 -0800)] 
Elastic net: Fix minor typo in install-check

7 days agoupdate NOTICE file to 2019
Frank McQuillan [Thu, 10 Jan 2019 01:34:11 +0000 (17:34 -0800)] 
update NOTICE file to 2019

4 weeks agoValidation: Support other 'relkind' for input tables
Rahul Iyer [Tue, 11 Dec 2018 13:24:17 +0000 (05:24 -0800)] 
Validation: Support other 'relkind' for input tables

JIRA: MADLIB-1287

Within `validate_args.py_in:table_exists()` we checked if a table existed within
`pg_class` but limited the input table to specific `relkind`s. This limited
scope is unnecessary and precluded MADlib functions from accepting partition
tables.

This commit removes the `relkind` check, effectively adding 'partition',
'index' and 'sequence' tables as valid input tables.

Closes #340

4 weeks agoBuild: Add PG11 Support
Orhan Kislal [Fri, 14 Dec 2018 11:34:43 +0000 (14:34 +0300)] 
Build: Add PG11 Support

JIRA: MADLIB-1283

PG11 support required a number of minor changes in the code.
- Change TRUE/FALSE to true/false
- Use TupleDescAttr function instead of direct access.
- Use prokind column instead of proisagg.

We also added a function to check if the PG version is earlier than 11
as well as the necessary cmake files.

Closes #339

2 months agoInstall/Dev check: Add new test cases for some modules
Nandish Jayaram [Sat, 29 Sep 2018 00:15:40 +0000 (17:15 -0700)] 
Install/Dev check: Add new test cases for some modules

Some modules such as array_ops and pmml did not have any install check
files, while stemmer did not have any test files. This commit adds some
basic test cases for these modules. We can add more comprehensive tests
in the future if need be.

Closes #338

2 months agoMinibatch Preprocessor: Update online doc
Nandish Jayaram [Tue, 23 Oct 2018 17:35:02 +0000 (10:35 -0700)] 
Minibatch Preprocessor: Update online doc

The online doc is outdated. This commit adds two new parameters that
have been introduced since the last time the doc was edited.

Closes #334

2 months agoMadpack: Add UDO and UDOC automation
Orhan Kislal [Wed, 14 Nov 2018 17:15:57 +0000 (20:15 +0300)] 
Madpack: Add UDO and UDOC automation

JIRA: MADLIB-1281

- Add scripts for detecting changed/dropped UDOs and UDOCs.
- Expand the create_changelist.py file to consume these scripts and
create changelists with these fields filled if necessary.
- Fix the update_util.py to use the correct dictionary key.
- Add drop operator class command to the svac.sql_in to make sure the
old class is removed before creating the updated one.

Closes #337

2 months agoUpdate dockerfile to use ubuntu 16.04
Jingyi Mei [Fri, 19 Oct 2018 14:43:07 +0000 (22:43 +0800)] 
Update dockerfile to use ubuntu 16.04

This commit adds a new dockerfile to bake postgres 10.5 on ubuntu
16.04. Also updates docker_start.sh and README to pull the new docker image instead
of the old one (Postgres9.6 on Ubuntu 8.9).

Closes #332

2 months agoUpdate version numbers to 1.16-dev
Orhan Kislal [Fri, 19 Oct 2018 17:23:42 +0000 (20:23 +0300)] 
Update version numbers to 1.16-dev

3 months agoBuild: Include preflight and postflight scripts for mac rc/1.15.1-rc1 rel/v1.15.1
Nikhil Kak [Tue, 9 Oct 2018 19:07:06 +0000 (12:07 -0700)] 
Build: Include preflight and postflight scripts for mac

Commit 441f16bd55d2a26e4dd59df6129c6092f099cbca introduced a bug where
the preflight and postflight scripts for mac were not getting included
in the .dmg file. This commit adds an if check for the APPLE platform to
include these scripts necessary for dmg installation.

Closes #331

3 months agoAdd 1.15.1 changelist and fix upgrade util
Orhan Kislal [Thu, 4 Oct 2018 08:14:25 +0000 (11:14 +0300)] 
Add 1.15.1 changelist and fix upgrade util

Upgrade was failing when functions without any arguments were added to
the changelist. This commit fixes the issue by setting the argument list
to empty string.

Closes #329

3 months agoUpdate RELEASE_NOTES for 1.15.1 release
Orhan Kislal [Thu, 4 Oct 2018 08:13:46 +0000 (11:13 +0300)] 
Update RELEASE_NOTES for 1.15.1 release

3 months agoBuild: Change version to 1.15.1
Orhan Kislal [Thu, 4 Oct 2018 08:13:18 +0000 (11:13 +0300)] 
Build: Change version to 1.15.1

3 months agoMargins: Copy summary table instead of renaming
Orhan Kislal [Thu, 4 Oct 2018 07:40:20 +0000 (10:40 +0300)] 
Margins: Copy summary table instead of renaming

JIRA: MADLIB-1276

Margins summary table gets dropped since its schema remains pg_temp.
This commit fixed the issue by copying the contents instead of renaming.

Closes #330

3 months agoUpgrade: Fix issue with upgrading RPM to 1.15.1
Nandish Jayaram [Mon, 1 Oct 2018 21:32:44 +0000 (14:32 -0700)] 
Upgrade: Fix issue with upgrading RPM to 1.15.1

JIRA: MADLIB-1278

During RPM upgrade, rpm_post.sh is run first, followed by
rpm_post_uninstall.sh. So we must do all the uninstallation specific
stuff based on the current operation being uninstall or upgrade.
This commit makes the necessary change to remove symlinks only during
uninstallation, and not while upgrading.

Closes #327

Co-authored-by: Domino Valdano <dvaldano@pivotal.io>
3 months agoGraph: Add id of nodes with 0 in-degree
Rahul Iyer [Mon, 1 Oct 2018 22:24:20 +0000 (15:24 -0700)] 
Graph: Add id of nodes with 0 in-degree

JIRA: MADLIB-1279

IDs of nodes with 0 in-degree were not showing in the result of
`in_out_degrees` since the output table was a result of a full outer
join where the id was obtained from only one side.

This has been fixed by checking for NULL values (using coalesce) and the
result from other side is obtained if the ID is missing on primary side.

Closes #328

Co-authored-by: Nandish Jayaram <njayaram@apache.org>
3 months agoBuild: Add single quote while setting AppendOnly guc
Jingyi Mei [Wed, 26 Sep 2018 20:56:04 +0000 (13:56 -0700)] 
Build: Add single quote while setting AppendOnly guc

JIRA: MADLIB-1273

Commit 3db98babe3326fb5e2cd16d0639a2bef264f4b04 added a context manager
for setting appendonly to false for all madlib modules. The commit was
missing a quote around the `gp_default_storage_options` guc because of
which the set command always failed. This commit adds a quote while
setting the guc.

Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Closes #323

3 months agoBuild: Remove primary key constraint in IC/DC
Jingyi Mei [Mon, 24 Sep 2018 18:48:51 +0000 (11:48 -0700)] 
Build: Remove primary key constraint in IC/DC

Some of the install-check/dev-check tests were setting primary key
constraints. If the user sets the GUC
gp_default_storage_options='appendonly=true', IC/DC will fail while
table creation because appendonly tables don't support primary key. This
commit removes these constrains since they are unnecessary for those
test cases.

Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Closes #323

3 months agoRF: Increase the dataset size of dev-check test
Orhan Kislal [Mon, 24 Sep 2018 06:48:27 +0000 (09:48 +0300)] 
RF: Increase the dataset size of dev-check test

Closes #321

3 months agoadd caution on run-times to assoc rules user docs re: max itemset size usage
Frank McQuillan [Tue, 18 Sep 2018 22:02:18 +0000 (15:02 -0700)] 
add caution on run-times to assoc rules user docs re: max itemset size usage

3 months agoAllocator: Remove 16-byte alignment in GPDB 6
Rahul Iyer [Wed, 12 Sep 2018 23:59:59 +0000 (16:59 -0700)] 
Allocator: Remove 16-byte alignment in GPDB 6

Findings:
1. MADlib performs a 16-byte alignment for pointers returned by palloc.
2. Postgres prepends a small (16 byte usually) header before every
pointer which includes
a. the memory context and
b. the size of the memory allocation.
3. Greenplum 6+ tweaks that scheme a little: instead of the memory context,
the header tracks a "shared header" which points to another struct with
richer information (aside from the memory context).
4. Postgres calls MemoryContextContains both with the final func
for an aggregate and final function for a windowed aggregate.
5. Currently Postgres always concludes that the datum from MADlib is
allocated outside of the context and makes an extra copy. In
Greenplum, MemoryContextContains needs to dereference the shared header.
This is a problem since the pointer has been shifted and the function is
getting a bad header.

In this commit, we disable the pointer alignment for GPDB 6+ to avoid
failure in this check. Further, we also have to disable vectorization in
Eigen since it does not work when pointers are not 16-byte aligned.

Closes #319

Co-authored-by: Jesse Zhang <sbjesse@gmail.com>
Co-authored-by: Nandish Jayaram <njayaram@apache.org>
3 months agoCMake: Fix false positive for Postgres 10+ check
Jesse Zhang [Thu, 13 Sep 2018 01:53:35 +0000 (18:53 -0700)] 
CMake: Fix false positive for Postgres 10+ check

We used to mistake 9.3.24 as a higher version than Postgres 10 and stop
matching it to the correct "port". This patch fixes that.

Closes #320

4 months agoMadpack: Add a script for automating changelist creation
Orhan Kislal [Mon, 17 Sep 2018 05:54:28 +0000 (08:54 +0300)] 
Madpack: Add a script for automating changelist creation

Closes #318

4 months agoControl: Add minor comments to context managers
Rahul Iyer [Thu, 13 Sep 2018 21:43:09 +0000 (14:43 -0700)] 
Control: Add minor comments to context managers

4 months agominor docs update to svm and elastic net on cross validation table naming
Frank McQuillan [Thu, 13 Sep 2018 19:22:21 +0000 (12:22 -0700)] 
minor docs update to svm and elastic net on cross validation table naming

4 months agoBuild: Disable AppendOnly if available
Rahul Iyer [Wed, 29 Aug 2018 23:23:04 +0000 (16:23 -0700)] 
Build: Disable AppendOnly if available

JIRA: MADLIB-1171

Greenplum provides an Append-optimized table storage that does not allow
UPDATE and DELETE. MADlib model tables are small enough that they won't
see a big benefit of using AO instead of Heap tables.

This commit ensures that APPENDONLY=False during MADlib function call
(the GUC is reset back to original value during exit). For cases where
we recreate the data table (standardization, redistribution, etc), we
have to explicitly add an 'APPENDONLY=true' to see the AO benefits.

Closes #316

4 months agokNN: Accept expressions for point_column_name and test_column_name
hpandeycodeit [Tue, 28 Aug 2018 06:40:13 +0000 (23:40 -0700)] 
kNN: Accept expressions for point_column_name and test_column_name

JIRA: MADLIB-1060

This commit adds code to allow expressions for point and test column
names in kNN. This also adds test cases for the same in dev-check.

Closes #315

4 months agoMultiple: Remove trailing whitespace from all SQL
Rahul Iyer [Fri, 7 Sep 2018 22:12:49 +0000 (15:12 -0700)] 
Multiple: Remove trailing whitespace from all SQL

Markup language states that two trailing whitespace should be
interpreted as a break line (<br>), which has been implemented by
Doxygen 1.8+. This commit removes all such instances since the trailing
whitespace is inadvertent in most cases. If a break line is required,
then it should be added explicitly (using HTML tag <br>).

Closes #317

Co-authored-by: Domino Valdano <dvaldano@pivotal.io>
4 months agoMLP: Simplify momentum and Nesterov updates 313/head
Rahul Iyer [Fri, 17 Aug 2018 08:42:53 +0000 (01:42 -0700)] 
MLP: Simplify momentum and Nesterov updates

JIRA: MADLIB-1272

Momentum updates are complicated due to Nesterov requiring an initial
update before gradient calculations. There is, however, a different form
of the Nesterov update that can be cleanly performed after the regular
update, simplifying the code. This allows performing the gradient
calculations before any update - with or without Nesterov.

Closes #313

4 months agoUbuntu support: Enable creation of gppkg on Ubuntu
Nandish Jayaram [Mon, 6 Aug 2018 22:19:18 +0000 (15:19 -0700)] 
Ubuntu support: Enable creation of gppkg on Ubuntu

This commit makes necessary changes to create a gppkg on Ubuntu. The
default behavior when MADlib is built on Ubuntu is to create a .deb
installer. If we want to create a gppkg, then we need an RPM due to
limitations in gppkg. We now create an RPM on Ubuntu (assuming package
alien is installed on Ubuntu) if the right cmake flag is specified. Once
an RPM is created on `make package`, we can now go ahead and create the
gppkg using `make gppkg`.
The cmake flag to use if we want to create an .rpm instead of .deb on
Ubuntu when we run `make package` is:
-DCREATE_RPM_FOR_UBUNTU=True

Closes #314

Co-authored-by: Orhan Kislal <okislal@pivotal.io>
4 months agoBuild: Update version after release for Ubuntu
Jingyi Mei [Tue, 21 Aug 2018 19:32:10 +0000 (12:32 -0700)] 
Build: Update version after release for Ubuntu

Co-authored-by: Nandish Jayaram <njayaram@apache.org>
5 months agoadd note to user docs on vec2cols about unequal arrays
Frank McQuillan [Fri, 17 Aug 2018 20:38:20 +0000 (13:38 -0700)] 
add note to user docs on vec2cols about unequal arrays

5 months agoMultiple: Re-enable tests in PCA, Pagerank
Jingyi Mei [Fri, 17 Aug 2018 03:12:25 +0000 (20:12 -0700)] 
Multiple: Re-enable tests in PCA, Pagerank

JIRA: MADLIB-1264

Some tests were commented out due to failures on GPDB 5.X.
These tests are now working and have been enabled again.

Closes #312

Co-authored-by: Arvind Sridhar <arvindsridhar@berkeley.edu>
5 months agoVec2Cols: Allow arrays of different lengths
Rahul Iyer [Fri, 17 Aug 2018 03:08:32 +0000 (20:08 -0700)] 
Vec2Cols: Allow arrays of different lengths

JIRA: MADLIB-1270

Added support to split arrays of different lengths in the vector_col.
If the user does not provide feature names, we pad each array to the
maximum length and split across the maximum possible number of features.
If the user does provide feature names, we truncate/pad the arrays
according to the number of features the user desires.

Closes #311

Co-authored-by: Arvind Sridhar <arvindsridhar@berkeley.edu>
5 months agoElastic Net: Allow grouping by non-numeric column
Arvind Sridhar [Fri, 17 Aug 2018 03:02:48 +0000 (20:02 -0700)] 
Elastic Net: Allow grouping by non-numeric column

JIRA: MADLIB-1262

- Grouping columns should be quoted if the type of the column is of type
TEXT.
- Grouping column names that require double quoting need special
handling.

Closes #309

Co-authored-by: Domino Valdano <dvaldano@pivotal.io>
Co-authored-by: Rahul Iyer <riyer@apache.org>
5 months agoUbuntu support for MADlib
Orhan Kislal [Thu, 16 Aug 2018 23:26:25 +0000 (16:26 -0700)] 
Ubuntu support for MADlib

JIRA: MADLIB-1256

Adds support for compiling on Ubuntu as well as creating a deb package.

Closes #306

Co-authored-by: Domino Valdano <dvaldano@pivotal.io>
Co-authored-by: Jingyi Mei<jmei@pivotal.io>
Co-authored-by: Nandish Jayaram <njayaram@apache.org>
5 months agoDocumentation: Remove online examples from sql functions.
Orhan Kislal [Wed, 15 Aug 2018 19:27:50 +0000 (12:27 -0700)] 
Documentation: Remove online examples from sql functions.

JIRA: MADLIB-1260

For a madlib module, we can call
`select madlib_schema.module_name('example');` to print out examples of this module.
They are hard to maintain and not that useful since we already have examples in our user documentation http://madlib.apache.org/docs/latest/index.html/.
We are going to remove those examples for every module that has it, and make sure madlib throw out proper error message when user calls it.

Colses #302

Co-authored-by: Orhan Kislal <okislal@pivotal.io>
Co-authored-by: Nandish Jayaram <njayaram@apache.org>
5 months agoBuild: Download compatible Boost if version >= 1.65
Rahul Iyer [Sat, 11 Aug 2018 19:28:29 +0000 (12:28 -0700)] 
Build: Download compatible Boost if version >= 1.65

JIRA: MADLIB-1235

BOOST 1.65.0 removed the TR1 library which is required by MADlib till
C++11 is completely supported. Hence, we force download of a compatible
version if existing Boost is 1.65 or greater. This should be removed
when TR1 dependency is removed.

Closes #310

5 months agoUtilities: Use plpy.quote_ident if available
Rahul Iyer [Mon, 13 Aug 2018 22:45:31 +0000 (15:45 -0700)] 
Utilities: Use plpy.quote_ident if available

5 months agoBuild: Update versions after release
Rahul Iyer [Mon, 13 Aug 2018 18:40:36 +0000 (11:40 -0700)] 
Build: Update versions after release

5 months agoDoc: Remove update_mathjax target latest_release 335/head rc/1.15-rc1 rel/v1.15
Rahul Iyer [Tue, 7 Aug 2018 04:46:03 +0000 (21:46 -0700)] 
Doc: Remove update_mathjax target

5 months agoRelease: Release Notes for v1.15
Rahul Iyer [Mon, 6 Aug 2018 21:27:38 +0000 (14:27 -0700)] 
Release: Release Notes for v1.15

Closes #308

5 months agoCV: Simplify and fix internal CV requirements
Rahul Iyer [Mon, 6 Aug 2018 02:18:32 +0000 (19:18 -0700)] 
CV: Simplify and fix internal CV requirements

This commit ensures internal cross validation API is consistent and
simplifies the arguments for CV parameters.

Closes #307

5 months agoMadpack: Duplicate DROP SCHEMA due to invalid cache
Rahul Iyer [Mon, 6 Aug 2018 02:18:48 +0000 (19:18 -0700)] 
Madpack: Duplicate DROP SCHEMA due to invalid cache

JIRA: MADLIB-1014

We've intermittently noticed a "cache lookup failure" due to a
"DROP OWNED BY". This was noticed only with CV enabled, possibly due to
the excessive number of tables created by CV.

This could be related to an error on StackExchange:
https://dba.stackexchange.com/questions/173815/redshift-internalerror-cache-lookup-failed-for-relation

An excerpt from the issue: "The problem is on the Postgres DB engine
caching is something to do with the System Catalog Cache access. ... the
issue gets reproduced when DROP is used very often and the cache is not
able to retrieve data which seems to be out-of-sync. After sometime (few
seconds) the cache is in back in sync and query runs fine. The only
workaround I see for the moment is retry query when fail ..."

To get around this issue, we duplicate the "DROP CASCADE" call after a
delay with the hope that the second call will clear the schema without
an issue.

5 months agoChange the version to 1.15 and add changelist
Orhan Kislal [Wed, 1 Aug 2018 21:49:46 +0000 (14:49 -0700)] 
Change the version to 1.15 and add changelist

Closes #305
Co-authored-by: Nandish Jayaram <njayaram@apache.org>
5 months agominor edit to minibatch preproc user doc
Frank McQuillan [Thu, 2 Aug 2018 17:34:17 +0000 (10:34 -0700)] 
minor edit to minibatch preproc user doc

5 months agoDT/RF: Fix user doc examples 295/head
Frank McQuillan [Wed, 1 Aug 2018 19:49:10 +0000 (12:49 -0700)] 
DT/RF: Fix user doc examples

5 months agoDT/RF: Add function to report importance scores
Nandish Jayaram [Tue, 3 Jul 2018 19:22:07 +0000 (12:22 -0700)] 
DT/RF: Add function to report importance scores

JIRA: MADLIB-925

This commit adds a new MADlib function (get_var_importance) to report the
importance scores in decision tree and random forest by unnesting the
importance values along with corresponding features.

Closes #295

Co-authored-by: Rahul Iyer <riyer@apache.org>
Co-authored-by: Jingyi Mei <jmei@pivotal.io>
Co-authored-by: Orhan Kislal <okislal@pivotal.io>
5 months agoDT/RF: Don't eliminate single-level cat variable
Rahul Iyer [Thu, 26 Jul 2018 19:17:58 +0000 (12:17 -0700)] 
DT/RF: Don't eliminate single-level cat variable

JIRA: MADLIB-1258

When DT/RF is run with grouping, a subset of the groups could eliminate
a categorical variable leading to multiple issues downstream, including
invalid importance values and incorrect prediction.

This commit keeps all categorical variables (even if it contains just
one level). The accumulator state would use additional space during
tree_train for this categorical variable, even though the variable is
never consumed by the tree. This inefficiency is still preferred since
it yields clean code and error-free prediction/importance reporting.

Additional changes:
- get_expr_type (validate_args.py) has been updated to return type for
multiple expressions at the same time. This prevents calling a separate
query for each expression, thus saving time.
- Cat features are not stored per tree (in the grouping case) anymore
since the features are now consistent across trees.

Closes #301

Co-authored-by: Nandish Jayaram <njayaram@apache.org>
5 months agoUtilities: Add module transform_vec_cols for column-vector conversion
Arvind Sridhar [Wed, 1 Aug 2018 18:22:27 +0000 (11:22 -0700)] 
Utilities: Add module transform_vec_cols for column-vector conversion

JIRA: MADLIB-1240

This commit adds a new SQL function called vec2cols and refactors the
current function cols2vec, providing greater integration between the two
modules. We now have a single Python file with separate classes for each
feature. We also have unified unit-tests and dev-check/install-check
tests.

The vec2cols function enables users to split up a single column into
multiple columns, given that the input column contains array entries.
For example, if the input column contained ARRAY[1, 2, 3] in one of its
rows, the output table will contain 3 different columns, one for each
element of the array.

Co-authored-by: Nandish Jayaram <njayaram@apache.org>
Co-authored-by: Rahul Iyer <riyer@apache.org>
Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Co-authored-by: Orhan Kislal <okislal@pivotal.io>
Co-authored-by: Frank McQuillan <fmcquillan@pivotal.io>
Closes #291

5 months agoMadpack: Fix missing test logs bug.
Orhan Kislal [Wed, 25 Jul 2018 22:05:08 +0000 (15:05 -0700)] 
Madpack: Fix missing test logs bug.

Due to a recent commit, madpack cleaned log files of test operations as
well as the atomic operations. As a result, log files are missing even
after install/dev check fails. This commit fixes this issue.

Closes #300

Co-authored-by: Jingyi Mei <jmei@pivotal.io>
5 months agoMultiple: Clean and update documentation
Frank McQuillan [Wed, 25 Jul 2018 00:20:18 +0000 (17:20 -0700)] 
Multiple: Clean and update documentation

Closes #298

5 months agoRF: Port DT fix for incorrect importance vector length
Rahul Iyer [Wed, 25 Jul 2018 17:51:47 +0000 (10:51 -0700)] 
RF: Port DT fix for incorrect importance vector length

JIRA: MADLIB-1254

Commit 0f7834e contained a fix in DT that ensured that impurity variable
importance was of the correct length for each group if a single group
eliminated a categorical variable.

This commit applies the same fix for random forest.

Closes #299

5 months agoMadpack: Improve error message
Orhan Kislal [Wed, 25 Jul 2018 17:41:08 +0000 (10:41 -0700)] 
Madpack: Improve error message

5 months agoMadpack: Fix various schema related bugs and messages.
Orhan Kislal [Tue, 24 Jul 2018 18:20:18 +0000 (11:20 -0700)] 
Madpack: Fix various schema related bugs and messages.

Closes #297

5 months agoDT/RF: Ensure cat features are recorded per group
Rahul Iyer [Mon, 23 Jul 2018 18:10:48 +0000 (11:10 -0700)] 
DT/RF: Ensure cat features are recorded per group

JIRA: MADLIB-1254

If tree_train/forest_train is run with grouping enabled and if one of
the groups has a categorical feature with just single level, then the
categorical feature is eliminated for that group. If other groups retain
that feature, then we end up with incorrect "bins" data structure built
as part of DT.

This commit fixes this issue by recording the categorical features
present in each group separately.

Closes #296

5 months agoCols2Vec: Add Apache License header
Rahul Iyer [Wed, 18 Jul 2018 23:30:57 +0000 (16:30 -0700)] 
Cols2Vec: Add Apache License header

6 months agoRF: Add impurity variable importance
Rahul Iyer [Wed, 18 Jul 2018 20:24:53 +0000 (13:24 -0700)] 
RF: Add impurity variable importance

JIRA: MADLIB-1205

This commit makes the following changes:
- Add impurity variable importance for random forests.
- Rename current cat_var_importance and con_var_importance measurements to
oob_cat_var_importance and oob_con_var_importance.

New impurity measurement is provided as impurity_var_importance, and supports
grouping. It combines the importance values for both categorical and
continuous features into a single array.

Closes #289

Co-authored-by: Nandish Jayaram <njayaram@apache.org>
Co-authored-by: Orhan Kislal <okislal@pivotal.io>
Co-authored-by: Jingyi Mei <jmei@pivotal.io>
Co-authored-by: Arvind Sridhar <asridhar@pivotal.io>
6 months agoPagerank: Remove duplicate entries from grouping output
Nandish Jayaram [Sat, 14 Jul 2018 00:09:11 +0000 (17:09 -0700)] 
Pagerank: Remove duplicate entries from grouping output

JIRA: MADLIB-1229
JIRA: MADLIB-1253

This commit fixes the missing output for complete graphs bug as well.

Closes #294

Co-authored-by: Orhan Kislal <okislal@pivotal.io>
6 months agoDocs: Update madpack help message.
Nandish Jayaram [Tue, 17 Jul 2018 16:22:38 +0000 (09:22 -0700)] 
Docs: Update madpack help message.

6 months agomadpack: Add madpack option to run unit tests.
Nandish Jayaram [Wed, 11 Jul 2018 00:32:24 +0000 (17:32 -0700)] 
madpack: Add madpack option to run unit tests.

JIRA: MADLIB-1251
JIRA: MADLIB-1252

Unit tests in MADlib are written in python files, that are located in
the ...<module_name>/test/unit_tests/ folders, whose names begin with
the prefix "test_". This commit adds a new madpack option to run unit
tests similar to how we run install and dev checks.

- The new option added is: `unit-test`.
- Sample usage (on a postgres database, with MADlib installed on
  database `madlib`):
  * Run unit tests on all modules that have it defined:
      src/bin/madpack -p postgres -c /madlib unit-test
  * Run unit tests only for the `convex` module:
      src/bin/madpack -p postgres -c /madlib unit-test -t convex
  * Run unit tests only for the `convex` and decision trees module:
      src/bin/madpack -p postgres -c /madlib unit-test -t
      convex,recursive_partitioning/decision_tree
- Add command to run all unit tests in Jenkins build script.
- This commit also removes `-t` option for install, reinstall and upgrade.
Using `-t` options for specific modules was bringing MADlib installation
into an unstable state.

Closes #290

6 months agoUtilties: Refactor and clean cols2vec from 2828d86
Rahul Iyer [Thu, 12 Jul 2018 23:44:57 +0000 (16:44 -0700)] 
Utilties: Refactor and clean cols2vec from 2828d86

JIRA: MADLIB-1239

Closes #288

6 months agoUtilities: Add cols2vec() to convert columns to array
Himanshu Pandey [Fri, 15 Jun 2018 08:33:27 +0000 (01:33 -0700)] 
Utilities: Add cols2vec() to convert columns to array

JIRA: MADLIB-1239

This commit adds a new function called cols2vec that can be used to
convert features from multiple columns of an input table into a feature
array in a single column.

6 months agoUtils: Simplify proxy quote function
Rahul Iyer [Fri, 13 Jul 2018 17:59:33 +0000 (10:59 -0700)] 
Utils: Simplify proxy quote function

Commit 5e47c8e added a wrapper quote_literal function that called
plpy.quote_literal if available, else returned dollar-quoted string.
We can use Python's introspection to switch between these two
options at runtime instead of a compile-time preprocessor switch.

6 months agoUtils: Add a Python quote_literal for GP platforms
Rahul Iyer [Fri, 13 Jul 2018 05:46:07 +0000 (22:46 -0700)] 
Utils: Add a Python quote_literal for GP platforms

Versions prior to GPBD 6 or Postgresql 9.1 do not provide
plpy.quote_literal which is necessary for building a SQL text array from
a Python list of strings.  We work around this limitation by creating
our own quote_literal function that just returns plpy.quote_literal
output for platforms that provide the function. For other platforms, we
compromise by using dollar-quoting (with a obscure tag between the
dollars).

6 months agoMadpack: Fix glob expansion for dev-check
Rahul Iyer [Fri, 13 Jul 2018 00:11:54 +0000 (17:11 -0700)] 
Madpack: Fix glob expansion for dev-check

6 months agoUtilities: Add check for any array type
Arvind Sridhar [Mon, 9 Jul 2018 23:14:48 +0000 (16:14 -0700)] 
Utilities: Add check for any array type

Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Closes #293

6 months agoMultiple: Update docs related to CV
Frank McQuillan [Wed, 11 Jul 2018 18:02:02 +0000 (11:02 -0700)] 
Multiple: Update docs related to CV

JIRA: MADLIB-1250

This commit updates documentation to reflect latest changes in cross
validation. An additional minor change is made to MLP docs to use 'AVG'
instead of 'SUM/COUNT'.

6 months agoCV: Fix incorrect dict index + change output columns
Rahul Iyer [Tue, 3 Jul 2018 21:28:21 +0000 (14:28 -0700)] 
CV: Fix incorrect dict index + change output columns

JIRA: MADLIB-1250

Cross validation had a minor bug that didn't fully index into a two-level
nested dictionary. This led to a KeyError while writing CV results to an
output table. This has been fixed in this commit.

Additionally, the CV output table columns are called 'mean_score' and
'std_dev_score', instead of 'mean_neg_loss' and 'std_neg_loss' to not
confuse with the loss function used in the primary modeling technique.

Closes #287

6 months agoSVM: Compute average loss per row instead of total loss
Rahul Iyer [Tue, 10 Jul 2018 20:47:39 +0000 (13:47 -0700)] 
SVM: Compute average loss per row instead of total loss

6 months agoBuild: Remove symlinks during rpm uninstall
Rahul Iyer [Tue, 3 Jul 2018 19:02:55 +0000 (12:02 -0700)] 
Build: Remove symlinks during rpm uninstall

JIRA: MADLIB-1175

`rpm --install` creates three symlinks to `Versions/`, `.../bin`, and
`.../doc`. These symlinks should be deleted during `rpm --erase`.
Additionally, we also delete `Versions/` if it is empty after the erase.

Closes #286

Co-Authored-by: Arvind Sridhar <asridhar@pivotal.io>
6 months agoUtilites: Add CTAS while dropping some columns
Rahul Iyer [Thu, 12 Jul 2018 17:08:52 +0000 (10:08 -0700)] 
Utilites: Add CTAS while dropping some columns

JIRA: MADLIB-1241

This commit adds function to create a new table from existing table
while dropping some of the columns of the original table.

Closes #282

6 months agoFix unit test failure in MLP
Nandish Jayaram [Wed, 11 Jul 2018 21:50:37 +0000 (14:50 -0700)] 
Fix unit test failure in MLP

6 months agoDT: Fix true and false child indexing
Rahul Iyer [Wed, 11 Jul 2018 15:04:54 +0000 (08:04 -0700)] 
DT: Fix true and false child indexing

Majority count and split incorrectly indexed the child indices in the
tree leading to an invalid surrogate agreement calculation. This commit
fixes the problem by using the trueChild and falseChild functions to get
the children locations instead of directly computing the indices.

6 months agoMadpack: fix install/reinstall not giving proper error message
Jingyi Mei [Mon, 2 Jul 2018 00:49:24 +0000 (17:49 -0700)] 
Madpack: fix install/reinstall not giving proper error message

JIRA: MADLIB-1248

Previously, uninstalling or reinstalling on a database that does not have MADlib already
installed fails as expected. However, the info messages do not mention
this failure and also keeps going to prepare database objects to
install. This commit fixes this issue. If there is no madlib installed
in database, madpack will show nothing found to uninstall/reinstall and
stop.

Closes #285

6 months agoEncode categorical variables: handling special characters
Arvind Sridhar [Thu, 24 May 2018 00:02:43 +0000 (17:02 -0700)] 
Encode categorical variables: handling special characters

JIRA: MADLIB-1238
JIRA: MADLIB-1243

This commit deals with special characters in column name and column
values. Also adds install check test cases to cover these scenarios.

Closes #281

Co-Authored-by: Jingyi Mei <jmei@pivotal.io>
Co-Authored-by: Arvind Sridhar <asridhar@pivotal.io>
6 months agoMLP+Minibatch Preprocessing: Support special characters
Jingyi Mei [Wed, 23 May 2018 23:29:54 +0000 (16:29 -0700)] 
MLP+Minibatch Preprocessing: Support special characters

JIRA: MADLIB-1237
JIRA: MADLIB-1238

This commit enables special character support for column names and
column values for mlp and minibatch preprocessor. We decided to use the
following strategy for supporting special characters

The module that needs to support special characters will have to call
quote_literal() on all the column values that need to be escaped and
quoted and then this list can be passed to the py_list_to_sql_string
function

We also created a function called get_distinct_col_levels which will
call quote_literal and then return a list of escaped column levels. The
output of this function can then be safely passed to
py_list_to_sql_string with long_format set as True.

Co-Authored-by: Jingyi Mei <jmei@pivotal.io>
Co-Authored-by: Rahul Iyer <riyer@apache.org>
Co-Authored-by: Arvind Sridhar <asridhar@pivotal.io>
6 months agoMLP: Add momentum and nesterov accelerated gradient
Nikhil Kak [Wed, 2 May 2018 12:25:48 +0000 (05:25 -0700)] 
MLP: Add momentum and nesterov accelerated gradient

JIRA: MADLIB-1210

Momentum methods remember the past gradients/model updates and allow
smoothening out the erratic behaviour of the gradient updates, without
slowing down the learning. With Momentum update, the parameter vector
will build up velocity in any direction that has consistent gradient.

Nesterov Accelerated Gradient method is a slightly different version of
the momentum update that enjoys stronger theoretical converge guarantees
for convex functions and in practice also works slightly better than
standard momentum.

This commit also includes some refactoring that combines the update
methods for IGD and mini-batch.

Closes #272

Co-authored-by: Rahul Iyer <riyer@apache.org>
Co-authored-by: Jingyi Mei <jmei@pivotal.io>
6 months agoSVM: Fix flaky dev-check failure
Nandish Jayaram [Wed, 27 Jun 2018 19:40:19 +0000 (12:40 -0700)] 
SVM: Fix flaky dev-check failure

JIRA: MADLIB-1232

SVM has a dev-check query that is flaky on a large cluster. This commit
relaxes the assert condition for that query.

Closes #284

6 months agoBugfix: Fix failing dev check in CRF
Nandish Jayaram [Wed, 27 Jun 2018 18:25:46 +0000 (11:25 -0700)] 
Bugfix: Fix failing dev check in CRF

A couple of dev check files in CRF did not have the label table creation
in it, which resulted in dev-check failures (not sure how it was going
through fine earlier). This commit consists of changes to fix those failures.

Closes #283

6 months agoInfra: Use dev-check in Jenkins build
Nandish Jayaram [Wed, 27 Jun 2018 21:52:36 +0000 (14:52 -0700)] 
Infra: Use dev-check in Jenkins build

Jenkins builds (both PR and master) are currently still running only
install-check. This commit now runs dev-check instead.

6 months agoMadpack: Reenable testcase selection in IC
Rahul Iyer [Wed, 27 Jun 2018 14:11:17 +0000 (07:11 -0700)] 
Madpack: Reenable testcase selection in IC

6 months agoMadpack: Add dev-check and a compact install-check.
Orhan Kislal [Tue, 26 Jun 2018 17:54:15 +0000 (10:54 -0700)] 
Madpack: Add dev-check and a compact install-check.

JIRA: MADLIB-1247

- The current install check is expensive since it runs various hyper param
permutations for all MADlib modules. This commits moves all of those
tests to dev-check, which can be used by developers for iterating
faster. We have now created watered down install-check for each module,
which just runs one  hyper-param combination for each MADlib function,
and does not do any asserts.
- This commit also includes changes in madpack to add a new madpack
option for dev-check.

Co-authored-by: Nandish Jayaram <njayaram@apache.org>
Co-authored-by: Arvind Sridhar <asridhar@pivotal.io>
6 months agoDT: Add impurity variable importance
Rahul Iyer [Wed, 30 May 2018 01:09:18 +0000 (18:09 -0700)] 
DT: Add impurity variable importance

JIRA: MADLIB-1205

Brieman et. al. [1] describe a "gini importance" measure that can be
computed for a single decision tree. This measure is the impurity
decrease produced by any given feature in a node, accumulated over the
whole tree. Surrogates can also be added to this by scaling the impurity
decrease with the adjusted surrogate agreement.

This commit adds this importance measure for all the impurity functions
(hence the term impurity importance instead of gini importance).

[1] https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#giniimp

Closes #277

Co-authored-by: Nandish Jayaram <njayaram@apache.org>
6 months agoMadpack: Fix error with dropping user after IC failure.
Nandish Jayaram [Tue, 5 Jun 2018 19:08:25 +0000 (12:08 -0700)] 
Madpack: Fix error with dropping user after IC failure.

JIRA: MADLIB-1182

Previously, when install check did not fail gracefully, the user created
by madpack hung around and disturbed IC attempts within other databases.
We fixed this by:
1) Renaming the test user using the specific database that the IC test
was run from, making the test user name database-specific.
2) Dropping schema and user in a try-finally block, so that it's executed even
on a failed IC run.

Closes #275

Co-authored-by: Arvind Sridhar <asridhar@pivotal.io>
7 months agoKNN: Make install-check asserts deterministic
Nikhil Kak [Thu, 14 Jun 2018 17:20:55 +0000 (10:20 -0700)] 
KNN: Make install-check asserts deterministic

Knn install-check had a couple of unordered array_agg asserts which were not
deterministic. This commit ensures the order is deterministic.

Closes #279

7 months agoUpgrade: Fix multiple bugs
Nikhil Kak [Fri, 15 Jun 2018 15:18:34 +0000 (08:18 -0700)] 
Upgrade: Fix multiple bugs

1. Appended schema_madlib to the mlp_igd_final return type. The missing
schema name caused the upgrade to fail from 1.12 to 1.x if there was a
dependency on mlp_igd_final.

2. A new changelist was created for changes from v1.14 to 1.15-dev.  We
will rename this at the 1.15 release from 1.14_1.15-dev.yaml to
1.14_1.15.yaml.

3. Commit 8e34f68 added a new function called `_write_to_file` that
takes 2 arguments.  Some of the calls to this function were not passing
the first file handle argument.

Closes #278

Co-authored-by : Orhan Kislal <okislal@pivotal.io>

7 months agoMadpack: Make install, reinstall and upgrade atomic 271/head
Nandish Jayaram [Fri, 18 May 2018 23:13:28 +0000 (16:13 -0700)] 
Madpack: Make install, reinstall and upgrade atomic

JIRA: MADLIB-1242

We now write all the necessary sql for MADlib installation into one file,
and run it once in a single session. The database's rollback will be useful
to bring it back to original state in case of a failure during install,
reinstall, uninstall, and upgrade.

Closes #271

Co-authored-by: Rahul Iyer <riyer@apache.org>
Co-authored-by: Orhan Kislal <okislal@pivotal.io>
7 months agoLogregr: Report error if output table is empty
Himanshu Pandey [Fri, 1 Jun 2018 01:44:41 +0000 (18:44 -0700)] 
Logregr: Report error if output table is empty

JIRA MADLIB-1172

When the model cannot be generated due to ill-conditioned input data,
the output table doesn't get populated.  In this case, we report back an
error instead of creating the empty table.

Closes #270

7 months agoDT: Ensure summary table has correct features
Rahul Iyer [Thu, 3 May 2018 18:38:27 +0000 (11:38 -0700)] 
DT: Ensure summary table has correct features

JIRA: MADLIB-1236

If a cat_feature is dropped (due to just a single level), that feature
should not be included in the summary table list, since tree_predict
uses the features in summary table while reading source table. This
commit ensures the right features are populated in the summary table.

Closes #268

7 months agoDT: Don't use NULL value to get dep_var type
Rahul Iyer [Tue, 1 May 2018 21:24:34 +0000 (14:24 -0700)] 
DT: Don't use NULL value to get dep_var type

JIRA: MADLIB-1233

Function `_is_dep_categorical` is used to obtain the type of the
dependent variable expression. This function gets a random value using
`LIMIT 1` and checks the type of the corresponding value in Python.
Further this does not filter out NULL values.
Since NULL values are not filtered out,
it's possible the `LIMIT 1` returns a "None" type in Python, leading to
incorrect results.

This commit updates the type extraction by checking the type in the
database instead of in Python and also filters out NULL values.
Additionally it checks if at least one non-NULL value is obtained, else
throws an appropriate error.

8 months agoDT: Fix sparse vector to float8[] casting bug
Nandish Jayaram [Mon, 14 May 2018 18:54:26 +0000 (11:54 -0700)] 
DT: Fix sparse vector to float8[] casting bug

JIRA: MADLIB-1234

The cast to float array (float8[]) should be evaluated before we access
the individual elements of the array, otherwise we encounter a strange
notation such as feature::madlib.svec::float8[][1]. A simple addition of
parentheses should fix the issue ((feature::madlib.svec::float8[])[1]).

Co-authored-by: Orhan Kislal <okislal@pivotal.io>
8 months agoBuild: Add support for GCC 5.0+
Rahul Iyer [Thu, 17 May 2018 22:38:50 +0000 (15:38 -0700)] 
Build: Add support for GCC 5.0+

JIRA: MADLIB-1025

GCC 5.1 release libstdc++ introduced a new library ABI that includes new
implementations of std::string and std::list. We disable this ABI to
ensure MADlib compiles with non-gcc compilers as well.

Further, using -O3 optimization (default in CMAKE_BUILD_TYPE=Release)
led to runtime errors. We revert to O2 optimization level (default in
CMAKE_BUILD_TYPE=RelWithDebInfo) to avoid this issue.
The RCA for this problem is yet unknown.

Co-authored-by: Nandish Jayaram <njayaram@apache.org>
Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Co-authored-by: Orhan Kislal <okislal@pivotal.io>
8 months agoStatistics: Add grouping support for correlation functions
Orhan Kislal [Wed, 16 May 2018 23:54:34 +0000 (16:54 -0700)] 
Statistics: Add grouping support for correlation functions

JIRA: MADLIB-1128

This commit adds grouping support to correlation and covariance
functions in MADlib stats. Changes include relevant queries to do the
same.
This commit also has refactor changes to a helper function in
utilities.py_in.

Co-authored-by: Jingyi Mei <jmei@pivotal.io>
Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Co-authored-by: Nandish Jayaram <njayaram@apache.org>
Co-authored-by: Frank McQuillan <fmcquillan@pivotal.io>
8 months agoMultiple: Remove support for HAWQ from all modules
Rahul Iyer [Mon, 30 Apr 2018 04:18:35 +0000 (21:18 -0700)] 
Multiple: Remove support for HAWQ from all modules

With HAWQ support removed for the past few versions, we can eliminate
all the code that was specifically written for that port. This
includes madpack changes for upgrade and reinstall, workarounds in
multiple modules for table updates, and special consideration in
Iteration Controllers.

Closes #267

8 months agoChange version to 1.15-dev
Jingyi Mei [Thu, 3 May 2018 17:53:18 +0000 (10:53 -0700)] 
Change version to 1.15-dev

8 months agoMatrix: Catch error with different-sized input arrays rc/1.14-rc1 rel/v1.14
Rahul Iyer [Wed, 25 Apr 2018 19:51:21 +0000 (12:51 -0700)] 
Matrix: Catch error with different-sized input arrays

8 months agoRelease 1.14: Update version numbers and support upgrading to v1.14
Orhan Kislal [Mon, 23 Apr 2018 23:59:11 +0000 (16:59 -0700)] 
Release 1.14: Update version numbers and support upgrading to v1.14

Update the version number to 1.14 for the release candidate.
Update the changelists and other related files for upgrade.
Update the upgrade_util to ensure PG 10 support.
Simplify the _get_existing_uda function since it is not possible to
define an aggregate without any arguments.
Note that upgrade is not supported from versions prior to 1.11.

Co-authored-by: Nikhil Kak <nkak@pivotal.io>
Closes #266

8 months agoMLP: Set min messages to error for predict
Rahul Iyer [Mon, 23 Apr 2018 20:40:10 +0000 (13:40 -0700)] 
MLP: Set min messages to error for predict