kunal642 [Mon, 1 Jun 2020 10:42:58 +0000 (16:12 +0530)]
[maven-release-plugin] prepare for next development iteration
kunal642 [Mon, 1 Jun 2020 10:41:50 +0000 (16:11 +0530)]
[maven-release-plugin] prepare release apache-carbondata-2.0.1-rc1
kunal642 [Mon, 1 Jun 2020 07:37:24 +0000 (13:07 +0530)]
[CARBONDATA-3840] Mark features as experimental
Why is this PR needed?
Mark features as experimental because they are subject to change in future.
What changes were proposed in this PR?
Mark features as experimental because they are subject to change in future.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3783
akashrn5 [Mon, 1 Jun 2020 04:28:31 +0000 (09:58 +0530)]
[CARBONDATA-3839]Fix rename file failed for FilterFileSystem DFS object
Why is this PR needed?
Rename file fails in HDFS when the FS object is of FilterFileSystem,
(which basically can contain any filesystem to use as basic filesystem)
What changes were proposed in this PR?
While rename force, have a check for this FS object
This closes #3781
Manhua [Sat, 30 May 2020 02:47:10 +0000 (10:47 +0800)]
[CARBONDATA-3836] Fix metadata folder FileNotFoundException while creating new carbon table
Why is this PR needed?
1. In the case of using carbon only setting carbon.storelocation, carbon will use local spark warehouse path instead of the configured value.
2. FileNotFoundException is thrown when creating schema file for a brand new table. Because current implementation gets the schema file path by listing the Metadata directory which has not been created.
What changes were proposed in this PR?
1. spark.sql.warehouse.dir has its own default value in Spark, remove using carbonStorePath as default value, which will make hiveStorePath.equals(carbonStorePath) TRUE when user not set spark.sql.warehouse.dir.
2. create the Metadata directory before getting the schema file path.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3780
QiangCai [Sun, 31 May 2020 16:01:47 +0000 (00:01 +0800)]
[CARBONDATA-3837] Fallback to the original plan when mv rewrite throw exception
Why is this PR needed?
All plans are checking MVRewriteRule,
if MVRewriteRule throw an exception, it will lead to query failure
What changes were proposed in this PR?
Only the query should check MVRewriteRule, other plans should skip it quickly.
catch all exceptions of MVRewriteRule, and fallback to original plan.
Does this PR introduce any user interface change?
No
Is any new testcase added?
Yes
This closes #3777
ajantha-bhat [Sat, 30 May 2020 02:54:39 +0000 (08:24 +0530)]
[CARBONDATA-3835] Fix global sort issues
Why is this PR needed?
For global sort without partition, string comes as byte[], if we use string comparator (StringSerializableComparator) it will convert byte[] to toString which gives address and comparison goes wrong.
For global sort with partition, when sort column is partition column, it was sorting on first column instead of partition columns.
What changes were proposed in this PR?
change data type to byte before choosing a comparator.
get the sorted column based on index, don't just take from first
Does this PR introduce any user interface change?
No
Is any new testcase added?
Yes
This closes #3779
kunal642 [Sat, 30 May 2020 06:01:36 +0000 (11:31 +0530)]
[HOTFIX] changed development version to 2.0.1
kunal642 [Sun, 17 May 2020 07:58:30 +0000 (13:28 +0530)]
[maven-release-plugin] prepare for next development iteration
kunal642 [Sun, 17 May 2020 07:57:38 +0000 (13:27 +0530)]
[maven-release-plugin] prepare release apache-carbondata-2.0.0-rc3
Venu Reddy [Fri, 8 May 2020 21:09:51 +0000 (02:39 +0530)]
[CARBONDATA-3791] Updated documentation for dynamic configuration params
Why is this PR needed?
Dynamic configuration params weren't updated in documentation.
What changes were proposed in this PR?
Updated documentation for dynamic configuration params and add testcases
This closes #3756
Venu Reddy [Wed, 6 May 2020 18:52:04 +0000 (00:22 +0530)]
[CARBONDATA-3548]Renamed index_handler to spatial_index property and indexColumn is changed to spatialColumn
Why is this PR needed?
Current code base has many types of indexes. To avoid the confusion and be more specific, have changed the index_handler property to spatial_index. Also changed the isIndexColumn/setIndexColumn to isSpatialColumn/setSpatialColumn respectively.
What changes were proposed in this PR?
Have changed the index_handler property to spatial_index. Also changed the isIndexColumn/setIndexColumn to isSpatialColumn/setSpatialColumn respectively.
Documentation is updated accordingly.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3750
Venu Reddy [Mon, 11 May 2020 07:59:14 +0000 (13:29 +0530)]
[CARBONDATA-3815]Insert into table select from another table throws exception for spatial tables
Why is this PR needed?
Insert into table select from another table throws exception for spatial tables.
NoSuchElementException exception is thrown with 'mygeohash' column.
What changes were proposed in this PR?
Excluded spatial columns during getReArrangedIndexAndSelectedSchema.
And have set the carbonLoadModel.setIndexColumnsPresent if spatial columns
are present during rearranging.
If the target spatial table has sort_scope configured as global-sort,
have made the insert flow to go through loadDataFrame() instead of
insertDataUsingGlobalSortWithInternalRow() in CarbonDataRDDFactory.loadCarbonData().
This ensures that it goes through existing load without Conversion Step and spatial
column values are regenerated in the flow.
This closes #3760
ajantha-bhat [Wed, 13 May 2020 16:45:45 +0000 (22:15 +0530)]
[CARBONDATA-3791] update pycarbon document
Why is this PR needed?
setup.cfg lies outside the folder. so need to update the document on the same.
What changes were proposed in this PR?
setup.cfg lies outside the folder. so need to update the document on the same.
some other minor fix.
This closes #3767
Indhumathi27 [Thu, 7 May 2020 14:06:01 +0000 (19:36 +0530)]
[CARBONDATA-3808] Added documentation for cdc and scd scenarios
Why is this PR needed?
Added documentation for cdc and scd scenarios
What changes were proposed in this PR?
Added documentation for cdc and scd scenarios
This closes #3754
liuzhi [Thu, 7 May 2020 03:59:09 +0000 (11:59 +0800)]
[CARBONDATA-3804] Provide end-to-end flink integration guide
Why is this PR needed?
Provide an end-to-end guide to help user understand and use flink integration module.
What changes were proposed in this PR?
Add file docs/flink-integration-guide.md
This closes #3752
kunal642 [Wed, 13 May 2020 04:49:02 +0000 (10:19 +0530)]
[CARBONDATA-3822] Fixed load time taken and added format back
Why is this PR needed?
1. Load time taken is shown as PT-1.0S which is wrong(negative time)
2. Show segment is missing format information.
What changes were proposed in this PR?
1. Fix the time and remove PT to show the time as "1.0S" for better viewing.
2. Added format back
This closes #3765
Indhumathi27 [Mon, 11 May 2020 12:34:52 +0000 (18:04 +0530)]
[CARBONDATA-3818] Missed code to check "MV with same query already exists" during MV Refactory
Why is this PR needed?
Missed code to check "MV with same query already exists" during MV Refactory.
What changes were proposed in this PR?
Add code to which if "MV with same query already exists" and add testcase
This closes #3761
akashrn5 [Mon, 11 May 2020 13:14:13 +0000 (18:44 +0530)]
[CARBONDATA-3817]Fix table creation with all columns as partition columns
Why is this PR needed?
When all the columns are given as partition columns during create table,
create table should fail as a minimum one column should be present as a
non-partition column. This is because after #3574 , we improved the create
data source table and we call CreateDataSourceTableCommand of spark.
Since we are creating as Hive table, if while creating hive compatible way,
if it fails, then it will fall back to save its metadata in the Spark SQL
specific way, so partition validation fails when we try to store in hive
compatible way, so in retry, it will pass which is wrong behavior for hive compatible table.
in Hive integration location do not have file system URI prepared for it
What changes were proposed in this PR?
For partition table, if all the columns are present as partition columns,
then validate with the same API which spark does. append file system URI
for location parameter while inferring schema.
This closes #3762
Indhumathi27 [Wed, 6 May 2020 09:21:09 +0000 (14:51 +0530)]
[CARBONDATA-3800] Load data to SI and MV after insert stage command
Why is this PR needed?
1. Data is not loaded to child tables(SI and MV) after executing insert stage command.
2. Datamap keyword still exists in some files
What changes were proposed in this PR?
1. Add Load Pre and Post listener's in CarbonInsertStageCommand to trigger data load
to Secondary indexes and materialized views.
2. Rename datamap to index
This closes #3747
ajantha-bhat [Tue, 12 May 2020 09:25:58 +0000 (14:55 +0530)]
[CARBONDATA-3811] Fix flat folder query returns 0 rows
Why is this PR needed?
This issue doesn't happen for local or s3a file system (because below method
don't lookup from the map for those file system). observed with HDFS file system.
In case of 'flat_folder'='true' table property,
BlockletIndexUtil#createBlockMetaInfo method for HDFS looks up meta info
from fileNameToMetaInfoMapping map.
What changes were proposed in this PR?
Path should not have // in first place. This was a bug in case of flat folder.
This closes #3763
Indhumathi27 [Wed, 6 May 2020 10:45:38 +0000 (16:15 +0530)]
[CARBONDATA-3801][CARBONDATA-3805][CARBONDATA-3809] Query on partition table with SI having multiple partition
columns gives empty results
Why is this PR needed?
1. Query on partition table with SI having multiple partition columns gives empty results. Because while loading data to SI table,
blockid -> partition directory path is replaced by '#' instead of '/' to support multi level partitioning. Hence block id will be
like part1=1#part2=2/xxxxxxxxx. During query the above block id is compared with actual block id part1=1/part2=2/xxxxxxxxx,
which do not match, and provides empty results.
2. Drop Index throws exception while starting beeline, that error while editing schema file
3. Refresh syntax in SI doc is wrong.
What changes were proposed in this PR?
1. During query, convert block id as in Load flow in case of partition table, to support multi level partitioning.
2. Remove modifying schema mdt file in index flow.
3. Updated Refresh syntax
This closes #3748
ajantha-bhat [Wed, 6 May 2020 08:03:21 +0000 (13:33 +0530)]
[CARBONDATA-3799] Fix inverted index cannot work with adaptive encoding
Why is this PR needed?
After PR #3638, Inverted index cannot work with adaptive encoding.
two issues are present
a) For Byte adaptive type (Not DirectByteBuffer), Encoded column page has
wrong result, as position() is used instead of limit()
b) For short adaptive type (DirectByteBuffer), result.array() will fail as
it is unsupported for direct byte buffer.
What changes were proposed in this PR?
For problem
a) use limit() instead of position()
b) write byte by byte instead of .array()
This closes #3746
akashrn5 [Thu, 7 May 2020 16:54:58 +0000 (22:24 +0530)]
[CARBONDATA-3814]Remove dead code and refactor MV events
Why is this PR needed?
Some code is not used in Index Events, some MV events should be at one place
What changes were proposed in this PR?
Removed the dead code and refactored the existing MV events class
This closes #3755
QiangCai [Mon, 11 May 2020 07:00:35 +0000 (15:00 +0800)]
[HOTFIX] Set default header for load data in legacy enviroment
Why is this PR needed?
The old code has been commented, this PR will revert the code.
What changes were proposed in this PR?
Set default header for load data in legacy enviroment
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3759
ajantha-bhat [Sat, 9 May 2020 07:38:11 +0000 (13:08 +0530)]
[CARBONDATA-3713] update SDK doc about sort columns
Why is this PR needed?
sort columns API information is wrong in SDK and CSDK
What changes were proposed in this PR?
update the correct information
This closes #3758
Venu Reddy [Tue, 5 May 2020 16:30:49 +0000 (22:00 +0530)]
[CARBONDATA-3793]Data load with partition columns fail with InvalidLoadOptionException when
load option 'header' is set to 'true'
Why is this PR needed?
Data load with partition columns fail with InvalidLoadOptionException when load option header
is set to true.
What changes were proposed in this PR?
In SparkCarbonTableFormat.prepareWrite() method after adding the fileheader option with header
columns to optionsFinal, need to set header option to false.
This closes #3745
ajantha-bhat [Wed, 29 Apr 2020 16:23:03 +0000 (21:53 +0530)]
[CARBONDATA-3786] Presto carbon reader should use tablePath from hive catalog
Why is this PR needed?
In upgrade scenarios of 1.6 to 2.0, when sparl.sql.warehouse is not configured.
Hive storage location is not proper. so presto carbon integration should use tablePath from hive storage instead of location.
What changes were proposed in this PR?
use tablePath instead of location from hive metatstroe table.
This closes #3731
QiangCai [Sat, 9 May 2020 03:52:24 +0000 (11:52 +0800)]
[CARBONDATA-3812] Set output metrics for data load spark job
Why is this PR needed?
data load jobs are missing output metrics. please check detail in jira: CARBONDATA-3812
What changes were proposed in this PR?
re-factory OutputFilesInfoHolder to DataLoadMetrics
add metrics: numOutputBytes and numOutputRows
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3757
QiangCai [Thu, 7 May 2020 08:42:40 +0000 (16:42 +0800)]
[CARBONDATA-3810] Partition column name should be case insensitive
Why is this PR needed?
when inserting into the static partition, the partition column name is case sensitive now.
What changes were proposed in this PR?
the partition column name should be case insensitive.
convert all partition column names to low case.
Does this PR introduce any user interface change?
No
Is any new testcase added?
Yes
This closes #3753
QiangCai [Thu, 7 May 2020 02:02:06 +0000 (10:02 +0800)]
[CARBONDATA-3803] Mark CarbonSession as deprecated since 2.0
Why is this PR needed?
Better to use CarbonExtensions instead of CarbonSession in version 2.0.
What changes were proposed in this PR?
mark CarbonSession as deprecated since version 2.0.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No, no need
This closes #3751
akashrn5 [Mon, 4 May 2020 15:04:40 +0000 (20:34 +0530)]
[CARBONDATA-3792]Refactor system folder location and removed unwanted property
Why is this PR needed?
Currently the carbon.system.folder.location is being used only in test code,
which is not required and since the system folder location is based on database
and that code is being duplicated.
What changes were proposed in this PR?
Refactor code to get the system folder location based on db
location without duplicate code and remove the carbon.system.folder.location
usage in all the test classes and remove unnecessary code.
This closes #3743
Vikram Ahuja [Thu, 7 May 2020 05:15:18 +0000 (10:45 +0530)]
[CARBONDATA-3791] Index server doc changes
Why is this PR needed?
Changing index server docs as per the code.
What changes were proposed in this PR?
Changing index server docs as per the code.
This closes #3739
Vikram Ahuja [Wed, 6 May 2020 14:57:13 +0000 (20:27 +0530)]
[CARBONDATA-3791]: Query and Compaction changes in configuration parameters
Why is this PR needed?
Query and Compaction changes in configuration parameters as per the code.
What changes were proposed in this PR?
Query and Compaction changes in configuration parameters as per the code.
This closes #3742
kunal642 [Sun, 3 May 2020 16:13:37 +0000 (21:43 +0530)]
[CARBONDATA-3791] Fix documentation for various features
Why is this PR needed?
Fix documentation for various features
What changes were proposed in this PR?
1. Added write with hive doc
2. Added alter upgrade segment doc
3. Fix other random issues
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3738
Mahesh Raju Somalaraju [Sun, 3 May 2020 18:30:07 +0000 (00:00 +0530)]
[CARBONDATA-3791] Fix spelling, validating links for quick-start guide and configuration parameters
Why is this PR needed?
Fix spelling, validating links for quick-start guide and configuration parameters
What changes were proposed in this PR?
.md file changes
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3740
akashrn5 [Thu, 30 Apr 2020 09:05:06 +0000 (14:35 +0530)]
[CARBONDATA-3789] Fix cache issue in case of compaction failure in compaction post listeners
Why is this PR needed?
Consider a scenario where the post-event Is called in case of compaction and then the listener is basically selecting the ongoing compacted segment data and loading SI table load after main table compaction. During that time the cache is loaded, if in case this is failed, then the cache is still present but the actual segment isn't, so which will lead to consecutive failures.
What changes were proposed in this PR?
So, when the failure happens, the cache should be cleared either in the index server or in the driver cache.
Does this PR introduce any user interface change?
No
Is any new test case added?
No
This closes #3733
Venu Reddy [Mon, 4 May 2020 16:38:29 +0000 (22:08 +0530)]
[CARBONDATA-3791] Updated configuration-parameters.md and removed unused configuration
Why is this PR needed?
Updated configuration-parameters.md and removed unused configuration
What changes were proposed in this PR?
Updated configuration-parameters.md and removed unused configuration
This closes #3744
Indhumathi27 [Sun, 3 May 2020 11:47:06 +0000 (17:17 +0530)]
[CARBONDATA-3791] Correct spelling, link and ddl in SI and MV Documentation
Why is this PR needed?
Correct spelling, link and ddl in SI and MV Documentation
What changes were proposed in this PR?
Fixed spelling, link and ddl in SI and MV Documentation
This closes #3735
akashrn5 [Sun, 3 May 2020 17:41:09 +0000 (23:11 +0530)]
[CARBONDATA-3791]Correct the link, grammars and content of dml-management document
Why is this PR needed?
Some links were not there, grammar mistakes and there were indentation errors.
What changes were proposed in this PR?
corrected the grammar, removed unnecessary content, and corrected the indentation problems.
This closes #3736
Pankaj Yadav [Mon, 4 May 2020 08:00:58 +0000 (13:30 +0530)]
[CARBONDATA-3791] Frequently Asked Question doc changes
Why is this PR needed?
faq.md docis not as per code
unwanted FAQs are present.
What changes were proposed in this PR?
changed faq.md doc as per code
removed the unwanted changes.
This closes #3741
Nihal kumar ojha [Mon, 4 May 2020 04:42:45 +0000 (10:12 +0530)]
[CARBONDATA-3791] Correct spelling, query, default value, in performance-tuning, prestodb and prestosql documentation.
Why is this PR needed?
Correct spelling, query, default value, in performance-tuning, prestodb and prestosql documentation.
What changes were proposed in this PR?
Corrected spelling, query, default value, in performance-tuning, prestodb and prestosql documentation.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3737
akashrn5 [Thu, 30 Apr 2020 09:11:10 +0000 (14:41 +0530)]
[CARBONDATA-3790] Fix SI table consistency with main table segments
Why is this PR needed?
Consider a scenario when SI loading is happening after main tale load, then when taking segment load, we got an issue and we skipped assuming that the missed segments will be loaded in next load by SILoadEventListenerForFailedSegments, but the status of SI is still enabled. But SILoadEventListenerForFailedSegments will only load to skipped segments if the status is disabled which will lead to segment mismatch between main and SI table, which ay lead query failure or data mismatch.
What changes were proposed in this PR?
If it fails to take segment lock, during Si load, add to a skip list, if that is not empty, make the SI disable, so that SILoadEventListenerForFailedSegments will take care to load the missing ones in next load to the main table.
Does this PR introduce any user interface change?
No
Is any new test case added?
No
This closes #3734
xubo245 [Sun, 22 Mar 2020 16:01:14 +0000 (00:01 +0800)]
[CARBONDATA-3461] Carbon SDK support filter equal values set
Add support in Carbon SDK for filter equal values set.
1.prepareEqualToExpression(String columnName, String dataType, Object value)
2.prepareOrExpression(List expressions).
3.prepareEqualToExpressionSet(String columnName, DataType dataType, List values)
4.prepareEqualToExpressionSet(String columnName, String dataType, List values).
This closes #3317
xubo245 [Wed, 20 Mar 2019 14:54:20 +0000 (22:54 +0800)]
[CARBONDATA-3342] Fix the IllegalArgumentException when using filter and result is null.
This PR fixs the IllegalArgumentException when using filter and result is null.
This closes #3273
ajantha-bhat [Thu, 30 Apr 2020 03:56:10 +0000 (09:26 +0530)]
[CARBONDATA-3788] Fix insert failure during global sort with huge data in new insert flow
Why is this PR needed?
Spark is resuing the internalRow in global sort partition flow with huge data.
As RDD of Internal row is persisted for global sort.
What changes were proposed in this PR?
Need to have a copy and work on the internalRow before the last transform for global sort partition flow.
Already this was doing for insert stage command (which uses global sort partition)
This closes #3732
xubo245 [Thu, 20 Jun 2019 04:08:11 +0000 (12:08 +0800)]
[CARBONDATA-3446] Support read schema of complex data type from carbon file folder path
Backgroud:
SDK can't read schema of complex data type from carbon file folder path
Support:
Support read schema of complex data type from carbon file folder path
This closes #3301
kumarvishal09 [Tue, 14 Apr 2020 01:40:55 +0000 (09:40 +0800)]
[CARBONDATA-3787] Fixed MemoryLeak in data load and compaction
Why is this PR needed?
PR #3638 uses direct byte buffer. This PR is for fixing Offheap memory leak during load/compaction
What changes were proposed in this PR?
DirectByteBuffer are cleaned in offheap memory by JVM, when GC happens (when heap is full). Consider a scenario, where heap is not full, but offheap is getting full. In this case DirectByteBuffers is not cleaned as GC not happened. This time we may get OOM from offheap memory and yarn may kill the jvm process. So, it is better to clean up the DirectByteBuffers after the usage by calling a reflection method.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3706
kunal642 [Wed, 8 Apr 2020 10:02:39 +0000 (15:32 +0530)]
[CARBONDATA-3784] Added spark binary version to related modules
Why is this PR needed?
For deploying multiple carbon jars version with different spark versions, the jar/module names should be different
What changes were proposed in this PR?
Add spark binary version to the related modules to distinguish the jars based on spark version
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3700
liuzhi [Mon, 27 Apr 2020 02:11:48 +0000 (10:11 +0800)]
[CARBONDATA-3778] Change conf.getProperty("carbon.storelocation") to CarbonEvn.getDatabaseLocation in mv module
Why is this PR needed?
Change the way of mv module get database location to recommend way.
Currently , the recommend way is using CarbonEvn.getDatabaseLocation,not using carbon.storelocation variable.
What changes were proposed in this PR?
Change all code which use carbon.storelocation variable to use CarbonEvn.getDatabaseLocation.
This closes #3727
Indhumathi27 [Thu, 16 Apr 2020 07:57:14 +0000 (13:27 +0530)]
[CARBONDATA-3773] Skip Validate partition info in Indexserver count star flow
Why is this PR needed?
BlockIndex.ValidatePartitionInfo check was added for IUD scenario as part of countStar query
performance improvement in PR-3148. For count(*) flow with indexserver, validatePartitionInfo
is called multiple times (No. of Blocks * No. of Partitions), which is not required.
What changes were proposed in this PR?
Skip ValidatePartition check in case of Count(*) query with indexserver
This coses #3714
Venu Reddy [Mon, 27 Apr 2020 13:36:45 +0000 (19:06 +0530)]
[HOTFIX]Compilation error in assembly module as it could not
resolve the mv-core dependency
Why is this PR needed?
Recent PR #3692 removes carbondata-mv-core module and part of the
code is moved to carbondata-spark module.
But assembly module still has dependency to mv-core in its pom.
What changes were proposed in this PR?
Removed carbondata-mv-core dependency from assembly module.
This closes #3728
Venu Reddy [Wed, 22 Apr 2020 07:09:42 +0000 (12:39 +0530)]
[CARBONDATA-3779]BlockletIndexInputFormat object instantiation failed due to mismatch in constructor params
Why is this PR needed?
BlockletIndexInputFormat object instantiation failed due to mismatch in params passed to reflection constructor
instantiation and actual parameters of BlockletIndexInputFormat constructor.
What changes were proposed in this PR?
1. Have modified to pass the correct parameters while instanting the BlockletIndexInputFormat through reflections
2. Segment min-max based pruning to happen when CARBON_LOAD_INDEXES_PARALLEL is enabled.
This closes #3723
Venu Reddy [Thu, 23 Apr 2020 09:45:13 +0000 (15:15 +0530)]
[CARBONDATA-3783]Alter table drop column on main table is not dropping the eligible secondary index tables
Why is this PR needed?
Alter table drop column on main table when droping columns match to a secondary index table columns is actually
not dropping secondary index table.
What changes were proposed in this PR?
1. Have corrected the dropApplicableSITables to check if the drop columns are matching the index table columns.
If all columns of a index table match to drop columns, then drop the index table itself.
2. Index table refresh was missed in CarbonCreateSecondaryIndexCommand. It was missed in secondary index feature PR.
This closes #3725
akashrn5 [Tue, 21 Apr 2020 08:53:41 +0000 (14:23 +0530)]
[CARBONDATA-3777] Add HDFSLocalCarbonFile implementation to Use FileSystem's LocalFileSystem in cluster mode
Why is this PR needed?
Currently LocalFile file implementation is JAVA's file implementation, which will give problem if we want to load the local file in cluster for instance.
What changes were proposed in this PR?
Implement a new class HDFSLocalCarbonFile, which extends HDFSCarbonFIle and when a file with local file scheme "file://" is given and trying to load in cluster, it takes the file as HDFSLocalCarbonFile and go ahead instead of failing which is current behaviour.
Does this PR introduce any user interface change?
Yes. (Doc update is not needed)
Is any new testcase added?
No(Existing HDFSCarbonFile tests will take care)
This closes #3721
akashrn5 [Thu, 23 Apr 2020 08:47:38 +0000 (14:17 +0530)]
[CARBONDATA-3785]update the scala version for spark-2.4 in compliance to open source spark
Why is this PR needed?
update the scala version for spark-2.4 in compliance to open source spark
What changes were proposed in this PR?
update scala version to 2.11.12
This closes #3724
akashrn5 [Fri, 17 Apr 2020 05:36:19 +0000 (11:06 +0530)]
[CARBONDATA-3680]Add NI as a function not as udf
Why is this PR needed?
NI is registered as udf, so it is handled for both case of scalaUDF and HiveSimpleUDF
What changes were proposed in this PR?
Register NI as a functionand remove unwanted case for ScalaUDF for NI functionality.
This closes #3718
Indhumathi27 [Tue, 14 Apr 2020 11:07:45 +0000 (16:37 +0530)]
[CARBONDATA-3781] Refactor code to optimize partition pruning
Why is this PR needed?
If number of partitions is more on a segment, cost of checking if (indexPath equalsIgnoreCase partitionsToPrune)
in BlockletIndexFactory.getTableBlockUniqueIdentifierWrappers() is high.
What changes were proposed in this PR?
1.Store location of partition filter in a SET and check if it contains indexPath.
2. Remove unused variables
This closes #3707
liuzhi [Sat, 18 Apr 2020 08:14:12 +0000 (16:14 +0800)]
[CARBONDATA-3775] Update materialized view document
Why is this PR needed?
Update materialized view document synchronously, after the materialized view is separated from the DataMap module.
What changes were proposed in this PR?
Update materialized view syntax comment.
Add comment about usage of time series.
Move document to document root directory from index directory.
This closes #3720
liuzhi [Fri, 3 Apr 2020 01:09:24 +0000 (09:09 +0800)]
[CARBONDATA-3776] Clean old materialized view implementation
Why is this PR needed?
After the materialized view is separated from the DataMap module, the old materialized view implementation in DataMap module is not used. Should clean those code to keep code clean.
What changes were proposed in this PR?
Move materialized view test case to carbondata-spark module.
Remove carbondata-mv-core module, because of the main code in this module have moved to carbondata-spark module.
Remove old materialized view metadata manage implementation which doesn't support multi-tenant.
This closes #3692
akashrn5 [Fri, 17 Apr 2020 07:47:24 +0000 (13:17 +0530)]
[CARBONDATA-3754]avoid listing index files during SI rebuild
Why is this PR needed?
List files was done during rebuild of SI, which will be costly in case of S3 and OBS
What changes were proposed in this PR?
avoided list files and delete files handled through segmentFileStore
This closes #3719
akkio-97 [Tue, 21 Apr 2020 08:03:46 +0000 (13:33 +0530)]
[CARBONDATA-3782] changes to reflect default behaviour in case of invalid configuration
Why is this PR needed?
Invalid configuration values in carbon.properties causes operation failure instead of reflecting
default configuration value behaviour.
What changes were proposed in this PR?
when invalid configurations are provided, instead of failing consider the default behavior.
This closes #3722
Gampa Shreelekhya [Tue, 14 Apr 2020 13:10:03 +0000 (18:40 +0530)]
[CARBONDATA-3772] Update index documents
Why is this PR needed?
update index documentation to comply with recent changes
What changes were proposed in this PR?
Does this PR introduce any user interface change?
No
Yes. (please explain the change and update document)
Is any new testcase added?
No
Yes
This closes #3708
ajantha-bhat [Wed, 15 Apr 2020 06:11:47 +0000 (11:41 +0530)]
[HOTFIX] Avoid reading table status file again per query
Why is this PR needed?
After add segment feature, we take snapshot twice (which leads in two times table status IO),
we can refuse the original snapshot
a. First time snapshot in LateDecodeStrategy.pruneFilterProjectRaw()
b. second time in CarbonTableInputFormat.getSplits()
What changes were proposed in this PR?
Take snapshot only once and reuse in the second place
This closes #3709
kunal642 [Thu, 16 Apr 2020 05:59:36 +0000 (11:29 +0530)]
[HOTFIX] Fixed task distribution issue in SegmentPruneRDD
Why is this PR needed?
SI queries are degraded because getPrefferedLocation is
not overridden in SegmentPruneRDD due to which tasks are fired randomly to any executors
What changes were proposed in this PR?
override getPrefferedLocation so that tasks are fired to correct executors.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3712
akashrn5 [Thu, 16 Apr 2020 14:20:03 +0000 (19:50 +0530)]
[CARBONDATA-3774]Fix hiding the actual exceptions during tableExists check
Why is this PR needed?
In table exists API, we catch all exceptions and say table does not exists which is wrong
What changes were proposed in this PR?
Catch only NoSuchTableException fro spark and return false, for other exceptions, throw it to caller
Does this PR introduce any user interface change?
No
Is any new testcase added?
No(existing test will take care)
This closes #3716
Indhumathi27 [Mon, 30 Mar 2020 13:46:55 +0000 (19:16 +0530)]
[CARBONDATA-3765] Refactor Index Metadata for CG and FG Indexes
Why is this PR needed?
1. Refactor is required to make CG and FG Indexes support multi-tenant
2. Refactor DataMap to Index
What changes were proposed in this PR?
CG and FG Indexes support multi-tenant
-> Remove writing datamapschema for cg and fg indexes to system folder
-> Add index related information to tableProperties of main table and hive, as similar to SI, to support multi-tenancy.
-> All indexes - CG,FG and Secondary Index information will be stored in main table
Does this PR introduce any user interface change?
Yes.
Is any new testcase added?
No
This closes #3688
ravipesala [Fri, 3 Jan 2020 16:26:50 +0000 (00:26 +0800)]
[CARBONDATA-3597] Missed commit from PR-3483 (SCD2)
There is a commit missed during PR merge of #3483
Why is this PR needed?
SCD can use insert flow to write data insted of calling the writer
directly
What changes were proposed in this PR?
using insert flow for SCD to write data
Does this PR introduce any user interface change?
No
Is any new testcase added?
Yes.
This closes #3704
Indhumathi27 [Wed, 8 Apr 2020 09:50:24 +0000 (15:20 +0530)]
[CARBONDATA-3768] Fix query not hitting mv without alias, with mv having Alias
Why is this PR needed?
create MV with select query having alias. Query without alias is not hitting mv.
What changes were proposed in this PR?
While query matching, check if outputlist with alias contains all user outputlist.
This closes #3699
Zhichao Zhang [Fri, 10 Apr 2020 02:46:11 +0000 (10:46 +0800)]
[HOTFIX] Fix some issues in doc
fix some issues on chinese doc
This closes #3703
kunal642 [Sat, 11 Jan 2020 21:00:15 +0000 (02:30 +0530)]
[CARBONDATA-3771] Fixed date filter issue and use hive constants instead of hard-coded values
Why is this PR needed?
Filter on date column is giving wrong results due to incorrect carbon expression.
What changes were proposed in this PR?
For date column getExprString is giving the filter value as 'DATE2020-01-01' which causes the filter to fail.
Changed to ((ExprNodeConstantDesc) exprNodeDesc).getValue().toString() for ConstantDesc for better handling
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3705
Vikram Ahuja [Wed, 25 Mar 2020 11:26:54 +0000 (16:56 +0530)]
[CARBONDATA-3751] Segments are not Marked for delete if everything is deleted in a segment with index server enabled
Why is this PR needed?
When all the rows are deleted from a segment with index server enabled, the segment is not Marked for delete.
What changes were proposed in this PR?
Rowcount was always being sent as 0 from the index server DistributedPruneRDD. Getting that value from DataMapRow now
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3679
Jacky Li [Wed, 18 Mar 2020 05:06:04 +0000 (13:06 +0800)]
[CARBONDATA-3736] Support show segment by query
Why is this PR needed?
There are many fields in segment that not shown in SHOW SEGMENTS command.
What changes were proposed in this PR?
This PR change SHOW SEGMENTS command to:
SHOW [HISTORY] SEGMENTS
[FOR TABLE | ON] [db_name.]table_name
[AS select_query]
User can query the segments as it is a table. A sample output as following:
show segments on source as
select id, status, dataSize from source_segments where status = 'Success' order by dataSize
+------------------------+
|output |
+------------------------+
|id | status | dataSize|
|4.1 | Success | 1762 |
|0.2 | Success | 3524 |
+------------------------+
Does this PR introduce any user interface change?
Yes
Is any new testcase added?
Yes
This closes #3657
maheshrajus [Tue, 25 Feb 2020 16:00:53 +0000 (21:30 +0530)]
[CARBONDATA-3724] Secondary Index enable on partition Table
Why is this PR needed?
Currently Secondary index is not supported for partition table.
Secondary index should support on non partition columns instead of blocking full partition table.
What changes were proposed in this PR?
Secondary index should support on non partition columns
This closes #3639
ajantha-bhat [Mon, 30 Mar 2020 13:38:25 +0000 (19:08 +0530)]
[CARBONDATA-3761] Remove redundant conversion for complex type insert
Why is this PR needed?
In PrimitiveDataType#writeByteArray
DataTypeUtil.parseValue(**input.toString()**, carbonDimension)
Here we convert every complex child element to string and then parse as an object to handle bad records. Which leads to heavy GC
DatatypeUtil#getBytesDataDataTypeForNoDictionaryColumn -> double,float, byte, decimal case is missing. so we convert them to string and then convert to bytes. which create more redundant objects
What changes were proposed in this PR?
For new Insert into flow, no need to handle bad records for complex types as it is already validated in source table. So, use object directly. This can decrease the memory foot print for complex type insert
Add a case for double,float, byte, decimal data type.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3687
ajantha-bhat [Thu, 26 Mar 2020 09:37:57 +0000 (15:07 +0530)]
[CARBONDATA-3753] optimize double/float stats collector
Why is this PR needed?
For every double/float column's value. we call
PrimitivePageStatsCollector.getDecimalCount(double value)
problem is, here we create new bigdecimal object and plain string object every time.
Which leads in huge memory usage during insert.
What changes were proposed in this PR?
Create only Bigdecimal object and use scale from that.
This closes #3682
ajantha-bhat [Fri, 3 Apr 2020 14:51:52 +0000 (20:21 +0530)]
[CARBONDATA-3763] Fix wrong insert result during insert stage command
Why is this PR needed?
For insertStageCommand, spark is reusing the internalRow as two times we transform from RDD[InternalRow] -> dataframe -> logical Plan -> RDD[InternalRow]. So, same data is inserted on other rows
What changes were proposed in this PR?
Copy the internalRow after the last transform.
This closes #3694
akashrn5 [Sat, 28 Mar 2020 11:51:30 +0000 (17:21 +0530)]
[CARBONDATA-3759] Refactor segmentRefreshInfo and fix cache issue in multiple application scenario
Why is this PR needed?
currently the segmentRefreshInfo is helping to clear the cache only in update cases and it fails to refresh the cache if any segment files changes or get updates.
when two applications are running on same store. One application changes some segment files changes and removes old cache and may be delete files, which wont be reflected in other application, which may result in either wrong results or query failure.
What changes were proposed in this PR?
refactor the segmentRefreshInfo to clear the cache and update when there are any updates on segments and if the segment files of respective segments changes.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
Tested in cluster and existing test cases will be enough.
This closes #3686
haomarch [Sat, 4 Apr 2020 14:35:43 +0000 (22:35 +0800)]
[HOTFIX] Fix Repeated access to getSegmentProperties
Why is this PR needed?
getSegmentProperties is accessed repeated in query processing, which leads to heavy time overhead.
What changes were proposed in this PR?
Reuse the segmentProperty.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3696
Vikram Ahuja [Fri, 13 Mar 2020 09:50:29 +0000 (15:20 +0530)]
[CARBONDATA-3743] Added pre-priming check in the Spark job description page
Why is this PR needed?
Added pre-priming check in the Spark job description page
What changes were proposed in this PR?
Added pre-priming check in the Spark job description page
This closes #3669
akashrn5 [Tue, 24 Mar 2020 09:47:18 +0000 (15:17 +0530)]
[CARBONDATA-3754]Clean up the data file and index files after SI rebuild
Why is this PR needed?
Clean up not happening for the data file and index files after SI rebuild
What changes were proposed in this PR?
every task should clear the old data and index files once task finishes.
This closes #3676
IceMimosa [Tue, 7 Jan 2020 05:24:57 +0000 (13:24 +0800)]
[CARBONDATA-3565] Fix complex binary data broken issue when loading dataframe data
Why is this PR needed?
When binary data is DataOutputStream#writeDouble and so on.
Spark DataFrame(SQL) load it to a table, the data will be broken (EF BF BD) when reading out.
What changes were proposed in this PR?
If data is byte[], no need to convert to string and decode to byte[]
again
Does this PR introduce any user interface change?
No
Is any new testcase added?
Yes
This closes #3430
Vikram Ahuja [Thu, 5 Mar 2020 10:24:20 +0000 (15:54 +0530)]
[CARBONDATA-3738] : Delete seg. by ID is displaying as failed with invalid ID upon
deleting a added parquet segment
Why is this PR needed?
Unable to delete segment in case of SI when table status file is not present.
What changes were proposed in this PR?
Checking for the table status file before triggering delete for that segment.
This closes #3659
Indhumathi27 [Wed, 1 Apr 2020 13:34:08 +0000 (19:04 +0530)]
[CARBONDATA-3762] Block creating Materialized view's with duplicate column
Why is this PR needed?
Currently, materialized views with duplicate column
1. On creation, we are taking distinct of Project columns.
2. Because of this, during load, data is inserted wrongly and query gives wrong results.
Materilaized views can be mapped against queries having duplicate columns, without having duplicate columns in mv table.
What changes were proposed in this PR?
1. Block creating materialized views with duplicate columns
2. Fix bug in matching mv against queries having duplicate columns.
Does this PR introduce any user interface change?
Yes.
Is any new testcase added?
Yes
This closes #3690
kunal642 [Thu, 26 Mar 2020 09:25:34 +0000 (14:55 +0530)]
[CARBONDATA-3766] Fixed desc formatted and show segment data size issues
Why is this PR needed?
Show segments shows the datasize as index size for non carbon segments in show segments command.
What changes were proposed in this PR?
Take data size from indexsize variable and show in show segments as non carbon segmnets do not have
index files. Index Size wil be shown as 0.
This closes #3680
ajantha-bhat [Fri, 20 Mar 2020 05:44:04 +0000 (11:14 +0530)]
[CARBONDATA-3744] Fix select query failure issue when warehouse directory is default (not configured) in cluster
Why is this PR needed?
select query fails when warehouse directory is default (not configured) with below callstak.
0: jdbc:hive2://localhost:10000> create table ab(age int) stored as carbondata;
---------+
Result
---------+
---------+
No rows selected (0.093 seconds)
0: jdbc:hive2://localhost:10000> select count from ab;
Error: org.apache.spark.sql.catalyst.analysis.NoSuchTableException: Table or view 'ab' not found in database 'tpch'; (state=,code=0)
caused by
java.io.FileNotFoundException: File hdfs://localhost:54311/home/root1/tools/spark-2.3.4-bin-hadoop2.7/spark-warehouse/tpch.db/ab/Metadata does not exist.
What changes were proposed in this PR?
When the spark.sql.warehouse.dir is not configured, default local file system SPARK_HOME is used. But the describe table shows with HDFS prefix in cluster.
Reason is we are removing the local filesystem scheme , so when table path is read we add HDFS prefix in cluster. instead if we keep the scheme issue will not come.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No. Happens only in cluster with HDFS or OBS.
This closes #3675
ajantha-bhat [Fri, 3 Apr 2020 07:01:34 +0000 (12:31 +0530)]
[CARBONDATA-3764] Reduce Reusable buffer creation when few projections selected out of many columns
Why is this PR needed?
If few projections selected out of many columns, reusable buffer is based on columns count instead of projections count. Hence many unused objects were created.
What changes were proposed in this PR?
Create reusable buffers only as per the projection count.
Fix spelling error from Resusable to Reusable
Does this PR introduce any user interface change?
No
Is any new testcase added?
No. Internal change
This closes #3693
akashrn5 [Thu, 26 Mar 2020 18:30:23 +0000 (00:00 +0530)]
[CARBONDATA-3755]Fix clean up issue with respect to segmentMetadaInfo
after update and clean files
Why is this PR needed?
1. segmentMetadaInfo is not getting copied to new segment files written
after multiple updates and clean files opearation.
2. old segment files are not getting deleted and getting accumulated.
What changes were proposed in this PR?
1. update the segmentMetadaInfo to new files
2. once we write new segment file, delete the old invalid segment files.
This closes #3683
akashrn5 [Tue, 31 Mar 2020 12:02:09 +0000 (17:32 +0530)]
[CARBONDATA-3680] Add Secondary Index Document
Why is this PR needed?
No documentation present for Secondary index.
What changes were proposed in this PR?
Added documentation for secondary index.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3689
akashrn5 [Wed, 1 Apr 2020 14:21:30 +0000 (19:51 +0530)]
[CARBONDATA-3760] Enable prefetch in presto to improve query performance
Why is this PR needed?
Prefetch was set to false explicitly.
What changes were proposed in this PR?
Improve the query performance with prefetch enabled
Does this PR introduce any user interface change?
No
Is any new testcase added?
TPCDS queries executed and some queries time was improved by almost 50%
This closes #3691
Jacky Li [Sat, 7 Mar 2020 01:33:06 +0000 (09:33 +0800)]
[CARBONDATA-3704] Support create materialized view on all type table, and make mv support mutil-tenant.
Why is this PR needed?
Support create materialized view on all table, include parquest table, orc table, hive table and carbon table.
Materialized view DDL is common in databases, carbondata should change its materialized view related SQL syntax as other database.
Materialized view support mutil-tenant.
What changes were proposed in this PR?
Define materialized view related commands: CREATE MATERIALIZED VIEW, DROP MATERIALIZED VIEW, REFRESH MATERIALIZED VIEW and SHOW MATERIALIZED VIEW.
Move materialized view schema files to each database directory.
Support create materialized view on all table, remove carbon table related check.
Include some datamap rename change, rename datamap to index.
This closes #3661
Co-authored-by: niuge01 <371684521@qq.com>
jack86596 [Wed, 11 Mar 2020 07:36:45 +0000 (15:36 +0800)]
[CARBONDATA-3740] Add line separator option to load command to configure the line separator during csv parsing.
Why is this PR needed?
Sometime univocity parser will detect the line separator incorrectly. In this case, user should be able to set line separator explicitly.
Issue: During loading, if in the first line, there is one field has a '\r' character and this '\r' appears before the first '\n',
line separator detection will treat '\r' as line separator. This is not the intention.
Example:
Data file has two lines, ^M is '\r':
1,2^M,3
4,5,6
After loading,
The records in table will be:
| 1 | 2 | null |
| null | 3
4 | 5 |
Correct should be:
| 1 | 2^M | 3 |
| 4 | 5 | 6 |
What changes were proposed in this PR?
Allow user to specify line separator explicitly in load command, add one new option to load command named "line_separator".
Does this PR introduce any user interface change?
Yes. New load option "line_separator" is added.
Is any new testcase added?
Yes.
This closes #3664
litao [Thu, 19 Dec 2019 13:12:21 +0000 (21:12 +0800)]
[CARBONDATA-3548] Add spatial-index user guid to doc
Why is this PR needed?
Spatial index feature document is not updated yet.
What changes were proposed in this PR?
updated the document for spatial index feature
Does this PR introduce any user interface change?
updated the document for psatial index feature.
Is any new testcase added?
No
This closes #3520
Co-authored-by: VenuReddy2103 <venugopalreddyk@huawei.com>
QiangCai [Fri, 27 Mar 2020 08:35:38 +0000 (16:35 +0800)]
[CARBONDATA-3756] Fix stage query bug it only read the first blocklet of each carbondata file
Why is this PR needed?
The query of stage files only read the first blocklet of each carbondata file.
So when the file contains multiple blocklets, the query result will be wrong.
What changes were proposed in this PR?
The query of stage files should read the all blocklets of all carbondata files.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3684
ajantha-bhat [Fri, 27 Mar 2020 12:11:18 +0000 (17:41 +0530)]
[HOTFIX] Fix all flink test case failure and enable UT in CI
Why is this PR needed?
After #3628, default BATCH_FILE_ORDER is wrong [it is not ASC or DSC].
so, all the test case in flink module failed due as no order is set in stage command.
Also flink UT is not running in CI, hence it is not caught
What changes were proposed in this PR?
Fix the default value to Ascending order.
Enable UT running for flink module.
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3685
QiangCai [Thu, 26 Mar 2020 09:18:36 +0000 (17:18 +0800)]
[CARBONDATA-3752] Reuse Exchange to fix performance issue
Why is this PR needed?
Spark ReusedExchange rule can't recognition the same Exchange plan on carbon table.
So the query on the carbon table doesn't reuse Exchange, it leads to bad performance.
For Example:
create table t1(c1 int, c2 string) using carbondata
explain
select c2, sum(c1) from t1 group by c2
union all
select c2, sum(c1) from t1 group by c2
physical plan as following:
Union
:- *(2) HashAggregate(keys=[c2#37], functions=[sum(cast(c1#36 as bigint))])
: +- Exchange hashpartitioning(c2#37, 200)
: +- *(1) HashAggregate(keys=[c2#37], functions=[partial_sum(cast(c1#36 as bigint))])
: +- *(1) FileScan carbondata default.t1[c1#36,c2#37] ReadSchema: struct<c1:int,c2:string>
+- *(4) HashAggregate(keys=[c2#37], functions=[sum(cast(c1#36 as bigint))])
+- Exchange hashpartitioning(c2#37, 200)
+- *(3) HashAggregate(keys=[c2#37], functions=[partial_sum(cast(c1#36 as bigint))])
+- *(3) FileScan carbondata default.t1[c1#36,c2#37] ReadSchema: struct<c1:int,c2:string>
after change, physical plan as following:
Union
:- *(2) HashAggregate(keys=[c2#37], functions=[sum(cast(c1#36 as bigint))])
: +- Exchange hashpartitioning(c2#37, 200)
: +- *(1) HashAggregate(keys=[c2#37], functions=[partial_sum(cast(c1#36 as bigint))])
: +- *(1) FileScan carbondata default.t1[c1#36,c2#37] ReadSchema: struct<c1:int,c2:string>
+- *(4) HashAggregate(keys=[c2#37], functions=[sum(cast(c1#36 as bigint))])
+- ReusedExchange [c2#37, sum#54L], Exchange hashpartitioning(c2#37, 200)
What changes were proposed in this PR?
change CarbonFileIndex class to case class.
Does this PR introduce any user interface change?
No
Is any new testcase added?
Yes
This closes #3681
Indhumathi27 [Thu, 9 Jan 2020 10:11:44 +0000 (15:41 +0530)]
[CARBONDATA-3718] Support SegmentLevel MinMax for better Pruning and less driver memory usage for cache
Why is this PR needed?
In Cloud scenarios, index is too big to store in SparkDriver, since VM may not have
so much memory. Currently in Carbon, we will load all indexes to cache for first time query.
Since Carbon LRU Cache does not support time-based expiration, indexes will be removed from
cache based on LeastRecentlyUsed mechanism, when the carbon lru cache is full.
In some scenarios, where user's table has more segments and if user queries only very
few segments often, we no need to load all indexes to cache. For filter queries,
if we prune and load only matched segments to cache,
then driver's memory will be saved.
What changes were proposed in this PR?
Added all block minmax with column-id and sort_column info to segment metadata file
and prune segment based on segment files and load index only for matched segment. Added a configurable carbon property 'carbon.load.all.index.to.cache'
to allow user to load all indexes to cache if needed. BY default, value will
be true, which loads all indexes to cache.
Does this PR introduce any user interface change?
Yes.
Is any new testcase added?
Yes
This closes #3584
Jacky Li [Tue, 17 Mar 2020 08:37:58 +0000 (16:37 +0800)]
[CARBONDATA-3742] Support spark 2.4.5 integration
Why is this PR needed?
Currently CarbonData does not support integration with spark 2.4.5
What changes were proposed in this PR?
Support integration with spark 2.4.5
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3671
Indhumathi27 [Wed, 18 Mar 2020 10:22:18 +0000 (15:52 +0530)]
[HOTFIX] Fix ClassName for load datamaps parallel job
Why is this PR needed?
Load Datamap parallel was not launching job, because the class name was not correct
What changes were proposed in this PR?
Change className for load datamaps parallel job
Does this PR introduce any user interface change?
No
Is any new testcase added?
No
This closes #3674
Indhumathi27 [Mon, 2 Mar 2020 15:58:40 +0000 (21:28 +0530)]
[CARBONDATA-3733] Fix Incorrect query results on mv with limit
Why is this PR needed?
Issue 1:
After creating an materilaised view, queries with simple projection and limit gives incorrect results.
Issue 2:
Compact IUD_DELTA on materilaised view throws NullPointerException, because SegmentUpdateStatusManager is not set
Issue 3:
Queries with order by columns not in projection for create mv's, gives incorrect results.
What changes were proposed in this PR?
Issue 1:
Copy subsume Flag and FlagSpec to subsumerPlan while rewriting with summarydatasets.
Update the flagSpec as per the mv attributes and copy to relation.
Issue 2:
Set SegmentUpdateStatusManager to CarbonLoadModel using carbonTable in case of IUD_DELTA compaction
Issue 3:
Order by columns has to be present in projection list during create mv.
This closes #3651