Add Table.cross_join and Table.zip to In-Memory Table (#4063)

Implements https://www.pivotaltracker.com/story/show/184239059
This commit is contained in:
Radosław Waśko 2023-01-23 14:19:52 +01:00 committed by GitHub
parent aa995110e9
commit d2e57edc8b
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
38 changed files with 949 additions and 95 deletions

View File

@ -280,6 +280,8 @@
to the types.][4026]
- [Implemented `Table.distinct` for Database backends.][4027]
- [Implemented `Table.union` for the in-memory backend.][4052]
- [Implemented `Table.cross_join` and `Table.zip` for the in-memory
backend.][4063]
[debug-shortcuts]:
https://github.com/enso-org/enso/blob/develop/app/gui/docs/product/shortcuts.md#debug
@ -438,6 +440,7 @@
[4027]: https://github.com/enso-org/enso/pull/4027
[4044]: https://github.com/enso-org/enso/pull/4044
[4052]: https://github.com/enso-org/enso/pull/4052
[4063]: https://github.com/enso-org/enso/pull/4063
#### Enso Compiler

View File

@ -788,6 +788,94 @@ type Table
problem_builder.attach_problems_before on_problems <|
self.connection.dialect.prepare_join self.connection sql_join_kind new_table_name left_setup.subquery right_setup.subquery on_expressions where_expressions columns_to_select=result_columns
## ALIAS Cartesian Join
Joins tables by pairing every row of the left table with every row of the
right table.
Arguments:
- right: The table to join with.
- right_row_limit: If the number of rows in the right table exceeds this,
then a `Cross_Join_Row_Limit_Exceeded` problem is raised. The check
exists to avoid exploding the size of the table by accident. This check
can be disabled by setting this parameter to `Nothing`.
- right_prefix: The prefix added to right table column names in case of
name conflict. See "Column Renaming" below for more information.
- on_problems: Specifies how to handle problems if they occur, reporting
them as warnings by default.
- If the `right` table has more rows than the `right_row_limit` allows,
a `Cross_Join_Row_Limit_Exceeded` is reported. In warning/ignore
mode, the join is still executed.
? Column Renaming
If columns from the two tables have colliding names, a prefix (by
default `Right_`) is added to the name of the column from the right
table. The left column remains unchanged. It is possible that the new
name will be in use, in this case it will be resolved using the normal
renaming strategy - adding subsequent `_1`, `_2` etc.
? Result Ordering
Rows in the result are first ordered by the order of the corresponding
rows from the left table and then the order of rows from the right
table. This applies only if the order of the rows was specified (for
example, by sorting the table; in-memory tables will keep the memory
layout order while for database tables the order may be unspecified).
cross_join : Table -> Integer | Nothing -> Table
cross_join self right right_row_limit=100 right_prefix="Right_" on_problems=Report_Warning =
_ = [right, right_row_limit, right_prefix, on_problems]
Error.throw (Unsupported_Database_Operation.Error "Table.cross_join is not implemented yet for the Database backends.")
## ALIAS Join By Row Position
Joins two tables by zipping rows from both tables table together - the
first row of the left table is correlated with the first one of the right
one etc.
Arguments:
- right: The table to join with.
- keep_unmatched: If set to `True`, the result will include as many rows
as the larger of the two tables - the last rows of the larger table
will have nulls for columns of the smaller one. If set to `False`, the
result will have as many rows as the smaller of the two tables - the
additional rows of the larger table will be discarded. The default
value is `Report_Unmatched` which means that the user expects that two
tables should have the same amount of rows; if they do not, the
behaviour is the same as if it was set to `True` - i.e. the unmatched
rows are kept with `Nothing` values for the other table, but a
`Row_Count_Mismatch` problem is also reported.
- right_prefix: The prefix added to right table column names in case of
name conflict. See "Column Renaming" below for more information.
- on_problems: Specifies how to handle problems if they occur, reporting
them as warnings by default.
- If the tables have different number of rows and `keep_unmatched` is
set to `Report_Unmatched`, the join will report `Row_Count_Mismatch`.
? Column Renaming
If columns from the two tables have colliding names, a prefix (by
default `Right_`) is added to the name of the column from the right
table. The left column remains unchanged. It is possible that the new
name will be in use, in this case it will be resolved using the normal
renaming strategy - adding subsequent `_1`, `_2` etc.
? Row Ordering
This operation requires a well-defined order of rows in the input
tables. In-memory tables rely on the ordering stemming directly from
their layout in memory. Database tables may not impose a deterministic
ordering. If the table defines a primary key, it is used to by default
to ensure deterministic ordering. That can be overridden by specifying
a different ordering using `Table.order_by`. If no primary key was
defined nor any ordering was specified explicitly by the user, the
order of columns is undefined and the operation will fail, reporting a
`Undefined_Column_Order` problem and returning an empty table.
zip : Table -> Boolean | Report_Unmatched -> Text -> Problem_Behavior -> Table
zip self right keep_unmatched=Report_Unmatched right_prefix="Right_" on_problems=Report_Warning =
_ = [right, keep_unmatched, right_prefix, on_problems]
Error.throw (Unsupported_Database_Operation.Error "Table.zip is not implemented yet for the Database backends.")
## ALIAS append, concat
Appends records from other table(s) to this table.

View File

@ -42,7 +42,7 @@ import project.Delimited.Delimited_Format.Delimited_Format
from project.Data.Column_Type_Selection import Column_Type_Selection, Auto
from project.Internal.Rows_View import Rows_View
from project.Errors import Column_Count_Mismatch, Missing_Input_Columns, Column_Indexes_Out_Of_Range, Duplicate_Type_Selector, No_Index_Set_Error, No_Such_Column, No_Input_Columns_Selected, No_Output_Columns, Invalid_Value_Type
from project.Errors import Column_Count_Mismatch, Missing_Input_Columns, Column_Indexes_Out_Of_Range, Duplicate_Type_Selector, No_Index_Set_Error, No_Such_Column, No_Input_Columns_Selected, No_Output_Columns, Invalid_Value_Type, Cross_Join_Row_Limit_Exceeded, Row_Count_Mismatch
from project.Data.Column import get_item_string
from project.Internal.Filter_Condition_Helpers import make_filter_column
@ -1113,6 +1113,113 @@ type Table
problems = new_java_table.getProblems
Java_Problems.parse_aggregated_problems problems
## ALIAS Cartesian Join
Joins tables by pairing every row of the left table with every row of the
right table.
Arguments:
- right: The table to join with.
- right_row_limit: If the number of rows in the right table exceeds this,
then a `Cross_Join_Row_Limit_Exceeded` problem is raised. The check
exists to avoid exploding the size of the table by accident. This check
can be disabled by setting this parameter to `Nothing`.
- right_prefix: The prefix added to right table column names in case of
name conflict. See "Column Renaming" below for more information.
- on_problems: Specifies how to handle problems if they occur, reporting
them as warnings by default.
- If the `right` table has more rows than the `right_row_limit` allows,
a `Cross_Join_Row_Limit_Exceeded` is reported. In warning/ignore
mode, the join is still executed.
? Column Renaming
If columns from the two tables have colliding names, a prefix (by
default `Right_`) is added to the name of the column from the right
table. The left column remains unchanged. It is possible that the new
name will be in use, in this case it will be resolved using the normal
renaming strategy - adding subsequent `_1`, `_2` etc.
? Result Ordering
Rows in the result are first ordered by the order of the corresponding
rows from the left table and then the order of rows from the right
table. This applies only if the order of the rows was specified (for
example, by sorting the table; in-memory tables will keep the memory
layout order while for database tables the order may be unspecified).
cross_join : Table -> Integer | Nothing -> Table
cross_join self right right_row_limit=100 right_prefix="Right_" on_problems=Report_Warning =
if check_table "right" right then
limit_problems = case right_row_limit.is_nothing.not && (right.row_count > right_row_limit) of
True ->
[Cross_Join_Row_Limit_Exceeded.Error right_row_limit right.row_count]
False -> []
on_problems.attach_problems_before limit_problems <|
new_java_table = self.java_table.crossJoin right.java_table right_prefix
renaming_problems = new_java_table.getProblems |> Java_Problems.parse_aggregated_problems
on_problems.attach_problems_before renaming_problems (Table.Value new_java_table)
## ALIAS Join By Row Position
Joins two tables by zipping rows from both tables table together - the
first row of the left table is correlated with the first one of the right
one etc.
Arguments:
- right: The table to join with.
- keep_unmatched: If set to `True`, the result will include as many rows
as the larger of the two tables - the last rows of the larger table
will have nulls for columns of the smaller one. If set to `False`, the
result will have as many rows as the smaller of the two tables - the
additional rows of the larger table will be discarded. The default
value is `Report_Unmatched` which means that the user expects that two
tables should have the same amount of rows; if they do not, the
behaviour is the same as if it was set to `True` - i.e. the unmatched
rows are kept with `Nothing` values for the other table, but a
`Row_Count_Mismatch` problem is also reported.
- right_prefix: The prefix added to right table column names in case of
name conflict. See "Column Renaming" below for more information.
- on_problems: Specifies how to handle problems if they occur, reporting
them as warnings by default.
- If the tables have different number of rows and `keep_unmatched` is
set to `Report_Unmatched`, the join will report `Row_Count_Mismatch`.
? Column Renaming
If columns from the two tables have colliding names, a prefix (by
default `Right_`) is added to the name of the column from the right
table. The left column remains unchanged. It is possible that the new
name will be in use, in this case it will be resolved using the normal
renaming strategy - adding subsequent `_1`, `_2` etc.
? Row Ordering
This operation requires a well-defined order of rows in the input
tables. In-memory tables rely on the ordering stemming directly from
their layout in memory. Database tables may not impose a deterministic
ordering. If the table defines a primary key, it is used to by default
to ensure deterministic ordering. That can be overridden by specifying
a different ordering using `Table.order_by`. If no primary key was
defined nor any ordering was specified explicitly by the user, the
order of columns is undefined and the operation will fail, reporting a
`Undefined_Column_Order` problem and returning an empty table.
zip : Table -> Boolean | Report_Unmatched -> Text -> Problem_Behavior -> Table
zip self right keep_unmatched=Report_Unmatched right_prefix="Right_" on_problems=Report_Warning =
if check_table "right" right then
keep_unmatched_bool = case keep_unmatched of
Report_Unmatched -> True
b : Boolean -> b
report_mismatch = keep_unmatched == Report_Unmatched
left_row_count = self.row_count
right_row_count = right.row_count
problems = if (left_row_count == right_row_count) || report_mismatch.not then [] else
[Row_Count_Mismatch.Error left_row_count right_row_count]
on_problems.attach_problems_before problems <|
new_java_table = self.java_table.zip right.java_table keep_unmatched_bool right_prefix
renaming_problems = new_java_table.getProblems |> Java_Problems.parse_aggregated_problems
on_problems.attach_problems_before renaming_problems (Table.Value new_java_table)
## ALIAS append, concat
Appends records from other table(s) to this table.

View File

@ -358,3 +358,20 @@ type Unmatched_Columns
to_display_text : Text
to_display_text self =
"The following columns were not present in some of the provided tables: " + (self.column_names.map (n -> "["+n+"]") . join ", ") + ". The missing values have been filled with `Nothing`."
type Cross_Join_Row_Limit_Exceeded
## Indicates that a `cross_join` has been attempted where the right table
has more rows than allowed by the limit.
Error (limit : Integer) (existing_rows : Integer)
to_display_text : Text
to_display_text self =
"The cross join operation exceeded the maximum number of rows allowed. The limit is "+self.limit.to_text+" and the number of rows in the right table was "+self.existing_rows.to_text+". The limit may be turned off by setting the `right_row_limit` option to `Nothing`."
type Row_Count_Mismatch
## Indicates that the row counts of zipped tables do not match.
Error (left_rows : Integer) (right_rows : Integer)
to_display_text : Text
to_display_text self =
"The number of rows in the left table ("+self.left_rows.to_text+") does not match the number of rows in the right table ("+self.right_rows.to_text+")."

View File

@ -2,6 +2,10 @@ from Standard.Base import all
import project.Extensions
## Returns values of warnings attached to the value.Nothing
get_attached_warnings v =
Warning.get_all v . map .value
## UNSTABLE
Tests how a specific operation behaves depending on the requested
`Problem_Behavior`.
@ -58,12 +62,10 @@ test_advanced_problem_handling action error_checker warnings_checker result_chec
# Lastly, we check the report warnings mode and ensure that both the result is correct and the warnings are as expected.
result_warning = action Problem_Behavior.Report_Warning
result_checker result_warning
warnings = Warning.get_all result_warning . map .value
warnings_checker warnings
warnings_checker (get_attached_warnings result_warning)
## UNSTABLE
Checks if the provided value does not have any attached problems.
assume_no_problems result =
result.is_error.should_be_false
warnings = Warning.get_all result . map .value
warnings.should_equal []
(get_attached_warnings result).should_equal []

View File

@ -1,4 +1,4 @@
package org.enso.base.text;
package org.enso.base.arrays;
/** A helper to efficiently build an array of unboxed integers of arbitrary length. */
public class IntArrayBuilder {
@ -62,4 +62,10 @@ public class IntArrayBuilder {
this.storage = null;
return tmp;
}
public int[] build() {
int[] result = new int[length];
System.arraycopy(storage, 0, result, 0, length);
return result;
}
}

View File

@ -3,6 +3,8 @@ package org.enso.base.text;
import com.ibm.icu.text.BreakIterator;
import com.ibm.icu.text.CaseMap;
import com.ibm.icu.text.CaseMap.Fold;
import org.enso.base.arrays.IntArrayBuilder;
import java.util.Locale;
/**

View File

@ -1,5 +1,7 @@
package org.enso.table.data.column.builder.object;
import java.util.Arrays;
import java.util.BitSet;
import org.enso.base.polyglot.NumericConverter;
import org.enso.table.data.column.storage.BoolStorage;
import org.enso.table.data.column.storage.DoubleStorage;
@ -7,9 +9,6 @@ import org.enso.table.data.column.storage.LongStorage;
import org.enso.table.data.column.storage.Storage;
import org.enso.table.util.BitSets;
import java.util.Arrays;
import java.util.BitSet;
/** A builder for numeric columns. */
public class NumericBuilder extends TypedBuilder {
private final BitSet isMissing = new BitSet();
@ -103,11 +102,11 @@ public class NumericBuilder extends TypedBuilder {
@Override
public void appendBulkStorage(Storage<?> storage) {
if (isDouble) {
appendBulkDouble(storage);
} else {
appendBulkLong(storage);
}
if (isDouble) {
appendBulkDouble(storage);
} else {
appendBulkLong(storage);
}
}
private void ensureFreeSpaceFor(int additionalSize) {
@ -125,7 +124,10 @@ public class NumericBuilder extends TypedBuilder {
BitSets.copy(doubleStorage.getIsMissing(), isMissing, currentSize, n);
currentSize += n;
} else {
throw new IllegalStateException("Unexpected storage implementation for type DOUBLE: " + storage + ". This is a bug in the Table library.");
throw new IllegalStateException(
"Unexpected storage implementation for type DOUBLE: "
+ storage
+ ". This is a bug in the Table library.");
}
} else if (storage.getType() == Storage.Type.LONG) {
if (storage instanceof LongStorage longStorage) {
@ -135,7 +137,10 @@ public class NumericBuilder extends TypedBuilder {
data[currentSize++] = Double.doubleToRawLongBits(longStorage.getItem(i));
}
} else {
throw new IllegalStateException("Unexpected storage implementation for type LONG: " + storage + ". This is a bug in the Table library.");
throw new IllegalStateException(
"Unexpected storage implementation for type LONG: "
+ storage
+ ". This is a bug in the Table library.");
}
} else if (storage.getType() == Storage.Type.BOOL) {
if (storage instanceof BoolStorage boolStorage) {
@ -149,7 +154,10 @@ public class NumericBuilder extends TypedBuilder {
}
}
} else {
throw new IllegalStateException("Unexpected storage implementation for type BOOLEAN: " + storage + ". This is a bug in the Table library.");
throw new IllegalStateException(
"Unexpected storage implementation for type BOOLEAN: "
+ storage
+ ". This is a bug in the Table library.");
}
} else {
throw new StorageTypeMismatch(getType(), storage.getType());
@ -165,7 +173,10 @@ public class NumericBuilder extends TypedBuilder {
BitSets.copy(longStorage.getIsMissing(), isMissing, currentSize, n);
currentSize += n;
} else {
throw new IllegalStateException("Unexpected storage implementation for type DOUBLE: " + storage + ". This is a bug in the Table library.");
throw new IllegalStateException(
"Unexpected storage implementation for type DOUBLE: "
+ storage
+ ". This is a bug in the Table library.");
}
} else if (storage.getType() == Storage.Type.BOOL) {
if (storage instanceof BoolStorage boolStorage) {
@ -178,7 +189,10 @@ public class NumericBuilder extends TypedBuilder {
}
}
} else {
throw new IllegalStateException("Unexpected storage implementation for type BOOLEAN: " + storage + ". This is a bug in the Table library.");
throw new IllegalStateException(
"Unexpected storage implementation for type BOOLEAN: "
+ storage
+ ". This is a bug in the Table library.");
}
} else {
throw new StorageTypeMismatch(getType(), storage.getType());

View File

@ -5,6 +5,8 @@ import java.util.List;
import java.util.function.IntFunction;
import org.enso.base.polyglot.Polyglot_Utils;
import org.enso.table.data.column.builder.object.BoolBuilder;
import org.enso.table.data.column.builder.object.Builder;
import org.enso.table.data.column.builder.object.InferredBuilder;
import org.enso.table.data.column.operation.map.MapOpStorage;
import org.enso.table.data.column.operation.map.MapOperation;
@ -364,6 +366,11 @@ public final class BoolStorage extends Storage<Boolean> {
negated);
}
@Override
public Builder createDefaultBuilderOfSameType(int capacity) {
return new BoolBuilder(capacity);
}
@Override
public BoolStorage slice(List<SliceRange> ranges) {
int newSize = SliceRange.totalLength(ranges);

View File

@ -1,6 +1,9 @@
package org.enso.table.data.column.storage;
import java.time.LocalDate;
import org.enso.table.data.column.builder.object.Builder;
import org.enso.table.data.column.builder.object.DateBuilder;
import org.enso.table.data.column.operation.map.MapOpStorage;
import org.enso.table.data.column.operation.map.SpecializedIsInOp;
import org.enso.table.data.column.operation.map.datetime.DateTimeIsInOp;
@ -36,4 +39,9 @@ public final class DateStorage extends SpecializedStorage<LocalDate> {
public int getType() {
return Type.DATE;
}
@Override
public Builder createDefaultBuilderOfSameType(int capacity) {
return new DateBuilder(capacity);
}
}

View File

@ -1,5 +1,7 @@
package org.enso.table.data.column.storage;
import org.enso.table.data.column.builder.object.Builder;
import org.enso.table.data.column.builder.object.DateTimeBuilder;
import org.enso.table.data.column.operation.map.MapOpStorage;
import org.enso.table.data.column.operation.map.SpecializedIsInOp;
import org.enso.table.data.column.operation.map.datetime.DateTimeIsInOp;
@ -39,4 +41,9 @@ public final class DateTimeStorage extends SpecializedStorage<ZonedDateTime> {
public int getType() {
return Type.DATE_TIME;
}
@Override
public Builder createDefaultBuilderOfSameType(int capacity) {
return new DateTimeBuilder(capacity);
}
}

View File

@ -2,6 +2,8 @@ package org.enso.table.data.column.storage;
import java.util.BitSet;
import java.util.List;
import org.enso.table.data.column.builder.object.Builder;
import org.enso.table.data.column.builder.object.NumericBuilder;
import org.enso.table.data.column.operation.map.MapOpStorage;
import org.enso.table.data.column.operation.map.UnaryMapOperation;
@ -296,6 +298,11 @@ public final class DoubleStorage extends NumericStorage<Double> {
return new DoubleStorage(newData, newSize, newMask);
}
@Override
public Builder createDefaultBuilderOfSameType(int capacity) {
return NumericBuilder.createDoubleBuilder(capacity);
}
@Override
public Storage<Double> slice(List<SliceRange> ranges) {
int newSize = SliceRange.totalLength(ranges);

View File

@ -2,6 +2,8 @@ package org.enso.table.data.column.storage;
import java.util.BitSet;
import java.util.List;
import org.enso.table.data.column.builder.object.Builder;
import org.enso.table.data.column.builder.object.NumericBuilder;
import org.enso.table.data.column.operation.map.MapOpStorage;
import org.enso.table.data.column.operation.map.UnaryMapOperation;
@ -356,6 +358,11 @@ public final class LongStorage extends NumericStorage<Long> {
return new LongStorage(newData, newSize, newMask);
}
@Override
public Builder createDefaultBuilderOfSameType(int capacity) {
return NumericBuilder.createLongBuilder(capacity);
}
@Override
public LongStorage slice(List<SliceRange> ranges) {
int newSize = SliceRange.totalLength(ranges);

View File

@ -1,6 +1,9 @@
package org.enso.table.data.column.storage;
import java.util.BitSet;
import org.enso.table.data.column.builder.object.Builder;
import org.enso.table.data.column.builder.object.ObjectBuilder;
import org.enso.table.data.column.operation.map.MapOpStorage;
import org.enso.table.data.column.operation.map.UnaryMapOperation;
@ -29,6 +32,11 @@ public final class ObjectStorage extends SpecializedStorage<Object> {
return Type.OBJECT;
}
@Override
public Builder createDefaultBuilderOfSameType(int capacity) {
return new ObjectBuilder(capacity);
}
private static final MapOpStorage<Object, SpecializedStorage<Object>> ops = buildObjectOps();
static <T, S extends SpecializedStorage<T>> MapOpStorage<T, S> buildObjectOps() {

View File

@ -257,6 +257,24 @@ public abstract class Storage<T> {
/** @return a copy of the storage containing a slice of the original data */
public abstract Storage<T> slice(int offset, int limit);
/**
* @return a new storage instance, containing the same elements as this one, with {@code count}
* nulls appended at the end
*/
public Storage<?> appendNulls(int count) {
Builder builder = new InferredBuilder(size() + count);
builder.appendBulkStorage(this);
builder.appendNulls(count);
return builder.seal();
}
/**
* Creates a builder that is capable of creating storages of the same type as the current one.
*
* <p>This is useful for example when copying the current storage with some modifications.
*/
public abstract Builder createDefaultBuilderOfSameType(int capacity);
/** @return a copy of the storage consisting of slices of the original data */
public abstract Storage<T> slice(List<SliceRange> ranges);

View File

@ -3,6 +3,7 @@ package org.enso.table.data.column.storage;
import java.util.BitSet;
import java.util.HashSet;
import org.enso.base.Text_Utils;
import org.enso.table.data.column.builder.object.Builder;
import org.enso.table.data.column.builder.object.StringBuilder;
import org.enso.table.data.column.operation.map.MapOpStorage;
import org.enso.table.data.column.operation.map.MapOperation;
@ -60,6 +61,11 @@ public final class StringStorage extends SpecializedStorage<String> {
}
}
@Override
public Builder createDefaultBuilderOfSameType(int capacity) {
return new StringBuilder(capacity);
}
private static MapOpStorage<String, SpecializedStorage<String>> buildOps() {
MapOpStorage<String, SpecializedStorage<String>> t = ObjectStorage.buildObjectOps();
t.add(

View File

@ -1,6 +1,9 @@
package org.enso.table.data.column.storage;
import java.time.LocalTime;
import org.enso.table.data.column.builder.object.Builder;
import org.enso.table.data.column.builder.object.TimeOfDayBuilder;
import org.enso.table.data.column.operation.map.MapOpStorage;
import org.enso.table.data.column.operation.map.SpecializedIsInOp;
import org.enso.table.data.column.operation.map.datetime.DateTimeIsInOp;
@ -36,4 +39,9 @@ public final class TimeOfDayStorage extends SpecializedStorage<LocalTime> {
public int getType() {
return Type.TIME_OF_DAY;
}
@Override
public Builder createDefaultBuilderOfSameType(int capacity) {
return new TimeOfDayBuilder(capacity);
}
}

View File

@ -46,4 +46,18 @@ public class OrderMask {
}
return new OrderMask(result);
}
public static OrderMask concat(List<OrderMask> masks) {
int size = 0;
for (OrderMask mask : masks) {
size += mask.positions.length;
}
int[] result = new int[size];
int offset = 0;
for (OrderMask mask : masks) {
System.arraycopy(mask.positions, 0, result, offset, mask.positions.length);
offset += mask.positions.length;
}
return new OrderMask(result);
}
}

View File

@ -158,4 +158,20 @@ public class Column {
public Column duplicateCount() {
return new Column(name + "_duplicate_count", storage.duplicateCount());
}
/** Resizes the given column to the provided new size.
* <p>
* If the new size is smaller than the current size, the column is truncated.
* If the new size is larger than the current size, the column is padded with nulls.
*/
public Column resize(int newSize) {
if (newSize == getSize()) {
return this;
} else if (newSize < getSize()) {
return slice(0, newSize);
} else {
int nullsToAdd = newSize - getSize();
return new Column(name, storage.appendNulls(nullsToAdd));
}
}
}

View File

@ -12,10 +12,7 @@ import org.enso.table.data.index.Index;
import org.enso.table.data.index.MultiValueIndex;
import org.enso.table.data.mask.OrderMask;
import org.enso.table.data.mask.SliceRange;
import org.enso.table.data.table.join.IndexJoin;
import org.enso.table.data.table.join.JoinCondition;
import org.enso.table.data.table.join.JoinResult;
import org.enso.table.data.table.join.JoinStrategy;
import org.enso.table.data.table.join.*;
import org.enso.table.problems.AggregatedProblems;
import org.enso.table.error.UnexpectedColumnTypeException;
import org.enso.table.operations.Distinct;
@ -219,57 +216,47 @@ public class Table {
*/
public Table join(Table right, List<JoinCondition> conditions, boolean keepLeftUnmatched, boolean keepMatched, boolean keepRightUnmatched, boolean includeLeftColumns, boolean includeRightColumns, List<String> rightColumnsToDrop, String right_prefix, Comparator<Object> objectComparator, BiFunction<Object, Object, Boolean> equalityFallback) {
NameDeduplicator nameDeduplicator = new NameDeduplicator();
JoinResult joinResult = null;
// Only compute the join if there are any results to be returned.
if (keepLeftUnmatched || keepMatched || keepRightUnmatched) {
JoinStrategy strategy = new IndexJoin(objectComparator, equalityFallback);
joinResult = strategy.join(this, right, conditions);
if (!keepLeftUnmatched && !keepMatched && !keepRightUnmatched) {
throw new IllegalArgumentException("At least one of keepLeftUnmatched, keepMatched or keepRightUnmatched must be true.");
}
List<Integer> leftRows = new ArrayList<>();
List<Integer> rightRows = new ArrayList<>();
JoinStrategy strategy = new IndexJoin(objectComparator, equalityFallback);
JoinResult joinResult = strategy.join(this, right, conditions);
List<JoinResult> resultsToKeep = new ArrayList<>();
if (keepMatched) {
for (var match : joinResult.matchedRows()) {
leftRows.add(match.getLeft());
rightRows.add(match.getRight());
}
resultsToKeep.add(joinResult);
}
if (keepLeftUnmatched) {
HashSet<Integer> matchedLeftRows = new HashSet<>();
for (var match : joinResult.matchedRows()) {
matchedLeftRows.add(match.getLeft());
}
Set<Integer> matchedLeftRows = joinResult.leftMatchedRows();
JoinResult.Builder leftUnmatchedBuilder = new JoinResult.Builder();
for (int i = 0; i < this.rowCount(); i++) {
if (!matchedLeftRows.contains(i)) {
leftRows.add(i);
rightRows.add(Index.NOT_FOUND);
leftUnmatchedBuilder.addRow(i, Index.NOT_FOUND);
}
}
resultsToKeep.add(leftUnmatchedBuilder.build(AggregatedProblems.of()));
}
if (keepRightUnmatched) {
HashSet<Integer> matchedRightRows = new HashSet<>();
for (var match : joinResult.matchedRows()) {
matchedRightRows.add(match.getRight());
}
Set<Integer> matchedRightRows = joinResult.rightMatchedRows();
JoinResult.Builder rightUnmatchedBuilder = new JoinResult.Builder();
for (int i = 0; i < right.rowCount(); i++) {
if (!matchedRightRows.contains(i)) {
leftRows.add(Index.NOT_FOUND);
rightRows.add(i);
rightUnmatchedBuilder.addRow(Index.NOT_FOUND, i);
}
}
}
OrderMask leftMask = OrderMask.fromList(leftRows);
OrderMask rightMask = OrderMask.fromList(rightRows);
resultsToKeep.add(rightUnmatchedBuilder.build(AggregatedProblems.of()));
}
List<Column> newColumns = new ArrayList<>();
if (includeLeftColumns) {
OrderMask leftMask = OrderMask.concat(resultsToKeep.stream().map(JoinResult::getLeftOrderMask).collect(Collectors.toList()));
for (Column column : this.columns) {
Column newColumn = column.applyMask(leftMask);
newColumns.add(newColumn);
@ -277,6 +264,7 @@ public class Table {
}
if (includeRightColumns) {
OrderMask rightMask = OrderMask.concat(resultsToKeep.stream().map(JoinResult::getRightOrderMask).collect(Collectors.toList()));
List<String> leftColumnNames = newColumns.stream().map(Column::getName).collect(Collectors.toList());
HashSet<String> toDrop = new HashSet<>(rightColumnsToDrop);
@ -288,17 +276,74 @@ public class Table {
for (int i = 0; i < rightColumnsToKeep.size(); ++i) {
Column column = rightColumnsToKeep.get(i);
String newName = newRightColumnNames.get(i);
Storage<?> newStorage = column.getStorage().applyMask(rightMask);
Column newColumn = new Column(newName, newStorage);
Column newColumn = column.applyMask(rightMask).rename(newName);
newColumns.add(newColumn);
}
}
AggregatedProblems joinProblems = joinResult != null ? joinResult.problems() : null;
AggregatedProblems aggregatedProblems = AggregatedProblems.merge(joinProblems, AggregatedProblems.of(nameDeduplicator.getProblems()));
AggregatedProblems aggregatedProblems = AggregatedProblems.merge(AggregatedProblems.of(nameDeduplicator.getProblems()), joinProblems);
return new Table(newColumns.toArray(new Column[0]), aggregatedProblems);
}
/**
* Performs a cross-join of this table with the right table.
*/
public Table crossJoin(Table right, String rightPrefix) {
NameDeduplicator nameDeduplicator = new NameDeduplicator();
List<String> leftColumnNames = Arrays.stream(this.columns).map(Column::getName).collect(Collectors.toList());
List<String> rightColumNames = Arrays.stream(right.columns).map(Column::getName).collect(Collectors.toList());
List<String> newRightColumnNames = nameDeduplicator.combineWithPrefix(leftColumnNames, rightColumNames, rightPrefix);
JoinResult joinResult = CrossJoin.perform(this.rowCount(), right.rowCount());
OrderMask leftMask = joinResult.getLeftOrderMask();
OrderMask rightMask = joinResult.getRightOrderMask();
Column[] newColumns = new Column[this.columns.length + right.columns.length];
int leftColumnCount = this.columns.length;
int rightColumnCount = right.columns.length;
for (int i = 0; i < leftColumnCount; i++) {
newColumns[i] = this.columns[i].applyMask(leftMask);
}
for (int i = 0; i < rightColumnCount; i++) {
newColumns[leftColumnCount + i] = right.columns[i].applyMask(rightMask).rename(newRightColumnNames.get(i));
}
AggregatedProblems aggregatedProblems = AggregatedProblems.merge(AggregatedProblems.of(nameDeduplicator.getProblems()), joinResult.problems());
return new Table(newColumns, aggregatedProblems);
}
/**
* Zips rows of this table with rows of the right table.
*/
public Table zip(Table right, boolean keepUnmatched, String rightPrefix) {
NameDeduplicator nameDeduplicator = new NameDeduplicator();
int leftRowCount = this.rowCount();
int rightRowCount = right.rowCount();
int resultRowCount = keepUnmatched ? Math.max(leftRowCount, rightRowCount) : Math.min(leftRowCount, rightRowCount);
List<String> leftColumnNames = Arrays.stream(this.columns).map(Column::getName).collect(Collectors.toList());
List<String> rightColumNames = Arrays.stream(right.columns).map(Column::getName).collect(Collectors.toList());
List<String> newRightColumnNames = nameDeduplicator.combineWithPrefix(leftColumnNames, rightColumNames, rightPrefix);
Column[] newColumns = new Column[this.columns.length + right.columns.length];
int leftColumnCount = this.columns.length;
int rightColumnCount = right.columns.length;
for (int i = 0; i < leftColumnCount; i++) {
newColumns[i] = this.columns[i].resize(resultRowCount);
}
for (int i = 0; i < rightColumnCount; i++) {
newColumns[leftColumnCount + i] = right.columns[i].resize(resultRowCount).rename(newRightColumnNames.get(i));
}
return new Table(newColumns, AggregatedProblems.of(nameDeduplicator.getProblems()));
}
/**
* Applies an order mask to all columns and indexes of this array.
*

View File

@ -0,0 +1,16 @@
package org.enso.table.data.table.join;
import org.enso.table.problems.AggregatedProblems;
public class CrossJoin {
public static JoinResult perform(int leftRowCount, int rightRowCount) {
JoinResult.Builder resultBuilder = new JoinResult.Builder(leftRowCount * rightRowCount);
for (int l = 0; l < leftRowCount; ++l) {
for (int r = 0; r < rightRowCount; ++r) {
resultBuilder.addRow(l, r);
}
}
return resultBuilder.build(AggregatedProblems.of());
}
}

View File

@ -56,13 +56,13 @@ public class IndexJoin implements JoinStrategy {
MatcherFactory factory = new MatcherFactory(objectComparator, equalityFallback);
Matcher remainingMatcher = factory.create(remainingConditions);
List<Pair<Integer, Integer>> matches = new ArrayList<>();
JoinResult.Builder resultBuilder = new JoinResult.Builder();
for (var leftKey : leftIndex.keys()) {
if (rightIndex.contains(leftKey)) {
for (var leftRow : leftIndex.get(leftKey)) {
for (var rightRow : rightIndex.get(leftKey)) {
if (remainingMatcher.matches(leftRow, rightRow)) {
matches.add(Pair.create(leftRow, rightRow));
resultBuilder.addRow(leftRow, rightRow);
}
}
}
@ -70,11 +70,8 @@ public class IndexJoin implements JoinStrategy {
}
AggregatedProblems problems =
AggregatedProblems.merge(
new AggregatedProblems[] {
leftIndex.getProblems(), rightIndex.getProblems(), remainingMatcher.getProblems()
});
return new JoinResult(matches, problems);
AggregatedProblems.merge(leftIndex.getProblems(), rightIndex.getProblems(), remainingMatcher.getProblems());
return resultBuilder.build(problems);
}
private static boolean isSupported(JoinCondition condition) {

View File

@ -1,8 +1,50 @@
package org.enso.table.data.table.join;
import org.enso.base.arrays.IntArrayBuilder;
import org.enso.table.data.mask.OrderMask;
import org.enso.table.problems.AggregatedProblems;
import org.graalvm.collections.Pair;
import java.util.List;
import java.util.*;
import java.util.stream.Collectors;
public record JoinResult(List<Pair<Integer, Integer>> matchedRows, AggregatedProblems problems) {}
public record JoinResult(int[] matchedRowsLeftIndices, int[] matchedRowsRightIndices, AggregatedProblems problems) {
public OrderMask getLeftOrderMask() {
return new OrderMask(matchedRowsLeftIndices);
}
public OrderMask getRightOrderMask() {
return new OrderMask(matchedRowsRightIndices);
}
public Set<Integer> leftMatchedRows() {
return new HashSet<>(Arrays.stream(matchedRowsLeftIndices).boxed().collect(Collectors.toList()));
}
public Set<Integer> rightMatchedRows() {
return new HashSet<>(Arrays.stream(matchedRowsRightIndices).boxed().collect(Collectors.toList()));
}
public static class Builder {
IntArrayBuilder leftIndices;
IntArrayBuilder rightIndices;
public Builder(int initialCapacity) {
leftIndices = new IntArrayBuilder(initialCapacity);
rightIndices = new IntArrayBuilder(initialCapacity);
}
public Builder() {
this(128);
}
public void addRow(int leftIndex, int rightIndex) {
leftIndices.add(leftIndex);
rightIndices.add(rightIndex);
}
public JoinResult build(AggregatedProblems problemsToInherit) {
return new JoinResult(leftIndices.build(), rightIndices.build(), problemsToInherit);
}
}
}

View File

@ -21,21 +21,22 @@ public class ScanJoin implements JoinStrategy {
@Override
public JoinResult join(Table left, Table right, List<JoinCondition> conditions) {
List<Pair<Integer, Integer>> matches = new ArrayList<>();
int ls = left.rowCount();
int rs = right.rowCount();
MatcherFactory factory = new MatcherFactory(objectComparator, equalityFallback);
Matcher compoundMatcher = factory.create(conditions);
JoinResult.Builder resultBuilder = new JoinResult.Builder();
for (int l = 0; l < ls; ++l) {
for (int r = 0; r < rs; ++r) {
if (compoundMatcher.matches(l, r)) {
matches.add(Pair.create(l, r));
resultBuilder.addRow(l, r);
}
}
}
return new JoinResult(matches, compoundMatcher.getProblems());
return resultBuilder.build(compoundMatcher.getProblems());
}
}

View File

@ -10,6 +10,12 @@ public class BitSets {
* something on our own that would operate on whole longs instead of bit by bit.
*/
public static void copy(BitSet source, BitSet destination, int destinationOffset, int length) {
if (destinationOffset == 0) {
destination.clear(0, length);
destination.or(source.get(0, length));
return;
}
for (int i = 0; i < length; i++) {
if (source.get(i)) {
destination.set(destinationOffset + i);

View File

@ -1324,7 +1324,7 @@ spec setup =
Test.specify "should merge Invalid Aggregation warnings" <|
new_table = table.aggregate [Group_By "Key", Concatenate "Value"]
problems = Warning.get_all new_table . map .value
problems = Problems.get_attached_warnings new_table
problems.length . should_equal 1
problems.at 0 . is_a Invalid_Aggregation.Error . should_be_true
problems.at 0 . column . should_equal "Concatenate Value"
@ -1332,7 +1332,7 @@ spec setup =
Test.specify "should merge Floating Point Grouping warnings" <|
new_table = table.aggregate [Group_By "Float", Count]
problems = Warning.get_all new_table . map .value
problems = Problems.get_attached_warnings new_table
problems.length . should_equal 1
problems.at 0 . is_a Floating_Point_Grouping.Error . should_be_true
problems.at 0 . column . should_equal "Float"
@ -1343,7 +1343,7 @@ spec setup =
result.column_count . should_equal 1
result.row_count . should_equal 1
result.columns.first.to_vector . should_equal [6]
warnings = Warning.get_all result . map .value
warnings = Problems.get_attached_warnings result
warnings.length . should_equal error_count
warnings.each warning->
warning.should_be_an Unsupported_Database_Operation.Error

View File

@ -0,0 +1,152 @@
from Standard.Base import all
import Standard.Base.Error.Illegal_State.Illegal_State
from Standard.Table import all hiding Table
from Standard.Table.Errors import all
from Standard.Database.Errors import Unsupported_Database_Operation
from Standard.Test import Test, Problems
import Standard.Test.Extensions
from project.Common_Table_Operations.Util import expect_column_names, run_default_backend
main = run_default_backend spec
spec setup =
prefix = setup.prefix
table_builder = setup.table_builder
materialize = setup.materialize
db_todo = if prefix.contains "In-Memory" then Nothing else "Table.cross_join is still WIP for the DB backend."
Test.group prefix+"Table.cross_join" pending=db_todo <|
Test.specify "should allow to create a cross product of two tables in the right order" <|
t1 = table_builder [["X", [1, 2]], ["Y", [4, 5]]]
t2 = table_builder [["Z", ['a', 'b']], ["W", ['c', 'd']]]
t3 = t1.cross_join t2
expect_column_names ["X", "Y", "Z", "W"] t3
t3.row_count . should_equal 4
r = materialize t3 . rows . map .to_vector
r.length . should_equal 4
r0 = [1, 4, 'a', 'c']
r1 = [1, 4, 'b', 'd']
r2 = [2, 5, 'a', 'c']
r3 = [2, 5, 'b', 'd']
expected_rows = [r0, r1, r2, r3]
case setup.is_database of
True -> r.should_contain_the_same_elements_as expected_rows
False -> r.should_equal expected_rows
Test.specify "should work correctly with empty tables" <|
t1 = table_builder [["X", [1, 2]], ["Y", [4, 5]]]
t2 = table_builder [["Z", ['a']], ["W", ['c']]]
# Workaround to easily create empty table until table builder allows that directly.
empty = t2.filter "Z" Filter_Condition.Is_Nothing
empty.row_count . should_equal 0
t3 = t1.cross_join empty
expect_column_names ["X", "Y", "Z", "W"] t3
t3.row_count.should_equal 0
t3.at "X" . to_vector . should_equal []
t4 = empty.cross_join t1
expect_column_names ["Z", "W", "X", "Y"] t4
t4.row_count.should_equal 0
t4.at "X" . to_vector . should_equal []
Test.specify "should respect the right row limit" <|
t2 = table_builder [["X", [1, 2]]]
t3 = table_builder [["X", [1, 2, 3]]]
t100 = table_builder [["Y", 0.up_to 100 . to_vector]]
t101 = table_builder [["Y", 0.up_to 101 . to_vector]]
t2.cross_join t100 . row_count . should_equal 200
t101.cross_join t2 . row_count . should_equal 202
action = t2.cross_join t101 on_problems=_
tester table =
table.row_count . should_equal 202
problems = [Cross_Join_Row_Limit_Exceeded.Error 100 101]
Problems.test_problem_handling action problems tester
t2.cross_join t101 right_row_limit=Nothing . row_count . should_equal 202
t2.cross_join t3 right_row_limit=2 on_problems=Problem_Behavior.Report_Error . should_fail_with Cross_Join_Row_Limit_Exceeded
Test.specify "should ensure 1-1 mapping even with duplicate rows" <|
t1 = table_builder [["X", [2, 1, 2, 2]], ["Y", [5, 4, 5, 5]]]
t2 = table_builder [["Z", ['a', 'a']]]
t3 = t1.cross_join t2
expect_column_names ["X", "Y", "Z"] t3
t3.row_count . should_equal 8
r = materialize t3 . rows . map .to_vector
r.length . should_equal 8
r1 = [2, 5, 'a']
r2 = [1, 4, 'a']
expected_rows = [r1, r1, r2, r2, r1, r1, r1, r1]
case setup.is_database of
True -> r.should_contain_the_same_elements_as expected_rows
False -> r.should_equal expected_rows
Test.specify "should allow self-joins" <|
t1 = table_builder [["X", [1, 2]], ["Y", [4, 5]]]
t2 = t1.cross_join t1
expect_column_names ["X", "Y", "Right_X", "Right_Y"] t2
t2.row_count . should_equal 4
r = materialize t2 . rows . map .to_vector
r.length . should_equal 4
r0 = [1, 4, 1, 4]
r1 = [1, 4, 2, 5]
r2 = [2, 5, 1, 4]
r3 = [2, 5, 2, 5]
expected_rows = [r0, r1, r2, r3]
case setup.is_database of
True -> r.should_contain_the_same_elements_as expected_rows
False -> r.should_equal expected_rows
Test.specify "should rename columns of the right table to avoid duplicates" <|
t1 = table_builder [["X", [1]], ["Y", [5]]]
t2 = table_builder [["X", ['a']], ["Y", ['d']]]
t3 = t1.cross_join t2
expect_column_names ["X", "Y", "Right_X", "Right_Y"] t3
Problems.get_attached_warnings t3 . should_equal [Duplicate_Output_Column_Names.Error ["X", "Y"]]
t3.row_count . should_equal 1
t3.at "X" . to_vector . should_equal [1]
t3.at "Y" . to_vector . should_equal [5]
t3.at "Right_X" . to_vector . should_equal ['a']
t3.at "Right_Y" . to_vector . should_equal ['d']
t1.cross_join t2 on_problems=Problem_Behavior.Report_Error . should_fail_with Duplicate_Output_Column_Names
expect_column_names ["X", "Y", "X_1", "Y_1"] (t1.cross_join t2 right_prefix="")
t4 = table_builder [["X", [1]], ["Right_X", [5]]]
expect_column_names ["X", "Y", "Right_X_1", "Right_X"] (t1.cross_join t4)
expect_column_names ["X", "Right_X", "Right_X_1", "Y"] (t4.cross_join t1)
Test.specify "should respect the column ordering" <|
t1 = table_builder [["X", [100, 2]], ["Y", [4, 5]]]
t2 = table_builder [["Z", ['a', 'b', 'c']], ["W", ['x', 'd', 'd']]]
t3 = t1.order_by "X"
t4 = t2.order_by (Sort_Column_Selector.By_Name [Sort_Column.Name "Z" Sort_Direction.Descending])
t5 = t3.cross_join t4
expect_column_names ["X", "Y", "Z", "W"] t5
t5.row_count . should_equal 6
r = materialize t5 . rows . map .to_vector
r.length . should_equal 6
r0 = [2, 5, 'c', 'd']
r1 = [2, 5, 'b', 'd']
r2 = [2, 5, 'a', 'x']
r3 = [100, 4, 'c', 'd']
r4 = [100, 4, 'b', 'd']
r5 = [100, 4, 'a', 'x']
expected_rows = [r0, r1, r2, r3, r4, r5]
r.should_equal expected_rows

View File

@ -1,7 +1,7 @@
from Standard.Base import all
import Standard.Base.Error.Illegal_State.Illegal_State
from Standard.Table import all
from Standard.Table import all hiding Table
from Standard.Table.Errors import all
import Standard.Table.Data.Value_Type.Value_Type
@ -417,12 +417,15 @@ spec setup =
t2 = table_builder [["X", [2, 1]], ["Y", [2, 2]]]
t3 = t1.join t2 on=(Join_Condition.Equals "X" "Y") |> materialize |> _.order_by ["Right_X"]
Problems.get_attached_warnings t3 . should_equal [Duplicate_Output_Column_Names.Error ["X", "Y"]]
expect_column_names ["X", "Y", "Right_X", "Right_Y"] t3
t3.at "X" . to_vector . should_equal [2, 2]
t3.at "Right_Y" . to_vector . should_equal [2, 2]
t3.at "Y" . to_vector . should_equal [4, 4]
t3.at "Right_X" . to_vector . should_equal [1, 2]
t1.join t2 on=(Join_Condition.Equals "X" "Y") on_problems=Problem_Behavior.Report_Error . should_fail_with Duplicate_Output_Column_Names
t4 = table_builder [["Right_X", [1, 1]], ["X", [1, 2]], ["Y", [3, 4]], ["Right_Y_2", [2, 2]]]
t5 = table_builder [["Right_X", [2, 1]], ["X", [2, 2]], ["Y", [2, 2]], ["Right_Y", [2, 2]], ["Right_Y_1", [2, 2]], ["Right_Y_4", [2, 2]]]
@ -431,6 +434,7 @@ spec setup =
t7 = t1.join t2 right_prefix=""
expect_column_names ["X", "Y", "Y_1"] t7
Problems.get_attached_warnings t7 . should_equal [Duplicate_Output_Column_Names.Error ["Y"]]
t8 = t1.join t2 right_prefix="P"
expect_column_names ["X", "Y", "PY"] t8

View File

@ -20,7 +20,7 @@ main = run_default_backend spec
spec setup =
prefix = setup.prefix
table_builder = setup.table_builder
db_todo = if prefix.contains "In-Memory" then Nothing else "Union API is not yet implemented for the DB backend."
db_todo = if prefix.contains "In-Memory" then Nothing else "Table.union is not yet implemented for the DB backend."
Test.group prefix+"Table.union" pending=db_todo <|
Test.specify "should merge columns from multiple tables" <|
t1 = table_builder [["A", [1, 2, 3]], ["B", ["a", "b", "c"]]]
@ -148,7 +148,7 @@ spec setup =
t3 = t1.union t2 match_columns=Match_Columns.By_Position
within_table t3 <|
check t3
Warning.get_all t3 . map .value . should_equal [Column_Count_Mismatch.Error 2 1]
Problems.get_attached_warnings t3 . should_equal [Column_Count_Mismatch.Error 2 1]
t4 = t1.union t2 match_columns=Match_Columns.By_Position keep_unmatched_columns=True
within_table t4 <|

View File

@ -0,0 +1,238 @@
from Standard.Base import all
import Standard.Base.Error.Illegal_State.Illegal_State
from Standard.Table import all hiding Table
from Standard.Table.Errors import all
import Standard.Table.Data.Value_Type.Value_Type
from Standard.Database.Errors import Unsupported_Database_Operation
from Standard.Test import Test, Problems
import Standard.Test.Extensions
from project.Common_Table_Operations.Util import expect_column_names, run_default_backend
main = run_default_backend spec
spec setup =
prefix = setup.prefix
table_builder = setup.table_builder
materialize = setup.materialize
db_todo = if prefix.contains "In-Memory" then Nothing else "Table.zip is still WIP for the DB backend."
Test.group prefix+"Table.zip" pending=db_todo <|
if setup.is_database.not then
Test.specify "should allow to zip two tables, preserving memory layout order" <|
t1 = table_builder [["X", [1, 2, 3]], ["Y", [4, 5, 6]]]
t2 = table_builder [["Z", ['a', 'b', 'c']], ["W", ['x', 'y', 'z']]]
t3 = t1.zip t2
expect_column_names ["X", "Y", "Z", "W"] t3
t3.row_count . should_equal 3
r = materialize t3 . rows . map .to_vector
r.length . should_equal 3
r0 = [1, 4, 'a', 'x']
r1 = [2, 5, 'b', 'y']
r2 = [3, 6, 'c', 'z']
expected_rows = [r0, r1, r2]
r.should_equal expected_rows
Test.specify "should allow to zip two tables, preserving the order defined by `order_by`" <|
t1 = table_builder [["X", [100, 2]], ["Y", [4, 5]]]
t2 = table_builder [["Z", ['a', 'b']], ["W", ['x', 'd']]]
t3 = t1.order_by "X"
t4 = t2.order_by (Sort_Column_Selector.By_Name [Sort_Column.Name "Z" Sort_Direction.Descending])
t5 = t3.zip t4
expect_column_names ["X", "Y", "Z", "W"] t5
t5.row_count . should_equal 2
r = materialize t5 . rows . map .to_vector
r.length . should_equal 2
r0 = [2, 5, 'b', 'd']
r1 = [100, 4, 'a', 'x']
expected_rows = [r0, r1]
r.should_equal expected_rows
Test.specify "should report unmatched rows if the row counts do not match and pad them with nulls" <|
t1 = table_builder [["X", [1, 2, 3]], ["Y", [4, 5, 6]]]
t2 = table_builder [["Z", ['a', 'b']], ["W", ['x', 'd']]]
action_1 = t1.zip t2 on_problems=_
tester_1 table =
expect_column_names ["X", "Y", "Z", "W"] table
table.at "X" . to_vector . should_equal [1, 2, 3]
table.at "Y" . to_vector . should_equal [4, 5, 6]
table.at "Z" . to_vector . should_equal ['a', 'b', Nothing]
table.at "W" . to_vector . should_equal ['x', 'd', Nothing]
problems_1 = [Row_Count_Mismatch.Error 3 2]
Problems.test_problem_handling action_1 problems_1 tester_1
action_2 = t2.zip t1 on_problems=_
tester_2 table =
expect_column_names ["Z", "W", "X", "Y"] table
table.at "Z" . to_vector . should_equal ['a', 'b', Nothing]
table.at "W" . to_vector . should_equal ['x', 'd', Nothing]
table.at "X" . to_vector . should_equal [1, 2, 3]
table.at "Y" . to_vector . should_equal [4, 5, 6]
problems_2 = [Row_Count_Mismatch.Error 2 3]
Problems.test_problem_handling action_2 problems_2 tester_2
Test.specify "should allow to keep the unmatched rows padded with nulls without reporting problems" <|
t1 = table_builder [["X", [1, 2, 3]], ["Y", [4, 5, 6]]]
t2 = table_builder [["Z", ['a']], ["W", ['x']]]
t3 = t1.zip t2 keep_unmatched=True on_problems=Problem_Behavior.Report_Error
Problems.assume_no_problems t3
expect_column_names ["X", "Y", "Z", "W"] t3
t3.at "X" . to_vector . should_equal [1, 2, 3]
t3.at "Y" . to_vector . should_equal [4, 5, 6]
t3.at "Z" . to_vector . should_equal ['a', Nothing, Nothing]
t3.at "W" . to_vector . should_equal ['x', Nothing, Nothing]
Test.specify "should allow to drop the unmatched rows" <|
t1 = table_builder [["X", [1, 2, 3]], ["Y", [4, 5, 6]]]
t2 = table_builder [["Z", ['a']], ["W", ['x']]]
t3 = t1.zip t2 keep_unmatched=False on_problems=Problem_Behavior.Report_Error
Problems.assume_no_problems t3
expect_column_names ["X", "Y", "Z", "W"] t3
t3.at "X" . to_vector . should_equal [1]
t3.at "Y" . to_vector . should_equal [4]
t3.at "Z" . to_vector . should_equal ['a']
t3.at "W" . to_vector . should_equal ['x']
Test.specify "should work when zipping with an empty table" <|
t1 = table_builder [["X", [1, 2]], ["Y", [4, 5]]]
t2 = table_builder [["Z", ['a']], ["W", ['c']]]
# Workaround to easily create empty table until table builder allows that directly.
empty = t2.filter "Z" Filter_Condition.Is_Nothing
empty.row_count . should_equal 0
t3 = t1.zip empty
expect_column_names ["X", "Y", "Z", "W"] t3
t3.row_count . should_equal 2
t3.at "X" . to_vector . should_equal [1, 2]
t3.at "Y" . to_vector . should_equal [4, 5]
t3.at "Z" . to_vector . should_equal [Nothing, Nothing]
t3.at "W" . to_vector . should_equal [Nothing, Nothing]
t4 = empty.zip t1
expect_column_names ["Z", "W", "X", "Y"] t4
t4.row_count . should_equal 2
t4.at "X" . to_vector . should_equal [1, 2]
t4.at "Y" . to_vector . should_equal [4, 5]
t4.at "Z" . to_vector . should_equal [Nothing, Nothing]
t4.at "W" . to_vector . should_equal [Nothing, Nothing]
t5 = t1.zip empty keep_unmatched=False
expect_column_names ["X", "Y", "Z", "W"] t5
t5.row_count . should_equal 0
t5.at "X" . to_vector . should_equal []
t6 = empty.zip t1 keep_unmatched=False
expect_column_names ["Z", "W", "X", "Y"] t6
t6.row_count . should_equal 0
t6.at "X" . to_vector . should_equal []
Test.specify "should not report unmatched rows for rows that simply are all null" <|
t1 = table_builder [["X", [1, 2, 3]], ["Y", [4, 5, 6]]]
t2 = table_builder [["Z", ['a', Nothing, Nothing]], ["W", ['b', Nothing, Nothing]]]
t3 = t1.zip t2 on_problems=Problem_Behavior.Report_Error
Problems.assume_no_problems t3
expect_column_names ["X", "Y", "Z", "W"] t3
t3.at "X" . to_vector . should_equal [1, 2, 3]
t3.at "Y" . to_vector . should_equal [4, 5, 6]
t3.at "Z" . to_vector . should_equal ['a', Nothing, Nothing]
t3.at "W" . to_vector . should_equal ['b', Nothing, Nothing]
Test.specify "should rename columns of the right table to avoid duplicates" <|
t1 = table_builder [["X", [1, 2]], ["Y", [5, 6]]]
t2 = table_builder [["X", ['a']], ["Y", ['d']]]
t3 = t1.zip t2 keep_unmatched=True
expect_column_names ["X", "Y", "Right_X", "Right_Y"] t3
Problems.get_attached_warnings t3 . should_equal [Duplicate_Output_Column_Names.Error ["X", "Y"]]
t3.row_count . should_equal 2
t3.at "X" . to_vector . should_equal [1, 2]
t3.at "Y" . to_vector . should_equal [5, 6]
t3.at "Right_X" . to_vector . should_equal ['a', Nothing]
t3.at "Right_Y" . to_vector . should_equal ['d', Nothing]
t1.zip t2 keep_unmatched=False on_problems=Problem_Behavior.Report_Error . should_fail_with Duplicate_Output_Column_Names
expect_column_names ["X", "Y", "X_1", "Y_1"] (t1.zip t2 right_prefix="")
t4 = table_builder [["X", [1]], ["Right_X", [5]]]
expect_column_names ["X", "Y", "Right_X_1", "Right_X"] (t1.zip t4)
expect_column_names ["X", "Right_X", "Right_X_1", "Y"] (t4.zip t1)
Test.specify "should report both row count mismatch and duplicate column warnings at the same time" <|
t1 = table_builder [["X", [1, 2]], ["Y", [5, 6]]]
t2 = table_builder [["X", ['a']], ["Z", ['d']]]
t3 = t1.zip t2
expected_problems = [Row_Count_Mismatch.Error 2 1, Duplicate_Output_Column_Names.Error ["X"]]
Problems.get_attached_warnings t3 . should_contain_the_same_elements_as expected_problems
Test.specify "should allow to zip the table with itself" <|
## Even though this does not seem very useful, we should verify that
this edge case works correctly. It may especially be fragile in
the Database backend.
t1 = table_builder [["X", [1, 2]], ["Y", [4, 5]]]
t2 = t1.zip t1
expect_column_names ["X", "Y", "Right_X", "Right_Y"] t2
t2.row_count . should_equal 2
t2.at "X" . to_vector . should_equal [1, 2]
t2.at "Y" . to_vector . should_equal [4, 5]
t2.at "Right_X" . to_vector . should_equal [1, 2]
t2.at "Right_Y" . to_vector . should_equal [4, 5]
if setup.is_database.not then
Test.specify "should correctly pad/truncate all kinds of column types" <|
primitives = [["ints", [1, 2, 3]], ["strs", ['a', 'b', 'c']], ["bools", [True, Nothing, False]]]
times = [["dates", [Date.new 1999 1 1, Date.new 2000 4 1, Date.new 2001 1 2]], ["times", [Time_Of_Day.new 23 59, Time_Of_Day.new 0 0, Time_Of_Day.new 12 34]], ["datetimes", [Date_Time.new 1999 1 1 23 59, Date_Time.new 2000 4 1 0 0, Date_Time.new 2001 1 2 12 34]]]
t = table_builder <|
primitives + times + [["mixed", ['a', 2, True]]]
t1 = table_builder [["X", [1]]]
t5 = table_builder [["X", 0.up_to 5 . to_vector]]
truncated = t.zip t1 keep_unmatched=False
expect_column_names ["ints", "strs", "bools", "dates", "times", "datetimes", "mixed", "X"] truncated
truncated.row_count . should_equal 1
truncated.at "ints" . to_vector . should_equal [1]
truncated.at "strs" . to_vector . should_equal ['a']
truncated.at "bools" . to_vector . should_equal [True]
truncated.at "dates" . to_vector . should_equal [Date.new 1999 1 1]
truncated.at "times" . to_vector . should_equal [Time_Of_Day.new 23 59]
truncated.at "datetimes" . to_vector . should_equal [Date_Time.new 1999 1 1 23 59]
truncated.at "mixed" . to_vector . should_equal ['a']
truncated.at "ints" . value_type . should_equal Value_Type.Integer
truncated.at "strs" . value_type . should_equal Value_Type.Char
truncated.at "bools" . value_type . should_equal Value_Type.Boolean
truncated.at "dates" . value_type . should_equal Value_Type.Date
truncated.at "times" . value_type . should_equal Value_Type.Time
truncated.at "datetimes" . value_type . should_equal Value_Type.Date_Time
truncated.at "mixed" . value_type . should_equal Value_Type.Mixed
padded = t.zip t5 keep_unmatched=True
expect_column_names ["ints", "strs", "bools", "dates", "times", "datetimes", "mixed", "X"] padded
padded.row_count . should_equal 5
padded.at "ints" . to_vector . should_equal [1, 2, 3, Nothing, Nothing]
padded.at "strs" . to_vector . should_equal ['a', 'b', 'c', Nothing, Nothing]
padded.at "bools" . to_vector . should_equal [True, Nothing, False, Nothing, Nothing]
padded.at "dates" . to_vector . should_equal [Date.new 1999 1 1, Date.new 2000 4 1, Date.new 2001 1 2, Nothing, Nothing]
padded.at "times" . to_vector . should_equal [Time_Of_Day.new 23 59, Time_Of_Day.new 0 0, Time_Of_Day.new 12 34, Nothing, Nothing]
padded.at "datetimes" . to_vector . should_equal [Date_Time.new 1999 1 1 23 59, Date_Time.new 2000 4 1 0 0, Date_Time.new 2001 1 2 12 34, Nothing, Nothing]
padded.at "mixed" . to_vector . should_equal ['a', 2, True, Nothing, Nothing]
padded.at "ints" . value_type . should_equal Value_Type.Integer
padded.at "strs" . value_type . should_equal Value_Type.Char
padded.at "bools" . value_type . should_equal Value_Type.Boolean
padded.at "dates" . value_type . should_equal Value_Type.Date
padded.at "times" . value_type . should_equal Value_Type.Time
padded.at "datetimes" . value_type . should_equal Value_Type.Date_Time
padded.at "mixed" . value_type . should_equal Value_Type.Mixed

View File

@ -7,12 +7,14 @@ import project.Common_Table_Operations.Distinct_Spec
import project.Common_Table_Operations.Expression_Spec
import project.Common_Table_Operations.Filter_Spec
import project.Common_Table_Operations.Integration_Tests
import project.Common_Table_Operations.Join_Spec
import project.Common_Table_Operations.Join.Join_Spec
import project.Common_Table_Operations.Join.Cross_Join_Spec
import project.Common_Table_Operations.Join.Zip_Spec
import project.Common_Table_Operations.Join.Union_Spec
import project.Common_Table_Operations.Missing_Values_Spec
import project.Common_Table_Operations.Order_By_Spec
import project.Common_Table_Operations.Select_Columns_Spec
import project.Common_Table_Operations.Take_Drop_Spec
import project.Common_Table_Operations.Union_Spec
from project.Common_Table_Operations.Util import run_default_backend
@ -95,6 +97,8 @@ spec setup =
Take_Drop_Spec.spec setup
Expression_Spec.spec detailed=False setup
Join_Spec.spec setup
Cross_Join_Spec.spec setup
Zip_Spec.spec setup
Union_Spec.spec setup
Distinct_Spec.spec setup
Integration_Tests.spec setup

View File

@ -83,11 +83,15 @@ spec setup =
selector = By_Index [0, -7, -6, 1]
action = table.select_columns selector on_problems=_
tester = expect_column_names ["foo", "bar"]
problems = [Input_Indices_Already_Matched.Error [-7, 1]]
problem_checker problem =
problem.should_be_a Input_Indices_Already_Matched.Error
problem.indices.should_contain_the_same_elements_as [-7, 1]
True
err_checker err =
err.catch.should_be_a Input_Indices_Already_Matched.Error
err.catch.indices.should_contain_the_same_elements_as [-7, 1]
Problems.test_advanced_problem_handling action err_checker (x-> x) tester
problem_checker err.catch
warn_checker warnings =
warnings.all problem_checker
Problems.test_advanced_problem_handling action err_checker warn_checker tester
Test.specify "should correctly handle problems: duplicate names" <|
selector = By_Name ["foo", "foo"]

View File

@ -61,7 +61,7 @@ spec =
r1 = plain_formatter.parse "1E3" Decimal
r1.should_equal Nothing
Warning.get_all r1 . map .value . should_equal [(Invalid_Format.Error Nothing Decimal ["1E3"])]
Problems.get_attached_warnings r1 . should_equal [(Invalid_Format.Error Nothing Decimal ["1E3"])]
exponential_formatter.parse "1E3" . should_equal 1000.0
exponential_formatter.parse "1E3" Decimal . should_equal 1000.0

View File

@ -35,15 +35,15 @@ spec = Test.group "Table.parse_values" <|
t1_zeros = ["+00", "-00", "+01", "-01", "01", "000", "0010"]
t3 = t1.parse_values column_types=[Column_Type_Selection.Value 0 Integer]
t3.at "ints" . to_vector . should_equal t1_parsed
Warning.get_all t3 . map .value . should_equal [Leading_Zeros.Error "ints" Integer t1_zeros]
Problems.get_attached_warnings t3 . should_equal [Leading_Zeros.Error "ints" Integer t1_zeros]
t4 = t1.parse_values column_types=[Column_Type_Selection.Value 0 Decimal]
t4.at "ints" . to_vector . should_equal t1_parsed
Warning.get_all t4 . map .value . should_equal [Leading_Zeros.Error "ints" Decimal t1_zeros]
Problems.get_attached_warnings t4 . should_equal [Leading_Zeros.Error "ints" Decimal t1_zeros]
t5 = t2.parse_values column_types=[Column_Type_Selection.Value 0 Decimal]
t5.at "floats" . to_vector . should_equal [0.0, 0.0, Nothing, Nothing, Nothing, 1.0]
Warning.get_all t5 . map .value . should_equal [Leading_Zeros.Error "floats" Decimal ["00.", "01.0", '-0010.0000']]
Problems.get_attached_warnings t5 . should_equal [Leading_Zeros.Error "floats" Decimal ["00.", "01.0", '-0010.0000']]
opts = Data_Formatter.Value allow_leading_zeros=True
t1_parsed_zeros = [0, 0, 0, 1, -1, 1, 0, 10, 12345, Nothing]
@ -203,10 +203,10 @@ spec = Test.group "Table.parse_values" <|
t3 = Table.new [["xs", ["1,2", "1.2", "_0", "0_", "1_0_0"]]]
t4 = t3.parse_values opts column_types=[Column_Type_Selection.Value 0 Decimal]
t4.at "xs" . to_vector . should_equal [1.2, Nothing, Nothing, Nothing, 100.0]
Warning.get_all t4 . map .value . should_equal [Invalid_Format.Error "xs" Decimal ["1.2", "_0", "0_"]]
Problems.get_attached_warnings t4 . should_equal [Invalid_Format.Error "xs" Decimal ["1.2", "_0", "0_"]]
t5 = t3.parse_values opts column_types=[Column_Type_Selection.Value 0 Integer]
t5.at "xs" . to_vector . should_equal [Nothing, Nothing, Nothing, Nothing, 100.0]
Warning.get_all t5 . map .value . should_equal [Invalid_Format.Error "xs" Integer ["1,2", "1.2", "_0", "0_"]]
Problems.get_attached_warnings t5 . should_equal [Invalid_Format.Error "xs" Integer ["1,2", "1.2", "_0", "0_"]]
Test.specify "should allow to specify custom values for booleans" <|
opts_1 = Data_Formatter.Value true_values=["1", "YES"] false_values=["0"]
@ -217,7 +217,7 @@ spec = Test.group "Table.parse_values" <|
t3 = Table.new [["bools", ["1", "NO", "False", "True", "YES", "no", "oui", "0"]]]
t4 = t3.parse_values opts_1 column_types=[Column_Type_Selection.Value 0 Boolean]
t4.at "bools" . to_vector . should_equal [True, Nothing, Nothing, Nothing, True, Nothing, Nothing, False]
Warning.get_all t4 . map .value . should_equal [Invalid_Format.Error "bools" Boolean ["NO", "False", "True", "no", "oui"]]
Problems.get_attached_warnings t4 . should_equal [Invalid_Format.Error "bools" Boolean ["NO", "False", "True", "no", "oui"]]
whitespace_table =
ints = ["ints", ["0", "1 ", "0 1", " 2"]]
@ -236,7 +236,7 @@ spec = Test.group "Table.parse_values" <|
t1.at "dates" . to_vector . should_equal [Date.new 2022 1 1, Date.new 2022 7 17, Nothing, Nothing]
t1.at "datetimes" . to_vector . should_equal [Date_Time.new 2022 1 1 11 59, Nothing, Nothing, Nothing]
t1.at "times" . to_vector . should_equal [Time_Of_Day.new 11 0 0, Time_Of_Day.new, Nothing, Nothing]
warnings = Warning.get_all t1 . map .value
warnings = Problems.get_attached_warnings t1
expected_warnings = Vector.new_builder
expected_warnings.append (Invalid_Format.Error "ints" Integer ["0 1"])
expected_warnings.append (Invalid_Format.Error "floats" Decimal ["- 1"])
@ -256,7 +256,7 @@ spec = Test.group "Table.parse_values" <|
t1.at "dates" . to_vector . should_equal nulls
t1.at "datetimes" . to_vector . should_equal nulls
t1.at "times" . to_vector . should_equal nulls
warnings = Warning.get_all t1 . map .value
warnings = Problems.get_attached_warnings t1
expected_warnings = Vector.new_builder
expected_warnings.append (Invalid_Format.Error "ints" Integer ["1 ", "0 1", " 2"])
expected_warnings.append (Invalid_Format.Error "floats" Decimal ["0 ", " 2.0", "- 1"])

View File

@ -202,7 +202,7 @@ spec =
positions = [7, 8, 15]
msg = "Encoding issues at codepoints " +
positions.map .to_text . join separator=", " suffix="."
Warning.get_all result . map .value . should_equal [Encoding_Error.Error msg]
Problems.get_attached_warnings result . should_equal [Encoding_Error.Error msg]
file.delete
Test.specify "should allow only text columns if no formatter is specified" <|

View File

@ -2,7 +2,7 @@ from Standard.Base import all
import Standard.Base.Error.Common.Type_Error
import Standard.Base.Error.Time_Error.Time_Error
from Standard.Test import Test, Test_Suite
from Standard.Test import Problems, Test, Test_Suite
import Standard.Test.Extensions
import project.Data.Time.Date_Part_Spec
@ -146,7 +146,7 @@ spec_with name create_new_date parse_date =
is_time_error v = case v of
_ : Time_Error -> True
_ -> False
expect_warning value = (Warning.get_all value . map .value . any is_time_error) . should_be_true
expect_warning value = (Problems.get_attached_warnings value . any is_time_error) . should_be_true
dates_before_epoch = [(create_new_date 100), (create_new_date 500 6 3)]
dates_before_epoch.each date->
expect_warning date.week_of_year

View File

@ -93,7 +93,7 @@ spec =
expected_problems = [Encoding_Error.Error "Encoding issues at bytes 14, 15, 16."]
contents_1 = read_file_one_by_one windows_file encoding expected_contents.length on_problems=Problem_Behavior.Report_Warning
contents_1.should_equal expected_contents
Warning.get_all contents_1 . map .value . should_equal expected_problems
Problems.get_attached_warnings contents_1 . should_equal expected_problems
contents_2 = windows_file.with_input_stream [File_Access.Read] stream->
stream.with_stream_decoder encoding Problem_Behavior.Report_Warning reporting_stream_decoder->
@ -104,7 +104,7 @@ spec =
reporting_stream_decoder.read.should_equal -1
Text.from_codepoints <| [codepoint_1]+codepoints_1+codepoints_2+codepoints_3
contents_2.should_equal expected_contents
Warning.get_all contents_2 . map .value . should_equal expected_problems
Problems.get_attached_warnings contents_2 . should_equal expected_problems
Test.specify "should work correctly if no data is read from it" <|
result = windows_file.with_input_stream [File_Access.Read] stream->

View File

@ -61,7 +61,7 @@ spec =
stream.with_stream_encoder encoding Problem_Behavior.Report_Warning reporting_stream_encoder->
reporting_stream_encoder.write contents
result.should_succeed
Warning.get_all result . map .value . should_equal [Encoding_Error.Error "Encoding issues at codepoints 1, 3."]
Problems.get_attached_warnings result . should_equal [Encoding_Error.Error "Encoding issues at codepoints 1, 3."]
f.read_text encoding . should_equal "S?o?wka!"
f.delete_if_exists
@ -74,7 +74,7 @@ spec =
reporting_stream_encoder.write "bar"
result_2.should_succeed
Warning.get_all result_2 . map .value . should_equal [Encoding_Error.Error "Encoding issues at codepoints 3, 9."]
Problems.get_attached_warnings result_2 . should_equal [Encoding_Error.Error "Encoding issues at codepoints 3, 9."]
f.read_text encoding . should_equal "ABC?foo -?- bar"
Test.specify "should work correctly if no data is written to it" <|