Commit Graph

155 Commits

Author SHA1 Message Date
Radosław Waśko
8db2ad51a1
Adding typechecks to Column Operations (#6298)
Closes #6106
2023-04-21 12:20:12 +00:00
James Dunkerley
0350762386
Add replace, trim to Column. Better number parsing. (#6253)
- Add `replace` with same syntax as on `Text` to an in-memory `Column`.
- Add `trim` with same syntax as on `Text` to an in-memory `Column`.
- Add `trim` to in-database `Column`.
- Added `is_supported` to dialects and exposed the dialect consistently on the `Connection`.
- Add `write_table` support to `JSON_File` allowing `Table.write` to write JSON.
- Updated the parsing for integers and decimals:
- Support for currency symbols.
- Support for brackets for negative numbers.
- Automatic detection of decimal points and thousand separators.
- Tighter rules for scientific and thousand separated numbers.
- Remove `replace_text` from `Table`.
- Remove `write_json` from `Table`.
2023-04-20 16:04:59 +00:00
Radosław Waśko
f5db35af07
Adjust {Table|Column}.parse to use Value_Type (#6213)
Closes #5660
2023-04-06 10:58:55 +00:00
Jaroslav Tulach
4805193428
Text.to_display_text is (shortened) identity (#6174)
Fixes #5971.
2023-04-05 19:53:07 +00:00
GregoryTravis
d9bc5246ba
Remove old (Java) Regex library and replace with new (Truffle) library. (#6195)
Remove old (Java) Regex library and replace with new (Truffle) library.
2023-04-04 19:58:26 +00:00
GregoryTravis
fb77f42fd5
Update Text.split to take a Vector Text parameter (#6156)
Allows you to pass a vector of delimiters to `split`.
2023-04-04 14:44:47 +00:00
James Dunkerley
f26bcf6ab6
Small issues from working with Ned (#6160)
- `Process.run` now returns a `Process_Result` allowing the easy capture of stdout and stderr.
- Joining a column with a column name does not warn if adding just the prefix.
- Stop the table viz from changing case and adding spaces to the headers.
2023-04-03 13:01:42 +00:00
Radosław Waśko
6ddcb553e5
Date/time support for Postgres. Year/month/day operations on Columns. (#6153)
Closes #6115
2023-03-31 18:37:04 +00:00
Radosław Waśko
6f86115498
Proper implementation of Value Types in Table (#6073)
This is the first part of the #5158 umbrella task. It closes #5158, follow-up tasks are listed as a comment in the issue.

- Updates all prototype methods dealing with `Value_Type` with a proper implementation.
- Adds a more precise mapping from in-memory storage to `Value_Type`.
- Adds a dialect-dependent mapping between `SQL_Type` and `Value_Type`.
- Removes obsolete methods and constants on `SQL_Type` that were not portable.
- Ensures that in the Database backend, operation results are computed based on what the Database is meaning to return (by asking the Database about expected types of each operation).
- But also ensures that the result types are sane.
- While SQLite does not officially support a BOOLEAN affinity, we add a set of type overrides to our operations to ensure that Boolean operations will return Boolean values and will not be changed to integers as SQLite would suggest.
- Some methods in SQLite fallback to a NUMERIC affinity unnecessarily, so stuff like `max(text, text)` will keep the `text` type instead of falling back to numeric as SQLite would suggest.
- Adds ability to use custom fetch / builder logic for various types, so that we can support vendor specific types (for example, Postgres dates).

# Important Notes
- There are some TODOs left in the code. I'm still aligning follow-up tasks - once done I will try to add references to relevant tasks in them.
2023-03-31 16:16:18 +00:00
GregoryTravis
6b9cbeacb2
Implement Regular Expression replace and update Text.replace to the new API (#5959)
Re-implement replace on top of Truffle regex.
2023-03-28 06:13:12 +00:00
James Dunkerley
bf2545fa04
Use new common parse method throwing less exceptions. (#6075)
Avoiding exceptions by not using parseBest.

Time now in CLI is 1.15s for 500k rows vs 1.65s in GUI.

CLI:
![image](https://user-images.githubusercontent.com/4699705/227711266-bc005b0d-5011-450f-964b-65dd2e437c2e.png)

GUI:
![image](https://user-images.githubusercontent.com/4699705/227711259-f7ddda29-86c7-4eef-a002-4bf0bda6063f.png)

Added it as a function in the shared library so used by both engine and polyglot.
2023-03-27 11:02:10 +00:00
James Dunkerley
58f2c7643f
Use new Enso Hash Codes and Comparable (#6060)
Enables `distinct`, `aggregate` and `cross_tab` to use the Enso hashing and equality operations.
Also, I rewired the way the ObjectComparators are obtained in polyglot code to be more consistent.

Add Comparator for `Day_Of_Week`, `Header`, `SQL_Type`, `Image` and `Matrix`.
Also, removed the custom `==` from these types as needed. (Closes #5626)
2023-03-24 15:02:25 +00:00
Radosław Waśko
952beba8d1
Fix cross_tab column naming edge cases, add fill_empty (#5863)
Closes #5151 and adds some additional tests for `cross_tab` that verify duplicated and invalid names.

I decided that for empty or `Nothing` names, instead of replacing them with `Column` and implicitly losing connection with the value that was in the column, we should just error on such values.

To make handling of these easier, `fill_empty` was added allowing to easily replace the empty values with something else.

Also, `{is,fill}_missing` was renamed to `{is,fill}_nothing` to align with `Filter_Condition.Is_Nothing`.
2023-03-11 11:58:54 +00:00
Radosław Waśko
263c3ad651
Add a common-polyglot-core-utils project (#5855)
Adds a common project that allows sharing code between the `runtime` and `std-bits`.

Due to classpath separation and the way it is compiled, the classes will be duplicated - we will have one copy for the `runtime` classpath and another copy as a small JAR for `Standard.Base` library.

This is still much better than having the code duplicated - now at least we have a single source of truth for the shared implementations.

Due to the copying we should not expand this project too much, but I encourage to put here any methods that would otherwise require us to copy the code itself.

This may be a good place to put parts of the hashing logic to then allow sharing the logic between the `runtime` and the `MultiValueKey` in the `Table` library (cc: @Akirathan).
2023-03-11 09:27:26 +00:00
Radosław Waśko
91ef8acf35
Review generated Column names (#5850)
Closes #5583 and closes #5157
2023-03-10 19:07:58 +00:00
Radosław Waśko
62e57f5557
Test some Mismatched Quote edge cases in Delimited reader (#5810)
Follow-up to #5113 - I add some more edge case tests as we discussed with @jdunkerley

When debugging some quoting issues, I also realised the current `Mismatched_Quote` error provided not enough information. So I amended it to at least include some context indicating which was the 'offending' cell.
2023-03-10 15:47:57 +00:00
James Dunkerley
299bfd6b7d
Fixes from the Demo on 2nd March (#5823)
- Fix issue with Geo Map viz.
- Handle invalid format strings better in `Data_Formatter`.
- New constants for the ISO format strings (and a special ENSO_ZONED_DATE_TIME)
- Consistent Date Time format for parsing in all places.
- Avoid throwing exception in datetime parsing.
- Support for milliseconds (well nanoseconds) in Date_Time and Time_Of_Day.
- `Column.map` stays within Enso.
- Allow `Aggregate_Column.Group_By` in `cross_tab` group_by parameter.
2023-03-07 20:58:00 +00:00
Pavel Marek
b6e2319fcc
Comparators support partial ordering (#5778) 2023-03-07 04:16:38 +00:00
Radosław Waśko
2d29456ed1
Review File/Data read and read_text warnings (#5799)
Closes #5113

Fixes a bug where read-only files would be overwritten if File.write was used in backup mode, and added tests to avoid such regression. To implement it, introduced a `is_writable` property on `File`.
2023-03-06 03:43:38 +00:00
James Dunkerley
01fc34c18a
Improving Expression Support for In Database (#5790)
- Adjust Excel Workbook write behaviour.
- Support Nothing / Null constants.
- Deduce the type of arithmetic operations and `iif`.
- Allow Date_Time constants, treating as local timezone.
- Removed the `to_column_name` and `ensure_sane_name` code.
2023-03-03 12:03:05 +00:00
Radosław Waśko
793eafc866
Improve Table.parse_values API (#5692)
Closes #5111
2023-02-24 13:35:01 +00:00
James Dunkerley
652b8d5db3
Update rename_columns to new API design, add first_row, second_row and last_row functions to the table. (#5719)
- Updates the `rename_columns` API.
- Add `first_row`, `second_row` and `last_row` to the Table types.
- New option for reading only last row of ResultSet.
2023-02-23 19:42:45 +00:00
Radosław Waśko
4dcf802831
Ensure that warnings are preserved on Nothing values passing back to Enso through polyglot boundary (#5677)
Fixes #5672

# Important Notes
- Added a subproject `enso-test-java-helpers` which allows the in-Enso tests to add Java helpers for testing.
2023-02-17 13:38:26 +00:00
Radosław Waśko
3027c6f3a2
Ensure entries containing newlines are quoted when writing Delimited Files (#5652)
Fixes #5638
2023-02-17 00:57:48 +00:00
James Dunkerley
1bc27501e6
Remove Column type from Aggregate_Column, simplify Column_Selector, some new File_Formats (#5646)
- Updated `Widget.Vector_Editor` ready for use by IDE team.
- Added `get` to `Row` to make API more aligned.
- Added `first_column`, `second_column` and `last_column` to `Table` APIs.
- Adjusted `Column_Selector` and associated methods to have simpler API.
- Removed `Column` from `Aggregate_Column` constructors.
- Added new `Excel_Workbook` type and added to `Excel_Section`.
- Added new `SQLiteFormatSPI` and `SQLite_Format`.
- Added new `IamgeFormatSPI` and `Image_Format`.
2023-02-16 15:15:49 +00:00
Radosław Waśko
a02eab451e
Implement basic warnings for column arithmetic, review warnings on expressions and filter (#5605)
Closes #5109

# Important Notes
- Currently the tests pass for the in-memory parts of Common_Table_Operations, but still some stuff not working on DB backends - in progress.
2023-02-14 09:33:04 +00:00
James Dunkerley
1c821e22cf
Some fixed form the Anagrams experiment. (#5592)
- Fixes the display of Date, Time_Of_Day and Date_Time so doesn't wrap.
- Adjust serialization of large integer values for JS and display within table.
- Workaround for issue with using `.lines` in the Table (new bug filed).
- Disabled warning on no specified `separator` on `Concatenate`.

Does not include fix for aggregation on integer values outside of `long` range.
2023-02-08 22:17:00 +00:00
Radosław Waśko
4f90946d1e
Rework Invalid Aggregations (#5579)
Closes #5108
2023-02-08 18:39:09 +00:00
Radosław Waśko
778d28fba3
Table with no columns is not valid, No_Output_Columns is always an error (#4073)
Implements https://www.pivotaltracker.com/story/show/184226020
2023-01-25 02:40:23 +00:00
Radosław Waśko
d2e57edc8b
Add Table.cross_join and Table.zip to In-Memory Table (#4063)
Implements https://www.pivotaltracker.com/story/show/184239059
2023-01-23 13:19:52 +00:00
Radosław Waśko
8853053020
Division in Columns within InDB is integer based if both columns are integers (#4057)
Fixes https://www.pivotaltracker.com/story/show/184073099

# Important Notes
- Since now the only operator on columns for division, `/`, returns floats, it may be worth creating an additional `div` operator exposing integer division. But that will be done as a separate task aligning column operator APIs.
2023-01-17 20:29:25 +00:00
Radosław Waśko
082e0bfd0d
Add Table.union to the In-Memory Table. (#4052)
Implements https://www.pivotaltracker.com/story/show/183854144
2023-01-17 00:34:57 +00:00
Radosław Waśko
0088096a58
Implement Distinct for the Database backends (#4027)
Implements https://www.pivotaltracker.com/story/show/182307281
2023-01-11 22:46:54 +00:00
Radosław Waśko
8c661fdb74
Database Joins (#4007)
Implements https://www.pivotaltracker.com/story/show/184032869

# Important Notes
- Currently we get failures in Full joins on Postgres which show a more serious problem - amending equality to ensure that `[NULL = NULL] == True` breaks hash/merge based indexing - so such joins will be extremely inefficient. All our joins currently rely on this notion of equality which will mean all of our DB joins will be extremely inefficient.
- We need to find a solution that will support nulls and still work OK with indices (but after exploring a few approaches: `COALESCE(a = b, a IS NULL AND b is NULL)`, `a IS NOT DISTINCT FROM b`, `(a = b) OR (a IS NULL AND b is NULL)`; all of which did not work (they all result in `ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions`) I'm less certain that it is possible. Alternatively, we may need to change the NULL semantics to align it with SQL - this seems like likely the simpler solution, allowing us to generate simple, reliable SQL - the NULL=NULL solution will be cornering us into nasty workarounds very dependent on the particular backend.
2023-01-05 10:36:22 +00:00
Dmitry Bushev
1e5e2327ab
Improve performance of Text.compare_to (#4012)
PR adds a flag to `Text` implementation tracking whether it is in a FCD normal form. Then this information can be used in the `Normalizer.compare` method.



| Benchmark name | Old (ms) | With flag (ms)
| --- | --- | ---
| Unicode very short | 40.29 | 40.04
| Unicode medium | 9.07 | 1.99
| Unicode big - random | 115.39 | 0.35
| Unicode big - early difference | 107.02 | 0.54
| Unicode big - late difference | 749.81 | 94.73
| ASCII very short | 28.13 | 31.13
| ASCII medium | 4.58 | 2.26
| ASCII big - random | 42.68 | 0.26
| ASCII big - early difference | 30.91 | 0.32
| ASCII big - late difference | 66.29 | 42.72

Full benchmark output.
[bench_old.txt](https://github.com/enso-org/enso/files/10325202/bench_old.txt)
[bench_new.txt](https://github.com/enso-org/enso/files/10325201/bench_new.txt)
2023-01-02 17:09:03 +00:00
Jaroslav Tulach
7252af6d62
Enso.getMetaObject, Type.isMetaInstance and Meta.is_a consolidation (#3949)
Implements `getMetaObject` and related messages from Truffle interop for Enso values and types. Turns `Meta.is_a` into builtin and re-uses the same functionality.

# Important Notes
Adds `ValueGenerator` testing infrastructure to provide unified access to special Enso values and builtin types that can be reused by other tests, not just `MetaIsATest` and `MetaObjectTest`.
2022-12-22 08:00:06 +00:00
James Dunkerley
579d3fc397
Adds Date, Time_Of_Day and Date_Time support to Excel IO (#3997)
- Allow date time inputs from Excel.
- Enables disabled test.
- Fix for Map.==.
- Allow nulls in crosstab name.
2022-12-20 16:12:00 +00:00
James Dunkerley
ace459ed53
Let JavaScript parse JSON and write JSON ... (#3987)
Use JavaScript to parse and serialise to JSON. Parses to native Enso object.
- `.to_json` now returns a `Text` of the JSON.
- Json methods now `parse`, `stringify` and `from_pairs`.
- New `JSON_Object` representing a JavaScript Object.
- `.to_js_object` allows for types to custom serialize. Returning a `JS_Object`.
- Default JSON format for Atom now has a `type` and `constructor` property (or method to call for as needed to deserialise).
- Removed `.into` support for now.
- Added JSON File Format and SPI to allow `Data.read` to work.
- Added `Data.fetch` API for easy Web download.
- Default visualization for JS Object trunctes, and made Vector default truncate children too.

Fixes defect where types with no constructor crashed on `to_json` (e.g. `Matching_Mode.Last.to_json`.
Adjusted default visualisation for Vector, so it doesn't serialise an array of arrays forever.
Likewise, JS_Object default visualisation is truncated to a small subset.

New convention:
- `.get` returns `Nothing` if a key or index is not present. Takes an `other` argument allowing control of default.
- `.at` error if key or index is not present.
- `Nothing` gains a `get` method allowing for easy propagation.
2022-12-20 10:33:46 +00:00
Radosław Waśko
b9bf958f2c
Efficient joining for Equals and Equals_Ignore_Case using a hashmap (#3978)
- Implemented https://www.pivotaltracker.com/story/show/183913276
- Refactored MultiValueIndex and MultiValueKeys to be more type-safe and more direct about using ordered or unordered maps.
- Added performance tests ensuring we use an efficient algorithm for the joins (the tests will fail for a full O(N*M) scan).
- Removed some duplicate code in the Table library.
- Added optional coloring of test results in terminal to make failures easier to spot.
2022-12-14 22:56:20 +00:00
James Dunkerley
77fe69dfd9
JSON Improvements, small Table stuff, Statistic in Enso not Java and few other minor bits. (#3964)
- Aligned `compare_to` so returns `Type_Error` if `that` is wrong type for `Text`, `Ordering` and `Duration`.
- Add `empty_object`, `empty_array`. `get_or_else`, `at`, `field_names` and `length` to `Json`.
- Fix `Json` serialisation of NaN and Infinity (to "null").
- Added `length`, `at` and `to_vector` to Pair (allowing it to be treated as a Vector).
- Added `running_fold` to the `Vector` and `Range`.
- Added `first` and `last` to the `Vector.Builder`.
- Allow `order_by` to take a single `Sort_Column` or have a mix of `Text` and `Sort_Column.Name` in a `Vector`.
- Allow `select_columns_helper` to take a `Text` value. Allows for a single field in group_by in cross_tab.
- Added `Patch` and `Custom` to HTTP_Method.
- Added running `Statistic` calculation and moved more of the logic from Java to Enso. Performance seems similar to pure Java version now.
2022-12-14 19:40:27 +00:00
Radosław Waśko
8e880e430b
Improve basic join implementation (#3958)
Implements https://www.pivotaltracker.com/story/show/183913232

# Important Notes
Added counts of succeeded/failed tests within a group and global summary, to easier see how many tests failed.
2022-12-09 00:55:07 +00:00
James Dunkerley
11e07f8676
Use the MultiValueIndex for the JoinStrategy. (#3959)
Use the MultiValueStrategy for pure equals Joins.
2022-12-08 12:24:53 +00:00
James Dunkerley
0ad70c6332
Tidy Standard.Base part 5 of n ... (hopefully the end...) (#3929)
- Moved `Any`, `Error` and `Panic` to `Standard.Base`.
- Separated `Json` and `Range` extensions into own modules.
- Tidied `Case`, `Case_Sensitivity`, `Encoding`, `Matching`, `Regex_Matcher`, `Span`, `Text_Matcher`, `Text_Ordering` and `Text_Sub_Range` in `Standard.Base.Data.Text`.
- Tidied `Standard.Base.Data.Text.Extensions` and stopped it re-exporting anything.
- Tidied `Regex_Mode`. Renamed `Option` to `Regex_Option` and added type to export.
- Tidied up `Regex` space.
- Tidied up `Meta` space.
- Remove `Matching` from export.
- Moved `Standard.Base.Data.Boolean` to `Standard.Base.Boolean`.

# Important Notes
- Moved `to_json` and `to_default_visualization_data` from base types to extension methods.
2022-12-02 18:08:14 +00:00
James Dunkerley
4518f8303d
Implementing transpose and cross_tab for the InMemory table. (#3919)
- Adds transpose and cross_tab to the In-Memory table.
- Cross Tab is built on top of aggregate and hence allows for expressions and has same error trapping as in aggregate.

# Important Notes
Only basic tests have been implemented. Error and warning tests will be added as a follow up task.
2022-11-30 01:19:25 +00:00
Radosław Waśko
85cbf7d9f9
Initial (naive) implementation for in memory join (#3918)
Implements https://www.pivotaltracker.com/story/show/183854123

It features a naive full scan join and only allows equality conditions. More advanced conditions and better optimized algorithms will be implemented in a subsequent PR.
2022-11-29 19:37:31 +00:00
Jaroslav Tulach
35c9ef7680
Enhanced Vector Builder (#3809)
Manual implementation of vector builder that avoid any copying (if the initial `capacity` is exact). Moreover the builder optimizes for  storage of `double` and `long` values - if the array homogeneously consists of these values, then no boxing happens and only primitive types are stored.

# Important Notes
Added few tests to [Vector_Spec.enso](76d2f38247).
2022-11-29 04:41:06 +00:00
Jaroslav Tulach
ecd1fdc3f8
Caching the grapheme_length of a Text (#3864)
Computing length of a text takes time. Let's cache it after first computation.

# Important Notes
Wrote `StringBenchmarks` that sums lengths of (the same) `Text` present many time in a `Vector`. Initially it took `383.673 ms` per operation. Then it took `0.031 ms/op`. Looks like the `length` calls are returning instantly as they get cached.
2022-11-14 15:53:10 +00:00
James Dunkerley
45276b243d
Expanding Derived Columns and Expression Syntax (#3782)
- Added expression ANTLR4 grammar and sbt based build.
- Added expression support to `set` and `filter` on the Database and InMemory `Table`.
- Added expression support to `aggregate` on the Database and InMemory `Table`.
- Removed old aggregate functions (`sum`, `max`, `min` and `mean`) from `Column` types.
- Adjusted database `Column` `+` operator to do concatenation (`||`) when text types.
- Added power operator `^` to both `Column` types.
- Adjust `iif` to allow for columns to be passed for `when_true` and `when_false` parameters.
- Added `is_present` to database `Column` type.
- Added `coalesce`, `min` and `max` functions to both `Column` types performing row based operation.
- Added support for `Date`, `Time_Of_Day` and `Date_Time` constants in database.
- Added `read` method to InMemory `Column` returning `self` (or a slice).

# Important Notes
- Moved approximate type computation to `SQL_Type`.
- Fixed issue in `LongNumericOp` where it was always casting to a double.
- Removed `head` from InMemory Table (still has `first` method).
2022-11-08 15:57:59 +00:00
Pavel Marek
f8a4e2a9d2
Add Period type (#3818)
This PR adds `Period` type, which is a date-only complement to `Duration` builtin type.

# Important Notes
- `Period` replaces `Date_Period`, and `Time_Period`.
- Added shorthand constructors for `Duration` and `Period`. For example: `Period.days 10` instead of `Period.new days=10`.
- `Period` can be compared to other `Period` in some cases, other cases throw an error.
2022-10-28 17:27:20 +00:00
Radosław Waśko
2bc0611869
Add support for using Columns within Is_In (#3822)
Implements https://www.pivotaltracker.com/story/show/183560222
2022-10-24 12:51:15 +00:00