csv: doc: clean up/expand manual after #1095

[ci skip]
2024-12-26 03:42:25 +03:00 · 2019-11-06 13:08:54 -08:00 · 2019-11-06 13:08:54 -08:00 · d92351e21a
commit d92351e21a
parent dcfc833d92
1 changed files with 351 additions and 220 deletions
--- a/hledger-lib/hledger_csv.m4.md
+++ b/hledger-lib/hledger_csv.m4.md
@ -25,7 +25,7 @@ Converting CSV to transactions requires some special conversion rules.
 These do several things:

 - they describe the layout and format of the CSV data
- they can customize the generated journal entries using a simple templating language
+- they can customize the generated journal entries (transactions) using a simple templating language
 - they can add refinements based on patterns in the CSV data, eg categorizing transactions with more detailed account names.

 When reading a CSV file named `FILE.csv`, hledger looks for a
@ -38,12 +38,245 @@ At minimum, the rules file must identify the date and amount fields.
 It's often necessary to specify the date format, and the number of header lines to skip, also.
 Eg:
 ```
-fields date, _, _, amount1
+fields date, _, _, amount
 date-format  %d/%m/%Y
 skip 1
 ```

-A more complete example:
+More examples in the EXAMPLES section below.
+
+# CSV RULES
+
+The following kinds of rule can appear in the rules file, in any order
+(except for `end` which can appear only inside a conditional block).
+Blank lines and lines beginning with `#` or `;` are ignored.
+
+## `skip`
+
+```rules
+skip N
+```
+The word "skip" followed by a number (or no number, meaning 1)
+tells hledger to ignore this many non-empty lines preceding the CSV data.
+(Empty/blank lines are skipped automatically.)
+You'll need this whenever your CSV data contains header lines.
+
+It also has a second purpose: it can be used to ignore certain CSV
+records, see [conditional blocks](#if) below.
+
+## `fields`
+
+```rules
+fields FIELDNAME1, FIELDNAME2, ...
+```
+A fields list ("fields" followed by one or more comma-separated field names) is the quick way to assign CSV field values to hledger fields.
+It  (a) names the CSV fields, in order (names may not contain whitespace; fields you don't care about can be left unnamed),
+and (b) assigns them to hledger fields if you use standard hledger field names.
+Here's an example:
+```rules
+# use the 1st, 2nd and 4th CSV fields as the transaction's date, description and amount,
+# ignore the 3rd, 5th and 6th fields,
+# and name the 7th and 8th fields for later reference:
+#      1     2           3  4       5 6  7          8
+
+fields date, description, , amount1, , , somefield, anotherfield
+```
+
+Here are the standard hledger field names:
+
+### Transaction fields
+
+`date`, `date2`, `status`, `code`, `description`, `comment` can be used to form the
+[transaction's](journal.html#transactions) first line. Only `date` is required.
+(See also [date-format](#date-format) below.)
+
+### Posting fields
+
+`accountN`, where N is 1 to 9, sets the Nth [posting's](journal.html#postings) account name.
+Most often there are two postings, so you'll want to set `account1` and `account2`.
+<!-- (Often, `account1` is fixed and `account2` will be set later by a [conditional block](#if).) -->
+
+A number of field/pseudo-field names are available for setting posting [amounts](journal.html#amounts):
+
+- `amountN` sets posting N's amount
+- `amountN-in` and `amountN-out` can be used instead, if the CSV has separate fields for debits and credits
+- `currencyN` sets a currency symbol to be left-prefixed to the amount, useful if the CSV provides that as a separate field
+- `balanceN` sets a (separate) [balance assertion](journal.html#balance-assertions) amount 
+   (or when no posting amount is set, a [balance assignment](journal.html#balance-assignments))
+
+If you write these with no number
+(`amount`, `amount-in`, `amount-out`, `currency`, `balance`),
+it means posting 1.
+Also, if you set an amount for posting 1 only, 
+a second posting that balances the transaction will be generated automatically.
+This helps support CSV rules created before hledger 1.16.
+<!-- XXX check exact behaviour, eg in three-posting example below -->
+
+Finally, `commentN` sets a [comment](journal.html#comments) on the Nth posting. 
+Comments can of course contain [tags](journal.html#tags).
+
+## `(field assignment)`
+
+```rules
+HLEDGERFIELDNAME FIELDVALUE
+```
+
+Instead of or in addition to a [fields list](#fields), you can
+assign a value to a hledger field by writing its name
+(any of the standard names above) followed by a text value.
+The value may contain interpolated CSV fields, 
+referenced by their 1-based position in the CSV record (`%N`),
+or by the name they were given in the fields list (`%CSVFIELDNAME`).
+Eg:
+```rules
+# set the amount to the 4th CSV field, with " USD" appended
+amount %4 USD
+```
+```rules
+# combine three fields to make a comment, containing note: and date: tags
+comment note: %somefield - %anotherfield, date: %1
+```
+Interpolation strips any outer whitespace, so a CSV value like `" 1 "`
+becomes `1` when interpolated
+([#1051](https://github.com/simonmichael/hledger/issues/1051)).
+Note you can only interpolate CSV fields, not the hledger fields being assigned to;
+for more on this, see [TIPS](#tips).
+
+## `date-format`
+
+```rules
+date-format DATEFMT
+```
+This is a helper for the `date` (and `date2`) fields.
+If your CSV dates are not formatted like `YYYY-MM-DD`, `YYYY/MM/DD` or `YYYY.MM.DD`,
+you'll need to specify the format by writing "date-format" followed by 
+a [strptime-like date parsing pattern](http://hackage.haskell.org/packages/archive/time/latest/doc/html/Data-Time-Format.html#v:formatTime),
+which must parse the date field values completely. Examples:
+
+``` rules
+# for dates like "11/06/2013":
+date-format %m/%d/%Y
+```
+
+``` rules
+# for dates like "6/11/2013". The - allows leading zeros to be optional.
+date-format %-d/%-m/%Y
+```
+
+``` rules
+# for dates like "2013-Nov-06":
+date-format %Y-%h-%d
+```
+
+``` rules
+# for dates like "11/6/2013 11:32 PM":
+date-format %-m/%-d/%Y %l:%M %p
+```
+
+## `if`
+
+```rules
+if PATTERN
+ RULE
+
+if
+PATTERN
+PATTERN
+PATTERN
+ RULE
+ RULE
+```
+
+Conditional blocks apply one or more rules to CSV records which are
+matched by any of the PATTERNs. This allows transactions to be
+customised or categorised based on patterns in the data.
+
+A single pattern can be written on the same line as the "if";
+or multiple patterns can be written on the following lines, non-indented.
+
+Patterns are case-insensitive [regular expressions](hledger.html#regular-expressions)
+which try to match any part of the whole CSV record.
+It's not yet possible to match within a specific field.
+Note the CSV record they see is close but not identical to the one in the CSV file;
+eg double quotes are removed, and the separator character becomes comma.
+
+After the patterns, there should be one or more rules to apply, all
+indented by at least one space. Three kinds of rule are allowed in
+conditional blocks:
+
+- [field assignments](#field-assignment) (to set a field's value)
+- [skip](#skip) (to skip the matched CSV record)
+- [end](#end) (to skip all remaining CSV records).
+
+Examples:
+```rules
+# if the CSV record contains "groceries", set account2 to "expenses:groceries"
+if groceries
+ account2 expenses:groceries
+```
+```rules
+# if the CSV record contains any of these patterns, set account2 and comment as shown
+if
+monthly service fee
+atm transaction fee
+banking thru software
+ account2 expenses:business:banking
+ comment  XXX deductible ? check it
+```
+
+## `end`
+
+As mentioned above, this rule can be used inside conditional blocks
+(only) to cause hledger to stop reading CSV records and proceed with
+command execution. Eg:
+```rules
+# ignore everything following the first empty record
+if ,,,,
+ end
+```
+
+## `include`
+
+```rules
+include RULESFILE
+```
+
+Include another CSV rules file at this point, as if it were written inline. 
+`RULESFILE` is an absolute file path or a path relative to the current file's directory.
+
+This can be useful eg for reusing common rules in several rules files:
+```rules
+# someaccount.csv.rules
+
+## someaccount-specific rules
+fields date,description,amount
+account1 some:account
+account2 some:misc
+
+## common rules
+include categorisation.rules
+```
+
+## `newest-first`
+
+hledger always sorts the generated transactions by date.
+Transactions on the same date should appear in the same order as their CSV records,
+as hledger can usually auto-detect whether the CSV's normal order is oldest first or newest first.
+But if all of the following are true:
+
+- the CSV might sometimes contain just one day of data (all records having the same date)
+- the CSV records are normally in reverse chronological order (newest first)
+- and you care about preserving the order of same-day transactions
+
+you should add the `newest-first` rule as a hint. Eg:
+```rules
+# tell hledger explicitly that the CSV is normally newest-first
+newest-first
+```
+
+# EXAMPLES
+
+A more complete example, generating three-posting transactions:
 ```
 # hledger CSV rules for amazon.com order history

@ -79,264 +312,162 @@ comment3    fees

 For more examples, see [Convert CSV files](https://github.com/simonmichael/hledger/wiki/Convert-CSV-files).

+# TIPS

-# CSV RULES
+## Reading multiple CSV files

-The following seven kinds of rule can appear in the rules file, in any order.
-Blank lines and lines beginning with `#` or `;` are ignored.
+You can read multiple CSV files at once using multiple `-f` arguments on the command line.
+hledger will look for a correspondingly-named rules file for each CSV file.
+If you use the `--rules-file` option, that rules file will be used for all the CSV files.

-## skip
+## Deduplicating, importing

-`skip `*`N`*
-
-Skip this many non-empty lines preceding the CSV data.
-(Empty/blank lines are skipped automatically.)
-You'll need this whenever your CSV data contains header lines. Eg:
-<!-- XXX -->
-<!-- hledger tries to skip initial CSV header lines automatically. -->
-<!-- If it guesses wrong, use this directive to skip exactly N lines. -->
-This can also be used in a conditional block to ignore certain CSV records.
-```rules
-# ignore the first CSV line
-skip 1
+When you download a CSV file repeatedly, eg to get your latest bank
+transactions, the new file may contain some of the same records as the
+old one. The [print --new](hledger.html#print) command is one simple
+way to detect just the new transactions. Or better still, the
+[import](hledger.html#import) command appends those new transactions
+to your main journal. This is the easiest way to import CSV data. Eg,
+after downloading your latest CSV files:
+```shell
+$ hledger import *.csv [--dry]
 ```

-## date-format
+## Other import methods

-`date-format `*`DATEFMT`*
+A number of other tools and workflows, hledger-specific and otherwise,
+exist for converting, deduplicating, classifying and managing CSV data.
+See:

-When your CSV date fields are not formatted like `YYYY/MM/DD` (or `YYYY-MM-DD` or `YYYY.MM.DD`),
-you'll need to specify the format.
-DATEFMT is a [strptime-like date parsing pattern](http://hackage.haskell.org/packages/archive/time/latest/doc/html/Data-Time-Format.html#v:formatTime),
-which must parse the date field values completely. Examples:
+- <https://hledger.org> -> sidebar -> real world setups
+- <https://plaintextaccounting.org> -> data import/conversion

-``` rules
-# for dates like "11/06/2013":
-date-format %m/%d/%Y
+## Valid CSV
+
+hledger accepts CSV conforming to [RFC 4180](https://tools.ietf.org/html/rfc4180).
+Some things to note when values are enclosed in quotes:
+
+- you must use double quotes (not single quotes)
+- spaces outside the quotes are [not allowed](https://stackoverflow.com/questions/4863852/space-before-quote-in-csv-field)
+
+## Other separator characters
+
+With the `--separator 'CHAR'` option, hledger will expect the
+separator to be CHAR instead of a comma. Ie it will read other
+"Character Separated Values" formats, such as TSV (Tab Separated Values).
+Note: on the command line, use a real tab character in quotes, not \t. Eg:
+```shell
+$ hledger -f foo.tsv --separator '	' print
 ```
+(Experimental.)

-``` rules
-# for dates like "6/11/2013" (note the - to make leading zeros optional):
-date-format %-d/%-m/%Y
-```
+## Setting amounts

-``` rules
-# for dates like "2013-Nov-06":
-date-format %Y-%h-%d
-```
+A posting amount can be set in one of these ways:

-``` rules
-# for dates like "11/6/2013 11:32 PM":
-date-format %-m/%-d/%Y %l:%M %p
-```
+- by assigning (with a fields list or field assigment) to
+  `amountN` (posting N's amount) or `amount` (posting 1's amount)

-## field list
+- by assigning to `amountN-in` and `amountN-out` (or `amount-in` and `amount-out`).
+  For each CSV record, whichever of these has a non-zero value will be used, with appropriate sign. 
+  If both contain a non-zero value, this may not work.

-`fields `*`FIELDNAME1`*, *`FIELDNAME2`*...
-
-This (a) names the CSV fields, in order (names may not contain whitespace; uninteresting names may be left blank),
-and (b) assigns them to journal entry fields if you use any of these standard field names:
-
-Fields `date`, `date2`, `status`, `code`, `description` will form transaction description.
-
-An assignment to any of `accountN`, `amountN`, `amountN-in`, `amountN-out`, `balanceN` or `currencyN` will generate a posting (though it's your responsibility to ensure it is a well formed one). Normally the `N`'s are consecutive starting from 1 but it's not required. One posting will be generated for each unique `N`. If you wish to supply a comment for the posting, use `commentN`, though comment on its own will not cause posting to be generated.
-
-Fields `amount`, `amount-in`, `amount-out`, `currency`, `balance` and `comment` are treated as aliases for `amount1`, and so on. If your rules file leads to both aliased fields having different values, `hledger` will raise an error.
-
-Eg:
-```rules
-# use the 1st, 2nd and 4th CSV fields as the entry's date, description and amount,
-# and give the 7th and 8th fields meaningful names for later reference:
-#
-# CSV field:
-#      1     2            3 4       5 6 7          8
-# entry field:
-fields date, description, , amount1, , , somefield, anotherfield
-```
-
-For backwards compatibility, we treat posting 1 specially. If your rules generated just posting 1, another posting would be added to your transaction to balance it. If your rules generated posting 1 and posting 2, but amount in the posting 2 is empty, hledger will fill it out with the opposite of posting 1. This special handling is needed to ensure smooth upgrade path from version 1.15.
-
-## field assignment
-
-*`ENTRYFIELDNAME`* *`FIELDVALUE`*
-
-This sets a journal entry field (one of the standard names above) to the given text value,
-which can include CSV field values interpolated by name (`%CSVFIELDNAME`) or 1-based position (`%N`).
-<!-- Whitespace before or after the value is ignored. -->
-Eg:
-```rules
-# set the amount to the 4th CSV field with "USD " prepended
-amount USD %4
-```
-```rules
-# combine three fields to make a comment (containing two tags)
-comment note: %somefield - %anotherfield, date: %1
-```
-
-Field assignments can be used instead of or in addition to a field list.
-
-Note, interpolation strips any outer whitespace, so a CSV value like
-`" 1 "` becomes `1` when interpolated ([#1051](https://github.com/simonmichael/hledger/issues/1051)).
-
-## conditional block
-
-`if` *`PATTERN`*\
-&nbsp;&nbsp;&nbsp;&nbsp;*`FIELDASSIGNMENTS`*...
-
-`if`\
-*`PATTERN`*\
-*`PATTERN`*...\
-&nbsp;&nbsp;&nbsp;&nbsp;*`FIELDASSIGNMENTS`*...
-
-`if` *`PATTERN`*\
-*`PATTERN`*...\
-&nbsp;&nbsp;&nbsp;&nbsp;*`skip N`*...
-
-`if` *`PATTERN`*\
-*`PATTERN`*...\
-&nbsp;&nbsp;&nbsp;&nbsp;*`end`*...
-
-This applies one or more field assignments, only to those CSV records matched by one of the PATTERNs.
-The patterns are case-insensitive regular expressions which match anywhere
-within the whole CSV record (it's not yet possible to match within a
-specific field).  When there are multiple patterns they can be written
-on separate lines, unindented.
-The field assignments are on separate lines indented by at least one space.
-
-Instead of field assignments you can specify `skip` or `skip 1` to skip this record, `skip N` to skip the next N records (including the one that matchied) or `end` to skip the rest of the file.
-
-Examples:
-```rules
-# if the CSV record contains "groceries", set account2 to "expenses:groceries"
-if groceries
- account2 expenses:groceries
-```
-```rules
-# if the CSV record contains any of these patterns, set account2 and comment as shown
-if
-monthly service fee
-atm transaction fee
-banking thru software
- account2 expenses:business:banking
- comment  XXX deductible ? check it
-```
-
-## include
-
-`include `*`RULESFILE`*
-
-Include another rules file at this point. `RULESFILE` is either an absolute file path or
-a path relative to the current file's directory. Eg:
-```rules
-# rules reused with several CSV files
-include common.rules
-```
-
-## newest-first
-
-`newest-first`
-
-Consider adding this rule if all of the following are true: 
-you might be processing just one day of data,
-your CSV records are in reverse chronological order (newest first),
-and you care about preserving the order of same-day transactions.
-It usually isn't needed, because hledger autodetects the CSV order,
-but when all CSV records have the same date it will assume they are oldest first.
-
-# CSV TIPS
-
-## CSV ordering
-
-The generated [journal entries](journal.html#transactions) will be sorted by date. 
-The order of same-day entries will be preserved 
-(except in the special case where you might need [`newest-first`](#newest-first), see above).
-
-## CSV accounts
-
-Each journal entry will have at least two [postings](journal.html#postings), to `account1` and some other account (usually `account2`).
-It's conventional and recommended to use `account1` for the account whose CSV we are reading.
-
-## CSV amounts
-
-A posting [amount](journal.html#amounts) could be set in one of these ways:
-
- with an `amountN` field assignment, which sets the Nth posting's amount
-
- (When the CSV has debit and credit amounts in separate fields:)\
-  with field assignments for the `amountN-in` and `amountN-out` pseudo
-  fields (both of them). Whichever one has a value will be used, with
-  appropriate sign. If both contain a value, it might not work so well.
-
- with `balanceN` field assignment that creates a [balance assignment](journal.html#balance-assignments) (see below).
+- by assigning to `balanceN` (or `balance`) instead of the above,
+  setting the amount indirectly via a 
+  [balance assignment](journal.html#balance-assignments).

 There is some special handling for sign in amounts:

 - If an amount value is parenthesised, it will be de-parenthesised and sign-flipped.
- If an amount value begins with a double minus sign, those will cancel out and be removed.
+- If an amount value begins with a double minus sign, those cancel out and are removed.

 If the currency/commodity symbol is provided as a separate CSV field,
-assign it to the `currency` pseudo field (applicable to the whole transaction) or `currencyN` (applicable to Nth posting only); the symbol will be prepended
-to the amount 
-(TODO: <s>when there is an amount</s>).
-Or, you can use an `amountN` [field assignment](#field-assignment) for more control, eg:
+you can assign it to `currency` (affects all posting amounts) or `currencyN` (affects just posting N's amount).
+The symbol will be prepended to the amount.
+Or for more control, you can set both currency symbol and amount with a field assignment, eg:
 ```
-fields date,description,currency,amount1
-amount1 %amount1 %currency
+fields date,description,currency,amount
+# add currency symbol on the right:
+amount %amount %currency
 ```

-## CSV balance assertions/assignments
+## Referencing other fields

-If the CSV includes a running balance, you can assign that to one of the pseudo fields
-`balance` (or `balance1`), `balance2`, ... up to `balance9`.
-This will generate a [balance assertion](journal.html#balance-assertions) 
-(or if the amount is left empty, a [balance assignment](journal.html#balance-assignments)),
-on the appropriate posting, whenever the running balance field is non-empty.
+In field assignments, you can interpolate only CSV fields, not hledger
+fields. In the example below, there's both a CSV field and a hledger
+field named amount1, but %amount1 always means the CSV field, not
+the hledger field:

-## References to other fields and evaluation order
-
-Field assignments could include references to other fields or even to the same field you are trying to assign:
-
-```
-fields date,description,currency,amount1
+```rules
+# Name the third CSV field "amount1"
+fields date,description,amount1

+# Set hledger's amount1 to the CSV amount1 field followed by USD
 amount1 %amount1 USD
-amount1 %amount1 EUR
-amount1 %amount1 %currency

-if SOME_REGEXP
-    amount1 %amount1 GBP
+# Set comment to the CSV amount1 (not the amount1 assigned above)
+comment %amount1
 ```
-This is how this file would be evaluated.

-First, parts of CVS record are assigned according to `fields` directive.
+Here, since there's no CSV amount1 field, %amount1 will produce a literal "amount1":
+```rules
+fields date,description,csvamount
+amount1 %csvamount USD
+# Can't interpolate amount1 here
+comment %amount1
+```

-Then all other field assignments -- written at top level, or included in `if` blocks -- are considered to see if they should be applied. They are checked in the order they are written, with later assignment overwriting earlier ones.
+When there are multiple field assignments to the same hledger field,
+only the last one takes effect. Here, comment's value will be be B,
+or C if "something" is matched, but never A:
+```rules
+comment A
+comment B
+if something
+ comment C
+```

-Once full set of field assignments that should be applied is known, their values are computed, and this is when all `%<fieldname>` references are evaluated.
+## How CSV rules are evaluated

-So for a particular row from CSV file, value from fourth column would be assigned to `amount1`.
+Here's how to think of CSV rules being evaluated (if you really need to). First,

-Then `hledger` will decide that `amount1` would have to be amended to `%amount1 USD`, but this will not happen immediately. This choice would be replaced by decision to rewrite `amount1` to `%amount EUR`, which will in turn be thrown away in favor of `%amount1 %currency`. If the `if` block condition will match the row, it will assign `amount1` to `%amount1 GBP`.
+- include - all includes are inlined, from top to bottom, depth first. (At each include point the file is inlined and scanned for further includes, before proceeding.)

-Overall, we will end up with one of the two alternatives for `amount1` - either `%amount1 %currency` or `%amount1 GBP`.
+Then "global" rules are evaluated, top to bottom. If a rule is repeated, the last one wins:

-Now substitution of all referenced values will happen, using the current values for `%amount1` and `currency`, which were provided by the `fields` directive.
+- skip (at top level)
+- date-format
+- newest-first
+- fields - names the CSV fields, optionally sets up initial assignments to hledger fields

+Then for each CSV record in turn:

-## Reading multiple CSV files
+- test all `if` blocks. If any of them contain a `end` rule, skip all remaining CSV records.
+  Otherwise if any of them contain a `skip` rule, skip that many CSV records.
+  If there are multiple matched skip rules, the first one wins.
+- collect all field assignments at top level and in matched if blocks.
+  When there are multiple assignments for a field, keep only the last one.
+- compute a value for each hledger field - either the one that was assigned to it
+  (and interpolate the %CSVFIELDNAME references), or a default
+- generate a synthetic hledger transaction from these values, 
+  which becomes part of the input to the hledger command that has been selected

-You can read multiple CSV files at once using multiple `-f` arguments on the command line,
-and hledger will look for a correspondingly-named rules file for each.
-Note if you use the `--rules-file` option, this one rules file will be used for all the CSV files being read. 
+## Valid transactions

-## Valid CSV
+hledger currently does not post-process and validate transactions
+generated from CSV as thoroughly as transactions read from a journal
+file. This means that if your rules are wrong, you can generate invalid
+transactions. Or, amounts may not be displayed with a canonical
+display style.

-hledger follows [RFC 4180](https://tools.ietf.org/html/rfc4180),
-with the addition of a customisable separator character.
+So when setting up or adjusting CSV rules, you should check your
+results visually with the print command. You can pipe print's output
+through hledger once more to validate and canonicalise fully.
+Eg:

-Some things to note:
+```shell
+$ hledger -f some.csv print | hledger -f- print -I
+```

-When quoting fields, 
-
- you must use double quotes, not single quotes
- spaces outside the quotes are [not allowed](https://stackoverflow.com/questions/4863852/space-before-quote-in-csv-field).
+(The -I/--ignore-assertions flag disables balance assertion checks,
+usually needed when re-parsing print output.)