1
1
mirror of https://github.com/wader/fq.git synced 2024-11-26 10:33:53 +03:00
fq/doc/dev.md
Mattias Wadman e77f776999 decode,interp: Rename unknown gap fields from "unknown#" to "gap#"
Think it makes it clearer and also less likely to collide with a field
name a deocder wants to use.
2022-12-01 20:43:30 +01:00

13 KiB

Implement a decoder

Steps to add new decoder

  • Create a directory format/<name>
  • Copy some similar decoder, format/format/bson.go is quite small, to format/<name>/<name>.go
  • Cleanup and fill in the register struct, rename format.BSON and add it to format/format.go and don't forget to change the string constant.
  • Add an import to format/all/all.go

Some general tips

  • Main goal is to produce a tree structure that is user-friendly and easy to work with. Prefer a nice and easy query tree structure over nice decoder implementation.
  • Use same names, symbols, constant number bases etc as in specification. But maybe in lowercase to be jq/JSON-ish.
  • Decode only ranges you know what they are. If possible let "parent" decide what to do with unknown gaps bits by using *Decode*Len/Range/Limit functions. fq will also automatically add "gap" fields if it finds gaps.
  • Try to not decode too much as one value. A length encoded int could be two fields, but maybe a length prefixed string should be one. Flags can be struct with bit-fields.
  • Map as many value as possible to more symbolic values.
  • Endian is inherited inside one format decoder, defaults to big endian for new format decoder
  • Make sure zero length or no frames found etc fails decoding
  • If format is in the probe group make sure to validate input to make it non-ambiguous with other decoders
  • Try keep decoder code as declarative as possible
  • Split into multiple sub formats if possible. Makes it possible to use them separately.
  • Validate/Assert
  • Error/Fatal/panic
  • Is format probeable or not
  • Can new formats be added to other formats
  • Does the new format include existing formats

Decoder API

*decode.D reader methods use this name convention:

<Field>?(<reader<length>?>|<type>Fn>)(...[, scalar.Mapper...]) <type>

  • If it starts with Field a field will be added and first argument will be name of field. If not it will just read.
  • <try>?<reader<length>?>|<try>?<type>Fn> a reader or a reader function
    • <try>? If prefixed with Try function return error instead of panic on error.
    • <reader<length>?> Read bits using some decoder.
      • U16 unsigned 16 bit integer.
      • UTF8 UTF8 with byte length as argument.
    • <type>Fn> read using a func(d *decode.D) <type> function.
      • This can be used to implement own custom readers.

All Field functions takes a var args of scalar.Mapper:s that will be applied after reading.

<type> are these types:

<type> Go type jq type
U uint64 number
S int64 number
F float64 number
Str string string
Bool bool boolean
Nil nil null

TODO: there are some more (BitBuf etc, should be renamed)

To add a struct or array use d.FieldStruct(...) and d.FieldArray(...).

TODO: nested formats, buffers, own decoders, scalar mappers

TODO: seeking, framed/limited/range decode

For example this decoder:

// read 4 byte UTF8 string and add it as "magic", return a string
d.FieldUTF8("magic", 4)
// create a new struct and add it as "headers", returns a *decode.D
d.FieldStruct("headers", func(d *decode.D) {
    // read 8 bit unsigned integer, map it and add it as "type", returns a uint64
    d.FieldU8("type", scalar.UToSymStr{
        1: "start",
        // ...
    })
})

will produce something like this:

*decode.Value{
    Parent: nil,
    V: *decode.Compound{
        IsArray: false, // is struct
        Children: []*decode.Value{
            *decode.Value{
                Name: "magic",
                V: scalar.S{
                    Actual: "abcd", // read and set by UTF8 reader
                },
                Range: ranges.Range{Start: 0, Len: 32},
            },
            *decode.Value{
                Parent: &... // ref parent *decode.Value>,
                Name: "headers",
                V: *decode.Compound{
                    IsArray: false, // is struct
                    Children: []*decode.Value{
                        *decode.Value{
                            Name: "type",
                            V: scalar.S{
                                Actual: uint64(1), // read and set by U8 reader
                                Sym: "start", // set by UToSymStr scalar.Mapper
                            },
                            Range: ranges.Range{Start: 32, Len: 8},
                        },
                    },
                },
                Range: ranges.Range{Start: 32, Len: 8},
            },
        },
    },
    Range: ranges.Range{Start: 0, Len: 40},
}

and will look like this in jq/JSON:

{
    "magic": "abcd",
    "headers": {
        "type": "start"
    }
}

*decode.D type

This is the main type used during decoding. It keeps track of:

  • A current array or struct *decode.Value where fields will be added.
  • Current bit reader
  • Current default endian
  • Decode options

New *decode.D are created during decoding when d.FieldStruct etc is used. It is also a kitchen sink of all kind functions for reading various standard number and string encodings etc.

Decoder authors do not have to create them.

*decode.Value type

Is what *decode.D produces and it used to represent the decoded structure. Can be array, struct, number, string etc. It is the underlaying type used by interp.DecodeValue that implements gojq.JQValue to expose it as various jq types, which in turn is used to produce JSON.

It stores:

  • Parent *decode.Value unless it's a root.
  • A decoded value, a scalar.S or *decode.Compound (struct or array)
  • Name in parent struct or array. If parent is a struct the name is unique.
  • Index in parent array. Not used if parent is a struct.
  • A bit range. Also struct and array have a range that is the min/max range of its children.
  • A bit reader where the bit range can be read from.

Decoder authors will probably not have to create them.

scalar.S type

Keeps track of

  • Actual value. Decoded value represented using a go type like uint64, string etc. For example a value reader by a utf8 or utf16 reader both will ends up as a string.
  • Symbolic value. Optional symbolic representation of the actual value. For example a scalar.UToSymStr would map an actual uint64 to a symbolic string.
  • String description of the value.
  • Number representation

The scalar package has scalar.Mapper implementations for all types to map actual to whole scalar.S value scalar.<type>ToScalar or to just to set symbolic value scalar.<type>ToSym<type>. There is also mappers to just set values or to change number representations scalar.Hex/scalar.SymHex etc.

Decoder authors will probably not have to create them. But you might implement your own scalar.Mapper to modify them.

*decode.Compound type

Used to store struct or array of *decode.Value.

Decoder authors do not have to create them.

Development tips

I ususally use -d <format> and dv while developing, that way you will get a decode tree even if it fails. dv gives verbose output and also includes stacktrace.

go run fq.go -d <format> dv file

If the format is inside some other format it can be handy to first extract the bits and run the decode directly. For example if working a aac_frame decoder issue:

fq '.tracks[0].samples[1234] | tobytes' file.mp4 > aac_frame_1234
fq -d aac_frame dv aac_frame_1234

Sometimes nested decoding fails then maybe a good way is to change the parent decoder to use d.RawLen() etc instead of d.FormatLen() etc temporary to extract the bits. Hopefully there will be some option to do this in the future.

When researching or investinging something I can recommend to use watchexec, modd etc to make things more comfortable. Also using vscode/delve for debugging should work fine once launch args are setup etc.

watchexec "go run fq.go -d aac_frame dv aac_frame"

Some different ways to run tests:

# run all tests
make test
# run all go tests
go test ./...
# run all tests for one format
go test -run TestFQTests/mp4 ./format/
# write all actual outputs
WRITE_ACTUAL=1 go test ./...
# write actual output for specific tests
WRITE_ACTUAL=1 go run -run ...
# color diff
DIFF_COLOR=1 go test ...

To lint source use:

make lint

Generate documentation. Requires FFmpeg and Graphviz:

make doc

TODO: make fuzz

Debug

Split debug and normal output even when using repl:

Write log package output and stderr to a file that can be tail -f:ed in another terminal:

LOGFILE=/tmp/log go run fq.go ... 2>>/tmp/log

gojq execution debug:

GOJQ_DEBUG=1 go run -tags debug fq.go ...

Memory and CPU profile (will open a browser):

make memprof ARGS=". file"
make cpuprof ARGS=". test.mp3"

From start to decoded value

main:main()
    cli.Main(default registry)
        interp.New(registry, std os interp implementation)
        interp.(*Interp).Main()
            interp.jq _main/0:
                args.jq _args_parse/2
                populate filenames for input/0
                interp.jq inputs/0
                    foreach valid input/0 output
                        interp.jq open
                            funcs.go _open
                        interp.jq decode
                            funcs.go _decode
                                decode.go Decode(...)
                                    ...
                        interp.jq eval expr
                            funcs.go _eval
                        interp.jq display
                            funcs.go _display
                                for interp.(decodeValueBase).Display()
                                    dump.go
                                        print tree
                                empty output

bitio and other io packages

*os.File, *bytes.Buffer
^
ctxreadseeker.Reader defers blocking io operations to a goroutine to make them cancellable
^
progressreadseeker.Reader approximates how much of a file has been read
^
aheadreadseeker.Reader does readahead caching
^
| (io.ReadSeeker interface)
|
bitio.IOBitReader (implements bitio.Bit* interfaces)
SectionBitReader
MultiBitReader

jq oddities

jq -n '[1,2,3,4] | .[null:], .[null:2], .[2:null], .[:null]'

Setup docker desktop with golang windows container

git clone https://github.com/StefanScherer/windows-docker-machine.git
cd windows-docker-machine
vagrant up 2016-box
cd ../fq
docker --context 2016-box run --rm -ti -v "C:${PWD//\//\\}:C:${PWD//\//\\}" -w "$PWD" golang:1.18-windowsservercore-ltsc2016

Implementation details

Dependencies and source origins

  • gojq fork that can be found at https://github.com/wader/gojq/tree/fq
    Issues and PR:s related to fq:
    #43 Support for functions written in go when used as a library
    #46 Support custom internal functions
    #56 String format query with no operator using %#v or %#+v panics #65 Try-catch with custom function
    #67 Add custom iterator function support which enables implementing a REPL in jq
    #81 path/1 behaviour and path expression question
    #86 ER: basic TCO #109 jq halt_error behaviour difference
    #113 error/0 and error/1 behavior difference
    #117 Negative number modulus *big.Int behaves differently to int
    #118 Regression introduced by "remove fork analysis from tail call optimization (ref #86)"
    #122 Slow performance for large error values that ends up using typeErrorPreview()
    #125 improve performance of join by make it internal
    #141 Empty array flatten regression since "improve flatten performance by reducing copy"

  • readline fork that can be found at https://github.com/wader/readline/tree/fq

  • gopacket for TCP and IPv4 reassembly

  • mapstructure for convenient JSON/map conversion

  • go-difflib for diff tests

  • golang.org/x/text for text encoding conversions

  • float16.go to convert bits into 16-bit floats

Release process

Run and follow instructions:

make release VERSION=1.2.3

Commits since release

git log --no-decorate --no-merges --oneline v0.0.4..wader/master | sort -t " " -k 2 | sed 's/\(.*\)/* \1/'