learnxinyminutes-docs/kdb+.html.markdown
Marcel Ribeiro-Dantas bba9f7df21
Fixes typos in many different English articles
Signed-off-by: Marcel Ribeiro-Dantas <mribeirodantas@seqera.io>
2022-12-10 12:05:34 -03:00

777 lines
22 KiB
Markdown

---
language: kdb+
contributors:
- ["Matt Doherty", "https://github.com/picodoc"]
- ["Jonny Press", "https://github.com/jonnypress"]
filename: learnkdb.q
---
The q language and its database component kdb+ were developed by Arthur Whitney
and released by Kx systems in 2003. q is a descendant of APL and as such is
very terse and a little strange looking for anyone from a "C heritage" language
background. Its expressiveness and vector oriented nature make it well suited
to performing complex calculations on large amounts of data (while also
encouraging some amount of [code
golf](https://en.wikipedia.org/wiki/Code_golf)). The fundamental structure in
the language is not the object but instead the list, and tables are built as
collections of lists. This means - unlike most traditional RDBMS systems -
tables are column oriented. The language has both an in-memory and on-disk
database built in, giving a large amount of flexibility. kdb+ is most widely
used in the world of finance to store, analyze, process and retrieve large
time-series data sets.
The terms *q* and *kdb+* are usually used interchangeably, as the two are not
separable so this distinction is not really useful.
All Feedback welcome! You can reach me at matt.doherty@aquaq.co.uk, or Jonny
at jonny.press@aquaq.co.uk
To learn more about kdb+ you can join the [Personal kdb+](https://groups.google.com/forum/#!forum/personal-kdbplus) or [TorQ kdb+](https://groups.google.com/forum/#!forum/kdbtorq) group.
```
/ Single line comments start with a forward-slash
/ These can also be used in-line, so long as at least one whitespace character
/ separates it from text to the left
/
A forward-slash on a line by itself starts a multiline comment
and a backward-slash on a line by itself terminates it
\
/ Run this file in an empty directory
////////////////////////////////////
// Basic Operators and Datatypes //
////////////////////////////////////
/ We have integers, which are 8 byte by default
3 / => 3
/ And floats, also 8 byte as standard. Trailing f distinguishes from int
3.0 / => 3f
/ 4 byte numerical types can also be specified with trailing chars
3i / => 3i
3.0e / => 3e
/ Math is mostly what you would expect
1+1 / => 2
8-1 / => 7
10*2 / => 20
/ Except division, which uses percent (%) instead of forward-slash (/)
35%5 / => 7f (the result of division is always a float)
/ For integer division we have the keyword div
4 div 3 / => 1
/ Modulo also uses a keyword, since percent (%) is taken
4 mod 3 / => 1
/ And exponentiation...
2 xexp 4 / => 16
/ ...and truncating...
floor 3.14159 / => 3
/ ...getting the absolute value...
abs -3.14159 / => 3.14159
/ ...and many other things
/ see http://code.kx.com/q/ref/card/ for more
/ q has no operator precedence, everything is evaluated right to left
/ so results like this might take some getting used to
2*1+1 / => 4 / (no operator precedence tables to remember!)
/ Precedence can be modified with parentheses (restoring the 'normal' result)
(2*1)+1 / => 3
/ Assignment uses colon (:) instead of equals (=)
/ No need to declare variables before assignment
a:3
a / => 3
/ Variables can also be assigned in-line
/ this does not affect the value passed on
c:3+b:2+a:1 / (data "flows" from right to left)
a / => 1
b / => 3
c / => 6
/ In-place operations are also as you might expect
a+:2
a / => 3
/ There are no "true" or "false" keywords in q
/ boolean values are indicated by the bit value followed by b
1b / => true value
0b / => false value
/ Equality comparisons use equals (=) (since we don't need it for assignment)
1=1 / => 1b
2=1 / => 0b
/ Inequality uses <>
1<>1 / => 0b
2<>1 / => 1b
/ The other comparisons are as you might expect
1<2 / => 1b
1>2 / => 0b
2<=2 / => 1b
2>=2 / => 1b
/ Comparison is not strict with regard to types...
42=42.0 / => 1b
/ ...unless we use the match operator (~)
/ which only returns true if entities are identical
42~42.0 / => 0b
/ The not operator returns true if the underlying value is zero
not 0b / => 1b
not 1b / => 0b
not 42 / => 0b
not 0.0 / => 1b
/ The max operator (|) reduces to logical "or" for bools
42|2.0 / => 42f
1b|0b / => 1b
/ The min operator (&) reduces to logical "and" for bools
42&2.0 / => 2f
1b&0b / => 0b
/ q provides two ways to store character data
/ Chars in q are stored in a single byte and use double-quotes (")
ch:"a"
/ Strings are simply lists of char (more on lists later)
str:"This is a string"
/ Escape characters work as normal
str:"This is a string with \"quotes\""
/ Char data can also be stored as symbols using backtick (`)
symbol:`sym
/ Symbols are NOT LISTS, they are an enumeration
/ the q process stores internally a vector of strings
/ symbols are enumerated against this vector
/ this can be more space and speed efficient as these are constant width
/ The string function converts to strings
string `symbol / => "symbol"
string 1.2345 / => "1.2345"
/ q has a time type...
t:01:00:00.000
/ date type...
d:2015.12.25
/ and a datetime type (among other time types)
dt:2015.12.25D12:00:00.000000000
/ These support some arithmetic for easy manipulation
dt + t / => 2015.12.25D13:00:00.000000000
t - 00:10:00.000 / => 00:50:00.000
/ and can be decomposed using dot notation
d.year / => 2015i
d.mm / => 12i
d.dd / => 25i
/ see http://code.kx.com/q4m3/2_Basic_Data_Types_Atoms/#25-temporal-data for more
/ q also has an infinity value so div by zero will not throw an error
1%0 / => 0w
-1%0 / => -0w
/ And null types for representing missing values
0N / => null int
0n / => null float
/ see http://code.kx.com/q4m3/2_Basic_Data_Types_Atoms/#27-nulls for more
/ q has standard control structures
/ if is as you might expect (; separates the condition and instructions)
if[1=1;a:"hi"]
a / => "hi"
/ if-else uses $ (and unlike if, returns a value)
$[1=0;a:"hi";a:"bye"] / => "bye"
a / => "bye"
/ if-else can be extended to multiple clauses by adding args separated by ;
$[1=0;a:"hi";0=1;a:"bye";a:"hello again"]
a / => "hello again"
////////////////////////////////////
//// Data Structures ////
////////////////////////////////////
/ q is not an object oriented language
/ instead complexity is built through ordered lists
/ and mapping them into higher order structures: dictionaries and tables
/ Lists (or arrays if you prefer) are simple ordered collections
/ they are defined using parentheses () and semi-colons (;)
(1;2;3) / => 1 2 3
(-10.0;3.14159e;1b;`abc;"c")
/ => -10f
/ => 3.14159e
/ => 1b
/ => `abc
/ => "c" (mixed type lists are displayed on multiple lines)
((1;2;3);(4;5;6);(7;8;9))
/ => 1 2 3
/ => 4 5 6
/ => 7 8 9
/ Lists of uniform type can also be defined more concisely
1 2 3 / => 1 2 3
`list`of`syms / => `list`of`syms
`list`of`syms ~ (`list;`of;`syms) / => 1b
/ List length
count (1;2;3) / => 3
count "I am a string" / => 13 (string are lists of char)
/ Empty lists are defined with parentheses
l:()
count l / => 0
/ Simple variables and single item lists are not equivalent
/ parentheses syntax cannot create a single item list (they indicate precedence)
(1)~1 / => 1b
/ single item lists can be created using enlist
singleton:enlist 1
/ or appending to an empty list
singleton:(),1
1~(),1 / => 0b
/ Speaking of appending, comma (,) is used for this, not plus (+)
1 2 3,4 5 6 / => 1 2 3 4 5 6
"hello ","there" / => "hello there"
/ Indexing uses square brackets []
l:1 2 3 4
l[0] / => 1
l[1] / => 2
/ indexing out of bounds returns a null value rather than an error
l[5] / => 0N
/ and indexed assignment
l[0]:5
l / => 5 2 3 4
/ Lists can also be used for indexing and indexed assignment
l[1 3] / => 2 4
l[1 3]: 1 3
l / => 5 1 3 3
/ Lists can be untyped/mixed type
l:(1;2;`hi)
/ but once they are uniformly typed, q will enforce this
l[2]:3
l / => 1 2 3
l[2]:`hi / throws a type error
/ this makes sense in the context of lists as table columns (more later)
/ For a nested list we can index at depth
l:((1;2;3);(4;5;6);(7;8;9))
l[1;1] / => 5
/ We can elide the indexes to return entire rows or columns
l[;1] / => 2 5 8
l[1;] / => 4 5 6
/ All the functions mentioned in the previous section work on lists natively
1+(1;2;3) / => 2 3 4 (single variable and list)
(1;2;3) - (3;2;1) / => -2 0 2 (list and list)
/ And there are many more that are designed specifically for lists
avg 1 2 3 / => 2f
sum 1 2 3 / => 6
sums 1 2 3 / => 1 3 6 (running sum)
last 1 2 3 / => 3
1 rotate 1 2 3 / => 2 3 1
/ etc.
/ Using and combining these functions to manipulate lists is where much of the
/ power and expressiveness of the language comes from
/ Take (#), drop (_) and find (?) are also useful working with lists
l:1 2 3 4 5 6 7 8 9
l:1+til 9 / til is a useful shortcut for generating ranges
/ take the first 5 elements
5#l / => 1 2 3 4 5
/ drop the first 5
5_l / => 6 7 8 9
/ take the last 5
-5#l / => 5 6 7 8 9
/ drop the last 5
-5_l / => 1 2 3 4
/ find the first occurrence of 4
l?4 / => 3
l[3] / => 4
/ Dictionaries in q are a generalization of lists
/ they map a list to another list (of equal length)
/ the bang (!) symbol is used for defining a dictionary
d:(`a;`b;`c)!(1;2;3)
/ or more simply with concise list syntax
d:`a`b`c!1 2 3
/ the keyword key returns the first list
key d / => `a`b`c
/ and value the second
value d / => 1 2 3
/ Indexing is identical to lists
/ with the first list as a key instead of the position
d[`a] / => 1
d[`b] / => 2
/ As is assignment
d[`c]:4
d
/ => a| 1
/ => b| 2
/ => c| 4
/ Arithmetic and comparison work natively, just like lists
e:(`a;`b;`c)!(2;3;4)
d+e
/ => a| 3
/ => b| 5
/ => c| 8
d-2
/ => a| -1
/ => b| 0
/ => c| 2
d > (1;1;1)
/ => a| 0
/ => b| 1
/ => c| 1
/ And the take, drop and find operators are remarkably similar too
`a`b#d
/ => a| 1
/ => b| 2
`a`b _ d
/ => c| 4
d?2
/ => `b
/ Tables in q are basically a subset of dictionaries
/ a table is a dictionary where all values must be lists of the same length
/ as such tables in q are column oriented (unlike most RDBMS)
/ the flip keyword is used to convert a dictionary to a table
/ i.e. flip the indices
flip `c1`c2`c3!(1 2 3;4 5 6;7 8 9)
/ => c1 c2 c3
/ => --------
/ => 1 4 7
/ => 2 5 8
/ => 3 6 9
/ we can also define tables using this syntax
t:([]c1:1 2 3;c2:4 5 6;c3:7 8 9)
t
/ => c1 c2 c3
/ => --------
/ => 1 4 7
/ => 2 5 8
/ => 3 6 9
/ Tables can be indexed and manipulated in a similar way to dicts and lists
t[`c1]
/ => 1 2 3
/ table rows are returned as dictionaries
t[1]
/ => c1| 2
/ => c2| 5
/ => c3| 8
/ meta returns table type information
meta t
/ => c | t f a
/ => --| -----
/ => c1| j
/ => c2| j
/ => c3| j
/ now we see why type is enforced in lists (to protect column types)
t[1;`c1]:3
t[1;`c1]:3.0 / throws a type error
/ Most traditional databases have primary key columns
/ in q we have keyed tables, where one table containing key columns
/ is mapped to another table using bang (!)
k:([]id:1 2 3)
k!t
/ => id| c1 c2 c3
/ => --| --------
/ => 1 | 1 4 7
/ => 2 | 3 5 8
/ => 3 | 3 6 9
/ We can also use this shortcut for defining keyed tables
kt:([id:1 2 3]c1:1 2 3;c2:4 5 6;c3:7 8 9)
/ Records can then be retrieved based on this key
kt[1]
/ => c1| 1
/ => c2| 4
/ => c3| 7
kt[`id!1]
/ => c1| 1
/ => c2| 4
/ => c3| 7
////////////////////////////////////
//////// Functions ////////
////////////////////////////////////
/ In q the function is similar to a mathematical map, mapping inputs to outputs
/ curly braces {} are used for function definition
/ and square brackets [] for calling functions (just like list indexing)
/ a very minimal function
f:{x+x}
f[2] / => 4
/ Functions can be anonymous and called at point of definition
{x+x}[2] / => 4
/ By default the last expression is returned
/ colon (:) can be used to specify return
{x+x}[2] / => 4
{:x+x}[2] / => 4
/ semi-colon (;) separates expressions
{r:x+x;:r}[2] / => 4
/ Function arguments can be specified explicitly (separated by ;)
{[arg1;arg2] arg1+arg2}[1;2] / => 3
/ or if omitted will default to x, y and z
{x+y+z}[1;2;3] / => 6
/ Built in functions are no different, and can be called the same way (with [])
+[1;2] / => 3
<[1;2] / => 1b
/ Functions are first class in q, so can be returned, stored in lists etc.
{:{x+y}}[] / => {x+y}
(1;"hi";{x+y})
/ => 1
/ => "hi"
/ => {x+y}
/ There is no overloading and no keyword arguments for custom q functions
/ however using a dictionary as a single argument can overcome this
/ allows for optional arguments or differing functionality
d:`arg1`arg2`arg3!(1.0;2;"my function argument")
{x[`arg1]+x[`arg2]}[d] / => 3f
/ Functions in q see the global scope
a:1
{:a}[] / => 1
/ However local scope obscures this
a:1
{a:2;:a}[] / => 2
a / => 1
/ Functions cannot see nested scopes (only local and global)
{local:1;{:local}[]}[] / throws error as local is not defined in inner function
/ A function can have one or more of its arguments fixed (projection)
f:+[4]
f[4] / => 8
f[5] / => 9
f[6] / => 10
////////////////////////////////////
////////// q-sql //////////
////////////////////////////////////
/ q has its own syntax for manipulating tables, similar to standard SQL
/ This contains the usual suspects of select, insert, update etc.
/ and some new functionality not typically available
/ q-sql has two significant differences (other than syntax) to normal SQL:
/ - q tables have well defined record orders
/ - tables are stored as a collection of columns
/ (so vectorized column operations are fast)
/ a full description of q-sql is a little beyond the scope of this intro
/ so we will just cover enough of the basics to get you going
/ First define ourselves a table
t:([]name:`Arthur`Thomas`Polly;age:35 32 52;height:180 175 160;sex:`m`m`f)
/ equivalent of SELECT * FROM t
select from t / (must be lower case, and the wildcard is not necessary)
/ => name age height sex
/ => ---------------------
/ => Arthur 35 180 m
/ => Thomas 32 175 m
/ => Polly 52 160 f
/ Select specific columns
select name,age from t
/ => name age
/ => ----------
/ => Arthur 35
/ => Thomas 32
/ => Polly 52
/ And name them (equivalent of using AS in standard SQL)
select charactername:name, currentage:age from t
/ => charactername currentage
/ => ------------------------
/ => Arthur 35
/ => Thomas 32
/ => Polly 52
/ This SQL syntax is integrated with the q language
/ so q can be used seamlessly in SQL statements
select name, feet:floor height*0.032, inches:12*(height*0.032) mod 1 from t
/ => name feet inches
/ => ------------------
/ => Arthur 5 9.12
/ => Thomas 5 7.2
/ => Polly 5 1.44
/ Including custom functions
select name, growth:{[h;a]h%a}[height;age] from t
/ => name growth
/ => ---------------
/ => Arthur 5.142857
/ => Thomas 5.46875
/ => Polly 3.076923
/ The where clause can contain multiple statements separated by commas
select from t where age>33,height>175
/ => name age height sex
/ => ---------------------
/ => Arthur 35 180 m
/ The where statements are executed sequentially (not the same as logical AND)
select from t where age<40,height=min height
/ => name age height sex
/ => ---------------------
/ => Thomas 32 175 m
select from t where (age<40)&(height=min height)
/ => name age height sex
/ => -------------------
/ The by clause falls between select and from
/ and is equivalent to SQL's GROUP BY
select avg height by sex from t
/ => sex| height
/ => ---| ------
/ => f | 160
/ => m | 177.5
/ If no aggregation function is specified, last is assumed
select by sex from t
/ => sex| name age height
/ => ---| -----------------
/ => f | Polly 52 160
/ => m | Thomas 32 175
/ Update has the same basic form as select
update sex:`male from t where sex=`m
/ => name age height sex
/ => ----------------------
/ => Arthur 35 180 male
/ => Thomas 32 175 male
/ => Polly 52 160 f
/ As does delete
delete from t where sex=`m
/ => name age height sex
/ => --------------------
/ => Polly 52 160 f
/ None of these sql operations are carried out in place
t
/ => name age height sex
/ => ---------------------
/ => Arthur 35 180 m
/ => Thomas 32 175 m
/ => Polly 52 160 f
/ Insert however is in place, it takes a table name, and new data
`t insert (`John;25;178;`m) / => ,3
t
/ => name age height sex
/ => ---------------------
/ => Arthur 35 180 m
/ => Thomas 32 175 m
/ => Polly 52 160 f
/ => John 25 178 m
/ Upsert is similar (but doesn't have to be in-place)
t upsert (`Chester;58;179;`m)
/ => name age height sex
/ => ----------------------
/ => Arthur 35 180 m
/ => Thomas 32 175 m
/ => Polly 52 160 f
/ => John 25 178 m
/ => Chester 58 179 m
/ it will also upsert dicts or tables
t upsert `name`age`height`sex!(`Chester;58;179;`m)
t upsert (`Chester;58;179;`m)
/ => name age height sex
/ => ----------------------
/ => Arthur 35 180 m
/ => Thomas 32 175 m
/ => Polly 52 160 f
/ => John 25 178 m
/ => Chester 58 179 m
/ And if our table is keyed
kt:`name xkey t
/ upsert will replace records where required
kt upsert ([]name:`Thomas`Chester;age:33 58;height:175 179;sex:`f`m)
/ => name | age height sex
/ => -------| --------------
/ => Arthur | 35 180 m
/ => Thomas | 33 175 f
/ => Polly | 52 160 f
/ => John | 25 178 m
/ => Chester| 58 179 m
/ There is no ORDER BY clause in q-sql, instead use xasc/xdesc
`name xasc t
/ => name age height sex
/ => ---------------------
/ => Arthur 35 180 m
/ => John 25 178 m
/ => Polly 52 160 f
/ => Thomas 32 175 m
/ Most of the standard SQL joins are present in q-sql, plus a few new friends
/ see http://code.kx.com/q4m3/9_Queries_q-sql/#99-joins
/ the two most important (commonly used) are lj and aj
/ lj is basically the same as SQL LEFT JOIN
/ where the join is carried out on the key columns of the left table
le:([sex:`m`f]lifeexpectancy:78 85)
t lj le
/ => name age height sex lifeexpectancy
/ => ------------------------------------
/ => Arthur 35 180 m 78
/ => Thomas 32 175 m 78
/ => Polly 52 160 f 85
/ => John 25 178 m 78
/ aj is an asof join. This is not a standard SQL join, and can be very powerful
/ The canonical example of this is joining financial trades and quotes tables
trades:([]time:10:01:01 10:01:03 10:01:04;sym:`msft`ibm`ge;qty:100 200 150)
quotes:([]time:10:01:00 10:01:01 10:01:01 10:01:03;
sym:`ibm`msft`msft`ibm; px:100 99 101 98)
aj[`time`sym;trades;quotes]
/ => time sym qty px
/ => ---------------------
/ => 10:01:01 msft 100 101
/ => 10:01:03 ibm 200 98
/ => 10:01:04 ge 150
/ for each row in the trade table, the last (prevailing) quote (px) for that sym
/ is joined on.
/ see http://code.kx.com/q4m3/9_Queries_q-sql/#998-as-of-joins
////////////////////////////////////
///// Extra/Advanced //////
////////////////////////////////////
////// Adverbs //////
/ You may have noticed the total lack of loops to this point
/ This is not a mistake!
/ q is a vector language so explicit loops (for, while etc.) are not encouraged
/ where possible functionality should be vectorized (i.e. operations on lists)
/ adverbs supplement this, modifying the behaviour of functions
/ and providing loop type functionality when required
/ (in q functions are sometimes referred to as verbs, hence adverbs)
/ the "each" adverb modifies a function to treat a list as individual variables
first each (1 2 3;4 5 6;7 8 9)
/ => 1 4 7
/ each-left (\:) and each-right (/:) modify a two-argument function
/ to treat one of the arguments and individual variables instead of a list
1 2 3 +\: 11 22 33
/ => 12 23 34
/ => 13 24 35
/ => 14 25 36
1 2 3 +/: 11 22 33
/ => 12 13 14
/ => 23 24 25
/ => 34 35 36
/ The true alternatives to loops in q are the adverbs scan (\) and over (/)
/ their behaviour differs based on the number of arguments the function they
/ are modifying receives. Here I'll summarise some of the most useful cases
/ a single argument function modified by scan given 2 args behaves like "do"
{x * 2}\[5;1] / => 1 2 4 8 16 32 (i.e. multiply by 2, 5 times)
{x * 2}/[5;1] / => 32 (using over only the final result is shown)
/ If the first argument is a function, we have the equivalent of "while"
{x * 2}\[{x<100};1] / => 1 2 4 8 16 32 64 128 (iterates until returns 0b)
{x * 2}/[{x<100};1] / => 128 (again returns only the final result)
/ If the function takes two arguments, and we pass a list, we have "for"
/ where the result of the previous execution is passed back into the next loop
/ along with the next member of the list
{x + y}\[1 2 3 4 5] / => 1 3 6 10 15 (i.e. the running sum)
{x + y}/[1 2 3 4 5] / => 15 (only the final result)
/ There are other adverbs and uses, this is only intended as quick overview
/ http://code.kx.com/q4m3/6_Functions/#67-adverbs
////// Scripts //////
/ q scripts can be loaded from a q session using the "\l" command
/ for example "\l learnkdb.q" will load this script
/ or from the command prompt passing the script as an argument
/ for example "q learnkdb.q"
////// On-disk data //////
/ Tables can be persisted to disk in several formats
/ the two most fundamental are serialized and splayed
t:([]a:1 2 3;b:1 2 3f)
`:serialized set t / saves the table as a single serialized file
`:splayed/ set t / saves the table splayed into a directory
/ the dir structure will now look something like:
/ db/
/ ├── serialized
/ └── splayed
/ ├── a
/ └── b
/ Loading this directory (as if it was as script, see above)
/ loads these tables into the q session
\l .
/ the serialized table will be loaded into memory
/ however the splayed table will only be mapped, not loaded
/ both tables can be queried using q-sql
select from serialized
/ => a b
/ => ---
/ => 1 1
/ => 2 2
/ => 3 3
select from splayed / (the columns are read from disk on request)
/ => a b
/ => ---
/ => 1 1
/ => 2 2
/ => 3 3
/ see http://code.kx.com/q4m3/14_Introduction_to_Kdb+/ for more
////// Frameworks //////
/ kdb+ is typically used for data capture and analysis.
/ This involves using an architecture with multiple processes
/ working together. kdb+ frameworks are available to streamline the setup
/ and configuration of this architecture and add additional functionality
/ such as disaster recovery, logging, access, load balancing etc.
/ https://github.com/AquaQAnalytics/TorQ
```
## Want to know more?
* [*q for mortals* q language tutorial](http://code.kx.com/q4m3/)
* [*Introduction to Kdb+* on disk data tutorial](http://code.kx.com/q4m3/14_Introduction_to_Kdb+/)
* [q language reference](https://code.kx.com/q/ref/)
* [Online training courses](http://training.aquaq.co.uk/)
* [TorQ production framework](https://github.com/AquaQAnalytics/TorQ)