1455 lines
64 KiB
Markdown
1455 lines
64 KiB
Markdown
# JOE: Hacker's Guide
|
|
|
|
[TOC]
|
|
|
|
## Character classes
|
|
|
|
The data structures for representing character classes such as \[a-z0-9\] or
|
|
\[\\w\\d\] are in [cclass.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/cclass.c) and [cclass.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/cclass.h). These are used by the syntax
|
|
highligher, the regular expression matcher, the keyboard mapping table and
|
|
for implementing JOE's ctype routines (such as joe_iswupper).
|
|
|
|
Structures are provided for sets (is a character in a particular set?), mappings
|
|
to integers (which integer is associated with this character?) and mappings to
|
|
pointers (which pointer is associated with this character?).
|
|
|
|
Two different sets of algorithms are provided: one set based on intervals
|
|
and another based on radix trees.
|
|
|
|
### Interval arrays
|
|
|
|
An interval is a simple range of character codes in the following structure:
|
|
|
|
struct interval {
|
|
int first, last;
|
|
};
|
|
|
|
If A-Z is be represented, then first is 65 and last is 90. If just the
|
|
number zero is to be represented, then first and last are both 48.
|
|
|
|
An array of these represents a set. If the array is sorted, you can use a
|
|
binary search to determine if a character belongs to a set. Two functions
|
|
supporting this are provided:
|
|
|
|
/* Sort an struct interval array */
|
|
void interval_sort(struct interval *array, ptrdiff_t size);
|
|
|
|
/* Test if character ch is in a sorted interval array using binary search.
|
|
Returns -1 if not found, or index to matching struct interval */
|
|
ptrdiff_t interval_test(struct interval *array, ptrdiff_t size, int ch);
|
|
|
|
These functions are no longer used in the editor since they have been
|
|
superseded by radix trees. However, facts about Unicode use arrays of
|
|
intervals, see [unicat.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicat.c). Unicat.c is generated by [util/uniproc.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/util/uniproc.c) from the
|
|
raw data provided on the [unicode.org](ftp://ftp.unicode.org/Public/UNIDATA/) website. Arrays of intervals provide
|
|
the source data for the radix trees. The struct Cclass includes an
|
|
interval array as the source data before it is optimized into a struct Rset.
|
|
|
|
### Interval lists
|
|
|
|
The next data structure in cclass.c is a linked list of intervals:
|
|
|
|
struct interval_list {
|
|
struct interval_list *next; /* Next item in list */
|
|
struct interval interval; /* Range of characters */
|
|
void *map;
|
|
};
|
|
|
|
This provides a way to create a mapping between characters and pointers
|
|
(void *map). For example, a key sequence mapping table (a KMAP) can be
|
|
implemented this way. Map could point to either a MACRO bound to a key, or
|
|
to a KMAP for a sub-map (for example a top-level map would have ^K bound to
|
|
a submap which has H bound to the help command).
|
|
|
|
The following functions are provided to handle interval lists:
|
|
|
|
/* Construct a struct interval_list */
|
|
struct interval_list *mkinterval(
|
|
struct interval_list *next,
|
|
int first, int last,
|
|
void *map);
|
|
|
|
/* Free an interval list (frees all items in the list) */
|
|
void rminterval(struct interval_list *item);
|
|
|
|
/* Add a single interval to an interval list */
|
|
struct interval_list *interval_add(
|
|
struct interval_list *interval_list,
|
|
int first, int last,
|
|
void *map);
|
|
|
|
/* Add an array of intervals (a character class) to an interval list */
|
|
struct interval_list *interval_set(
|
|
struct interval_list *list,
|
|
struct interval *array, int size,
|
|
void *map);
|
|
|
|
/* Look up single character in an interval list, return what it's mapped to */
|
|
void *interval_lookup(
|
|
struct interval_list *list,
|
|
void *dflt,
|
|
int ch);
|
|
|
|
/* Print interval list to the log buffer for debugging */
|
|
void interval_show(struct interval_list *list);
|
|
|
|
Interval_add is a complex funtion which adds a mapping to an interval list.
|
|
You pass it the address of the first item in the list (or NULL) and it
|
|
returns an edited list. Interval_add will override any mappings which
|
|
already exist in the given range. The resulting list is always sorted and
|
|
the items in it are non-overlapping. It might add, delete or modify items
|
|
to the list.
|
|
|
|
Interval_set is just like interval_add, but takes an array of intervals (as
|
|
previously discussed) instead of a single interval. For example, to create
|
|
a mapping from all Unicode numbers to a particular MACRO, you could pass the
|
|
Nd_table from unicat.c to it.
|
|
|
|
Interval_lookup returns the pointer mapped to the character, or if none is
|
|
mapped, it returns dflt.
|
|
|
|
### Radix trees
|
|
|
|
These sets of radix tree structures and associated methods are provided:
|
|
|
|
* Rset: for sets
|
|
* Rtree: for maps from characters to pointers
|
|
* Rmap: for maps from characters to integers
|
|
|
|
The best way to understand a radix tree is to study one of the lookup
|
|
functions (slightly simplified):
|
|
|
|
void *rtree_lookup_unopt(struct Rtree *r, int ch)
|
|
{
|
|
int a, b, c, d;
|
|
int idx;
|
|
|
|
a = (ch >> TOPSHIFT);
|
|
b = (SECONDMASK & (ch >> SECONDSHIFT));
|
|
c = (THIRDMASK & (ch >> THIRDSHIFT));
|
|
d = (LEAFMASK & (ch >> LEAFSHIFT));
|
|
|
|
idx = r->top.entry[a];
|
|
if (idx != -1) {
|
|
idx = r->second.table.b[idx].entry[b];
|
|
if (idx != -1) {
|
|
idx = r->third.table.c[idx].entry[c];
|
|
if (idx != -1)
|
|
return r->leaf.table.d[idx].entry[d];
|
|
}
|
|
}
|
|
|
|
return NULL;
|
|
}
|
|
|
|
The idea is to break the 21 bit character code into four parts: a (7 bits),
|
|
b (5 bits), c (5 bits) and d (4 bits) and use these to traverse a four-level
|
|
tree. The top-level table is embedded in the struct Rtree itself. Tables
|
|
from the other levels (the two mid levels and the leaf level) are allocated
|
|
out of struct Levels. The top and mid-level tables are arrays of 16-bit
|
|
shorts.
|
|
|
|
The number of levels and the size of each level was chosen as a compromise
|
|
between speed and size. We certainly want radix tree lookups to be faster
|
|
than interval array binary searches (otherwise why bother?), so we can't
|
|
have too many levels. On the other hand, radix trees should not use too
|
|
much memory. So for example a single level of 2^21 entries is out.
|
|
|
|
#### De-duplication
|
|
|
|
The leaf tables often contain the same data, so we de-duplicate them such
|
|
that many of the second to last level table entries index to the same leaf
|
|
table. This is such an effective optimization that the resulting radix tree
|
|
ends up being around the same size as an equivalent interval array (an
|
|
interval array with an array of the same number of entries containing
|
|
associated data (ints or pointers)).
|
|
|
|
The second to last level also has duplicate tables, but not many so it's not
|
|
de-duplicated.
|
|
|
|
De-deplication happens progressively during the build, and as a second pass
|
|
optimization step. Why do it progressively if there is a whole second pass?
|
|
There is an important reason for this: shorts are used for the second to
|
|
last level table indexes. If you don't deduplicate progressively, you can
|
|
have more than 64K leaf tables before optimization and overflow the short.
|
|
There is a second reason: lower memory usage tends to equate to faster
|
|
performance. For this reason it's preferable to keep the memory usage low
|
|
during the build.
|
|
|
|
The progressive deduplication algorithm is simple: if the most recently
|
|
completed leaf table matches the previously completed leaf table, we use the
|
|
previous one in its place.
|
|
|
|
There is one big complication: there is no strong concept of a completed
|
|
leaf table. A leaf table is assumed to be complete when the last entry has
|
|
been written to. But there is nothing preventing previous data from being
|
|
overwritten, or for tables to be made in reverse order (this can happen in a
|
|
regex character class like [zyxw...]). It means that a de-duplicated table
|
|
might have to re-duplicated. There is a reference count on each leaf table
|
|
which is incremented every time that table is used during de-duplication.
|
|
If the reference count is not equal to one during a table write, it is first
|
|
copied to a new table (it's a copy on write scheme).
|
|
|
|
The second-pass deduplication works like this: a hash table of all leaf
|
|
tables is built. If a duplicate table is detected during the hash table
|
|
build, it is deduplicated (not added to hash table).
|
|
|
|
#### Anglocentrism
|
|
|
|
There is one further optimization: we provide faster access to the first 512
|
|
character codes (so that ASCII is fast). To do this, the first of the
|
|
second to last level tables is embedded in the struct Rtree. If the code is
|
|
below 512, then the upper 5 bits index this table to find the leaf table,
|
|
and the lower 4 bits index to leaf table. Thus, there is one local access
|
|
and one indirect access for codes below 512.
|
|
|
|
### Rset
|
|
|
|
struct Rset radix trees are used for simple sets. In this case there are no
|
|
leaf tables. Instead, the second to last level table contains bit maps. If
|
|
a bit is a one, then the corresponding character is in the set. Each bit
|
|
map is 16 bits: the shorts that make up the second to last level table are
|
|
used for these bitmaps instead of as leaf-table indexes.
|
|
|
|
Here are the provided methods:
|
|
|
|
/* Initialize a struct Rset */
|
|
void rset_init(struct Rset *r);
|
|
|
|
/* Clear a struct Rset: anything malloced for it is freed, but the
|
|
* struct Rset itself is not freed (it could be embedded in another
|
|
* structure). */
|
|
void rset_clr(struct Rset *r);
|
|
|
|
/* Test if a character is in a set: this one can be called before
|
|
* before rset_opt is called. */
|
|
int rset_lookup_unopt(struct Rset *r, int ch);
|
|
|
|
/* Add a range of characters to a set: ch to che */
|
|
void rset_add(struct Rset *r, int ch, int che);
|
|
|
|
/* Add an array of ranges to a set */
|
|
void rset_set(struct Rset *r, struct interval *array, ptrdiff_t size);
|
|
|
|
/* De-duplicate leaf tables and set up embedded table for quicker
|
|
* access for characters below 512 */
|
|
void rset_opt(struct Rset *r);
|
|
|
|
/* Test if a character is in a set: only use this after rset_opt has
|
|
* been called */
|
|
int rset_lookup(struct Rset *r, int ch);
|
|
|
|
/* Print Rset in human readable form to startup log for debugging */
|
|
void rset_show(struct Rset *r);
|
|
|
|
Rset_add originally accepted just a single character to add, not a range.
|
|
This turned out to be too inefficient: the editor was using a lot of CPU
|
|
time during start-up just to build the standard character class tables.
|
|
Therefore, it has been modified to loop over the range for a fast inner loop
|
|
and to lift function call overhead out of the original loop in rset_set.
|
|
|
|
### Rtree
|
|
|
|
struct Rtree provides a mapping from characters to pointers. For example,
|
|
this is used in the regular expression code to traverse states.
|
|
|
|
Here are the provided methods:
|
|
|
|
/* Initialize a struct Rset */
|
|
void rtree_init(struct Rtree *r);
|
|
|
|
/* Clear a struct Rtree: anything malloced for it is freed, but the
|
|
* struct Rtree itself is not freed (it could be embedded in another
|
|
* structure). */
|
|
void rtree_clr(struct Rtree *r);
|
|
|
|
/* Test if a character is in a tree: return pointer bound to it
|
|
* or NULL. This can be called before rtree_opt. */
|
|
void *rtree_lookup_unopt(struct Rtree *r, int ch);
|
|
|
|
/* Add a binding to from a range of characters to a pointer */
|
|
void rtree_add(struct Rtree *r, int ch, int che, void *map);
|
|
|
|
/* Add a binding to from each range of characters in the given array to a pointer */
|
|
void rtree_set(struct Rtree *r, struct interval *array, ptrdiff_t len, void *map);
|
|
|
|
/* Add a binding to from each range of characters in the given
|
|
* interval_list list to its pointer */
|
|
void rtree_build(struct Rtree *r, struct interval_list *l);
|
|
|
|
/* De-duplicate leaf tables and set up embedded table for quicker
|
|
* access for characters below 512 */
|
|
void rtree_opt(struct Rtree *r);
|
|
|
|
/* Test if a character is in a tree: return pointer bound to it
|
|
* or NULL. Only call this after rtree_opt has been called. */
|
|
void *rtree_lookup(struct Rtree *r, int ch);
|
|
|
|
/* Print Rtree in human readable form to startup log for debugging */
|
|
void rtree_show(struct Rtree *r);
|
|
|
|
They are the same as the rset methods, except that a pointer is provided.
|
|
Also, a method is provided to convert an interval_list into an Rtree.
|
|
|
|
### Rmap
|
|
|
|
struct Rmap provides a mapping from characters to integers. For example,
|
|
this is used for Unicode simple case conversion. The integer provides an
|
|
offset to be added to the character to convert it from one case to another.
|
|
It's important to use an offset, not a direct replacement character for this
|
|
so that leaf nodes have the same values and can be de-duplicated.
|
|
|
|
Here are the provided methods:
|
|
|
|
void rmap_init(struct Rtree *r);
|
|
|
|
void rmap_clr(struct Rtree *r);
|
|
|
|
int rmap_lookup_unopt(
|
|
struct Rtree *r,
|
|
int ch,
|
|
int dflt
|
|
);
|
|
|
|
void rmap_add(
|
|
struct Rtree *r,
|
|
int ch, int che,
|
|
int map,
|
|
int dflt
|
|
);
|
|
|
|
void rmap_set(
|
|
struct Rtree *r,
|
|
struct interval *array,
|
|
ptrdiff_t len,
|
|
int map,
|
|
int dflt
|
|
);
|
|
|
|
void rmap_opt(struct Rtree *r);
|
|
|
|
int rmap_lookup(struct Rtree *r, int ch, int dflt);
|
|
|
|
void rmap_show(struct Rtree *r);
|
|
|
|
They are the same methods as for struct Rtree, except that you provide
|
|
'dflt', the value to return if there is no binding for the character.
|
|
|
|
### struct Cclass
|
|
|
|
Finally we have struct Cclass, which is made up of an interval array and a
|
|
struct Rset and which provides set operations for character classes. The
|
|
interval array is used for the set operations, then the array is optimized
|
|
into a struct Rset for quick character lookups.
|
|
|
|
The Unicode database (made from the data in unicat.c and methods in
|
|
unicode.c) uses this data structure. For example, you can get a character
|
|
class containing all letters with:
|
|
|
|
struct Cclass *letters = unicode("L");
|
|
|
|
The set operations allow you to construct complex classes, for example the
|
|
character class for \\i, the initial character of identifiers is contructed
|
|
like this:
|
|
|
|
cclass_init(cclass_alnum_);
|
|
cclass_union(cclass_alnum_, unicode("L"));
|
|
cclass_union(cclass_alnum_, unicode("Pc"));
|
|
/* cclass_union(cclass_alpha_, unicode("Sc")); */
|
|
cclass_union(cclass_alpha_, unicode("Nl"));
|
|
cclass_union(cclass_alnum_, unicode("Mn"));
|
|
cclass_union(cclass_alnum_, unicode("Mc"));
|
|
cclass_union(cclass_alnum_, unicode("Nd"));
|
|
cclass_add(cclass_alnum_, 0x200c, 0x200d);
|
|
cclass_opt(cclass_alnum_);
|
|
|
|
The following methods are provided:
|
|
|
|
/* Initialize a character class */
|
|
void cclass_init(struct Cclass *cclass);
|
|
|
|
/* Free memory used by a Cclass (does not free cclass itself) */
|
|
void cclass_clr(struct Cclass *cclass);
|
|
|
|
/* Add a range to a character class */
|
|
void cclass_add(struct Cclass *cclass, int first, int last);
|
|
|
|
/* Remove a range from a character class */
|
|
void cclass_sub(struct Cclass *cclass, int first, int last);
|
|
|
|
/* Merge one character class into another */
|
|
void cclass_union(struct Cclass *cclass, struct Cclass *n);
|
|
|
|
/* Merge an interval array into a character class */
|
|
void cclass_merge(struct Cclass *cclass, struct interval *intervals, int len);
|
|
|
|
/* Subtract one character class from another */
|
|
void cclass_diff(struct Cclass *cclass, struct Cclass *n);
|
|
|
|
/* Compute unicode inverse of a character class, for [^a-z] */
|
|
void cclass_inv(struct Cclass *cclass);
|
|
|
|
/* Lookup a character in a character class using binary search.
|
|
Return true if character is in the class. */
|
|
int cclass_lookup_unopt(struct Cclass *m, int ch);
|
|
|
|
/* Optimize a character class for fast lookup with cclass_lookup.
|
|
* Generates a radix tree version of the character class. */
|
|
void cclass_opt(struct Cclass *cclass);
|
|
|
|
/* Return true if character is in the class */
|
|
int cclass_lookup(struct Cclass *cclass, int ch);
|
|
|
|
/* Print character class to startup log */
|
|
void cclass_show(struct Cclass *cclass);
|
|
|
|
/* Remap a character class from Unicode to a specific locale */
|
|
struct Cclass *cclass_remap(struct Cclass *m, struct charmap *map);
|
|
|
|
Cclass_remap is used to convert a character class from Unicode to some other
|
|
character set (represented by struct charmap *). For example, to get
|
|
letters in the KOI8-R 8-bit character set, use:
|
|
|
|
struct Cclass *r = cclass_remap(unicode("L"), find_charmap("KOI8-R"));
|
|
|
|
Cclass_remap remembers the conversion to save work, so the input struct
|
|
Cclass is expected to remain around forever. Basically this is to be used
|
|
only for built-in character classes.
|
|
|
|
## Regular expressions
|
|
|
|
As of verison 4.1, JOE uses an enhanced version of [Thompson's NFA matching algorithm](https://swtch.com/~rsc/regexp/regexp1.html).
|
|
|
|
The code is in [regex.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/regex.c) and [regex.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/regex.h).
|
|
|
|
The regular expression matcher is a subroutine of the the larger text search
|
|
algorithm in JOE. For example, text search will use [Boyer-Moore](https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_string_search_algorithm)
|
|
to find a leading prefix of the regular expression before running the
|
|
regular expression matcher. A leading prefix is leading text present in
|
|
all of the different possible text which can match the regular expression.
|
|
For example, in 'hello(foo|bar)', the leading text is 'hello', but it's empty
|
|
in 'hello|there'. One of the jobs of the regular expression code is to
|
|
identify this leading text.
|
|
|
|
The API for the regular expression library is simple:
|
|
|
|
/* Compile an expression s/len into a struct regcomp
|
|
* Check regcomp->err for compile errors.
|
|
*/
|
|
struct regcomp *joe_regcomp(
|
|
struct charmap *charmap, /* The character set of the expression */
|
|
const char *s, ptrdiff_t len, /* The regular expression */
|
|
int icase, /* Set to ignore case */
|
|
int stdfmt, /* Set for standard syntax expressions, vs. JOE syntax */
|
|
int debug /* Set to have debug information printed to startup log */
|
|
);
|
|
|
|
/* Free a compiled regular expression */
|
|
void joe_regfree(struct regcomp *r);
|
|
|
|
struct regmatch {
|
|
off_t rm_so, rm_eo; /* Starting and ending byte offset of a match, or -1 if no match */
|
|
}
|
|
|
|
/* Returns 0 if buffer at p matches compiled expression r
|
|
* 'matches' is filled in with submatch addresses
|
|
* 'nmatch' has size of matches array
|
|
*/
|
|
int joe_regexec(
|
|
struct regcomp *r, /* Compiled regular expression */
|
|
P *p, /* Pointer to edit buffer */
|
|
int nmatch, /* Size of matches array */
|
|
struct regmatch *matches, /* Locations of sub-matches */
|
|
int fold /* Set to ignore case */
|
|
);
|
|
|
|
|
|
Joe_regcomp converts the regular expression into an [AST](https://en.wikipedia.org/wiki/Abstract_syntax_tree)
|
|
with an operator precedence parser and then generates a byte-code program
|
|
representation of the regular expression's [NFA](https://en.wikipedia.org/wiki/Nondeterministic_finite_automaton).
|
|
Joe_regexec simulates the byte-code program with an arbitrary number of threads machine.
|
|
It feeds characters from the buffer into each active thread until one
|
|
of the threads encounters the END instruction, there are no more threads, or
|
|
the end of the buffer is reached. A side effect is the buffer pointer __P__
|
|
is advanced to the first character not part of the matching text.
|
|
|
|
If the 'v' flag is included in JOE's search and replace options prompt, the
|
|
AST and program are sent to JOE's startup log so that they can be viewed
|
|
(with ESC x showlog). For example, here is the log when you try to search
|
|
for 'hello\\(foo\\|bar\\)\\*there':
|
|
|
|
~~~~~
|
|
Parse tree:
|
|
,
|
|
,
|
|
,
|
|
,
|
|
,
|
|
,
|
|
,
|
|
,
|
|
,
|
|
,
|
|
'h'
|
|
'e'
|
|
'l'
|
|
'l'
|
|
'o'
|
|
*
|
|
{
|
|
|
|
|
,
|
|
,
|
|
'f'
|
|
'o'
|
|
'o'
|
|
,
|
|
,
|
|
'b'
|
|
'a'
|
|
'r'
|
|
't'
|
|
'h'
|
|
'e'
|
|
'r'
|
|
'e'
|
|
Leading prefix 'hello'
|
|
NFA-program:
|
|
PC INSN
|
|
-- ----
|
|
0: 'h'
|
|
4: 'e'
|
|
8: 'l'
|
|
12: 'l'
|
|
16: 'o'
|
|
20: fork 92
|
|
28: bra 0
|
|
36: fork 64
|
|
44: 'f'
|
|
48: 'o'
|
|
52: 'o'
|
|
56: jump 76
|
|
64: 'b'
|
|
68: 'a'
|
|
72: 'r'
|
|
76: ket 0
|
|
84: fork 28
|
|
92: 't'
|
|
96: 'h'
|
|
100: 'e'
|
|
104: 'r'
|
|
108: 'e'
|
|
112: end
|
|
Total size = 116
|
|
~~~~~
|
|
|
|
The "Parse tree" (AST) is printed in an indented format: A line begins with an
|
|
operator, then all succeeding lines with deeper indentation are its operands.
|
|
The ',' operator means succession: match the first operand, then match the
|
|
second operand. The '{' operator means record the submatch of its
|
|
argument. The '*' operator means 0 or more matches of its argument. The
|
|
'|' operator means match one argument or the other.
|
|
|
|
The "Leading prefix" is the part of the regular expression which can use
|
|
Boyer-Moore search.
|
|
|
|
The "NFA-program" shows the dis-assembled program. A letter like 'h' is an
|
|
instruction to match the letter itself. 'fork' is an instruction which
|
|
creates a new thread beginning at the specified program counter value.
|
|
'jump' means the thread should goto the specified PC value. 'bra' and 'ket'
|
|
indicate that the start and ending location of submatch should be recorded.
|
|
|
|
Here is the log from the regular expression '-\\i\\c-' to show you how
|
|
character classes appear (\\i is the initial character of an identifier, and
|
|
\\c is the following character of an identifier):
|
|
|
|
~~~~~
|
|
Parse tree:
|
|
,
|
|
,
|
|
,
|
|
'-'
|
|
[
|
|
[41..5a] [5f..5f] [61..7a]
|
|
[
|
|
[30..39] [41..5a] [5f..5f] [61..7a]
|
|
'-'
|
|
Leading prefix '-'
|
|
NFA-program:
|
|
PC INSN
|
|
-- ----
|
|
0: '-'
|
|
4: class [41..5a] [5f..5f] [61..7a]
|
|
12: class [30..39] [41..5a] [5f..5f] [61..7a]
|
|
20: '-'
|
|
24: end
|
|
Total size = 28
|
|
~~~~~
|
|
|
|
Note that this is showing the ASCII character encoding so that the character
|
|
classes are not too big. With the UTF-8 character encoding, the character
|
|
classes will include the full Unicode definitions of \\i and \\c.
|
|
|
|
### Functions
|
|
|
|
/* Parse a character or escape sequence from ptr/len.
|
|
* Returns with character or -256 and cat (if not NULL) filled in with a character class
|
|
*/
|
|
int escape(int utf8, const char **ptr, ptrdiff_t *len, struct Cclass **cat);
|
|
|
|
Escape reads one character from a string. If the character is an escape
|
|
sequence, it's parsed into either a single character or a character class.
|
|
Escape is used in other parts of the editor: for example, it's used to
|
|
parse character lists of syntax highlighting files.
|
|
|
|
/* Conventional syntax regex */
|
|
static int do_parse_conventional(struct regcomp *g, int prec, int fold);
|
|
|
|
do_parse_conventional is the standard syntax regular expression parser. It
|
|
records the AST in regcomp->nodes.
|
|
|
|
/* JOE syntax regex */
|
|
static int do_parse(struct regcomp *g, int prec, int fold);
|
|
|
|
do_parse is the JOE syntax regular expression parser. The JOE syntax is
|
|
just like the regular syntax except that regular expression special
|
|
characters must be escaped.
|
|
|
|
/* Determine leading prefix of search string */
|
|
/* We can use boyer-moore on the prefix if:
|
|
Character set is byte coded.
|
|
Character set is UTF-8 coded, but no folding is requested.
|
|
Character set is UTF-8 coded, folding is requested, but character is below 128 */
|
|
|
|
static int extract(struct regcomp *g, int no, int fold);
|
|
|
|
Extract finds the leading prefix for Boyer-Moore from the AST.
|
|
|
|
/* Convert parse-tree into code with infinite loops */
|
|
static void codegen(struct regcomp *g, int no, int *end);
|
|
|
|
Codegen converts the AST into byte-code. Functions from [frag.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/frag.c) / [frag.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/frag.h)
|
|
write instructions to the memory image and handle alignment. They also provide a
|
|
mechanism for linking the program (to handle forward jumps). Basically a
|
|
linked-list of all the jump intructions which target the same destination is
|
|
created right in the memory image, using the jump instruction operands as
|
|
the link pointers. When the jump destination is finally known, the
|
|
link-list is traversed and the link pointers are replaced with the
|
|
destination address.
|
|
|
|
### Matcher
|
|
|
|
joe_regexec has the arbitrary number of threads simulator. It has two banks
|
|
of threads, A and B, and ping-pongs between them for each input character.
|
|
It feeds the next input character to all threads in bank A. It executes as
|
|
many instructions as possible for each thread, and stops when the thread
|
|
finally accepts the character or rejects it. If the thread accepts the
|
|
character, it will be active for the next character, so it is moved to bank
|
|
B. Otherwise, the thread will die when we transition to bank B. If a
|
|
thread in bank A forks, the newly created thread is appended to bank A so
|
|
that the current character is fed to it before switching. Once all threads
|
|
have been fed the character, we switch to bank B and feed in the next
|
|
character.
|
|
|
|
This scheme is simple and fast, but there are complications. First has to
|
|
do with sub-match addressing. Without modification, the matcher will find
|
|
all possible solutions to expressions like 'x(a\*)(a\*)y'. This means that
|
|
at the end of the match, there will be a thread for each of these
|
|
solutions. If the input is xaaaay, it will find x()(aaaa)y, x(a)(aaa)y,
|
|
x(aa)(aa)y, x(aaa)(a)y, and x(aaaa)()y.
|
|
|
|
Research topic: due to this property, I think the Thompson NFA matcher can
|
|
be extended to support back-references.
|
|
|
|
But we don't want all solutions, we want just one solution, the one where
|
|
the left-most submatch is longest: x(aaaa)()y. To select the preferred
|
|
solution, joe_regexec has a helper function:
|
|
|
|
static int add_thread(
|
|
struct thread *pool, /* Thread pool */
|
|
int l, int le, /* Current start and end of pool */
|
|
unsigned char *pc, /* Current thread's pc */
|
|
Regmatch_t *pos, /* Submatch addresses */
|
|
int bra_no, /* Size of submatch addresses */
|
|
int *stack, int sp /* Current thread's stack */
|
|
);
|
|
|
|
This adds the thread to the end of the new pool B (which is at
|
|
pool[le]). Before it adds, it checks if there are other existing threads
|
|
(between l and le) at the same program counter. If there are, then we pick
|
|
the one with the "best" submatch, the one with the longest to the left.
|
|
|
|
The second complication is JOE's extension \\! (which used to be \\c in
|
|
versions before 4.1). \\! is like '.': it matches one character. However,
|
|
if that one character is the opening bracket of an expression, it matches
|
|
the entire expression. \\! will match all of "(1+(2+3))". To support this,
|
|
each thread needs a stack to record its progress in parsing the balanced
|
|
expression.
|
|
|
|
## Unicode database
|
|
|
|
[unicode.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicode.c), [unicode.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicode.h) and [unicat.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/unicat.c) provide
|
|
JOE's database of Unicode facts. Unicat.c is generated by [util/uniproc.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/util/uniproc.c) from the
|
|
raw data provided on the [unicode.org](ftp://ftp.unicode.org/Public/UNIDATA/) website.
|
|
|
|
There is a table in unicat.c which has the name of, pointer to and size of
|
|
each other table.
|
|
|
|
struct unicat unicat[] = {
|
|
{ "Co", 3, Co_table, 0 },
|
|
{ "Cs", 1, Cs_table, 0 },
|
|
{ "Zp", 1, Zp_table, 0 },
|
|
...
|
|
{ "Domino Tiles", 1, uniblocks + 241, 0 },
|
|
{ "Mahjong Tiles", 1, uniblocks + 240, 0 },
|
|
{ "Arabic Mathematical Alphabetic Symbols", 1, uniblocks + 239, 0 },
|
|
...
|
|
|
|
The function unicode looks up one of these tables and returns a struct
|
|
Cclass version of it:
|
|
|
|
struct Cclass *unicode(const char *name);
|
|
|
|
If a single letter name is passed to the above function, a Cclass containing
|
|
all two-letter named tables with the same matching first letter is returned.
|
|
So for example, unicode("L") returns a Cclass of all letters, including
|
|
"Lu", "Ll", "Lt", etc.
|
|
|
|
Joe_iswinit initializes Cclasses for JOE's isw ctype functions. The
|
|
following functions are provided:
|
|
|
|
int joe_iswupper(struct charmap *,int c);
|
|
int joe_iswlower(struct charmap *,int c);
|
|
|
|
int joe_iswalpha(struct charmap *,int c);
|
|
int joe_iswalpha_(struct charmap *,int c); /* used for \i */
|
|
|
|
int joe_iswalnum(struct charmap *,int c);
|
|
int joe_iswalnum_(struct charmap *,int c); /* used for \c */
|
|
|
|
int joe_iswdigit(struct charmap *,int c);
|
|
int joe_iswspace(struct charmap *,int c);
|
|
int joe_iswctrl(struct charmap *,int c);
|
|
int joe_iswpunct(struct charmap *,int c);
|
|
int joe_iswgraph(struct charmap *,int c);
|
|
int joe_iswprint(struct charmap *,int c);
|
|
int joe_iswxdigit(struct charmap *,int c);
|
|
int joe_iswblank(struct charmap *,int c);
|
|
|
|
Another table in unicat.c is the width_table. This indicates which
|
|
characters are double-wide on terminal emulators. It is used to implement
|
|
the joe_wcwidth function:
|
|
|
|
int joe_wcwidth(int wide,int c);
|
|
|
|
'wide' should be true if the character set of c is Unicode. Both the
|
|
terminal and the file must be Unicode (UTF-8 or UTF-16) or double-wide
|
|
characters to be printed.
|
|
|
|
Another set of tables in unicat.c includes fold_table and fold_repl for case
|
|
folding for string comparisons. Fold_table is an array of ranges. If a
|
|
character is in one of these ranges, then the corresponding entry in
|
|
fold_repl is a string (of 1 to 3 characters) which should replace it for
|
|
case folding. The first character of the string is an offset which should
|
|
be added to the original character to get the first character of the
|
|
replacement.
|
|
|
|
rtree_fold is an struct Rtree representation of the fold table. It returns
|
|
either a single character replacement offset, or an index into fold_repl if the
|
|
character is to be replaced by a string longer than one character.
|
|
|
|
Lowerize can be called to convert a UTF-32 string into its case folded
|
|
version:
|
|
|
|
int *lowerize(
|
|
int *d, ptrdiff_t len, /* Destination string and length */
|
|
const int *s /* Source string */
|
|
);
|
|
|
|
Unicat.c also provides simple case conversion tables: toupper_table,
|
|
tolower_table and totitle_table along with the corresponding replacement
|
|
tables: toupper_cvt, tolower_cvt and totitle_cvt.
|
|
|
|
These are used to implement joe_towupper and joe_towlower:
|
|
|
|
int joe_towupper(struct charmap *,int c); /* Convert to uppercase */
|
|
int joe_towlower(struct charmap *,int c); /* Convert to lowercase */
|
|
|
|
Digval is provided to return the decimal value of any unicode digit:
|
|
|
|
int digval(int ch)
|
|
|
|
This function uses Nd_table from unicat.c. Nd_table has the property that
|
|
each range represents a single set of decimal digits. Normally adjacent
|
|
ranges would be combined together, but in this case they are separated to
|
|
support digval.
|
|
|
|
Finally unicode.c provides the following function:
|
|
|
|
int unictrl(int c);
|
|
|
|
Which returns true if JOE should treat the character as a control character
|
|
which should not be sent directly to the terminal. Instead the character
|
|
value is displayed in hexadecimal: <FFFFF\>. Basically this function
|
|
returns true if the character is not printable- if it's not in cclass_print.
|
|
|
|
## Character sets
|
|
|
|
[charmap.c](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/charmap.c) and [charmap.h](http://sourceforge.net/p/joe-editor/mercurial/ci/default/tree/joe/charmap.h) implement JOE's representation of character
|
|
sets.
|
|
|
|
Each character set is represented by a "struct charmap *". A charmap
|
|
provides information about a specific character set. It provides these
|
|
methods:
|
|
|
|
/* True if character is punctuation */
|
|
int joe_ispunct(struct charmap *map, int ch);
|
|
|
|
/* True if character is printable */
|
|
int joe_isprint(struct charmap *map, int ch);
|
|
|
|
/* True if character is a space */
|
|
int joe_isspace(struct charmap *map, int ch);
|
|
|
|
/* True if character is first of an identifier */
|
|
int joe_isalpha_(struct charmap *map, int ch);
|
|
|
|
/* True if character is rest of an identifier */
|
|
int joe_isalnum_(struct charmap *map, int ch);
|
|
|
|
/* True if character is blank: space or tab */
|
|
int joe_isblank(struct charmap *map, int ch);
|
|
|
|
/* True if character is blank tab CR LF FF or NUL */
|
|
int joe_isspace_eos(struct charmap *map, int ch);
|
|
|
|
/* Convert to lowercase */
|
|
int joe_tolower(struct charmap *map, int ch);
|
|
|
|
/* Convert to uppercase */
|
|
int joe_tolower(struct charmap *map, int ch);
|
|
|
|
/* Convert from character set to Unicode */
|
|
int to_uni(struct charmap *map, int ch);
|
|
|
|
/* Convert from Unicode to character set */
|
|
int from_uni(struct charmap *map, int ch);
|
|
|
|
There are two types of character sets: Unicode and 8-bit. If the character
|
|
set is Unicode, then charmap->type will be true.
|
|
|
|
Joe_locale intializes some global character maps:
|
|
|
|
extern struct charmap *locale_map; /* Character map of terminal */
|
|
extern struct charmap *utf8_map; /* UTF-8 character map */
|
|
extern struct charmap *utf16_map; /* UTF-16 character map */
|
|
extern struct charmap *utf16r_map; /* UTF-16 reversed character map */
|
|
extern struct charmap *ascii_map; /* Plain ASCII map */
|
|
|
|
Find_charmap can be used to find a named character set:
|
|
|
|
/* Find (load if necessary) a character set */
|
|
struct charmap *find_charmap(const char *name);
|
|
|
|
If the character set is not already built into JOE, find_charmap tries to
|
|
load it from a file. It looks in $HOME/.joe/charmaps and /usr/share/joe/charmaps.
|
|
|
|
The following character sets are built into JOE:
|
|
|
|
ASCII, ISO-8859-1, ISO-8859-2, ISO-8859-3, ISO-8859-4,
|
|
ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8,
|
|
ISO-8859-9, ISO-8859-10, ISO-8859-11, ISO-8859-13,
|
|
ISO-8859-14, ISO-8859-15, ISO-8859-16, ISO-8859-1,
|
|
KOI-8, KOI8-R, KOI8-T, KOI8-U, CP1251, CP1256,
|
|
CP1255, ARMSCII-8, TIS-620, GEORGIAN-PS
|
|
|
|
Guess_map tries to determine the character map based on the contents of a
|
|
buffer:
|
|
|
|
struct charmap *guess_map(const char *buf, ptrdiff_t len);
|
|
|
|
It selects the locale_map unless it detects valid UTF-8 sequences in the
|
|
buffers.
|
|
|
|
Finally, my_iconv is provided to convert from one character set to another:
|
|
|
|
void my_iconv(
|
|
char *dest, ptrdiff_t destsiz, struct charmap *dest_map,
|
|
const char *src,struct charmap *src_map
|
|
);
|
|
|
|
This is used in a few places in JOE, but most of JOE uses to_uni and
|
|
from_uni to convert between character sets a character at a time.
|
|
|
|
## Coroutines in JOE
|
|
|
|
<p>JOE 3.6 now uses co-routines to help reduce the amount of event-driven
|
|
code which has to be written. All previous versions of JOE were entirely in
|
|
the event driven coding style, which made user dialog code a chore to write.
|
|
Now, with the help of co-routines, dialog code can be written as old
|
|
fashioned blocking code which is much cleaner, even though JOE can have
|
|
multiple un-finished dialogs on the screen.</p>
|
|
|
|
<p>The coroutine library has only three (now four) functions:</p>
|
|
|
|
<table width="100%" cellspacing=20 border=0 cellpadding=0> <colgroup> <col width="300"> <tbody>
|
|
|
|
<tr valign="top"><td>int co_call(int (*func)(va_list args),...);</td><td>Call a function
|
|
as a co-routine.</td></tr>
|
|
|
|
<tr valign="top"><td>int co_yield(Coroutine *t, int val);</td><td>Suspend current
|
|
co-routine and return to caller with given return value.</td></tr>
|
|
|
|
<tr valign="top"><td>int co_resume(Coroutine *t, int val);</td><td>Suspend current
|
|
co-routine and resume specified one with given return value.</td></tr>
|
|
|
|
<tr valign="top"><td>int co_suspend(Coroutine *u, int val);</td><td>Suspend
|
|
current co-routine (remembering its suspended invoking coroutines) and
|
|
resume the top-level. The current coroutine is stored in u. When u is
|
|
resumed and returns, the suspended chain is resumed (the resumer is
|
|
continued when the chain tries to continue the top level.</td></tr>
|
|
|
|
</tbody> </table>
|
|
|
|
<p>co_call() invokes 'func' as a co-routine- it will run on its own stack.
|
|
'func' is called immediately upon the call to co_call. All of the arguments
|
|
after func are passed as a va_list to the function. co_call calls va_start
|
|
and va_end, 'func' should call va_arg once for each passed argument (see
|
|
stdarg.h). The co-routine should not yield before it has finished reading
|
|
all of the arguments, otherwise va_end will be called early. When 'func'
|
|
returns, the co-routine is destroyed and co_call() returns with the return
|
|
value of 'func'. If the co-routine yields by calling co_yield, co_call
|
|
returns with the return value specified in the call to co_yield. Note that
|
|
co_call only ever returns once for a particular call.</p>
|
|
|
|
<p>co_yield() suspends the current co-routine and returns with the specified
|
|
return value to the caller (either to co_call or co_resume). co_yield()
|
|
writes information about the co-routine to the 'Coroutine' structure whose
|
|
address is passed to co_yield. The Coroutine structure's address is used as
|
|
a handle for the co-routine and can be passed to co_resume().</p>
|
|
|
|
<p>co_resume() suspends the current co-routine (henceforth called the
|
|
original co-routine) and resumes the specified co-routine. When the
|
|
specified co-routine returns (when the function given in the co_call that
|
|
created the co-routine returns), the original co-routine is continued and
|
|
co_resume() returns with the function's return value. If the specified
|
|
co-routine yields, the original co-routine is continued and co_resume()
|
|
returns with the return value specified in the call to co_yield.</p>
|
|
|
|
<p>co_suspend() is like co_resume, except that the top-level task is resumed
|
|
instead of the calling co-routine. This has the effect of suspending the
|
|
entire chain of co-routines which invoked the caller of co_suspend.
|
|
Co_suspend stores the suspended co_routine in another co_routine. That
|
|
co_routine will resume this one when it finally returns.</p>
|
|
|
|
<p>The co-routine library is structured this way (which is different from
|
|
the usual way where you get a co-routine handle immediately upon creation,
|
|
for example see Knuth) to encourage proper stack un-winding. Basically, a
|
|
co-routine can not be forcibly destroyed. Instead, each created co-routine
|
|
is only destroyed when the initial function returns.</p>
|
|
|
|
<p>The co-routine library is based on ucontext.h and the getcontext,
|
|
swapcontext, makecontext and setcontext system calls. If the system does
|
|
not have these calls, it uses setjmp/longjmp tricks in a highly portable
|
|
way: the stacks for each co-routine are allocated off the top of the main
|
|
stack. The main stack grows after each allocation. The stacks are never
|
|
freed, but they are reused. This works for any machine with a single normal
|
|
stack. It will not work on IA64 (two stacks) or Cray (the stack is really a
|
|
linked list).</p>
|
|
|
|
<p>I was originally going to use simple cooperative multi-threading, with
|
|
only two basic calls: sleep_on() and wake_up() (inspired by old versions of
|
|
the Linux kernel) and a top-level scheduler. Sleep_on would suspend the
|
|
current thread, and create a new thread which runs just the scheduler.
|
|
Wake_up would mark a thread as ready to run, and on the next return to the
|
|
scheduler, it would be run.</p>
|
|
|
|
<p>This turns out to not be powerful enough to implement JOE's keyboard
|
|
macro language, at least without much rewriting and inelegance. It could be
|
|
done, but the keyboard macro player would have to feed events into the
|
|
top-level scheduler or would have to be integrated with it, whereas in older
|
|
versions of JOE (even without co-routines) the macro player just calls the
|
|
edit functions (more or less) directly. The problem is that sleep_on
|
|
(called by a function asking for user input) would suspend the task all the
|
|
way up to the scheduler including the macro player, thus preventing the
|
|
macro player from simulating user input.</p>
|
|
|
|
<p>On the other hand with real co-routines, the macro player will call each
|
|
edit function as a co-routine (which I like to think of as a task with a
|
|
very clear return point). If the edit function has to yield, it returns
|
|
immediately to the macro player for the next macro step. If the yield is
|
|
due to a request for input, the macro player supplies it by calling the
|
|
recorded edit functions. If the macro wants to allow the user to provide
|
|
input (using the "query" function, also known as an interactive macro), it
|
|
can itself yield (since it is also running in a coroutine) back to the
|
|
top-level edit loop (in actuality, the macro player calls a function which
|
|
calls co_suspend() to do this). The macro player is continued when the user
|
|
finishes supplying the input (hits the return key at a prompt). Several
|
|
interactive macros can be running at once with this scheme.</p>
|
|
|
|
## Semi-Automatic Variables
|
|
|
|
<p>JOE 3.6 uses "semi-automatic" variables for strings. There is a global
|
|
"object-stack", accessed through these functions:</p>
|
|
|
|
<table width="100%" cellspacing=20 border=0 cellpadding=0>
|
|
<colgroup>
|
|
<col width="300">
|
|
<tbody>
|
|
<tr valign="top"><td>void *obj_malloc(int size);</td><td>Allocate a block of memory.</td></tr>
|
|
<tr valign="top"><td>void *obj_realloc(void *ptr, int size);</td><td>Change allocation size.</td></tr>
|
|
<tr valign="top"><td>void obj_free(void *ptr);</td><td>Free a specified block and all newer blocks.</td></tr>
|
|
<tr valign="top"><td>void obj_perm(void *ptr);</td><td>Mark a block as permanent, which in effect moves the block from the stack to the heap.</td></tr>
|
|
<tr valign="top"><td>int obj_len(void *ptr);</td><td>Access hidden "length" part of block (string length).</td></tr>
|
|
<tr valign="top"><td>int obj_size(void *ptr);</td><td>Access hidden "size" part of block (allocation size).</td></tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
<p>These functions remember the order of allocation. When an allocated
|
|
block is freed, it and all newer blocks are freed as well. For example, in:</p>
|
|
|
|
<pre>
|
|
unsigned char *a = obj_malloc(10);
|
|
unsigned char *b = obj_malloc(10);
|
|
unsigned char *c = obj_malloc(10);
|
|
unsigned char *d = obj_malloc(10);
|
|
|
|
obj_free(b);
|
|
</pre>
|
|
|
|
<p>b, c, and d are all freed, but a remains allocated. The top-level edit
|
|
loop is structured to call obj_free after every editor command:</p>
|
|
|
|
<pre>
|
|
unsigned char *gc = obj_malloc(1);
|
|
execmd(...); /* Execute edit commands */
|
|
obj_free(gc); /* Collect garbage */
|
|
</pre>
|
|
|
|
<p>Thus, you rarely have to call obj_free- the only time you might want to do
|
|
it is if you are calling obj_malloc in a loop. This could cause a large
|
|
amount of memory to be allocated, so obj_free() should be called at the
|
|
end of each loop iteration.</p>
|
|
|
|
<p>obj_perm() marks an object as a heap object instead of a stack object.
|
|
When a heap object is freed, only itself is freed- the stack is not
|
|
modified. Also when the stack is freed, heap objects are not touched.</p>
|
|
|
|
## Variable Strings
|
|
|
|
<p>JOE uses dynamic strings built on top of the global object stack
|
|
functions. These strings automatically resize themselves to fit their
|
|
contents. They also know their own length, which allows you to have NULs in
|
|
the middle of a string. They also always have a terminating NUL to make
|
|
them compatible with C-strings. In fact they have (almost) the same type as
|
|
C-strings: unsigned char *. The length is hidden in the obj_malloc data
|
|
structure. These functions are provided:</p>
|
|
|
|
<table width="100%" cellspacing=20 border=0 cellpadding=0>
|
|
<colgroup>
|
|
<col width="400">
|
|
<tbody>
|
|
<tr valign="top"><td>unsigned char *vsmk(int prealloc);</td><td>Allocate a zero length string, but with enough space to grow to the specified length.</td></tr>
|
|
<tr valign="top"><td>int vslen(unsigned char *s);</td><td>Return string length or 0 if s is NULL.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vsensure(unsigned char *s, int len);</td><td>Make sure there is enough space for a string of the specified length.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vstrunc(unsigned char *s, int len);</td><td>Set length of string: the string is truncated or extended with spaces to the specified length.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vsncpy(unsigned char *s,int offset,unsigned char *block, int block_size);</td><td>Copy a memory block to a string. The string is extended with spaces if offset > string length.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vsdupz(unsigned char *s);</td><td>Duplicate a z-string to a variable string.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vsdup(unsigned char *s);</td><td>Duplicate a variable string.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vscpyz(unsigned char *s,unsigned char *z);</td><td>Copy a z-string to a variable string.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vscpy(unsigned char *s,unsigned char *block,int size);</td><td>Copy a memory block to a variable string.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vscatz(unsigned char *s,unsigned char *z);</td><td>Append a z-string to a variable string.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vscat(unsigned char *s,unsigned char *block,int size);</td><td>Append a memory block to a variable string.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vsadd(unsigned char *s,unsigned char c);</td><td>Append a single character to a string.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vsfmt(unsigned char *s,int offset,const char *format,...);</td><td>Sprintf() to a variable string.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vsgets(unsigned char **sp,FILE *f);</td><td>fgets() to a variable string (returns NULL on EOF, does not put '\n' in the string).</td></tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
<p>Note that it is OK to pass NULL in place of 's' in any of the above
|
|
functions. In this case, a new string is allocated.</p>
|
|
|
|
<p>Most of these function return the address of the variable string, in case
|
|
it was modified. One exception is vsgets()- it takes an address of a
|
|
variable holding the string. It returns the string address, but it also
|
|
writes it to the variable directly.</p>
|
|
|
|
<p>Many of the above functions take a memory block, which is specified by an
|
|
address and a length. These macros can be used to help construct these
|
|
arguments:</p>
|
|
|
|
<table width="100%" cellspacing=20 border=0 cellpadding=0>
|
|
<colgroup>
|
|
<col width="350">
|
|
<tbody>
|
|
<tr valign="top"><td>#define sc(s) (s), (sizeof(s)-1)</td><td>For string constants.</td></tr>
|
|
<tr valign="top"><td>#define sz(s) (s), zlen(s)</td><td>For C-strings (zlen is the same as strlen).</td></tr>
|
|
<tr valign="top"><td>#define sv(s) (s), vslen(s)</td><td>For variable strings.</td></tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
|
|
<p>So for example, vscatz is defined as:</p>
|
|
|
|
<pre>
|
|
unsigned char *vscatz(unsigned char *s, unsigned char *z);
|
|
{
|
|
return vsncpy(sv(s), sz(z));
|
|
}
|
|
</pre>
|
|
|
|
## Variable arrays of variable strings
|
|
|
|
<p>JOE also semi-automatic variable arrays of variable strings. These
|
|
functions are provided:</p>
|
|
|
|
<table width="100%" cellspacing=20 border=0 cellpadding=0>
|
|
<colgroup>
|
|
<col width="400">
|
|
<tbody>
|
|
<tr valign="top"><td>unsigned char *vamk(int prealloc);</td><td>Allocate a zero length array, but with enough space to grow to the specified length.</td></tr>
|
|
<tr valign="top"><td>void varm(unsigned char **a);</td><td>Free an array.</td></tr>
|
|
<tr valign="top"><td>int valen(unsigned char *a);</td><td>Return array length or 0 if a is NULL.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vaensure(unsigned char *s, int len);</td><td>Make sure there is enough space for an array of the specified length.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vatrunc(unsigned char *s, int len);</td><td>Set length of array: the array is truncated or extended with NULLs to the specified length.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vaadd(unsigned char **a,unsigned char *s);</td><td>Append a single string to an array.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vasort(unsigned char **a,int len);</td><td>Sort an array.</td></tr>
|
|
<tr valign="top"><td>unsigned char *vawords(unsigned char **a,unsigned char *s,int len, unsigned char *sep, int seplen);</td><td>Convert a word list in s (words separated by characters given in sep/seplen) into an array of strings.</td></tr>
|
|
<tr valign="top"><td>void vaperm(unsigned char **a);</td><td>Mark an array as preferment (heap instead of stack).</td></tr>
|
|
</tbody>
|
|
</table>
|
|
|
|
<p>When an array is marked as permanent, any strings in the array also
|
|
marked as permanent, and any string later added to the array are also marked
|
|
as permanent.</p>
|
|
|
|
|
|
## Edit Buffers
|
|
<p>API:
|
|
</p>
|
|
<p> Look at the comments in b.h for more information.
|
|
</p>
|
|
<p> B *bfind(unsigned char *name);
|
|
Load disk file into memory buffer 'B'.
|
|
</p>
|
|
<p> bsave(P *p,unsigned char *name,int size);
|
|
Write size bytes from buffer beginning at p to disk file
|
|
</p>
|
|
<p> brm(b); Free data structure
|
|
</p>
|
|
<p>Once you have a B you can access the characters in it via P pointers (which
|
|
are like C++ STL iterators).
|
|
</p>
|
|
<p> B *b = bfind("foo"); Load file into memory
|
|
</p>
|
|
<p> P *p = pdup(b->bof); Get pointer to beginning of file (duplicate
|
|
b->bof which is a P).
|
|
</p>
|
|
<p> prm(p); Free p when we're done with it.
|
|
</p>
|
|
<p> int c=brch(p); Get character at p.</p>
|
|
<p> int c=pgetc(p); Get character at p and advance it.</p>
|
|
<p> int c=prgetc(p); Retract p, then return character at it.</p>
|
|
</p>
|
|
<p> - These return -1 (NO_MORE_DATA) if you try to read end of file or
|
|
before beginning of file.
|
|
</p>
|
|
<p> - A pointer can point to any character of the file and right after the
|
|
end of the file.
|
|
</p>
|
|
<p> - For UTF-8 files, character can be between 0 and 0x7FFFFFFF
|
|
</p>
|
|
<p> Publicly readable members of P:</p>
|
|
<p> p->byte The byte offset into the buffer</p>
|
|
<p> p->line The line number</p>
|
|
<p> p->xcol If P is the cursor, this is the column number
|
|
where the cursor will be displayed on the screen
|
|
(which does not have to match the column of the
|
|
character at P).</p>
|
|
<p> Some predicates:</p>
|
|
<p> pisbof(p); True if pointer is at beginning of buffer</p>
|
|
<p> piseof(p); True if pointer is at end of buffer</p>
|
|
<p> pisbol(p); True if pointer is at beginning of line</p>
|
|
<p> piseol(p); True if pointer is at end of line</p>
|
|
<p> pisbow(p); True if pointer is at beginning of a word</p>
|
|
<p> piseow(p); True if pointer is at end of a word</p>
|
|
|
|
<p> More information about character at p:</p>
|
|
<p> piscol(p); Get column number of character at p.
|
|
</p>
|
|
<p> Some other ways of moving a P through a B:
|
|
</p>
|
|
<p> pnextl(p); Go to beginning of next line</p>
|
|
<p> pprevl(p); Go to end of previous line</p>
|
|
<p> pfwrd(p,int n); Move forward n bytes</p>
|
|
<p> pbkwd(p,int n); Move backward n bytes</p>
|
|
<p> p_goto_bof(p);</p>
|
|
<p> p_goto_eof(p);</p>
|
|
<p> p_goto_bol(p);</p>
|
|
<p> p_goto_eol(p);</p>
|
|
<p> pset(p,q); Move p to same position as q.
|
|
</p>
|
|
<p> pline(p,n); Goto to beginning of a specific line.</p>
|
|
<p> pgoto(p,n); Goto a specific byte position.</p>
|
|
<p> pfind(P,unsigned char *s,int len);
|
|
Fast Boyer-Moore search forward.</p>
|
|
<p> prfind(P,unsigned char *s,int len);
|
|
Fast Boyer-Moore search backward.</p>
|
|
<p> These are very fast- they look at low level
|
|
data structure and don't go through pgetc(). Boyer-Moore
|
|
allows you to skip over many characters without reading
|
|
them, so you can get something like O(n/len).</p>
|
|
<p> Some facts:</p>
|
|
<p> Local operations are fast: pgetc(), prgetc().</p>
|
|
<p> Copy is fast: pset().</p>
|
|
<p> pline() and pgoto() are slower, but look for the closest existing
|
|
P to start from.</p>
|
|
<p> The column number is stored in P, but it is only updated if
|
|
it is easy to do so. If it's hard (like you crossed a line
|
|
boundary backward) it's marked as invalid. piscol() then has
|
|
to recalculate it.
|
|
</p>
|
|
<p> Modifying a buffer:
|
|
</p>
|
|
<p> binsc(p,int c); Insert character at p.</p>
|
|
<p> bdel(P *from,P *to); Delete character between two Ps.</p>
|
|
<p> Note that when you insert or delete, all of the Ps after the insertion/
|
|
deletion point are adjusted so that they continue to point to the same
|
|
character before the insert or delete.
|
|
</p>
|
|
<p> Insert and Delete create undo records.
|
|
</p>
|
|
<p> Insert and Delete set dirty flags on lines which are currently being
|
|
displayed on the screen, so that when you return to the edit loop, these
|
|
lines automatically get redrawn.
|
|
</p>
|
|
<p>Internal:
|
|
</p>
|
|
<p> An edit buffer is made up of a doubly-linked list of fixed sized (4 KB)
|
|
gap buffers. A gap buffer has two parts: a ~16 byte header, which is always
|
|
in memory, and the actual buffer, which can be paged out to a swap file (a
|
|
vfile- see vfile.h). A gap buffer consists of three regions: text before
|
|
the gap, the gap and text after the gap (which always goes all the way to
|
|
the end of buffer). (hole and ehole in the header indicate the gap
|
|
position). The size of the gap may be 0 (which is the case when a file is
|
|
first loaded). Gap buffers are fast for small inserts and deletes when the
|
|
cursor is at the gap (for delete you just adjust a pointer, for insert you
|
|
copy the data into gap). When you reposition the cursor, you have to move
|
|
the gap before any inserts or deletes occur. If you had only one window and
|
|
a single large gap buffer for the file, you could always keep the gap at the
|
|
cursor- the right-arrow key copies one character across the gap.
|
|
</p>
|
|
<p> Of course for edits which span gap buffers or which are larger than a gap
|
|
buffer, you get a big mess of gap buffer splitting and merging plus
|
|
doubly-linked list splicing.
|
|
</p>
|
|
<p> Still, this buffer method is quite fast: you never have to do large memory
|
|
moves since the gap buffers are limited in size. To help search for line
|
|
numbers, the number of newlines '\n's contained in each gap buffer is stored
|
|
in the header. Reads are fast as long as you have a P at the place you
|
|
want to read from, which is almost always the case.
|
|
</p>
|
|
<p> It should be possible to quickly load files by mapping them directly into
|
|
memory (using mmap()) and treating each 4KB page as a gap buffer with 0 size
|
|
gap. When page is modified, use copy-on-write to move the page into the
|
|
swap file (change pointer in header). This is not done now. Instead the
|
|
file is copied when loaded.
|
|
</p>
|
|
## Windowing System
|
|
<p>There is a tiny object-oriented windowing system built into JOE. This is
|
|
the class hierarchy:
|
|
</p>
|
|
<p>SCRN
|
|
A optimizing terminal screen driver (very similar to 'curses').
|
|
has a pointer to a CAP, which has the terminal capabilities read
|
|
from termcap or terminfo.
|
|
</p>
|
|
<p> writes output to screen with calls to the macro ttputc(). (tty.c is the
|
|
actual interface to the tty device).
|
|
</p>
|
|
<p> cpos() - set cursor position</p>
|
|
<p> outatr() - draw a character at a screen position with attributes</p>
|
|
<p> eraeol() - erase from some position to the end of the line</p>
|
|
<p>SCREEN
|
|
Contains list of windows on the screen (W *topwin).
|
|
</p>
|
|
<p> Points to window with focus (W *curwin).
|
|
</p>
|
|
<p> Contains pointer to a 'SCRN', the tty driver for the particular terminal
|
|
type.
|
|
</p>
|
|
<p>W
|
|
A window on a screen.
|
|
</p>
|
|
<p> Has position and size of window.
|
|
</p>
|
|
<p> Has:
|
|
void *object- pointer to a structure which inherits window (W should
|
|
really be a base class for these objects- since C doesn't have this
|
|
concept, a pointer to the derived class is used instead- the derived
|
|
class has a pointer to the base class: it's called 'parent').
|
|
</p>
|
|
<p> Currently this is one of:</p>
|
|
<p> BW * a text buffer window (screen update code is here.)</p>
|
|
<p> QW * query window (single character yes/no questions)</p>
|
|
<p> MENU * file selection menu</p>
|
|
<p> BW * is inherited by (in the same way that a BW inherits a W):</p>
|
|
<p> PW * a single line prompt window (file name prompts)</p>
|
|
<p> TW * a text buffer window (main editing window).</p>
|
|
<p> WATOM *watom- Gives type of this window. The WATOM structure has
|
|
pointers to virtual member functions.</p>
|
|
<p> KBD *kbd- The keyboard handler for this window. When window has
|
|
focus, keyboard events are sent to this object. When key sequences
|
|
are recognized, macros bound to them are invoked.</p>
|
|
<p>Some window are operators on others. For example ^K E, load a file into a
|
|
window prompt operates on a text window. If you hit tab, a file selection
|
|
menu which operates on the prompt window appears below this. When a window
|
|
is the target of operator windows is killed, the operators are killed also.</p>
|
|
<p>Currently all windows are currently the width of the screen (should be fixed
|
|
in the future). The windows are in a big circular list (think of a big loop
|
|
of paper). The screen is small window onto this list. So unlike emacs, you
|
|
can have windows which exist, but which are not on the screen.</p>
|
|
<p>^K N and ^K P move the cursor to next or previous window. If the next
|
|
window is off the screen it is moved onto the screen, along with any
|
|
operator windows are target it.</p>
|
|
## Macros
|
|
<p>- add something here.
|
|
</p>
|
|
## Screen update
|
|
<p>- add something here.
|
|
</p>
|
|
## Syntax Highlighter
|
|
|
|
<p>There are two parts to the syntax highlighter: the parser (in syntax.c)
|
|
and the line attribute database (in lattr.c). The parser is simple enough:
|
|
it is a finite state machine interpreter. You supply a line of text and the
|
|
starting state, and it returns the starting state of the next line and an
|
|
array containing the color to be used for each character. Each state has a
|
|
color. The line is scanned left to right, character by character: the state
|
|
and the character index a table, which gives an action, which includes the
|
|
next state.</p>
|
|
|
|
<p>The action specifies the next state, and also indicates whether some
|
|
previous N characters (including the current) should be "recolored" with the
|
|
color of the next state, or whether the color of the current state should be
|
|
used for the character. The action also indicates whether to move on to the
|
|
next character, or to stay put and re-evaluate the current character with
|
|
the new state. This leads to simpler hand-coding of the state machine, but
|
|
can create an infinite loop. The action also indicates if the character
|
|
should be added to a string buffer. The string buffer can be used (when
|
|
specified by an action) as a key for a hash table look-up (typically filled
|
|
with key-words). If there is a match, the hash-table entry supplies the
|
|
action instead of the state table.</p>
|
|
|
|
<p>This parser is fast and simple and powerful enough to lexically analyze
|
|
more than 40 languages. However a few enhancements were made to improve
|
|
both the class of languages which can be highlighted and to improve the ease
|
|
of writing the state machine. To support "here" documents in perl and sh,
|
|
as well as to simplify matching delimiters (like open with close
|
|
parenthesis), a string (from the string buffer) can be stored along with the
|
|
state. This can be matched against the string buffer. So first the
|
|
terminating word from the here document is stored in the string, and then
|
|
the state machine looks forward until it finds this string to determine the
|
|
end of the here document.</p>
|
|
|
|
<p>The state is only a single integer, not an entire stack. The parser can
|
|
therefore not handle recursive grammars. Unfortunately, some languages have
|
|
recursive elements in their lexical structure. One way to deal with this is
|
|
to support only a very limited recursion depth by tediously repeating a set
|
|
of states in the hand written state machine N times, where N is your maximum
|
|
recursion depth. To ease writing this type of thing, and also to allow
|
|
state machine to be reused in cases where one language is embedded within
|
|
another (PHP in HTML, for example), a macro system was added to the state
|
|
machine loader. Recursion is allowed in the macro calls, but is limited to
|
|
a depth of 5.</p>
|
|
|
|
<p>As of JOE 3.6, the macro instantiation was replaced with a real stack, so
|
|
now there is no recursion depth limit.</p>
|
|
|
|
## Line attribute cache
|
|
|
|
<p>The second part of the syntax highlighter is the line attribute cache.
|
|
In the original implementation, the state of the first line of each window
|
|
is maintained and the parser is invoked during the screen update algorithm
|
|
to highlight all of the lines on the screen. This works well when the
|
|
window remains in place or when it scrolls forward through the file.
|
|
However, if the window scrolls backward you may have to reparse the entire
|
|
file or from the nearest previous window if there happens to be one. This
|
|
can be slow for large files, and certainly wastes a lot of CPU time when
|
|
scrolling backward through a file.</p>
|
|
|
|
<p>One attempt to solve this was to instead of starting at the beginning of
|
|
the file, to instead assume that the line 200 lines (a configurable option)
|
|
back was at the initial state, parse forward from there and hope for the
|
|
best. Sometimes the state machine would find itself back in the proper
|
|
state, but not always- to the point where most syntaxes had specified to
|
|
just always reparse from the beginning of the file.</p>
|
|
|
|
<p>Finally, the line attribute cache was added. It stores the parser state
|
|
at the beginning of each line, from the beginning of the file to last line
|
|
actually displayed on the screen. This solves the scrolling problem, but
|
|
introduces a new one: how to keep the line attribute cache up to date. This
|
|
turns out to not be so difficult. First of all, the states are stored in a
|
|
gap-buffer, which makes it easy to perform insertions and deletions.
|
|
Whenever lines are inserted or deleted in the file, the line attribute cache
|
|
is expanded or contracted to keep in sync. Now, when any change is made to
|
|
the file, we reparse from just before the start of the change until the
|
|
state of the beginning of a line again matches the corresponding entry in
|
|
the cache. If a match is found, the rest of the cache is assumed to be
|
|
correct. Some changes will require a large amount of reparsing, but many
|
|
will not. In any case, the the highlighting on the screen is always
|
|
correct.</p>
|
|
|
|
## Files
|
|
|
|
<table width="100%" cellspacing=20 border=0 cellpadding=0>
|
|
<colgroup>
|
|
<col width="100">
|
|
<tbody>
|
|
<tr valign="top"><td>b.c</td><td>Text buffer management</td></tr>
|
|
<tr valign="top"><td>blocks.c</td><td>Library: fast memcpy() functions (necessary on really old versions of UNIX).</td></tr>
|
|
<tr valign="top"><td>bw.c</td><td>A class: text buffer window (screen update code is here)</td></tr>
|
|
<tr valign="top"><td>charmap.c</td><td>UNICODE to 8-bit conversion functions</td></tr>
|
|
<tr valign="top"><td>cmd.c</td><td>Table of user edit functions</td></tr>
|
|
<tr valign="top"><td>coroutine.c</td><td>Coroutine library</td></tr>
|
|
<tr valign="top"><td>dir.c</td><td>Directory reading functions (for old UNIXs).</td></tr>
|
|
<tr valign="top"><td>hash.c</td><td>Library: automatically expanding hash table functions.</td></tr>
|
|
<tr valign="top"><td>help.c</td><td>Implement the on-line help window</td></tr>
|
|
<tr valign="top"><td>i18n.c</td><td>Unicode character type information database</td></tr>
|
|
<tr valign="top"><td>kbd.c</td><td>Keymap data structure (keysequence to macro bindings).</td></tr>
|
|
<tr valign="top"><td>lattr.c</td><td>Line attribute cache</td></tr>
|
|
<tr valign="top"><td>macro.c</td><td>Keyboard and joerc file macros</td></tr>
|
|
<tr valign="top"><td>main.c</td><td>Has main() and top level edit loop</td></tr>
|
|
<tr valign="top"><td>menu.c</td><td>A class: menu windows</td></tr>
|
|
<tr valign="top"><td>obj.c</td><td>Semi-automatic objects, plus dynamic string and array library</td></tr>
|
|
<tr valign="top"><td>path.c</td><td>Library: file name and path manipulation functions</td></tr>
|
|
<tr valign="top"><td>poshist.c</td><td>Cursor position history</td></tr>
|
|
<tr valign="top"><td>pw.c</td><td>A class: prompt windows</td></tr>
|
|
<tr valign="top"><td>queue.c</td><td>Library: doubly linked lists</td></tr>
|
|
<tr valign="top"><td>qw.c</td><td>A class: single-key query windows</td></tr>
|
|
<tr valign="top"><td>rc.c</td><td>Joerc file loader</td></tr>
|
|
<tr valign="top"><td>regex.c</td><td>Regular expressions</td></tr>
|
|
<tr valign="top"><td>scrn.c</td><td>Terminal update functions (like curses)</td></tr>
|
|
<tr valign="top"><td>selinux.c</td><td>Secure linux (for RedHat) functions</td></tr>
|
|
<tr valign="top"><td>syntax.c</td><td>Syntax highlighter</td></tr>
|
|
<tr valign="top"><td>tab.c</td><td>Tab completion for file selection prompt</td></tr>
|
|
<tr valign="top"><td>termcap.c</td><td>Load terminal capabilities from /etc/termcap file or terminfo database</td></tr>
|
|
<tr valign="top"><td>tw.c</td><td>A class: main text editing window</td></tr>
|
|
<tr valign="top"><td>ublock.c</td><td>User edit functions: block moves</td></tr>
|
|
<tr valign="top"><td>uedit.c</td><td>User edit functions: basic edit functions</td></tr>
|
|
<tr valign="top"><td>uerror.c</td><td>User edit functions: parse compiler error messages and goto next error, previous error</td></tr>
|
|
<tr valign="top"><td>ufile.c</td><td>User edit functions: load and save file</td></tr>
|
|
<tr valign="top"><td>uformat.c</td><td>User edit functions: paragraph formatting, centering</td></tr>
|
|
<tr valign="top"><td>uisrch.c</td><td>User edit functions: incremental search</td></tr>
|
|
<tr valign="top"><td>umath.c</td><td>User edit functions: calculator</td></tr>
|
|
<tr valign="top"><td>undo.c</td><td>Undo system</td></tr>
|
|
<tr valign="top"><td>usearch.c</td><td>User edit functions: search & replace</td></tr>
|
|
<tr valign="top"><td>ushell.c</td><td>User edit functions: subshell</td></tr>
|
|
<tr valign="top"><td>utag.c</td><td>User edit functions: tags file search</td></tr>
|
|
<tr valign="top"><td>utf8.c</td><td>UTF-8 to unicode coding functions</td></tr>
|
|
<tr valign="top"><td>utils.c</td><td>Misc. utilities</td></tr>
|
|
<tr valign="top"><td>vfile.c</td><td>Library: virtual memory functions (allows you to edit files larger than memory)</td></tr>
|
|
<tr valign="top"><td>w.c</td><td>A class: base class for all windows</td></tr>
|
|
</tbody>
|
|
</table>
|