Name

rules table — The rules table contains a set of rules that maps address input sequence tokens to standardized output sequence. A rule is defined as a set of input tokens followed by -1 (terminator) followed by set of output tokens followed by -1 followed by number denoting kind of rule followed by ranking of rule.

Descripción

Una tabla de reglas debe tener al menos las siguientes columnas, aunque se le permite agregar más para sus propios usos.

id

Llave primaria de la tabla

rule

text field denoting the rule. Details at PAGC Address Standardizer Rule records.

A rule consists of a set of non-negative integers representing input tokens, terminated by a -1, followed by an equal number of non-negative integers representing postal attributes, terminated by a -1, followed by an integer representing a rule type, followed by an integer representing the rank of the rule. The rules are ranked from 0 (lowest) to 17 (highest).

So for example the rule 2 0 2 22 3 -1 5 5 6 7 3 -1 2 6 maps to sequence of output tokens TYPE NUMBER TYPE DIRECT QUALIF to the output sequence STREET STREET SUFTYP SUFDIR QUALIF. The rule is an ARC_C rule of rank 6.

Numbers for corresponding output tokens are listed in stdaddr.

Tokens de entrada

Each rule starts with a set of input tokens followed by a terminator -1. Valid input tokens excerpted from PAGC Input Tokens are as follows:

Form-Based Input Tokens

AMPERS

(13). El ampersand (&) se utiliza frecuentemente para abreviar la letra "y".

DASH

(9). Un carácter de puntuación.

DOUBLE

(21). Secuencia de dos letras. A menudo se utilizan como identificadores.

FRACT

(25). Las fracciones a veces se usan en números cívicos o números de unidad.

MIXED

(23). Una cadena alfanumérica que contiene letras y dígitos. Se utiliza para identificadores.

NUMBER

(0). Una cadena de dígitos.

ORD

(15). Representaciones como Primera o 1ra. Se utiliza a menudo en nombres de calles.

ORD

(18). Una sola letra.

WORD

(1). Una palabra es una cadena de letras de longitud arbitraria. Una sola letra puede ser SINGLE y una WORD.

Function-based Input Tokens

BOXH

(14). Palabras utilizadas para denotar casillas postales. Por ejemplo Box o PO Box.

BUILDH

(19). Words used to denote buildings or building complexes, usually as a prefix. For example: Tower in Tower 7A.

BUILDT

(24). Words and abbreviations used to denote buildings or building complexes, usually as a suffix. For example: Shopping Centre.

DIRECT

(22). Words used to denote directions, for example North.

MILE

(20). Words used to denote milepost addresses.

ROAD

(6). Words and abbreviations used to denote highways and roads. For example: the Interstate in Interstate 5

RR

(8). Words and abbreviations used to denote rural routes. RR.

TYPE

(2). Words and abbreviation used to denote street typess. For example: ST or AVE.

UNITH

(16). Words and abbreviation used to denote internal subaddresses. For example, APT or UNIT.

Postal Type Input Tokens

QUINT

(28). Un número de 5 dígitos. Identifica un código postal

QUAD

(29). A 4 digit number. Identifies ZIP4.

PCH

(27). A 3 character sequence of letter number letter. Identifies an FSA, the first 3 characters of a Canadian postal code.

PCT

(26). A 3 character sequence of number letter number. Identifies an LDU, the last 3 characters of a Canadian postal code.

Stopwords

STOPWORDS combine with WORDS. In rules a string of multiple WORDs and STOPWORDs will be represented by a single WORD token.

STOPWORD

(7). A word with low lexical significance, that can be omitted in parsing. For example: THE.

Tokens de salida

After the first -1 (terminator), follows the output tokens and their order, followed by a terminator -1. Numbers for corresponding output tokens are listed in stdaddr. What are allowed is dependent on kind of rule. Output tokens valid for each rule type are listed in the section called “Rule Types and Rank”.

Rule Types and Rank

The final part of the rule is the rule type which is denoted by one of the following, followed by a rule rank. The rules are ranked from 0 (lowest) to 17 (highest).

MACRO_C

(token number = "0"). The class of rules for parsing MACRO clauses such as PLACE STATE ZIP

MACRO_C output tokens (excerpted from http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--.

CITY

(token number "10"). Example "Albany"

STATE

(token number "11"). Example "NY"

NATION

(token number "12"). This attribute is not used in most reference files. Example "USA"

POSTAL

(token number "13"). (SADS elements "ZIP CODE" , "PLUS 4" ). This attribute is used for both the US Zip and the Canadian Postal Codes.

MICRO_C

(token number = "1"). The class of rules for parsing full MICRO clauses (such as House, street, sufdir, predir, pretyp, suftype, qualif) (ie ARC_C plus CIVIC_C). These rules are not used in the build phase.

MICRO_C output tokens (excerpted from http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--.

HOUSE

Es un texto (ficha número 1): este es el número de la calle en una calle. Ejemplo 75 en 75 State Street.

predir

Es texto (token número 2): NOMBRE DE LA CALLE PRE-DIRECCIONAL como Norte, Sur, Este, Oeste, etc.

qual

Es texto (token número 3): NOMBRE DE CALLE PRE-MODIFICADOR Ejemplo VIEJO en 3715 VIEJA CARRETERA 99.

pretype

es texto (token número 4): TIPO DE PREFIJO DE CALLE

street

es texto (token número 5): NOMBRE DE LA CALLE

suftype

es texto (token número 6): TIPO DE POSTE DE CALLE, p. ej. St, Ave, Cir. Un tipo de calle que sigue al nombre de la calle raíz. Ejemplo STREET en 75 State Street.

sufdir

es texto (token número 7): STREET POST-DIRECTIONAL Un modificador direccional que sigue al nombre de la calle. Ejemplo WEST en 3715 TENTH AVENUE WEST.

ARC_C

(token number = "2"). The class of rules for parsing MICRO clauses, excluding the HOUSE attribute. As such uses same set of output tokens as MICRO_C minus the HOUSE token.

CIVIC_C

(token number = "3"). The class of rules for parsing the HOUSE attribute.

EXTRA_C

(token number = "4"). The class of rules for parsing EXTRA attributes - attributes excluded from geocoding. These rules are not used in the build phase.

EXTRA_C output tokens (excerpted from http://www.pagcgeo.org/docs/html/pagc-12.html#--r-typ--.

BLDNG

(token number 0): Unparsed building identifiers and types.

BOXH

(token number 14): The BOX in BOX 3B

BOXT

(token number 15): The 3B in BOX 3B

RR

(token number 8): The RR in RR 7

UNITH

(token number 16): The APT in APT 3B

UNITT

(token number 17): The 3B in APT 3B

UNKNWN

(token number 9): An otherwise unclassified output.