23. Assembly Language

With the ability to make decisions, the ESAP system is now complete
However, using the ESAP system is not particularly easy as it’s programmed with machine code
- Writing programs for the ESAP system is tedious
- Machine code is prone to errors and and requires memorizing bit patterns
A solution to this problem is to improve the way programming is done
- Instead of binary patterns or hex numbers, english mnemonics can be used
- Other quality of life features can be included, like separating opcode mnemonic from the operand value
However, the ESAP system ultimately requires machine code

23.1. Assembler

An assembly language is a very low level programming language
- Often referred to as assembly
- Assembly is strongly tied to a specific system design, the underlying hardware, and its machine language
- For example, the instruction set for ARM processors
An assembler is a tool to convert assembly language to machine code
When compared to machine code, it enables a more human centric way of programming the system
- Assembly is effectively programming in machine code, but with a few nice features
Assembly languages have many benefits over programming in machine code, but two key important features are
- Mnemonics for referring to specific instructions
  For example, consider loading the value 5 into register A
  
  Instead of the machine code 0b00100101, one could write LDAD 5
  
  The mnemonic would mean the same thing, and would be translated to the machine code
  
  But the mnemonic is much easier to remember and mentally parse
- Labels/symbolic representation for memory addresses
  For example, memory addresses could be labelled and referenced by their label
  
  This would make referencing memory addresses for jumps and loading from RAM easier
  
  Removes the need to remember specific memory addresses
  
  Also removes the need to constantly update addresses when lines are added/removed to RAM
An assembler would take the assembly language and translate it, or assemble it, to the corresponding machine code
- It would replace the mnemonics with their opcode bit patterns and translate literals to their binary/hex values
- It would replace all labels within the assembly with their corresponding memory addresses
Typically, each statement in assembly has a 1-to-1 mapping to a statement in machine code
Despite its simplicity, it improves the programming experience and allows for a small amount of abstraction

../../_images/assembly_to_machine_code.png — An assembler is a tool used to translate assembly language to machine code. The left hand side shows an example of some assembly language making use of mnemonics. The right hand side shows the hex representation of corresponding machine code. The assembler takes the assembly language and “assembles” it to the machine code. Here, each instruction has a 1-to-1 mapping between assembly and machine code.

23.2. The ESAP Assembler

To make programming easier, a simple assembler will be built for the ESAP system
This ESAP assembler will only implement the mnemonics and interpret various literal value encodings
- It will not make use of labels for memory addresses
The mnemonics make writing programs much easier
It will also make interpreting/reading programs easier
- LDAR 15 versus 0b00011111 or 0x1F
The mnemonics for each instruction have already been discussed in a previous topic
- Below is a table of all 16 instructions
- This table was shown before, but did not include the conditional jump instructions
- These conditional jump instructions are included here

Complete Instruction Set for the Current ESAP System
Bit Pattern	Hex	Label	Description
`0000`	`0`	`NOOP`	No Operation
`0001`	`1`	`LDAR`	Load A From RAM
`0010`	`2`	`LDAD`	Load A Direct
`0011`	`3`	`LDBR`	Load B From RAM
`0100`	`4`	`LDBD`	Load B Direct
`0101`	`5`	`SAVA`	Save A to RAM
`0110`	`6`	`SAVB`	Save B to RAM
`0111`	`7`	`ADAB`	Add B to A — `A += B`
`1000`	`8`	`SUAB`	Subtract B from A — `A -= B`
`1001`	`9`	`JMPA`	Jump Always
`1010`	`A`	`JMPZ`	Jump if Zero Flag Set
`1011`	`B`	`JMPS`	Jump if Significant/Sign Flag Set
`1100`	`C`	`JMPC`	Jump if Carry Flag Set
`1101`	`D`	`OUTU`	Output Unsigned Integer
`1110`	`E`	`OUTS`	Output Signed Integer
`1111`	`F`	`HALT`	Halt

The assembler will translate literal values from various bases
- For example, the programmer could write 0b1010, 10, or 0xA to mean ten
- Although they all mean the same thing, one encoding may make more sense for the programmer in some context
  Remember, code is for humans, machine code is for machines
  
  An assembly language is one step away from machine code
Negative numbers will also be handled
- The assembler will convert the number to a two’s complement number
Finally, the ESAP assembler will provide some level of error checking on the program
- Check if the program will fit into RAM
- Syntax
- Missing operands
- Values within range
Since an assembler is a program, a Python script can serve as the ESAP assembler
Below, a script created for the ESAP system’s assembler is discussed
- This script is by no means the only way one could write an assembler
- Its presentation serves to show the simplicity of such an assembler
- It facilitates the additional layer of abstraction
A series of constants are used to simplify the code

OPERATORS = {
    "NOOP": 0b0000,
    "LDAR": 0b0001,
    "LDAD": 0b0010,
    "LDBR": 0b0011,
    "LDBD": 0b0100,
    "SAVA": 0b0101,
    "SAVB": 0b0110,
    "ADAB": 0b0111,
    "SUAB": 0b1000,
    "JMPA": 0b1001,
    "JMPZ": 0b1010,
    "JMPS": 0b1011,
    "JMPC": 0b1100,
    "OUTU": 0b1101,
    "OUTS": 0b1110,
    "HALT": 0b1111,
}
HAS_OPERAND = {
    "LDAR",
    "LDAD",
    "LDBR",
    "LDBD",
    "SAVA",
    "SAVB",
    "JMPA",
    "JMPZ",
    "JMPS",
    "JMPC",
    "OUTU",
    "OUTS",
}
VALID_SYNTAX = {
    r"NOOP",
    r"LDAR\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"LDAD\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"LDBR\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"LDBD\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"SAVA\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"SAVB\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"ADAB",
    r"SUAB",
    r"JMPA\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"JMPZ\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"JMPS\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"JMPC\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"OUTU\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"OUTS\s+\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
    r"HALT",
    r"^-?\b(0x[0-9a-fA-F]+|0b[0-1]+|[0-9]+)\b",
}

Helper functions are also included that will make the assembly loop easier to implement

def parse_number(number_string:str) -> int:
    """
    Convert a string of a number to a decimal integer representable with the specified number of bits. This function
    will work with binary (0bXXXX), hex (0xXX), decimal, etc.

    :param number_string: String of a number to be converted.
    :return: Value of the string as a decimal integer.
    """
    try:
        number = int(eval(number_string))
    except (ValueError, SyntaxError):
        raise ValueError(f"Cannot parse operand {number_string}")
    return number

def verify_number_and_fix_negative(number:int, max_bits:int) -> int:
    """
    Verify that the number fits in the specified number of bits and convert to a signed, 2s compliment binary pattern
    where necessary.

    If the number is negative, this function applies the 2s compliment conversion since Python does not store negative
    integers in a 2s compliment format. For example, the number -10 should be converted to the 2s compliment binary
    pattern 0b0110. Since Python treats all binary patterns are unsigned ints, this would mean this function returns the
    integer 6 in this case.

    :param number: Number to be verified and converted
    :param max_bits: Maximum number of bits the number can be stored in.
    :return: Decimal version of the number (may be signed int binary pattern's decimal value).
    """
    # max_bits + 2 to account for the 0b in the string
    # negative numbers are representable in max_bits - 1, but require +1 for the negative sign
    # len(bin(number + 1)) when negative to account for edge case of the min negative number needing 1 more bit 
    if (number < 0 and len(bin(number + 1)) > max_bits + 2 or
            number >= 0 and len(bin(number)) > max_bits + 2):
        raise ValueError(f"Data value {number} cannot be represented with {max_bits} bits.")
    if number < 0:
        number = ((2**max_bits - 1 ) ^ number * -1) + 1
    return number

verify_number_and_fix_negative does two things
- It verifies that a number can be represented with some specific number of bits
  For example, if the number is data, it must fit in 8 bits
  
  If the number is an operand, it must fit in 4 bits
- This function also converts negative numbers to the integer representing the signed two’s complement number
  This is necessary as Python has a peculiarity when it comes to signed integers
  
  Python stores signed integers as unsigned integers with a sign flag
  
  It does not store the signed integers as two’s complement numbers
  
  For example -7 would be stored as 0b0111 with a sign flag
  
  However, the two’s complement number for -7 is 0b1001, which is what the ESAP system expects
  
  This function would convert -7 to the number representing the correct two’s complement bit pattern
  
  In this case, it would return the integer 9, which corresponds to the bit pattern 0b1001
  
  Here, the value 9 is not important, but the underlying bit pattern for 9 is important

def verify_syntax_return_string(program_line):
    """
    Verifies that a given program line is valid syntax for the assembler. If valid, this function returns the string,
    otherwise the function raises a ValueError.

    :param program_line:
    :raise ValueError: If the program line does not match a valid syntax pattern.
    :return: Returns a valid program line
    """
    for syntax in VALID_SYNTAX:
        syntax_match = re.match(syntax, program_line)
        if syntax_match:
            return syntax_match[0]
    raise ValueError(f"Invalid operator and/or operand {program_line}")

The main part of the script uses the above constants and functions

if len(sys.argv) < 2 or len(sys.argv) > 3:
    raise ValueError(f"Assembler takes 1 or 2 argument(s), {len(sys.argv) - 1} given\n"
                     f"\tUsage: assembler.py input.as [out.hex]\n"
                     f"\t\tinput.as: the source assembly file to assemble\n"
                     f"\t\tout.hex: the output hex dig file, defaults to `a.hex` (optional)\n")

file_to_assemble = sys.argv[1]
file_to_output = sys.argv[2] if len(sys.argv) == 3 else "a.hex"

The assembler takes one or two command line arguments
- One argument specifies the name of the file containing the assembly code to assemble
- The second argument, which is optional, specifies the name of the file to write the machine code to
- This portion of the script verifies the a correct number of arguments are provided to the script

with open(file_to_assemble) as file:
    program_list = [line.strip() for line in file.readlines() if line.strip()]

if len(program_list) > 16:
    raise ValueError(f"Program length of {len(program_list)} exceeds maximum size of 16 bytes")

The assembler reads the assembly language code and verifies that it fits within RAM
The main loop of the assembler processes one instruction at a time

machine_code = []
for i, raw_program_line in enumerate(program_list):
    verified_program_line = verify_syntax_return_string(raw_program_line)
    line = verified_program_line.split()
    if line[0].isalpha():
        operator = OPERATORS[line[0]]
        if line[0] not in HAS_OPERAND:
            operand = 0
        else:
            operand = parse_number(line[1])
            operand = verify_number_and_fix_negative(operand, 4)
        machine_code_line = (operator << 4) | operand
    else:
        machine_code_line = parse_number(line[0])
        machine_code_line = verify_number_and_fix_negative(machine_code_line, 8)
    machine_code.append(machine_code_line)

The main loop verifies the line and processes it as an instruction or data accordingly
- If it’s an instruction, it processes the operand if necessary too
Finally, the assembler saves the assembled code to a file

with open(file_to_output, "w") as hex_file:
    hex_file.write("v2.0 raw\n")
    hex_file.writelines([f"0x{code:02x}\n" for code in machine_code])

Using the assembler is then a matter of running the script with the proper command line arguments
- For example, python assembler.py to_assemble.esap assembled.hex
- The file extension .esap is not necessary for the assembly language file
- If the second argument is not set, the file is saved to a.hex by default

23.3. For Next Time

Something?