State of Layout Descriptor Language

15 July, 2015

Version 0.3

Tobi Ajila, Angela Lin, Daniel Heidinga, Karl Taylor and Patrick Doyle

This is a proposal to specify a layout descriptor language (LDL) for the Java Virtual Machine. The contents of this proposal are a summary of the discussions in the mailing list as well as a general outline for the future direction of this work.

Background

Project Panama proposes to create a new standard way to connect the JVM and foreign (non-Java) APIs. The project includes (and is not limited to) the following:

A Java Layout is a representation of a native data structure. A layout can be mapped over an existing structure in native memory or it can be used to allocate a new structure in memory while providing access to its members. The sequence below describes a general overview of Layouts.

Layout Overview
fig.1 Layout Overview

A groveller is a tool that parses files for native structures and their members. The output of the groveller is a layout descriptor (LD) and stub interfaces. The stub interfaces provide an interface for programmers to use during development. They contain method headers for accessors but no implementation. The LD is a description of the native data that the runtime system uses to generate appropriate accessors. The Layout Factory supplies instances of the stub interface classes at runtime. These instances contain the implementation of all the accessors.

1 Introduction

Panama proposes a new way to interface the JVM with native APIs. To achieve this we will need to come up with a standard way to describe structured data. Different languages have different ways of describing data (COBOL comp-5 vs C int) and different platforms interpret the same native types differently (C/C++ x86-32 long vs x84-64 long). There is a need for a language that can describe native data in a general way regardless of platform or language. For these reasons we have come up with the layout descriptor language. The main goals of the language are the following:

  1. Describe data access disciplines
  2. Describe the data layout
  3. Attach type information to the data

2 Data Access Disciplines

2.1 Motivation for Memory Model

There are many diverse languages with specific rules regarding native data access. It is not feasible to describe all the possible data access disciplines in one language. We want to define a memory model that is familiar to people who are familiar with native languages, and also allows a compiler to make optimizations. C++11 introduced a memory model that describes concurrent interaction between memory locations and bit-fields. This memory model gives flexibility to compiler designers by permitting interference within the confines of a memory location. It is important to have the ability to capture this kind of behaviour in the LDL as we could also benefit from similar optimizations. Also, it would mean that C++ users could expect similar data access behaviour in Panama.

2.2 Memory Model

2.2.1 Composition of a Layout

A field is defined as a contiguous sequence of bits, where bit zero is the least significant bit (this is a LE-centric bit numbering scheme). A container is defined as a contiguous sequence of one or more fields. This implies that a field cannot have a greater size than its enclosing container. There are no other size restrictions on fields. The size of a container must be a multiple of 8 bits.

Composition of a Container
fig.2 Composition of a container

A member of a layout can be a container, array, union, or nested layout. A container is the most primitive form of member.

An array is defined as a contiguous sequence of identical member types. Multi-dimensional arrays are laid out in row-major order.

A union is defined as a group of overlapping member types where the size of the union is the size of its largest member type.

A layout is defined as a contiguous sequence of members.

Based on the definitions above a member can be composed of other members. The leaf member of any member hierarchy is always a container. A container is the most basic atomic unit, it cannot be composed of another container.

Composition of a Layout
fig.3 Memory layout of a Layout composed of a container, nested layout, array and a union

2.2.2 Atomicity and Tearing

An update of a field may overwrite the contents of other fields within the same container. An update of a field may not overwrite the contents of a field in another container. If the enclosing container is marked as "atomic", another thread cannot observe the value of fields in the container before the update is complete. Since member types in unions overlap with one another, atomicity and tearing rules are only enforced for the individual member types.

2.3 Endianness

Endianness is specified in the layout descriptor (LD) at a container granularity, where the endian can be either big-endian or little-endian. In cases where the data layout is invariant across platforms (e.g., network protocols) we have just one LD. The bytecode generated from the LD must be portable between platforms. This means that the runtime system must make the appropriate byte swaps (LEloadInt & BEloadInt) to ensure that the behaviour of the generated bytecode is consistent across all platforms.

3 Data Layout

  1. A layout descriptor must be interpreted the same way by all JDKs. This means that there can be no implementation-defined inference of padding or alignment. Every field and member type must be specified in terms of offset and size. The offset is calculated from the beginning of the enclosing type. E.g. the offset of a field is calculated from the beginning of the enclosing container. The offset must take into account any required padding or alignment. The LD must describe every piece of memory in a layout. There cannot be any unspecified regions.

  2. A layout can have an alignment associated with all allocations of its type. The default alignment of a Layout is recursively defined as the following: The Layout alignment is the maximum alignment of all of its members, where the alignment of a member is the maximum alignment of all of its members. If a member is a container the alignment is its size rounded up to 2^n.

    Note: The "2^n" requirement is to ensure that default alignment is register sized. However once the size of a container is larger than the largest register size this limitation is no longer necessary. This is an area that we can expand on in future discussions.

    Layout Alignment = Max(Align(member1),... Align(memberN))
            = Max(Max(Align(member1_1), Max(Align(member1_2))), ... Align(memberN))
            = Max(Max(ROUND2N(sizeof(container1_1)), Max(ROUND2N(sizeof(container1_2)))), ... Align(memberN))

    C Example:

     struct Triangle {
      uint8_t triDim;
      struct {
       struct {
        uint32_t x;
        uint32_t y;
        uint32_t z;
       } point[2];
      } line[3];
     };
    The alignment of the structure above is 4bytes. Therefore alignment is not based on direct member size, but on the largest leaf primitive that the structure is composed of. This is analogous to saying that the alignment of a layout is based on the largest leaf container size.

  3. A layout cannot be recursively defined. This means a layout cannot have a nested type of itself or of another layout that contains a nested type of itself.
  4. All member types in a union begin at offset zero.
  5. A variable sized array can be specified as long as the size of the array is provided ahead of time such that it is possible to bounds check the access. Note, there are many other types of variable sized arrays. This proposal only attempts to provide this specific usage of variable sized arrays. This is an area that we can expand on in future discussions.

4 Accessor Type specification

This section describes how to view layouts and their members in Java. Except for unions, each member of a layout (as defined in section 2.2.1) is associated with one type. For a union, each member of the union has one type.

Type informs the generation of accessors for these members. For example, the getter of a nested layout member might return a layout instance (see 4.1 #2 below).

For the purpose of illustration, we have referenced a type called "Layout" below. "Layout" represents a base type for Layouts. The specifics of this type are beyond the scope of this document, and are not essential for the following discussions. "Layout" is defined as an interface in this specification, but it could also be defined as an abstract class. This document does not suggest that we favour one or another.

4.1 Accessor type categories

There are 5 categories of types: primitive, structured, raw, pointer and opaque.

  1. Primitive types: boolean, byte, char, short, int, long, float, double

    Associating a primitive type with a layout member indicates that the member can be interpreted as the corresponding primitive Java type.

  2. Structured Types:

    Structured types are typically associated with layout members that are composites of containers, such as arrays and nested layouts. A layout member that has a structured type is accessed using a layout. A getter for the member would return a layout instance.

  3. Raw:

    The Raw type is used to provide access to the raw bits of a layout member. This is useful in cases where there are no suitable Java types to represent the data, such as native types wider than 64 bits. A getter for a Raw member might provide a ByteBuffer interface to the data, such as:

    interface Raw extends Layout {
     ByteBuffer getValue();
    }

  4. Pointer:

    This type provides a way to view and dereference native pointers in layouts. It enables Java code to interact with data structures such as native linked lists. In the example below, one could use the Pointer type to walk all the nodes of a linked list.

    struct Node {
     uint64_t data;
     struct Node* next;
    }

    This would be done using the following API.
    interface Pointer extends Layout {
     T deference();
     void set(Pointer ptr);
    }

  5. Opaque:

    An Opaque member type is useful in cases where we want the offset and size of a member represented, but we don't want any accessors to be generated for it. In the previous version of this document, we stated that an unnamed container would have this behaviour, but it may be worthwhile to be more explicit. Opaque members could be used when specifying padding in a layout.

4.2 Effective Address (EA) Accessor

An EA accessor is analogous to C's "address of a struct field". It allows us to pass along a view to a layout member, for example, in a calling sequence from C->Java->C, without copying or marshalling the member's data. Other potential uses could be to facilitate casting between layout types.

EA accessors do not need to be tied to a particular member type. We could potentially generate them for all members of a layout.

The following provides some examples as to how EA should exposed. For this document I will use the term 'address' to describe where a layout is located.

4.3 Accessor type rules

The following are the rules for associating type information with Layouts. These rules are in addition to the previously defined rules in section 2 and 3.

  1. Type information only represents the Java type that is returned, not how the container is loaded or stored.
  2. Each container must have an accessor type associated with it.
  3. If a container is of type Pointer or Raw, it cannot have fields.
  4. The type of a nested layout member is itself. In other words, its accessor must return a layout instance of the same type as the nested layout.

4.4 Accessor generation

It seems sensible to generate accessors in all cases where pointer, primitive and structured types appear in a LD, for Raw and EA this may not be the case. This section proposes strategies for generating EA and Raw.

4.4.1 Generate up front

  1. Generate Always:

    Generate EA accessors for every layout member and make an exception where Opaque is specified. The downside is that this greatly increases the size of the class (memory footprint) and adds additional API complexity.

  2. Generate when specified:

    Ensure that all accessor types to be generated are specified in the LD, this also includes EA and Raw. This implies that all required accessor types used must be known when the LD is created. The provider of the LD determines which fields get an EA and it cannot be requested after.

  3. EA inner class:

    Create an inner Class in the Layout that contains all the EA getters. This separates all basic getters and setters from the EA getters and thus reduces API complexity.

  4. Two Stub Interfaces:

    Two separate stub interfaces for Layouts, one with basic LD specified getters and another sub-Interface with EA (and/or Raw). However, JVMs prefer monomorphic callsites and single implementations of interfaces.

4.4.2 Generate on demand

  1. EAFactory factory:

     class EAFactory {
     //detects if member is opaque or non-existent
      EA getEA(Layout instance, String memberName);
      EA getEA(Layout instance);
      EA getEA(Array1D instance, long index);
      EA getEA(Array2D instance, long row, long column);
     }

    The disadvantage is this requires reflection to look up offsets of members when given a member name (could make use of caching techniques to make this faster).

4.4.3 Overview

Options JIT Optimizations Runtime Cost Footprint API complexity/usability
Generate Always Good:
Jit can easily inline methods, likely monomorphic
Good:
No runtime cost to generating methods
Poor:
Increased class size
Poor:
Additional EA methods may confuse users who are not familiar with low-level languages
Generate When Specified Good:
Jit can easily inline methods, likely monomorphic
Good:
No runtime cost to generating methods
Fair:
Increased class size, but only when specified
Poor:
Cannot request EA methods that are not specified in the LD
EA inner class Good:
Jit can easily inline methods, likely monomorphic
Good:
No runtime cost to generating methods
Poor:
Increased class size
Good:
Separation of EA methods and other basic methods
Two Interfaces Poor:
Not monomorphic
Good:
No runtime cost to generating methods
Poor:
Additional class
Good:
Separation of EA methods and other basic methods
EA factory Fair:
Acquiring EA is not a simple base + offset operation
Poor:
Requires reflection to find fields
Good:
No EA methods in Layout Classes
Good:
Separation of EA methods and other basic methods

EA inner class appears to be the best choice as it does not increases API complexity and is likely to have good performance.

4.4.4 RAW utility

Raw has been previously defined as an accessor type, but it can also be viewed as a utility type. The are cases where it would be useful to access the raw bits of a layout (copying a layout, overwriting a layout, etc.). The following describes a Raw API that can facilitate this:
 class RAWFactory {
  Raw getRaw(Layout instance);
  Raw getRaw(Layout instance, EA start, EA end);
  Raw getRaw(Array1D instance, long index);
  Raw getRaw(Array2D instance, long row, long column);
 }
This API makes use of EA to generate Raw objects. The use of this API will be restricted by a security layer as improper use of this functionality may cause the JVM to malfunction.

5 Grammar

5.1 Goals

The grammar of the LDL must be able to fully capture the requirements mentioned in sections 2), 3) and 4). It should not contain any redundant information, and must be easily verifiable. The grammar should also be human-readable and writeable as some users may need to generate LDL for data structures that are not encoded in a formal language.

5.2 Abstract model

The following describes the kind of information that should be captured in the LDL.

5.3 Straw-man Grammar syntax

The following grammar is a rough attempt and is subject to change. It is retained from the previous version of this document for the purpose of sharing examples.

for layouts:
 layoutName','size','[endianness','][alignment','] //endianness (<, >) LE, BE
 ['{'
  {(containers | unions)}
 '}']

for unions:
 'U:'unionSize [unionName]
 ['{'
  {containers}
 '},']

for containers:
 [endianess,] ([typeInfo,] containerSize, [containerName,] | layoutName) {'[' numOFElements ']'}
 ['{'
  {fields}
 '},']

 for fields:
  fieldSize, [fieldName,]

for typeInfo:


for layoutName:

6 Examples

6.1 Basic Example

The following is a basic structure with two fields 'x' and 'y'.

 struct A {
  uint16_t x;
  uint16_t y;
 };

This structure produces the following little-endian layout.

LD:

 LA;, 32, < { //<-- little-endian
  int, 16, x,
  int, 16, y,
 }

6.2 Array Example

The following example shows how a structure of arrays can be described in a layout.

 struct SOA {
  uint8_t a[10];
  uint16_t b[10][10]; //2-d array
 };

LD:

 LSOA;, 210, < {
  int, 8[10], a,
  int, 16[10][10], b,
 }

6.3 IP Header Example

http://en.wikipedia.org/wiki/IPv4#Header

The next example shows the layout of an IPV4Header.

LD:

 LIPv4;, 160, > { //<-- big-endian
  byte, 8, {
   4 ihl,
   4 version,
  },
  byte, 8, {
   2 ECN,
   6 DSCP,
  },
  short, 16, totLen,
  short, 16, iden,
  short, 16, {
   13 fragOff,
   3 flags,
  },
  byte, 8, TTL,
  byte, 8, Proto,
  short, 16, Checksum,
  int, 32, srcAddr,
  int, 32, destAddr,
  int, 32, options,
 }

The following is the memory layout of the first 4 bytes:
0: version 7 - 4, ihl 3 - 0
1: DHCP 7 - 3, ECN 2 - 0
2: totLen 15 - 8
3: totLen 7 - 0

6.4 Nested Layout

 struct A {
  uint32_t x;
  uint32_t y;
 }

 struct B {
  struct A xy;
  uint32_t z;
 }

LD:

 LA;, 64, < {
  int, 32, x,
  int, 32, y,
 }

 LB;, 96, < {
  LA;, xy, //<-- A is nested in B
  int, 32, z,
 }

6.5 UDP Packet

http://en.wikipedia.org/wiki/User_Datagram_Protocol

LD:

 LUDPPacket;, 224, > {
  LIPv4;, ipHeader, //<-- nested struct
  short, 16, srcPort,
  short, 16, destPort,
  short, 16, length,
  short, 16, checksum,
 }

6.6 Implicit padding example

On a 64 bit machine the compiler would add 32 bit padding between the two fields shown in the following structure.

 struct A {
  uint32_t x;
  uint64_t y;
 }

This structure would produce the following Layout:

LD:

 LA;, 128, < {
  int, 32, x,
  32, //<-- unnamed fields are used to represent padding
  long, 64, y,
 }