1280461Sdim.. _Readers:
2280461Sdim
3280461SdimDeveloping lld Readers
4280461Sdim======================
5280461Sdim
6317025SdimNote: this document discuss Mach-O port of LLD. For ELF and COFF,
7317025Sdimsee :doc:`index`.
8317025Sdim
9280461SdimIntroduction
10280461Sdim------------
11280461Sdim
12280461SdimThe purpose of a "Reader" is to take an object file in a particular format
13280461Sdimand create an `lld::File`:cpp:class: (which is a graph of Atoms)
14280461Sdimrepresenting the object file.  A Reader inherits from
15280461Sdim`lld::Reader`:cpp:class: which lives in
16280461Sdim:file:`include/lld/Core/Reader.h` and
17280461Sdim:file:`lib/Core/Reader.cpp`.
18280461Sdim
19280461SdimThe Reader infrastructure for an object format ``Foo`` requires the
20280461Sdimfollowing pieces in order to fit into lld:
21280461Sdim
22280461Sdim:file:`include/lld/ReaderWriter/ReaderFoo.h`
23280461Sdim
24280461Sdim   .. cpp:class:: ReaderOptionsFoo : public ReaderOptions
25280461Sdim
26280461Sdim      This Options class is the only way to configure how the Reader will
27280461Sdim      parse any file into an `lld::Reader`:cpp:class: object.  This class
28280461Sdim      should be declared in the `lld`:cpp:class: namespace.
29280461Sdim
30280461Sdim   .. cpp:function:: Reader *createReaderFoo(ReaderOptionsFoo &reader)
31280461Sdim
32280461Sdim      This factory function configures and create the Reader. This function
33280461Sdim      should be declared in the `lld`:cpp:class: namespace.
34280461Sdim
35280461Sdim:file:`lib/ReaderWriter/Foo/ReaderFoo.cpp`
36280461Sdim
37280461Sdim   .. cpp:class:: ReaderFoo : public Reader
38280461Sdim
39280461Sdim      This is the concrete Reader class which can be called to parse
40280461Sdim      object files. It should be declared in an anonymous namespace or
41280461Sdim      if there is shared code with the `lld::WriterFoo`:cpp:class: you
42280461Sdim      can make a nested namespace (e.g. `lld::foo`:cpp:class:).
43280461Sdim
44280461SdimYou may have noticed that :cpp:class:`ReaderFoo` is not declared in the
45280461Sdim``.h`` file. An important design aspect of lld is that all Readers are
46280461Sdimcreated *only* through an object-format-specific
47280461Sdim:cpp:func:`createReaderFoo` factory function. The creation of the Reader is
48280461Sdimparametrized through a :cpp:class:`ReaderOptionsFoo` class. This options
49280461Sdimclass is the one-and-only way to control how the Reader operates when
50280461Sdimparsing an input file into an Atom graph. For instance, you may want the
51280461SdimReader to only accept certain architectures. The options class can be
52280461Sdiminstantiated from command line options or be programmatically configured.
53280461Sdim
54280461SdimWhere to start
55280461Sdim--------------
56280461Sdim
57280461SdimThe lld project already has a skeleton of source code for Readers for
58292934Sdim``ELF``, ``PECOFF``, ``MachO``, and lld's native ``YAML`` graph format.
59292934SdimIf your file format is a variant of one of those, you should modify the
60292934Sdimexisting Reader to support your variant. This is done by customizing the Options
61280461Sdimclass for the Reader and making appropriate changes to the ``.cpp`` file to
62280461Sdiminterpret those options and act accordingly.
63280461Sdim
64280461SdimIf your object file format is not a variant of any existing Reader, you'll need
65280461Sdimto create a new Reader subclass with the organization described above.
66280461Sdim
67280461SdimReaders are factories
68280461Sdim---------------------
69280461Sdim
70280461SdimThe linker will usually only instantiate your Reader once.  That one Reader will
71280461Sdimhave its loadFile() method called many times with different input files.
72280461SdimTo support multithreaded linking, the Reader may be parsing multiple input
73280461Sdimfiles in parallel. Therefore, there should be no parsing state in you Reader
74280461Sdimobject.  Any parsing state should be in ivars of your File subclass or in
75280461Sdimsome temporary object.
76280461Sdim
77344779SdimThe key function to implement in a reader is::
78280461Sdim
79280461Sdim  virtual error_code loadFile(LinkerInput &input,
80280461Sdim                              std::vector<std::unique_ptr<File>> &result);
81280461Sdim
82280461SdimIt takes a memory buffer (which contains the contents of the object file
83280461Sdimbeing read) and returns an instantiated lld::File object which is
84280461Sdima collection of Atoms. The result is a vector of File pointers (instead of
85280461Sdimsimple a File pointer) because some file formats allow multiple object
86280461Sdim"files" to be encoded in one file system file.
87280461Sdim
88280461Sdim
89280461SdimMemory Ownership
90280461Sdim----------------
91280461Sdim
92280461SdimAtoms are always owned by their File object. During core linking when Atoms
93280461Sdimare coalesced or stripped away, core linking does not delete them.
94280461SdimCore linking just removes those unused Atoms from its internal list.
95280461SdimThe destructor of a File object is responsible for deleting all Atoms it
96280461Sdimowns, and if ownership of the MemoryBuffer was passed to it, the File
97280461Sdimdestructor needs to delete that too.
98280461Sdim
99280461SdimMaking Atoms
100280461Sdim------------
101280461Sdim
102280461SdimThe internal model of lld is purely Atom based.  But most object files do not
103280461Sdimhave an explicit concept of Atoms, instead most have "sections". The way
104280461Sdimto think of this is that a section is just a list of Atoms with common
105280461Sdimattributes.
106280461Sdim
107280461SdimThe first step in parsing section-based object files is to cleave each
108280461Sdimsection into a list of Atoms. The technique may vary by section type. For
109280461Sdimcode sections (e.g. .text), there are usually symbols at the start of each
110280461Sdimfunction. Those symbol addresses are the points at which the section is
111280461Sdimcleaved into discrete Atoms.  Some file formats (like ELF) also include the
112280461Sdimlength of each symbol in the symbol table. Otherwise, the length of each
113280461SdimAtom is calculated to run to the start of the next symbol or the end of the
114280461Sdimsection.
115280461Sdim
116280461SdimOther sections types can be implicitly cleaved. For instance c-string literals
117280461Sdimor unwind info (e.g. .eh_frame) can be cleaved by having the Reader look at
118280461Sdimthe content of the section.  It is important to cleave sections into Atoms
119280461Sdimto remove false dependencies. For instance the .eh_frame section often
120280461Sdimhas no symbols, but contains "pointers" to the functions for which it
121280461Sdimhas unwind info.  If the .eh_frame section was not cleaved (but left as one
122280461Sdimbig Atom), there would always be a reference (from the eh_frame Atom) to
123280461Sdimeach function.  So the linker would be unable to coalesce or dead stripped
124280461Sdimaway the function atoms.
125280461Sdim
126280461SdimThe lld Atom model also requires that a reference to an undefined symbol be
127280461Sdimmodeled as a Reference to an UndefinedAtom. So the Reader also needs to
128280461Sdimcreate an UndefinedAtom for each undefined symbol in the object file.
129280461Sdim
130280461SdimOnce all Atoms have been created, the second step is to create References
131280461Sdim(recall that Atoms are "nodes" and References are "edges"). Most References
132280461Sdimare created by looking at the "relocation records" in the object file. If
133280461Sdima function contains a call to "malloc", there is usually a relocation record
134280461Sdimspecifying the address in the section and the symbol table index. Your
135280461SdimReader will need to convert the address to an Atom and offset and the symbol
136280461Sdimtable index into a target Atom. If "malloc" is not defined in the object file,
137280461Sdimthe target Atom of the Reference will be an UndefinedAtom.
138280461Sdim
139280461Sdim
140280461SdimPerformance
141280461Sdim-----------
142280461SdimOnce you have the above working to parse an object file into Atoms and
143280461SdimReferences, you'll want to look at performance.  Some techniques that can
144280461Sdimhelp performance are:
145280461Sdim
146280461Sdim* Use llvm::BumpPtrAllocator or pre-allocate one big vector<Reference> and then
147280461Sdim  just have each atom point to its subrange of References in that vector.
148280461Sdim  This can be faster that allocating each Reference as separate object.
149280461Sdim* Pre-scan the symbol table and determine how many atoms are in each section
150280461Sdim  then allocate space for all the Atom objects at once.
151280461Sdim* Don't copy symbol names or section content to each Atom, instead use
152280461Sdim  StringRef and ArrayRef in each Atom to point to its name and content in the
153280461Sdim  MemoryBuffer.
154280461Sdim
155280461Sdim
156280461SdimTesting
157280461Sdim-------
158280461Sdim
159280461SdimWe are still working on infrastructure to test Readers. The issue is that
160280461Sdimyou don't want to check in binary files to the test suite. And the tools
161280461Sdimfor creating your object file from assembly source may not be available on
162280461Sdimevery OS.
163280461Sdim
164280461SdimWe are investigating a way to use YAML to describe the section, symbols,
165280461Sdimand content of a file. Then have some code which will write out an object
166280461Sdimfile from that YAML description.
167280461Sdim
168280461SdimOnce that is in place, you can write test cases that contain section/symbols
169280461SdimYAML and is run through the linker to produce Atom/References based YAML which
170280461Sdimis then run through FileCheck to verify the Atoms and References are as
171280461Sdimexpected.
172280461Sdim
173280461Sdim
174280461Sdim
175