Program Interaction
139 labs???? any%
Program Interaction: Binary Files
Lets start with cat, what is cat?
It is a 64 bit ELF file.
What is an ELF then? ELF is a binary file format that contains the programs and its data.
2 Main components
ELF Program Headers - The source of information used when loading a file
Define the segments aka the parts of the file that are loaded into memory.
INTERP: defines the library that should be used to load this ELF into memory
LOAD: defines a part of the file that should be loaded into memory.
ELF Section Headers - Section headers are not a necessary part of the ELF. Segments (defined via the program headers) are needed for loading and operation. Headers are just metadata
.text: the executable code of your program
.plt and .got: used to resolve and dispatch library calls
.data: used for pre-initialized global writable data (global arrays with initial values)
.rodata: global read-only data (like string constants)
.bss: uninitialized global writable data (global arrays without initial values)
Then there are Symbols - Binaries and libraries that use dynamically loaded libraries rely on symbols (names) to find libraries, resolve function calls into those libraries etc.
How do we interact with ELF?
gcc - to make your elf
readelf - to parse the header
objdump - to parse the ELF header and disassemble the src code
nm - to view ELF symbols
patchelf - to change some ELF's symbols
objcopy - to swap out ELF sections
strip - to remove otherwise-helpful information (such as symbols)
kaitai struct to look through ELF interactively
Linux Process Loading
Lets take /bin/cat
When we run cat /flag
A process is created
Cat is loaded
Cat is initialized
Cat is launched
Cat reads its arguments and environment
Cat does its thing
Cat terminates
Portrait of a process
Every Linux process has:
state (running, waiting, stopped, zombie)
priority (and other scheduling information)
parents, siblings, children
shared resources (files, pipes, sockets)
virtual memory space
security context
effective uid and uid
saved uid and gid
capabilities
Where do these process come from though? fork (and more recently) clone are system calls that create a nearly exact copy of the calling process: a parent and a child. Later, the child process usually uses the execve syscall to replace itself with another process
Example:
you type /bin/cat in bash
bash forks itself into the old parent process and the child process
the child process execves /bin/cat becoming /bin/cat
Can we load? Before anything is loaded the kernel checks for executable permissions. If a file isn;t executable execve will fail.
But what do we load? To figure this out the kernel reads the begging of the file and makes a decision;
If the file starts with #! the kernel extracts the interpreter from the rest of that line and executes this interpreter with the original file as an argument.
This interpreter also totally could be a shell script/alternative binary.
If the file matches a format in /proc/sys/fs/binfmt_misc - the kernel executes the interpreter specifically for that format with the original file as an argument.
If the file is a dynamically-linked ELF, the kernel reads the interpreter/loader defined in the ELF, loads the interpreter and the original file and lets the interpreter takes control
If the file is a statically linked ELF the kernel will load it
Other legacy file formats are checked for
It is looking for magic bytes at offset 0 in order to figure out what format to use.
Note: These can totally be recursive.
Dynamically linked ELFs: the interpreter
Process loading is done by the ELF interpreter specified in the binary.
This can be overriden
Or even changed permanently
Now the loading process
The program and its interpreter are loaded by the kernel
The interpreter locates the libraires
LD_PRELOAD env variable, and anything in /etc/ld.so.preload
LD_LIBRARY_PATH env variable (can be set in the shell)
DT_RUNPATH or DT_RPATH specified in the binary (both can be modified with patchelf)
system-wide configs (/etc/ld.so.conf)
/lib and /usr/lib
The interpreter loads the libraries
These can depend on other libraries causing more to be loaded
relocations are updated
Where is all this getting loaded too?
Each Linux proceess has a virtual memory space that contains
the binaries
the libraries
the "heap" (for dynamically allocated memory)
the "stack" (for function local variables)
any memory specifically mapped by the program
some helper regions
kernel code in the "upper half" of memory (above 0x8000000000000000 on 64-bit architectures)
Virtual memory is dedicated to your process Physical memory is shared among the whole system
You can see this whole space by looking at /proc/self/maps
The Standard C Library
libc.so is linked by almost every process.
This provides many of the main functionalities
printf()
scanf()
socket()
atoi()
malloc()
free()
First the binary is loaded, then initialized (3). In example again: Cat.
Every ELF binary can specify constructors, which are functions that run before the program is actually launched
For example, depending on the version - libc can initialize memory regions for dynamic allocations (malloc/free) when the program launches.
Linux Process Execution
Still focusing on cat, lets move onto the launching step (4)
A normal ELF automatically calls __libc_start_main() in libc which in turn calls the program's main() function.
Step 5: Cat reads its arguments and environment
int main(int argc, void **argv, void **envp);
Your process's entire input from the outside world at launch comprises of:
the loaded objects (binaries and libraries)
command-line arguments in argv
"environment" in envp However, processes need to keep interacting with the outside world
Step 6: Cat does the thing
The binary's import symbols have to be resolved using the libraries export symbols In the past this was on-demand process and carried great peril. In modern times this is all done when the binary is loaded and is much safer.
So back to it how does it interact with the environment? This is primarily done via systemcalls (syscalls). Each system call is well-documented in section 2 of the man pages. We can trace these processes using strace
System Calls
System calls have a well defined interface that rarely changes. While there are over 300 syscalls in linux here are some important ones.
Typical signal combinations:
fork, execve, wait (think: a shell)
open, read, write (cat)
Signals
System calls are a way for a process to call into the OS. What about the other way around? This is what signals are used for. Some relevant calls:
Signals pause process execution and invoke the handler. Handlers are functions that only take one argument - the signal number. Without a handler for a signal the default action is used (typically this is kill) SIGKILL (signal 9) and SIGSTOP (signal 19) cannot be handled.
The full list of these are in section 7 of man (man 7 signal) and kill -l.
Shared memory
Another way of interacting with the outside world is by sharing memory with other processes. This requires system calls to establish and once it is established communication happens without system calls.
The easy way is to use a shared memory-mapped file in /dev/shm
Process termination
The last thing cat does (6) Cat terminates
Terminate is processed in 1 of 2 ways
Receiving an unhandled signal
Calling the exit() system call:
int exit(int status)
All processes must be "reaped"
After termination they will remain in a zombie state until they are wait()ed on by their parent
When this happens their exit code will be returned to the parent and the process will be freed.
If their parent dies without wait()ing on them they are re-parented to PID 1 and will stay there until they're cleaned up.
BONUS: Unix Shell: The Art of I/O Redirection
An absolutely fantastic writeup by Benjamin Cane going over helpful tips/tricks for I/O redirection within the Linux/Unix shell. Full article can be found here.
Input and Output
There are always 3 streams open
stdin
stdout
stderr
These streams are used for interacting with the user input and program output within the shell environment. These have reserved file descriptors attached to them which means we can interact with them from command line.
stdin - Standard Input
This stream is used to get input for commands from the user keyboard. stdin has the file descriptor of 0 and a file of /dev/stdin
stdout - Standard Output
This stream is used for non-error output from programs. stdout has the file descriptor of 1 and a file of /dev/stdout.
stderr - Standard Error
This stream is used for error output from programs. stderr has a file descriptor of 2 and a file of /dev/stderr
Examples of input and output
Example reference
Cat without a filename opens the cat program and expects user input
When a program prints errors or diagnostic info it uses the Standard error stream rather then Standard Ouput. This keeps it from mixing a commands errors with the same commands outputs.
Redirection
What if we want to take the stdout of one program and write it to a file?
The > is used to redirect to a file.
A single left carror will truncate then enter the redirected data into the file (replacing) if it doesn't exist it will be created The > symbol will redirect stdout by default however you can specify whether to redirect stdout or stderr by putting the appropriate file descriptor in front of the redirect symbol
2 > symbols are used to append the output to a file rather than overwriting the file entirely. This can once again be done with the file descriptor to specify if you want stdout or stderror.
< is used to redirect input from a file instead
It can be used to redirect stdin from a file to a program
Piping from one procces to another
The | pipe redirects stdout from one command to stdin for another. The pipe is not considered a redirect operator but rather a control operator; while pipe is used to redirect ouput not all controll operators have this function
We can also write to a file and stdout using tee
Tee is neither a control or redirection operator but it is a command that will take the stdin given to it and write that both to a file specified and stdout. The -a flag will append the output rather than overwriting
Last updated