main
method:int main(int argc, char* argv[]) { //... }Wikipedia tells me that "Unix (though not POSIX.1) and Windows have a third argument giving the program's environment", and takes the following form:
int main(int argc, char *argv[], char *envp[]) { //... }See libc_start_main.c at lines 58-67 for the mechanics of how the signature is varied between the two forms. Depending on which form is implemented the stack frame would look different as it would have a third argument pushed onto it. However, we're not going to rely on the
glibc
code to invoke a main
method, we're going to implement a global function called _start
and see what we're given: this is at a level below what you'd see in a C-style main
function.The story with
x86
codeThe second form of the
main
function appears unusual for a programmer who is used to writing software in Java but gives an indication as to how a program is initialised and receives its program arguments. If you do a search for "System V ABI i386" you'll no doubt find a document whose copyright is asserted by both the "Santa Cruz Operation, Inc" (there's a bogeyman acronym there somewhere) and AT&T. I've been looking at the fourth edition, and on page 54, section 3-28, it shows the initial process stack. A helpful diagram shows something a bit like this:Unspecified | High address | |
Information block, including argument strings, environment strings, auxillary information ... (size varies). | ||
Unspecified | ||
Null auxiliary vector entry | ||
Auxiliary vector ... (2-word entries) | ||
Null word | ||
Environment pointers ... (one word each) | ||
Null word | ||
4(%esp) |
Argument pointers ... (argc words) |
|
0(%esp) |
Argument count (argc ) |
In the document the term
word
refers to a 32-bit value, which unhelpfully collides with what you come to accept as a "word" if you do any assembly programming (i.e. you expect it to refer to a 16-bit value).Just to ensure that everyone's on the same page, it's worth mentioning that the stack starts at a high address and "grows" downwards. As you add a stack frame for nested function invocations, the value of the stack pointer in
%esp
(or %rsp
for 64-bit code) decreases.I'll quote directly from the document at this point:
Argument strings, environment strings, and the auxiliary information appear in no specific order within the information block; the system makes no guarantees about their arrangement. The system also may leave an unspecified amount of memory between the null auxiliary vector entry and the beginning of the information block.The ABI goes into some detail about the structure of the Auxillary Vector entries (they are an 8-byte structure containing a type and a value or pointer).
On the basis that a picture is worth a thousand words, I'm going to borrow another diagram from the document (hopefully not so slavishly that SCO sue me):
n |
: |
\0 |
pad | High address | |
r |
/ |
b |
i |
||
= |
/ |
u |
s |
||
0x8047ff0 |
P |
A |
T |
H |
|
d |
i |
r |
\0 |
||
o |
m |
e |
/ |
||
E |
= |
/ |
h |
||
0x8047fe0 |
\0 |
H |
O |
M |
|
0x8047fdc |
\0 |
a |
b |
i |
|
0x8047fd8 |
e |
c |
h |
o |
|
0 |
|||||
0x8047fd0 |
0 |
||||
13 |
|||||
0x8047fc8 |
2 |
Auxiliary vector | |||
0 |
|||||
0x8047ff0 |
|||||
0x8047fbc |
0x8047fe1 |
Environment vector | |||
0 |
|||||
0x8047fdd |
|||||
0x8047fb0 |
0x8047fd8 |
Argument vector | |||
0(%esp), 0x8047fac |
2 |
Argument count | |||
Undefined | Low address |
I hope this helps, it took ages to type it in! You can see how the two argument pointers point to the first byte of their argument values, which are null-terminated. Accessing the environment variables involves accounting for the
argc
value and the null-word between the argument vector and the environment vector and multiplying the number of arguments by their width. We can write this using indexed addressing, which in 32-bit code (each pointer being four-bytes) is8(%esp, argc, 4)
or slightly more readably
8 + %esp + (argc * 4)
In this example, that makes
8 + 0x8047fac + 8
which is 0x8047fbc
.The program arguments in this example were
echo
and abi
and the environment variables are HOME=/home/dir
and PATH=/usr/bin:
. Interestingly, without the benefit of glibc
start-up code, it's necessary to scan through the environment vector until the null word is detected in order to access the auxiliary vector. As an aside, it seems that
glibc
initialises the hidden variables __libc_argc
, __libc_argv
and the readable __environ
in the _init
function. I don't know from where they're they're later accessed, but they're declared as follows:/* Remember the command line argument and enviroment contents for
later calls of initializers for dynamic libraries. */
int __libc_argc attribute_hidden;
char **__libc_argv attribute_hidden;
The story with
x86_64
codeGiven the different function calling convention with
x86_64
defined in the x86_64 System V ABI, when I went to convert a simple 32-bit piece of assembly to do the same using 64-bit instructions and registers, I was somewhat surprised to find that the arguments to the _start
function were not passed in the registers %rdi
, %rsi
and %rdx
, but are passed in the same way as for 32-bit code.They provide the following table in the PDF:
Purpose | Start Address | Length |
Unspecified | High Address | |
Information block, including argument strings, environment strings, auxiliary information... | varies | |
Unspecified | ||
Null auxiliary vector entry | 1 eightbyte | |
Auxiliary vector entries... | 2 eightbytes each | |
0 | eightbyte | |
Environment pointers... | 1 eightbyte each | |
0 | 8+8*argc+%rsp |
eightbyte |
Argument pointers | 8+%rsp |
argc eightbytes |
Argument count | %rsp |
eightbyte |
Unspecified | Low Address |
In other words, allowing for the difference in size between a pointer on the two different architectures, the initial process stack is the same on both. It's no surprise that process arguments, environment variables and auxiliary vector are stored there. It took some thought to realise why
argc
and the pointer to argv[0]
are passed on the stack, rather than %rdi
and %rsi
: permanence. If the values were passed in the registers, they would be ephemeral at best. Finally, some assembly
Anyway, here's one I made earlier. It prints the program arguments in order using the standard library's
printf
function, before printing each environment variable the same way in pointer-order.envvars.s
-------------------------------8<-------------------------------
.section .data argc_str: .asciz "argc: %d\n" argv_str: .asciz "argv[%d]: %s\n" env_str: .asciz "env: %s\n" .section .text .globl _start _start: # # Application prologue. See page 29 in the System V 64-bit ABI # movq %rsp, %rbp # Store the stack-pointer in RBP # # Print the number of command-line arguments # # Function: # printf # Parameters: # RDI: address of the format-string $argc_str # RSI: the number of arguments passed to this function # AL: the parameter-count of this varargs function call # Returns: # void movq $argc_str, %rdi # Store the address of the format-string in RDI movq (%rbp), %rsi # Store the cmd-line arg-count in RSI, by dereferencing RDP movq $1, %rax # Store the printf function's vararg-count in AL call printf # Invoke the standard library's printf function # # Print each command-line argument # movq (%rbp), %rcx # Store the argument count in counter register RCX movq %rcx, %r12 # Copy that value to the (protected) R12 register .Lprintarg: movq %rcx, %rbx # Copy the count value to protected register RBX # Call function: # printf # Parameters: # RDI: address of the format-string "argv[%d]: %s\n" # RSI: 1st value for conversion: index-count # RDX: 2nd value for conversion: address of cmd-line arg # AL: number of values to the varargs section of the call # Returns: # void movq $argv_str, %rdi # Store the address of the format-string in RDI movq %r12, %rsi # Calculate the index value in RSI subq %rcx, %rsi # Subtract counter from arg-count to get the index # Calculate the pointer's address and store RDX leaq 0x8(%rbp, %rsi, 0x8), %rdx movq (%rdx), %rdx # Dereference that pointer to get the parameter's address movq $0x2, %rax # Set the varargs-count in the 'hidden' AL parameter call printf # Invoke printf movq %rbx, %rcx # Restore the counter from the protected RBX register loop .Lprintarg # Decrement RCX and loop again if not zero # # Print each environment variable # .Largsfinished: # Calculate the base address of the env-vars, which is: # %rbp + (8 * argc) + 16 leaq 0x10(%rbp, %r12, 0x8), %r12 testq $-0x01, (%r12) # Test R12 against -1 to find a zero-value (footnote 1) jz .Lexit .Lprintenv: movq $env_str, %rdi # Store address of the format-string in RDI movq (%r12), %rsi # Store pointer to env-var in RSI movq $0x1, %rax # varargs component as 'hidden' parameter in AL call printf addq $0x8, %r12 # Step to the next pointer testq $-0x1, (%r12) # Test to see whether it is zero (footnote 1) jne .Lprintenv # Jump if not zero to print the next variable .Lexit: movq $0x3C, %rax # index of sys_exit movq $0x0, %rdi # exit status syscall-------------------------------8<-------------------------------
No comments:
Post a Comment