# CodeCut - Detecting Object File Boundaries in radare2

r2codecut is a port of [CodeCut](https://github.com/JHUAPL/CodeCut/) to [radare2](https://github.com/radareorg/radare2).

## Theory of Operation

CodeCut attempts to identify object files boundaries (or modules) in binary files. It then attempts to name said modules using [r2magicstrings](https://github.com/FernandoDoming/r2magicstrings).

## Dependencies

r2codecut relies in [radare2](https://github.com/radareorg/radare2) and as such it requires to be installed in the system.
Please refer to [radare's documentation](https://github.com/radareorg/radare2) for installation instructions.

## Installation

You may install the package available in PyPi by:
```
pip install r2codecut
```

## Usage

```
$ r2codecut -h
usage: r2codecut [-h] [-v] [-vv] filepath

positional arguments:
  filepath    File to analyze

options:
  -h, --help  show this help message and exit
  -v          Verbose output (INFO)
  -vv         Increase verbosity (DEBUG)
```

Example execution:
```
$ r2codecut ~/examples/ls 
INFO: Analyze all flags starting with sym. and entry0 (aa)
INFO: Analyze function calls (aac)
INFO: Analyze len bytes of instructions for references (aar)
INFO: Finding and parsing C++ vtables (avrr)
INFO: Type matching analysis for all functions (aaft)
INFO: Propagate noreturn information (aanr)
INFO: Scanning for strings constructed in code (/azs)
INFO: Finding function preludes (aap)
INFO: Enable anal.types.constraint for experimental type propagation
[+] Identified 10 modules in /home/fdd/malware/ls
        src/ls.c - Start: 0x402a00, End: 0x404e2f
        src/ls.c - Start: 0x404e30, End: 0x4053ef
        Module 0x004053f0:0x00407f7f - Start: 0x4053f0, End: 0x407f7f
        Module 0x00407f80:0x0040ad6f - Start: 0x407f80, End: 0x40ad6f
        Module 0x0040ad70:0x0040d22f - Start: 0x40ad70, End: 0x40d22f
        Module 0x0040d230:0x0040f3ff - Start: 0x40d230, End: 0x40f3ff
        Module 0x0040f400:0x0041181f - Start: 0x40f400, End: 0x41181f
        lib/xstrtol.c - Start: 0x411820, End: 0x41264f
        Module 0x00412650:0x0041295f - Start: 0x412650, End: 0x41295f
        Module 0x00412960:0x004134c6 - Start: 0x412960, End: 0x4134c6
```


## LFA Parameters & Interpolation

A couple areas for research:

  - The idea behind LFA is that we throw out "external" calls - we can't 
    determine this exactly in a binary so we throw out calls that are above a 
    certain threshold.  This is set to 4K in the code but it could be tweaked.

  - There is a threshold set for edge detection - plus a little bit of extra
    logic (value has to be positive and 2 of last 3 values were negative). You
    can either vary this threshold or write your own edge_detect() function.

  - Currently "calls to" affinity and "calls from" affinity are treated as
    separate scores.  If one of these scores is zero an interpolation from
    the previous score is used - just a simple linear equation assuming
    decreasing scores.  This could be improved a number of ways but could
    be replaced with an actual interpolation between scores.

  - If both "calls to" affinity and "calls from" affinity for a function are 0
    the function is skipped and is essentially treated like it's not there.
    This happens for functions with no references or where all references are
    above the "external" threshold.  This means there can be gaps between the
    modules in the output list.

  - The portion of code that tries to name object files based on common strings
    is completely researchy and open ended. Lots of things to play with there.

## MaxCut Parameters & Interpolation

  - The only real parameter for MaxCut is a THRESHOLD variable that corresponds to the size at which the algorithm will stop subdividing modules.  A threshold of 4K (0x1000) seems to provide similar sized modules to LFA.  A threshold of 8K (0x2000) seems to be a good upper bound.  A good area of research would be making this not a static cutoff but maybe deciding to stop subdividing based on a connectedness measurement or something along those lines.

