AMPLEC - 2: Basics of the project

Table of Contents

This is the second post in a series of posts about AMPLEC, my most recent project. See more details about the project and the previous posts here.

The why behind the project is covered in part 1, Why would you even want that?.

Now, to dive right into it:

General Structure
#

The project is dependent on the access to an LLM (Large Language Model) and the access to an API of the Malware Analysis Pipeline (which in this case is KARTON).

flowchart TD
    %% Nodes
    A[Malware 
    Analysis Pipeline]
    B[LLM]
    C[AMPLEC]

    %% Edges (with labels)
    A -->|Analysis
    Results| C
    B -->|Answer: 
    Text / FunctionCall| C
    C -->|Prompt & Data| B
    C -->|Analysis
    Request| A

The LLM will be used to

Generate the beautiful responses to the user, including all the data that is needed to be shown to the user.
use a function call to amplec to retrieve the data from the Malware Analysis Pipeline (KARTON) to be able to display the data to the user.

But, the real magic happens in the core-module of the AMPLEC project. This module is responsible for the following:

Prompting the LLM with the data that is needed to be shown to the user.
serving the FunctionCall to the LLM.
Preprocessing, Cleaning, Naturalizing and Enriching the data that is retrieved from the Malware Analysis Pipeline (KARTON).

The Flow of Data
#

To make the ins and outs of the project a bit clearer, in this section I will illustrate the mode of operation by showing the flow of data through the project, from the generation of the data during the malware analysis in the Malware Analysis Pipeline (KARTON) to the final response to the user.

flowchart TD
    %% Nodes
    A[Malware 
    Analysis Pipeline]
    B[AMPLEC]
    C[LLM]
    D[User]

    %% Edges (with labels)
    A -->|Analysis
    Results| B
    B -->|Result of FunctionCall| C
    C -->|FunctionCall| B
    C -->|Response| D
    C -->|Prompt| D

Data Processing
#

Inside the AMPLEC project, the data is processed in the following way:

Data is retrieved from the pipeline (KARTON).
The data gets preprocessed and cleaned, this includes:
- Removing unnecessary data
- Formatting the data
- Polling further parts of the data only referenced in the result from the pipeline
The data is then naturalized (I’ll do a deepdive on this technique in the next post, so stay tuned), which is a process of converting the data into human-readable text.
In its naturalized form, the data is enriched with additional information, such as:
- Names and descriptions of TTPs (Tactics Techniques Procedures) mentioned in the data (current)
- contextual information like scores or tags for IoCs (Indicators of Compromise) like hashes, URLs, etc. (planned)
The data is then made accessible to the LLM via a function call.
The LLM generates a response for the user based on the result of the function call and the user prompt.

Tech Stack
#

To clarify what AMPLEC is built on, I’ll list the frameworks and tools, as well as the different LLMs and the Malware Pipeline used.

Language & Frameworks
#

Amplec is developed purely in Python (3.12 to be specific). For the interaction with LLMs, especially tool-calling, Langchain was used. To serve the API which the core module offers, I chose Flask and paired it with the WSGI server Waitress. For the UI I used Streamlit, which lets you build interactive UIs in Python without hand-writing HTML or CSS.

LLM back-end(s)
#

In the development process, for this project, I tested different LLMs. While the closed source, commercial models such as OpenAIs ChatGPT performed significantly better, I still needed to evaluate the project based on an open-source, really lightweight model that could realistically be run locally with solid performance, which is why, after careful statistic evaluation of different models, I chose llama3.2:3b as my main contestant.

Malware-pipeline interface
#

The Malware Analysis Pipeline in question, as mentioned above is Karton, developed by CERT Polska. However, we are not running a completely vanilla instance of this framework and modified the way analysis results are saved quite heavily.

Karton runs a Task based system, so far so normal, in our case though we push each result of each analysis and then built an API to consolidate all the results and then send them out. This is not part of the AMPLEC project itself, but it is how we get to our malware analysis data.

General Structure #

The Flow of Data #

Data Processing #

Tech Stack #

Language & Frameworks #

LLM back-end(s) #

Malware-pipeline interface #