This is the first post in a series of posts about AMPLEC, my most recent project. See more details about the project in this article:
Now, to get to the point:
Why would you even want to have something like Amplec? #
Well - that’s a good question, first of all I think I should clarify why I felt the need to create this project in the first place.
A short story about Karton #
I was working at the Threat Intel Team of Deutsche Telekom and we were using a tool called Karton to analyze malware. Karton is a framework which allows you to build your own analysis steps and pipelines to analyze malware. Now what does make Karton unique in a world of preexisting pipelining frameworks? Well, it is designed to never have a static or hardcoded flow of analysis. Instead, every Karton is just programmed, to take in all the entities in the pipeline that match certain filters. This also means the pipeline technically “starts” with more than one step.
Distributed malware processing framework based on Python, Redis and S3.
This is a very powerful concept, because it allows you to build a pipeline that is not only flexible, but also can react to different results of previous steps dynamically by nature.
However, I would not go into that great detail, if this pipeline did not come with a downside.
The initial problem #
The downside with karton is, that the results of the analysis are as dynamic, and unpredictable as the pipeline itself. This means that one can certainly not expect the results to be in a certain format, or even contain certain information, while being still larger and more verbose than a human analyst has time and patience to dive into.
On top of that, many different team members can write kartons. And this is actually encouraged, as it allows each person to focus on their area of expertise. However, this often leads to analysis results that are not structured uniformly. While this isn’t necessarily a problem in itself, it does make it even harder to extract the information you need.
This is where I started to think about how I could make the results of the analysis more accessible and more efficiently digestible for analysts and in extend for consumers. This is where the idea, that an LLM could help, extracting and describing the information wanted by the analyst from the giant results, came into play. And so the idea for AMPLEC was born.
First Big Setback and a New Direction #
When I first tested the AMPLEC concept using ChatGPT, everything worked surprisingly well. However, I knew the tool would need to run locally to address privacy and security concerns. When trying to use smaller locally hosted LLMs (like Llama3.2 7B), I began to suspect I had created more of a “hallucination machine” than a reliable malware analysis tool. This realization made me question the feasibility of my project.
Recognizing this issue, I decided to refocus my efforts. Instead of building yet another ChatGPT wrapper, I shifted my attention to solving the specific problems that arise when working with large datasets and small LLMs-particularly ones that can run on any standard machine.
This also means, that AMPLEC now is not only a malware analysis tool, but also a large amount of lessons learned and a few new ideas about how to work with small LLMs and large, complex domain knowledge focussed datasets.
Now, do I need it? #
Well, I think that depends greatly on your use case. If you already use karton and look for a way to integrate a LLM into the pipeline as a data access point, then AMPLEC is definitely something you should take a look at. Further, if you have the problem of having large dataset in combination with small LLM and run into similar issues as I did, then you might find some useful ideas in the project as well.
Especially if you struggle with limited domain knowledge of your LLM, which is a common problem with smaller models, you might find AMPLEC and the ideas behind it useful.
In the following posts of this series, I will go into much greater detail about the project and the ideas behind it, I will share the lessons learned, the challenges I faced and also the statistics of different LLMs I tested.