Project Log: Day 17

It’s been a hell of a week so far. A good one. Now I have the weekend in front of me, and as I’m a bit tired, I’ll slow down a bit. Or maybe next week, instead.

Regardless, I need to plan what’s next.

Upcoming objectives

Applicability 1: In the network security world, sometimes you might care more about certain assets than about anything else. E.g. your Intellectual Property database. It seems like a valid use-case, and… I think I can optimize for that too (i.e. minimize risk of infection of a few key nodes, regardless of the rest). It’s simply a matter of choosing another fitness measure, in other words, reducing as much as possible the average infection of such *key nodes*. I call it “Protecting the Crown Jewels”. And although it’s a bit different from trying to minimize overall infection prevalence, my program will work… Basically the same way! I’m just optimizing a different function, that’s all.
Applicability 2: This one I know will be a bit harder to implement (maybe a lot of refactoring in fact :S), but it’s valuable given my real day job (this project, as much as it pains me, has to stay second to my job, for obvious reasons…). As it turns out, currently I focus on inventories. And one question I keep asking myself is: “What if I don’t know about a subset of my IT Network? How bad is that?”. That’s what I call the “unknown-unknown risk”. Well… I can modify my simulator to account for that too! IN FACT, I’m thinking about seeing how to allocate my budget across protection, detection AND INVENTORY, and see the impact of that. (If you’re familiar with the NIST Framework, hopefully you’ll see the parallel with that: “Identify”, “Protect” (acting on Beta), “Detect & Respond” (acting on mu) is what I would be covering in my theoretical setting).
Theory: Graph Topology. I know by now my simulator doesn’t present the same effectiveness, depending on the graph on which it runs. So when does it work best? Worst? And mind you, not only is it interesting for my dissertation/paper, no no… In a real world setting for Cyber Operations optimizations, one “obvious” parameter to look at is: “How much segmentation do I need? Is segmentation important? By how much?”. In any case, it feels like the most relevant point for my dissertation, by a long shot.

As it turns out, this whole time, when looking at it from the Cybersecurity perspective, when I discuss my project, I’m talking about doing a “PRE-POST-MORTEM” (not registered trademark, let’s say “Registration Pending”! – I certainly should, here the date of first publication of that denomination that I know of: 2023/10/28, me :D).

So what I do is simulate infection and it’s spreading, and recommend what would have worked better to minimise impact. So I simulate the impact. Then iterate and make improvements. I compare many POST-MORTEMs, a few thousand times. (that’s maybe the best value of simulation here: You don’t have to wait to actually suffer 10 million different infection scenarios on your real-world company network…)

Then PRE-that, making recommendations before the fact.

(Personal note: I came up with this denomination in conversations with one of the best persons – one of those I respect the most – in this world, just yesterday while trying to explain it all… Thanks, “you”. For all the listening and feedback. Really. I wouldn’t be here if it weren’t for you. But then, I’m not talking about this project thing, it is surely the least of the infinite things I owe you…)

Tomorrow, some numbers

By the by, one discussion I’ve just had was: How does this “thing” differ from pure brute-force?

I almost felt offended there for a second, like “I’m so clever I wouldn’t waste resources like that” – but then when going through it calmly, there is a bit of that, sure.

And let me be clear, I don’t want to call it an “ML” thing (much less “AI”, please!), because we all abuse those terms.

It is an optimization program, running on top of a simulation program, that uses graphs as a basis. That’s it.

And it’s not exactly brute-force, because the Genetic Algorithm (GA) doesn’t do things aimlessly at random… But there IS a lot of number crunching in there (VERY simple number-crunching, but… Well, millions over millions of times that…).

And so an interesting measure to look at is: How much computing? It’s never going to be extremely expensive (this is certainly not comparable to training a Deep NNet on a thousand million images, or the current LLM craziness, or anything like that…), but let’s just say it grows exponentially with the network size, and a few other things… Interesting, I believe. (But that’s me, I like these questions, I’m a weird person I guess).

Side note: Confirmed, I didn’t invent the wheel

So I haven’t found papers (yet) that do things EXACTLY like what I’m proposing, but indeed I found a few that fall very near this, from 2012-2013. And I haven’t looked anywhere near hard enough yet… (Plus, those in particular I’m thinking of, they do it with Cybersecurity in mind!! What are the odds? Very high, of course…)

I’ll admit, I truly was stupid enough (albeit for a very brief moment) to think that there was a very tiny chance – but non-zero, you know – that no-one else would have done this. Well, although not exactly… As I knew, and kept repeating myself, of course I didn’t invent fire here.

The only thing is, my approach seems to rely a bit less on equations and more on simulations, from what I found, and I’m making less of a simplification overall (as I keep the complexity of the network layout, which seems hardly realistic to encode in simplistic ODE systems… Although I’m not a mathematician, so what do I know…), and probably spending much more processing power. But also, others have used tools to do what I programmed basically from scratch (in R and C++), so I feel I learnt a lot anyway. And then, computing power from 2012 compared to 2023, that’s a world of difference, meaning I can do things in a fraction of the time, which is probably why I’m not as worried as the paper’s authors about modelling that mathematically… Well, that’s what I have in my head.

Not that it wouldn’t be cool to translate it all magically in a few Differential Equations and get results instantaneously, compared to running minutes (or hours) or simulations. Absolutely. I’m just not there yet, I guess.

Also, I haven’t (yet) found a paper about “using GA+SIS on these specific parameters” I keep mentioning (the Beta, Mu, Segmentation, unknown subnets…), so I’m hopeful this is still somewhat original work!

By the way, how the hell did scientists do “due-diligence” before Google Scholar? It must have been incredibly hard work for a looong time… Anyhow.

ONWARDS!