Tutorial 2: Syslog data mining with attached md5sum. AKA "Store 100% of data".

December 6, 2007  |  Dominique Karg

1. The need. The Hype.

There’s obviously a need for storing vast amount of logs, and few things today aren’t able to log into syslog. So it’s just obvious to stumble upon that request every once in a while, and this tutorial illustrates the OSSIM approach at massive syslog data storage. Of course, where you say syslog you can say windows event log, snmp data, whatever generates a big amount of raw data.


I don’t know much yet about all of this compliance stuff (I were lucky, Julio always has been much more knowledgeable on that area than me so I could skip it) but I guess I’ll have to start learning, there are just too many people asking for it and I’m getting very curious.

From what I’ve seen, a short list of regulations requiring, or at least strongly recommending a certain amount of raw data storage and reports are:

  • ISO27001/17799
  • SOX
  • PCI
  • Basel II
  • NIST 800-53
  • Many more…

(Searching for SIM and compliance information I see that’s a major marketing point from vendors too, well, just for the records, ossim helps you to be compliant with all that stuff)

Centralized logging

Maybe the need is pure sysadmin’s lazyness. You want to be able to answer to questions you get asked by your management / customers in the easiest possible way.

I heard this from a guy a couple of days ago: the more information about your network you’ve got, the more answers you can give, and that’s exactly what SIM/SEM systems are good at.

Data mining

This is a bit redundant with the previous entry, but there are people that just don’t care about exact data, but they’re in desperate need of colorful graphs in order to be able to keep their bosses calm. Well, having logs from everything in your network allows for easy colorful report generation with little knowledge of the underlying data. The worthyness of those reports in the end will be highly questionable of course.

2. The preparation.

I thought it would be interesting to explain the process we used to create this plugin in the first time here, so it will be a hybrid tutorial: syslog and how to fetch your own datasources.

Write a new function

Well, the first thing that came to my mind was that, even without knowing much about those compliance regulations, if they’ve been designed with at least a bit of common sense they’ll require a digital cryptographic stamp on the logs. So let’s just add a little function to our plugin environment that does just that: timestamping using md5.

def md5sum(datastring):

    return md5.new(datastring).hexdigest();

Just put your functions into ParserUtils.py and they’re available for all plugins. That file usually resides at /usr/share/ossim-agent/ossim_agent/.

Write a new plugin

The next thing we’ll do is create a new plugin. We want this plugin to:

  • Get every single syslog line
  • Respect the orignal logline
  • Checksum that line

After a bit of toying I’ve seen it would be easy to also:

  • Extract the sensor from the line
  • Extract the source ip
  • Extract the originating process
  • Extract the PID
  • Extract the line without the “changing” part

All of this while respecting the original line of course.

;; syslog

;; plugin_id: 4007

;; type: detector








# Enable syslog to log everything to one file. Add it to log rotation also.

# echo "*.*     /var/log/all.log" >> /etc/syslog.conf; killall -HUP syslogd


# create log file if it does not exists,

# otherwise stop processing this plugin







## rules

[syslog - datamining]

# Sep  6 12:07:26 ossim-devel su[9886]: FAILED su for root by juanma




[^s]+)s+(?P[^[]*)[(?P d+)]:(?P.*))$" sensor={resolv($sensor)} date={normalize_date($1)} plugin_sid=1 sensor={resolv($sensor)} userdata1={md5sum($logline)} userdata2={$logline} userdata3={$generator} userdata4={$logged_event} userdata5={$pid}


This is quite a simple plugin. Only one rule, a relatively simple regexp, you may notice the md5sum function being used, as well as a couple of others.


We don’t really need to extract the fields here but that will make getting reporting done for this plugin a very very easy task.


Write the SQL


Next thing we have to do is make the server aware of our new plugin. We’ll assign it ID 4007 since it’s free and it’s in the range reserved for syslog, and insert only one event.


Please notice this event is being inserted with priority 0, more on this later.




DELETE FROM plugin WHERE id = "4007";

DELETE FROM plugin_sid where plugin_id = "4007";

INSERT INTO plugin (id, type, name, description)

VALUES (4007, 1, 'syslog', 'Syslog plugin with md5 checksum logging');

INSERT INTO plugin_sid (plugin_id, sid, category_id, class_id, name, priority, reliability)

VALUES (4007, 1, NULL, NULL, 'Syslog: syslog entry' , 0, 1);




Priority 0


Inserting the event with priority 0 is a little trick that helps us preventing these items from generating noise to our SIM system.


By default, each incoming event as an instant risk value calculated as:


risk = asset * priority * reliability / 25


So, what would happen if this plugin is activated and you enable debugging on an app that generates a thousand log lines ? Well, that host would be treated as compromised, alarms would raise, and many indicators would be affected by this false positive.


Since we actually only want to store events, not assess any instant risk with them, one of three variables on the multiplication should be set to 0 for this type of events to be ignored by the risk assessment system by default (they would still be correlated, forwarded, could become alarms, etc. etc…). We can’t control the assets, and the event’s reliability can only be modified by correlation rules. Wouldn’t it be nice to be able to decide which events to take into account or not based on our policy ? well, the obvious choice is priority, since correlation rules can affect both priority as well as reliability and policies only affect priorities.


That means if we want specific source ips or generators to behave differently, we can control it via policy, and if we want to generate alarms, correlate with other events or based on some of the extracted fields, we do it using directives (correlation rules).


Enable syslog logging


There are better ways of getting all the logs into one file without duplicating but just for the records sake, I added “*.* /var/log/all.log” to /etc/syslog.conf and restarted the service. That way we can point our plugin at that file and forget about filtering.


Enable remote logging


Remote logging is similar, but entering “*.* @agent_ip” instead into the syslog file and restarting. Here is my output on MacosX:


Gestalt:etc dk$ sudo vi syslog.conf

Gestalt:etc dk$ ps ax | grep syslog

   13   ??  Ss     0:04.84 /usr/sbin/syslogd

 3994 s004  R+     0:00.00 grep syslog

Gestalt:etc dk$ sudo kill -HUP 13

Gestalt:etc dk$ syslog -s test

On our tail -f /var/log/all.log:

Dec  6 07:10:12 Gestalt syslog[3997]: test




3. The implementation.


Update 2008/03/11: Fixed the regexp, as can be seen in comments below it was too narrow. Just re-download the .cfg.txt


Copy plugin to it’s place


The plugin file should be put into /etc/ossim/agent/plugins/syslog.cfg.


Insert SQL


The SQL can be inserted from the server as:


cat syslog.sql.txt | mysql -p -uroot ossim


Note to the installer users: you can get your database password from /etc/ossim/ossim_setup.conf as root, grepping for “pass” since it gets generated at install randomly for each installation. That password is also used for ntop, nessus, etc…


Restart server


There are nicer ways but this works:


killall ossim-server; ossim-server -d


Enable plugin


Add a line like this to your /etc/ossim/agent/config.cfg, into the [plugins] section.




Restart agent


killall ossim-agent; ossim-agent -d


Does it work ? RT event viewer.


The best way to see if an event traverses through generator—> agent—> server—> database is to fire up the RealTime event viewer and start it selecting “fast”.


Here you can see they’re starting to arrive:


(Image removed, broken link, I’m very sorry. DK.)


4. The results.


That’s it, the data is being fed into the database. What now?




First of all, remember that the event’s won’t affect our risk but they’re still getting stored.


WARNING: If you abuse this, your database won’t be able to handle the load from all that data. You’re going to store it, but with little use. Default OSSIM is not tuned for more than 1, 1.5 Million active events in database. That’s more than enough for a small/home user, but way too much if you feed it with 2000 syslogging devices.


So what could you do ? store only part of the data using policies when you need it, correlating all events for more interesting stuff. This obviously doesn’t help if you came here looking for better compliance. (See below, sections 5 and 6).


Following are a couple of hints about how to create a policy that only stores events from a single syslog host.


  1. Create a host: (Image removed, broken link, I’m very sorry. DK.)
  2. Create a plugin group with syslog plugins: (Image removed, broken link, I’m very sorry. DK.)
  3. Select the generating host as source (you could also apply this to a sensor, and any hosts it receives) (Image removed, broken link, I’m very sorry. DK.)
  4. Disable everything, just store (or leave correlation enabled for these events if you wish) (Image removed, broken link, I’m very sorry. DK.)
  5. Create a drop-everything else policy: (Image removed, broken link, I’m very sorry. DK.)
  6. The result: (Image removed, broken link, I’m very sorry. DK.)




I’ll just throw in the warning in again:


WARNING: If you abuse this, your database won’t be able to handle the load from all that data. You’re going to store it, but with little use. Default OSSIM is not tuned for more than 1, 1.5 Million active events in database. That’s more than enough for a small/home user, but way too much if you feed it with 2000 syslogging devices.


Which basically means: if you’ve got few devices (at most 200000 events a day) you can go for “logging everything” with the default installer ossim. Otherwise you’ll have to heavily tune your database, enable “writing into filesystem” and/or play some other DB tricks. See section 6 below.


Event viewer


Since we’ve already separated our info into bits, besides of keeping the whole original line, we can easily create a new event viewer panel for this specific plugin.


In this example below, I created one with four columns, and label:data relations such as:


  • Checksum:USERDATA1
  • Generator:USERDATA3


(Image removed, broken link, I’m very sorry. DK.)


Note: log is not the entire line, that would be USERDATA2 as you can see on our plugin.




What else do we need to be able to tell our boss that we’re now much closer to being compliant ? nifty graphs :blush:


Let’s define a new tab (using a nice icon from images.google.com) and put two panels in there, a simple graph with our top 5 generated logs and a cloud with all the generators.


Tab: (Image removed, broken link, I’m very sorry. DK.)


Cloud definition: (Image removed, broken link, I’m very sorry. DK.)


Cloud SQL: (Image removed, broken link, I’m very sorry. DK.)


Resulting graph: (Image removed, broken link, I’m very sorry. DK.)


Note: The pie graph is of type “SQL”, using rows as labels, 45 degree rotation. Here is the export you can import into your panel:




























































5. The tuning/solutions.


After the two warnings, there are some solutions for the problems.


Database rotation


First we implement database rotation every backup_day days. By default this is set to 5, so if you are collecting a maximum of 200.000-300.000 syslog events a day, you should be fine. If that number is much more, you can increase the days, if it’s higher, decrease. That’s the quick and dirty solution.


File storage


A friend of mine is working on filesystem storage in the form of /date/sensor/plugin/ip.log, hooks are there for that format and it shoud be available very soon.


6. The spam.


I really hate spamming on a community article, please don’t read any further if you’re satisfied with what you’ve got so far and don’t have any special requirements.


Compliance module: Performance, storage, reporting.


The founders of ossim have created a company about a year ago, called (in order to be original) ossim. The site is ossim.com and there we provide a series of added value solutions and services.


Since all of this compliance stuff is really a high level enterprise need we’ve spent a lot of time packaging a few custom modules just for that matter, and are continously improving them in parallel with normal ossim evolution.


So the compliance modules are being provided as sort of an add-on to the open ossim distribution, since we feel that that’s a part useless for the average community user and we would like the companies that are saving many hundres of thousands of dollars by using our opensource solution to put something back into development.


Professional compliance modules include:


  • Appliance version for easy deployment
    • Heavily tuned database, using a combination of Heap, Merge, InnoDB, MyISAM and compressed MyISAM tables for optimal performance / storage capacity.
    • 64Bit version
    • MMAP compiled libpcap applications for greatly enhanced performance (Up to Gigabyte speeds with a standard dual core machine)
    • Sets of custom flash graphs for specific compliance needs
    • Sets of custom viewers for specific compliance needs
  • Guides for particular compliance needs. “What exactly do I have to setup in ossim to be more compliant with XXXX”
  • Specific support
  • Specific compliance rule feeds

Share this with others

Featured resources



2024 Futures Report

Get price Free trial