notes blog about

Basics

The Splunk index is a directory on a filesystem for data ingested by Splunk software. Events are stored in the index as two groups of files:

.. these files reside in sets of directories, called buckets. An index tipycally consists of many buckets, organized by age: hot (stored in db - see below), warm (db), cold (colddb), frozen (data in this bucket is not searchable but can be thawed), thawed (thaweddb, buckets restored from an archive). Hot bucket is being written to, and has not necessarily been optimized.

Splunk stores raw data it indexed and its indexes within flat files in a structured directory ($SPLUNK_HOME/var/lib/splunk), meaning it doesn’t require any database software running in the background. For example, the default index looks like this:

/opt/splunk/var/lib/splunk/defaultdb $ tree -sC
.
├── [       4096]  colddb
├── [       4096]  datamodel_summary
├── [       4096]  db
│   ├── [         10]  CreationTime
│   ├── [       4096]  GlobalMetaData
│   └── [       4096]  hot_v1_0
│       ├── [      53152]  1466256608-1466256608-4107774829233827835.tsidx
│       ├── [      68997]  1466256608-1466256608-4107774830220961558.tsidx
│       ├── [         67]  bucket_info.csv
│       ├── [        105]  Hosts.data
│       ├── [       4096]  rawdata
│       │   └── [      94207]  0
│       ├── [        111]  Sources.data
│       ├── [        106]  SourceTypes.data
│       ├── [         78]  splunk-autogen-params.dat
│       └── [         41]  Strings.data
└── [       4096]  thaweddb

7 directories, 10 files

Splunk breaks data into events based on the timestamps it identifies.

Event data - all IT data that has been added to software indexes. The individual pieces of data are called events.

Splunk is designed as a platform extensible via Apps and Add-ons. Both are packaged sets of configuration. “Search” is the default App.

Instance types (components)

Splunk data pipeline:

  1. INPUT - receipt of raw data from log files, ports or scripts
  2. PARSING (analyzing) - raw data split into events, time parsing, running transforms, setting base metadata, …
  3. INDEXING - data storage and optimization of indexes
  4. SEARCH - running of queries and results presentation

These four stages of processing are generally split across two to four layers.

Splunk network diagram

Forwarders (INPUT/PARSING)

Indexers (PARSING/INDEXING/SEARCHING)

Search heads (SEARCH)

Deployment server

Sizing indexers

The following indexer should handle 100GB of raw logs per day (using AutoLB feature of Splunk forwarders) and four concurrent searches (including both interactive and saved ones):

If you have 200GB of logs per day you should have two such indexers (you should have two of them anyway because of high availability).

.conf files

Everything in Splunk is controlled by configuration files sitting in the filesystem of each instance of Splunk. Configuration changes done via the web interfaces end up in these .conf files.

$SPLUNK_HOME/etc/

inputs.conf

props.conf

serverclass.conf

Data sources

Moniroting logs on servers (UF)

Monitoring logs on shared drive

Consuming logs in batch

Receiving syslog events

Consuming logs from a database

Using scripts to gather data

Searching

 search terms (keywords, "quoted phrases", fields, wildcard*, booleans, comparisons)
                     +
                     |                                    clause
                     |                                      +
                     |                                      |
                     v                                      v
[search]  sourcetype=access_* status=503 | stats sum(price) as lost_revenue
                                            ^     ^    ^
                                            |     |    |
                                            +     |    +
                                       command    |   argument
                                                  +
                                             function

Performance tips

As events are stored by time, time is always the most efficient filter. After time, the most powerful keywords are host, source, sourcetype.

The more you tell Splunk, the better the chance for good results.

Field extraction is one of the most costly parts of a search. fields [+] <wc-field-list> - include only the specified fields; occurs before field extraction => improved performance.

Debugging

https://answers.splunk.com/answers/4075/whats-the-best-way-to-track-down-props-conf-problems.html

$SPLUNK_HOME/bin/splunk cmd btool props list <sourcetype>

Resources