If you’re like me, you’ve probably downloaded Splunk, then panicked and stared at an empty dashboard for a while. To be completely honest, after I downloaded Splunk and installed it on my laptop, I was a little intimidated on what to do next. Fortunately, as an EMC Splunk Ninja, we were able to attend a 5 day boot camp around Splunk that answered a ton of my questions. Hopefully this little snippet will help you get started. Please don’t hesitate to reach out to me via Twitter for some help!
Where can I get data?
There’s many different ways to get data into Splunk. For the sake of simplicity for our demo, we’re going to take the easy route. If you haven’t already, I highly recommend you check out Data.gov. This website is the embodiment of a movement in the US government to make datasets free and publicly accessible so people way smarter than me can figure out way cooler things. But for now, we’re going to have some fun. For our example, I pulled down the traffic violations of Montgomery County, Maryland in .CSV format. Be patient, it’s a little over 250 MB. If you’re feeling brave, try to open that file in Excel… It more than likely won’t work and you’ll end up force quitting excel and become frustrated you can’t look at the data. But, let’s go ahead and ingest that data in tool built for large datasets, maybe Splunk per se, and see what we can find. Here’s how you do that.
Data homepage
What does the data look like?
Once this data is ingested, Splunk is able to correlate events off of the time stamp of the events in the CSV file. Splunk also correlates important fields (like Location, make of car) in the dataset that make it easy to query information from. Where Splunk really differentiates itself from the traditional data warehouses is their use of schema on read architecture. This affords us the ability to get the data in place then decide what to ask it, rather than having to formulate your questions, then find the data.
Screen Shot 2016-08-02 at 3.40.55 PM
Splunk also doesn’t just stop at .csv files either. Anything that is human readable and has a time stamp is a great candidate for this platform. Across your business, from IT to security to business analytics, your entire data lake can be serviced and governed on one centrally managed platform that provides instantaneous value to the data that live in your company.
Screen Shot 2016-08-02 at 9.02.03 PM
How do I import it?
For simplicity’s sake, Here’s how to manually import the data. Keep in mind that this is the most archaic way to load data into Splunk. For more information on the multiple ways Splunk ingests data, check out this article.
We’re going to be ingesting data from the .csv file earlier. In order to do that, login to the splunk web page, then do the following:
  1. Click on settings, then “Add Data”.
               Screen Shot 2016-06-12 at 10.48.49 PM
  1. From there, you have multiple options. For our tutorial, we’re going to manually import the data, so click “Upload from my computer”.
          Upload Data
  1. From there, you’ll drag and drop the .csv file onto the platform shown below. Wait for the file to fully upload. Also keep in mind that the largest file allowed for manual uploading is 500 MB.
          Screen Shot 2016-05-28 at 12.06.08 AM
How do I search it?
Now that the data is imported, lets figure out how to actually derive value out of all the work you’ve just done. Your time is valuable; just like this data… So lets get cracking. My way of seeing whats in my small environment is to type source=*. The ‘*’ is a wildcard in Splunk, so we’ll be able to see what data is in our environment. I don’t necessarily recommend using this route on larger environments though.
  1. Let’s make a simple query of whether the traffic violation was alcohol related. In order to do that, go to the search bar and type ‘source=”Traffic_Violations.csv” |  stats count by Alcohol’ This command will search the Traffic_Violations.csv file, then will run a count of each “Yes” or “No” for the Alcohol field.
          Screen Shot 2016-08-02 at 8.26.04 AM
  1. Now, the results are really cool! But we really need visualizations. So head on over to the visualization tab in the middle of the screen and you should see a pie chart show up. Sometimes certain visualizations do a better job of explaining your data than others. You can change that by clicking on the pie chart and seeing different options you have. (Here’s a great book that shows you how to efficiently & effectively tell your story using data.) The great thing about Splunk is that since you’ve already given it the variables for the visualization, it will automatically carry those variables across every visualization you have.
Screen Shot 2016-08-02 at 8.30.42 AM
  1. For now, make sure you keep this search as the Pie Chart. We’re going to build this visualization into a dashboard. In order to do this, click on the “Save As” in the upper right corner and click “Dashboard Panel”. Make sure the dashboard panel is “New” and fill out the information following the photo below.
    Screen Shot 2016-08-02 at 2.04.44 PM
  1. Then push save and make sure the dashboard looks good. After that, we’re going to repeat the same process. Click on Search in the upper left corner, paste in the following commands one by one in the search bar, and repeat steps 1-3 in this section:
  2. To see what the busiest times of day for traffic stops are, try the following command: ‘source=”Traffic_Violations.csv”| stats count by date_hour | sort +date_hour’ Note: The “+” at the send will sort the following variable in ascending order. After the query runs, click on the “Visualization” tab and check out the bar graph Splunk automatically recommends. Go ahead save that to the existing Dashboard and name the panel “Violations by the Hour”. Interesting observation, the busiest times of day are am / pm commute, and in the late evening!
          Screen Shot 2016-08-02 at 2.53.08 PM
  1. Since we have the what time of day is the busiest, let’s see what day of the week is the busiest. To do this, try the following command ‘source=”Traffic_Violations.csv” | stats count by date_wday | sort -count’. Notice this time we used the “-“ in front of the count variable. This way the busiest day with show at the top of the table and the slowest day on the bottom. Now head over to the visualizations. Add that Line Chart to the existing dashboard as “Violations by Day” and take a second to admire your work!
Screen Shot 2016-08-02 at 3.12.30 PM
  1. We have 3 really great informative panels. But the layout is not that great. Lets reorganize that to see what we can do to clean that up. On your “Traffic Violations Stats” dashboard, click on the “Edit” drop down menu in the upper right corner of the screen and select “Edit Panels”. Your panels should now have a black dotted row at the top of each panel. Scroll down to the “Violations by Day” panel and see if you can move that to the right of the “Alcohol Related Violations” pie chart. Once you have a layout you like, select “Done” and check out your work. Here’s what mine looks like for reference:
Screen Shot 2016-08-02 at 3.15.04 PM
Just for the fun of it, here’s a few more queries and dashboards you can play with…
Top 10 Car Makes for Violations: ‘source=”Traffic_Violations.csv”| top limit=10 Make’
Car Color: ‘source=”Traffic_Violations.csv” | top Color’
Screen Shot 2016-08-02 at 3.20.51 PM
I’m going to call this dashboard good to go. Let’s go ahead and make this the default dashboard, so you’ll automatically see it when you log in. To do that, from the Search & Reporting app, click on “Dashboards” on the upper bar of the Splunk screen and select the dashboard named “Traffic Violation Stats”. From there, we’ll click on edit, then select “Set as home dashboard” and voila. You’re good to go.
I hope you’ve enjoyed playing along and Splunking with us at EMC. Please don’t hesitate to reach out to me if you have any questions about this blog or anything Splunk / EMC related!
Happy Splunking,
Kyle Prins