Tracking unusual traffic on your WordPress website using Splunk Community Edition – Part 1

Like a good tenant, I tend to keep an eye on the resources that my website uses on the hosting platform that I use. I have hosted with SiteGround since late last year on their basic “Startup” package and I have found them to be great value for money for the feature set that you are given.
However, cheap and cheerful comes at a price and whist I get access to come great features there is an expectation that I run a well behaved website. Many websites are hosted within shared eco systems as shared instances (you can, of course purchase dedicated servers – but that is really expensive, and is really suited to high traffic transactional websites).
Now shared instances means that you are given a little chunk of a web server which is your play ground for your own personal website. The server that you are on may host many hundreds, or in some cases thousands of website which belong to other people – so it is important that you (or in this case – I) make sure that I am doing my best to make sure that my own site does not impact on the performance of someone else’s.
On SiteGround there are four key metrics you are limited on.
- Disk Space Utilisation – there is a limit on the starter package of 10GB – which, for my needs is perfect.
- iNode usage – An iNode is essentially metadata about each and every file that is stored as part of your website. So for example; size, owner information, device and socket information.The number of iNodes on a given website will be proportional to the number of files that you have, and SiteGround gives you an allocation of 150,0000 on the starter package.
- CPU Time Usage – this is the amount of CPU time that your website uses on the shared server. This is measured in seconds and you are allocated no more than 2.7 hours CPU time in a given day which is more than reasonable for smaller non-transactional sites.
- Account Executions – this is the total number of script(s) that can be executed within a single day. So, each time that index.php is landed on that counts as an execution.
On the starter package you are limited to 10,000 within a 24 hour period – which, in essence means that you can have (sort of) 10,000 visitors a day.I say sort of, because each time a script is executed that counts against your overall allowance – so, you could have 5,000 unique visitors in a day and if one git sits there and refreshes a page 5,000 times (it is possible) then that’s your allocation gone!
Now, I haven’t ever really utilised a huge amount of CPU time within my instance and my Disk Space / iNode use has always been well within limits.
However, what I noticed very early on (right from day one when I completed my migration from my previous host) the Account Executions have always been quite high. Not initially over the limit, but high enough for me to question in my mind what was going on – but, admittedly as the executions were still under my allocation I didn’t really devote huge amounts of time to it.
That is, of course until on the 20th of April my account executions when through the roof, and remained consistently over the thresholds that I should be staying within.
Now, to be fair to SiteGround – they have been very good about this at the moment (probably because whilst I am over on Account Executions, my site is still quite low on resource consumption in other areas) I could not ignore the fact that my site could by subject to sanction if the situation continued to go unchecked.
So I thought that I would share with you my “journey of discovery” as to what was going on and hopefully it will help someone along the way.
Ok, Sherlock – lets gather some evidence
SiteGround provides a number of tools within it’s version of the cPanel (this is the account management dashboard) which give an overall view of the amount of resources your website(s) are using. These are:
- Dashboard Flash Tools.
- AWStats.
- Downloadable RAW Apache logs for the website.
Now, I have found that whilst these tools are good in isolation and certainly give you ideas to what might be going on – they can also lead you down paths where the data you are looking at is inconsistent and you don’t get a single view of what might be happening. I will come back to that point later, but for now I’d like to try and illustrate what I mean.
Account Executions Graph
In the SiteGround dashboard this is accessed from the “Last 24h Graph” hyperlink in the Account Executions box.
It’s useful from the perspective that it gives you and idea of when any “spike” occurred on your website, however you cannot drill into the data. Whilst that is a little frustrating, it still yields useful information as it told me that between 07:00 and 10:00 there was a peak of 5,600 executions. So, armed with this information I decided to jump into AWStats to see if I could figure more out.
AWStats
So I knew that between 07:00 and 10:00 (four hours) there were more than 40K hits to my site (which in itself was a little odd) that amounted to 7.6K more hits than the highest total of any 4 hour period during the previous 7 hours (the picture below shows this compare 00:00, 01:00, 02:00, 05:00 to 07:00 – 10:00).
So, now I had a time frame established, and a break down of the traffic, the next step for me was to get an idea of where this was coming from. Ideally I wanted a source IP address(es) and indeed the target pages – but, SiteGround seems to deploy either an older, or slightly restricted version of AWStats so there is some functionality missing (I have used versions before which I have installed that contained far more data).
What I could get was a fairly educated guess as to where the traffic was coming from – China (see the screen shot below).
I ignored the US traffic; not because I have anything against the Chinese – just the vast majority of my traffic (no matter what platform I have used to host) is from the US and the interesting thing that I noticed was the Bandwidth utilisation versus the page hits between the two nations.
The US metric in terms of bandwidth is way higher, whilst the Chinese equivalent is much lower – but the page count from China is almost double.
That suggested to me two things:
- US Traffic seems to more consumption of content based – therefore legitimate.
- China traffic seemed more “scanning based” – e.g. something perhaps crawling my website, as the page hits were high, but the bandwidth consumed was lower.
The views from both nations would count as an Account Execution – but if one was merely trying to index the data on my site whilst the other was actually viewing it – that could account for my problems. The led me to the suspicion that I was dealing with a Bot or number of Bots.
This theory seemed to have further merit when I looked back in the SiteGround Dashboard. I looked at the further detail of the account executions (see screen shot below).
The number of executions against the remote publishing component of my WordPress site (this is denoted by /home/telnetpo/public_html/xmlrpc.php) was very high compared to what I would expect. I was not worried about the hits against the index.php as that is the main landing page of my site – so, did I now have my “smoking gun”?
So how could I get the final pieces of this puzzle together and I knew that these would more than likely be in the RAW Apache logs for my site. Apache http logs contain a wealth of information in regard how a visitor has interacted with your website – the only trouble with them is that they are not overtly easy on the eye.
Combine that with the fact that under normal circumstances there will be thousands of entries within the logs – you (or in this case I) will have quite a job finding the exact information that I was looking for. What I needed was a tool which could consume the log data and let me search it simply to determine the answers to the questions that had.
What is Splunk … sounds naughty!
I’ve looked for a succinct explanation of what Splunk is on the web. Something that could be described as sexy but at the same time informative.
However, I didn’t find anything that really did it justice. So I guess that you will have to do with my own hand crafted bungling diatribe.
It is a product that I came into contact with 10 years ago. It was pretty amazing, even back then. IMHO it was one of the first “true” big data tools. In practicality – in the UK it has been primarily been marketed as a SIEM (Security Information Events Management) tool used predominantly for applications in networks and security fields.
However, it is capable of so much more – it is brilliant at consuming large amounts of unstructured data and then either finding patterns via it’s own AI or via your own custom built queries to produce meaningful intelligence that a user can act upon and it does all of this in a very simple way.
What is beautiful about Splunk is that you can make it as simple or complicated as you like and the learning curve is very gentle. You can have working queries on data working within about 20 minutes after installation.
You might be thinking – ‘Sounds expensive’. You’d be right – the full Enterprise version costs are not for the feint hearted – but – Splunk do a community (Free) Edition.
Ok, tell me more?
You won’t need any specific infrastructure to run Splunk Free. It will work perfectly on any relatively modern PC or Mac. For example; for my own purposes I have been running it quite happily on my four year old iMac.
You can download (after registration) a copy of Splunk Free from here – you will also find a comparison between the various versions of what it can and can’t do. You will notice that one of the key things about the free version is that you are limited to a data ingestion rate of 500MB per day. So, this means that if you have log files that are larger than that – then you are not going to be able to use it long term.
I say long term, as you can for the first 60 days; as during that time you will be on a trail license. At the end of that term you either need to switch into Free mode or buy it permanently.
Now, in my case the daily 500MB limit wasn’t a problem as the amount of data that I was working with never really exceeded 12MB which I would imagine would be the case for the vast majority of personal website owners.
I don’t intend to cover off the installation process in this blog post as it’s pretty straight forward.
Input, major input ….
All these years and I have been desperate to get a Short Circuit reference into something that I have written!
Anyhow, now that I had Splunk up and running I needed to get some data into it.
The first step was to download my logs from my virtual server instance on SiteGround. For those of you who use that platform you can accomplish this via the cPanel interface – or – using the built in SiteGround FTP account you can connect to the root of your instance and retrieve them directly from /logs/.
They are stored in compressed archives; so you will need to extract them locally before you can import them into Splunk.
If you don’t use SiteGround – I would imagine that the process would be that much different. Most decent web hosting providers will give you a means to access your websites logs.
So once the logs were downloaded and extracted I fired up Splunk. From the home page (under “Explore Splunk” area) I selected the “Add Data” option (see in red below):
I was taken to the add data screen, from here I selected the “Upload” option from the “or get data in with the following methods” section (see in red below):
I was then asked to select the source for my data upload (this is of course the log files). It is important (and slightly irritating) that you can only upload one file at a time from the web interface. I believe that it is possible to do batch inputs – but I haven’t looked into that in any great detail.
I clicked on the “Select File” button (highlighted in red below):
I then navigated to here I stored the log files, selected the one I wanted and then clicked on “Open“:
The file then uploads. I then clicked on the “Next” button (highlighted in red below) – I did this for every remaining stage until the progress line to the left of the “Next” button was at “Done“. There are a number of screens in between – but you don’t need to make any changes to them. Apache uses a standard log format which Splunk recognises so the default options work just fine.
In the next part
In the next part of this series (which, I promise won’t take over a year to write) – I will go through how I used Splunk to track down the culprits behind the increased Account Executions.
In essence I was looking for the source IP addresses and an understanding of what specifically was happening – I have to say Splunk was amazing! So I hope that you will join me for the next instalment.