In this tutorial, we will go through the process of making all the needed settings to send streams with log data from a server to Google BigQuery.
There are some other tutorials on this topic on the Internet, but I found them not clear enough or full of outdated information. There are various changes needed for everything to work as I expected and will cover them in this article. I have tested this approach in production and results for some weeks & huge amounts of data are good enough.
But why we need this?
One question is coming in place — why we need this? Why we are not just downloading the logs and perform an analysis with Pandas or some other tools like GoAccess (awesome open source web log analyzer)?
I wanted to be able to see data in real-time and have the ability to perform some sort of analysis of the traffic. In our voice platform — we have also an API and I need more information on hits of each user.
That’s the way I search for the best way of implementing log streaming into Google BigQuery, where we are able to check every request, IP, and visited page.
It is like Google Analytics for all server traffic, not only JavaScript detected visitors. Well, not much, but we are close …
With this technique, we are achieving:
- ability to monitor in almost real-time hits on our server;
- possibility to see how search engines bots are crawling our website;
- check if we have too many visits from specific IP addresses.
And every data is stored in BigQuery, where we can perform the lightning-fast requests, using SQL syntax. On top of that, we can integrate our data in Google Data Studio or custom data visualization solutions.
What is Google BigQuery
It is a serverless, highly scalable, and cost-effective cloud data warehouse designed for business agility. It allows us to perform blazing-fast queries and get the data we need with simple SQL — in a fraction of the time.
Also, BigQuery’s high-speed streaming insertion API provides a powerful foundation for real-time analytics, making our data immediately available for analysis.
That is perfect for our need as we want normally to handle big amount of information, as you can imagine crawling rates for the regular bots.
And what is Fluentd
In order to send all this streaming information, we will need some type of connection from our server to BigQuery. Using custom methods of sending information from logs can be hard to achieve, as we need to implement the logic for specific behavior in case of log rotating, server shutdowns, and recovery from the same place, and other. Fluentd is the tool I found so far as the best for these needs. It is an open-source data collector for a unified logging layer. It works both with Apache and Nginx, but for this tutorial we will be focusing more on Nginx.
Installation on Ubuntu + nginx
Let’s start and go throw the process of setting everything. Settings are performed on Ubuntu 18.04 + Nginx 1.14.
1. Install fluentd
curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-xenial-td-agent3.sh | sh
2. Create a BigQuery dataset
First, you need to have a registration in Google Cloud and go throw the process of project creation and setting service account credentials. You can refer to the documentation if you face any issues.
We then have to choose from using the CLI or the visual interface. If you prefer command-line, you need to check for bq. It is a python-based, command-line tool for BigQuery.
From the visual console, you can create dataset like this:
After creating the dataset, you need to make a table and schema. For the naming, we use voxpow:logs:ubuntu (dataset:table:schema)
3. Create a JSON schema to handle the web logs + server hostname + vhost name
Important step is to define the schema with all the fields, which should match the data.
[ { "type": "TIMESTAMP", "name": "time" }, { "type": "STRING", "name": "remote" }, { "type": "STRING", "name": "host" }, { "type": "STRING", "name": "user" }, { "type": "STRING", "name": "method" }, { "type": "STRING", "name": "path" }, { "type": "STRING", "name": "code" }, { "type": "INTEGER", "name": "size" }, { "type": "STRING", "name": "referer" }, { "type": "STRING", "name": "agent" }, { "type": "STRING", "name": "http_x_forwarded_for" }, { "type": "STRING", "name": "hostname" }, { "type": "STRING", "name": "vhost" }, ]
If you use Apache server, this schema need to be adapted as the log formats are different.
4. Install the Fluentd Google BigQuery plugins
sudo /usr/sbin/td-agent-gem install fluent-plugin-bigquery --no-ri --no-rdoc -V
5. Configure fluentd to read the nginx access log for this vhost and upload to BigQuery
[/fusion_text][fusion_alert type=”success” accent_color=”#808080″ background_color=”#ffffff” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” sticky_display=”normal,sticky” animation_direction=”left” animation_speed=”0.3″ animation_delay=”0″] Location of the configuration file: /etc/td-agent/td-agent.conf
[/fusion_alert][fusion_text]
<source> @type tail tag voxpow-logs path /var/log/nginx/access.log <parse> @type nginx </parse> buffer_type file buffer_path /tmp/tdagent.voxpow-logs pos_file /var/log/td-agent/nginx.pos </source> # add servername <filter voxpow-logs> @type record_transformer <record> hostname ${hostname} </record> </filter> # add vhost <filter voxpow-logs> @type record_transformer <record> vhost voxpow.com </record> </filter> # forwarding to bigquery plugin <match voxpow-logs> @type bigquery_insert auth_method json_key email [email protected] json_key /home/voxpow/voxpow-aSdx3sf0spf.json project "voxpow" dataset logs tables ubuntu <inject> time_key time time_type string time_format %Y-%m-%dT%H:%M:%S.%NZ </inject> </match>
Here is maybe the hardest part, as you need to adapt this configuration to your needs. We are using vhosts also, which can add additional complexity, but will allow us to extend this functionality to more services.
If you want to read a different format, you can change the <service> tag.
All sercrets for the connection to BigQuery are placed in a JSON file on secure location, which is not accessible via HTTP. It is the part:
json_key /home/voxpow/voxpow-aSdx3sf0spf.json
If something is not working as expecting or just for confirmation, you need to check the logs of fluentd.
6. After making sure that the user fluentd runs as (td-agent by default) has read access to your Nginx access logs, start (or restart) fluentd.
sudo systemctl status td-agent
Also you can start and stop the service:
sudo systemctl start td-agent sudo systemctl stop td-agent
If you face a permission issue, which can be identify from logs, it is possible to use this simple bash script for giving permissions to log directory:
#!/bin/sh chmod -R 755 /var/log/nginx
And setting a cron job in specific amount of time, adapted to your website load. Please keep in mind that logrotating service will change the access file, according the permissions.
7. Congratulations, you have just setup your web access logs to inject to a BigQuery.
If there is any errors — you will find them in log: /var/log/td-agent/td-agent.log
Server logs in BigQuery
If everything is fine with settings, you should be able to see hits in the BigQuery console.
Example SQL queries
You can adapt and use some those example queries, but keep in mind that if you have a lot of rows — you can be charged for the “big query” and the golden rule is to ask only the information you really need.
/* Get amount of all rows */ SELECT count(*) FROM `voxpow.logs.ubuntu` /* Get the last 100 records */ SELECT * FROM `voxpow.logs.ubuntu` ORDER BY time DESC LIMIT 100 /* Get results for all user agents, which contains something */ SELECT * FROM `voxpow.logs.ubuntu` WHERE agent LIKE '%google%' /* Get all results, which status codes are not 200 (OK) */ SELECT * FROM `voxpow.logs.ubuntu` WHERE code NOT LIKE '%200%' /* Get all results, for status codes 404 (Not Found) */ SELECT * FROM `voxpow.logs.ubuntu` WHERE code LIKE '%404%'
Charts
You can make various charts, exporting data directly to Data Studio.
Resource usage
In the first few weeks, we have some high CPU peaks, but it was due to misconfiguration in the td-agent.conf. This article helps us to find and resolve the problems.
Summary
In this article, we discussed an example configuration and log streaming in BigQuery. There is a lot of servers analyzing tools, but I found the following key points very hard to get with other solution:
- make an analysis of Google bot crawling;
- real-time statistic for the API usage, available in different command-line interfaces;
- possibility to quickly have IP information and frequency of request;
- data warehouse and storing this information for performing further analysis;
Using this approach, it is easy to perform a detailed analysis of search engine crawlers and get the insights — how they crawl your website, what kind of factors have an impact on the frequency of visits, etc.