[fusion_builder_container type=”flex” hundred_percent=”no” equal_height_columns=”no” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” background_position=”center center” background_repeat=”no-repeat” fade=”no” background_parallax=”none” parallax_speed=”0.3″ video_aspect_ratio=”16:9″ video_loop=”yes” video_mute=”yes” border_style=”solid”][fusion_builder_row][fusion_builder_column type=”1_1″ layout=”1_1″ background_position=”left top” border_style=”solid” border_position=”all” spacing=”yes” background_repeat=”no-repeat” margin_top=”0px” margin_bottom=”0px” animation_speed=”0.3″ animation_direction=”left” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” center_content=”no” last=”true” hover_type=”none” min_height=”” link=”” background_blend_mode=”overlay” first=”true”][fusion_alert type=”notice” accent_color=”#808080″ background_color=”#ffffff” dismissable=”no” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” sticky_display=”normal,sticky” animation_direction=”left” animation_speed=”0.3″ animation_delay=”0″]

this is improved version of my original article here

[/fusion_alert][fusion_text]

In this tutorial, we will go through the process of making all the needed settings to send streams with log data from a server to Google BigQuery.
There are some other tutorials on this topic on the Internet, but I found them not clear enough or full of outdated information. There are various changes needed for everything to work as I expected and will cover them in this article. I have tested this approach in production and results for some weeks & huge amounts of data are good enough.

But why we need this?

One question is coming in place — why we need this? Why we are not just downloading the logs and perform an analysis with Pandas or some other tools like GoAccess (awesome open source web log analyzer)?

I wanted to be able to see data in real-time and have the ability to perform some sort of analysis of the traffic. In our voice platform — we have also an API and I need more information on hits of each user.

That’s the way I search for the best way of implementing log streaming into Google BigQuery, where we are able to check every request, IP, and visited page.

It is like Google Analytics for all server traffic, not only JavaScript detected visitors. Well, not much, but we are close …

With this technique, we are achieving:

  • ability to monitor in almost real-time hits on our server;
  • possibility to see how search engines bots are crawling our website;
  • check if we have too many visits from specific IP addresses.

And every data is stored in BigQuery, where we can perform the lightning-fast requests, using SQL syntax. On top of that, we can integrate our data in Google Data Studio or custom data visualization solutions.

What is Google BigQuery

It is a serverless, highly scalable, and cost-effective cloud data warehouse designed for business agility. It allows us to perform blazing-fast queries and get the data we need with simple SQL — in a fraction of the time.
Also, BigQuery’s high-speed streaming insertion API provides a powerful foundation for real-time analytics, making our data immediately available for analysis.
That is perfect for our need as we want normally to handle big amount of information, as you can imagine crawling rates for the regular bots.

And what is Fluentd

In order to send all this streaming information, we will need some type of connection from our server to BigQuery. Using custom methods of sending information from logs can be hard to achieve, as we need to implement the logic for specific behavior in case of log rotating, server shutdowns, and recovery from the same place, and other. Fluentd is the tool I found so far as the best for these needs. It is an open-source data collector for a unified logging layer. It works both with Apache and Nginx, but for this tutorial we will be focusing more on Nginx.

Installation on Ubuntu + nginx

Let’s start and go throw the process of setting everything. Settings are performed on Ubuntu 18.04 + Nginx 1.14.

1. Install fluentd

curl -L https://toolbelt.treasuredata.com/sh/install-ubuntu-xenial-td-agent3.sh | sh

2. Create a BigQuery dataset

First, you need to have a registration in Google Cloud and go throw the process of project creation and setting service account credentials. You can refer to the documentation if you face any issues.

We then have to choose from using the CLI or the visual interface. If you prefer command-line, you need to check for bq. It is a python-based, command-line tool for BigQuery.

From the visual console, you can create dataset like this:

Create Dataset in BigQuery

After creating the dataset, you need to make a table and schema. For the naming, we use voxpow:logs:ubuntu (dataset:table:schema)

3. Create a JSON schema to handle the web logs + server hostname + vhost name

Important step is to define the schema with all the fields, which should match the data.

[  { "type": "TIMESTAMP", "name": "time" },
   { "type": "STRING",    "name": "remote" },
   { "type": "STRING",    "name": "host" },
   { "type": "STRING",    "name": "user" },
   { "type": "STRING",    "name": "method" },
   { "type": "STRING",    "name": "path" },
   { "type": "STRING",    "name": "code" },
   { "type": "INTEGER",   "name": "size" },
   { "type": "STRING",    "name": "referer" },
   { "type": "STRING",    "name": "agent" },
   { "type": "STRING",    "name": "http_x_forwarded_for" },
   { "type": "STRING",    "name": "hostname" },
   { "type": "STRING",    "name": "vhost" },
]

If you use Apache server, this schema need to be adapted as the log formats are different.

4. Install the Fluentd Google BigQuery plugins

sudo /usr/sbin/td-agent-gem install fluent-plugin-bigquery --no-ri --no-rdoc -V

5. Configure fluentd to read the nginx access log for this vhost and upload to BigQuery

[/fusion_text][fusion_alert type=”success” accent_color=”#808080″ background_color=”#ffffff” hide_on_mobile=”small-visibility,medium-visibility,large-visibility” sticky_display=”normal,sticky” animation_direction=”left” animation_speed=”0.3″ animation_delay=”0″] Location of the configuration file: /etc/td-agent/td-agent.conf

[/fusion_alert][fusion_text]

<source>
  @type tail
  tag voxpow-logs
  path /var/log/nginx/access.log
  <parse>
    @type nginx
  </parse>
  
  buffer_type file
  buffer_path /tmp/tdagent.voxpow-logs
  pos_file /var/log/td-agent/nginx.pos
</source>

# add servername
<filter voxpow-logs>
  @type record_transformer
  <record>
    hostname ${hostname}
  </record>
</filter>

# add vhost
<filter voxpow-logs>
  @type record_transformer
  <record>
    vhost voxpow.com
  </record>
</filter>

# forwarding to bigquery plugin
<match voxpow-logs>
  @type bigquery_insert
  auth_method json_key
  email [email protected]
  json_key /home/voxpow/voxpow-aSdx3sf0spf.json

  project "voxpow"
  dataset logs
  tables ubuntu

  <inject>
  time_key time
  time_type string
  time_format %Y-%m-%dT%H:%M:%S.%NZ
  </inject>
</match>

Here is maybe the hardest part, as you need to adapt this configuration to your needs. We are using vhosts also, which can add additional complexity, but will allow us to extend this functionality to more services.

If you want to read a different format, you can change the <service> tag.

All sercrets for the connection to BigQuery are placed in a JSON file on secure location, which is not accessible via HTTP. It is the part:

json_key /home/voxpow/voxpow-aSdx3sf0spf.json

If something is not working as expecting or just for confirmation, you need to check the logs of fluentd.

6. After making sure that the user fluentd runs as (td-agent by default) has read access to your Nginx access logs, start (or restart) fluentd.

sudo systemctl status td-agent

Also you can start and stop the service:

sudo systemctl start td-agent
sudo systemctl stop td-agent

If you face a permission issue, which can be identify from logs, it is possible to use this simple bash script for giving permissions to log directory:

 

#!/bin/sh

chmod -R 755 /var/log/nginx

And setting a cron job in specific amount of time, adapted to your website load. Please keep in mind that logrotating service will change the access file, according the permissions.

7. Congratulations, you have just setup your web access logs to inject to a BigQuery.

If there is any errors — you will find them in log: /var/log/td-agent/td-agent.log

Server logs in BigQuery

If everything is fine with settings, you should be able to see hits in the BigQuery console.

server logs in big query

Example SQL queries

You can adapt and use some those example queries, but keep in mind that if you have a lot of rows — you can be charged for the “big query” and the golden rule is to ask only the information you really need.

/* Get amount of all rows */
SELECT count(*) FROM `voxpow.logs.ubuntu`

/* Get the last 100 records */
SELECT * FROM `voxpow.logs.ubuntu` ORDER BY time DESC LIMIT 100

/* Get results for all user agents, which contains something */
SELECT * FROM `voxpow.logs.ubuntu` WHERE agent LIKE '%google%'

/* Get all results, which status codes are not 200 (OK) */
SELECT * FROM `voxpow.logs.ubuntu` WHERE code NOT LIKE '%200%'

/* Get all results, for status codes 404 (Not Found) */
SELECT * FROM `voxpow.logs.ubuntu` WHERE code LIKE '%404%'

Charts

You can make various charts, exporting data directly to Data Studio.

export to datastudio

Visualization in Data Studio

Resource usage

In the first few weeks, we have some high CPU peaks, but it was due to misconfiguration in the td-agent.conf. This article helps us to find and resolve the problems.

Summary

In this article, we discussed an example configuration and log streaming in BigQuery. There is a lot of servers analyzing tools, but I found the following key points very hard to get with other solution:

  • make an analysis of Google bot crawling;
  • real-time statistic for the API usage, available in different command-line interfaces;
  • possibility to quickly have IP information and frequency of request;
  • data warehouse and storing this information for performing further analysis;

Using this approach, it is easy to perform a detailed analysis of search engine crawlers and get the insights — how they crawl your website, what kind of factors have an impact on the frequency of visits, etc.

[/fusion_text][/fusion_builder_column][/fusion_builder_row][/fusion_builder_container]

Categorized in:

MLOps, Tutorials,

Last Update: 26/12/2023