Nginx Requests Not Being Gzipped on CDN Pass-Through

A couple months ago I ran into a curious situation. Requests from Nginx were being gzipped as expected and requests from CDN were being gzipped. My Nginx settings for compression looked like this:

gzip on;
gzip_comp_level 6;
gzip_http_version 1.1;
gzip_min_length 0;
gzip_types application/json application/x-javascript application/javascript text/plain text/css text/javascript text/xml;
gzip_vary on;
gzip_disable "MSIE [1-6]\."; 

Reproducing the behavior was simple enough. If you hit the URI at CDN and append a cache busting string to the end of the URI such as \?2342343243 the headers would return without the Content-Encoding: gzip header. Another test to confirm this was to use a curl statement passing the via header with any value such as below


curl -v -H “Accept-Encoding: gzip” -H “Via: 1.1 (AkamaiGHost)”  "” 

A simple explanation of this issue is that when request hits the CDN and the object is not cached, the request is then passed to origin using the via header. Unless the Nginx gzip directive of gzip_proxied any; is included this request will always be sent uncompressed. To resolve add this line to your nginx.conf file. Below is the same example shown earlier but including this directive:

gzip  on;
gzip_comp_level 6;
gzip_http_version 1.1;
gzip_min_length 0;
gzip_types application/json application/x-javascript application/javascript text/plain text/css text/javascript text/xml;
gzip_proxied any;
gzip_vary on;
gzip_disable "MSIE [1-6]\.";

Passed AWS Solutions Architect Professional!

Greetings! Apologies the blog hasn’t been more active recently, but it’s been a pretty busy time at work and I just finished a long few months of studying and practicing for the AWS Solutions Architect Professional Exam. I sat the exam on April 06, 2018 and passed with a 70%. This was hands down one of the most difficult tests I have ever taken for a certification. I think the trickiest bit is to pay attention to what is being asked rather than what the best technical answer is, as some questions are looking for the most cost effective solution rather than the most technically accurate or resilient solution.



When I sat my first AWS test back in 2016 it was for the Solutions Architect Associate. As I prepare for that I had gone through the Linux Academy course as well as the A Cloud Guru course. For the professional I went back to A Cloud Guru as I found their training to be through. Training can be found at their site here.

I also found the WhizLab practice exams to be very close to what you can expect to see on the test. They have 5 exams and each provides answer remediation with links to whitepapers. This was helpful in identifying areas where additional study is needed, as well as linking directly to the resources to study with. You can find a link to their page here.


Other Info:

That’s all I have on the test. Stick around for some posts in the next few months where I’ll be talking about some interesting Azure Infrastructure as Code with ARM templates, service fabric, and Windows containers.

Data Warehousing with SnowFlake

I recently had a chance to work on a data warehousing project for a client that wanted a centralized data repository of data from a number of analytics providers to run reports against for business intelligence purposes. For this project the client chose to work with Snowflake a cloud data warehouse. In this blog post I’m going to discuss an overview of what Snowflake is, how it works, and talk a bit about some of the data loaders that I built to facilitate loading analytics data into the data warehouse. We’ll take a look at the slick architecture used in Snowflake and some of its tools and features. Thanks to  the folks supporting Snowflake who have been nothing short of amazing in terms of responsiveness to answering questions and their documentation is useful and to the point.

Snowflake Architecture

Architecturally there are 3 main components that underlie the actual data warehouse. The 3 main components are as follows: Compute: Snowflake provides the ability to create “Virtual Warehouses” which are basically compute clusters in EC2 that are provisioned behind the scenes. Virtual Warehouses can be used to load data, or run queries and is capable of doing both of these tasks concurrently. These Virtual Warehouses can be scaled up or down on demand and can be paused when not in use to reduce the spend on compute. Storage: The actual underlying file system in Snowflake is backed by S3 in Snowflake’s account, all data is encrypted and compressed and distributed to optimize performance. By the very nature of Amazon S3 the data is geo-redundant and is backed by Amazon’s industry leading data durability and availability. Services: Coordinates and handles all other services in Snowflake including sessions, authentication, SQL compilation, encryption, etc. By design, each one of these 3 layers can be independently scaled and are also architecturally redundant. For more information about the underlying architecture visit Snowflake’s documentation here

Data Handling and Connecting To Snowflake

Snowflake is built to handle both relational and nonrelational data. Meaning you can create databases that are traditional relational DBs as well as document (more akin to NoSQL) DBs with record formats such as JSON and Avro. One of the slick features of this is that you can query across relational and nonrelational databases using JSON keys in your SQL query like this example:


select payload:app_version,payload: app,payload:parameters:attributes from app_data_json;


Connecting to your databases in Snowflake is relatively easy, there’s a few different methods to do so. One method is to use any of the supported ODBC drivers for Snowflake, or to use SnowSQL CLI (Install instructions are found here), or by using the Web based worksheet within your Snowflake account. For the project that I worked on, we used a Linux instance in Azure as a cron machine for executing API scraper scripts to acquire the data, and used SnowSQL on a cron to load the data into Snowflake.

Setting Up A Snowflake DB

The first step once you’ve logged into your Snowflake account is to create a database. This can be done by clicking the Databases icon in the web interface and choosing the Create option.

Once you’ve selected the create dialog you’ll receive a prompt to create a database name and optionally a description. You can also optionally click the Show SQL option if you want to get the exact syntax to script DB creation later using SnowSQL.

Now that we’ve created a database it’s time to create a table. Do do this click on the database you just created and click the create button. Once you do this you’ll receive a web GUI in which you can name your table, and create your schema. Much like every other step in this process there is a Show SQL prompt at the bottom left of the window you can use to grab the CLI syntax for the same task.

Now that the database and table have been created we can load some data. Of course to load data we must first acquire it. I’ve created a Python based API scraper to scrape data from AppAnnie (a popular mobile analytics SaaS product) which will scrape specific data from the API and place it into a CSV file on the S3 bucket I’ve mounted on my filesystem using the S3FS fuse driver. See example here

Loading Data

Now that we have data syncing into an S3 bucket we created we can now set up our stage in Snowflake. When setting up a stage you have a few choices, you can either load the data locally, use a Snowflake staging storage, or provide info from you own S3 bucket. I have chosen that latter as this gives me long term retention of data for the future that I can reuse or repurpose down the road.

To create an S3 stage in Snowflake click on your database and click the stages tab. Once you’ve done this click create. You will need to give your stage a name (we’ll use this in a bit to setup a snowflake sync cron). You’ll also need to provide the s3 URL as well as AWS API keys. Once you’ve done this your stage will show up in Snowflake.

Now that we’ve set up a stage we will next need to setup a file format, this will tell Snowflake how to read the data that we wish to import. I’ve chosen semicolon separated CSV for my example. Click the File Formats tab in your database and click create. Once you do this you will see the following dialog box, fill this out according to your file format and click finish.

Assuming that you have SnowSQL installed on a Linux server that will run a cron to load data from the S3 stage, you’ll need to first setup a connection in your .snowsql/config file. A sample connection looks like this:

[connections.testdb] accountname = mytestaccount username = testuser password = testpassword dbname = test_db warehousename = LOAD_WH schemaname = public

Once you have saved this into your config file you will want to create a .sql file in your service account’s home directory with the information telling Snowflake what database and table to load the data into, where to load the data from and what file format to use when loading the data. Here is an example:



Once you’ve created this file you can setup crontab to run this with an entry similar to this (note this example runs daily at 17:00 UTC time:

0 17 * * * /home/example/bin/snowsql -c testdb -f /home/builder/snowflaktestdb.sql > /dev/null 2>&1

Final Thoughts

Getting to work on a project with Snowflake was in totality a really fun project. I have seen a number of data warehouse designs and approaches, and i believe there is no one size fits all solution. However, my take away from Snowflake is that it was super easy to setup, there is the right amount of options when it comes to provisioning dbs, data loading staging storage, and virtual warehouse provisioning for data loading and execution without having to worry about optimizing all of the underlying architecture and scaling it. The feature set given, the plethora of connectivity options, and flexibility to support relational and nonrelational data made this a fun technology to work with. Not only was it fun to work on a data warehousing project, but getting to write multiple API scrapers and automate the process of acquiring and loading the data end to end.