Architecting for Real-Time Big Data Analytics

I was recently invited to speak in Naarden on experiences with architecting a big data platform. These are the slides that I used to facilitate the talk.

Architecting for Real-Time Big Data Analytics from Rob Winters

TravelBird AWS Case Study

AWS did a really nice write-up on what my team has been working on, published here.

Critical considerations in BI landscape

Solution providers (See Gartner BI Quadrant for full vendor overview):

The problem space that most organizations are facing leans towards Visual Exploration above traditional BI. There are many open source intiatives which are also seeing huge development, for example Airpal by AirBnB.

Traditional BI flow:

  • ETL->Pre-aggregate to cubes->Present (mostly) static reports

A more modern approach:

  • ELT->Direct query against cleaned event tables


  • (Near) real time flow
  • Allows rapid iteration against the data / data definitions
  • Lower maintenance (fewer cubes)
  • Deep analysis on reporting data


  • Storage must be highly performant due to lack of aggregations
  • Can be more complex for end users due to higher dimensionality

Challenges BI faces today:

  • Rapidly changing schemas in unstructured/semi-structured data
  • Merging of small and big data for your reporting; most platforms work well for one or the other
  • Achieving scalability of your BI team as more and more employees consume data in a high interactivity manner

Emergent paradigms driving design consideration:

  • Self-service is a must due to employee expectations
  • Mobile is rapidly becoming significant
  • Cycle time for reporting is reducing due to success in agile development
  • Analytics becoming more important relative to reporting

Common solutions to support these paradigms:

  • Structured/cleaned event bus backing reporting
  • Real time aggregations using (for example) Apache Spark
  • Lambda architecture data models to support reporting

Best practices:

  • Isolate financial big data reporting and operational
  • <5 seconds for report to load
  • Put data management in your BI team’s hands, report development embedded in your ops teams


The Role of Self-Service in your Business Intelligence Strategy

The BI industry is subject to a variety of debates currently, from whether or not Hadoop will replace the data warehouse to internet of things and real time analytics, but one of the topics that most directly impacts your business partners is much more simple: to what degree should a BI team build out a self-service infrastructure? According to some vendors, self-service analytics are bullshit; according to others, self-service is the future of BI. Regardless of which camp you belong to, before moving down the path of self-service there are five questions you should ask yourself:

1. What business benefit do I hope to achieve? Far and away, this is the most critical question to ask before moving forward. Self-service can afford an organization a variety of benefits: faster report development, better analytical capacity in other departments, more various views on the data to address related but different questions. Identifying and prioritizing the desired benefit(s) will be critical to moving towards a self-service organization.

2. Is my data ready for self-service? Many BI teams have built up cumbersome data structures and flows over the years to meet report demands and complex business rules. However, a self-service data model will need to be much simpler if users will be working directly against what you make available; the average business user will have the knowledge or willingness to memorize that D_ORD_CHG_CD represents order status code while F_ORD_CHG_AMT represents the order amount. Scoping your data and how to structure it for layman consumption will be critical to get user buy in.

3. Are my users ready for self service? Organizations often put out self-service tools as “just another reporting solution”, but the real benefit to self-service comes from changing how the organization interacts with your data. Properly assessing the training required to empower your organization’s decision making should happen before rolling out new platforms.

4. Is my BI team ready for new questions? Just as the delivery mechanism and user interactions change, the questions posed to your analysts and developers will differ. Your BI team will need to be ready to partner with the business in discovering how and why via exploration and collaboration, not just delivering the numbers.

5. How will I handle governance, quality control, and data security? By exposing business users to underlying data and allowing them the freedom to create on their own, the risk increases exponentially that someone will see something they shouldn’t or a view will be created which is misleading and/or inaccurate. While this is nothing new (folks have been lying with Excel for years), it creates a new challenge because it is “in the reporting system”. Have a strategy to manage/identify “official” sources and reports, protect users from sharing confidential information, and ensure the quality of publicly available reporting within your environment.

Additional resources:

Insights Conference 2015, Neuromarketing in eCommerce

This week I had the privilege to sit in on a conference discussing neuromarketing: the roles of cognitive psychology, intrinsic motivation/intent, and emotional response analysis in designing products and experiences. It’s an interesting contrast from the usual approaches like A/B testing because rather than simply evaluating the success metrics, it attempts to explain why consumers react the way they do to changes in experience, measured by EEGs and eye tracking. Presenters included Transavia, ING, Blue Mango, Telfort, and Incentro. Some key takeaways:

1. Assumptions of customer preference should always be tested: Blue Mango shared the results of a test based on the idea that customers would be more preferential towards a brand they already owned; for example, Samsung customers would respond stronger to a Samsung ad than an LG one. In only case (Apple) did they find this to be the case.

2. Effective personalization should be based on intrinsic needs: Incentro broke down personalization into three theories of why individuals are influenced:

  • Persuasive: The information presented is convincing that this the right choice

  • Motivation level: The effort required to take the action is low relative to my intrinsic motivation (why not?)

  • Proximity: The offer is presentated at a time/place where it aligns to a consumer’s intrinsic needs

The Power of Level of Detail Calculations in Tableau

Just a quick one, we’ve been playing a lot with level of detail calculations lately in Tableau 9 and I have to say, they’re incredible, especially when combined with context filtering. A few useful ones so far:

Customer Lifetime Value

Presuming you have a transactions table:

First transaction: {FIXED [customer_id] : min([transaction_date])}
Tenure month: DATEDIFF('month',[First transaction], [transaction_date])

RFM (Recency, Frequency, Lifetime Value) Modeling

Recency: DATEDIFF('month',{FIXED [customer_id] : max([transaction_date])})
Frequency: {FIXED [customer_id] : countd([order_id])}
Lifetime Value: {FIXED [customer_id] : sum([order_value])}

Salesperson effectiveness (% index vs team average)

Size of Team: {FIXED [team_id] : countd([account_mgr_name])}
Team Revenues: {FIXED [team_id] : sum([order_value])}
Salesperson Index: sum([order_value])/avg([Team Revenues]/[Size of Team])

Those are a few examples, but these are really a game-changer with Tableau; now we can skip entire subqueries and ETL processes, massively accelerating analytics and dashboard development.

Redshift vs Vertica, an analyst's experience

When it comes to conducting analysis on very large structured data sets, there are a limited number of viable players which won’t require a second mortgage and army of consultants to use (looking at you, Exadata). For the last several years my platform of choice has been HP Vertica: it’s fast, scales well, relatively easy to manage/performance tune, and is fairly middle of the pack for pricing. However, for the last several months I’ve been using Redshift as my primary data warehouse and feel ready to make a comparison on where the platforms materially differ.

Data Loading: Structured flat files

  • Vertica: Vertica’s standard loading paradigm is to load the file via a local copy command and loads into memory. This creates a potentially volatile loading pattern (dependent on network performance) but is extremely fast a parallelizes well if the machine executing the copy statement is in the same network. When running loading jobs like “copy all tables from a production DB” or “I need ad hoc data in the DB for analysis”, Vertica was significantly faster/easier than Redshift.

  • Redshift: Refshift can load from S3, EMR, or DynamoDB (it can also load from another machine via SSH, but the mechanism to do so is enormously cumbersome). Since it is lacking in any COPY FROM LOCAL functionality, loading data from a local source is both more cumbersome and much slower. However, loading data from Hadoop (presuming it’s an EMR cluster) is much better in Redshift.

Time and Time Series

  • Redshift: Time series analysis is where Redshift falls flat; the database is lacking certain basic functions like _TOTIMESTAMP… god help you if you need to work with Unix epochs. Most basic date functions are represented (ex. DATEDIFF, DATETRUNC) but there are very few analytical or window functions available.

  • Vertica: In contrast, Vertica has an enormous number of date functions available to assist in analysis of event stream data (although not all functions are implemented in ISO/ANSI standards). Additionally, it’s possible to do advanced time analytics in raw SQL. For example, sessionization of an event stream can be done using conditional true events.

JSON Handling

  • Redshift: Redshift allows direct loading of JSON objects to tables and will map each key to the appropriate column during loading. This solution works well when the arrays are consistently defined as it will offer high performance on read but is a liability if new keys are constantly being added to your arrays - those new records will be ignored until the column is added to the table. Redshift also has excellent support for querying JSON objects stored in strings.

  • Vertica: In contrast, Vertica stores the JSON object as written in a Flex Table, preserving the whole record and using a view to allow access via SQL to the array. The DBA can choose which (if any) columns should be materialized as database columns; this approach is more complicated from an end-user perspective but ensures that data is never lost. Vertica also does not support querying JSON strings directly except by abusing functions developed for use on Flex tables.

Physical Structure Management

  • Vertica: The reason that column stores are so performant is the physical sorting and partitioning of the data by column; like indices, tuning for performance is a long term process and may require many iterations. Vertica allows multiple sort and compression designs for each table, allowing one to optimize for a variety of queries at the cost of load performance. Additionally, changes can be executed on the fly without impacting end users or dropping tables.

  • Redshift: Redshift allows the same level of sorting/node distribution logic, but each table can only have one physical manifestation. Additionally, changes to the physical structure require the table to be dropped and recreated. This significantly slows down the DBA’s work on performance tuning.


So, which one is “better”? Objectively, neither - both platforms bring strengths and weaknesses to the table. I would argue that Vertica has a more mature functionality set and is easier/more flexible to work in, with the downsides of higher cost and more administrative work required.

Providing personal push reporting via Tableau

Tableau is one of my favorite Business Intelligence platforms: it achieves an excellent balance of flexibility and control and allows users to explore the data and create new insights. However, the pricing can be prohibitively expensive for small companies ($1k per user up front, $200 per year afterwards) and not all members of the organization require the full interactor experience: some people just want to see what happened yesterday, last week, or last month. As a result, many companies start with just a limited use PoC and miss a great opportunity to get the organization used to looking at the numbers. At TravelBird we’ve built a Python script to allow everyone in the company (even those without Tableau licenses) to receive a personalized report every day.

The key: tabcmd and URL filters

One of the most powerful Tableau tools available to a BI Manager who wants to share data is tabcmd, which allows incredible functionality to generate images, export PDFs and data, and manage a number of other tasks on the server. For example, if we imagine our server is, the workbook is Daily Sales Report, and the default view is Yesterday Sales Summary, then to generate a PNG summary, PDF report, and the underlying data we could execute:

tabcmd login -s -u <admin_username> -p <admin_password>
tabcmd export "Daily Sales Report/Yesterday Sales Summary" --pdf -f "C:\output\Sales.pdf"
tabcmd get "Daily Sales Report/Yesterday Sales Summary" --png -f "C:\output\summary.png"
tabcmd export "Daily Sales Report/Yesterday Sales Summary" --csv -f "C:\output'sales.csv"
tabcmd logout

As imaginable, this could all be compiled into an email using a tool like Blat and distributed to a group. However, the trick to personal emails are Tableau’s URL filters, which can be used both on the web and with tabcmd. For example, using will show the organization’s total performance, but adding ?SalesPerson=Joe Smith to the end will filter the data to only Joe. The real power to this trick is that the filter does not need to be visible on the dashboard, only present in the workbook filters. Provided the reports are built with the desired filter in the data set, almost any report can be automatically personalized on distribution.

Distributing personal reports

To make this work, three things are required: the base report to distribute, a table containing the filter clause and recipient email, and an email provider who plays well with Python (we use SendGrid as they have a convenient library). In our job we’ve added additional controls to provide multiple scheduling options, choose on a recipient basis what attachments are included, and track the last successful distribution, but the base job is relatively straightforward (presume the reference table contains salesperson, region, and email):

import sendgrid
import pyodbc
import os
import datetime

sendgrid_user = 'my_user'
sendgrid_pass = 'my_pass'

tmp_folder = 'C:\\temp\\'
tableau_folder = 'C:\\Program Files\\Tableau\\Tableau Server\
tab_user = 'my_admin'
tab_pass = 'my_admin_pass'

def return_recipients():
    odbc_cxn = pyodbc.connect('DSN=my_db')
    cxn = odbc_cxn.cursor()
    cxn.execute("select salesperson_name, region, salesperson_email from ref_table")
    results = cxn.fetchall()

def generate_files(baseReport, fileName, filteredName):
    attachments = []
    os.command(tableau_folder +
        ' login -s -u %s -p %s') % \
        (tab_user, tab_pass)
    os.command(tableau_folder + ' export %s --png -f "%s%s.png"') % \
        (filteredName, tmp_folder, fileName)
    attachments.append('%s.png' % fileName)
        os.command(tableau_folder + ' export %s --pdf -f "%s%s.pdf"') % \
            (filteredName, tmp_folder, fileName)
        attachments.append('%s.pdf' % fileName)
        os.command(tableau_folder + ' export %s --csv -f "%s%s.csv"') % \
            (filteredName, tmp_folder, fileName)
        attachments.append('%s.csv' % fileName)
    os.command(tableau_folder + ' logout')
    return attachments

def send_email(user, toAddress, subject, attachments):
    sg = sendgrid.SendGridClient(sendgrid_user, sendgrid_pass)
    message = sendgrid.Mail()
    body = 'Good morning %s,\n' % user
    body += 'Attached is the %s report for %s.\n\n' % \
        (subject, time.strftime('%a, %d %b %Y'))
    body += 'For questions, please contact'
    for attachment in attachments:
        message.add_attachment(attachment, tmp_folder + attachment)

def main():
    baseReport = 'Daily Sales Report/Yesterday Sales Summary'
    fileName = baseReport.split('/')[0]
    email_list = return_recipients()
    for recipient in email_list:
        filteredName = baseReport + "?Region=" + recipient.region
        attachments = generate_files(baseReport, fileName, filteredName)
        send_email(recipient.salesperson_name, recipient.salesperson_email, fileName, attachments)
        for file in attachments:

if __name__ == '__main__':

And that’s it! With this small piece of code, every member in the organization can start the day with their personalized report and have the full data if they need to do further analysis. As opposed to the Tableau email platform, this script based approach can offer a huge amount of flexibility: * Central BI management of distributions and distributed content * Database checks can be added for given reports (ex. check that sales data is complete at 7 AM before sending out the PDF) * Multiple reports or PNGs could be consolidated into one email * The entire message body can be tuned/customized to meet business needs. For example, populating the summary statistics into the subject for better experience

What might it look like? With the Superstore sample data:

personalized report

Data Vault Loading Automation

Note: These posts on data vault automation are an in-depth discussion of functionality described in a 10 Mar 2015 presentation on the Bijenkorf’s DWH architecture

When we began our vault rollout, one of the initial challenges we faced was the disparity of relational systems to work with: MySQL (various versions), Oracle 11g, Vertica, etc. While heterogeneous replication is an ideal solution in this landscape, the majority of servers were partner hosted and so actual code/feature deployment on the machines was not possible. As such, the easiest solution available was to deploy code generators based on metadata derived from our staging environments. This approach offers a number of advantages:

  • Easy integration of new tables: Rather than building new ETLs or modifying code to add new tables from existing sources, we can integrate the new information automatically by deploying the DDL in the DWH

  • Resiliency against source changes: In the event that the source system adds additional columns to the source, changes referential integrity, etc, ETLs are not impacted. Only INT->VARCHAR changes is source data models cause issues.

  • Automatic full vs incremental loading: Using the last timestamp off our ODS view layer allows the scripts to only pull the new records from the sources; in the event that nothing is currently in the ODS, the engine will automatically pull everything.

Engine workflow

High level data vault workflow model

Real-time recommenders, part four - Creating a recommender


Now that we have both an offline and real-time event stream available, we need to calculate a recommender model to determine what to serve users. Fundamentally, most item-based recommenders work to answer the same question: “given product 1, what products (2-N) should I show a customer (and how good of a recommendation are they)”? There are a number of complex modeling techniques available and off-the-shelf functions in tools like Mahout and Prediction.IO, one of the best ways to familiarize oneself with how these models work is to calculate it yourself. To that end, an easy starting point is a slope one mode, a non-trivial way to calculate item and user similarities.

Real-time recommenders, part three - Event tracking (processing events)


In order to utilize our event data from Kinesis we have configured two separate feeds: one non-real-time for reporting and archiving purposes and one real time feed for use cases like recommendations and last viewed items. While snowplow has use case specific libraries for both, we built our own utilities for the following reasons:

  • Flexibility: Building our own library allows easier loading to multiple database engines and/or unique archive flows

  • Speed: Using a basic loader allows event rotation into the database once a minute without any loss of information.

  • Archive Structure: The default snowplow libraries do not make any provision for DB loading and archiving to S3; for reliability purposes we wanted both

Database loading

In order to process the data for loading into the database we are using a mixture of the default Kinesis Enricher, sed, a small Python script for event validation, and cronolog to write to a file. The exact flow looks something like:

  1. Kinesis enricher stderr and stdout to stdout /var/lib/snowplow_enrich/snowplow-kinesis-enrich-0.2.0 --config /var/lib/snowplow_enrich/snowplow.conf 2>&1

  2. sed to remove checkpointing data from the event stream sed -e '/^\[/d' | sed -e 's/\[pool.*$//'

  3. A Python script which validates whether or not the event has been processed before and validates the custom JSON objects inside the structured event labels

  4. Cronolog which rotates log files every minute cronolog /data/logs/snowplow%Y%m%d%H%M.dat

One issue with using cronolog is that it splits exactly on the minute, resulting in one event per minute being cut in half every minute. Concatenation of the logs before loading (for example, every 15 minutes) reduces the impact to one event per load, but this is still unacceptable - we would like 100% of events being written to the database. To correct, our loader uses a mixture of head and tail to grab the end of the last event for loading from the head of the log file currently being written. Once the log files are correctly merged, loading and backups become trivial:


RIGHTNOW=`date +%Y%m%d%H%M`
MINUTEAGO=`date -d '-1 minute' +%Y%m%d%H%M`
YEAR=`date +%Y`
MONTH=`date +%m`
DAY=`date +%d`

find /data/logs -cmin +1 -type f -exec mv "{}" /data/logs/loading/ \;
ls /data/logs/loading/snowplow*.dat | sort | xargs cat> /data/logs/loading/temp.dat
head -n 1 /data/logs/snowplow$MINUTEAGO.dat>>/data/logs/loading/temp.dat
tail -n +2 /data/logs/loading/temp.dat > /data/logs/loading/snowplow_merged_$RIGHTNOW.dat
rm /data/logs/loading/snowplow2*.dat
rm /data/logs/loading/temp.dat

/opt/vertica/bin/vsql -h my.loadbalancer.path -U $1 -w $2 -c "copy stg.snowplow_events from local '/data/logs/loading/snowplow_merged_$RIGHTNOW.dat' with delimiter E'\t'  rejected data '/data/logs/loading/snowplow_rejected_$RIGHTNOW.dat'"

gzip /data/logs/loading/snowplow_merged_$RIGHTNOW.dat

s3cmd put /data/logs/loading/snowplow_merged_$RIGHTNOW.dat.gz s3://bykdwh/snowplow/$YEAR/$MONTH/$DAY/
rm /data/logs/loading/snowplow_merged_$RIGHTNOW.dat.gz

Far simpler than the default loaders and extremely performant in our environment. Another advantage to having the log files on a minute (or five minute) basis is that it makes it extremely easy to restore an hour or a day if we need to reload.

Real time processing

While the process outlined above is fantastic for reporting purposes, a one minute delay on the events is still not fast enough for a “real time” experience for users. Our objective for users is that the data is analyzed and recommendations updated before their next pageview; this gives us five seconds or less to process the event. To achieve this throughput we are using inline processing in a Python script to read specific data from the event, check records against redis, and execute a number of puts to redis. While this approach is somewhat cumbersome and lower performance than using a language like Scala, it is still possible to achieve throughput of a few thousand events per second per thread - more than sufficient for most businesses.

The functional workflow of real-time processing looks something like:

import sys
import redis
import json
import fileinput


def unlist(input_list):
    return '\t'.join(map(str,input_list))

def process_event(event):
    # Do some event processing to extract the correct user, event type (based on data in the event label), and SKU
    return (user_id, event_type, product_sku)

def main():
    counter = 0
    pipe = rec_set.pipeline()

    for line in sys.stdin:
            split_tab = []
            split_tab = [val for sublist in split_tab for val in sublist]
            user_id, event_type, product_sku = process_event(split_tab)

            if product_sku is not None:
                # Do some lookups for recommendations and brand information, then load those recommendations to redis

if __name__ == '__main__':

Real-time recommenders, part two - Event tracking (front end, collector, and bus)


In order to generate recommendations for users, the first task is to generate user data. One of the best options available right now is to use Snowplow, an open source event tracker built around the AWS stack. Using Snoplow plus Google Tag Manager, one can get event tracking running in just a few hours.

Front End Tracker

Rather rewriting already excellent documentation, simply follow the notes here. However, a few tips in getting started:

  1. Start small: Capture Pageviews only and validate against Google Analytics. This will expose potential tracking issues in your front end immediately which might cause tracking issues (we initially were missing almost half of our events due to ajax refreshes).

  2. Be prepared for lots of data: Without having a baseline, it’s easy to underestimate how much data will arrive. When we turned on page pings on a very conservative interval (every two minutes), event volume doubled even though average time on page was well under a minute.

  3. Use the existing event types: Snowplow is prebuilt to support a number of different events, if at all possible use those rather than “rolling your own” to capture similar data.

Event Capture and Bus

In order to get real time recommendations, data must be processed in real time; this rules out the “production ready” clojure collector and forces one to use the scala streaming collector. The good news is that this appears to be extremely stable; we have been using it in production for several months with no significant issues.

Configuration of the collector is relatively straightforward. Since our event volume can swing significantly during the day, we set up an AWS image which auto-starts the collector and configured an Elastic Beanstalk application for this purpose. The threshold used is CPU >65% for launching an instance and <25% for removing an instance as this was the greatest cause of event processing delays and dropped data. One major advantage of the streaming collector is that machines can be turned on and off with virtually no loss of data; using the clojure collectors, one can auto-scale up but must manually scale down to ensure that web logs rotate correctly.

While it is possible to write the event stream directly to stdout (for easy processing), we write the records to Kinesis as we have two consumer streams on the data set: one for real time recommendations and one for recording the data to the DB and archiving data to S3. One issue with Kinesis is that it does not autoscale. A quick estimate of event volume to bus utilization is that one needs one shard for every 10k events per minute, depending on how much additional data is written per event.

Automating Tableau Backups via AWS


Tableau is far and away my favorite Business Intelligence tool for its ease of development and performance on small (<20M row) extracts, but I have always found the backup approach disappointing. Each backup can account for a substantial amount of disk space, there is no default backup rotation, and Windows offers insufficient tooling to effectively handle these tasks. Enter boto, the Python library for AWS. Using boto plus S3 and tabadmin, we were able to build a backup solution for Tableau which is:

  • Infinitely space scalable

  • Fully automated and can email on failure

  • Handles daily/weekly/monthly backup rotation automatically. Currently we keep daily for two weeks, weekly for six months, and monthly for an indefinite period.

  • Low cost. As of February 2015 the storage cost of S3 is approximately $0.03 per month, so keeping backups only runs a few tens of dollars per year.


There are several critical functions to generating and rotating the backups effectively:

1. Options manager: To read configuration details like AWS keys, bucket names, file naming conventions, etc

2. Backup extractor: To pull the backup, in this case just a system call to tabadmin

3. Uploader: Used to place the file in S3 and validate its placement. Given the file sizes, this must be a multipart upload

4. Backup Rotator: Used to eliminate backups which no longer fit our retention policies

Options Management

Options were handled using the argparse and option parser libraries to store configuration details on our Tableau server; our configuration options were:

days_to_keep_daily: 14
days_to_keep_weekly: 180

tempdir: C:/BACKUPS/
tabadmin_path: C:/Program Files/Tableau/Tableau Server/8.3/bin/tabadmin.exe
filename_base: TABLEAU_BACKUP

This affords the flexibility to easily change rotation schedule in the future.

Backup Extraction

Handled using a simple system call to the tabadmin command based on the parameters set in the config file:

def generate_extract(configs):
	call(configs.get('Tableau','tabadmin_path') + " backup -d " + configs.get('Tableau','tempdir') + configs.get('Tableau','filename_base'))

Uploading and validating

To upload the backup and validate the object, we built a number of small functions to upload the file and confirm its placement. Our files are laoded in a naming convention S3://{bucket}/{tableau_path}/{year}/{month}/{filename_root}-YYYY-MM-DD.tsbak which makes it easy to recover the appropriate backup. Specific functions used include:

  1. A multipart file uploader which is “borrowed” wholesale from boto’s S3 examples

  2. A small function to validate the upload was successful using bucket.list()

File rotation

Since we are backing up daily but have a tiered backup strategy, the most trivial way to do so is to generate a list of valid file names and remove any which do not conform to the list. Specifically, we will need:

  1. All file names which fit in the daily retention window (14 days ago through today)

  2. All file names which match a Sunday (chosen snapshot day) and are between six months ago and today

  3. All file names which have “01” as the date, regardless of year/month

As a day can fit 0-3 of these conditions, the easiest solution was to generate three distinct lists, merge them, and deduplicate them using the set function. We can then pull a list of all keys from S3 for our Tableau backups, filtering them, and then deleting the undesired files. One caveat associated with S3 keys from boto: the file name is treated is the entire path after the bucket, so it is necessary to parse out the relevant information for finding the correct file. To do so, the function below is used:

def remove_old(connection, configs):
	keys = connection.list('tableau_backups')
	valid_keys = keys_to_keep(connection, configs)
	keylist = []
	for key in keys:
		keyname = key.key.rsplit('/',1)[1]
		if keyname not in valid_keys:
	for j in keylist:


In addition to the functionality outlined above, it’s also helpful to add in email notification in the event of failure and some additional logging. That said, this utility massively reduces data loss risks with Tableau and introduces a useful, low cost mechanism to maintain any easily restorable history of the server.

Real-time recommenders, part one - Overview of Technology


This is the first in a series of posts on how to implement near-real-time tracking and personalization for your users based on off the shelf and open source technologies. While many organizations perceive “real time” as challenging, the solution outlined here has been tested to scale easily to ten million pageviews per day; significantly higher volumes than most organizations encounter.

Fundamental components

In order to create recommendation for your users, a few fundamental components must be present in your technical environment:

  • Event Tracker: A mechanism to capture user activity in near real time and make it available for scoring/analysis.

  • Data Platform: The environment where your models will be calculated and history stored. Typically column store databases or Hadoop are used.

  • Data Presentation Platform: Where the user recommendations will be stored and presented from. High throughput of reads/writes is critical here.

  • Front End Implementation: How your users will consume the recommendations from your models.

That’s it, regardless of whether you are personalizing for ten users or ten million. From start to finish, an entire recommender architecture can be developed and deplyoed in about a week. In the next post I’ll discuss implementation of Snowplow, an open source event tracking stack.


I recently decided to start blogging on all things BI including dashboarding, technology, data science, etc.