Year 2022 in review - a period of growth

Written on December 31, 2022

If I had to choose one word to describe 2022, it would be this word: Growth.

Learning Scala as a Python Programmer: Higher-Order Functions for Functional Data Pipeline Design

Written on February 5, 2022

In my previous post on functional programming features for control flow, I provided an overview of function composition, and discussed the use of higher-order functions and recursion as a form of functional iteration. In this post, we will explore more on higher-order functions and how they can be used in designing functional data pipelines.

Year 2021 in review - a year of transitions

Written on January 2, 2022

One word to describe 2021: Transitions.

Talk: Designing Functional Data Pipelines for Reproducibility and Maintainability (PyData Global 2021)

Written on October 30, 2021

Getting my proposal accepted for PyData Global a second time was a surprise, given the polarising nature of my talk topic on functional programming and my string of disastrous live talks due to technical issues. For my last virtual live talk of the year, I used an upgraded conferencing setup and modified the structure of my talk to include references to functional design patterns in Apache Spark. For the first time in close to a year, the curse of Murphy's Law in virtual live talks is finally broken.

Talk: Designing Functional Data Pipelines for Reproducibility and Maintainability (EuroPython 2021)

Written on July 29, 2021

I designed this talk as a momentum-driven, content-packed, visually engaging talk that is a cumulation of my experiences in designing data pipelines at scale and learning functional programming. Thinking that it would make for a great comeback talk to demonstrate and reinforce my learnings (and a taster for my ongoing series on learning functional programming and Scala), I pictured myself engaging the audience with illustrations of functional programming concepts and design patterns for data pipelines. Murphy's Law had other ideas though - if not for a huge stroke of luck and some help, I can't imagine what it spells for my reputation as a tech speaker.

Learning Scala as a Python Programmer: Functional Programming Features for Control Flow

Written on July 4, 2021

In my previous post on key principles of functional programming, I explained how the functional programming paradigm differs from imperative programming, and discussed how the the concepts of idempotency and avoidance of side effects are linked to the property of referential transparency that enables equational resasoning in functional programming. Before we dive into some of the features of functional programming, let's start with a personal anecdote during my first 3 months of writing Scala code.

Learning Scala as a Python Programmer: Key Principles of Functional Programming

Written on May 9, 2021

In my previous post on my motivations for learning Scala, I stated that one of my key reasons for learning Scala for data engineering is due to the programming language being primarily designed for functional programming. Before we dive into the details of writing functional programs, it is important for us to understand the key principles of functional programming and how these programming principles are useful when designing reproducible data pipelines at scale.

Learning Scala as a Python Programmer: Motivations

Written on April 18, 2021

One of my tech goals in 2021 is to learn Scala. My key reason for learning Scala is to learn Functional Programming for data engineering. The question is: Why go through the trouble of learning Scala if Functional Programming is supported in Python?

Year 2020 in review - when tech conferences go virtual

Written on December 31, 2020

At the start of 2020, I set a goal to speak at 4 tech conferences including one in Europe. COVID-19 disrupted my plans completely, and I was forced to adapt to the new reality of virtual conferences. Here's my journey from regional to international speaker in the midst of a pandemic, and lessons learnt along the way.

#Shitoberfest: How free T-shirts ruined #Hacktoberfest2020

Written on October 3, 2020

Hacktoberfest is an annual event organized by DigitalOcean that celebrates open-source contributions. Occuring every October, the goal of Hacktoberfest is to encourage developers (of all backgrounds and skill level) and companies to make positive contributions to the open-source community. All these sound like an initiative with good intentions on paper - incentivise developers to contribute to open-source projects. Unfortunately, the organizers underestimated the extent of what people are willing to do for the sake of getting free T-shirts (or freebies in general).

Talk: Speed Up Your Data Processing: Parallel and Asynchronous Programming in Data Science (PyCon TW 2020 Edition)

Written on September 9, 2020

Speaking at PyCon Taiwan has been one of my key priorities in year 2020 even before the pandemic (another of my key priorities was to speak at a conference outside of Asia). The fact that there will be an offline audience at the other end of the remote call is also a "pull" factor in my decision to speak at PyCon Taiwan this year - I really miss being able to see live audience responses at a physical conference!

Talk: Speed Up Your Data Processing (EuroPython 2020 Edition)

Written on July 24, 2020

I originally designed this talk to be interactive and audience-driven, with the pace of the talk driven by casual "coffee shop" banters with the audience. With the COVID-19 pandemic showing no signs of abating, my original plan of making my European speaking debut at Ljubljana was disrupted. I still wanted to make my European speaking debut with the same talk somehow; hence, I submitted the talk proposal for EuroPython 2020 and hoped for the best.

Lightning Talk: Just-in-Time with Numba - 10-minute Remote Python Pizza 1.0 version

Written on April 26, 2020

I have spoken quite a bit about Numba in my first conference talk last year, and felt it deserved more attention. Hence, I decided to craft a jam-packed lightning talk that focuses on the non-trivial aspects of Numba - all in 10 minutes!

Lightning Talk: Just-in-Time with Numba - 7-minute PyLadies version

Written on March 29, 2020

One fine day in the morning of 28 March 2020 Singapore time, I came across a tweet from PyLadies calling for lightning talk submissions for their International Women's Month Lightning Talks Zoom call. Just a day ago, my data scientist friend mentioned about Numba on Facebook, and I happened to have spoken quite a bit about Numba in my first conference talk last year.

Talk: Speed Up Your Data Processing: Parallel and Asynchronous Programming in Python

Written on March 21, 2020

Year 2020 started out bad. Real bad. The COVID-19 pandemic led to a string of cancellations for tech events (including PyCon) and travel restrictions. As a result of travel restrictions and Business Continuity Plans, many speakers were not able to deliver their talks in person at the eleventh hour, participation was greatly reduced by around 90%, and even the FOSSASIA organizers could not make it to the venue in person due to COVID-19 restrictions. Still, the organizers made the decision to proceed with the event with a mix of offline and online talks with live streaming and chats.

TIL: DataFrame reshaping in Pandas - melt, unstack

Written on February 20, 2020

As a data engineer, part of my daily work involves performing data processing and manipulation on raw data into data that is ready for analysis. As my development team primarily uses Python for our data science workflow, we often use Pandas to perform operations and transformations on datasets before analysing the data. While we primarily use Pandas for data cleaning and engineering as part of the data science process, sometimes we also have to perform complex data transformations to obtain actionable insights that business users can leverage on to improve their processes.

Talk: Exploring Seasonal Insights from Singapore Weather Station Data

Written on January 15, 2020

It started out as a weekend coding exploration of real-time data from the Data.gov.sg APIs, but went on to become something more. 2nd iteration of the talk - given at JuniorDevSG

Year 2019 in review - getting started with speaking at a tech conference

Written on December 31, 2019

At the start of 2019, I set a goal to speak at a tech event. By the end of 2019, I've spoken at 2 tech meetups and 2 conferences. Here's my journey from wide-eyed event attendee to tech conference speaker, and lessons learnt along the way.

Talk: Making Open Weather Data More Accessible: Extracting Seasonal Insights from Singapore Weather Station Data

Written on December 3, 2019

It started out as a weekend coding exploration of real-time data from the Data.gov.sg APIs, but went on to become something more.

Talk: Contributing to pandas documentation for the first time - lessons from open source

Written on November 29, 2019

I use pandas daily, maybe even hourly. I spend time reading the docs, but end up finding my answers on StackOverflow. What better way to mark my first year as a data engineer by contributing to the docs for pandas?

Contributing to pandas documentation for the first time - and lessons learnt

Written on November 8, 2019

To mark my first year as a data engineer, I started thinking "How can I contribute back to the community that enables my work for the past year?" I came across an open issue on documentation for pandas, a popular open-source Python library for data analysis and manipulation, and decided to give it a try. Here's a work-in-progress developer log on the lessons learnt through contributing documentation to an open-source Python library for the first time - and how contributing to open-source projects is not as scary as it might seem to be.

Understanding Python Dependency Management using pideptree

Written on October 13, 2019

Dependency management is important, as packages depend on versions of other core packages in order to run as intended. Typically in a Python project, dependencies are downloaded using a requirements.txt file, which lists the packages and their dependencies as a flat file. While the package versions are included in the requirements.txt file, the dependency relationships are not explicitly stated.

TIL: Migrating Git repositories manually from GitLab to Azure DevOps (TFS)

Written on October 12, 2019

My development team has been using GitLab on-premise to manage their code repositories. As we are moving development work to our new on-premise development cloud with expanded processing capabilities, we also need to migrate our code repositories to the new development cloud which uses Azure DevOps Team Foundation Server (TFS) for Git workflows. To support the chief architect in the migration, I prepared a quick migration guide for the team's move to Azure DevOps TFS.

Talk: How to Make Your Data Processing Faster - Parallel Processing and JIT in Data Science

Written on September 1, 2019

I did a little talk on how to make your data processing faster as my first-ever conference talk, and it was loads of fun. Toasts, coffee and a barista included in the talk. Oh, and did I mention that it was also my first-ever CFP submission?

Talk: Parallel Processing in Python

Written on August 27, 2019

Sometimes, we just can't make things run indefinitely faster on a single worker.

Musings about Remote Development with Visual Studio Code

Written on August 22, 2019

Keeping codes and configuration files in sync between client machine and remote server used to be a drawn-out exercise in personal responsibility via SFTP/SCP. VS Code Remote Development looks to change that - for the better. Here's my notes on VS Code Remote Development.

Accelerating Batch Processing of Images in Python — with gsutil, numba and concurrent.futures

Written on May 27, 2019

In a data science project, one of the biggest bottlenecks (in terms of time) is the constant wait for the data processing code to finish executing. Sometimes, the gigantic execution times even end up making the project infeasible and often forces a data scientist to work with only a subset of the entire dataset, depriving the data scientist of insights and performance improvements that could be obtained with a larger dataset.