And what you can do about it


The online machine learning ML community agrees that most of the problems that budding ML products face are engineering challenges[1][2]. The leading ML researcher and practitioners seem to confirm that, time and time again, what makes a difference is fast iterations and the ability to add great ML features. The promise is that by investing your resources onto continuous ML delivery practices, significant gains can be achieved from great features and a good deal of iteration; Not from focusing on the latest cutting-edge ML algorithms, techniques, or frameworks.

In this write up we will assume that you are already convinced…

The 2020 ACM SIGKDD International Conference on Knowledge Discovery and Data Mining is a core conference in the AI/ML/DS field. The conference delivered on its promise of being all-encompassing in terms of breadth and depth of topics covered. Here are some highlights that can help you sharpen that saw of yours.

This post is in collaboration with Seong Kang, and Ian Horton.

The sessions highlighted are in no way a complete view of the KDD 2020 conference. The following sessions are covered in this post:

Sampled Topics

Best Research Paper Award

Recommendation systems

  • Hands On Tutorial: Building Recommender Systems with…

The Missing Link of ML Infrastructures. Part 3/3

This is part 3 of a series of articles:“ML Feature Stores: A Casual Tour”; Part 1; Part 2;

In part 1 of this tour, we covered the pros and cons of having an ML feature store in your ML organization. In part 2, we took a look at the components and goals of four feature stores, Hopsworks Feature Store, Go-Jek Feast, Uber Michelangelo, and Twitter Feature Libraries/Store.

In part 3 of this tour, we look at more examples of publicly described feature stores:

  • Airbnb Zipline Feature Store/ Feature engineering framework.
  • Netflix Fact/Feature…

The Missing Link of ML Infrastructures. Part 2/3

This is part 2 of a series of articles:“ML Feature Stores: A Casual Tour”. Part 1 is here and Part 3 is here.

In part 1 of this tour, we covered the pros and cons of having an ML feature store in your ML organization. We also discussed the use cases that benefit from such an ML infrastructure investment.

In part 2 of this tour, we take a look at the components and goals of four feature stores, both open source, and proprietary at hyper-scale internet companies:

  • Hopsworks Feature Store.
  • Go-Jek Feast.

The Missing Link of ML Infrastructures. Part 1/3

This is part 1 of a series of articles:“ML Feature Stores: A Casual Tour”. Part 2 is here and Part 3 is here.

The combined pressures of data monetization and privacy compliance are escalating at a dramatic pace. Machine learning (ML) and data science (DS) teams are asked to ship autonomous and intelligent products at a faster rate. This comes with many hurdles. A central for ML practitioners to scale their work is feature management. In the past few years, ML workflows have been improving rapidly. …

Image by Speedy McVroom from Pixabay

I recently had the chance to attend the O’Reilly Software Architecture Conference NYC 2020. I think the goal of that conference is to remind us to make software architecture matter. These days software architecture is at the core of nearly every branch of technology. Besides, it seems that the breadth of knowledge I need, to not seem extra perplexed during architectural reviews, is expanding every year. Regularly sharpening the saw is a necessity.

This conference strikes a nice balance. It introduces the latest in architecture and design, while at the same time providing guidance for legacy software that needs to…

Robert Delaunay, 1913, “Premier Disque”.

The ML Metrics Trap

Reporting small improvements on inadequate metrics is a well known Machine Learning trap. Understanding the pros and cons of machine learning (ML) metrics helps build personal credibility for ML practitioners. This is done to avoid the trap of prematurely proclaiming victory. Understanding metrics used for machine learning (ML) systems is important. ML practitioners invest signification budgets to move prototypes from research to production. The central goal is to extract value from prediction systems. Offline metrics are crucial indicators for promoting a new model to production.

In this post, we look at three ranking metrics. Ranking is a fundamental task. It…

Practical Software Engineering Principles for ML Craftsmanship

Why We Should Care about Clean Machine Learning Code (CMLC)?

Checkout my latest (in progress) book about this topic with code examples, in-depth discussions, and more!

Machine learning (ML) pipelines are software pipelines after all. They are full of needless complexity and repetition. This is mixed with thick opacity, rigidity, and viscosity of design. With these issues, ML failures are growing in importance at an unprecedented pace. We have seen the self-driving cars hitting pedestrians in Arizona. We learnt about the gender bias of large scale translation system. We saw how simple masks hacked face id systems in smartphones. We heard about other “smart” systems making bad decisions (e.g…

Time travel in Scikit-learn’s codebase.

The Incident

It all started one morning with a data scientist colleague asking me:
“ I have static versions of everything for the test. But for the RandomForest Classifier, it’s not guaranteed that when I pass my static-training-set I will get the same predictions as my static-prediction-table. I did set the random seed in Numpy and all but ¯\_(ツ)_/¯….”

Testing ML code is necessary to build robust machine learning pipelines. But, the testing literature is not as thorough as in other software engineering disciplines. For ML code there is a fair amount of uncertainty when writing integration tests. These integration tests actually…

Simple evolution guidelines for data science products

A view on the evolution of data science products

Evolving data science products for new teams can be a daunting task. There are conflicting requirements embedded in the nature of data science products. First constraint: product teams want to move proof of concepts to market as fast as possible. Second constraint: data science (DS) teams need a growing infrastructure for efficient experimentation. Data scientists usually lack adequate infrastructure to meet their growing needs. Third constraint: engineering teams (ENG) value system stability and reliability. This makes them unable to keep pace with the “idea generation” coming from the research teams. The combination of these constraints leads to routine delays in…

Moussa Taifi, Ph.D.

Senior Data Science Platform Engineer — Team Lead — Xandr-AT&T —

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store