Career Profile

I am a data scientist, ML engineer, and academically trained statistician. I currently work in the exciting field of ML-platform-as-a-service, bringing my experience and insight to building tools for data scientists and ML engineers. In my latest effort, I am building a distributed vector database that will power the next generation of massive scale machine learning applications.

I pride myself on my sensibility for business needs, user experience, and product usability. I excel at abstracting technical work to cater diverse audiences. I strive to build data products and data ecosystems that enable people of all technical backgrounds to make data-informed decisions.



Founding Engineer
2020 - Present

Keywords: distributed data platform, MLOps, recommender system, ANN search, ML tool UX, technical marketing.

Pinecone is a distributed data platform for massive scale (billions of vectors) machine learning applications such as Recommender System, Text/Image Search, and Fraud Detection. It is optimized for deep learning applications and complex MLOps workloads.

The Pinecone team and I develop cutting-edge approximate nearest neighbor (ANN) search algorithms and implementations that outperform state-of-the-art open-source solutions. In addition, I am responsible for designing the user experience (UX) and ensuring Pinecone is a delightful tool to use for data scientists and machine learning engineers. Check out the documentation and the SDK that I wrote here.

I work closely with the marketing team, providing both engineering and data analytics support. I also lead the Technical Marketing effort: I write, edit, and manage technical contents, including ML-focused demo notebooks, tutorials, and blog posts.

Soda Technology

Co-founder, CTO
2016 - 2019

Recommendation System: I build the recommendation system that powers Soda’s daily operations. It outperforms humans by 50% in terms of revenue.

MLOps and Data Engineering: I architect and lead the development of Soda’s MLOps, data lake and ETL development. The orchestration is performed via Airflow, with data lake and data warehouse implemented via S3 and Kinesis, and ML applications deployed via serverless functions.

Business Intelligence: I foster the culture of data-driven decision-making by increasing the company’s data literacy, helping decision-makers learn how to use Google Data Studios for better business intelligence analyses.

Software Development: I manage and develope very complex in-house ERP & CRM systems to boost the company’s operational efficiency, which translates into substantial cost savings.

Nokia Bell Labs

Member of Technical Staff
2015 - 2016

I am part of the data monetization team. We build POCs to demonstrate novel applications using telecom data. Projects include (1) user segmentation analysis based on cell phone browsing data and (2) city bike lane planning.


  • "Scalable privacy-preserving data sharing methodology for genome-wide association studies." In: Journal of Biomedical Informatics. DOI:10.1016/j.jbi.2014.01.008. arXiv:1401.5193.
  • Yu, F., Fienberg, S.E., Slavković, A.B., Uhler, C.

  • "Differentially-private logistic regression for detecting multiple-SNP association in GWAS databases." In: Privacy in Statistical Databases. DOI:10.1007/978-3-319-11257-2_14. arXiv:1407.8067.
  • Yu, F., Rybar, M., Uhler, C., Fienberg, S.E.

  • "Scalable privacy-preserving data sharing methodology for genome-wide association studies: an application to iDASH healthcare privacy protection challenge." In: BMC Medical Informatics and Decision Making. DOI:10.1186/1472-6947-14-S1-S3.
  • Yu, F., Ji, Z.

  • "O Privacy, Where Art Thou?: Genomics and Privacy." In: CHANCE. DOI:10.1080/09332480.2015.1042736.
  • Slavković, S.E., Yu, F.

  • "A unified framework for evaluating online user treatment effectiveness, with advertising applications." In: KDD’ 2014: Proceedings of the 2nd Workshop of User Engagement Optimization.
  • Wang, P, Meytlis, M, Yu, F., Yang, J

  • "Whole exome sequencing reveals minimal differences between cell line and whole blood derived DNA." In: Genomics. DOI:10.1016/j.ygeno.2013.05.005.
  • Schafer, C.M., [and 13 others, including Yu, F.]

    Invited Talks

  • Practical methods for privacy-preserving genome-wide association study data sharing. Joint Statistical Meetings. Seattle. 2015.
  • Differentially-private logistic regression for detecting multiple-SNP association in GWAS databases. Privacy in Statistical Databases. Eivissa. 2014.
  • Privacy-preserving data sharing methodology for genome-wide association studies. Joint Statistical Meetings. Boston. 2014.
  • Healthcare data privacy protection competition. UC San Diego. San Diego. 2014.