Resume

Education

  • Columbia University, Statistics (M.A.), Sep 2017 - Dec 2018
  • University of Birmingham in UK, Mathematics (B.Sc. with Honors), Sep 2014 - Jul 2016
  • Beijing Jiaotong University, Information and Computing Science (B.Sc.), Sep 2012 - Jul 2014

Skill

  • Code: Python, Go, Bash, R, Matlab
  • Database: Postgresql, Delta Lake, Mysql, Redis, Neo4j
  • ETL: Apache Airflow, Celery, Kafka
  • Operations: Git/Github, Docker, Docker Compose, CircleCI, Kubernetes(Certified Kubernetes Administrator)
  • Cloud: AWS(S3, EC2, Cloudfront, Route 53, IAM)
  • Data-related: Machine Learning Algorithm and Modeling, Data Visualization, Feature Engineering
  • Web Framework and Dashboard: Django, Flask, Gin, Dash
  • Big Data: Spark

Experience

SimpleBet Inc.

Data Engineer, Apr 2019 - Present

New York, NY, USA

  • Built production ETL system using Apache Airflow on Kubernetes centralizing multiple sports data into PostgreSQL data warehouse and set up Delta Lake layer for research purpose of data scientists in AWS S3
  • Designed and implemented the codebase of data access layer allowing caching, serialization, paralleling functionality as a dependency of other research and production repositories used by a team of 40+ data scientists
  • Implemented serializable filter pipeline compatible with data science framework simplifying the delivery procedure from research models to production models
  • Configured Kafka as a producer in docker compose and added data migration with SQLAlchemy and Alembic in pre-event pricing service
  • Continuously integrated and delivered repositories with CircleCI and monitored project dependencies with poetry
  • Wrote integration tests and unit tests for multiple projects with pytest

Weber Shandwick

Data Analytics Intern, Feb 2019 - Apr 2019

New York, NY, USA

  • Deployed brand sentiment analysis dashboard for pitches of multiple clients such as Coke Cola, TIAA etc.
  • Refactored brand analysis pipeline increasing the speed 6 times faster under the scale of 500,000 rows of data
  • Implemented internal influencer analytics GUI for business analysts to increase the efficiency of generating influencer reports, reducing the communication time across team to one click operation

Eino Inc., Cornell Tech

Data Engineer Intern, Sep 2018 – Dec 2018

New York, NY, USA

  • Built the event data visualization dashboard using Python Dash and Flask
  • Implemented and improved LSTM model by tuning the network architecture to predict network signal strength
  • Utilized PySpark to write network data generation function and feature engineering function making the procedure of preparing the prediction dataset automatically

Avlino Inc.

Data Science Intern, May 2018 - Aug 2018

Holmdel, NJ, USA

  • Finished Object-Oriented production code of data generation functions in Python connecting SQL Server to extract equipment error data from database in appropriate formats for feature engineering purpose
  • Accomplished the automatic numerical pipeline prototype, wrapped up the classification, regression functions and incorporated time series model such as ARIMA into an Object-Oriented script in PySpark to increase the efficiency of processing multiple client data through Alenza product
  • Maintained the internal-consumption analytics pipeline and collaborated with business team to find KPI efficiently

Beijing Yuqian Investment Management Co., Ltd.

Data Analyst, Aug 2016 - July 2017

Beijing, Beijing, China

  • Generated extra valuable features for a cold boot credit model for wine agency using Coefficient of Variation Method
  • Implemented logistic regression, neural network and SVM to achieve approximately 77% AUC Application and Behavior credit model being embedded in a half-automatic score system to increase the efficiency in auditing
  • Utilized pandas and nltk to analyze telephone details record using cluster, social network analysis, text analysis and association rule to filter potential organized fraud
  • Completed an automatic web-embeddable R-script connecting the MySQL database to retrieve financial data and calculating aging analysis for different products such as digital products etc. to support the development of post loan management and anti-fraud systems

Project

End-to-End Wine Recommendation System - WineMaster

  • Combined CTPN text detection model with OCR to recognize the wine label image
  • Designed a matching method using Ngram and Tversky Index to match contents of wine labels with wine dataset
  • Wrote and embedded callable functions using Flask for end-to-end web app

Database Implementation Project

  • Improved CSV table scripts efficiency by applying database index concept
  • Combined database queries and common REST API calls using Flask supporting users’ stories
  • Implemented CSV table and Rational table manipulating Object-Oriented function and self-designed unit tests

Image Analysis Projects

  • Implemented Iris Recognition with Li Ma’s Method individually
  • Utilized PCA and SVM to classify 4 types of brain status data extracted with Matlab SPM12 interface
  • Tuned hyperparameter of CNN on MINST data using Keras

AI & SQL Project at Columbia Statistics Club

  • Implemented text analysis using tm and SnowballC packages in R to preprocess lyrics and collected the bag of words
  • Characterized different songs using LDA topic model for being used in music recommendation
  • Designed ER diagram of a hotel booking database and used SQL to manipulate data in the database