Resume

Education

Columbia University, Statistics (M.A.), Sep 2017 - Dec 2018
University of Birmingham in UK, Mathematics (B.Sc. with Honors), Sep 2014 - Jul 2016
Beijing Jiaotong University, Information and Computing Science (B.Sc.), Sep 2012 - Jul 2014

Code: Python, Go, Bash, R, Matlab
Database: Postgresql, Delta Lake, Mysql, Redis, Neo4j
ETL: Apache Airflow, Celery, Kafka
Operations: Git/Github, Docker, Docker Compose, CircleCI, Kubernetes(Certified Kubernetes Administrator)
Cloud: AWS(S3, EC2, Cloudfront, Route 53, IAM)
Data-related: Machine Learning Algorithm and Modeling, Data Visualization, Feature Engineering
Web Framework and Dashboard: Django, Flask, Gin, Dash
Big Data: Spark

Data Engineer, Apr 2019 - Present

New York, NY, USA

Built production ETL system using Apache Airflow on Kubernetes centralizing multiple sports data into PostgreSQL data warehouse and set up Delta Lake layer for research purpose of data scientists in AWS S3
Designed and implemented the codebase of data access layer allowing caching, serialization, paralleling functionality as a dependency of other research and production repositories used by a team of 40+ data scientists
Implemented serializable filter pipeline compatible with data science framework simplifying the delivery procedure from research models to production models
Configured Kafka as a producer in docker compose and added data migration with SQLAlchemy and Alembic in pre-event pricing service
Continuously integrated and delivered repositories with CircleCI and monitored project dependencies with poetry
Wrote integration tests and unit tests for multiple projects with pytest

Data Analytics Intern, Feb 2019 - Apr 2019

New York, NY, USA

Deployed brand sentiment analysis dashboard for pitches of multiple clients such as Coke Cola, TIAA etc.
Refactored brand analysis pipeline increasing the speed 6 times faster under the scale of 500,000 rows of data
Implemented internal influencer analytics GUI for business analysts to increase the efficiency of generating influencer reports, reducing the communication time across team to one click operation

Data Engineer Intern, Sep 2018 – Dec 2018

New York, NY, USA

Built the event data visualization dashboard using Python Dash and Flask
Implemented and improved LSTM model by tuning the network architecture to predict network signal strength
Utilized PySpark to write network data generation function and feature engineering function making the procedure of preparing the prediction dataset automatically

Data Science Intern, May 2018 - Aug 2018

Holmdel, NJ, USA

Finished Object-Oriented production code of data generation functions in Python connecting SQL Server to extract equipment error data from database in appropriate formats for feature engineering purpose
Accomplished the automatic numerical pipeline prototype, wrapped up the classification, regression functions and incorporated time series model such as ARIMA into an Object-Oriented script in PySpark to increase the efficiency of processing multiple client data through Alenza product
Maintained the internal-consumption analytics pipeline and collaborated with business team to find KPI efficiently

Data Analyst, Aug 2016 - July 2017

Beijing, Beijing, China

Generated extra valuable features for a cold boot credit model for wine agency using Coefficient of Variation Method
Implemented logistic regression, neural network and SVM to achieve approximately 77% AUC Application and Behavior credit model being embedded in a half-automatic score system to increase the efficiency in auditing
Utilized pandas and nltk to analyze telephone details record using cluster, social network analysis, text analysis and association rule to filter potential organized fraud
Completed an automatic web-embeddable R-script connecting the MySQL database to retrieve financial data and calculating aging analysis for different products such as digital products etc. to support the development of post loan management and anti-fraud systems

Combined CTPN text detection model with OCR to recognize the wine label image
Designed a matching method using Ngram and Tversky Index to match contents of wine labels with wine dataset
Wrote and embedded callable functions using Flask for end-to-end web app

Improved CSV table scripts efficiency by applying database index concept
Combined database queries and common REST API calls using Flask supporting users’ stories
Implemented CSV table and Rational table manipulating Object-Oriented function and self-designed unit tests

Implemented Iris Recognition with Li Ma’s Method individually
Utilized PCA and SVM to classify 4 types of brain status data extracted with Matlab SPM12 interface
Tuned hyperparameter of CNN on MINST data using Keras

Implemented text analysis using tm and SnowballC packages in R to preprocess lyrics and collected the bag of words
Characterized different songs using LDA topic model for being used in music recommendation
Designed ER diagram of a hotel booking database and used SQL to manipulate data in the database