Dr. Alejandro Correa Bahnsen is the Chief Data Scientist at Easy Solutions. With a passion for machine learning, he considers himself a technology evangelist of data science. He has more than a decade of experience applying the use and development of predictive models to real-world issues such as cyber fraud, human resources analytics, credit scoring, churn modeling and direct marketing.
In addition to advising the Easy Solution’s executive team and customers on unique fraud challenges, Alejandro manages the data science team, tests big data processing engines and researches the application of deep learning on electronic fraud prevention. He also creates and develops machine learning algorithms related to phishing detection, user identification and malware prevention. He is constantly improving Easy Solutions’ products with data science and artificial intelligence capabilities.
Alejandro holds a PhD in Machine Learning and Pattern Recognition from Luxembourg University. He has published over 15 academic and industrial papers in noteworthy peer-reviewed publications. He also taught the following subjects on a university level: econometrics, financial risk management, machine learning and natural language processing.
Classifying Phishing URLs using Deep Recurrent Neural Networks
Organizations trying to protect their users from phishing attacks have a hard time dealing with massive amount of emerging sites which must be identified and labeled either as malicious or harmless before users can safely access them. All major web browsers make use of reactive blacklists to block web access to URLs contained within them. One drawback of such a reactive method is that for a phishing URL to be blocked, it has to have been submitted to the blacklist. That implies that until someone submits the URL and the blacklist is updated, web users are at risk. This session shows the results of research into two different types of machine learning algorithms, both tested to gauge how well each can accurately distinguish between genuine URLs and phishing sites. The two machine learning methods analyzed included a feature engineering approach followed by a random forest classifier, against a novel method using deep recurrent neural networks. At 98.7% detection accuracy, the neural model has higher overall prediction performance without the need of expert knowledge to create the features. The downside is that the inner workings cannot be interpreted easily. Conversely, the random forest model achieved an average score of 93.5%. The random forest also required expert knowledge to create the features. However, the model can be interpreted more easily due to the input features and their significance.