Research Intern - Data Scientist NLP at homepage

Who we are

CrowdSec is a cybersecurity product with a human-sized team of experts in their fields. We have worked remotely since day one. Our open-source software is used in over 175 countries worldwide by tens of thousands of users, and we intend to get millions more of them. Only in 2021, we managed to grow software adoption by 2200%, and we are still growing on average by 1% on a daily basis.

Our mission is to deter opportunist and organized Cybercrime, through an Internet-scale, real-time, security network. This network is powered by thousands of users, sharing with one another the aggression they blocked with our open-source software. By coupling this local behavior analysis (think next-generation Fail2ban) and sharing its findings within our community, we create a Crowd Sourced Cyber Threat Intelligence of unprecedented magnitude, that will counter the vast majority of technical hacks.

If you …

have a solid scientific background in Machine Learning (including deep learning), Statistics, and NLP techniques.
are a fast learner and have a genuine interest in cybersecurity
can design and deliver working prototypes in a fast-paced environment and then work with the core team to put them in production.
are autonomous, do not hesitate to share new ideas with the team, and challenge existing solutions.
have programming skills in Python, Machine Learning libraries, data visualization packages, and data manipulation. Knowledge of at least one compiled language (Golang, C++, etc.) is a big plus.
Have a track records of kaggle competitions. If not, you can also take a look at our current challenge to classify VPN and proxy addresses here : https://www.kaggle.com/competitions/vpn-classification

Then this is an ideal opportunity if you are looking for a six-month (end-of-studies) internship starting in September 2023 and want to work in a tech-savvy environment!

Internship subject

The crowdSec security engine analyzes malicious behaviors in logs, but will soon operate at the HTTP Layer similarly to a Web Application Firewall. This offers many more advantages to just processing logs:

It allows blocking attacks before they reach the application layer: a request can be blocked before being processed.
HTTP requests are a standardized format that is easier to analyze than raw application logs.
WAF has the ability to inspect the content of the HTTP request while analyzing logs limits you to what’s logged by the webserver (ie. method and URI)

Your role is to develop new algorithms to detect malicious activity in HTTP requests.

For this purpose, successful approaches have been developed such as in the paper Sec2vec: Anomaly Detection in HTTP Traffic and Malicious URLs. (link) where researchers implement common NLP techniques (tokenization, word embeddings…) to train a classifier on a public, labeled dataset of malicious HTTPs requests. Your goal will be to first reproduce these results, and then implement more complex architectures and move to an unsupervised framework. The research should be carried out with production constraints in mind, hence always taking model footprint and complexity into account. However, this does not limit the scope of the research to simple models, more complex architectures are to be considered (deep learning, LLMs, … ). The final implementation will be done with the help of the Crowdsec core team. Hence the knowledge of Golang is not mandatory, only considered as a plus.

Research Intern - Data Scientist NLP

This job is no longer available