We have a server! We spent our most recent meeting figuring out who has access to which parts of it and other things like that. The next step is to learn how to navigate the server; I lucked out and worked on a remote server all summer for a research project two summers ago, so I'm already very familiar with navigating a server via terminal/bash commands.
We also ran some analysis on the official-vs-predicted genders and ages of the patients from an earlier script that Professor Stubbs ran. The script used pronouns, age markers, and other mentions of gender/age/etc to determine the age and gender of each patient on a record-by-record basis.
Along with Stephanie and Rebecca, I worked on a script that looked for some initial discrepancies among the "official" and "predicted" ages and genders in this file, and found that there are some patients whose gender is never specified in their clinical narrative records, and some for which the age is ambiguous. We'll refine these analyses over the course of the next few weeks, but for now, it's an interesting start.
It's been tricky to get a server instantiated to keep all of our data in one place, but we're working on it! In the meantime, I read some papers related to our project, including one that Professor Stubbs worked on, and touched up a few scripts to convert .xml files to Patient objects in Python.
Our next steps will be to move the scripts to the server (once we have one), so that our data and our analysis tools can all be in the same place at long last.
I was fortunate enough to attend the Grace Hopper Celebration recently, so I was unable to make our most recent meeting. However, I did program quite a bit on the plane (I had a lot of layovers). Regular expressions are a good time.
Stephanie, Rebecca, and I have started discussing how we're going to integrate all of our code, since we have all begun working on parsers, Patient classes, and so on. I did some more background reading as well.
Lately we've been in a little bit of a limbo; we have a lot of the tools we'll need to begin the project in earnest, but we don't have a place to store the data, so we're at a temporary impasse with regard to that.
I also spent some time this week ensuring that the patient class is appropriately robust for the data I already have access to, and additionally tweaked my XML parser for improved flexibility.
This week, I read more background on the subject of not only de-identifying medical records but also on developing ways to parse the files, find medical information, and so on. It was very interesting, especially since I hadn't a lot on the subject since earlier this summer.
I also began to develop a program which will read in a patient's information from an XML file and attribute the details of their medication history, smoking status(es), etc. to a Patient object.
This week I started working on a Patient class, which will encompass information such as medication history, smoking status, family information, co-morbidity of disease, etc.
I have a small subset of data available to use and have been testing my parser on that. It's been fun to get back into regular expressions -- finding patterns in language, and then implementing them as robustly as possible, is something I actually really enjoy.
Since we're still working on finding a way to host our data in a way that enables us all to have access to it but not provide any means of accidental data leaks, I spent my week developing an XML parser. As it currently stands, the program I have written will read in an XML file with a de-identified medical record in it, find the tags within the file, and separate them.
I'm working to automate the process and make it more robust to fit the needs of the whole dataset as well.
I also re-took parts of the CITI certification course to gain more knowledge on research ethics specifically related to our project.
The CREU Clinical Narratives project is officially underway! We're going to be spending our year studying patterns in clinical narratives using a natural language processing (NLP) approach.
This week, we met for the first time since last semester and began discussing data dissemination among the members of our student research team. Stephanie, Rebecca, and Katie signed the data use agreement contract, and we began the requisite CITI training for dealing with medical record data. We also discussed what our first steps will be once we've completed the CITI training, so we'll get started with that very soon.
Next week, we'll meet again and determine the next few steps of our project. We're all very excited -- stay tuned for more updates from the CREU crew!