Over the past few weeks, I've been working on a number of things. I had to alter some of the CSVs I created, because my original files were counting symptoms as mentions of a condition. Analyzing symptoms would be interesting as well, but at this stage we are only looking at direct mentions of medical conditions. To simplify things I created a program that will populate any CSV based on the tag you input.
We recently submitted a proposal for the undergraduate research symposium here at Simmons. Since we're nearing the end of the semester, we've also begun writing a final paper on our project.
This past week, I created more CSVs for mentions of CAD, hypertension, hyperlipidemia, and obesity. These CSVs are based on whether or not each condition is mentioned at all, so there are only two possible options (mentioned or not mentioned). But in many cases, we have more information than that about the conditions or related events. Over this next week, which is spring break, I'll be looking into how to organize and analyze this more complex information.
This week, we put together a presentation of our work for our school's metadata research group. They gave a lot of really interesting feedback, which was much appreciated! Some of their ideas were outside of what we'll be able to do, but these ideas were still interesting to think about. For example, they thought it would be worthwhile to track each individual doctor's writing habits. However, because the way the records are deidentified, a doctor's pseudonym will only be consistent within any one patient's narrative. It was nice to get some new feedback on the project and the directions in which we can take it!
After having Thanksgiving break, we decided to lay out our game plan for the rest of the semester. I will continue looking into discrepancies between medical records. Specifically, I'm looking into patients' smoking statuses. It's common for one record to say a patient has never smoked and for the next to say that they only quit within the last year, for example. I think this will be a very interesting area to analyze, because smoking status is such an important aspect of one's health history.
Our server is finally working! I've been learning how to use the command line and how to SSH into the server. Having remote access will be especially useful for me, since I live off campus. Yay!
We also did some preliminary analysis on our data. Mainly, we wanted to check for discrepancies between a patient's actual gender and the gender predicted within each visit. We checked for similar discrepancies between the patient's actual age and predicted age. We did find some discrepancies, and many records never mention age or gender.
This week, we worked on finishing up our person and patient classes. I also created separate objects for each tag category - for example, family history and medication. At first we were reading in these tags as strings. Now we can save them as objects that contain more specific information as attributes, which will be helpful for future analysis.
Today, Stephanie and I worked on creating Person and Patient objects in Python (Katie is away at a conference, go Katie!). I've never used Python for object-oriented programming before, so I had to learn the basics of class definitions, how constructors work, etc. The objects we created are still pretty basic and static. The Person class has attributes like name and age, and the Patient class inherits these attributes along with others such as current medication. The next step will be to bring our code together with what Katie has put together, and to pull information from our XML files and link that data with each person/patient object.
After spending some time writing a program to read in our patient files, we've begun working with regular expressions in Python. First, we worked through many exercises to reacquaint ourselves with regular expressions in general. Then, we worked on implementing these regular expressions in Python. As a simple test, I wrote a program to pull the main text out of an xml file. The next step would be to retrieve more specific data and store it in a useful format.
Now that we've all gotten our CITI certifications sorted, we'll be programming soon! We've decided to use Python for this project. I'm excited for this, because I haven't used Python since my first semester here at Simmons. We've also looked at several records and discussed how we will be working with and analyzing the data. For now, we are working with a small set of clinical narratives. Our first step will be to write a simple program to read in our files. Hopefully we will have a private server set up soon where we will be able to work with the full range of records.
The CREU Clinical Narratives project is officially underway! We're going to be spending our year studying patterns in clinical narratives using a natural language processing (NLP) approach.
This week, we met for the first time since last semester and began discussing data dissemination among the members of our student research team. Stephanie, Rebecca, and Katie signed the data use agreement contract, and we began the requisite CITI training for dealing with medical record data. We also discussed what our first steps will be once we've completed the CITI training, so we'll get started with that very soon.
Next week, we'll meet again and determine the next few steps of our project. We're all very excited -- stay tuned for more updates from the CREU crew!