This week I got to use R to analyze the CSV file. I ran some descriptive analysia on them, and also wrote some code to produce some graphs. I ran into an issue with how the file was configured however. I am collaborating with a professor from the statistics department to write some code in order to analyze the data by patient, and not by patient visits. In the meanwhile, I was able to do some comparison of our corpus data to national averages of age and gender.  

The code I was working on is going well, but not complete yet; testing my own code has shown a few inconsistencies, which I've addressed as necessary.

This past week, on Dec. 8, we presented an overview of our project to a small research group here at Simmons, and a lot of the questions that were asked were helpful, particularly with respect to areas of inquiry we haven't considered explicitly. (Some were out of the scope of the project, but still good to keep in mind.)

I've been working on finding discrepancies in diabetes mentions in the clinical narratives, and additionally Rebecca has been working to address discrepancies in patients' smoking history. Before the end of the semester, our goal is to report some preliminary findings with regard to discrepancies in clinical narratives.

In the past couple of weeks, I've attempted to combine all of the Python classes I've written in order to extract data from XML files, store them in Patient objects, and assess discrepancies in the clinical narratives.

My part of discrepancy-seeking is attempting to identify places in the records where mentions of diabetes are inconsistent or lacking; in spite of the fact that all of the records in the corpus were selected based on patients' diabetes status, many of the clinical narratives do not mention whether or not the patient has diabetes.

The first step will be to gather statistics regarding whether or not a given patient has any explicit reference to diabetes, including via medication mentions, physical state, etc.

After having Thanksgiving break, we decided to lay out our game plan for the rest of the semester. I will continue looking into discrepancies between medical records. Specifically, I'm looking into patients' smoking statuses. It's common for one record to say a patient has never smoked and for the next to say that they only quit within the last year, for example. I think this will be a very interesting area to analyze, because smoking status is such an important aspect of one's health history.

We have a server! We spent our most recent meeting figuring out who has access to which parts of it and other things like that. The next step is to learn how to navigate the server; I lucked out and worked on a remote server all summer for a research project two summers ago, so I'm already very familiar with navigating a server via terminal/bash commands.

We also ran some analysis on the official-vs-predicted genders and ages of the patients from an earlier script that Professor Stubbs ran. The script used pronouns, age markers, and other mentions of gender/age/etc to determine the age and gender of each patient on a record-by-record basis.

Along with Stephanie and Rebecca, I worked on a script that looked for some initial discrepancies among the "official" and "predicted" ages and genders in this file, and found that there are some patients whose gender is never specified in their clinical narrative records, and some for which the age is ambiguous. We'll refine these analyses over the course of the next few weeks, but for now, it's an interesting start.

Our server is finally working! I've been learning how to use the command line and how to SSH into the server. Having remote access will be especially useful for me, since I live off campus. Yay!

We also did some preliminary analysis on our data. Mainly, we wanted to check for discrepancies between a patient's actual gender and the gender predicted within each visit. We checked for similar discrepancies between the patient's actual age and predicted age. We did find some discrepancies, and many records never mention age or gender.

It's been tricky to get a server instantiated to keep all of our data in one place, but we're working on it! In the meantime, I read some papers related to our project, including one that Professor Stubbs worked on, and touched up a few scripts to convert .xml files to Patient objects in Python.

Our next steps will be to move the scripts to the server (once we have one), so that our data and our analysis tools can all be in the same place at long last.