This week, we worked on completing our poster proposal. I did a lot of background reading for this, mostly from the Journal of Biomedical Informatics. They recently released a special volume full of articles pertaining to the i2b2 Shared Task, which is where our corpus of records came from. Seeing how other teams approached the task was interesting, as well as seeing if they addressed missing and inaccurate data.

This week we collaborated to finish a poster presentation for the 2016 Tapia conference. We also read journal articles from the journal of biomedical informatics. The articles were pertaining to the 2014 i2b2/UTHealth NLP shared task, which is where we are getting the information for our research.

Over winter break, I spent a few days reading through the articles published last year detailing the outcomes of the various tracks of the 2014 i2b2 challenge, which is where the data we're currently investigating came from in the first place. The articles I read primarily concerned risk factor identification and heart disease prediction systems, since those are closely relevant to our team's work.

In particular, a few of these papers explicitly mention the impact of missing (i.e., unstated, undocumented) risk factors on developed systems which rely on tagged risk factor metadata; developing methods not only to identify missing risk factors but also to create workarounds seems to be an area of clear research opportunity/necessity.

Our first week back, we wrote the rest of our proposal for the 2016 Tapia Celebration student poster session, and submitted it Friday (thanks very much to Professor Stubbs' help!), and will move on to move computational pursuits in the meantime.

This week, we put together a presentation of our work for our school's metadata research group. They gave a lot of really interesting feedback, which was much appreciated! Some of their ideas were outside of what we'll be able to do, but these ideas were still interesting to think about. For example, they thought it would be worthwhile to track each individual doctor's writing habits. However, because the way the records are deidentified, a doctor's pseudonym will only be consistent within any one patient's narrative. It was nice to get some new feedback on the project and the directions in which we can take it!

This week I got to use R to analyze the CSV file. I ran some descriptive analysia on them, and also wrote some code to produce some graphs. I ran into an issue with how the file was configured however. I am collaborating with a professor from the statistics department to write some code in order to analyze the data by patient, and not by patient visits. In the meanwhile, I was able to do some comparison of our corpus data to national averages of age and gender.  

The code I was working on is going well, but not complete yet; testing my own code has shown a few inconsistencies, which I've addressed as necessary.

This past week, on Dec. 8, we presented an overview of our project to a small research group here at Simmons, and a lot of the questions that were asked were helpful, particularly with respect to areas of inquiry we haven't considered explicitly. (Some were out of the scope of the project, but still good to keep in mind.)

I've been working on finding discrepancies in diabetes mentions in the clinical narratives, and additionally Rebecca has been working to address discrepancies in patients' smoking history. Before the end of the semester, our goal is to report some preliminary findings with regard to discrepancies in clinical narratives.

In the past couple of weeks, I've attempted to combine all of the Python classes I've written in order to extract data from XML files, store them in Patient objects, and assess discrepancies in the clinical narratives.

My part of discrepancy-seeking is attempting to identify places in the records where mentions of diabetes are inconsistent or lacking; in spite of the fact that all of the records in the corpus were selected based on patients' diabetes status, many of the clinical narratives do not mention whether or not the patient has diabetes.

The first step will be to gather statistics regarding whether or not a given patient has any explicit reference to diabetes, including via medication mentions, physical state, etc.