This past week, we participated in our school's Undergraduate Symposium. Simmons holds this event annually, allowing students to showcase the work they've done throughout the year. Many students present their work at panels, but for our project it made more sense to participate in the two hour poster session. It was really interesting to see what other students have been working on and I think our work was well received! We talked to a lot of professors and students, many of whom were in the health sciences. We had one of the few computer science posters, and many people we talked to were interested in the intersection of those fields.
I think there are still questions to be answered about our corpus, like how clearly medication affects future symptoms. But overall, I'm happy with the work we did and the final poster we created. It was a great experience and I'm excited to be attending and sharing our work at the Tapia conference next fall!
This week Katie, Stephanie and I worked on our poster for the Undergraduate Research Symposium here at Simmons. We worked on finalizing data and planned out what information and graphics to include on the poster. Of course, we also spent a while picking out the best color scheme to use. I participated in the poster session last year, but this is my first time really working on a research-focused poster, so I'm excited!
Over the past few weeks, I've been working on a number of things. I had to alter some of the CSVs I created, because my original files were counting symptoms as mentions of a condition. Analyzing symptoms would be interesting as well, but at this stage we are only looking at direct mentions of medical conditions. To simplify things I created a program that will populate any CSV based on the tag you input.
We recently submitted a proposal for the undergraduate research symposium here at Simmons. Since we're nearing the end of the semester, we've also begun writing a final paper on our project.
This past week, I created more CSVs for mentions of CAD, hypertension, hyperlipidemia, and obesity. These CSVs are based on whether or not each condition is mentioned at all, so there are only two possible options (mentioned or not mentioned). But in many cases, we have more information than that about the conditions or related events. Over this next week, which is spring break, I'll be looking into how to organize and analyze this more complex information.
During week 15, I worked on extracting diabetes mentions from each medical record and writing them to a CSV file. I formatted the file based on what will work best for Stephanie when she goes on to analyze the CSV in R. It took me a while to work out all the bugs, but I had an accurate CSV by the end of the week. Just skimming the file I could see that a surprising number of records never mention the fact that the patient has diabetes.
Now that I've developed the script to go through each record, extract diabetes mentions, and write them to a CSV, doing the same with other tags will be much easier. I also created a CSV file detailing the patient's smoking status. Though this file was relatively simple for me to create, it will be harder to analyze since someone's smoking status can change over time.
Currently, I'm working on going through this same process to create CSV files for family history and other tags.
This week, we worked on completing our poster proposal. I did a lot of background reading for this, mostly from the Journal of Biomedical Informatics. They recently released a special volume full of articles pertaining to the i2b2 Shared Task, which is where our corpus of records came from. Seeing how other teams approached the task was interesting, as well as seeing if they addressed missing and inaccurate data.
This week, we put together a presentation of our work for our school's metadata research group. They gave a lot of really interesting feedback, which was much appreciated! Some of their ideas were outside of what we'll be able to do, but these ideas were still interesting to think about. For example, they thought it would be worthwhile to track each individual doctor's writing habits. However, because the way the records are deidentified, a doctor's pseudonym will only be consistent within any one patient's narrative. It was nice to get some new feedback on the project and the directions in which we can take it!
After having Thanksgiving break, we decided to lay out our game plan for the rest of the semester. I will continue looking into discrepancies between medical records. Specifically, I'm looking into patients' smoking statuses. It's common for one record to say a patient has never smoked and for the next to say that they only quit within the last year, for example. I think this will be a very interesting area to analyze, because smoking status is such an important aspect of one's health history.
Our server is finally working! I've been learning how to use the command line and how to SSH into the server. Having remote access will be especially useful for me, since I live off campus. Yay!
We also did some preliminary analysis on our data. Mainly, we wanted to check for discrepancies between a patient's actual gender and the gender predicted within each visit. We checked for similar discrepancies between the patient's actual age and predicted age. We did find some discrepancies, and many records never mention age or gender.
This week, we've worked on finalizing our person and patient objects. After a few weeks of writing the separate parts of this program, we've finally brought them all together into a cohesive whole. The code will now read in the XML files and apply the information to each individual patient. Now that we have this functionality implemented, we should be able to begin analyzing data soon.