Get existed keywords from full-text article

* Purpose

Extract specific keywords from body of an article and output existed keywords, PubMed ID and PubMed Central ID.
全文から特定のキーワードを検索してPubMed ID、PubMedCentral IDとセットでアウトプットする。

* How

Download full-text XMLs data of articles from PubMedCentral and use Element.tree library of python for text-mining.
PubMedCentralから論文全文のXMLファイルを取得し、PythonのElement.treeライブラリーでテキストマイニングを行う。

* Tool

Python 2.7.3
(NOTE: Python 2.4 didn't have Element.tree library)

* Files

- all_pmcid.txt: get
- words.list : list of keywords

1. Get a full-text article in XML text file from PubMed Central.

See the previous post.
- http://bioinfomemo.blogspot.jp/2013/10/get-full-text-article-from-pubmed.html

2. Exclude all character decoration

The xml.etree will stop reading when there is a text decoration tag inside paragraph.

--------------------Shell script---------------------
#!/bin/bash

ls -1 xml/* | while read line
do
cat ${line} |
sed -e 's/<ext-link [^>]*>//g' | sed -e 's/<ext-link>//g' | sed -e 's/<\/ext-link>//g' |
sed -e 's/<xref [^>]*>//g' | sed -e 's/<xref>//g' | sed -e 's/<\/xref>//g' |
sed -e 's/<bold>//g' | sed -e 's/<\/bold>//g' |
sed -e 's/<italic>//g' | sed -e 's/<\/italic>//g' |
sed -e 's/<sup>//g' | sed -e 's/<\/sup>//g' |
sed -e 's/<p [^>]*>//g' | sed -e 's/<p>//g'| sed -e 's/<\/p>//g' |
sed -e 's/<supplementary-material [^>]*>//g' | sed -e 's/<supplementary-material>//g' | sed -e 's/<\/supplementary-material>//g' |
sed -e 's/<title>//g' | sed -e 's/<\/title>//g' |
sed -e 's/<caption>//g' | sed -e 's/<\/caption>//g' |
sed -e 's/<media [^>]*>//g' | sed -e 's/<media>//g' | sed -e 's/<\/media>//g' |
sed -e 's/<sec [^>]*>//g' | sed -e 's/<sec>//g' | sed -e 's/<\/sec>//g' |
sed -e 's/<table-wrap [^>]*>//g' | sed -e 's/<table-wrap>//g' | sed -e 's/<\/table-wrap>//g' |
sed -e 's/<table-wrap-foot [^>]*>//g' | sed -e 's/<table-wrap-foot>//g' | sed -e 's/<\/table-wrap-foot>//g' |
sed -e 's/<table [^>]*>//g' | sed -e 's/<table>//g' | sed -e 's/<\/table>//g' |
sed -e 's/<label [^>]*>//g' | sed -e 's/<label>//g' | sed -e 's/<\/label>//g' |
sed -e 's/<thead [^>]*>//g' | sed -e 's/<thead>//g' | sed -e 's/<\/thead>//g' |
sed -e 's/<tbody [^>]*>//g' | sed -e 's/<tbody>//g' | sed -e 's/<\/tbody>//g' |
sed -e 's/<tr [^>]*>//g' | sed -e 's/<tr>//g' | sed -e 's/<\/tr>//g' |
sed -e 's/<th [^>]*>//g' | sed -e 's/<th>//g' | sed -e 's/<\/th>//g' |
sed -e 's/<td [^>]*>//g' | sed -e 's/<td>//g' | sed -e 's/<\/td>//g' |
sed -e 's/<fn [^>]*>//g' | sed -e 's/<fn>//g' | sed -e 's/<\/fn>//g' |
sed -e 's/<fig [^>]*>//g' | sed -e 's/<fig>//g' | sed -e 's/<\/fig>//g' > del_tag/$line
done
---------------------------------------------------

NOTE: Validate XML format

3. Assemble xml files to one file and reformat for ElementTree library.

(1) Delete <!DOCTYPE> and add new line at end of text of all XML files. Merge all XML files to the one file and rap with <articleset></articleset> and add <!DOCTYPE> at first line.

--------------------Shell script---------------------
#!/bin/bash

cd del_tag/xml/;

cat PMC*******.xml | head -1 > header.txt;

ls -1 del_tag/xml/* | while read line
do
cat ${line} | sed -e '$s/$/\n/' | sed -e '1d' > ins_end/${line}
done

cat ins_end/* > all.xml
sed -i -e '1i\<articleset>' all.xml
sed -i '$s/$/<\/articleset>/' all.xml
cat header.xml all.xml > all2.xml
---------------------------------------------------

4. Text-mining by python Element.tree.

----------------------Python-----------------------

### SPLIT KEYWORDS
f=open('words.list','r')
words_list=f.read().splitlines()

### EXTRACT PMCID, PMID AND BODY TEXT FROM XML
from xml.etree import ElementTree
XMLFILE = "full.xml"
tree = ElementTree.parse(XMLFILE)
root = tree.getroot()
art=[]
for e in root.getiterator("article"):
p = e.find('.//body').findall('.//p')
p_str=""
for i in p:
if isinstance(i.text, unicode):
p_str += i.text.encode('utf-8')
elif isinstance(i.text, str):
p_str += i.text
art.append({
"pmcid": e.findtext('.//article-id[@pub-id-type="pmcid"]'),
"pmid": e.findtext('.//article-id[@pub-id-type="pmid"]'),
"text": p_str
})

### EXTRACT KEYWORDS FROM BODY TEXT
result = []
n = ""
e = ""
g = ""
for i in range(len(art)):
for word in art[i].get('text').split():
if word in words_list:
n = word
result.append({
"pmid" : art[i].get('pmid'),
"pmcid" : art[i].get('pmcid'),
"word" : w
})

# output
# ['PMID', 'PMCID', 'Keyword']

### UNIQUE RESULT
seen = set()
uniq_result = []
for d in result:
t = tuple(d.items())
if t not in seen:
seen.add(t)
uniq_result.append(d)
---------------------------------------------------

Bio + Info = Life

Get existed keywords from full-text article

* Purpose

* How

* Tool

* Files

1. Get a full-text article in XML text file from PubMed Central.

2. Exclude all character decoration

3. Assemble xml files to one file and reformat for ElementTree library.

4. Text-mining by python Element.tree.

About Piyoko

0 コメント:

コメントを投稿

ブログアーカイブ

Find Us On Facebook

Get existed keywords from full-text article

* Purpose

* How

* Tool

* Files

1. Get a full-text article in XML text file from PubMed Central.

2. Exclude all character decoration

3. Assemble xml files to one file and reformat for ElementTree library.

4. Text-mining by python Element.tree.

About Piyoko

RELATED POSTS

0 コメント:

コメントを投稿

Find Us On Facebook