<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Count Anything now supports PDF files</title>
	<atom:link href="http://ginstrom.com/scribbles/2008/05/21/count-anything-now-supports-pdf-files/feed/" rel="self" type="application/rss+xml" />
	<link>http://ginstrom.com/scribbles/2008/05/21/count-anything-now-supports-pdf-files/</link>
	<description>Random scribbling about programming, translation, and Japan</description>
	<lastBuildDate>Fri, 03 Feb 2012 09:05:20 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: Ryan Ginstrom</title>
		<link>http://ginstrom.com/scribbles/2008/05/21/count-anything-now-supports-pdf-files/comment-page-1/#comment-208</link>
		<dc:creator>Ryan Ginstrom</dc:creator>
		<pubDate>Sun, 25 May 2008 23:44:31 +0000</pubDate>
		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/05/21/count-anything-now-supports-pdf-files/#comment-208</guid>
		<description>@Gururaj

Thanks for trying it out. I can&#039;t be certain, but I suspect that the pdftotext converter is extracting some spurious characters meant to be used as formatting.

You can check to see exactly what is getting counted by using the Dump Text utility included with Count Anything. Go to Start &gt;&gt; All Programs &gt;&gt; Count Anything &gt;&gt; Dump Text, and select the file. The text will be dumped to a text file alongside the file you selected.</description>
		<content:encoded><![CDATA[<p>@Gururaj</p>
<p>Thanks for trying it out. I can&#8217;t be certain, but I suspect that the pdftotext converter is extracting some spurious characters meant to be used as formatting.</p>
<p>You can check to see exactly what is getting counted by using the Dump Text utility included with Count Anything. Go to Start >> All Programs >> Count Anything >> Dump Text, and select the file. The text will be dumped to a text file alongside the file you selected.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Gururaj Rao</title>
		<link>http://ginstrom.com/scribbles/2008/05/21/count-anything-now-supports-pdf-files/comment-page-1/#comment-209</link>
		<dc:creator>Gururaj Rao</dc:creator>
		<pubDate>Sun, 25 May 2008 22:42:29 +0000</pubDate>
		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/05/21/count-anything-now-supports-pdf-files/#comment-209</guid>
		<description>Good show! Just tried out the count on a PDF file containing both J and E characters. Copied text in PDF to a Word file and compared the two counts. Although Asian characters and non-Asian words were just off by a couple of words, the characters with no spaces were as follows:
CountAnything:    29125
MS Word:          25243
Any ideas on how this difference occurs?</description>
		<content:encoded><![CDATA[<p>Good show! Just tried out the count on a PDF file containing both J and E characters. Copied text in PDF to a Word file and compared the two counts. Although Asian characters and non-Asian words were just off by a couple of words, the characters with no spaces were as follows:<br />
CountAnything:    29125<br />
MS Word:          25243<br />
Any ideas on how this difference occurs?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Nick</title>
		<link>http://ginstrom.com/scribbles/2008/05/21/count-anything-now-supports-pdf-files/comment-page-1/#comment-207</link>
		<dc:creator>Nick</dc:creator>
		<pubDate>Wed, 21 May 2008 04:58:23 +0000</pubDate>
		<guid isPermaLink="false">http://ginstrom.com/scribbles/2008/05/21/count-anything-now-supports-pdf-files/#comment-207</guid>
		<description>You may want to look at the best PDF library for Python I have used - parsed over 1800 documents with only 1 failure (which turned out to be malformed anyway).  The author is actively updating it, it&#039;s written in Python, and he uses it for parsing Japanese pdfs (he&#039;s Japanese), so I strongly suspect it would meet your needs... I recommend you check it out!

http://www.unixuser.org/~euske/python/pdfminer/index.html</description>
		<content:encoded><![CDATA[<p>You may want to look at the best PDF library for Python I have used &#8211; parsed over 1800 documents with only 1 failure (which turned out to be malformed anyway).  The author is actively updating it, it&#8217;s written in Python, and he uses it for parsing Japanese pdfs (he&#8217;s Japanese), so I strongly suspect it would meet your needs&#8230; I recommend you check it out!</p>
<p><a href="http://www.unixuser.org/~euske/python/pdfminer/index.html">http://www.unixuser.org/~euske/python/pdfminer/index.html</a></p>
]]></content:encoded>
	</item>
</channel>
</rss>

